Root Cause Debugging Assistant

The Prompt

You are a senior {{language}} incident investigator. Help me find the most likely root cause of a production issue.

Error signal:
{{error_message}}

How to reproduce:
{{reproduction_steps}}

Recent changes:
{{recent_changes}}

Environment details:
{{environment}}

Output in this exact format:
1) Incident summary (3-5 bullets)
2) Ranked hypotheses (top 5, with confidence % and why)
3) Investigation plan (ordered steps with expected evidence)
4) Instrumentation and logging upgrades (specific metrics/log fields/traces to add)
5) Likely fix path (minimal-risk patch first, then long-term fix)
6) Verification checklist (pre-deploy, canary, post-deploy checks)
7) If blocked, list exactly what extra data is required

Constraints:
- Do not jump to a fix before ranking hypotheses.
- Prefer reversible and low-blast-radius actions first.
- Explicitly call out assumptions.
- Include at least one edge case that could invalidate the top hypothesis.

When to Use

Use this when a bug is real, urgent, and not obvious from a quick read. It is especially useful when logs are noisy, multiple recent changes overlap, or the error only appears in one environment.

Good scenarios:

A production regression after a deployment
An intermittent failure with low reproduction reliability
A system where several services could be involved
A “works locally, fails in staging/prod” mismatch

This template helps avoid random trial-and-error by forcing a hypothesis-first process. You get a prioritized path that starts with fast evidence gathering and low-risk checks, then moves toward targeted fixes.

Variables

Variable	Description	Good input examples
`language`	Main implementation language	TypeScript, Python, Go, Rust
`error_message`	Exact error text, stack traces, alerts	”TypeError: Cannot read properties of undefined”, Datadog alert excerpt
`reproduction_steps`	Clear sequence to reproduce or trigger conditions	”Open checkout, apply coupon, submit payment”
`recent_changes`	Relevant deploys, config toggles, dependency updates	”Upgraded Prisma 5.9 -> 5.12, enabled caching flag”
`environment`	Runtime context and constraints	”Kubernetes, Node 20, Redis 7, only EU region impacted”

Tips & Variations

Add a timeline: prepend incident timestamps to improve causal analysis.
Add blast radius: include affected users, regions, or endpoints.
For flaky issues, ask for “three competing hypotheses with disconfirming tests.”
For distributed systems, require a trace-based investigation section.
After root cause is found, run a second pass: “Draft a postmortem from this analysis.”

If your logs are weak, this prompt still works well because it requests instrumentation upgrades early instead of pretending certainty.

Example Output

Ranked hypothesis #1 (68%): Null response from upstream profile service causes unguarded property access in checkout handler.

Investigation step 1: Correlate failed checkout requests with upstream profile-service 5xx spikes in the same minute.

Minimal-risk fix: Add null-guard and fallback path in checkout handler, then canary at 5% traffic.

Verification: Error rate below 0.1% for 30 minutes, no latency regression over p95 baseline.