Root Cause Debugging Assistant
{{language}} {{error_message}} {{reproduction_steps}} {{recent_changes}} {{environment}} The Prompt
You are a senior {{language}} incident investigator. Help me find the most likely root cause of a production issue.
Error signal:
{{error_message}}
How to reproduce:
{{reproduction_steps}}
Recent changes:
{{recent_changes}}
Environment details:
{{environment}}
Output in this exact format:
1) Incident summary (3-5 bullets)
2) Ranked hypotheses (top 5, with confidence % and why)
3) Investigation plan (ordered steps with expected evidence)
4) Instrumentation and logging upgrades (specific metrics/log fields/traces to add)
5) Likely fix path (minimal-risk patch first, then long-term fix)
6) Verification checklist (pre-deploy, canary, post-deploy checks)
7) If blocked, list exactly what extra data is required
Constraints:
- Do not jump to a fix before ranking hypotheses.
- Prefer reversible and low-blast-radius actions first.
- Explicitly call out assumptions.
- Include at least one edge case that could invalidate the top hypothesis.
When to Use
Use this when a bug is real, urgent, and not obvious from a quick read. It is especially useful when logs are noisy, multiple recent changes overlap, or the error only appears in one environment.
Good scenarios:
- A production regression after a deployment
- An intermittent failure with low reproduction reliability
- A system where several services could be involved
- A “works locally, fails in staging/prod” mismatch
This template helps avoid random trial-and-error by forcing a hypothesis-first process. You get a prioritized path that starts with fast evidence gathering and low-risk checks, then moves toward targeted fixes.
Variables
| Variable | Description | Good input examples |
|---|---|---|
language | Main implementation language | TypeScript, Python, Go, Rust |
error_message | Exact error text, stack traces, alerts | ”TypeError: Cannot read properties of undefined”, Datadog alert excerpt |
reproduction_steps | Clear sequence to reproduce or trigger conditions | ”Open checkout, apply coupon, submit payment” |
recent_changes | Relevant deploys, config toggles, dependency updates | ”Upgraded Prisma 5.9 -> 5.12, enabled caching flag” |
environment | Runtime context and constraints | ”Kubernetes, Node 20, Redis 7, only EU region impacted” |
Tips & Variations
- Add a timeline: prepend incident timestamps to improve causal analysis.
- Add blast radius: include affected users, regions, or endpoints.
- For flaky issues, ask for “three competing hypotheses with disconfirming tests.”
- For distributed systems, require a trace-based investigation section.
- After root cause is found, run a second pass: “Draft a postmortem from this analysis.”
If your logs are weak, this prompt still works well because it requests instrumentation upgrades early instead of pretending certainty.
Example Output
Ranked hypothesis #1 (68%): Null response from upstream profile service causes unguarded property access in checkout handler.
Investigation step 1: Correlate failed checkout requests with upstream
profile-service5xx spikes in the same minute.Minimal-risk fix: Add null-guard and fallback path in checkout handler, then canary at 5% traffic.
Verification: Error rate below 0.1% for 30 minutes, no latency regression over p95 baseline.