Root Cause Debugging Assistant

Category development
Subcategory debugging
Difficulty intermediate
Target models: claude, gpt-5, gemini
Variables: {{language}} {{error_message}} {{reproduction_steps}} {{recent_changes}} {{environment}}
debugging root-cause logs incident troubleshooting
Updated February 14, 2026

The Prompt

You are a senior {{language}} incident investigator. Help me find the most likely root cause of a production issue.

Error signal:
{{error_message}}

How to reproduce:
{{reproduction_steps}}

Recent changes:
{{recent_changes}}

Environment details:
{{environment}}

Output in this exact format:
1) Incident summary (3-5 bullets)
2) Ranked hypotheses (top 5, with confidence % and why)
3) Investigation plan (ordered steps with expected evidence)
4) Instrumentation and logging upgrades (specific metrics/log fields/traces to add)
5) Likely fix path (minimal-risk patch first, then long-term fix)
6) Verification checklist (pre-deploy, canary, post-deploy checks)
7) If blocked, list exactly what extra data is required

Constraints:
- Do not jump to a fix before ranking hypotheses.
- Prefer reversible and low-blast-radius actions first.
- Explicitly call out assumptions.
- Include at least one edge case that could invalidate the top hypothesis.

When to Use

Use this when a bug is real, urgent, and not obvious from a quick read. It is especially useful when logs are noisy, multiple recent changes overlap, or the error only appears in one environment.

Good scenarios:

  • A production regression after a deployment
  • An intermittent failure with low reproduction reliability
  • A system where several services could be involved
  • A “works locally, fails in staging/prod” mismatch

This template helps avoid random trial-and-error by forcing a hypothesis-first process. You get a prioritized path that starts with fast evidence gathering and low-risk checks, then moves toward targeted fixes.

Variables

VariableDescriptionGood input examples
languageMain implementation languageTypeScript, Python, Go, Rust
error_messageExact error text, stack traces, alerts”TypeError: Cannot read properties of undefined”, Datadog alert excerpt
reproduction_stepsClear sequence to reproduce or trigger conditions”Open checkout, apply coupon, submit payment”
recent_changesRelevant deploys, config toggles, dependency updates”Upgraded Prisma 5.9 -> 5.12, enabled caching flag”
environmentRuntime context and constraints”Kubernetes, Node 20, Redis 7, only EU region impacted”

Tips & Variations

  • Add a timeline: prepend incident timestamps to improve causal analysis.
  • Add blast radius: include affected users, regions, or endpoints.
  • For flaky issues, ask for “three competing hypotheses with disconfirming tests.”
  • For distributed systems, require a trace-based investigation section.
  • After root cause is found, run a second pass: “Draft a postmortem from this analysis.”

If your logs are weak, this prompt still works well because it requests instrumentation upgrades early instead of pretending certainty.

Example Output

Ranked hypothesis #1 (68%): Null response from upstream profile service causes unguarded property access in checkout handler.

Investigation step 1: Correlate failed checkout requests with upstream profile-service 5xx spikes in the same minute.

Minimal-risk fix: Add null-guard and fallback path in checkout handler, then canary at 5% traffic.

Verification: Error rate below 0.1% for 30 minutes, no latency regression over p95 baseline.