Docs › Skills › Systematic Debugging

Systematic Debugging

Root-cause-first debugging with a 3-strike limit and ML-specific diagnostics. Diagnose before you fix — always.

Core Rule

Diagnose Before You Fix

NEVER apply a fix without first presenting the diagnosis to the user. This is Gate 4 — it is mandatory. The user may have domain knowledge that changes the interpretation of the root cause.

Property	Details
Trigger	Bug reports, training failures, unexpected behavior, NaN/divergence issues
Active Modes	Engineer Debugger
Checkpoint	G4 fires after diagnosis, before any fix

5-Step Debugging Process

Step 1: Characterize the Symptom

Before investigating, write down exactly what's happening:

What is the observed behavior?
What is the expected behavior?
When did this start? What changed?
Is it reproducible? Always or intermittent?

Step 2: Form Hypotheses (Ranked)

Generate 3–5 hypotheses for the root cause, ranked by likelihood. For each hypothesis, state what evidence would confirm or rule it out.

Step 3: Gather Evidence

For each hypothesis (in order of likelihood): write targeted diagnostic checks, gather concrete evidence (specific values, shapes, line numbers), and confirm or rule out the hypothesis before moving to the next.

Step 4: Present Diagnosis G4

After identifying the root cause, present in the Gate 4 format. Wait for user approval before changing any code.

Step 5: Apply Fix & Verify

After user approves: apply the fix, run relevant validation, present evidence that the fix works. Ask whether to capture a retrospective entry.

Gate 4: Diagnosis Format

Gate 4 is the mandatory checkpoint before any fix is applied. The diagnosis must follow this structure:

## Diagnosis

### Symptom
[What the user reported]

### Root Cause
[What's actually happening, with evidence —
 specific line numbers, values, shapes]

### Why This Happens
[Explanation of the mechanism, not just the symptom]

### Proposed Fix
[Exact changes, with file paths and line numbers]

### What This Fix Will Change
[Side effects, if any — other behavior
 that will be affected]

### What This Fix Will NOT Fix
[Other issues that might look related but
 have different causes]

After presenting the diagnosis, ask the user:

"I believe the root cause is [X] because [evidence]. Does this match what you're seeing?"
"I can fix this by [A: targeted fix] or [B: broader refactor]. A is faster but doesn't address the underlying fragility. Which do you prefer?"
If applicable: "This fix will also change the behavior of [Y]. Is that acceptable?"

The 3-Strike Limit

If the same debugging approach fails three times, something is fundamentally wrong with the assumptions:

Strike 1: Apply fix based on diagnosis. If it doesn't resolve the symptom, gather more evidence.
Strike 2: Revised fix based on new evidence. If still failing, question whether the root cause hypothesis is correct.
Strike 3: Stop. Present what was learned from all three attempts. Ask the user: "My assumption about the root cause may be wrong. Here's what I've learned from the failures. Should I try a fundamentally different approach?"

Never Ignore the 3-Strike Limit

The temptation is to try "one more variation." This is almost always wrong. Three failed attempts of the same approach is strong evidence that the mental model is incorrect, not that the fix needs tweaking.

ML-Specific Diagnostics

When debugging training issues, check these systematically. Diagnostic scripts are written to scratch/debug/.

Loss Component Magnitudes

Log each loss component individually
Is one term dominating? (commitment loss >> reconstruction loss = codebook never learns)
Are any terms exactly zero? (missing term, incorrect masking)
Are magnitudes reasonable? (loss of 1e6 suggests numerical issue, loss of 0.0 suggests dead path)

Gradient Norms Per Module

Log gradient norms for each module separately
Vanishing gradients: norm < 1e-8 (dead path, too many layers without residuals, stop_gradient in wrong place)
Exploding gradients: norm > 1e3 (missing clipping, bad initialization, numerical instability)
Imbalanced gradients: one module's gradients 100x another's (loss weighting issue)

Codebook / Discrete Metrics

For VQ-VAE, FSQ, and similar discrete representation models:

Codebook utilization: what fraction of codes are being used? (<10% = collapse)
Perplexity: effective number of codes used per batch (should be close to codebook size)
Commitment loss: is the encoder being pushed toward codebook entries?
Code frequency histogram: are some codes monopolizing? (mode collapse in discrete space)

Anti-Patterns

Anti-Pattern	What It Looks Like	What to Do Instead
Symptom Patching	Fixing the symptom without understanding the cause. "Loss is NaN → add gradient clipping" is a band-aid if the real issue is a `log(0)` in the loss.	Trace the NaN to its source. The clamp masks the real problem.
Shotgun Debugging	Changing 5 things at once to see if something works. Makes it impossible to know which change actually fixed the issue.	Change one thing. Verify. Then change the next.
Displaced Fix	Fixing in module B what's actually broken in module A. The regression-guard catches these, but avoid creating them.	Fix the code where the bug actually is, not where it's convenient.
Ignoring 3-Strike	"Let me try one more variation of the same approach."	Stop. Re-examine assumptions. Change direction.

Delegation Rules

When the debugging investigation crosses into a different domain, delegate to the appropriate agent:

If the failure involves...	Delegate to
A known issue or documented failure mode	failure-mode-researcher (internet search)
Subtle data flow or shape propagation	data-flow-tracer
JAX-specific semantics (vmap, scan, jit)	jax-logic-auditor
Paper/algorithm misalignment	paper-alignment-auditor