Systematic Debugging
Root-cause-first debugging with a 3-strike limit and ML-specific diagnostics. Diagnose before you fix — always.
Core Rule
NEVER apply a fix without first presenting the diagnosis to the user. This is Gate 4 — it is mandatory. The user may have domain knowledge that changes the interpretation of the root cause.
| Property | Details |
|---|---|
| Trigger | Bug reports, training failures, unexpected behavior, NaN/divergence issues |
| Active Modes | Engineer Debugger |
| Checkpoint | G4 fires after diagnosis, before any fix |
5-Step Debugging Process
Step 1: Characterize the Symptom
Before investigating, write down exactly what's happening:
- What is the observed behavior?
- What is the expected behavior?
- When did this start? What changed?
- Is it reproducible? Always or intermittent?
Step 2: Form Hypotheses (Ranked)
Generate 3–5 hypotheses for the root cause, ranked by likelihood. For each hypothesis, state what evidence would confirm or rule it out.
Step 3: Gather Evidence
For each hypothesis (in order of likelihood): write targeted diagnostic checks, gather concrete evidence (specific values, shapes, line numbers), and confirm or rule out the hypothesis before moving to the next.
Step 4: Present Diagnosis G4
After identifying the root cause, present in the Gate 4 format. Wait for user approval before changing any code.
Step 5: Apply Fix & Verify
After user approves: apply the fix, run relevant validation, present evidence that the fix works. Ask whether to capture a retrospective entry.
Gate 4: Diagnosis Format
Gate 4 is the mandatory checkpoint before any fix is applied. The diagnosis must follow this structure:
## Diagnosis
### Symptom
[What the user reported]
### Root Cause
[What's actually happening, with evidence —
specific line numbers, values, shapes]
### Why This Happens
[Explanation of the mechanism, not just the symptom]
### Proposed Fix
[Exact changes, with file paths and line numbers]
### What This Fix Will Change
[Side effects, if any — other behavior
that will be affected]
### What This Fix Will NOT Fix
[Other issues that might look related but
have different causes]
After presenting the diagnosis, ask the user:
- "I believe the root cause is [X] because [evidence]. Does this match what you're seeing?"
- "I can fix this by [A: targeted fix] or [B: broader refactor]. A is faster but doesn't address the underlying fragility. Which do you prefer?"
- If applicable: "This fix will also change the behavior of [Y]. Is that acceptable?"
The 3-Strike Limit
If the same debugging approach fails three times, something is fundamentally wrong with the assumptions:
- Strike 1: Apply fix based on diagnosis. If it doesn't resolve the symptom, gather more evidence.
- Strike 2: Revised fix based on new evidence. If still failing, question whether the root cause hypothesis is correct.
- Strike 3: Stop. Present what was learned from all three attempts. Ask the user: "My assumption about the root cause may be wrong. Here's what I've learned from the failures. Should I try a fundamentally different approach?"
The temptation is to try "one more variation." This is almost always wrong. Three failed attempts of the same approach is strong evidence that the mental model is incorrect, not that the fix needs tweaking.
ML-Specific Diagnostics
When debugging training issues, check these systematically. Diagnostic scripts are written to scratch/debug/.
Loss Component Magnitudes
- Log each loss component individually
- Is one term dominating? (commitment loss >> reconstruction loss = codebook never learns)
- Are any terms exactly zero? (missing term, incorrect masking)
- Are magnitudes reasonable? (loss of 1e6 suggests numerical issue, loss of 0.0 suggests dead path)
Gradient Norms Per Module
- Log gradient norms for each module separately
- Vanishing gradients: norm < 1e-8 (dead path, too many layers without residuals,
stop_gradientin wrong place) - Exploding gradients: norm > 1e3 (missing clipping, bad initialization, numerical instability)
- Imbalanced gradients: one module's gradients 100x another's (loss weighting issue)
Codebook / Discrete Metrics
For VQ-VAE, FSQ, and similar discrete representation models:
- Codebook utilization: what fraction of codes are being used? (<10% = collapse)
- Perplexity: effective number of codes used per batch (should be close to codebook size)
- Commitment loss: is the encoder being pushed toward codebook entries?
- Code frequency histogram: are some codes monopolizing? (mode collapse in discrete space)
Anti-Patterns
| Anti-Pattern | What It Looks Like | What to Do Instead |
|---|---|---|
| Symptom Patching | Fixing the symptom without understanding the cause. "Loss is NaN → add gradient clipping" is a band-aid if the real issue is a log(0) in the loss. |
Trace the NaN to its source. The clamp masks the real problem. |
| Shotgun Debugging | Changing 5 things at once to see if something works. Makes it impossible to know which change actually fixed the issue. | Change one thing. Verify. Then change the next. |
| Displaced Fix | Fixing in module B what's actually broken in module A. The regression-guard catches these, but avoid creating them. | Fix the code where the bug actually is, not where it's convenient. |
| Ignoring 3-Strike | "Let me try one more variation of the same approach." | Stop. Re-examine assumptions. Change direction. |
Delegation Rules
When the debugging investigation crosses into a different domain, delegate to the appropriate agent:
| If the failure involves... | Delegate to |
|---|---|
| A known issue or documented failure mode | failure-mode-researcher (internet search) |
| Subtle data flow or shape propagation | data-flow-tracer |
| JAX-specific semantics (vmap, scan, jit) | jax-logic-auditor |
| Paper/algorithm misalignment | paper-alignment-auditor |