Systematic Debugging

Root-cause-first debugging with a 3-strike limit and ML-specific diagnostics. Diagnose before you fix — always.

Core Rule

Diagnose Before You Fix

NEVER apply a fix without first presenting the diagnosis to the user. This is Gate 4 — it is mandatory. The user may have domain knowledge that changes the interpretation of the root cause.

PropertyDetails
TriggerBug reports, training failures, unexpected behavior, NaN/divergence issues
Active ModesEngineer Debugger
CheckpointG4 fires after diagnosis, before any fix

5-Step Debugging Process

Step 1: Characterize the Symptom

Before investigating, write down exactly what's happening:

  • What is the observed behavior?
  • What is the expected behavior?
  • When did this start? What changed?
  • Is it reproducible? Always or intermittent?

Step 2: Form Hypotheses (Ranked)

Generate 3–5 hypotheses for the root cause, ranked by likelihood. For each hypothesis, state what evidence would confirm or rule it out.

Step 3: Gather Evidence

For each hypothesis (in order of likelihood): write targeted diagnostic checks, gather concrete evidence (specific values, shapes, line numbers), and confirm or rule out the hypothesis before moving to the next.

Step 4: Present Diagnosis G4

After identifying the root cause, present in the Gate 4 format. Wait for user approval before changing any code.

Step 5: Apply Fix & Verify

After user approves: apply the fix, run relevant validation, present evidence that the fix works. Ask whether to capture a retrospective entry.

Gate 4: Diagnosis Format

Gate 4 is the mandatory checkpoint before any fix is applied. The diagnosis must follow this structure:

## Diagnosis

### Symptom
[What the user reported]

### Root Cause
[What's actually happening, with evidence —
 specific line numbers, values, shapes]

### Why This Happens
[Explanation of the mechanism, not just the symptom]

### Proposed Fix
[Exact changes, with file paths and line numbers]

### What This Fix Will Change
[Side effects, if any — other behavior
 that will be affected]

### What This Fix Will NOT Fix
[Other issues that might look related but
 have different causes]

After presenting the diagnosis, ask the user:

The 3-Strike Limit

If the same debugging approach fails three times, something is fundamentally wrong with the assumptions:

  1. Strike 1: Apply fix based on diagnosis. If it doesn't resolve the symptom, gather more evidence.
  2. Strike 2: Revised fix based on new evidence. If still failing, question whether the root cause hypothesis is correct.
  3. Strike 3: Stop. Present what was learned from all three attempts. Ask the user: "My assumption about the root cause may be wrong. Here's what I've learned from the failures. Should I try a fundamentally different approach?"
Never Ignore the 3-Strike Limit

The temptation is to try "one more variation." This is almost always wrong. Three failed attempts of the same approach is strong evidence that the mental model is incorrect, not that the fix needs tweaking.

ML-Specific Diagnostics

When debugging training issues, check these systematically. Diagnostic scripts are written to scratch/debug/.

Loss Component Magnitudes

Gradient Norms Per Module

Codebook / Discrete Metrics

For VQ-VAE, FSQ, and similar discrete representation models:

Anti-Patterns

Anti-PatternWhat It Looks LikeWhat to Do Instead
Symptom Patching Fixing the symptom without understanding the cause. "Loss is NaN → add gradient clipping" is a band-aid if the real issue is a log(0) in the loss. Trace the NaN to its source. The clamp masks the real problem.
Shotgun Debugging Changing 5 things at once to see if something works. Makes it impossible to know which change actually fixed the issue. Change one thing. Verify. Then change the next.
Displaced Fix Fixing in module B what's actually broken in module A. The regression-guard catches these, but avoid creating them. Fix the code where the bug actually is, not where it's convenient.
Ignoring 3-Strike "Let me try one more variation of the same approach." Stop. Re-examine assumptions. Change direction.

Delegation Rules

When the debugging investigation crosses into a different domain, delegate to the appropriate agent:

If the failure involves...Delegate to
A known issue or documented failure modefailure-mode-researcher (internet search)
Subtle data flow or shape propagationdata-flow-tracer
JAX-specific semantics (vmap, scan, jit)jax-logic-auditor
Paper/algorithm misalignmentpaper-alignment-auditor