Silent Bug Detector
Finds bugs that don't crash — the code runs, the loss decreases, training completes, but the results are wrong. A hard crash is a gift. A silent bug is a trap.
Overview
The Silent Bug Detector systematically scans code for every known pattern of silent failure in AI and robotics research. These are the most dangerous bugs because they waste compute and produce misleading results that get reported in papers or deployed to hardware.
The detector doesn't wait for symptoms — it checks proactively against a catalog of 11 failure categories, each representing a class of bugs that has burned real research time.
| Property | Details |
|---|---|
| Tools | Read, Grep, Glob (read-only) |
| Auto-Dispatch | Yes — after implementing or modifying model components, loss functions, data pipelines, or training loops |
| Trigger | Any code change touching model, loss, data, or training logic |
If it doesn't crash, it's not safe. Every bug in this catalog produces code that runs successfully. The absence of errors means nothing. "The loss is decreasing" proves nothing — a model can minimize a wrong loss, overfit to leaked data, or learn a degenerate solution while the loss curve looks healthy.
1. Broadcasting Bugs
The #1 source of silent errors. Two tensors interact with compatible but wrong shapes.
| Pattern | What Happens |
|---|---|
(B, 1) + (1, T) | Did you mean element-wise or outer product? Check intent. |
(B, T, F) * (F,) | Broadcasting works, but did you want per-feature scaling or a matmul? |
| Scalar promotion | A tensor that should be (B,) accidentally becomes () after a wrong reduction — then broadcasts against everything silently. |
Missing keepdims | mean(axis=1) reduces (B, T, F) to (B, F), but if you needed (B, 1, F) for subsequent broadcasting, the missing keepdims=True causes a silent shape shift. |
For every binary operation between two tensors of different ranks, verify the broadcast semantics are intentional.
2. Wrong Reduction Axis
Loss looks fine, gradients flow, but the learning signal is wrong.
| Pattern | What Happens |
|---|---|
| Mean vs sum over batch | loss.mean() vs loss.sum() changes the effective learning rate by a factor of batch_size. Many papers don't specify which they use. |
| Mean over wrong axis | loss.mean(axis=0) averages over batch (correct), but loss.mean(axis=1) averages over time — should time steps be weighted equally? |
| Reduction inside vs outside vmap/scan | Reducing before or after a JAX transform changes what gets averaged. Check that reductions happen at the intended scope. |
| Per-element vs per-sample loss | Is the loss computed per element then averaged, or per sample? This affects how the model weighs short vs long sequences. |
For every .mean(), .sum(), .max(), verify which axis is being reduced and whether that matches the paper/intent.
3. Data Leakage
Model gets access to information it shouldn't have — artificially inflates metrics.
| Pattern | What Happens |
|---|---|
| Future leaking into past | In sequential/temporal data, the model at time t sees information from t+1 or later. Check attention masks, convolution padding (should be causal), and data loading. |
| Train leaking into val/test | Normalization statistics (mean, std) computed on the full dataset including val/test. Augmentations using parameters fit on the full dataset. |
| Target leaking into input | The label or ground truth accidentally included in input features. Off-by-one tensor slicing can include the target. |
| Cross-episode leakage (RL) | Observation at the start of a new episode contains information from the previous episode. Check auto-reset logic and observation buffers. |
At every point where data is filtered, masked, or split, verify that no future/test/target information crosses the boundary.
4. Normalization Bugs
Statistics look reasonable, outputs are in range, but they're wrong.
| Pattern | What Happens |
|---|---|
| Stale running statistics | BatchNorm/LayerNorm running mean/var not updated because model is stuck in eval mode, or updated at the wrong point in the pipeline. |
| Wrong normalization axis | LayerNorm over (features,) vs (time, features) vs (batch, features) — all run without error, but only one is correct. |
| Normalizing after clipping | If you clip then normalize, statistics are biased. If you normalize then clip, the distribution is distorted. Check the order. |
| Train vs eval normalization | Normalization differs between train and eval but eval actually uses batch stats from a single sample instead of running stats. |
| RL observation normalization | Running stats computed across envs but applied per-env, or vice versa. Stats not updated after initial warmup. |
For every normalization operation, verify: correct axis, correct mode (train vs eval), correct statistics source.
5. Off-by-One and Indexing Errors
The most classic bug class — especially dangerous in sequential data.
| Pattern | What Happens |
|---|---|
| Temporal indexing | obs[t] paired with action[t] and reward[t] — is reward[t] the reward FOR action[t] or AFTER action[t]? Off-by-one means wrong credit assignment. |
| Sequence slicing | x[:, :-1] for input and x[:, 1:] for target — does this match your autoregressive convention? Is the shift correct? |
| GAE discount indexing | advantages[t] = delta[t] + gamma * lambda * advantages[t+1] — is the scan iterating in the right direction (backward from T to 0)? |
| Done mask application | next_value * (1 - done[t]) — is done[t] the done flag for the transition that produced obs[t] or the one starting from obs[t]? |
For every index [t], [t+1], [:-1], [1:], verify the temporal semantics explicitly. Draw out a timeline if needed.
6. Gradient Flow Bugs
Gradients technically exist, but they don't carry the right signal.
| Pattern | What Happens |
|---|---|
Accidental stop_gradient | A .detach() or jax.lax.stop_gradient on a tensor that should receive gradients. Model still trains via other paths but the blocked path doesn't learn. |
Missing stop_gradient | Gradients flow through the target network, codebook, or value baseline when they shouldn't. Optimization is wrong but training continues. |
| Dead branches | A conditional branch (e.g., temperature annealed to 0 = argmax) that cuts off gradients. The dead branch never learns. |
| Straight-through estimator bugs | Forward pass uses quantized value while backward pass uses un-quantized value (or vice versa). Check both directions. |
For every tensor in the loss, verify whether gradient should flow through it. For every stop_gradient/detach, verify it's intentional.
7. Loss Function Bugs
Loss decreases during training, but it's optimizing the wrong thing.
| Pattern | What Happens |
|---|---|
| Wrong sign | Minimizing when you should maximize (or vice versa) on a specific loss term. Model converges to the opposite of what you want. |
| Wrong KL direction | KL(q||p) vs KL(p||q) — both are valid but have different mode-seeking vs mode-covering behavior. |
| Missing loss term | A regularization term (KL, commitment loss, entropy bonus) defined but not added to the total loss. Model trains fine without it but produces worse results. |
| Loss term always zero | A term computed but evaluating to zero due to a bug (masking that zeros everything, a coefficient of 0.0 instead of 1.0). |
| Loss not reaching right parameters | Loss computed on output A but only parameters B have requires_grad. Loss decreases by chance but intended parameters don't learn. |
Log each loss component individually. Verify all components are non-zero and contribute to the total. Verify the gradient of the total loss w.r.t. each parameter group.
8. Train/Eval Mode Bugs
Model behaves differently in train vs eval, but the switch is wrong.
| Pattern | What Happens |
|---|---|
| Forgot eval mode | Dropout still active during evaluation produces noisy metrics. BatchNorm using batch stats instead of running stats gives inconsistent evaluation. |
| Forgot train mode back | After evaluation, model stays in eval mode for the next epoch. Dropout disabled, BatchNorm frozen — training degrades silently. |
| Stochastic components uncontrolled | Eval should be deterministic but random sampling is still active (e.g., VAE sampling without switching to mode/mean). |
Verify model.train() is called before training and model.eval() before evaluation. Check that stochastic components (dropout, sampling, noise injection) respect the mode.
9. Random Seed / Reproducibility Bugs
Results vary between runs, or worse, are correlated when they shouldn't be.
| Pattern | What Happens |
|---|---|
| Same seed for different purposes | Using the same RNG key for weight initialization and data shuffling — produces correlated initialization and data ordering. |
| Seed not split in parallel envs | All environments get the same seed — identical rollouts, no diversity, model overfits to one trajectory. |
| Seed reuse after checkpoint | Restoring from checkpoint but not restoring RNG state — first epoch after resume has same randomness as first epoch of training. |
Verify every random call uses a unique, properly-split key. Verify RNG state is saved/restored with checkpoints.
10. Action and Observation Space Bugs
Robotics-specific: hardware doesn't crash, but the robot does the wrong thing.
| Pattern | What Happens |
|---|---|
| Wrong action scaling | Policy outputs in [-1, 1] but actuators expect radians. Missing denormalization produces small, useless movements or saturated commands. |
| Wrong coordinate frame | Action computed in world frame but applied in body frame (or vice versa). Movement direction is wrong but magnitude is plausible. |
| Observation ordering mismatch | Observation vector is [pos, vel, orientation] but model expects [orientation, pos, vel]. Shapes match, values are reasonable, semantics are scrambled. |
| Stale observations | Observation buffer not updated between steps — model sees the same observation repeatedly, policy looks stuck but there's no error. |
At every sim/real boundary, verify: correct scaling, correct frame, correct ordering, correct freshness.
11. Masking Bugs
Masks are applied but they mask the wrong thing.
| Pattern | What Happens |
|---|---|
| Inverted mask | Using mask where 1 - mask was intended (or True/False flipped). Masked tokens get attention, unmasked tokens are ignored. |
| Mask not applied to loss | Padding tokens contribute to the loss — model wastes capacity learning to predict padding. |
| Mask at wrong scope | Mask computed at sequence level but applied at token level, or applied before padding instead of after. |
| Mask shape + broadcasting | Mask is (B, T) but attention is (B, H, T, T) — broadcasting may expand the mask incorrectly. |
For every mask, verify: which values are masked vs unmasked, that the mask reaches all places it should (loss, attention, metrics), and that broadcasting expands it correctly.
Audit Process
The detector follows a systematic four-step process:
- Full catalog scan — go through every category above systematically. Don't skip categories because "it's probably fine."
- Trace actual values — don't just check that the operation exists. Verify what it actually computes given the shapes and values in this specific codebase.
- Check interactions — many silent bugs only manifest when two individually-correct components interact wrong (e.g., normalization followed by clipping, correct mask applied to wrong tensor).
- Propose verification tests — for each potential bug, suggest a small test that would catch it (e.g.,
assert loss_component_kl > 0,assert obs[t+1] != obs[t] after env.step).
Output Format
The detector produces a structured report with three sections:
- Bugs Found — each with location, category, description of wrong vs correct behavior, severity, specific fix, and verification test
- Suspicious Patterns — things that look off but need runtime confirmation, each with a specific check to confirm or rule out
- Verified Clean — categories confirmed to not have bugs in this code, with reasoning
Shapes matching is not correctness. Two tensors of shape (64, 128) can mean completely different things. Always verify semantic meaning, not just numerical compatibility. Check the boring stuff — most silent bugs are wrong axis in .mean(), flipped mask, off-by-one in time index.