Docs › Agents › Silent Bug Detector

Silent Bug Detector

Finds bugs that don't crash — the code runs, the loss decreases, training completes, but the results are wrong. A hard crash is a gift. A silent bug is a trap.

Overview

The Silent Bug Detector systematically scans code for every known pattern of silent failure in AI and robotics research. These are the most dangerous bugs because they waste compute and produce misleading results that get reported in papers or deployed to hardware.

The detector doesn't wait for symptoms — it checks proactively against a catalog of 11 failure categories, each representing a class of bugs that has burned real research time.

Property	Details
Tools	Read, Grep, Glob (read-only)
Auto-Dispatch	Yes — after implementing or modifying model components, loss functions, data pipelines, or training loops
Trigger	Any code change touching model, loss, data, or training logic

Core Principle

If it doesn't crash, it's not safe. Every bug in this catalog produces code that runs successfully. The absence of errors means nothing. "The loss is decreasing" proves nothing — a model can minimize a wrong loss, overfit to leaked data, or learn a degenerate solution while the loss curve looks healthy.

1. Broadcasting Bugs

The #1 source of silent errors. Two tensors interact with compatible but wrong shapes.

Pattern	What Happens
`(B, 1) + (1, T)`	Did you mean element-wise or outer product? Check intent.
`(B, T, F) * (F,)`	Broadcasting works, but did you want per-feature scaling or a matmul?
Scalar promotion	A tensor that should be `(B,)` accidentally becomes `()` after a wrong reduction — then broadcasts against everything silently.
Missing `keepdims`	`mean(axis=1)` reduces `(B, T, F)` to `(B, F)`, but if you needed `(B, 1, F)` for subsequent broadcasting, the missing `keepdims=True` causes a silent shape shift.

Verification

For every binary operation between two tensors of different ranks, verify the broadcast semantics are intentional.

2. Wrong Reduction Axis

Loss looks fine, gradients flow, but the learning signal is wrong.

Pattern	What Happens
Mean vs sum over batch	`loss.mean()` vs `loss.sum()` changes the effective learning rate by a factor of `batch_size`. Many papers don't specify which they use.
Mean over wrong axis	`loss.mean(axis=0)` averages over batch (correct), but `loss.mean(axis=1)` averages over time — should time steps be weighted equally?
Reduction inside vs outside vmap/scan	Reducing before or after a JAX transform changes what gets averaged. Check that reductions happen at the intended scope.
Per-element vs per-sample loss	Is the loss computed per element then averaged, or per sample? This affects how the model weighs short vs long sequences.

Verification

For every .mean(), .sum(), .max(), verify which axis is being reduced and whether that matches the paper/intent.

3. Data Leakage

Model gets access to information it shouldn't have — artificially inflates metrics.

Pattern	What Happens
Future leaking into past	In sequential/temporal data, the model at time `t` sees information from `t+1` or later. Check attention masks, convolution padding (should be causal), and data loading.
Train leaking into val/test	Normalization statistics (mean, std) computed on the full dataset including val/test. Augmentations using parameters fit on the full dataset.
Target leaking into input	The label or ground truth accidentally included in input features. Off-by-one tensor slicing can include the target.
Cross-episode leakage (RL)	Observation at the start of a new episode contains information from the previous episode. Check auto-reset logic and observation buffers.

Verification

At every point where data is filtered, masked, or split, verify that no future/test/target information crosses the boundary.

4. Normalization Bugs

Statistics look reasonable, outputs are in range, but they're wrong.

Pattern	What Happens
Stale running statistics	BatchNorm/LayerNorm running mean/var not updated because model is stuck in eval mode, or updated at the wrong point in the pipeline.
Wrong normalization axis	LayerNorm over `(features,)` vs `(time, features)` vs `(batch, features)` — all run without error, but only one is correct.
Normalizing after clipping	If you clip then normalize, statistics are biased. If you normalize then clip, the distribution is distorted. Check the order.
Train vs eval normalization	Normalization differs between train and eval but eval actually uses batch stats from a single sample instead of running stats.
RL observation normalization	Running stats computed across envs but applied per-env, or vice versa. Stats not updated after initial warmup.

Verification

For every normalization operation, verify: correct axis, correct mode (train vs eval), correct statistics source.

5. Off-by-One and Indexing Errors

The most classic bug class — especially dangerous in sequential data.

Pattern	What Happens
Temporal indexing	`obs[t]` paired with `action[t]` and `reward[t]` — is `reward[t]` the reward FOR `action[t]` or AFTER `action[t]`? Off-by-one means wrong credit assignment.
Sequence slicing	`x[:, :-1]` for input and `x[:, 1:]` for target — does this match your autoregressive convention? Is the shift correct?
GAE discount indexing	`advantages[t] = delta[t] + gamma * lambda * advantages[t+1]` — is the scan iterating in the right direction (backward from T to 0)?
Done mask application	`next_value * (1 - done[t])` — is `done[t]` the done flag for the transition that produced `obs[t]` or the one starting from `obs[t]`?

Verification

For every index [t], [t+1], [:-1], [1:], verify the temporal semantics explicitly. Draw out a timeline if needed.

6. Gradient Flow Bugs

Gradients technically exist, but they don't carry the right signal.

Pattern	What Happens
Accidental `stop_gradient`	A `.detach()` or `jax.lax.stop_gradient` on a tensor that should receive gradients. Model still trains via other paths but the blocked path doesn't learn.
Missing `stop_gradient`	Gradients flow through the target network, codebook, or value baseline when they shouldn't. Optimization is wrong but training continues.
Dead branches	A conditional branch (e.g., temperature annealed to 0 = argmax) that cuts off gradients. The dead branch never learns.
Straight-through estimator bugs	Forward pass uses quantized value while backward pass uses un-quantized value (or vice versa). Check both directions.

Verification

For every tensor in the loss, verify whether gradient should flow through it. For every stop_gradient/detach, verify it's intentional.

7. Loss Function Bugs

Loss decreases during training, but it's optimizing the wrong thing.

Pattern	What Happens
Wrong sign	Minimizing when you should maximize (or vice versa) on a specific loss term. Model converges to the opposite of what you want.
Wrong KL direction	`KL(q\|\|p)` vs `KL(p\|\|q)` — both are valid but have different mode-seeking vs mode-covering behavior.
Missing loss term	A regularization term (KL, commitment loss, entropy bonus) defined but not added to the total loss. Model trains fine without it but produces worse results.
Loss term always zero	A term computed but evaluating to zero due to a bug (masking that zeros everything, a coefficient of `0.0` instead of `1.0`).
Loss not reaching right parameters	Loss computed on output A but only parameters B have `requires_grad`. Loss decreases by chance but intended parameters don't learn.

Verification

Log each loss component individually. Verify all components are non-zero and contribute to the total. Verify the gradient of the total loss w.r.t. each parameter group.

8. Train/Eval Mode Bugs

Model behaves differently in train vs eval, but the switch is wrong.

Pattern	What Happens
Forgot eval mode	Dropout still active during evaluation produces noisy metrics. BatchNorm using batch stats instead of running stats gives inconsistent evaluation.
Forgot train mode back	After evaluation, model stays in eval mode for the next epoch. Dropout disabled, BatchNorm frozen — training degrades silently.
Stochastic components uncontrolled	Eval should be deterministic but random sampling is still active (e.g., VAE sampling without switching to mode/mean).

Verification

Verify model.train() is called before training and model.eval() before evaluation. Check that stochastic components (dropout, sampling, noise injection) respect the mode.

9. Random Seed / Reproducibility Bugs

Results vary between runs, or worse, are correlated when they shouldn't be.

Pattern	What Happens
Same seed for different purposes	Using the same RNG key for weight initialization and data shuffling — produces correlated initialization and data ordering.
Seed not split in parallel envs	All environments get the same seed — identical rollouts, no diversity, model overfits to one trajectory.
Seed reuse after checkpoint	Restoring from checkpoint but not restoring RNG state — first epoch after resume has same randomness as first epoch of training.

Verification

Verify every random call uses a unique, properly-split key. Verify RNG state is saved/restored with checkpoints.

10. Action and Observation Space Bugs

Robotics-specific: hardware doesn't crash, but the robot does the wrong thing.

Pattern	What Happens
Wrong action scaling	Policy outputs in `[-1, 1]` but actuators expect radians. Missing denormalization produces small, useless movements or saturated commands.
Wrong coordinate frame	Action computed in world frame but applied in body frame (or vice versa). Movement direction is wrong but magnitude is plausible.
Observation ordering mismatch	Observation vector is `[pos, vel, orientation]` but model expects `[orientation, pos, vel]`. Shapes match, values are reasonable, semantics are scrambled.
Stale observations	Observation buffer not updated between steps — model sees the same observation repeatedly, policy looks stuck but there's no error.

Verification

At every sim/real boundary, verify: correct scaling, correct frame, correct ordering, correct freshness.

11. Masking Bugs

Masks are applied but they mask the wrong thing.

Pattern	What Happens
Inverted mask	Using `mask` where `1 - mask` was intended (or `True`/`False` flipped). Masked tokens get attention, unmasked tokens are ignored.
Mask not applied to loss	Padding tokens contribute to the loss — model wastes capacity learning to predict padding.
Mask at wrong scope	Mask computed at sequence level but applied at token level, or applied before padding instead of after.
Mask shape + broadcasting	Mask is `(B, T)` but attention is `(B, H, T, T)` — broadcasting may expand the mask incorrectly.

Verification

For every mask, verify: which values are masked vs unmasked, that the mask reaches all places it should (loss, attention, metrics), and that broadcasting expands it correctly.

Audit Process

The detector follows a systematic four-step process:

Full catalog scan — go through every category above systematically. Don't skip categories because "it's probably fine."
Trace actual values — don't just check that the operation exists. Verify what it actually computes given the shapes and values in this specific codebase.
Check interactions — many silent bugs only manifest when two individually-correct components interact wrong (e.g., normalization followed by clipping, correct mask applied to wrong tensor).
Propose verification tests — for each potential bug, suggest a small test that would catch it (e.g., assert loss_component_kl > 0, assert obs[t+1] != obs[t] after env.step).

Output Format

The detector produces a structured report with three sections:

Bugs Found — each with location, category, description of wrong vs correct behavior, severity, specific fix, and verification test
Suspicious Patterns — things that look off but need runtime confirmation, each with a specific check to confirm or rule out
Verified Clean — categories confirmed to not have bugs in this code, with reasoning

One Silent Bug Can Invalidate a Paper

Shapes matching is not correctness. Two tensors of shape (64, 128) can mean completely different things. Always verify semantic meaning, not just numerical compatibility. Check the boring stuff — most silent bugs are wrong axis in .mean(), flipped mask, off-by-one in time index.