Silent Bug Detector

Finds bugs that don't crash — the code runs, the loss decreases, training completes, but the results are wrong. A hard crash is a gift. A silent bug is a trap.

Overview

The Silent Bug Detector systematically scans code for every known pattern of silent failure in AI and robotics research. These are the most dangerous bugs because they waste compute and produce misleading results that get reported in papers or deployed to hardware.

The detector doesn't wait for symptoms — it checks proactively against a catalog of 11 failure categories, each representing a class of bugs that has burned real research time.

PropertyDetails
ToolsRead, Grep, Glob (read-only)
Auto-DispatchYes — after implementing or modifying model components, loss functions, data pipelines, or training loops
TriggerAny code change touching model, loss, data, or training logic
Core Principle

If it doesn't crash, it's not safe. Every bug in this catalog produces code that runs successfully. The absence of errors means nothing. "The loss is decreasing" proves nothing — a model can minimize a wrong loss, overfit to leaked data, or learn a degenerate solution while the loss curve looks healthy.

1. Broadcasting Bugs

The #1 source of silent errors. Two tensors interact with compatible but wrong shapes.

PatternWhat Happens
(B, 1) + (1, T)Did you mean element-wise or outer product? Check intent.
(B, T, F) * (F,)Broadcasting works, but did you want per-feature scaling or a matmul?
Scalar promotionA tensor that should be (B,) accidentally becomes () after a wrong reduction — then broadcasts against everything silently.
Missing keepdimsmean(axis=1) reduces (B, T, F) to (B, F), but if you needed (B, 1, F) for subsequent broadcasting, the missing keepdims=True causes a silent shape shift.
Verification

For every binary operation between two tensors of different ranks, verify the broadcast semantics are intentional.

2. Wrong Reduction Axis

Loss looks fine, gradients flow, but the learning signal is wrong.

PatternWhat Happens
Mean vs sum over batchloss.mean() vs loss.sum() changes the effective learning rate by a factor of batch_size. Many papers don't specify which they use.
Mean over wrong axisloss.mean(axis=0) averages over batch (correct), but loss.mean(axis=1) averages over time — should time steps be weighted equally?
Reduction inside vs outside vmap/scanReducing before or after a JAX transform changes what gets averaged. Check that reductions happen at the intended scope.
Per-element vs per-sample lossIs the loss computed per element then averaged, or per sample? This affects how the model weighs short vs long sequences.
Verification

For every .mean(), .sum(), .max(), verify which axis is being reduced and whether that matches the paper/intent.

3. Data Leakage

Model gets access to information it shouldn't have — artificially inflates metrics.

PatternWhat Happens
Future leaking into pastIn sequential/temporal data, the model at time t sees information from t+1 or later. Check attention masks, convolution padding (should be causal), and data loading.
Train leaking into val/testNormalization statistics (mean, std) computed on the full dataset including val/test. Augmentations using parameters fit on the full dataset.
Target leaking into inputThe label or ground truth accidentally included in input features. Off-by-one tensor slicing can include the target.
Cross-episode leakage (RL)Observation at the start of a new episode contains information from the previous episode. Check auto-reset logic and observation buffers.
Verification

At every point where data is filtered, masked, or split, verify that no future/test/target information crosses the boundary.

4. Normalization Bugs

Statistics look reasonable, outputs are in range, but they're wrong.

PatternWhat Happens
Stale running statisticsBatchNorm/LayerNorm running mean/var not updated because model is stuck in eval mode, or updated at the wrong point in the pipeline.
Wrong normalization axisLayerNorm over (features,) vs (time, features) vs (batch, features) — all run without error, but only one is correct.
Normalizing after clippingIf you clip then normalize, statistics are biased. If you normalize then clip, the distribution is distorted. Check the order.
Train vs eval normalizationNormalization differs between train and eval but eval actually uses batch stats from a single sample instead of running stats.
RL observation normalizationRunning stats computed across envs but applied per-env, or vice versa. Stats not updated after initial warmup.
Verification

For every normalization operation, verify: correct axis, correct mode (train vs eval), correct statistics source.

5. Off-by-One and Indexing Errors

The most classic bug class — especially dangerous in sequential data.

PatternWhat Happens
Temporal indexingobs[t] paired with action[t] and reward[t] — is reward[t] the reward FOR action[t] or AFTER action[t]? Off-by-one means wrong credit assignment.
Sequence slicingx[:, :-1] for input and x[:, 1:] for target — does this match your autoregressive convention? Is the shift correct?
GAE discount indexingadvantages[t] = delta[t] + gamma * lambda * advantages[t+1] — is the scan iterating in the right direction (backward from T to 0)?
Done mask applicationnext_value * (1 - done[t]) — is done[t] the done flag for the transition that produced obs[t] or the one starting from obs[t]?
Verification

For every index [t], [t+1], [:-1], [1:], verify the temporal semantics explicitly. Draw out a timeline if needed.

6. Gradient Flow Bugs

Gradients technically exist, but they don't carry the right signal.

PatternWhat Happens
Accidental stop_gradientA .detach() or jax.lax.stop_gradient on a tensor that should receive gradients. Model still trains via other paths but the blocked path doesn't learn.
Missing stop_gradientGradients flow through the target network, codebook, or value baseline when they shouldn't. Optimization is wrong but training continues.
Dead branchesA conditional branch (e.g., temperature annealed to 0 = argmax) that cuts off gradients. The dead branch never learns.
Straight-through estimator bugsForward pass uses quantized value while backward pass uses un-quantized value (or vice versa). Check both directions.
Verification

For every tensor in the loss, verify whether gradient should flow through it. For every stop_gradient/detach, verify it's intentional.

7. Loss Function Bugs

Loss decreases during training, but it's optimizing the wrong thing.

PatternWhat Happens
Wrong signMinimizing when you should maximize (or vice versa) on a specific loss term. Model converges to the opposite of what you want.
Wrong KL directionKL(q||p) vs KL(p||q) — both are valid but have different mode-seeking vs mode-covering behavior.
Missing loss termA regularization term (KL, commitment loss, entropy bonus) defined but not added to the total loss. Model trains fine without it but produces worse results.
Loss term always zeroA term computed but evaluating to zero due to a bug (masking that zeros everything, a coefficient of 0.0 instead of 1.0).
Loss not reaching right parametersLoss computed on output A but only parameters B have requires_grad. Loss decreases by chance but intended parameters don't learn.
Verification

Log each loss component individually. Verify all components are non-zero and contribute to the total. Verify the gradient of the total loss w.r.t. each parameter group.

8. Train/Eval Mode Bugs

Model behaves differently in train vs eval, but the switch is wrong.

PatternWhat Happens
Forgot eval modeDropout still active during evaluation produces noisy metrics. BatchNorm using batch stats instead of running stats gives inconsistent evaluation.
Forgot train mode backAfter evaluation, model stays in eval mode for the next epoch. Dropout disabled, BatchNorm frozen — training degrades silently.
Stochastic components uncontrolledEval should be deterministic but random sampling is still active (e.g., VAE sampling without switching to mode/mean).
Verification

Verify model.train() is called before training and model.eval() before evaluation. Check that stochastic components (dropout, sampling, noise injection) respect the mode.

9. Random Seed / Reproducibility Bugs

Results vary between runs, or worse, are correlated when they shouldn't be.

PatternWhat Happens
Same seed for different purposesUsing the same RNG key for weight initialization and data shuffling — produces correlated initialization and data ordering.
Seed not split in parallel envsAll environments get the same seed — identical rollouts, no diversity, model overfits to one trajectory.
Seed reuse after checkpointRestoring from checkpoint but not restoring RNG state — first epoch after resume has same randomness as first epoch of training.
Verification

Verify every random call uses a unique, properly-split key. Verify RNG state is saved/restored with checkpoints.

10. Action and Observation Space Bugs

Robotics-specific: hardware doesn't crash, but the robot does the wrong thing.

PatternWhat Happens
Wrong action scalingPolicy outputs in [-1, 1] but actuators expect radians. Missing denormalization produces small, useless movements or saturated commands.
Wrong coordinate frameAction computed in world frame but applied in body frame (or vice versa). Movement direction is wrong but magnitude is plausible.
Observation ordering mismatchObservation vector is [pos, vel, orientation] but model expects [orientation, pos, vel]. Shapes match, values are reasonable, semantics are scrambled.
Stale observationsObservation buffer not updated between steps — model sees the same observation repeatedly, policy looks stuck but there's no error.
Verification

At every sim/real boundary, verify: correct scaling, correct frame, correct ordering, correct freshness.

11. Masking Bugs

Masks are applied but they mask the wrong thing.

PatternWhat Happens
Inverted maskUsing mask where 1 - mask was intended (or True/False flipped). Masked tokens get attention, unmasked tokens are ignored.
Mask not applied to lossPadding tokens contribute to the loss — model wastes capacity learning to predict padding.
Mask at wrong scopeMask computed at sequence level but applied at token level, or applied before padding instead of after.
Mask shape + broadcastingMask is (B, T) but attention is (B, H, T, T) — broadcasting may expand the mask incorrectly.
Verification

For every mask, verify: which values are masked vs unmasked, that the mask reaches all places it should (loss, attention, metrics), and that broadcasting expands it correctly.

Audit Process

The detector follows a systematic four-step process:

  1. Full catalog scan — go through every category above systematically. Don't skip categories because "it's probably fine."
  2. Trace actual values — don't just check that the operation exists. Verify what it actually computes given the shapes and values in this specific codebase.
  3. Check interactions — many silent bugs only manifest when two individually-correct components interact wrong (e.g., normalization followed by clipping, correct mask applied to wrong tensor).
  4. Propose verification tests — for each potential bug, suggest a small test that would catch it (e.g., assert loss_component_kl > 0, assert obs[t+1] != obs[t] after env.step).

Output Format

The detector produces a structured report with three sections:

One Silent Bug Can Invalidate a Paper

Shapes matching is not correctness. Two tensors of shape (64, 128) can mean completely different things. Always verify semantic meaning, not just numerical compatibility. Check the boring stuff — most silent bugs are wrong axis in .mean(), flipped mask, off-by-one in time index.