Regression Guard
Ensures new code additions or modifications don't silently break, alter, or degrade existing pipelines. A silent regression can invalidate weeks of experiments.
Overview
The Regression Guard audits new code changes to ensure backward compatibility. It traces every touchpoint between new and existing code to find unintended side effects — before they reach a training run.
| Property | Details |
|---|---|
| Tools | Read, Grep, Glob, Bash |
| Auto-Dispatch | Yes — before merging any feature branch |
| Trigger | Any feature branch with code changes; training loop or optimizer modifications |
Change Scope Analysis
Before anything else, the guard understands exactly what changed:
- Reads the full diff of all modified files
- Identifies every function, class, and module that was added, modified, or deleted
- Classifies each change: new addition, modification, refactor (same behavior, different structure), or deletion
- Maps the intended scope vs the actual scope of the change
Interface Compatibility
Checks that existing code calling into modified code still works:
- Function signatures — were arguments added, removed, reordered, or re-typed? Are new arguments optional with backward-compatible defaults?
- Return types and shapes — does any modified function return a different shape, dtype, or structure? Traces all callers to verify they handle the new return correctly.
- Config/YAML schemas — were config fields added, removed, or renamed? Do existing config files still parse?
- Checkpoint compatibility — if model architecture changed, can old checkpoints still be loaded?
- Import paths — were modules moved or renamed? Does anything still import from the old location?
Pipeline Regression Tracing
Traces the full pipeline end-to-end to find where new code touches existing code:
- Data pipeline — if loading/preprocessing changed, does it still produce the same output format for downstream consumers?
- Training loop — does update logic still work identically for all existing model configs?
- Evaluation/inference — does the evaluation pipeline still produce valid results with metrics computed the same way?
- Logging and checkpointing — do wandb/tensorboard logs still capture the same keys? Can the checkpoint loader restore from existing saved states?
Behavioral Equivalence
For code that was refactored but should behave identically, the guard checks for subtle behavioral changes:
- Random seed consumption order — adding a new random call upstream shifts all downstream randomness
- Numerical precision — switching float64 to float32 intermediates, reordering operations that affect floating point accumulation
- Default values — new argument with a default that doesn't match the old implicit behavior
- Iteration order — changing dict to OrderedDict or different sorting
Dependency and Side Effect Analysis
- Shared mutable state — does new code modify any global state, class variables, or module-level variables that existing code reads?
- Import side effects — does importing the new module trigger side effects (registering classes, modifying globals, monkey-patching)?
- File system effects — does new code write to paths that existing code reads from? Could it overwrite outputs, logs, or checkpoints?
- Environment changes — does new code set environment variables, modify
sys.path, or change process state?
Displaced Fix Detection
One of the most dangerous patterns in research code: fixing a problem in module A by changing module B. This creates hidden coupling, makes the codebase fragile, and often introduces new bugs.
The guard looks for these displaced fix patterns:
| Pattern | Example |
|---|---|
| Fix location doesn't match bug location | Bug is in the loss function but the "fix" changes data preprocessing. Bug is in the decoder but the "fix" normalizes the encoder output. |
| Compensating hacks | Adding * 0.5 upstream to counteract a doubled value downstream. Adding a transpose to "undo" a wrong axis convention from another module. |
| Workarounds in shared code | Fixing a problem specific to one model variant by changing shared infrastructure, forcing all other variants to live with the workaround. |
| Config-level fixes for code bugs | Adding loss_scale=0.5 because the loss is accidentally doubled somewhere. The correct fix is to fix the doubling. |
| Shape manipulation papering over mismatches | Adding reshapes, squeezes, or transposes at module boundaries when the real issue is one module produces the wrong shape. |
For each change, the guard asks:
- Is the change in the same module/function where the problem originates?
- If not, why? Is there a legitimate architectural reason, or is this working around a root cause?
- Would this fix survive if the "other" module changed?
- Does this introduce an implicit contract between two modules that isn't enforced by tests?
If a fix changes module B to compensate for a problem in module A, reject it. When module A is later fixed properly, module B's workaround becomes a new bug. Always fix the root cause directly.
Conditional Path Verification
When new code adds branches (if/else, new model types, new loss terms):
- Verify the existing branch still triggers for existing configs
- Check the new branch is only entered when explicitly opted into (via config flag, new argument)
- Verify shared code after the branch point receives compatible outputs from both branches
Output Format
The guard produces a structured report containing:
- Change Summary — files modified, intended scope vs actual scope
- Interface Changes — what changed, backward compatibility status, callers affected
- Pipeline Impact — data pipeline, training loop, evaluation, checkpoints
- Displaced Fixes Detected — change location vs bug location, what's dangerous, correct fix
- Regressions Found — before/after behavior, severity, how to fix
- Recommended Tests — tests to add to catch this class of regression in the future
The default behavior must not change. If someone runs the exact same config and command as before the change, they must get the same results. New behavior should only activate when explicitly requested. "It still runs" is not "it still works" — a pipeline can run without errors but produce silently different results.