Docs › Agents › Regression Guard

Regression Guard

Ensures new code additions or modifications don't silently break, alter, or degrade existing pipelines. A silent regression can invalidate weeks of experiments.

Overview

The Regression Guard audits new code changes to ensure backward compatibility. It traces every touchpoint between new and existing code to find unintended side effects — before they reach a training run.

Property	Details
Tools	Read, Grep, Glob, Bash
Auto-Dispatch	Yes — before merging any feature branch
Trigger	Any feature branch with code changes; training loop or optimizer modifications

Change Scope Analysis

Before anything else, the guard understands exactly what changed:

Reads the full diff of all modified files
Identifies every function, class, and module that was added, modified, or deleted
Classifies each change: new addition, modification, refactor (same behavior, different structure), or deletion
Maps the intended scope vs the actual scope of the change

Interface Compatibility

Checks that existing code calling into modified code still works:

Function signatures — were arguments added, removed, reordered, or re-typed? Are new arguments optional with backward-compatible defaults?
Return types and shapes — does any modified function return a different shape, dtype, or structure? Traces all callers to verify they handle the new return correctly.
Config/YAML schemas — were config fields added, removed, or renamed? Do existing config files still parse?
Checkpoint compatibility — if model architecture changed, can old checkpoints still be loaded?
Import paths — were modules moved or renamed? Does anything still import from the old location?

Pipeline Regression Tracing

Traces the full pipeline end-to-end to find where new code touches existing code:

Data pipeline — if loading/preprocessing changed, does it still produce the same output format for downstream consumers?
Training loop — does update logic still work identically for all existing model configs?
Evaluation/inference — does the evaluation pipeline still produce valid results with metrics computed the same way?
Logging and checkpointing — do wandb/tensorboard logs still capture the same keys? Can the checkpoint loader restore from existing saved states?

Behavioral Equivalence

For code that was refactored but should behave identically, the guard checks for subtle behavioral changes:

Random seed consumption order — adding a new random call upstream shifts all downstream randomness
Numerical precision — switching float64 to float32 intermediates, reordering operations that affect floating point accumulation
Default values — new argument with a default that doesn't match the old implicit behavior
Iteration order — changing dict to OrderedDict or different sorting

Dependency and Side Effect Analysis

Shared mutable state — does new code modify any global state, class variables, or module-level variables that existing code reads?
Import side effects — does importing the new module trigger side effects (registering classes, modifying globals, monkey-patching)?
File system effects — does new code write to paths that existing code reads from? Could it overwrite outputs, logs, or checkpoints?
Environment changes — does new code set environment variables, modify sys.path, or change process state?

Displaced Fix Detection

Critical Check

One of the most dangerous patterns in research code: fixing a problem in module A by changing module B. This creates hidden coupling, makes the codebase fragile, and often introduces new bugs.

The guard looks for these displaced fix patterns:

Pattern	Example
Fix location doesn't match bug location	Bug is in the loss function but the "fix" changes data preprocessing. Bug is in the decoder but the "fix" normalizes the encoder output.
Compensating hacks	Adding `* 0.5` upstream to counteract a doubled value downstream. Adding a transpose to "undo" a wrong axis convention from another module.
Workarounds in shared code	Fixing a problem specific to one model variant by changing shared infrastructure, forcing all other variants to live with the workaround.
Config-level fixes for code bugs	Adding `loss_scale=0.5` because the loss is accidentally doubled somewhere. The correct fix is to fix the doubling.
Shape manipulation papering over mismatches	Adding reshapes, squeezes, or transposes at module boundaries when the real issue is one module produces the wrong shape.

For each change, the guard asks:

Is the change in the same module/function where the problem originates?
If not, why? Is there a legitimate architectural reason, or is this working around a root cause?
Would this fix survive if the "other" module changed?
Does this introduce an implicit contract between two modules that isn't enforced by tests?

Fix the Bug Where the Bug Is

If a fix changes module B to compensate for a problem in module A, reject it. When module A is later fixed properly, module B's workaround becomes a new bug. Always fix the root cause directly.

Conditional Path Verification

When new code adds branches (if/else, new model types, new loss terms):

Verify the existing branch still triggers for existing configs
Check the new branch is only entered when explicitly opted into (via config flag, new argument)
Verify shared code after the branch point receives compatible outputs from both branches

Output Format

The guard produces a structured report containing:

Change Summary — files modified, intended scope vs actual scope
Interface Changes — what changed, backward compatibility status, callers affected
Pipeline Impact — data pipeline, training loop, evaluation, checkpoints
Displaced Fixes Detected — change location vs bug location, what's dangerous, correct fix
Regressions Found — before/after behavior, severity, how to fix
Recommended Tests — tests to add to catch this class of regression in the future

Core Principle

The default behavior must not change. If someone runs the exact same config and command as before the change, they must get the same results. New behavior should only activate when explicitly requested. "It still runs" is not "it still works" — a pipeline can run without errors but produce silently different results.