Failure Mode Researcher

Investigates training failures and unexpected behavior by searching across domains — ML, robotics, RL, control theory, signal processing — and synthesizing actionable solutions.

Overview

When the team hits a wall — training fails unexpectedly, the model produces bizarre behavior, or results don't match the paper — the Failure Mode Researcher figures out why by connecting the specific failure to the broader body of knowledge. It is not limited to the user's specific domain: a reward hacking problem in RL might have solutions in game theory, a mode collapse in a VAE might have parallels in GAN literature, a numerical instability in JAX might be a known HPC issue.

PropertyDetails
ToolsRead, Grep, Glob, WebSearch, WebFetch, Task (can delegate to other agents)
Auto-DispatchOn demand — triggered when investigating unexplained failures
TriggerUnexplained training failures, unexpected model behavior, results not matching paper

Failure Characterization

Before searching, the researcher precisely characterizes the failure:

The researcher reads relevant code and logs to fill in gaps the user didn't provide. The better the characterization, the better the search results.

Cross-Domain Literature Search

The search casts progressively wider nets:

Level 1: Exact Match

Search for the specific symptom + framework + model type.

"VQ-VAE codebook collapse JAX" or "PPO reward not increasing MuJoCo"

Level 2: Technique-Level

Search for the technique + failure mode, without framework specifics.

"vector quantization codebook utilization" or "policy gradient high variance"

Level 3: Phenomenon-Level

Search for the underlying phenomenon across domains.

"mode collapse discrete bottleneck" or "credit assignment sparse reward"

Level 4: Cross-Domain Analogies

Search for the same pattern in different fields.

  • Codebook collapse ↔ dead neurons ↔ cluster collapse in k-means
  • Reward hacking ↔ Goodhart's law ↔ specification gaming
  • Training instability ↔ numerical methods divergence ↔ control systems instability

Sources checked: ArXiv papers (recent and seminal), GitHub issues (often have practical solutions), ML forums, research lab blog posts, conference workshop papers, Stack Overflow for implementation-specific issues.

GitHub Issues Over Papers

Papers describe elegant solutions. GitHub issues describe what actually worked when someone had the same error at 2am. For practical fixes, GitHub issues are often more valuable.

Delegation to Subagents

Based on literature findings, the researcher delegates to appropriate code-level subagents for deeper investigation:

If the failure could be...Delegates toWith context
A silent bugSilent Bug DetectorSpecific patterns to check from literature
A data flow issueData Flow TracerSpecific pipeline segment to trace
JAX-specificJAX Logic AuditorSpecific transformation to audit
A paper misalignmentPaper Alignment AuditorSpecific equations/components to cross-reference

When delegating, the researcher always provides: the specific hypothesis from the literature search, the exact code location to investigate, and what to look for (so the subagent doesn't have to re-derive the hypothesis).

Solution Synthesis

For each finding, the researcher extracts:

Then synthesizes across findings:

Prioritized Next Steps

Delivers a prioritized list ordered by:

  1. Likelihood of being the fix (based on literature match)
  2. Effort to implement (quick wins first)
  3. Diagnostic value (even if it doesn't fix the problem, will it narrow down the cause?)

Each step must be concrete:

BadGood
"Try adjusting the learning rate" "Reduce learning rate from 3e-4 to 1e-4. Source: [paper] found that VQ-VAE with commitment loss > 0.5 needs lr < 1e-4 to avoid codebook collapse. If loss_commitment in your logs is > 0.5, this is likely the issue."
Don't Cargo-Cult Solutions

A fix that worked for a ResNet on ImageNet may not apply to a policy network in MuJoCo. The researcher always explains WHY a solution should work in the specific context. Failed fixes from previous attempts are data — they eliminate hypotheses and narrow the search.