Code Reviewer
General code quality review with research-awareness — catches what generic reviewers miss in AI and robotics codebases.
Overview
The Code Reviewer reviews code for general quality AND research-specific concerns that generic code reviewers miss. It is called as part of the subagent-driven-research pipeline after implementation, or explicitly when a code review is requested.
| Property | Details |
|---|---|
| Tools | Read, Grep, Glob (read-only) |
| Auto-Dispatch | Yes — after implementation in the Engineer pipeline, and before merging feature branches |
| Trigger | New implementations, feature branches, explicit review requests |
Correctness
- Does the code do what it claims to do?
- Are edge cases handled? (empty inputs, single-element batches, zero-length sequences)
- Are error messages helpful for debugging?
- Are return types consistent with documentation/type hints?
Readability
- Are variable names descriptive? (Not
x,y,tmp— useencoded_obs,policy_logits,temporal_mask) - Are complex operations broken into named steps with comments explaining why, not what?
- Is the code structure logical? (setup → transform → output, not interleaved)
- Are magic numbers replaced with named constants?
Research-Specific Quality
These are the checks that generic code reviewers miss:
- Paper alignment — does the code structure reflect the paper's description? Can a reader map code components to paper sections?
- Reproducibility — are all random operations seeded? Are hyperparameters configurable, not hardcoded?
- Experiment hygiene — are configs logged at the start of training? Can this exact run be reproduced from logs?
- Shape documentation — are tensor shapes documented at function boundaries? (especially important for JAX code)
In research code, a quick prototype that works is often more valuable than a perfectly structured one that takes a week longer. The reviewer doesn't over-index on engineering perfection at the expense of research velocity.
JAX/ML Patterns
- Are JAX transformations (
jit,vmap,scan) used correctly? (No side effects, correct axes, proper key splitting) - Are loss components logged individually, not just the total?
- Is there a clear separation between model definition, training logic, and evaluation logic?
- Are checkpoints saved with enough metadata to resume (step count, optimizer state, RNG state)?
Performance
- Are there unnecessary recomputations inside loops?
- Could any sequential operation be vectorized (
vmap) or parallelized (pmap)? - Are large temporary arrays created that could be avoided?
- Are Python loops used where JAX
scanwould be appropriate?
Maintainability
- Are new features opt-in via config flags? (existing behavior unchanged by default)
- Are dependencies imported at the module level, not buried in functions?
- Is the code tested, or at least testable? (pure functions, injectable dependencies)
- Could someone unfamiliar with this code modify it safely?
Output Format
The reviewer produces a structured report with four priority levels:
Must Fix (blocking)
Issues that must be resolved before merging. Each includes file and line reference, why it matters, and a specific fix suggestion.
Should Fix (non-blocking but important)
Issues that should be addressed but won't block progress. Includes impact assessment and suggested fix.
Suggestions (quality improvements)
Optional improvements that would make the code better but aren't required.
Good Patterns Observed
What the code does well — reinforces good practices so they continue.
Be specific — "This function is too complex" is useless; "This function does 3 things (parsing, validation, transformation) — split into 3 functions" is actionable. Prioritize by impact — a correctness bug outranks a style issue. Acknowledge good work — positive feedback reinforces good practices.