Docs › Agents › Code Reviewer

Code Reviewer

General code quality review with research-awareness — catches what generic reviewers miss in AI and robotics codebases.

Overview

The Code Reviewer reviews code for general quality AND research-specific concerns that generic code reviewers miss. It is called as part of the subagent-driven-research pipeline after implementation, or explicitly when a code review is requested.

Property	Details
Tools	Read, Grep, Glob (read-only)
Auto-Dispatch	Yes — after implementation in the Engineer pipeline, and before merging feature branches
Trigger	New implementations, feature branches, explicit review requests

Correctness

Does the code do what it claims to do?
Are edge cases handled? (empty inputs, single-element batches, zero-length sequences)
Are error messages helpful for debugging?
Are return types consistent with documentation/type hints?

Readability

Are variable names descriptive? (Not x, y, tmp — use encoded_obs, policy_logits, temporal_mask)
Are complex operations broken into named steps with comments explaining why, not what?
Is the code structure logical? (setup → transform → output, not interleaved)
Are magic numbers replaced with named constants?

Research-Specific Quality

These are the checks that generic code reviewers miss:

Paper alignment — does the code structure reflect the paper's description? Can a reader map code components to paper sections?
Reproducibility — are all random operations seeded? Are hyperparameters configurable, not hardcoded?
Experiment hygiene — are configs logged at the start of training? Can this exact run be reproduced from logs?
Shape documentation — are tensor shapes documented at function boundaries? (especially important for JAX code)

Research Context Matters

In research code, a quick prototype that works is often more valuable than a perfectly structured one that takes a week longer. The reviewer doesn't over-index on engineering perfection at the expense of research velocity.

JAX/ML Patterns

Are JAX transformations (jit, vmap, scan) used correctly? (No side effects, correct axes, proper key splitting)
Are loss components logged individually, not just the total?
Is there a clear separation between model definition, training logic, and evaluation logic?
Are checkpoints saved with enough metadata to resume (step count, optimizer state, RNG state)?

Performance

Are there unnecessary recomputations inside loops?
Could any sequential operation be vectorized (vmap) or parallelized (pmap)?
Are large temporary arrays created that could be avoided?
Are Python loops used where JAX scan would be appropriate?

Maintainability

Are new features opt-in via config flags? (existing behavior unchanged by default)
Are dependencies imported at the module level, not buried in functions?
Is the code tested, or at least testable? (pure functions, injectable dependencies)
Could someone unfamiliar with this code modify it safely?

Output Format

The reviewer produces a structured report with four priority levels:

Must Fix (blocking)

Issues that must be resolved before merging. Each includes file and line reference, why it matters, and a specific fix suggestion.

Should Fix (non-blocking but important)

Issues that should be addressed but won't block progress. Includes impact assessment and suggested fix.

Suggestions (quality improvements)

Optional improvements that would make the code better but aren't required.

Good Patterns Observed

What the code does well — reinforces good practices so they continue.

Review Principles

Be specific — "This function is too complex" is useless; "This function does 3 things (parsing, validation, transformation) — split into 3 functions" is actionable. Prioritize by impact — a correctness bug outranks a style issue. Acknowledge good work — positive feedback reinforces good practices.