Gates & Questioners

The seven human-in-the-loop checkpoints that prevent Claude from skipping ahead. Every gate has a mandatory format, every questioner has specific questions, and every one must be passed before the next phase begins.

Overview

Propel enforces 5 gates and 2 questioner checkpoints. Gates are hard stops where Claude presents structured findings and waits for human approval. Questioners are structured question sets that ground the next phase in concrete details rather than plausible defaults.

CheckpointWhenPurposeResearcherEngineerDebuggerTrainer
Gate 0User describes taskScope the workActiveActiveActive
Q0After Gate 0Ground in referencesActiveActive
Gate 1Investigation completeReview findingsActiveActiveActive
Q1After Gate 1Nail down detailsActiveActive
Gate 2Design completeReview planActive
Gate 3After each componentAudit resultsActive
Gate 4Root cause identifiedApprove diagnosisActiveActiveActive*

* Trainer Mode uses Gate 4 for runtime bugs only — logic changes redirect to Engineer Mode.

Question Quality Rule

Gate questions must be disjunctive (A or B?), assumption-exposing, design-revealing, and evidence-based. Never ask "shall I proceed?" or "is this okay?" — these invite rubber-stamping. Questioner questions must be concrete and reference-seeking, grounding the work in existing examples.

Gate 0: Intake

When: The user describes what they want to build or fix.
Active in: Researcher Engineer Debugger

Gate 0 is the first line of defense against "just do it" requests. When a user describes a task, Claude does not start working immediately. Instead, it asks 3–5 scoping questions, one at a time, to expose assumptions and define boundaries before any investigation begins.

Format

  1. Ask one clarifying question at a time. Wait for the answer.
  2. Questions cover: scope boundaries, success criteria, implicit assumptions, what NOT to do, priority.
  3. Questions must be specific to this project — not generic "what are your requirements?"
  4. After enough answers, write a one-paragraph scope statement.
  5. Ask: "Is this what you want?" — user confirms before proceeding.

Question Types

TypeExample
Assumption-exposing"You mentioned depth=2. Is that a hard design choice, or do you want arbitrary depth?"
Priority-revealing"If we discover the paper's approach has a known failure mode, do you want alternatives or implement it anyway?"
Scope boundary"Should the new quantizer support the existing stickiness bias, or is this a chance to remove it?"
Disjunctive"Is the loss exploding (going to NaN/inf) or collapsing (going to zero/constant)?"
Temporal"Did this break after a specific commit, or has it always been this way?"
Handling "Just Do It"

If the user says "just do it," Claude pushes back once: "I want to make sure I build what you actually need. These questions will save us a /clear cycle later." If the user insists, proceed with the most conservative interpretation and document assumptions explicitly.

Output

A confirmed one-paragraph scope statement that becomes the reference for all subsequent phases. This scope statement is recorded in the investigation README.

Questioner Q0: Grounding Before Investigation

When: After Gate 0 confirms scope, before investigation begins.
Active in: Researcher Engineer

Why Q0 Exists

Claude is great at looking at one thing and morphing it into what you want, but bad at creating from scratch when the problem is unconstrained. Vague requests lead to plausible-looking but wrong implementations — Claude fills gaps with training-data averages instead of asking. Q0 forces the user to provide concrete reference points before investigation.

The 5 Grounding Questions

Asked one at a time, skipping any already answered during Gate 0:

#CategoryQuestion
1Reference implementation"Is there an existing codebase or repo I should use as a starting point? If so, which files or modules are most relevant?"
2Architecture pattern"Is there a known architecture or design pattern you want to follow? (e.g., a specific paper's approach, an existing task implementation, a framework's canonical example)"
3Example to study"Is there a similar task/feature already implemented in this project or another project that I should study and adapt from?"
4Benchmark / ground truth"What benchmark, test case, or expected behavior should I use to verify correctness? Is there a known-good output I can compare against?"
5Framework constraints"Are there specific APIs, libraries, or framework conventions I must use? (e.g., 'use robosuite's SingleArmEnv base class', 'follow the MuJoCo MJX wrapper pattern')"

What Happens After Q0

Example

User:  I want to build a can transport task with robosuite.
[Gate 0 confirms scope: single-arm can transport, dense reward, sim only]

Claude (Q0-1): Is there an existing task in robosuite I should use as a
               starting point? For example, PickPlace or Lift?
User:  Yeah, look at PickPlace in robosuite/environments/manipulation/

Claude (Q0-2): Got it. Is there a specific architecture pattern — should this
               follow the same reward structure as PickPlace, or do you have
               a different reward design in mind?
User:  Same structure, but add a transport phase between pick and place.

Claude (Q0-3): For verification — is there a known trajectory or benchmark
               I should test against, or should I use the PickPlace test suite
               as a baseline?
User:  Use PickPlace tests as baseline, plus a manual trajectory I'll provide.

[Claude records all three reference points in scratch/README.md]

Gate 1: Post-Investigation

When: Investigation is complete.
Active in: Researcher Engineer Debugger

Gate 1 is where Claude presents what it found during investigation and the human makes design decisions. Claude does not proceed to design or implementation without explicit approval of the findings.

Mandatory Presentation Format

## Investigation Summary

### What I Found
- [3-5 bullet findings, concrete and specific]

### Surprises / Risks
- [Unexpected complications discovered during investigation]

### Open Questions — I Need Your Input
1. [Trade-off question — disjunctive, not yes/no]
2. [Architecture choice]
3. [Priority question]
Findings Must Be Concrete

"The codebase is complex" is not a finding. "The loss function in model/vq.py:89-124 computes commitment loss across all depth levels without per-level scaling" is a finding. Every bullet must reference specific files, line numbers, or concrete observations.

Output

The human's design decisions are recorded in the investigation README under "Design Decisions (Human-Approved)". These decisions are binding for all subsequent phases.

Questioner Q1: Before Design

When: After Gate 1 confirms investigation findings, before design begins.
Active in: Researcher Engineer

Why Q1 Exists

After investigation, Claude has findings and the user has made high-level design decisions. But there is a dangerous gap: Claude may still fill in critical implementation details with plausible defaults instead of asking. Q1 catches these before design begins, turning vague intent into binding constraints.

The 6 Detail Questions

Asked one at a time, only the questions relevant to what was discovered during investigation:

#CategoryQuestion
1Interface contracts"Based on the investigation, here are the key interfaces I found: [list]. Should the new code match these exactly, or are we changing the API?"
2Data formats"The reference uses [format/shape/convention]. Should we follow the same convention, or does your use case need something different?"
3Configuration"Should this be configurable via [config file / constructor args / env vars]? What should the defaults be?"
4Edge cases"During investigation I found these edge cases: [list]. How should each be handled — error, fallback, or ignore?"
5Integration points"This will need to connect to [existing modules]. Should I modify those modules or create adapters?"
6Scope confirmation"Based on what I found, the minimal implementation needs [X, Y, Z]. Should I also include [A, B] which I noticed in the reference, or keep it minimal?"

What Happens After Q1

Example

User:  [After Gate 1, approved investigation findings for can transport task]

Claude (Q1-1): PickPlace uses a 4-phase reward: reach → grasp → lift → place.
               For can transport, I'd add a "transport" phase between lift and
               place. Should I keep the same reward scale (0-1 per phase) or
               re-weight for 5 phases?
User:  Keep 0-1 per phase, so total reward range becomes 0-5 instead of 0-4.

Claude (Q1-2): PickPlace accepts a single placement target. Transport implies
               a waypoint. Should the waypoint be a constructor parameter, or
               computed from start/end positions?
User:  Constructor parameter. I want to be able to set it per episode.

Claude (Q1-3): The PickPlace test suite checks gripper-to-object distance.
               For transport, should I also check object-to-waypoint distance
               during the transport phase?
User:  Yes, that's the key metric for the new phase.

[Claude records all decisions and proceeds to design with concrete constraints]

Gate 2: Post-Design

When: Design and implementation plan are ready.
Active in: Engineer

Gate 2 presents the full implementation plan for human review. The human can approve, modify, or reject parts of the plan. Implementation does not begin until the human says "go."

Mandatory Presentation Format

## Implementation Plan Summary

### Components (in order)
1. [Component name] — maps to paper eq. [X] — touches files [Y, Z]
2. [Component name] — maps to paper eq. [X] — touches files [Y, Z]

### Paper-to-Code Mapping
| Paper Reference | Code Location | Status |
|----------------|--------------|--------|
| Eq. 3 (VQ loss) | model/vq.py:compute_loss() | New |
| Algorithm 1 (codebook update) | model/vq.py:update_codebook() | Modified |

### Regression Risk Assessment
- [What could break and how we're protecting it]

### What I'm Uncertain About
1. [Uncertainty + two options + trade-offs]
2. [Uncertainty + two options + trade-offs]

### Estimated Scope
- [N components, M files modified, K new files]
Component Order Matters

Components must be listed in implementation order. Dependencies flow downward — component 2 can depend on component 1, but not the reverse. If the human wants to change the order, Claude must re-check dependency chains.

Key Elements

ElementPurpose
Components listWhat will be built, in what order, with paper equation mappings
Paper-to-code mappingExplicit link between paper equations/algorithms and code locations
Regression riskWhat existing functionality could break and how it will be protected
UncertaintiesOpen questions with concrete options and trade-offs for the human to decide
Estimated scopeConcrete count of components, files, and changes to set expectations

Gate 3: Mid-Implementation

When: After each component is implemented.
Active in: Engineer

Gate 3 is a per-component checkpoint during implementation. After Claude implements each component, auditors run automatically and results are presented to the human. This is the most frequently triggered gate — it fires once per component.

Auditor Review Format

Each component goes through three stages before Gate 3:

  1. Implement — Subagent implements the specific task from the approved plan.
  2. Review — Spec compliance (does it match the plan?) and paper alignment (does it match the source equations?).
  3. Audit — Parallel auditors check for paper mismatches, JAX bugs, silent failures, and regressions.

Clean Pass Format

"All auditors passed for component 3. Spec ✓, Paper ✓, JAX ✓, Silent bugs ✓, Regression ✓. Moving to component 4?"

Issues Found Format

## Component 3 Audit Results

### Issues Found
- ✗ Silent Bug Detector: commitment loss applied to combined residual, not per-depth
  - Paper specifies per-depth (Section 3.2, Eq. 5)
  - Severity: Critical

### Question
Should I fix this to match the paper (per-depth commitment), or was the
combined approach an intentional simplification?
Context Management at Gate 3

Every 3 components, Claude offers a pause point for /clear. Long implementation sessions accumulate context rapidly. The living README in scratch/ ensures no findings are lost across context boundaries.

Auditor Dispatch Table

What ChangedAuditors That Run
Paper-derived componentpaper-alignment-auditor
JAX code (scan, vmap, pmap, jit)jax-logic-auditor
Model / loss / data codesilent-bug-detector
Any code changeregression-guard
Environment interaction codeenv-researcher
Deep trace needed (explicit only)data-flow-tracer

Gate 4: Before Bug Fix

When: Root cause of a bug has been identified.
Active in: Engineer Debugger Trainer (runtime only)

Gate 4 is the mandatory checkpoint before any fix is applied. Claude must present its diagnosis in a structured format and get human approval. This prevents shotgun debugging, symptom patching, and displaced fixes.

The Core Rule

Diagnose before fixing. Never apply a fix without presenting the diagnosis first and getting human approval. No exceptions. If you don't understand the root cause well enough to fill in every field below, you haven't investigated enough.

Mandatory Diagnosis Format

Symptom:         What is observed (concrete, measurable)
Root Cause:      What is actually wrong (with evidence)
Why This Happens: The mechanism that produces the symptom from the root cause
Proposed Fix:    What to change and where (specific files and lines)
Side Effects:    What else this fix touches (other components, training dynamics)
What Won't Fix:  Approaches that look promising but won't work (and why)

Field-by-Field Breakdown

FieldRequired ContentAnti-Pattern
Symptom Measurable observation: "NaN at step ~500 in commitment loss" "Training doesn't work" (too vague)
Root Cause Specific location + evidence: "RVQ commitment loss sums across depth levels without per-level scaling at model/rvq.py:89" "Something is wrong with the loss" (no location)
Why This Happens Causal chain: "VQ commitment loss assumes single codebook. RVQ adds levels but reuses same loss unchanged." Restating the symptom
Proposed Fix Concrete change: "Scale commitment loss by 1/num_levels per level" "Fix the loss function" (not actionable)
Side Effects Impact analysis: "Changes training dynamics. May need LR adjustment." "No side effects" (almost never true)
What Won't Fix Dead ends with reasons: "Gradient clipping masks symptom but doesn't fix imbalanced scaling" Omitting this section
The "What Won't Fix" Section

This is often the most valuable part of the diagnosis. It documents dead ends so neither the user nor a future session wastes time re-exploring them. Include every approach you considered and rejected, with the specific reason each one fails.

Bug Classification at Gate 4

Every issue must be classified before the diagnosis is presented. The classification determines the evidence standard:

ClassificationEvidence RequiredWhat Happens Next
Code Bug Specific line numbers, concrete values (expected vs. actual), traceback or reproduction steps Fix is applied after Gate 4 approval
Design Issue Literature backing mandatory — papers, known failure modes, established best practices that show why the current design is wrong Actual redesign is redirected to Engineer Mode (/switch engineer) for the full pipeline
Environment / Config Specific config values, version mismatches, environment state Config change applied after Gate 4 approval
Design Issues Need Literature

"I think the architecture is wrong" is not sufficient for a design issue classification. You must cite why it is wrong with references to papers, documented failure modes, or established best practices. If you cannot find literature backing, say so explicitly: "I believe this may be a design issue, but I haven't found literature confirming this."

The 3-Strike Limit

If the same debugging approach fails three times, the assumptions are wrong. This rule is absolute:

StrikeAction
Strike 1Apply fix based on diagnosis. If it doesn't resolve the symptom, gather more evidence.
Strike 2Revised fix based on new evidence. If still failing, question whether the root cause hypothesis is correct.
Strike 3Stop. Present what was learned from all three attempts. Ask: "My assumption about the root cause may be wrong. Here's what I've learned. Should I investigate a fundamentally different direction?"
Never Ignore the 3-Strike Limit

The temptation is to try "one more variation." This is almost always wrong. Three failed attempts of the same approach is strong evidence that the mental model is incorrect, not that the fix needs tweaking.

Quick Reference

CheckpointKey OutputMust Include
Gate 0Scope statement3-5 disjunctive questions, confirmed scope paragraph
Q0Reference sourcesReference impl, architecture pattern, example, benchmark, framework constraints
Gate 1Approved findings3-5 bullet findings, surprises/risks, open questions
Q1Binding constraintsInterface contracts, data formats, config, edge cases, integration points, scope
Gate 2Approved planComponents, paper-to-code mapping, regression risk, uncertainties, scope estimate
Gate 3Audit resultsPer-component: spec compliance, paper alignment, auditor findings
Gate 4Approved diagnosisSymptom, root cause, why, fix, side effects, what won't fix