Gates & Questioners
The seven human-in-the-loop checkpoints that prevent Claude from skipping ahead. Every gate has a mandatory format, every questioner has specific questions, and every one must be passed before the next phase begins.
Overview
Propel enforces 5 gates and 2 questioner checkpoints. Gates are hard stops where Claude presents structured findings and waits for human approval. Questioners are structured question sets that ground the next phase in concrete details rather than plausible defaults.
| Checkpoint | When | Purpose | Researcher | Engineer | Debugger | Trainer |
|---|---|---|---|---|---|---|
| Gate 0 | User describes task | Scope the work | Active | Active | Active | — |
| Q0 | After Gate 0 | Ground in references | Active | Active | — | — |
| Gate 1 | Investigation complete | Review findings | Active | Active | Active | — |
| Q1 | After Gate 1 | Nail down details | Active | Active | — | — |
| Gate 2 | Design complete | Review plan | — | Active | — | — |
| Gate 3 | After each component | Audit results | — | Active | — | — |
| Gate 4 | Root cause identified | Approve diagnosis | — | Active | Active | Active* |
* Trainer Mode uses Gate 4 for runtime bugs only — logic changes redirect to Engineer Mode.
Gate questions must be disjunctive (A or B?), assumption-exposing, design-revealing, and evidence-based. Never ask "shall I proceed?" or "is this okay?" — these invite rubber-stamping. Questioner questions must be concrete and reference-seeking, grounding the work in existing examples.
Gate 0: Intake
When: The user describes what they want to build or fix.
Active in: Researcher Engineer Debugger
Gate 0 is the first line of defense against "just do it" requests. When a user describes a task, Claude does not start working immediately. Instead, it asks 3–5 scoping questions, one at a time, to expose assumptions and define boundaries before any investigation begins.
Format
- Ask one clarifying question at a time. Wait for the answer.
- Questions cover: scope boundaries, success criteria, implicit assumptions, what NOT to do, priority.
- Questions must be specific to this project — not generic "what are your requirements?"
- After enough answers, write a one-paragraph scope statement.
- Ask: "Is this what you want?" — user confirms before proceeding.
Question Types
| Type | Example |
|---|---|
| Assumption-exposing | "You mentioned depth=2. Is that a hard design choice, or do you want arbitrary depth?" |
| Priority-revealing | "If we discover the paper's approach has a known failure mode, do you want alternatives or implement it anyway?" |
| Scope boundary | "Should the new quantizer support the existing stickiness bias, or is this a chance to remove it?" |
| Disjunctive | "Is the loss exploding (going to NaN/inf) or collapsing (going to zero/constant)?" |
| Temporal | "Did this break after a specific commit, or has it always been this way?" |
If the user says "just do it," Claude pushes back once: "I want to make sure I build what you actually need. These questions will save us a /clear cycle later." If the user insists, proceed with the most conservative interpretation and document assumptions explicitly.
Output
A confirmed one-paragraph scope statement that becomes the reference for all subsequent phases. This scope statement is recorded in the investigation README.
Questioner Q0: Grounding Before Investigation
When: After Gate 0 confirms scope, before investigation begins.
Active in: Researcher Engineer
Claude is great at looking at one thing and morphing it into what you want, but bad at creating from scratch when the problem is unconstrained. Vague requests lead to plausible-looking but wrong implementations — Claude fills gaps with training-data averages instead of asking. Q0 forces the user to provide concrete reference points before investigation.
The 5 Grounding Questions
Asked one at a time, skipping any already answered during Gate 0:
| # | Category | Question |
|---|---|---|
| 1 | Reference implementation | "Is there an existing codebase or repo I should use as a starting point? If so, which files or modules are most relevant?" |
| 2 | Architecture pattern | "Is there a known architecture or design pattern you want to follow? (e.g., a specific paper's approach, an existing task implementation, a framework's canonical example)" |
| 3 | Example to study | "Is there a similar task/feature already implemented in this project or another project that I should study and adapt from?" |
| 4 | Benchmark / ground truth | "What benchmark, test case, or expected behavior should I use to verify correctness? Is there a known-good output I can compare against?" |
| 5 | Framework constraints | "Are there specific APIs, libraries, or framework conventions I must use? (e.g., 'use robosuite's SingleArmEnv base class', 'follow the MuJoCo MJX wrapper pattern')" |
What Happens After Q0
- All answers are recorded in the investigation README under "Reference Sources".
- If the user provides a reference codebase, Claude records it and focuses investigation on those files.
- If the user says "no reference, build from scratch" — Claude flags this as high-risk unconstrained implementation in the README. Investigation will be more thorough to compensate.
- The answers directly shape what gets investigated in the next phase.
Example
User: I want to build a can transport task with robosuite.
[Gate 0 confirms scope: single-arm can transport, dense reward, sim only]
Claude (Q0-1): Is there an existing task in robosuite I should use as a
starting point? For example, PickPlace or Lift?
User: Yeah, look at PickPlace in robosuite/environments/manipulation/
Claude (Q0-2): Got it. Is there a specific architecture pattern — should this
follow the same reward structure as PickPlace, or do you have
a different reward design in mind?
User: Same structure, but add a transport phase between pick and place.
Claude (Q0-3): For verification — is there a known trajectory or benchmark
I should test against, or should I use the PickPlace test suite
as a baseline?
User: Use PickPlace tests as baseline, plus a manual trajectory I'll provide.
[Claude records all three reference points in scratch/README.md]
Gate 1: Post-Investigation
When: Investigation is complete.
Active in: Researcher Engineer Debugger
Gate 1 is where Claude presents what it found during investigation and the human makes design decisions. Claude does not proceed to design or implementation without explicit approval of the findings.
Mandatory Presentation Format
## Investigation Summary
### What I Found
- [3-5 bullet findings, concrete and specific]
### Surprises / Risks
- [Unexpected complications discovered during investigation]
### Open Questions — I Need Your Input
1. [Trade-off question — disjunctive, not yes/no]
2. [Architecture choice]
3. [Priority question]
"The codebase is complex" is not a finding. "The loss function in model/vq.py:89-124 computes commitment loss across all depth levels without per-level scaling" is a finding. Every bullet must reference specific files, line numbers, or concrete observations.
Output
The human's design decisions are recorded in the investigation README under "Design Decisions (Human-Approved)". These decisions are binding for all subsequent phases.
Questioner Q1: Before Design
When: After Gate 1 confirms investigation findings, before design begins.
Active in: Researcher Engineer
After investigation, Claude has findings and the user has made high-level design decisions. But there is a dangerous gap: Claude may still fill in critical implementation details with plausible defaults instead of asking. Q1 catches these before design begins, turning vague intent into binding constraints.
The 6 Detail Questions
Asked one at a time, only the questions relevant to what was discovered during investigation:
| # | Category | Question |
|---|---|---|
| 1 | Interface contracts | "Based on the investigation, here are the key interfaces I found: [list]. Should the new code match these exactly, or are we changing the API?" |
| 2 | Data formats | "The reference uses [format/shape/convention]. Should we follow the same convention, or does your use case need something different?" |
| 3 | Configuration | "Should this be configurable via [config file / constructor args / env vars]? What should the defaults be?" |
| 4 | Edge cases | "During investigation I found these edge cases: [list]. How should each be handled — error, fallback, or ignore?" |
| 5 | Integration points | "This will need to connect to [existing modules]. Should I modify those modules or create adapters?" |
| 6 | Scope confirmation | "Based on what I found, the minimal implementation needs [X, Y, Z]. Should I also include [A, B] which I noticed in the reference, or keep it minimal?" |
What Happens After Q1
- Answers are recorded under "Implementation Decisions (Human-Approved)" in the investigation README.
- These answers become binding constraints — Claude must not deviate from them without asking.
- If the user is unsure about a detail, Claude flags it as an open question to revisit during implementation (Gate 3).
Example
User: [After Gate 1, approved investigation findings for can transport task]
Claude (Q1-1): PickPlace uses a 4-phase reward: reach → grasp → lift → place.
For can transport, I'd add a "transport" phase between lift and
place. Should I keep the same reward scale (0-1 per phase) or
re-weight for 5 phases?
User: Keep 0-1 per phase, so total reward range becomes 0-5 instead of 0-4.
Claude (Q1-2): PickPlace accepts a single placement target. Transport implies
a waypoint. Should the waypoint be a constructor parameter, or
computed from start/end positions?
User: Constructor parameter. I want to be able to set it per episode.
Claude (Q1-3): The PickPlace test suite checks gripper-to-object distance.
For transport, should I also check object-to-waypoint distance
during the transport phase?
User: Yes, that's the key metric for the new phase.
[Claude records all decisions and proceeds to design with concrete constraints]
Gate 2: Post-Design
When: Design and implementation plan are ready.
Active in: Engineer
Gate 2 presents the full implementation plan for human review. The human can approve, modify, or reject parts of the plan. Implementation does not begin until the human says "go."
Mandatory Presentation Format
## Implementation Plan Summary
### Components (in order)
1. [Component name] — maps to paper eq. [X] — touches files [Y, Z]
2. [Component name] — maps to paper eq. [X] — touches files [Y, Z]
### Paper-to-Code Mapping
| Paper Reference | Code Location | Status |
|----------------|--------------|--------|
| Eq. 3 (VQ loss) | model/vq.py:compute_loss() | New |
| Algorithm 1 (codebook update) | model/vq.py:update_codebook() | Modified |
### Regression Risk Assessment
- [What could break and how we're protecting it]
### What I'm Uncertain About
1. [Uncertainty + two options + trade-offs]
2. [Uncertainty + two options + trade-offs]
### Estimated Scope
- [N components, M files modified, K new files]
Components must be listed in implementation order. Dependencies flow downward — component 2 can depend on component 1, but not the reverse. If the human wants to change the order, Claude must re-check dependency chains.
Key Elements
| Element | Purpose |
|---|---|
| Components list | What will be built, in what order, with paper equation mappings |
| Paper-to-code mapping | Explicit link between paper equations/algorithms and code locations |
| Regression risk | What existing functionality could break and how it will be protected |
| Uncertainties | Open questions with concrete options and trade-offs for the human to decide |
| Estimated scope | Concrete count of components, files, and changes to set expectations |
Gate 3: Mid-Implementation
When: After each component is implemented.
Active in: Engineer
Gate 3 is a per-component checkpoint during implementation. After Claude implements each component, auditors run automatically and results are presented to the human. This is the most frequently triggered gate — it fires once per component.
Auditor Review Format
Each component goes through three stages before Gate 3:
- Implement — Subagent implements the specific task from the approved plan.
- Review — Spec compliance (does it match the plan?) and paper alignment (does it match the source equations?).
- Audit — Parallel auditors check for paper mismatches, JAX bugs, silent failures, and regressions.
Clean Pass Format
"All auditors passed for component 3. Spec ✓, Paper ✓, JAX ✓, Silent bugs ✓, Regression ✓. Moving to component 4?"
Issues Found Format
## Component 3 Audit Results
### Issues Found
- ✗ Silent Bug Detector: commitment loss applied to combined residual, not per-depth
- Paper specifies per-depth (Section 3.2, Eq. 5)
- Severity: Critical
### Question
Should I fix this to match the paper (per-depth commitment), or was the
combined approach an intentional simplification?
Every 3 components, Claude offers a pause point for /clear. Long implementation sessions accumulate context rapidly. The living README in scratch/ ensures no findings are lost across context boundaries.
Auditor Dispatch Table
| What Changed | Auditors That Run |
|---|---|
| Paper-derived component | paper-alignment-auditor |
| JAX code (scan, vmap, pmap, jit) | jax-logic-auditor |
| Model / loss / data code | silent-bug-detector |
| Any code change | regression-guard |
| Environment interaction code | env-researcher |
| Deep trace needed (explicit only) | data-flow-tracer |
Gate 4: Before Bug Fix
When: Root cause of a bug has been identified.
Active in: Engineer Debugger Trainer (runtime only)
Gate 4 is the mandatory checkpoint before any fix is applied. Claude must present its diagnosis in a structured format and get human approval. This prevents shotgun debugging, symptom patching, and displaced fixes.
Diagnose before fixing. Never apply a fix without presenting the diagnosis first and getting human approval. No exceptions. If you don't understand the root cause well enough to fill in every field below, you haven't investigated enough.
Mandatory Diagnosis Format
Symptom: What is observed (concrete, measurable)
Root Cause: What is actually wrong (with evidence)
Why This Happens: The mechanism that produces the symptom from the root cause
Proposed Fix: What to change and where (specific files and lines)
Side Effects: What else this fix touches (other components, training dynamics)
What Won't Fix: Approaches that look promising but won't work (and why)
Field-by-Field Breakdown
| Field | Required Content | Anti-Pattern |
|---|---|---|
| Symptom | Measurable observation: "NaN at step ~500 in commitment loss" | "Training doesn't work" (too vague) |
| Root Cause | Specific location + evidence: "RVQ commitment loss sums across depth levels without per-level scaling at model/rvq.py:89" |
"Something is wrong with the loss" (no location) |
| Why This Happens | Causal chain: "VQ commitment loss assumes single codebook. RVQ adds levels but reuses same loss unchanged." | Restating the symptom |
| Proposed Fix | Concrete change: "Scale commitment loss by 1/num_levels per level" |
"Fix the loss function" (not actionable) |
| Side Effects | Impact analysis: "Changes training dynamics. May need LR adjustment." | "No side effects" (almost never true) |
| What Won't Fix | Dead ends with reasons: "Gradient clipping masks symptom but doesn't fix imbalanced scaling" | Omitting this section |
This is often the most valuable part of the diagnosis. It documents dead ends so neither the user nor a future session wastes time re-exploring them. Include every approach you considered and rejected, with the specific reason each one fails.
Bug Classification at Gate 4
Every issue must be classified before the diagnosis is presented. The classification determines the evidence standard:
| Classification | Evidence Required | What Happens Next |
|---|---|---|
| Code Bug | Specific line numbers, concrete values (expected vs. actual), traceback or reproduction steps | Fix is applied after Gate 4 approval |
| Design Issue | Literature backing mandatory — papers, known failure modes, established best practices that show why the current design is wrong | Actual redesign is redirected to Engineer Mode (/switch engineer) for the full pipeline |
| Environment / Config | Specific config values, version mismatches, environment state | Config change applied after Gate 4 approval |
"I think the architecture is wrong" is not sufficient for a design issue classification. You must cite why it is wrong with references to papers, documented failure modes, or established best practices. If you cannot find literature backing, say so explicitly: "I believe this may be a design issue, but I haven't found literature confirming this."
The 3-Strike Limit
If the same debugging approach fails three times, the assumptions are wrong. This rule is absolute:
| Strike | Action |
|---|---|
| Strike 1 | Apply fix based on diagnosis. If it doesn't resolve the symptom, gather more evidence. |
| Strike 2 | Revised fix based on new evidence. If still failing, question whether the root cause hypothesis is correct. |
| Strike 3 | Stop. Present what was learned from all three attempts. Ask: "My assumption about the root cause may be wrong. Here's what I've learned. Should I investigate a fundamentally different direction?" |
The temptation is to try "one more variation." This is almost always wrong. Three failed attempts of the same approach is strong evidence that the mental model is incorrect, not that the fix needs tweaking.
Quick Reference
| Checkpoint | Key Output | Must Include |
|---|---|---|
| Gate 0 | Scope statement | 3-5 disjunctive questions, confirmed scope paragraph |
| Q0 | Reference sources | Reference impl, architecture pattern, example, benchmark, framework constraints |
| Gate 1 | Approved findings | 3-5 bullet findings, surprises/risks, open questions |
| Q1 | Binding constraints | Interface contracts, data formats, config, edge cases, integration points, scope |
| Gate 2 | Approved plan | Components, paper-to-code mapping, regression risk, uncertainties, scope estimate |
| Gate 3 | Audit results | Per-component: spec compliance, paper alignment, auditor findings |
| Gate 4 | Approved diagnosis | Symptom, root cause, why, fix, side effects, what won't fix |