Docs › Pipeline › Gates & Questioners

Gates & Questioners

The seven human-in-the-loop checkpoints that prevent Claude from skipping ahead. Every gate has a mandatory format, every questioner has specific questions, and every one must be passed before the next phase begins.

Overview

Propel enforces 5 gates and 2 questioner checkpoints. Gates are hard stops where Claude presents structured findings and waits for human approval. Questioners are structured question sets that ground the next phase in concrete details rather than plausible defaults.

Checkpoint	When	Purpose	Researcher	Engineer	Debugger	Trainer
Gate 0	User describes task	Scope the work	Active	Active	Active	—
Q0	After Gate 0	Ground in references	Active	Active	—	—
Gate 1	Investigation complete	Review findings	Active	Active	Active	—
Q1	After Gate 1	Nail down details	Active	Active	—	—
Gate 2	Design complete	Review plan	—	Active	—	—
Gate 3	After each component	Audit results	—	Active	—	—
Gate 4	Root cause identified	Approve diagnosis	—	Active	Active	Active*

* Trainer Mode uses Gate 4 for runtime bugs only — logic changes redirect to Engineer Mode.

Question Quality Rule

Gate questions must be disjunctive (A or B?), assumption-exposing, design-revealing, and evidence-based. Never ask "shall I proceed?" or "is this okay?" — these invite rubber-stamping. Questioner questions must be concrete and reference-seeking, grounding the work in existing examples.

Gate 0: Intake

When: The user describes what they want to build or fix.
Active in: Researcher Engineer Debugger

Gate 0 is the first line of defense against "just do it" requests. When a user describes a task, Claude does not start working immediately. Instead, it asks 3–5 scoping questions, one at a time, to expose assumptions and define boundaries before any investigation begins.

Format

Ask one clarifying question at a time. Wait for the answer.
Questions cover: scope boundaries, success criteria, implicit assumptions, what NOT to do, priority.
Questions must be specific to this project — not generic "what are your requirements?"
After enough answers, write a one-paragraph scope statement.
Ask: "Is this what you want?" — user confirms before proceeding.

Question Types

Type	Example
Assumption-exposing	"You mentioned depth=2. Is that a hard design choice, or do you want arbitrary depth?"
Priority-revealing	"If we discover the paper's approach has a known failure mode, do you want alternatives or implement it anyway?"
Scope boundary	"Should the new quantizer support the existing stickiness bias, or is this a chance to remove it?"
Disjunctive	"Is the loss exploding (going to NaN/inf) or collapsing (going to zero/constant)?"
Temporal	"Did this break after a specific commit, or has it always been this way?"

Handling "Just Do It"

If the user says "just do it," Claude pushes back once: "I want to make sure I build what you actually need. These questions will save us a /clear cycle later." If the user insists, proceed with the most conservative interpretation and document assumptions explicitly.

Output

A confirmed one-paragraph scope statement that becomes the reference for all subsequent phases. This scope statement is recorded in the investigation README.

Questioner Q0: Grounding Before Investigation

When: After Gate 0 confirms scope, before investigation begins.
Active in: Researcher Engineer

Why Q0 Exists

Claude is great at looking at one thing and morphing it into what you want, but bad at creating from scratch when the problem is unconstrained. Vague requests lead to plausible-looking but wrong implementations — Claude fills gaps with training-data averages instead of asking. Q0 forces the user to provide concrete reference points before investigation.

The 5 Grounding Questions

Asked one at a time, skipping any already answered during Gate 0:

#	Category	Question
1	Reference implementation	"Is there an existing codebase or repo I should use as a starting point? If so, which files or modules are most relevant?"
2	Architecture pattern	"Is there a known architecture or design pattern you want to follow? (e.g., a specific paper's approach, an existing task implementation, a framework's canonical example)"
3	Example to study	"Is there a similar task/feature already implemented in this project or another project that I should study and adapt from?"
4	Benchmark / ground truth	"What benchmark, test case, or expected behavior should I use to verify correctness? Is there a known-good output I can compare against?"
5	Framework constraints	"Are there specific APIs, libraries, or framework conventions I must use? (e.g., 'use robosuite's SingleArmEnv base class', 'follow the MuJoCo MJX wrapper pattern')"

What Happens After Q0

All answers are recorded in the investigation README under "Reference Sources".
If the user provides a reference codebase, Claude records it and focuses investigation on those files.
If the user says "no reference, build from scratch" — Claude flags this as high-risk unconstrained implementation in the README. Investigation will be more thorough to compensate.
The answers directly shape what gets investigated in the next phase.

Example

User:  I want to build a can transport task with robosuite.
[Gate 0 confirms scope: single-arm can transport, dense reward, sim only]

Claude (Q0-1): Is there an existing task in robosuite I should use as a
               starting point? For example, PickPlace or Lift?
User:  Yeah, look at PickPlace in robosuite/environments/manipulation/

Claude (Q0-2): Got it. Is there a specific architecture pattern — should this
               follow the same reward structure as PickPlace, or do you have
               a different reward design in mind?
User:  Same structure, but add a transport phase between pick and place.

Claude (Q0-3): For verification — is there a known trajectory or benchmark
               I should test against, or should I use the PickPlace test suite
               as a baseline?
User:  Use PickPlace tests as baseline, plus a manual trajectory I'll provide.

[Claude records all three reference points in scratch/README.md]

Gate 1: Post-Investigation

When: Investigation is complete.
Active in: Researcher Engineer Debugger

Gate 1 is where Claude presents what it found during investigation and the human makes design decisions. Claude does not proceed to design or implementation without explicit approval of the findings.

Mandatory Presentation Format

## Investigation Summary

### What I Found
- [3-5 bullet findings, concrete and specific]

### Surprises / Risks
- [Unexpected complications discovered during investigation]

### Open Questions — I Need Your Input
1. [Trade-off question — disjunctive, not yes/no]
2. [Architecture choice]
3. [Priority question]

Findings Must Be Concrete

"The codebase is complex" is not a finding. "The loss function in model/vq.py:89-124 computes commitment loss across all depth levels without per-level scaling" is a finding. Every bullet must reference specific files, line numbers, or concrete observations.

Output

The human's design decisions are recorded in the investigation README under "Design Decisions (Human-Approved)". These decisions are binding for all subsequent phases.

Questioner Q1: Before Design

When: After Gate 1 confirms investigation findings, before design begins.
Active in: Researcher Engineer

Why Q1 Exists

After investigation, Claude has findings and the user has made high-level design decisions. But there is a dangerous gap: Claude may still fill in critical implementation details with plausible defaults instead of asking. Q1 catches these before design begins, turning vague intent into binding constraints.

The 6 Detail Questions

Asked one at a time, only the questions relevant to what was discovered during investigation:

#	Category	Question
1	Interface contracts	"Based on the investigation, here are the key interfaces I found: [list]. Should the new code match these exactly, or are we changing the API?"
2	Data formats	"The reference uses [format/shape/convention]. Should we follow the same convention, or does your use case need something different?"
3	Configuration	"Should this be configurable via [config file / constructor args / env vars]? What should the defaults be?"
4	Edge cases	"During investigation I found these edge cases: [list]. How should each be handled — error, fallback, or ignore?"
5	Integration points	"This will need to connect to [existing modules]. Should I modify those modules or create adapters?"
6	Scope confirmation	"Based on what I found, the minimal implementation needs [X, Y, Z]. Should I also include [A, B] which I noticed in the reference, or keep it minimal?"

What Happens After Q1

Answers are recorded under "Implementation Decisions (Human-Approved)" in the investigation README.
These answers become binding constraints — Claude must not deviate from them without asking.
If the user is unsure about a detail, Claude flags it as an open question to revisit during implementation (Gate 3).

Example

User:  [After Gate 1, approved investigation findings for can transport task]

Claude (Q1-1): PickPlace uses a 4-phase reward: reach → grasp → lift → place.
               For can transport, I'd add a "transport" phase between lift and
               place. Should I keep the same reward scale (0-1 per phase) or
               re-weight for 5 phases?
User:  Keep 0-1 per phase, so total reward range becomes 0-5 instead of 0-4.

Claude (Q1-2): PickPlace accepts a single placement target. Transport implies
               a waypoint. Should the waypoint be a constructor parameter, or
               computed from start/end positions?
User:  Constructor parameter. I want to be able to set it per episode.

Claude (Q1-3): The PickPlace test suite checks gripper-to-object distance.
               For transport, should I also check object-to-waypoint distance
               during the transport phase?
User:  Yes, that's the key metric for the new phase.

[Claude records all decisions and proceeds to design with concrete constraints]

Gate 2: Post-Design

When: Design and implementation plan are ready.
Active in: Engineer

Gate 2 presents the full implementation plan for human review. The human can approve, modify, or reject parts of the plan. Implementation does not begin until the human says "go."

Mandatory Presentation Format

## Implementation Plan Summary

### Components (in order)
1. [Component name] — maps to paper eq. [X] — touches files [Y, Z]
2. [Component name] — maps to paper eq. [X] — touches files [Y, Z]

### Paper-to-Code Mapping
| Paper Reference | Code Location | Status |
|----------------|--------------|--------|
| Eq. 3 (VQ loss) | model/vq.py:compute_loss() | New |
| Algorithm 1 (codebook update) | model/vq.py:update_codebook() | Modified |

### Regression Risk Assessment
- [What could break and how we're protecting it]

### What I'm Uncertain About
1. [Uncertainty + two options + trade-offs]
2. [Uncertainty + two options + trade-offs]

### Estimated Scope
- [N components, M files modified, K new files]

Component Order Matters

Components must be listed in implementation order. Dependencies flow downward — component 2 can depend on component 1, but not the reverse. If the human wants to change the order, Claude must re-check dependency chains.

Key Elements

Element	Purpose
Components list	What will be built, in what order, with paper equation mappings
Paper-to-code mapping	Explicit link between paper equations/algorithms and code locations
Regression risk	What existing functionality could break and how it will be protected
Uncertainties	Open questions with concrete options and trade-offs for the human to decide
Estimated scope	Concrete count of components, files, and changes to set expectations

Gate 3: Mid-Implementation

When: After each component is implemented.
Active in: Engineer

Gate 3 is a per-component checkpoint during implementation. After Claude implements each component, auditors run automatically and results are presented to the human. This is the most frequently triggered gate — it fires once per component.

Auditor Review Format

Each component goes through three stages before Gate 3:

Implement — Subagent implements the specific task from the approved plan.
Review — Spec compliance (does it match the plan?) and paper alignment (does it match the source equations?).
Audit — Parallel auditors check for paper mismatches, JAX bugs, silent failures, and regressions.

Clean Pass Format

"All auditors passed for component 3. Spec ✓, Paper ✓, JAX ✓, Silent bugs ✓, Regression ✓. Moving to component 4?"

Issues Found Format

## Component 3 Audit Results

### Issues Found
- ✗ Silent Bug Detector: commitment loss applied to combined residual, not per-depth
  - Paper specifies per-depth (Section 3.2, Eq. 5)
  - Severity: Critical

### Question
Should I fix this to match the paper (per-depth commitment), or was the
combined approach an intentional simplification?

Context Management at Gate 3

Every 3 components, Claude offers a pause point for /clear. Long implementation sessions accumulate context rapidly. The living README in scratch/ ensures no findings are lost across context boundaries.

Auditor Dispatch Table

What Changed	Auditors That Run
Paper-derived component	paper-alignment-auditor
JAX code (scan, vmap, pmap, jit)	jax-logic-auditor
Model / loss / data code	silent-bug-detector
Any code change	regression-guard
Environment interaction code	env-researcher
Deep trace needed (explicit only)	data-flow-tracer

Gate 4: Before Bug Fix

When: Root cause of a bug has been identified.
Active in: Engineer Debugger Trainer (runtime only)

Gate 4 is the mandatory checkpoint before any fix is applied. Claude must present its diagnosis in a structured format and get human approval. This prevents shotgun debugging, symptom patching, and displaced fixes.

The Core Rule

Diagnose before fixing. Never apply a fix without presenting the diagnosis first and getting human approval. No exceptions. If you don't understand the root cause well enough to fill in every field below, you haven't investigated enough.

Mandatory Diagnosis Format

Symptom:         What is observed (concrete, measurable)
Root Cause:      What is actually wrong (with evidence)
Why This Happens: The mechanism that produces the symptom from the root cause
Proposed Fix:    What to change and where (specific files and lines)
Side Effects:    What else this fix touches (other components, training dynamics)
What Won't Fix:  Approaches that look promising but won't work (and why)

Field-by-Field Breakdown

Field	Required Content	Anti-Pattern
Symptom	Measurable observation: "NaN at step ~500 in commitment loss"	"Training doesn't work" (too vague)
Root Cause	Specific location + evidence: "RVQ commitment loss sums across depth levels without per-level scaling at `model/rvq.py:89`"	"Something is wrong with the loss" (no location)
Why This Happens	Causal chain: "VQ commitment loss assumes single codebook. RVQ adds levels but reuses same loss unchanged."	Restating the symptom
Proposed Fix	Concrete change: "Scale commitment loss by `1/num_levels` per level"	"Fix the loss function" (not actionable)
Side Effects	Impact analysis: "Changes training dynamics. May need LR adjustment."	"No side effects" (almost never true)
What Won't Fix	Dead ends with reasons: "Gradient clipping masks symptom but doesn't fix imbalanced scaling"	Omitting this section

The "What Won't Fix" Section

This is often the most valuable part of the diagnosis. It documents dead ends so neither the user nor a future session wastes time re-exploring them. Include every approach you considered and rejected, with the specific reason each one fails.

Bug Classification at Gate 4

Every issue must be classified before the diagnosis is presented. The classification determines the evidence standard:

Classification	Evidence Required	What Happens Next
Code Bug	Specific line numbers, concrete values (expected vs. actual), traceback or reproduction steps	Fix is applied after Gate 4 approval
Design Issue	Literature backing mandatory — papers, known failure modes, established best practices that show why the current design is wrong	Actual redesign is redirected to Engineer Mode (`/switch engineer`) for the full pipeline
Environment / Config	Specific config values, version mismatches, environment state	Config change applied after Gate 4 approval

Design Issues Need Literature

"I think the architecture is wrong" is not sufficient for a design issue classification. You must cite why it is wrong with references to papers, documented failure modes, or established best practices. If you cannot find literature backing, say so explicitly: "I believe this may be a design issue, but I haven't found literature confirming this."

The 3-Strike Limit

If the same debugging approach fails three times, the assumptions are wrong. This rule is absolute:

Strike	Action
Strike 1	Apply fix based on diagnosis. If it doesn't resolve the symptom, gather more evidence.
Strike 2	Revised fix based on new evidence. If still failing, question whether the root cause hypothesis is correct.
Strike 3	Stop. Present what was learned from all three attempts. Ask: "My assumption about the root cause may be wrong. Here's what I've learned. Should I investigate a fundamentally different direction?"

Never Ignore the 3-Strike Limit

The temptation is to try "one more variation." This is almost always wrong. Three failed attempts of the same approach is strong evidence that the mental model is incorrect, not that the fix needs tweaking.

Quick Reference

Checkpoint	Key Output	Must Include
Gate 0	Scope statement	3-5 disjunctive questions, confirmed scope paragraph
Q0	Reference sources	Reference impl, architecture pattern, example, benchmark, framework constraints
Gate 1	Approved findings	3-5 bullet findings, surprises/risks, open questions
Q1	Binding constraints	Interface contracts, data formats, config, edge cases, integration points, scope
Gate 2	Approved plan	Components, paper-to-code mapping, regression risk, uncertainties, scope estimate
Gate 3	Audit results	Per-component: spec compliance, paper alignment, auditor findings
Gate 4	Approved diagnosis	Symptom, root cause, why, fix, side effects, what won't fix