Propel — Structured Constraint Framework for Research Code Agents

Overview

Claude Code can be a powerful boost to research workflows — if used correctly and carefully. Without structure, an unconstrained LLM produces the mean of its training data: ask it to "implement RVQ" and you get a plausible-looking average of every RVQ implementation it has seen, not the one that matches your paper, your architecture, your constraints. The output compiles, but it's noisy — wrong assumptions baked in, silent numerical bugs, design decisions made without asking.

The fix isn't better prompts — it's structured constraints. When we constrain an LLM with domain-specific rules, verification gates, and forced checkpoints, the output goes from "plausible average" to precisely what we need. Propel applies this to research workflows where the cost of undetected noise is highest — a silent broadcasting bug in a loss function doesn't crash, it produces subtly wrong training runs that waste compute.

Core Principles

Before the skills, gates, and agents — Propel is built on five non-negotiable principles that are injected into every session. These define the mindset, not the process.

Assistant, Not Agent

Claude gathers information, analyzes code, does deep research, and makes logical arguments based on evidence. It presents assessments for the user to decide on — it does not guess when it can investigate, or assume when it can ask.

Evidence standard: Every claim traceable to a line of code, a paper, or a concrete observation
Context discipline: Hallucination risk increases as context grows — proactively suggest clearing before reasoning degrades, preserving state in living READMEs first
Break logic loops: When stuck in circular reasoning, name it, reframe the question, or bring new data instead of pushing through

Critical Self-Reflection

Claude questions its own reasoning as critically as the user's. It watches for premature convergence, confirmation bias, and complexity bias. When it catches itself in a reasoning error, it says so out loud.

Retrospection: "What evidence would change my mind? Am I recommending this because the evidence supports it, or because it's the first thing I thought of?"
Honesty rule: If unsure about something said earlier, re-verify against written artifacts rather than trusting "memory"
3-strike limit: If the same approach fails 3 times, stop, re-examine assumptions, and change direction

Anti-Sycophancy

Claude does not try to please the user. It resists leading questions, confirmation-seeking, and emotional framing. When it notices itself about to agree because agreement is expected, it stops and steel-mans the counterargument first.

Resist anchoring: When the user steers toward a predetermined conclusion, check the evidence independently before responding
Challenge false transfer: Success in one context does not guarantee success in another — verify that assumptions actually hold in the current setting
Ignore sunk cost: Time already invested is not evidence that an approach will work — present the data honestly regardless of prior effort

Vision: Beyond Claude Code

While Propel is currently implemented as a Claude Code plugin, its core design — structured gates, investigation-first methodology, domain-specific auditors, and living documentation — is not tied to any particular agent platform. The pipeline encodes how researchers should interact with code-generating agents, regardless of the underlying model or tool.

The same constraint framework that prevents Claude Code from producing noisy averages can prevent any research code agent from doing the same. Propel's gates, questioners, and auditors represent a general protocol for human-AI collaboration in research engineering.

The goal is a portable specification: structured intake, grounding questioners, investigation scaffolds, paper-alignment auditing, and experiment registries that work across agent systems — Claude Code, Cursor, Copilot Workspace, Devin, or whatever comes next.

The Pipeline

Propel enforces five human-in-the-loop gates, two questioner checkpoints, and automatic auditor dispatch after every code change. The pipeline ensures investigation happens before implementation, design decisions are explicit, and silent bugs are caught before they reach training.

Propel Pipeline: Human-in-the-Loop Research Workflow with Gates, Questioners, and Auditors — Figure 1. The Propel pipeline — seven stages, five human-in-the-loop gates, and two questioner checkpoints.

Intake → Q0 → Investigation → Gate 1 → Q1 → Design → Implementation → Debug → Training → Retrospective G0 ground G1 findings detail G2 G3 G4 Trainer All

At each gate, the agent stops and asks structured questions that reveal design assumptions — never "shall I proceed?" but "should we [A] or [B]? A means [trade-off], B means [trade-off]." The questioners (Q0, Q1) ground the work in concrete reference implementations and specific implementation details before the agent acts.

Four Modes

Not every session needs the full pipeline. Choose a mode at the start of each session to filter which skills and gates are active. Each mode is a standalone tool — use Researcher just for literature, Debugger just for diagnosis, or Trainer just for launching runs.

Researcher

Gates 0 & 1

Literature, investigation, deep research. Understanding the problem space before building anything.

Engineer

All Gates (0–4)

Full pipeline. Investigation through implementation with all auditors. The default mode.

Debugger

Gates 0, 1 & 4

Deep root-cause analysis. Classifies code bugs vs. design issues with evidence. Literature-backed diagnosis.

Trainer

Gate 4 (runtime)

Code is ready. Launch training runs, monitor, fix CUDA/OOM/path errors.

See It In Action

Each mode shapes how the agent interacts with you. Watch simulated sessions for each mode.

Researcher Mode

Engineer Mode

Debugger Mode

Trainer Mode

When Would I Use This?

Three real-world starting points and how Propel guides you through each one.

1

“I have an idea and a reference codebase”

You know what you want to build. You have a reference implementation or paper to follow. You need to adapt it to your architecture and constraints.

Engineer Mode Gate 0 → Q0 → G1 → Q1 → G2 → G3 → G4

Gate 0

Scope it. Claude asks what exactly you're building, which parts of the reference to follow, and what to change. You provide the reference repo/paper.

Q0

Ground it. “Which file in the reference should I use as the starting point? What architecture pattern should I follow? What should I verify against?” — Claude asks for concrete anchors instead of guessing.

Investigation

Trace the reference. Claude reads the reference codebase, documents how it works in a scratch/ README, identifies what needs adapting vs. what can be reused directly.

Q1 → Design

Nail down details. Interface contracts, data formats, edge cases, config approach. Then a paper-to-code mapping with explicit component ordering and regression risk assessment.

Build → Audit

Implement with review. Each component gets a 3-stage review: implement → spec compliance → domain auditors (paper-alignment, silent-bug-detector, regression-guard). You approve at Gate 3 after each component.

What you get: Code that precisely matches your reference — not a plausible average. Every deviation is documented and human-approved.

2

“I inherited a codebase and need to understand it, then extend it”

Your teammate's project, a previous experiment, or an open-source repo. It generally works but you need to understand how before adding your features.

Researcher then Engineer G0 → G1 → /switch → G0 → Q0 → G1 → Q1 → G2 → G3

Researcher

Map the territory. Start in Researcher Mode. Claude traces the codebase end-to-end: entry points, data flow, key abstractions, configuration, and conventions. Everything goes into a scratch/ investigation README that persists across sessions.

Gate 1

Confirm understanding. Claude presents its findings: “Here's how the reward pipeline works, here are the 3 config files that control it, here's what surprised me.” You correct misunderstandings before anything gets built.

/switch engineer

Now build. Switch to Engineer Mode. The investigation README becomes the foundation. Q0 asks which existing patterns to follow. Q1 nails down how your new feature integrates with what's already there.

Design → Build

Extend with guardrails. Design respects existing conventions (from investigation). The regression-guard auditor verifies nothing breaks. Each new component is audited against the existing codebase, not just in isolation.

What you get: A documented understanding of the codebase that survives across sessions, plus new features that integrate cleanly without breaking existing functionality.

3

“Something is wrong and I need to find the real cause”

Training diverges, outputs look wrong, or a refactor broke something. You need to find the root cause — not just patch symptoms. Could be a code bug, a design flaw, or a config issue.

Debugger Mode Gate 0 → G1 → Classify → G4 → Fix

Gate 0

Characterize the symptom. Claude asks precise, disjunctive questions: “Is the loss exploding or collapsing? All configs or just one? After a specific commit or always?”

Investigation

Dispatch auditors. Based on the symptom, Claude dispatches relevant auditors — silent-bug-detector for numerical issues, paper-alignment-auditor for correctness, jax-logic-auditor for shape problems. Evidence is gathered, not guessed at.

Classify

Name the category. Every issue is classified as a code bug (concrete evidence: line numbers, values), design issue (requires literature backing), or config/environment problem (specific settings, version mismatches). The fix depends on the category.

Gate 4

Diagnose before fixing. Claude presents: Symptom → Root Cause → Why This Happens → Proposed Fix → Side Effects → What Won't Fix. You approve before any code changes. If it's a design issue, Claude searches the literature for validated alternatives and redirects to Engineer Mode for the redesign.

3-Strike

No infinite loops. If the same approach fails 3 times, Claude stops, re-examines assumptions, and asks whether the root cause hypothesis is wrong — instead of trying one more variation.

What you get: A proven root cause with evidence, not a symptom patch. Design issues get literature backing. Failed debugging approaches are documented so you don't repeat them.

Key Features

Human-in-the-Loop Gates

Five mandatory checkpoints that extract your specific insight before the agent acts. No rubber-stamping — every question is disjunctive and assumption-exposing.

Investigation-First

No implementation without a documented investigation in scratch/. Living READMEs preserve knowledge across sessions and /clear boundaries.

Paper-Alignment Auditing

Every paper-derived component is automatically cross-referenced against source equations. Catches mismatches between what the paper says and what the code does.

Silent Bug Detection

Active scanning for 11 categories of silent failures after every code change — broadcasting bugs, wrong reductions, detached gradients, and more.

Questioner Checkpoints

Q0 grounds work in concrete references before investigation. Q1 nails down implementation details before design. Prevents the agent from filling gaps with training-data averages.

Experiment Registry

Retrospectives capture what worked, what failed, and why. The failed attempts table is more valuable than the working solution — prevents repeating mistakes.

Skills

Propel includes 15+ skills organized by workflow phase. Each skill activates on specific triggers and guides the agent through a structured process.

Category	Skill	Trigger
Meta	using-propel	Always active — routes to correct skill
Literature	deep-research	"survey", "literature review", "compare methods"
Literature	paper-extraction	"process these papers", "build paper database"
Investigation	investigation	"start investigation", "trace X", "what touches X"
Design	research-design	"propose how to", "design the implementation"
Design	writing-plans	"write the plan", "break into tasks"
Implementation	subagent-driven-research	User says "go" after plan approval
Validation	research-validation	"validate this", "test the implementation"
Validation	verification-before-completion	Before claiming "done"
Debugging	systematic-debugging	Bug reports, training failures
Learning	retrospective	"retrospective", auto-suggests at ~20 turns
Cross-cutting	think-deeply	Confirmation-seeking statements
Cross-cutting	context-hygiene	>15 turns, "getting long"
Training	trainer-mode	"train", "launch training" (Trainer Mode)
Customization	project-customization	"customize Propel", "detect conventions"

Agents (Auditors)

Domain-specific auditors auto-dispatch after code changes. They run as subagents with separate context windows and read-only access.

Agent	Purpose	Auto?
paper-alignment-auditor	Cross-reference code against paper equations	Yes
jax-logic-auditor	Trace shapes through JAX transforms	Yes
silent-bug-detector	Scan for 11 silent failure categories	Yes
data-flow-tracer	End-to-end tensor annotation	No
regression-guard	Verify existing configs unchanged	Yes
env-researcher	Deep-dive simulation env docs	Yes
failure-mode-researcher	Internet search for training failures	No
code-reviewer	General code quality with research awareness	No

Get Started

git clone https://github.com/KevinBian107/propel.git
cd propel && pip install -e .

cd /path/to/your/project
propel init

Then start Claude and run /intro to select a mode and set up your project. See the full Getting Started guide for a complete walkthrough.

What Only You Can Provide

Propel's constraints are necessary but not sufficient. The framework forces the agent to stop and ask structured questions at every phase transition, but the quality of the output depends entirely on what you bring to those checkpoints:

Your Input	Why It Matters
Research question	Not "implement X" but "test whether X improves Y under condition Z." The more specific, the less the agent guesses.
Hypothesis	What do you expect and why? This is what auditors verify against.
Method	Which paper, which equations, which algorithmic choices. The agent cannot infer "use stop-gradient on the codebook as in Section 3.2."
Domain knowledge	Pitfalls not in any paper, configs that silently fail, things that only work in your setup.

Recommended Plugins

Third-party Claude Code plugins that pair well with Propel. The first two power Propel's bridge skills (/c-codex, /codex-lead, /c-review) — without them, those skills refuse to run. The third is complementary: it doesn't change Propel's pipeline, it just makes long sessions easier to manage.

codex-plugin-cc

Brings OpenAI Codex into Claude Code as a callable second model. Powers /c-codex (peer-review critique during planning or code review) and /codex-lead (flip the contract for one exchange). Cross-model critique catches blind spots a single model would miss.

Anthropic `code-review` plugin

The official Claude Code team's general-purpose review rubric. Powers /c-review, which runs the plugin alongside Propel's domain auditors (silent-bug-detector, paper-alignment-auditor, jax-logic-auditor, regression-guard) and merges everything into one Gate 3 card with source attribution.

claude-hud

A heads-up display for Claude Code that surfaces session state — model, token usage, context budget — in the status line. Pairs especially well with Propel because long investigation → design → implementation sessions are exactly where context health matters most, and where context-hygiene decisions get made.

See the full Recommended Plugins page for setup notes, compatibility caveats, and the bridge-skill interaction contract.

Acknowledgments

Propel combines ideas from multiple sources:

obra/superpowers — Plugin architecture, discipline enforcement, verification gates, micro-task planning.
scott-yj-yang/new-prompt — Session management CLI, auto-detection of project root, investigation artifact linking.
Talmo's sleap-io — Investigation skill template with structured scratch/ patterns and living READMEs.
Sionic AI's experiment registry — Retrospective skill and experiment learning workflow.
brunoasm's claude skills — Think-deeply anti-sycophancy skill and PDF extraction.
Weizhena's Deep-Research — Structured literature review with human-in-the-loop checkpoints.
Context Engineering Template — Basic Claude Code usage patterns and context engineering principles.

Overview

Core Principles

Assistant, Not Agent

Critical Self-Reflection

Anti-Sycophancy

Vision: Beyond Claude Code

The Pipeline

Four Modes

Researcher

Engineer

Debugger

Trainer

See It In Action

When Would I Use This?

“I have an idea and a reference codebase”

“I inherited a codebase and need to understand it, then extend it”

“Something is wrong and I need to find the real cause”

Key Features

Human-in-the-Loop Gates

Investigation-First

Paper-Alignment Auditing

Silent Bug Detection

Questioner Checkpoints

Experiment Registry

Skills

Agents (Auditors)

Get Started

What Only You Can Provide

Recommended Plugins

codex-plugin-cc

Anthropic code-review plugin

claude-hud

Acknowledgments

Anthropic `code-review` plugin