Propel

A structured constraint framework for Claude Code in research workflows — turning unconstrained LLM output into precise, paper-aligned research code.

Kaiwen Bian · Yuer Tang

Under Active Development. Propel currently serves as a Claude Code plugin, but the pipeline and logic are designed to be agent-agnostic — the structured constraint framework generalizes to any research code agent assistant.

Overview

Claude Code can be a powerful boost to research workflows — if used correctly and carefully. Without structure, an unconstrained LLM produces the mean of its training data: ask it to "implement RVQ" and you get a plausible-looking average of every RVQ implementation it has seen, not the one that matches your paper, your architecture, your constraints. The output compiles, but it's noisy — wrong assumptions baked in, silent numerical bugs, design decisions made without asking.

The fix isn't better prompts — it's structured constraints. When we constrain an LLM with domain-specific rules, verification gates, and forced checkpoints, the output goes from "plausible average" to precisely what we need. Propel applies this to research workflows where the cost of undetected noise is highest — a silent broadcasting bug in a loss function doesn't crash, it produces subtly wrong training runs that waste compute.

Core Principles

Before the skills, gates, and agents — Propel is built on five non-negotiable principles that are injected into every session. These define the mindset, not the process.

Assistant, Not Agent

Claude gathers information, analyzes code, does deep research, and makes logical arguments based on evidence. It presents assessments for the user to decide on — it does not guess when it can investigate, or assume when it can ask.

  • Evidence standard: Every claim traceable to a line of code, a paper, or a concrete observation
  • Context discipline: Hallucination risk increases as context grows — proactively suggest clearing before reasoning degrades, preserving state in living READMEs first
  • Break logic loops: When stuck in circular reasoning, name it, reframe the question, or bring new data instead of pushing through

Critical Self-Reflection

Claude questions its own reasoning as critically as the user's. It watches for premature convergence, confirmation bias, and complexity bias. When it catches itself in a reasoning error, it says so out loud.

  • Retrospection: "What evidence would change my mind? Am I recommending this because the evidence supports it, or because it's the first thing I thought of?"
  • Honesty rule: If unsure about something said earlier, re-verify against written artifacts rather than trusting "memory"
  • 3-strike limit: If the same approach fails 3 times, stop, re-examine assumptions, and change direction

Anti-Sycophancy

Claude does not try to please the user. It resists leading questions, confirmation-seeking, and emotional framing. When it notices itself about to agree because agreement is expected, it stops and steel-mans the counterargument first.

  • Resist anchoring: When the user steers toward a predetermined conclusion, check the evidence independently before responding
  • Challenge false transfer: Success in one context does not guarantee success in another — verify that assumptions actually hold in the current setting
  • Ignore sunk cost: Time already invested is not evidence that an approach will work — present the data honestly regardless of prior effort

Vision: Beyond Claude Code

While Propel is currently implemented as a Claude Code plugin, its core design — structured gates, investigation-first methodology, domain-specific auditors, and living documentation — is not tied to any particular agent platform. The pipeline encodes how researchers should interact with code-generating agents, regardless of the underlying model or tool.

The same constraint framework that prevents Claude Code from producing noisy averages can prevent any research code agent from doing the same. Propel's gates, questioners, and auditors represent a general protocol for human-AI collaboration in research engineering.

The goal is a portable specification: structured intake, grounding questioners, investigation scaffolds, paper-alignment auditing, and experiment registries that work across agent systems — Claude Code, Cursor, Copilot Workspace, Devin, or whatever comes next.

The Pipeline

Propel enforces five human-in-the-loop gates, two questioner checkpoints, and automatic auditor dispatch after every code change. The pipeline ensures investigation happens before implementation, design decisions are explicit, and silent bugs are caught before they reach training.

Propel Pipeline: Human-in-the-Loop Research Workflow with Gates, Questioners, and Auditors
Figure 1. The Propel pipeline — seven stages, five human-in-the-loop gates, and two questioner checkpoints.
Intake → Q0 → Investigation → Gate 1 → Q1 → Design → Implementation → Debug → Training → Retrospective G0 ground G1 findings detail G2 G3 G4 Trainer All

At each gate, the agent stops and asks structured questions that reveal design assumptions — never "shall I proceed?" but "should we [A] or [B]? A means [trade-off], B means [trade-off]." The questioners (Q0, Q1) ground the work in concrete reference implementations and specific implementation details before the agent acts.

Four Modes

Not every session needs the full pipeline. Choose a mode at the start of each session to filter which skills and gates are active. Each mode is a standalone tool — use Researcher just for literature, Debugger just for diagnosis, or Trainer just for launching runs.

Researcher

Gates 0 & 1

Literature, investigation, deep research. Understanding the problem space before building anything.

Engineer

All Gates (0–4)

Full pipeline. Investigation through implementation with all auditors. The default mode.

Debugger

Gates 0, 1 & 4

Deep root-cause analysis. Classifies code bugs vs. design issues with evidence. Literature-backed diagnosis.

Trainer

Gate 4 (runtime)

Code is ready. Launch training runs, monitor, fix CUDA/OOM/path errors.

See It In Action

Each mode shapes how the agent interacts with you. Watch simulated sessions for each mode.

Researcher Mode
Engineer Mode
Debugger Mode
Trainer Mode

When Would I Use This?

Three real-world starting points and how Propel guides you through each one.

1

“I have an idea and a reference codebase”

You know what you want to build. You have a reference implementation or paper to follow. You need to adapt it to your architecture and constraints.

Engineer Mode Gate 0 → Q0 → G1 → Q1 → G2 → G3 → G4
Gate 0
Scope it. Claude asks what exactly you're building, which parts of the reference to follow, and what to change. You provide the reference repo/paper.
Q0
Ground it. “Which file in the reference should I use as the starting point? What architecture pattern should I follow? What should I verify against?” — Claude asks for concrete anchors instead of guessing.
Investigation
Trace the reference. Claude reads the reference codebase, documents how it works in a scratch/ README, identifies what needs adapting vs. what can be reused directly.
Q1 → Design
Nail down details. Interface contracts, data formats, edge cases, config approach. Then a paper-to-code mapping with explicit component ordering and regression risk assessment.
Build → Audit
Implement with review. Each component gets a 3-stage review: implement → spec compliance → domain auditors (paper-alignment, silent-bug-detector, regression-guard). You approve at Gate 3 after each component.
What you get: Code that precisely matches your reference — not a plausible average. Every deviation is documented and human-approved.
2

“I inherited a codebase and need to understand it, then extend it”

Your teammate's project, a previous experiment, or an open-source repo. It generally works but you need to understand how before adding your features.

Researcher then Engineer G0 → G1 → /switch → G0 → Q0 → G1 → Q1 → G2 → G3
Researcher
Map the territory. Start in Researcher Mode. Claude traces the codebase end-to-end: entry points, data flow, key abstractions, configuration, and conventions. Everything goes into a scratch/ investigation README that persists across sessions.
Gate 1
Confirm understanding. Claude presents its findings: “Here's how the reward pipeline works, here are the 3 config files that control it, here's what surprised me.” You correct misunderstandings before anything gets built.
/switch engineer
Now build. Switch to Engineer Mode. The investigation README becomes the foundation. Q0 asks which existing patterns to follow. Q1 nails down how your new feature integrates with what's already there.
Design → Build
Extend with guardrails. Design respects existing conventions (from investigation). The regression-guard auditor verifies nothing breaks. Each new component is audited against the existing codebase, not just in isolation.
What you get: A documented understanding of the codebase that survives across sessions, plus new features that integrate cleanly without breaking existing functionality.
3

“Something is wrong and I need to find the real cause”

Training diverges, outputs look wrong, or a refactor broke something. You need to find the root cause — not just patch symptoms. Could be a code bug, a design flaw, or a config issue.

Debugger Mode Gate 0 → G1 → Classify → G4 → Fix
Gate 0
Characterize the symptom. Claude asks precise, disjunctive questions: “Is the loss exploding or collapsing? All configs or just one? After a specific commit or always?”
Investigation
Dispatch auditors. Based on the symptom, Claude dispatches relevant auditors — silent-bug-detector for numerical issues, paper-alignment-auditor for correctness, jax-logic-auditor for shape problems. Evidence is gathered, not guessed at.
Classify
Name the category. Every issue is classified as a code bug (concrete evidence: line numbers, values), design issue (requires literature backing), or config/environment problem (specific settings, version mismatches). The fix depends on the category.
Gate 4
Diagnose before fixing. Claude presents: Symptom → Root Cause → Why This Happens → Proposed Fix → Side Effects → What Won't Fix. You approve before any code changes. If it's a design issue, Claude searches the literature for validated alternatives and redirects to Engineer Mode for the redesign.
3-Strike
No infinite loops. If the same approach fails 3 times, Claude stops, re-examines assumptions, and asks whether the root cause hypothesis is wrong — instead of trying one more variation.
What you get: A proven root cause with evidence, not a symptom patch. Design issues get literature backing. Failed debugging approaches are documented so you don't repeat them.

Key Features

Human-in-the-Loop Gates

Five mandatory checkpoints that extract your specific insight before the agent acts. No rubber-stamping — every question is disjunctive and assumption-exposing.

Investigation-First

No implementation without a documented investigation in scratch/. Living READMEs preserve knowledge across sessions and /clear boundaries.

Paper-Alignment Auditing

Every paper-derived component is automatically cross-referenced against source equations. Catches mismatches between what the paper says and what the code does.

Silent Bug Detection

Active scanning for 11 categories of silent failures after every code change — broadcasting bugs, wrong reductions, detached gradients, and more.

Questioner Checkpoints

Q0 grounds work in concrete references before investigation. Q1 nails down implementation details before design. Prevents the agent from filling gaps with training-data averages.

Experiment Registry

Retrospectives capture what worked, what failed, and why. The failed attempts table is more valuable than the working solution — prevents repeating mistakes.

Skills

Propel includes 15+ skills organized by workflow phase. Each skill activates on specific triggers and guides the agent through a structured process.

CategorySkillTrigger
Metausing-propelAlways active — routes to correct skill
Literaturedeep-research"survey", "literature review", "compare methods"
paper-extraction"process these papers", "build paper database"
Investigationinvestigation"start investigation", "trace X", "what touches X"
Designresearch-design"propose how to", "design the implementation"
writing-plans"write the plan", "break into tasks"
Implementationsubagent-driven-researchUser says "go" after plan approval
Validationresearch-validation"validate this", "test the implementation"
verification-before-completionBefore claiming "done"
Debuggingsystematic-debuggingBug reports, training failures
Learningretrospective"retrospective", auto-suggests at ~20 turns
Cross-cuttingthink-deeplyConfirmation-seeking statements
context-hygiene>15 turns, "getting long"
Trainingtrainer-mode"train", "launch training" (Trainer Mode)
Customizationproject-customization"customize Propel", "detect conventions"

Agents (Auditors)

Domain-specific auditors auto-dispatch after code changes. They run as subagents with separate context windows and read-only access.

AgentPurposeAuto?
paper-alignment-auditorCross-reference code against paper equationsYes
jax-logic-auditorTrace shapes through JAX transformsYes
silent-bug-detectorScan for 11 silent failure categoriesYes
data-flow-tracerEnd-to-end tensor annotationNo
regression-guardVerify existing configs unchangedYes
env-researcherDeep-dive simulation env docsYes
failure-mode-researcherInternet search for training failuresNo
code-reviewerGeneral code quality with research awarenessNo

Get Started

git clone https://github.com/KevinBian107/propel.git
cd propel && pip install -e .

cd /path/to/your/project
propel init

Then start Claude and run /intro to select a mode and set up your project. See the full Getting Started guide for a complete walkthrough.

What Only You Can Provide

Propel's constraints are necessary but not sufficient. The framework forces the agent to stop and ask structured questions at every phase transition, but the quality of the output depends entirely on what you bring to those checkpoints:

Your InputWhy It Matters
Research questionNot "implement X" but "test whether X improves Y under condition Z." The more specific, the less the agent guesses.
HypothesisWhat do you expect and why? This is what auditors verify against.
MethodWhich paper, which equations, which algorithmic choices. The agent cannot infer "use stop-gradient on the codebook as in Section 3.2."
Domain knowledgePitfalls not in any paper, configs that silently fail, things that only work in your setup.

Acknowledgments

Propel combines ideas from multiple sources: