Docs › Modes › Trainer

Trainer Mode

Code is ready, time to launch and monitor training. Runtime debugging without logic changes.

Overview

Trainer Mode handles the full training lifecycle: finding training commands, launching them in persistent screen sessions, and fixing runtime errors that prevent training from running. It does not modify training logic — no architecture changes, no loss function edits, no data pipeline rewrites.

Property	Details
Active Gates	G4 (runtime bugs only)
Active Skills	trainer-mode, systematic-debugging (runtime only), context-hygiene, verification-before-completion, project-customization
Not Available	investigation, deep-research, paper-extraction, research-design, writing-plans, subagent-driven-research, research-validation, using-git-worktrees
Switch Command	`/switch trainer`

Trainer Mode Phases

When Trainer Mode activates, it immediately begins scanning your project. The workflow proceeds through four phases:

Phase 1: Scan

Claude scans the project to detect training entry points:

Training scripts — train*.py, run*.py, main*.py in root and scripts/, tools/, bin/
Shell scripts — *.sh in root, scripts/, bin/
Package entry points — pyproject.toml scripts, setup.py console_scripts, Makefile targets
Config files — YAML/JSON/TOML in configs/, conf/, config/
README instructions — grep for "train", "run", "usage" in documented commands

Results are presented as a structured table with detected commands, config files, environment details (Python version, framework, CUDA, GPUs, wandb). Claude then asks which command to launch.

Phase 2: Pre-flight

Before launching, Claude proactively checks for common issues:

Are all file paths in the config valid?
Is GPU available and CUDA configured correctly?
Are all dependencies installed (no missing imports)?
Is there enough disk space for checkpoints and logs?
Is wandb/tensorboard configured for monitoring?

Claude also prompts useful thinking: target metrics, smoke test completion, checkpoint resumption, and logging configuration.

Phase 3: Launch

Training is launched in a persistent screen (or tmux) session using the exact command the user confirmed:

screen -dmS propel-train-{timestamp} bash -c '{COMMAND}; exec bash'

Claude reports the session name and provides attach, detach, list, and kill commands. Critical rules: no modifications to the confirmed command, no temporary folders, no disabling of wandb or logging, no wrapper scripts.

Phase 4: Runtime Debugging G4

When training crashes, Claude reads the traceback, classifies the error as runtime or logic, and applies Gate 4 lite for runtime bugs:

"The error is [X] because [Y]. I'll fix it by [Z].
This changes [file:line]. Okay?"

If the same error recurs after 2 fixes, Claude stops and reassesses rather than trying another variation.

What Trainer Mode CAN Fix

Trainer Mode fixes runtime errors — things that prevent the training command from executing:

Category	Examples
GPU / CUDA	`RuntimeError: CUDA error`, device mismatch, NCCL timeouts, DDP rank errors
OOM	`CUDA out of memory`, `torch.cuda.OutOfMemoryError`
Path errors	`FileNotFoundError`, missing checkpoint paths, `NotADirectoryError`
Dependencies	`ImportError`, `ModuleNotFoundError`, version conflicts
Config	Misspelled keys, wrong types, missing required fields
Environment	Wrong Python version, missing env vars, venv not activated
Permissions	`PermissionError` on log dirs, checkpoint dirs

Screen Session Management

Training runs in persistent screen sessions that survive terminal disconnects. Key commands:

Action	Command
Attach to session	`screen -r propel-train-{timestamp}`
Detach from session	`Ctrl+A`, then `D`
List all sessions	`screen -ls`
Kill a session	`screen -X -S propel-train-{timestamp} quit`

If screen is not available, Trainer Mode falls back to tmux with equivalent commands.

What Trainer Mode Does NOT Do

Trainer Mode has a strict scope boundary. It does not make logic changes:

No model architecture changes — "add a layer", "change the backbone"
No loss function modifications — "try a different loss", "add regularization"
No feature additions — "add gradient clipping", "add LR scheduling"
No algorithm changes — "switch from Adam to SGD", "add warmup"
No data pipeline changes — "augment differently", "change the dataloader"
No hyperparameter tuning in code — suggesting values is okay, but changing code is not

Scope Boundary

If a training failure turns out to be caused by a logic bug (not a runtime error), Trainer Mode will identify this but will not fix it. Instead: "That's a logic change, not a runtime bug. Switch to Engineer Mode with /switch engineer to get the full investigation-design-implement workflow for that change."

Redirect Rules

If You Ask For...	Redirect
Logic bugs (architecture, loss, data pipeline)	`/switch engineer` — investigation-design-implement workflow
Design issues or new features	`/switch engineer` — full pipeline with all gates
Literature review or investigation	`/switch researcher` — research-focused skills
Root-cause diagnosis for non-runtime bugs	`/switch debugger` — evidence-backed diagnosis

Proactive Monitoring Prompts

Trainer Mode does not wait passively for problems. When you check in during training, Claude proactively flags common issues:

Loss plateau — "Loss hasn't moved in N steps. Common causes: learning rate too low, data not shuffled, or a bug in the loss computation."
Throughput — "What's your steps/second? If it's lower than expected, check DataLoader num_workers and pin_memory."
Checkpointing — "Are checkpoints being saved? If training crashes at hour 8 without a checkpoint, you lose everything."
Mixed precision — "If you're using fp16/bf16, watch for NaN in the first 100 steps. bf16 is more stable."
Multi-GPU scaling — "With N GPUs, effective batch size is Nx. You may need to scale learning rate accordingly."
Gradient accumulation — "With gradient_accumulation_steps=N, effective batch size is batch_size * N * num_gpus."

Example Session

A typical Trainer Mode session launching and monitoring a training run:

User:  /switch trainer
Claude: Trainer Mode active. Gate 4 (runtime bugs only).
        Let me scan your project for training commands...

        [Phase 1 — Scan]
        Training Commands Detected:

        | Source     | Command                                      |
        |------------|----------------------------------------------|
        | Script     | python train.py --config configs/rvq.yaml    |
        | Makefile   | make train CONFIG=configs/rvq.yaml           |

        Configs Found:
        • configs/default.yaml
        • configs/rvq.yaml

        Environment:
        • Python 3.11, JAX 0.4.25, CUDA 12.3
        • 2x A100 (nvidia-smi)
        • wandb configured

        Which command should I launch?

User:  python train.py --config configs/rvq.yaml

Claude: [Phase 2 — Pre-flight]
        Checking paths, dependencies, GPU...
        All clear. Target metric for this run?

User:  Reconstruction loss below 0.05 by 50k steps.

Claude: [Phase 3 — Launch]
        Training launched in screen session: propel-train-20260328-1430

        Attach:  screen -r propel-train-20260328-1430
        Detach:  Ctrl+A, then D
        List:    screen -ls
        Kill:    screen -X -S propel-train-20260328-1430 quit

        Monitoring first 1000 steps...

User:  It crashed. CUDA out of memory.

Claude: [Phase 4 — Runtime Debug]
        Reading traceback...

        The error is OOM at model forward pass because batch_size=64
        exceeds A100 40GB VRAM for this model size. I'll fix it by
        reducing batch_size to 32 in configs/rvq.yaml (line 12).
        This changes only the config file. Okay?

User:  Yes.

Claude: Fixed. Relaunching in the same screen session...
        propel-train-20260328-1430 running. Monitoring...