Trainer Mode
Code is ready, time to launch and monitor training. Runtime debugging without logic changes.
Overview
Trainer Mode handles the full training lifecycle: finding training commands, launching them in persistent screen sessions, and fixing runtime errors that prevent training from running. It does not modify training logic — no architecture changes, no loss function edits, no data pipeline rewrites.
| Property | Details |
|---|---|
| Active Gates | G4 (runtime bugs only) |
| Active Skills | trainer-mode, systematic-debugging (runtime only), context-hygiene, verification-before-completion, project-customization |
| Not Available | investigation, deep-research, paper-extraction, research-design, writing-plans, subagent-driven-research, research-validation, using-git-worktrees |
| Switch Command | /switch trainer |
Trainer Mode Phases
When Trainer Mode activates, it immediately begins scanning your project. The workflow proceeds through four phases:
Phase 1: Scan
Claude scans the project to detect training entry points:
- Training scripts —
train*.py,run*.py,main*.pyin root andscripts/,tools/,bin/ - Shell scripts —
*.shin root,scripts/,bin/ - Package entry points —
pyproject.tomlscripts,setup.pyconsole_scripts,Makefiletargets - Config files — YAML/JSON/TOML in
configs/,conf/,config/ - README instructions — grep for "train", "run", "usage" in documented commands
Results are presented as a structured table with detected commands, config files, environment details (Python version, framework, CUDA, GPUs, wandb). Claude then asks which command to launch.
Phase 2: Pre-flight
Before launching, Claude proactively checks for common issues:
- Are all file paths in the config valid?
- Is GPU available and CUDA configured correctly?
- Are all dependencies installed (no missing imports)?
- Is there enough disk space for checkpoints and logs?
- Is wandb/tensorboard configured for monitoring?
Claude also prompts useful thinking: target metrics, smoke test completion, checkpoint resumption, and logging configuration.
Phase 3: Launch
Training is launched in a persistent screen (or tmux) session using the exact command the user confirmed:
screen -dmS propel-train-{timestamp} bash -c '{COMMAND}; exec bash'
Claude reports the session name and provides attach, detach, list, and kill commands. Critical rules: no modifications to the confirmed command, no temporary folders, no disabling of wandb or logging, no wrapper scripts.
Phase 4: Runtime Debugging G4
When training crashes, Claude reads the traceback, classifies the error as runtime or logic, and applies Gate 4 lite for runtime bugs:
"The error is [X] because [Y]. I'll fix it by [Z].
This changes [file:line]. Okay?"
If the same error recurs after 2 fixes, Claude stops and reassesses rather than trying another variation.
What Trainer Mode CAN Fix
Trainer Mode fixes runtime errors — things that prevent the training command from executing:
| Category | Examples |
|---|---|
| GPU / CUDA | RuntimeError: CUDA error, device mismatch, NCCL timeouts, DDP rank errors |
| OOM | CUDA out of memory, torch.cuda.OutOfMemoryError |
| Path errors | FileNotFoundError, missing checkpoint paths, NotADirectoryError |
| Dependencies | ImportError, ModuleNotFoundError, version conflicts |
| Config | Misspelled keys, wrong types, missing required fields |
| Environment | Wrong Python version, missing env vars, venv not activated |
| Permissions | PermissionError on log dirs, checkpoint dirs |
Screen Session Management
Training runs in persistent screen sessions that survive terminal disconnects. Key commands:
| Action | Command |
|---|---|
| Attach to session | screen -r propel-train-{timestamp} |
| Detach from session | Ctrl+A, then D |
| List all sessions | screen -ls |
| Kill a session | screen -X -S propel-train-{timestamp} quit |
If screen is not available, Trainer Mode falls back to tmux with equivalent commands.
What Trainer Mode Does NOT Do
Trainer Mode has a strict scope boundary. It does not make logic changes:
- No model architecture changes — "add a layer", "change the backbone"
- No loss function modifications — "try a different loss", "add regularization"
- No feature additions — "add gradient clipping", "add LR scheduling"
- No algorithm changes — "switch from Adam to SGD", "add warmup"
- No data pipeline changes — "augment differently", "change the dataloader"
- No hyperparameter tuning in code — suggesting values is okay, but changing code is not
If a training failure turns out to be caused by a logic bug (not a runtime error), Trainer Mode will identify this but will not fix it. Instead: "That's a logic change, not a runtime bug. Switch to Engineer Mode with /switch engineer to get the full investigation-design-implement workflow for that change."
Redirect Rules
| If You Ask For... | Redirect |
|---|---|
| Logic bugs (architecture, loss, data pipeline) | /switch engineer — investigation-design-implement workflow |
| Design issues or new features | /switch engineer — full pipeline with all gates |
| Literature review or investigation | /switch researcher — research-focused skills |
| Root-cause diagnosis for non-runtime bugs | /switch debugger — evidence-backed diagnosis |
Proactive Monitoring Prompts
Trainer Mode does not wait passively for problems. When you check in during training, Claude proactively flags common issues:
- Loss plateau — "Loss hasn't moved in N steps. Common causes: learning rate too low, data not shuffled, or a bug in the loss computation."
- Throughput — "What's your steps/second? If it's lower than expected, check DataLoader num_workers and pin_memory."
- Checkpointing — "Are checkpoints being saved? If training crashes at hour 8 without a checkpoint, you lose everything."
- Mixed precision — "If you're using fp16/bf16, watch for NaN in the first 100 steps. bf16 is more stable."
- Multi-GPU scaling — "With N GPUs, effective batch size is Nx. You may need to scale learning rate accordingly."
- Gradient accumulation — "With gradient_accumulation_steps=N, effective batch size is batch_size * N * num_gpus."
Example Session
A typical Trainer Mode session launching and monitoring a training run:
User: /switch trainer
Claude: Trainer Mode active. Gate 4 (runtime bugs only).
Let me scan your project for training commands...
[Phase 1 — Scan]
Training Commands Detected:
| Source | Command |
|------------|----------------------------------------------|
| Script | python train.py --config configs/rvq.yaml |
| Makefile | make train CONFIG=configs/rvq.yaml |
Configs Found:
• configs/default.yaml
• configs/rvq.yaml
Environment:
• Python 3.11, JAX 0.4.25, CUDA 12.3
• 2x A100 (nvidia-smi)
• wandb configured
Which command should I launch?
User: python train.py --config configs/rvq.yaml
Claude: [Phase 2 — Pre-flight]
Checking paths, dependencies, GPU...
All clear. Target metric for this run?
User: Reconstruction loss below 0.05 by 50k steps.
Claude: [Phase 3 — Launch]
Training launched in screen session: propel-train-20260328-1430
Attach: screen -r propel-train-20260328-1430
Detach: Ctrl+A, then D
List: screen -ls
Kill: screen -X -S propel-train-20260328-1430 quit
Monitoring first 1000 steps...
User: It crashed. CUDA out of memory.
Claude: [Phase 4 — Runtime Debug]
Reading traceback...
The error is OOM at model forward pass because batch_size=64
exceeds A100 40GB VRAM for this model size. I'll fix it by
reducing batch_size to 32 in configs/rvq.yaml (line 12).
This changes only the config file. Okay?
User: Yes.
Claude: Fixed. Relaunching in the same screen session...
propel-train-20260328-1430 running. Monitoring...