Trainer Mode Skill
Training execution and runtime debugging. Find training commands, launch them in persistent sessions, and fix runtime errors — without touching training logic.
Overview
The Trainer Mode skill handles the full training lifecycle: finding training commands, launching them in persistent screen sessions, and fixing runtime errors that prevent training from running. It does NOT modify training logic — that requires switching to Engineer Mode.
Phase 1: Training Command Detection
When Trainer Mode activates, it immediately scans the project for training entry points:
| Scan Target | What It Looks For |
|---|---|
| Training scripts | train*.py, run*.py, main*.py in root and subdirectories |
| Shell scripts | *.sh in project root, scripts/, bin/ |
| Package entry points | pyproject.toml scripts, setup.py console_scripts, Makefile targets |
| Config files | YAML/JSON/TOML in configs/, conf/, config/ directories |
| README instructions | Grep for "train", "run", "usage" in README.md |
Results are presented as a structured table with detected commands, configs, and environment details (Python version, framework, CUDA, GPUs, wandb status). The user must confirm a specific command before proceeding.
Phase 2: Screen Session Management
Training runs in a named, persistent screen session:
screen -dmS propel-train-{timestamp} bash -c '{COMMAND}; exec bash'
Execute the EXACT command the user confirmed — no modifications. Do NOT create temporary folders, disable wandb, add wrapper scripts, or modify environment variables unless the user explicitly requests it.
After launching, the user receives session management commands:
Training launched in screen session: propel-train-{timestamp}
Attach: screen -r propel-train-{timestamp}
Detach: Ctrl+A, then D
List: screen -ls
Kill: screen -X -S propel-train-{timestamp} quit
If screen is not available, the skill falls back to tmux with equivalent commands.
Phase 3: Runtime Bug Fixing
Scope Boundary
This is the most critical distinction in Trainer Mode. The skill has a strict scope boundary between runtime bugs and logic changes:
CAN Fix (Runtime Bugs)
- GPU/CUDA errors (
RuntimeError: CUDA error, device mismatch) - OOM errors (
CUDA out of memory) - Path errors (
FileNotFoundError, missing checkpoint paths) - Dependency issues (
ImportError,ModuleNotFoundError, version conflicts) - Config typos (misspelled keys, wrong types, missing required fields)
- Environment issues (wrong Python version, missing env vars)
- Multi-GPU/DDP setup (NCCL timeouts, rank errors,
torch.distributedfailures) - Permission errors on log and checkpoint directories
CANNOT Fix (Logic Changes)
- Model architecture changes ("add a layer", "change the backbone")
- Loss function modifications
- Feature additions (gradient clipping, LR scheduling)
- Algorithm changes (switching optimizers)
- Data pipeline changes
- Hyperparameter tuning (suggesting values is okay, changing code is not)
"That's a logic change, not a runtime bug. Switch to Engineer Mode with /switch engineer to get the full investigation-design-implement workflow for that change."
Fixing Process
- Read the traceback — get the full error output from the screen session
- Classify: runtime bug or logic change?
- If runtime bug: Present a "Gate 4 lite" diagnosis ("The error is [X] because [Y]. I'll fix it by [Z]. This changes [file:line]. Okay?"), wait for confirmation, apply minimal fix, restart training
- If same error recurs after 2 fixes: Stop and reassess before trying again
Phase 4: Monitoring Patterns
Trainer Mode doesn't just wait for errors — it proactively prompts useful thinking about the training run.
Before Training
- "What's your target metric for this run?"
- "Have you run a smoke test (a few steps on a small batch)?"
- "Is wandb/tensorboard configured?"
- "Is this resuming from a checkpoint? If so, which one?"
During Training
- Loss plateau detection: "Loss hasn't moved in [N] steps. Common causes: learning rate too low, data not shuffled, or a loss computation bug."
- Throughput monitoring: "What's your steps/second? Check DataLoader num_workers and pin_memory if lower than expected."
- Checkpoint verification: "Are checkpoints being saved? A crash at hour 8 without a checkpoint loses everything."
Common Pitfalls to Flag
| Pitfall | What to Watch For |
|---|---|
| fp16/bf16 instability | Watch for NaN in the first 100 steps. bf16 is more stable than fp16 for most training. |
| Multi-GPU scaling | With N GPUs, effective batch size is Nx. Learning rate may need scaling by sqrt(N) or linearly. |
| Missing warmup | If loss spikes at the start and never recovers, try LR warmup (100-1000 steps). |
| Gradient accumulation math | Effective batch = batch_size x accumulation_steps x num_gpus. Ensure LR accounts for this. |