Docs › Skills › Trainer Mode

Trainer Mode Skill

Training execution and runtime debugging. Find training commands, launch them in persistent sessions, and fix runtime errors — without touching training logic.

Overview

The Trainer Mode skill handles the full training lifecycle: finding training commands, launching them in persistent screen sessions, and fixing runtime errors that prevent training from running. It does NOT modify training logic — that requires switching to Engineer Mode.

Phase 1: Training Command Detection

When Trainer Mode activates, it immediately scans the project for training entry points:

Scan Target	What It Looks For
Training scripts	`train.py`, `run.py`, `main*.py` in root and subdirectories
Shell scripts	`*.sh` in project root, `scripts/`, `bin/`
Package entry points	`pyproject.toml` scripts, `setup.py` console_scripts, Makefile targets
Config files	YAML/JSON/TOML in `configs/`, `conf/`, `config/` directories
README instructions	Grep for "train", "run", "usage" in README.md

Results are presented as a structured table with detected commands, configs, and environment details (Python version, framework, CUDA, GPUs, wandb status). The user must confirm a specific command before proceeding.

Phase 2: Screen Session Management

Training runs in a named, persistent screen session:

screen -dmS propel-train-{timestamp} bash -c '{COMMAND}; exec bash'

Critical Rules

Execute the EXACT command the user confirmed — no modifications. Do NOT create temporary folders, disable wandb, add wrapper scripts, or modify environment variables unless the user explicitly requests it.

After launching, the user receives session management commands:

Training launched in screen session: propel-train-{timestamp}

Attach:  screen -r propel-train-{timestamp}
Detach:  Ctrl+A, then D
List:    screen -ls
Kill:    screen -X -S propel-train-{timestamp} quit

If screen is not available, the skill falls back to tmux with equivalent commands.

Phase 3: Runtime Bug Fixing

Scope Boundary

This is the most critical distinction in Trainer Mode. The skill has a strict scope boundary between runtime bugs and logic changes:

CAN Fix (Runtime Bugs)

GPU/CUDA errors (RuntimeError: CUDA error, device mismatch)
OOM errors (CUDA out of memory)
Path errors (FileNotFoundError, missing checkpoint paths)
Dependency issues (ImportError, ModuleNotFoundError, version conflicts)
Config typos (misspelled keys, wrong types, missing required fields)
Environment issues (wrong Python version, missing env vars)
Multi-GPU/DDP setup (NCCL timeouts, rank errors, torch.distributed failures)
Permission errors on log and checkpoint directories

CANNOT Fix (Logic Changes)

Model architecture changes ("add a layer", "change the backbone")
Loss function modifications
Feature additions (gradient clipping, LR scheduling)
Algorithm changes (switching optimizers)
Data pipeline changes
Hyperparameter tuning (suggesting values is okay, changing code is not)

Redirect Message

"That's a logic change, not a runtime bug. Switch to Engineer Mode with /switch engineer to get the full investigation-design-implement workflow for that change."

Fixing Process

Read the traceback — get the full error output from the screen session
Classify: runtime bug or logic change?
If runtime bug: Present a "Gate 4 lite" diagnosis ("The error is [X] because [Y]. I'll fix it by [Z]. This changes [file:line]. Okay?"), wait for confirmation, apply minimal fix, restart training
If same error recurs after 2 fixes: Stop and reassess before trying again

Phase 4: Monitoring Patterns

Trainer Mode doesn't just wait for errors — it proactively prompts useful thinking about the training run.

Before Training

"What's your target metric for this run?"
"Have you run a smoke test (a few steps on a small batch)?"
"Is wandb/tensorboard configured?"
"Is this resuming from a checkpoint? If so, which one?"

During Training

Loss plateau detection: "Loss hasn't moved in [N] steps. Common causes: learning rate too low, data not shuffled, or a loss computation bug."
Throughput monitoring: "What's your steps/second? Check DataLoader num_workers and pin_memory if lower than expected."
Checkpoint verification: "Are checkpoints being saved? A crash at hour 8 without a checkpoint loses everything."

Common Pitfalls to Flag

Pitfall	What to Watch For
fp16/bf16 instability	Watch for NaN in the first 100 steps. bf16 is more stable than fp16 for most training.
Multi-GPU scaling	With N GPUs, effective batch size is Nx. Learning rate may need scaling by sqrt(N) or linearly.
Missing warmup	If loss spikes at the start and never recovers, try LR warmup (100-1000 steps).
Gradient accumulation math	Effective batch = batch_size x accumulation_steps x num_gpus. Ensure LR accounts for this.