Trainer Mode Skill

Training execution and runtime debugging. Find training commands, launch them in persistent sessions, and fix runtime errors — without touching training logic.

Overview

The Trainer Mode skill handles the full training lifecycle: finding training commands, launching them in persistent screen sessions, and fixing runtime errors that prevent training from running. It does NOT modify training logic — that requires switching to Engineer Mode.

Phase 1: Training Command Detection

When Trainer Mode activates, it immediately scans the project for training entry points:

Scan TargetWhat It Looks For
Training scriptstrain*.py, run*.py, main*.py in root and subdirectories
Shell scripts*.sh in project root, scripts/, bin/
Package entry pointspyproject.toml scripts, setup.py console_scripts, Makefile targets
Config filesYAML/JSON/TOML in configs/, conf/, config/ directories
README instructionsGrep for "train", "run", "usage" in README.md

Results are presented as a structured table with detected commands, configs, and environment details (Python version, framework, CUDA, GPUs, wandb status). The user must confirm a specific command before proceeding.

Phase 2: Screen Session Management

Training runs in a named, persistent screen session:

screen -dmS propel-train-{timestamp} bash -c '{COMMAND}; exec bash'
Critical Rules

Execute the EXACT command the user confirmed — no modifications. Do NOT create temporary folders, disable wandb, add wrapper scripts, or modify environment variables unless the user explicitly requests it.

After launching, the user receives session management commands:

Training launched in screen session: propel-train-{timestamp}

Attach:  screen -r propel-train-{timestamp}
Detach:  Ctrl+A, then D
List:    screen -ls
Kill:    screen -X -S propel-train-{timestamp} quit

If screen is not available, the skill falls back to tmux with equivalent commands.

Phase 3: Runtime Bug Fixing

Scope Boundary

This is the most critical distinction in Trainer Mode. The skill has a strict scope boundary between runtime bugs and logic changes:

CAN Fix (Runtime Bugs)

CANNOT Fix (Logic Changes)

Redirect Message

"That's a logic change, not a runtime bug. Switch to Engineer Mode with /switch engineer to get the full investigation-design-implement workflow for that change."

Fixing Process

  1. Read the traceback — get the full error output from the screen session
  2. Classify: runtime bug or logic change?
  3. If runtime bug: Present a "Gate 4 lite" diagnosis ("The error is [X] because [Y]. I'll fix it by [Z]. This changes [file:line]. Okay?"), wait for confirmation, apply minimal fix, restart training
  4. If same error recurs after 2 fixes: Stop and reassess before trying again

Phase 4: Monitoring Patterns

Trainer Mode doesn't just wait for errors — it proactively prompts useful thinking about the training run.

Before Training

During Training

Common Pitfalls to Flag

PitfallWhat to Watch For
fp16/bf16 instabilityWatch for NaN in the first 100 steps. bf16 is more stable than fp16 for most training.
Multi-GPU scalingWith N GPUs, effective batch size is Nx. Learning rate may need scaling by sqrt(N) or linearly.
Missing warmupIf loss spikes at the start and never recovers, try LR warmup (100-1000 steps).
Gradient accumulation mathEffective batch = batch_size x accumulation_steps x num_gpus. Ensure LR accounts for this.