Trainer Mode

Code is ready, time to launch and monitor training. Runtime debugging without logic changes.

Overview

Trainer Mode handles the full training lifecycle: finding training commands, launching them in persistent screen sessions, and fixing runtime errors that prevent training from running. It does not modify training logic — no architecture changes, no loss function edits, no data pipeline rewrites.

PropertyDetails
Active GatesG4 (runtime bugs only)
Active Skillstrainer-mode, systematic-debugging (runtime only), context-hygiene, verification-before-completion, project-customization
Not Availableinvestigation, deep-research, paper-extraction, research-design, writing-plans, subagent-driven-research, research-validation, using-git-worktrees
Switch Command/switch trainer

Trainer Mode Phases

When Trainer Mode activates, it immediately begins scanning your project. The workflow proceeds through four phases:

Phase 1: Scan

Claude scans the project to detect training entry points:

  • Training scriptstrain*.py, run*.py, main*.py in root and scripts/, tools/, bin/
  • Shell scripts*.sh in root, scripts/, bin/
  • Package entry pointspyproject.toml scripts, setup.py console_scripts, Makefile targets
  • Config files — YAML/JSON/TOML in configs/, conf/, config/
  • README instructions — grep for "train", "run", "usage" in documented commands

Results are presented as a structured table with detected commands, config files, environment details (Python version, framework, CUDA, GPUs, wandb). Claude then asks which command to launch.

Phase 2: Pre-flight

Before launching, Claude proactively checks for common issues:

  • Are all file paths in the config valid?
  • Is GPU available and CUDA configured correctly?
  • Are all dependencies installed (no missing imports)?
  • Is there enough disk space for checkpoints and logs?
  • Is wandb/tensorboard configured for monitoring?

Claude also prompts useful thinking: target metrics, smoke test completion, checkpoint resumption, and logging configuration.

Phase 3: Launch

Training is launched in a persistent screen (or tmux) session using the exact command the user confirmed:

screen -dmS propel-train-{timestamp} bash -c '{COMMAND}; exec bash'

Claude reports the session name and provides attach, detach, list, and kill commands. Critical rules: no modifications to the confirmed command, no temporary folders, no disabling of wandb or logging, no wrapper scripts.

Phase 4: Runtime Debugging G4

When training crashes, Claude reads the traceback, classifies the error as runtime or logic, and applies Gate 4 lite for runtime bugs:

"The error is [X] because [Y]. I'll fix it by [Z].
This changes [file:line]. Okay?"

If the same error recurs after 2 fixes, Claude stops and reassesses rather than trying another variation.

What Trainer Mode CAN Fix

Trainer Mode fixes runtime errors — things that prevent the training command from executing:

CategoryExamples
GPU / CUDARuntimeError: CUDA error, device mismatch, NCCL timeouts, DDP rank errors
OOMCUDA out of memory, torch.cuda.OutOfMemoryError
Path errorsFileNotFoundError, missing checkpoint paths, NotADirectoryError
DependenciesImportError, ModuleNotFoundError, version conflicts
ConfigMisspelled keys, wrong types, missing required fields
EnvironmentWrong Python version, missing env vars, venv not activated
PermissionsPermissionError on log dirs, checkpoint dirs

Screen Session Management

Training runs in persistent screen sessions that survive terminal disconnects. Key commands:

ActionCommand
Attach to sessionscreen -r propel-train-{timestamp}
Detach from sessionCtrl+A, then D
List all sessionsscreen -ls
Kill a sessionscreen -X -S propel-train-{timestamp} quit

If screen is not available, Trainer Mode falls back to tmux with equivalent commands.

What Trainer Mode Does NOT Do

Trainer Mode has a strict scope boundary. It does not make logic changes:

Scope Boundary

If a training failure turns out to be caused by a logic bug (not a runtime error), Trainer Mode will identify this but will not fix it. Instead: "That's a logic change, not a runtime bug. Switch to Engineer Mode with /switch engineer to get the full investigation-design-implement workflow for that change."

Redirect Rules

If You Ask For...Redirect
Logic bugs (architecture, loss, data pipeline)/switch engineer — investigation-design-implement workflow
Design issues or new features/switch engineer — full pipeline with all gates
Literature review or investigation/switch researcher — research-focused skills
Root-cause diagnosis for non-runtime bugs/switch debugger — evidence-backed diagnosis

Proactive Monitoring Prompts

Trainer Mode does not wait passively for problems. When you check in during training, Claude proactively flags common issues:

Example Session

A typical Trainer Mode session launching and monitoring a training run:

User:  /switch trainer
Claude: Trainer Mode active. Gate 4 (runtime bugs only).
        Let me scan your project for training commands...

        [Phase 1 — Scan]
        Training Commands Detected:

        | Source     | Command                                      |
        |------------|----------------------------------------------|
        | Script     | python train.py --config configs/rvq.yaml    |
        | Makefile   | make train CONFIG=configs/rvq.yaml           |

        Configs Found:
        • configs/default.yaml
        • configs/rvq.yaml

        Environment:
        • Python 3.11, JAX 0.4.25, CUDA 12.3
        • 2x A100 (nvidia-smi)
        • wandb configured

        Which command should I launch?

User:  python train.py --config configs/rvq.yaml

Claude: [Phase 2 — Pre-flight]
        Checking paths, dependencies, GPU...
        All clear. Target metric for this run?

User:  Reconstruction loss below 0.05 by 50k steps.

Claude: [Phase 3 — Launch]
        Training launched in screen session: propel-train-20260328-1430

        Attach:  screen -r propel-train-20260328-1430
        Detach:  Ctrl+A, then D
        List:    screen -ls
        Kill:    screen -X -S propel-train-20260328-1430 quit

        Monitoring first 1000 steps...

User:  It crashed. CUDA out of memory.

Claude: [Phase 4 — Runtime Debug]
        Reading traceback...

        The error is OOM at model forward pass because batch_size=64
        exceeds A100 40GB VRAM for this model size. I'll fix it by
        reducing batch_size to 32 in configs/rvq.yaml (line 12).
        This changes only the config file. Okay?

User:  Yes.

Claude: Fixed. Relaunching in the same screen session...
        propel-train-20260328-1430 running. Monitoring...