Monitor
A supervised self-monitoring loop for long-running training runs and multi-step command chains. Clarifies the success criteria first, prints them for the user, then iterates check → debug → verify until done.
/monitor runs an iterative loop that reads progress, evaluates against a checklist, and can dispatch debuggers each iteration. The skill requires explicit user confirmation before entering the loop. If you just need a single check, use a one-shot inspection instead.
Overview
Monitor turns Claude into a patient supervisor for work that takes time: training runs, hyperparameter sweeps, deploy pipelines, long command chains. It is deliberately not a dumb polling loop. Every run has three phases: clarify, loop, exit.
When It Triggers
| Trigger | Details |
|---|---|
| Manual | "/monitor", "watch this run", "keep an eye on", "monitor training", "run this and make sure it finishes" |
| Contextual | User hands off a multi-step chain and wants Claude to supervise through completion |
Phase 1 — Clarify Requirements
Before any loop runs, the skill extracts a concrete success checklist, hard blockers (terminate), and debuggable soft failures (try to fix). It prints a reference card for the user:
┌─ /monitor plan ─────────────────────────────────────────────────────┐
│ Goal: <one line> │
│ Success criteria: │
│ [ ] <item 1> │
│ [ ] <item 2> │
│ Hard blockers (terminate loop): │
│ • <item> │
│ Debuggable failures (try to fix): │
│ • <item> │
│ Budget: up to N iterations, ~M min between checks │
└───────────────────────────────────────────────────────┘
The user confirms or edits the plan before the loop starts.
Phase 2 — The Loop
Each iteration runs the same four steps, logged to scratch/monitor/{YYYY-MM-DD}-{short-name}/log.md:
- Inspect: cheap reads of the authoritative progress source (tail logs,
nvidia-smi,squeue, metric endpoint). - Evaluate: mark each checklist item as met / not-yet / regressed.
- Decide: exit on success or hard blocker, dispatch a debugger on soft failure, otherwise schedule the next check via
ScheduleWakeup. - Check budget: if iterations ≥ max, stop and report status — never silently keep going.
Phase 3 — Exit Protocol
On exit (success, failure, or budget exhaustion), the skill always produces: outcome, final checklist marks, what was done, what's left, and suggested next step. If the run produced learnings, it recommends /retrospective.
Guardrails
| Guardrail | Why |
|---|---|
| No destructive writes inside the loop without per-action approval | The loop has authority to read and diagnose, not to rewrite live infra. |
| Never claim success just because nothing crashed | Evaluate against the confirmed checklist — silence is not victory. |
| Renegotiate if criteria turn out to be ill-formed | Don't march toward the wrong goal just because the user wrote it down once. |
| Keep per-iteration logs to 2–5 lines | The log is for audit, not narrative. |