agent operations skill risk: low

Codex Training Metrics Monitor

The prompt instructs the model to enter interactive watch mode that periodically checks WandB or fallback logs for training issues such as NaN values, divergence, spikes, and plate…

External action: medium

SKILL 1 file

SKILL.md

Download

---
name: training-check
description: "Interactively monitor training metrics from the current Codex session, periodically checking WandB or fallback logs for NaN, divergence, plateaus, and broken runs."
---
# Training Check

You are now in **interactive watch** / 交互式训练监控模式.

Keep the current session open and report directly in the current terminal. The user is watching this terminal for updates. By default, run a training health check every 30 minutes, output a concise but complete analysis report after each check, state the next check time, then continue monitoring.

This skill checks training **quality**, not basic process health. Process health checks such as whether a tmux session exists or whether the GPU is idle can be handled by watchdog-style tooling; this skill focuses on whether the run is still worth continuing.

## Inputs To Establish First

Before the first check, identify or ask for the minimum monitoring context:

- WandB run path or URL, if available.
- Fallback log path, SSH command, or local command for reading recent training logs.
- Training target, expected baseline, and key metrics that define success.
- How the training was launched, so it can be stopped if needed.
- Project notes path for recording decisions and evidence.

If a source is unavailable, say so clearly and continue with the available source. If both WandB and fallback logs are unreachable, report the connectivity issue, classify the round as `WAIT`, and check again later. Do not infer that training is bad only because data is unreachable.

## Per-Round Check

Every round, read WandB first when configured. If WandB is unreachable, read the fallback logs. Inspect at least:

- Training loss trend over recent checkpoints or steps.
- Eval metrics and whether they improve, flatten, or degrade against baseline.
- NaN or Inf in loss, gradients, activations, or logged metrics.
- Sudden loss spikes, divergence, or repeated failed evaluations.
- Learning rate schedule behavior.
- Gradient norm, if logged.
- Plateau patterns that suggest the run is no longer useful.

Output one report in the current terminal with this structure:

```text
## Training Check - <local timestamp>

- Data source: wandb_ok | log_fallback | unreachable
- Run: <wandb run or training identifier>
- Recent metrics: <loss/eval/lr/grad summary>
- Anomalies: <NaN/Inf/spike/divergence/plateau findings>
- Evidence: <WandB URL, log lines, metric values, or files inspected>
- Decision: CONTINUE | WAIT | STOP
- Reason: <why this decision is justified>
- Next check: <local timestamp, normally 30 minutes later unless ending>
```

Use the decisions as follows:

| Decision | Meaning | Action |
|----------|---------|--------|
| `CONTINUE` | Run looks healthy enough to keep training. | Keep monitoring and check again in 30 minutes. |
| `WAIT` | Evidence is inconclusive, noisy, too early, or temporarily unreachable. | Do not stop training; keep monitoring and check again later. |
| `STOP` | Training is clearly problematic or no longer worth continuing. | Stop the training task, save evidence, write notes, output final summary, and end monitoring. |

## Stop Behavior

When the decision is `STOP`:

- Stop the training task.
- If the context contains `stop_command`, run `stop_command` first.
- If no `stop_command` is available, choose the appropriate stop action from how the training was launched, such as stopping the relevant tmux session, local process, remote process, scheduler job, or notebook job.
- Save evidence: WandB URL, key metrics, relevant log snippets, files inspected, and the reason for stopping.
- Append a project note for debugging and future analysis.
- Output `FINAL_SUMMARY` in the terminal.
- End the interactive monitoring loop.

Never stop on the first sign of ordinary metric noise. Look for sustained trends, hard failures, or clear divergence. Always preserve enough evidence for a later agent or human to understand why the run was stopped.

## Interactive Loop Guidance

- The normal interval is 30 minutes.
- If a round is `CONTINUE`, announce the next check time and wait until then.
- If a round is `WAIT`, explain what evidence is missing or noisy and check again later. Use a shorter interval only when the run looks suspicious but not yet stop-worthy.
- If an anomaly recovers, say so explicitly and continue monitoring.
- Keep the user-facing report short enough to read in a terminal, but include concrete metric values and evidence paths.

REQUIRED CONTEXT

WandB run path or URL
fallback log path or command
training target and key metrics
how training was launched
project notes path

ROLES & RULES

Role assignments

You are now in **interactive watch** / 交互式训练监控模式.

Keep the current session open and report directly in the current terminal.
By default, run a training health check every 30 minutes.
Output a concise but complete analysis report after each check.
State the next check time, then continue monitoring.
This skill checks training quality, not basic process health.
Before the first check, identify or ask for the minimum monitoring context.
If a source is unavailable, say so clearly and continue with the available source.
If both WandB and fallback logs are unreachable, report the connectivity issue, classify the round as `WAIT`, and check again later.
Do not infer that training is bad only because data is unreachable.
Every round, read WandB first when configured.
If WandB is unreachable, read the fallback logs.
Inspect at least the listed training metrics and patterns.
Output one report in the current terminal with the exact specified structure.
Use the decisions CONTINUE | WAIT | STOP exactly as defined in the table.
When the decision is `STOP`, perform all listed stop actions in order.
Never stop on the first sign of ordinary metric noise.
Look for sustained trends, hard failures, or clear divergence.
Always preserve enough evidence for a later agent or human.
If a round is `CONTINUE`, announce the next check time and wait until then.
If a round is `WAIT`, explain what evidence is missing or noisy.
If an anomaly recovers, say so explicitly and continue monitoring.
Keep the user-facing report short enough to read in a terminal, but include concrete metric values and evidence paths.

EXPECTED OUTPUT

Format

structured_report

Schema

text_template · ## Training Check - <local timestamp>, - Data source:, - Run:, - Recent metrics:, - Anomalies:, - Evidence:, - Decision:, - Reason:, - Next check:

Constraints

use exact report template with sections Data source, Run, Recent metrics, Anomalies, Evidence, Decision, Reason, Next check
output FINAL_SUMMARY and stop loop on STOP decision
report in current terminal

SUCCESS CRITERIA

Identify or ask for minimum monitoring context before first check.
Read WandB first, then fallback logs.
Inspect loss trends, eval metrics, NaNs, spikes, LR, grad norms, and plateaus.
Output report using the exact structure every round.
Choose CONTINUE | WAIT | STOP according to the decision table.
On STOP, execute stop actions, save evidence, append project note, output FINAL_SUMMARY, and end loop.
Never stop on ordinary metric noise; preserve evidence.

FAILURE MODES

May classify round as WAIT when data is unreachable.
May use shorter interval only when run looks suspicious but not yet stop-worthy.

CAVEATS

Dependencies

WandB run path or URL
Fallback log path, SSH command, or local command
Training target, expected baseline, and key metrics
How the training was launched
Project notes path
stop_command if present

Missing context

Preferred shorter interval length when a run is suspicious
How to handle multiple concurrent training runs

Ambiguities

Does not define quantitative thresholds for anomalies such as 'sudden loss spikes' or 'plateau patterns'.

QUALITY

OVERALL: 0.88
CLARITY: 0.90
SPECIFICITY: 0.92
REUSABILITY: 0.85
COMPLETENESS: 0.88

IMPROVEMENT SUGGESTIONS

Add an optional 'thresholds' input section listing numeric cutoffs for loss spike, plateau length, and gradient norm.

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.