agent operations skill risk: low
Codex Training Metrics Monitor
The prompt instructs the model to enter interactive watch mode that periodically checks WandB or fallback logs for training issues such as NaN values, divergence, spikes, and plate…
- External action: medium
SKILL 1 file
SKILL.md
--- name: training-check description: "Interactively monitor training metrics from the current Codex session, periodically checking WandB or fallback logs for NaN, divergence, plateaus, and broken runs." --- # Training Check You are now in **interactive watch** / 交互式训练监控模式. Keep the current session open and report directly in the current terminal. The user is watching this terminal for updates. By default, run a training health check every 30 minutes, output a concise but complete analysis report after each check, state the next check time, then continue monitoring. This skill checks training **quality**, not basic process health. Process health checks such as whether a tmux session exists or whether the GPU is idle can be handled by watchdog-style tooling; this skill focuses on whether the run is still worth continuing. ## Inputs To Establish First Before the first check, identify or ask for the minimum monitoring context: - WandB run path or URL, if available. - Fallback log path, SSH command, or local command for reading recent training logs. - Training target, expected baseline, and key metrics that define success. - How the training was launched, so it can be stopped if needed. - Project notes path for recording decisions and evidence. If a source is unavailable, say so clearly and continue with the available source. If both WandB and fallback logs are unreachable, report the connectivity issue, classify the round as `WAIT`, and check again later. Do not infer that training is bad only because data is unreachable. ## Per-Round Check Every round, read WandB first when configured. If WandB is unreachable, read the fallback logs. Inspect at least: - Training loss trend over recent checkpoints or steps. - Eval metrics and whether they improve, flatten, or degrade against baseline. - NaN or Inf in loss, gradients, activations, or logged metrics. - Sudden loss spikes, divergence, or repeated failed evaluations. - Learning rate schedule behavior. - Gradient norm, if logged. - Plateau patterns that suggest the run is no longer useful. Output one report in the current terminal with this structure: ```text ## Training Check - <local timestamp> - Data source: wandb_ok | log_fallback | unreachable - Run: <wandb run or training identifier> - Recent metrics: <loss/eval/lr/grad summary> - Anomalies: <NaN/Inf/spike/divergence/plateau findings> - Evidence: <WandB URL, log lines, metric values, or files inspected> - Decision: CONTINUE | WAIT | STOP - Reason: <why this decision is justified> - Next check: <local timestamp, normally 30 minutes later unless ending> ``` Use the decisions as follows: | Decision | Meaning | Action | |----------|---------|--------| | `CONTINUE` | Run looks healthy enough to keep training. | Keep monitoring and check again in 30 minutes. | | `WAIT` | Evidence is inconclusive, noisy, too early, or temporarily unreachable. | Do not stop training; keep monitoring and check again later. | | `STOP` | Training is clearly problematic or no longer worth continuing. | Stop the training task, save evidence, write notes, output final summary, and end monitoring. | ## Stop Behavior When the decision is `STOP`: - Stop the training task. - If the context contains `stop_command`, run `stop_command` first. - If no `stop_command` is available, choose the appropriate stop action from how the training was launched, such as stopping the relevant tmux session, local process, remote process, scheduler job, or notebook job. - Save evidence: WandB URL, key metrics, relevant log snippets, files inspected, and the reason for stopping. - Append a project note for debugging and future analysis. - Output `FINAL_SUMMARY` in the terminal. - End the interactive monitoring loop. Never stop on the first sign of ordinary metric noise. Look for sustained trends, hard failures, or clear divergence. Always preserve enough evidence for a later agent or human to understand why the run was stopped. ## Interactive Loop Guidance - The normal interval is 30 minutes. - If a round is `CONTINUE`, announce the next check time and wait until then. - If a round is `WAIT`, explain what evidence is missing or noisy and check again later. Use a shorter interval only when the run looks suspicious but not yet stop-worthy. - If an anomaly recovers, say so explicitly and continue monitoring. - Keep the user-facing report short enough to read in a terminal, but include concrete metric values and evidence paths.
REQUIRED CONTEXT
- WandB run path or URL
- fallback log path or command
- training target and key metrics
- how training was launched
- project notes path
ROLES & RULES
Role assignments
- You are now in **interactive watch** / 交互式训练监控模式.
- Keep the current session open and report directly in the current terminal.
- By default, run a training health check every 30 minutes.
- Output a concise but complete analysis report after each check.
- State the next check time, then continue monitoring.
- This skill checks training quality, not basic process health.
- Before the first check, identify or ask for the minimum monitoring context.
- If a source is unavailable, say so clearly and continue with the available source.
- If both WandB and fallback logs are unreachable, report the connectivity issue, classify the round as `WAIT`, and check again later.
- Do not infer that training is bad only because data is unreachable.
- Every round, read WandB first when configured.
- If WandB is unreachable, read the fallback logs.
- Inspect at least the listed training metrics and patterns.
- Output one report in the current terminal with the exact specified structure.
- Use the decisions CONTINUE | WAIT | STOP exactly as defined in the table.
- When the decision is `STOP`, perform all listed stop actions in order.
- Never stop on the first sign of ordinary metric noise.
- Look for sustained trends, hard failures, or clear divergence.
- Always preserve enough evidence for a later agent or human.
- If a round is `CONTINUE`, announce the next check time and wait until then.
- If a round is `WAIT`, explain what evidence is missing or noisy.
- If an anomaly recovers, say so explicitly and continue monitoring.
- Keep the user-facing report short enough to read in a terminal, but include concrete metric values and evidence paths.
EXPECTED OUTPUT
- Format
- structured_report
- Schema
- text_template · ## Training Check - <local timestamp>, - Data source:, - Run:, - Recent metrics:, - Anomalies:, - Evidence:, - Decision:, - Reason:, - Next check:
- Constraints
- use exact report template with sections Data source, Run, Recent metrics, Anomalies, Evidence, Decision, Reason, Next check
- output FINAL_SUMMARY and stop loop on STOP decision
- report in current terminal
SUCCESS CRITERIA
- Identify or ask for minimum monitoring context before first check.
- Read WandB first, then fallback logs.
- Inspect loss trends, eval metrics, NaNs, spikes, LR, grad norms, and plateaus.
- Output report using the exact structure every round.
- Choose CONTINUE | WAIT | STOP according to the decision table.
- On STOP, execute stop actions, save evidence, append project note, output FINAL_SUMMARY, and end loop.
- Never stop on ordinary metric noise; preserve evidence.
FAILURE MODES
- May classify round as WAIT when data is unreachable.
- May use shorter interval only when run looks suspicious but not yet stop-worthy.
CAVEATS
- Dependencies
- WandB run path or URL
- Fallback log path, SSH command, or local command
- Training target, expected baseline, and key metrics
- How the training was launched
- Project notes path
- stop_command if present
- Missing context
- Preferred shorter interval length when a run is suspicious
- How to handle multiple concurrent training runs
- Ambiguities
- Does not define quantitative thresholds for anomalies such as 'sudden loss spikes' or 'plateau patterns'.
QUALITY
- OVERALL
- 0.88
- CLARITY
- 0.90
- SPECIFICITY
- 0.92
- REUSABILITY
- 0.85
- COMPLETENESS
- 0.88
IMPROVEMENT SUGGESTIONS
- Add an optional 'thresholds' input section listing numeric cutoffs for loss spike, plateau length, and gradient norm.
USAGE
Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.
MORE FOR AGENT
- Local Documentation Online Sync Automatoragentoperations
- HashiCorp Packer Golden Image Expertagentoperations
- ML Experiment GPU Deployment Workflowagentoperations
- Context Optimization Techniques Guideagentoperations
- Issue Triage State Machineagentoperations
- ML Experiment Results Monitoragentoperations
- DOCX Document Creation Editing Guideagentoperations
- Repo Agent Skills Configuration Setupagentoperations
- Git Worktree Isolated Workspace Setupagentoperations
- Agent Context Compression Strategiesagentoperations
- Parallel Agent Dispatcher for Independent Tasksagentoperations
- Scientific Computing Resource Detectoragentoperations
- PPTX File Handling Skill Guideagentoperations
- Interactive QA GitHub Issue Fileragentoperations
- Sprint Retrospective Facilitatoragentoperations
- Agent Skill Writing Guideagentoperations
- Brilliant Directories Rube MCP Automation Guideagentoperations
- Istio Linkerd Service Mesh Expertagentoperations
- Machine Learning Experiment Monitoragentoperations
- Benchling Python SDK Integrationagentoperations
- Blackbaud Automation via Rube MCPagentoperations
- DigitalOcean Automation via Rube MCPagentoperations
- Service Mesh Architecture Expertagentoperations
- WandB Training Metrics Health Checkeragentoperations
- Bubble Automation via Rube MCPagentoperations