agent operations skill risk: medium
Machine Learning Experiment Monitor
Instructs the model to monitor running experiments by identifying the backend, collecting screen outputs, JSON results, and W&B metrics, then summarizing results in a comparison ta…
- Human review
- External action: high
SKILL 1 file
SKILL.md
---
name: monitor-experiment
description: "Monitor running experiments, check progress, collect results. Use when user says /\"check results/\", /\"is it done/\", /\"monitor/\", or wants experiment output."
---
# Monitor Experiment Results
Monitor: $ARGUMENTS
## Workflow
### Step 1: Check What's Running
First identify the backend from `AGENTS.md`, run notes, or launch summary: local, SSH, Vast.ai, or Modal. Monitor the backend that was actually used; do not assume a plain SSH screen session when the run was launched through Vast.ai or Modal.
```bash
ssh <server> "screen -ls"
```
For Vast.ai, also check instance state, SSH reachability, hourly cost, and whether `auto_destroy` is pending. For Modal, check the Modal run/app logs, function status, timeout, volume outputs, and cloud cost exposure.
### Step 2: Collect Output from Each Screen
For each screen session, capture the last N lines:
```bash
ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"
```
If hardcopy fails, check for log files or tee output.
### Step 3: Check for JSON Result Files
```bash
ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"
```
If JSON results exist, fetch and parse them:
```bash
ssh <server> "cat <results_dir>/<latest>.json"
```
### Step 3.5: Pull W&B Metrics (when `wandb: true` in AGENTS.md)
If the project enables W&B, pull metrics before interpreting results. Prefer W&B as the source of training curves and recent eval state, while still checking logs for crashes.
List recent runs:
```bash
python3 - <<'PY'
import wandb
api = wandb.Api()
for run in api.runs("<entity>/<project>", per_page=20):
print(run.name, run.state, run.url)
PY
```
Pull recent history for a specific run:
```bash
python3 - <<'PY'
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
for row in run.history(samples=50, keys=["train/loss", "eval/loss", "eval/accuracy", "train/lr"]):
print(row)
print("summary:", dict(run.summary))
PY
```
If W&B is configured but unavailable, report the connectivity problem and fall back to screen/log/json evidence. Do not interpret missing W&B data as experiment failure by itself.
Always include W&B dashboard links (`run.url`) when available so later review and paper-writing agents can inspect the exact training curves.
### Step 4: Summarize Results
Present results in a comparison table:
```
| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline | X.XX | — | done |
| Method A | X.XX | +Y.Y | done |
```
### Step 5: Interpret
- Compare against known baselines
- Flag unexpected results (negative delta, NaN, divergence)
- Suggest next steps based on findings
### Step 6: Feishu Notification (if configured)
After results are collected, check `~/.codex/feishu.json`:
- Send `experiment_done` notification: results summary table, delta vs baseline
- If config absent or mode `"off"`: skip entirely (no-op)
## Key Rules
- Always show raw numbers before interpretation
- Compare against the correct baseline (same config)
- Note if experiments are still running (check progress bars, iteration counts)
- If results look wrong, check training logs for errors before concluding
- Include backend cost/risk notes for long-running Vast.ai or Modal jobs
INPUTS
- $ARGUMENTS REQUIRED
experiment identifier or filter passed to the monitor command
REQUIRED CONTEXT
- AGENTS.md or launch summary for backend
- $ARGUMENTS for monitor target
- wandb entity/project when enabled
OPTIONAL CONTEXT
- results_dir path
- feishu.json config
TOOLS REQUIRED
- ssh
- code_execution
ROLES & RULES
- Monitor the backend that was actually used; do not assume a plain SSH screen session when the run was launched through Vast.ai or Modal.
- If W&B is configured but unavailable, report the connectivity problem and fall back to screen/log/json evidence. Do not interpret missing W&B data as experiment failure by itself.
- Always include W&B dashboard links (run.url) when available
- Always show raw numbers before interpretation
- Compare against the correct baseline (same config)
- Note if experiments are still running (check progress bars, iteration counts)
- If results look wrong, check training logs for errors before concluding
- Include backend cost/risk notes for long-running Vast.ai or Modal jobs
EXPECTED OUTPUT
- Format
- markdown
- Schema
- table · Experiment, Metric, Delta vs Baseline, Status
- Constraints
- always show raw numbers before interpretation
- present results in comparison table
- include W&B dashboard links when available
- note backend cost/risk for Vast/Modal jobs
SUCCESS CRITERIA
- Identify the correct backend from AGENTS.md or launch artifacts
- Capture last N lines from each screen session
- Check for and parse JSON result files
- Pull W&B metrics when enabled
- Present results in comparison table
- Interpret against baselines and flag anomalies
- Send Feishu notification if configured
CAVEATS
- Dependencies
- AGENTS.md
- run notes
- launch summary
- ~/.codex/feishu.json
- Missing context
- How AGENTS.md is structured or where it lives
- Default or parameter value for N in screen hardcopy
- How to obtain or configure wandb entity/project when not in AGENTS.md
- Ambiguities
- "last N lines" — N is never defined
- Placeholders <server>, <name>, <results_dir>, <entity>/<project> are used without definition or substitution rules
- Trigger phrase escaping in description looks malformed: /"check results/"
QUALITY
- OVERALL
- 0.81
- CLARITY
- 0.85
- SPECIFICITY
- 0.88
- REUSABILITY
- 0.78
- COMPLETENESS
- 0.72
IMPROVEMENT SUGGESTIONS
- Replace "last N lines" with a concrete default (e.g., 100) or an explicit parameter
- Add a short "Inputs" section listing required variables (server, results_dir, wandb entity/project) and how they are supplied
- Make the W&B code blocks use variables instead of literal "<entity>/<project>" strings
USAGE
Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.
MORE FOR AGENT
- Local Documentation Online Sync Automatoragentoperations
- HashiCorp Packer Golden Image Expertagentoperations
- ML Experiment GPU Deployment Workflowagentoperations
- Codex Training Metrics Monitoragentoperations
- Context Optimization Techniques Guideagentoperations
- Issue Triage State Machineagentoperations
- ML Experiment Results Monitoragentoperations
- DOCX Document Creation Editing Guideagentoperations
- Repo Agent Skills Configuration Setupagentoperations
- Git Worktree Isolated Workspace Setupagentoperations
- Agent Context Compression Strategiesagentoperations
- Parallel Agent Dispatcher for Independent Tasksagentoperations
- Scientific Computing Resource Detectoragentoperations
- PPTX File Handling Skill Guideagentoperations
- Interactive QA GitHub Issue Fileragentoperations
- Sprint Retrospective Facilitatoragentoperations
- Agent Skill Writing Guideagentoperations
- Brilliant Directories Rube MCP Automation Guideagentoperations
- Istio Linkerd Service Mesh Expertagentoperations
- Benchling Python SDK Integrationagentoperations
- Blackbaud Automation via Rube MCPagentoperations
- DigitalOcean Automation via Rube MCPagentoperations
- Service Mesh Architecture Expertagentoperations
- WandB Training Metrics Health Checkeragentoperations
- Bubble Automation via Rube MCPagentoperations