agent operations skill risk: medium

Machine Learning Experiment Monitor

Instructs the model to monitor running experiments by identifying the backend, collecting screen outputs, JSON results, and W&B metrics, then summarizing results in a comparison ta…

Human review
External action: high

SKILL 1 file

SKILL.md

Download

---
name: monitor-experiment
description: "Monitor running experiments, check progress, collect results. Use when user says /\"check results/\", /\"is it done/\", /\"monitor/\", or wants experiment output."
---
# Monitor Experiment Results

Monitor: $ARGUMENTS

## Workflow

### Step 1: Check What's Running

First identify the backend from `AGENTS.md`, run notes, or launch summary: local, SSH, Vast.ai, or Modal. Monitor the backend that was actually used; do not assume a plain SSH screen session when the run was launched through Vast.ai or Modal.

```bash
ssh <server> "screen -ls"
```

For Vast.ai, also check instance state, SSH reachability, hourly cost, and whether `auto_destroy` is pending. For Modal, check the Modal run/app logs, function status, timeout, volume outputs, and cloud cost exposure.

### Step 2: Collect Output from Each Screen
For each screen session, capture the last N lines:
```bash
ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"
```

If hardcopy fails, check for log files or tee output.

### Step 3: Check for JSON Result Files
```bash
ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"
```

If JSON results exist, fetch and parse them:
```bash
ssh <server> "cat <results_dir>/<latest>.json"
```

### Step 3.5: Pull W&B Metrics (when `wandb: true` in AGENTS.md)

If the project enables W&B, pull metrics before interpreting results. Prefer W&B as the source of training curves and recent eval state, while still checking logs for crashes.

List recent runs:

```bash
python3 - <<'PY'
import wandb
api = wandb.Api()
for run in api.runs("<entity>/<project>", per_page=20):
    print(run.name, run.state, run.url)
PY
```

Pull recent history for a specific run:

```bash
python3 - <<'PY'
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
for row in run.history(samples=50, keys=["train/loss", "eval/loss", "eval/accuracy", "train/lr"]):
    print(row)
print("summary:", dict(run.summary))
PY
```

If W&B is configured but unavailable, report the connectivity problem and fall back to screen/log/json evidence. Do not interpret missing W&B data as experiment failure by itself.

Always include W&B dashboard links (`run.url`) when available so later review and paper-writing agents can inspect the exact training curves.

### Step 4: Summarize Results

Present results in a comparison table:
```
| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline  | X.XX   | —                 | done   |
| Method A  | X.XX   | +Y.Y              | done   |
```

### Step 5: Interpret
- Compare against known baselines
- Flag unexpected results (negative delta, NaN, divergence)
- Suggest next steps based on findings

### Step 6: Feishu Notification (if configured)

After results are collected, check `~/.codex/feishu.json`:
- Send `experiment_done` notification: results summary table, delta vs baseline
- If config absent or mode `"off"`: skip entirely (no-op)

## Key Rules
- Always show raw numbers before interpretation
- Compare against the correct baseline (same config)
- Note if experiments are still running (check progress bars, iteration counts)
- If results look wrong, check training logs for errors before concluding
- Include backend cost/risk notes for long-running Vast.ai or Modal jobs

INPUTS

$ARGUMENTS REQUIRED: experiment identifier or filter passed to the monitor command

REQUIRED CONTEXT

AGENTS.md or launch summary for backend
$ARGUMENTS for monitor target
wandb entity/project when enabled

OPTIONAL CONTEXT

results_dir path
feishu.json config

TOOLS REQUIRED

ssh
code_execution

ROLES & RULES

Monitor the backend that was actually used; do not assume a plain SSH screen session when the run was launched through Vast.ai or Modal.
If W&B is configured but unavailable, report the connectivity problem and fall back to screen/log/json evidence. Do not interpret missing W&B data as experiment failure by itself.
Always include W&B dashboard links (run.url) when available
Always show raw numbers before interpretation
Compare against the correct baseline (same config)
Note if experiments are still running (check progress bars, iteration counts)
If results look wrong, check training logs for errors before concluding
Include backend cost/risk notes for long-running Vast.ai or Modal jobs

EXPECTED OUTPUT

Format

markdown

Schema

table · Experiment, Metric, Delta vs Baseline, Status

Constraints

always show raw numbers before interpretation
present results in comparison table
include W&B dashboard links when available
note backend cost/risk for Vast/Modal jobs

SUCCESS CRITERIA

Identify the correct backend from AGENTS.md or launch artifacts
Capture last N lines from each screen session
Check for and parse JSON result files
Pull W&B metrics when enabled
Present results in comparison table
Interpret against baselines and flag anomalies
Send Feishu notification if configured

CAVEATS

Dependencies

AGENTS.md
run notes
launch summary
~/.codex/feishu.json

Missing context

How AGENTS.md is structured or where it lives
Default or parameter value for N in screen hardcopy
How to obtain or configure wandb entity/project when not in AGENTS.md

Ambiguities

"last N lines" — N is never defined
Placeholders <server>, <name>, <results_dir>, <entity>/<project> are used without definition or substitution rules
Trigger phrase escaping in description looks malformed: /"check results/"

QUALITY

OVERALL: 0.81
CLARITY: 0.85
SPECIFICITY: 0.88
REUSABILITY: 0.78
COMPLETENESS: 0.72

IMPROVEMENT SUGGESTIONS

Replace "last N lines" with a concrete default (e.g., 100) or an explicit parameter
Add a short "Inputs" section listing required variables (server, results_dir, wandb entity/project) and how they are supplied
Make the W&B code blocks use variables instead of literal "<entity>/<project>" strings

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.