Skip to main content
NEW · APP STORE Now on iOS · macOS · iPad Android & Windows soon GET IT
Prompts Machine Learning Experiment Monitor

agent operations skill risk: medium

Machine Learning Experiment Monitor

Instructs the model to monitor running experiments by identifying the backend, collecting screen outputs, JSON results, and W&B metrics, then summarizing results in a comparison ta…

  • Human review
  • External action: high

SKILL 1 file

SKILL.md
---
name: monitor-experiment
description: "Monitor running experiments, check progress, collect results. Use when user says /\"check results/\", /\"is it done/\", /\"monitor/\", or wants experiment output."
---
# Monitor Experiment Results

Monitor: $ARGUMENTS

## Workflow

### Step 1: Check What's Running

First identify the backend from `AGENTS.md`, run notes, or launch summary: local, SSH, Vast.ai, or Modal. Monitor the backend that was actually used; do not assume a plain SSH screen session when the run was launched through Vast.ai or Modal.

```bash
ssh <server> "screen -ls"
```

For Vast.ai, also check instance state, SSH reachability, hourly cost, and whether `auto_destroy` is pending. For Modal, check the Modal run/app logs, function status, timeout, volume outputs, and cloud cost exposure.

### Step 2: Collect Output from Each Screen
For each screen session, capture the last N lines:
```bash
ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"
```

If hardcopy fails, check for log files or tee output.

### Step 3: Check for JSON Result Files
```bash
ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"
```

If JSON results exist, fetch and parse them:
```bash
ssh <server> "cat <results_dir>/<latest>.json"
```

### Step 3.5: Pull W&B Metrics (when `wandb: true` in AGENTS.md)

If the project enables W&B, pull metrics before interpreting results. Prefer W&B as the source of training curves and recent eval state, while still checking logs for crashes.

List recent runs:

```bash
python3 - <<'PY'
import wandb
api = wandb.Api()
for run in api.runs("<entity>/<project>", per_page=20):
    print(run.name, run.state, run.url)
PY
```

Pull recent history for a specific run:

```bash
python3 - <<'PY'
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
for row in run.history(samples=50, keys=["train/loss", "eval/loss", "eval/accuracy", "train/lr"]):
    print(row)
print("summary:", dict(run.summary))
PY
```

If W&B is configured but unavailable, report the connectivity problem and fall back to screen/log/json evidence. Do not interpret missing W&B data as experiment failure by itself.

Always include W&B dashboard links (`run.url`) when available so later review and paper-writing agents can inspect the exact training curves.

### Step 4: Summarize Results

Present results in a comparison table:
```
| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline  | X.XX   | —                 | done   |
| Method A  | X.XX   | +Y.Y              | done   |
```

### Step 5: Interpret
- Compare against known baselines
- Flag unexpected results (negative delta, NaN, divergence)
- Suggest next steps based on findings

### Step 6: Feishu Notification (if configured)

After results are collected, check `~/.codex/feishu.json`:
- Send `experiment_done` notification: results summary table, delta vs baseline
- If config absent or mode `"off"`: skip entirely (no-op)

## Key Rules
- Always show raw numbers before interpretation
- Compare against the correct baseline (same config)
- Note if experiments are still running (check progress bars, iteration counts)
- If results look wrong, check training logs for errors before concluding
- Include backend cost/risk notes for long-running Vast.ai or Modal jobs

INPUTS

$ARGUMENTS REQUIRED

experiment identifier or filter passed to the monitor command

REQUIRED CONTEXT

  • AGENTS.md or launch summary for backend
  • $ARGUMENTS for monitor target
  • wandb entity/project when enabled

OPTIONAL CONTEXT

  • results_dir path
  • feishu.json config

TOOLS REQUIRED

  • ssh
  • code_execution

ROLES & RULES

  1. Monitor the backend that was actually used; do not assume a plain SSH screen session when the run was launched through Vast.ai or Modal.
  2. If W&B is configured but unavailable, report the connectivity problem and fall back to screen/log/json evidence. Do not interpret missing W&B data as experiment failure by itself.
  3. Always include W&B dashboard links (run.url) when available
  4. Always show raw numbers before interpretation
  5. Compare against the correct baseline (same config)
  6. Note if experiments are still running (check progress bars, iteration counts)
  7. If results look wrong, check training logs for errors before concluding
  8. Include backend cost/risk notes for long-running Vast.ai or Modal jobs

EXPECTED OUTPUT

Format
markdown
Schema
table · Experiment, Metric, Delta vs Baseline, Status
Constraints
  • always show raw numbers before interpretation
  • present results in comparison table
  • include W&B dashboard links when available
  • note backend cost/risk for Vast/Modal jobs

SUCCESS CRITERIA

  • Identify the correct backend from AGENTS.md or launch artifacts
  • Capture last N lines from each screen session
  • Check for and parse JSON result files
  • Pull W&B metrics when enabled
  • Present results in comparison table
  • Interpret against baselines and flag anomalies
  • Send Feishu notification if configured

CAVEATS

Dependencies
  • AGENTS.md
  • run notes
  • launch summary
  • ~/.codex/feishu.json
Missing context
  • How AGENTS.md is structured or where it lives
  • Default or parameter value for N in screen hardcopy
  • How to obtain or configure wandb entity/project when not in AGENTS.md
Ambiguities
  • "last N lines" — N is never defined
  • Placeholders <server>, <name>, <results_dir>, <entity>/<project> are used without definition or substitution rules
  • Trigger phrase escaping in description looks malformed: /"check results/"

QUALITY

OVERALL
0.81
CLARITY
0.85
SPECIFICITY
0.88
REUSABILITY
0.78
COMPLETENESS
0.72

IMPROVEMENT SUGGESTIONS

  • Replace "last N lines" with a concrete default (e.g., 100) or an explicit parameter
  • Add a short "Inputs" section listing required variables (server, results_dir, wandb entity/project) and how they are supplied
  • Make the W&B code blocks use variables instead of literal "<entity>/<project>" strings

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.

MORE FOR AGENT