Skip to main content
NEW · APP STORE Now on iOS · macOS · iPad Android & Windows soon GET IT
Prompts ML Experiment Results Monitor

agent operations skill risk: medium

ML Experiment Results Monitor

Instructs the model to monitor running experiments by checking SSH screen sessions, Vast.ai and Modal instances, collecting JSON outputs and W&B metrics, summarizing results in tab…

  • Policy sensitive
  • Human review
  • External action: high

SKILL 1 file

SKILL.md
---
name: auto-claude-code-research-in-sleep-monitor-experiment
description: "Monitor running experiments, check progress, collect results. Use when user says \"check results\", \"is it done\", \"monitor\", or wants experiment output."
---
# Monitor Experiment Results

Monitor: $ARGUMENTS

## Workflow

### Step 1: Check What's Running

**SSH server:**
```bash
ssh <server> "screen -ls"
```

**Vast.ai instance** (read `ssh_host`, `ssh_port` from `vast-instances.json`):
```bash
ssh -p <PORT> root@<HOST> "screen -ls"
```

Also check vast.ai instance status:
```bash
vastai show instances
```

**Modal** (when `gpu: modal` in CLAUDE.md):
```bash
modal app list         # List running/recent apps
modal app logs <app>   # Stream logs from a running app
```
Modal apps auto-terminate when done — if it's not in the list, it already finished. Check results via `modal volume ls <volume>` or local output.

### Step 2: Collect Output from Each Screen
For each screen session, capture the last N lines:
```bash
ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"
```

If hardcopy fails, check for log files or tee output.

### Step 3: Check for JSON Result Files
```bash
ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"
```

If JSON results exist, fetch and parse them:
```bash
ssh <server> "cat <results_dir>/<latest>.json"
```

### Step 3.5: Pull W&B Metrics (when `wandb: true` in CLAUDE.md)

**Skip this step entirely if `wandb` is not set or is `false` in CLAUDE.md.**

Pull training curves and metrics from Weights & Biases via Python API:

```bash
# List recent runs in the project
ssh <server> "python3 -c \"
import wandb
api = wandb.Api()
runs = api.runs('<entity>/<project>', per_page=10)
for r in runs:
    print(f'{r.id}  {r.state}  {r.name}  {r.summary.get(\"eval/loss\", \"N/A\")}')
\""

# Pull specific metrics from a run (last 50 steps)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
history = list(run.scan_history(keys=['train/loss', 'eval/loss', 'eval/ppl', 'train/lr'], page_size=50))
print(json.dumps(history[-10:], indent=2))
\""

# Pull run summary (final metrics)
ssh <server> "python3 -c \"
import wandb, json
api = wandb.Api()
run = api.run('<entity>/<project>/<run_id>')
print(json.dumps(dict(run.summary), indent=2, default=str))
\""
```

**What to extract:**
- **Training loss curve** — is it converging? diverging? plateauing?
- **Eval metrics** — loss, PPL, accuracy at latest checkpoint
- **Learning rate** — is the schedule behaving as expected?
- **GPU memory** — any OOM risk?
- **Run status** — running / finished / crashed?

**W&B dashboard link** (include in summary for user):
```
https://wandb.ai/<entity>/<project>/runs/<run_id>
```

> This gives the auto-review-loop richer signal than just screen output — training dynamics, loss curves, and metric trends over time.

### Step 4: Summarize Results

Present results in a comparison table:
```
| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline  | X.XX   | —                 | done   |
| Method A  | X.XX   | +Y.Y              | done   |
```

### Step 5: Interpret
- Compare against known baselines
- Flag unexpected results (negative delta, NaN, divergence)
- Suggest next steps based on findings

### Step 6: Feishu Notification (if configured)

After results are collected, check `~/.claude/feishu.json`:
- Send `experiment_done` notification: results summary table, delta vs baseline
- If config absent or mode `"off"`: skip entirely (no-op)

## Key Rules
- Always show raw numbers before interpretation
- Compare against the correct baseline (same config)
- Note if experiments are still running (check progress bars, iteration counts)
- If results look wrong, check training logs for errors before concluding
- **Vast.ai cost awareness**: When monitoring vast.ai instances, report the running cost (hours * $/hr from `vast-instances.json`). If all experiments on an instance are done, remind the user to run `/vast-gpu destroy <instance_id>` to stop billing
- **Modal cost awareness**: Modal auto-scales to zero — no idle billing. When reporting results from Modal runs, note the actual execution time and estimated cost (time * $/hr from the GPU tier used). No cleanup action needed

INPUTS

ARGUMENTS REQUIRED

experiment monitor target or query

server REQUIRED

SSH server hostname

results_dir REQUIRED

directory containing JSON result files

REQUIRED CONTEXT

  • $ARGUMENTS
  • CLAUDE.md
  • vast-instances.json

OPTIONAL CONTEXT

  • wandb config
  • feishu.json config

TOOLS REQUIRED

  • ssh
  • vastai
  • modal

ROLES & RULES

  1. Skip this step entirely if `wandb` is not set or is `false` in CLAUDE.md.
  2. Always show raw numbers before interpretation
  3. Compare against the correct baseline (same config)
  4. Note if experiments are still running (check progress bars, iteration counts)
  5. If results look wrong, check training logs for errors before concluding
  6. When monitoring vast.ai instances, report the running cost (hours * $/hr from `vast-instances.json`)
  7. If all experiments on an instance are done, remind the user to run `/vast-gpu destroy <instance_id>` to stop billing
  8. When reporting results from Modal runs, note the actual execution time and estimated cost (time * $/hr from the GPU tier used)

EXPECTED OUTPUT

Format
markdown
Schema
table · Experiment, Metric, Delta vs Baseline, Status
Constraints
  • always show raw numbers before interpretation
  • compare against correct baseline
  • report running costs for vast.ai and modal
  • include W&B dashboard link when applicable
  • use comparison table format

SUCCESS CRITERIA

  • Monitor running experiments
  • Check progress
  • Collect results
  • Present results in a comparison table
  • Interpret results and suggest next steps

CAVEATS

Dependencies
  • CLAUDE.md
  • vast-instances.json
  • ~/.claude/feishu.json
Missing context
  • Definition or example of CLAUDE.md format and contents
  • Schema or example of vast-instances.json
  • How $ARGUMENTS is parsed to select servers/instances
Ambiguities
  • Placeholders such as <server>, <results_dir>, <entity>/<project> are used without defining how they are populated from $ARGUMENTS or context files.

QUALITY

OVERALL
0.86
CLARITY
0.88
SPECIFICITY
0.92
REUSABILITY
0.78
COMPLETENESS
0.85

IMPROVEMENT SUGGESTIONS

  • Add a short 'Inputs' section that explicitly lists required files (CLAUDE.md, vast-instances.json, feishu.json) and the expected shape of $ARGUMENTS.
  • Replace generic placeholders (<server>, <results_dir>) with named variables or extraction rules so the template can be used without additional human editing.

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.

MORE FOR AGENT