agent research skill risk: low

ML Ablation Study Planner

Designs ablation studies for ML papers by spawning a reviewer agent to propose component, hyperparameter, and design-choice tests, then parsing the plan, checking compute feasibili…

SKILL 1 file

SKILL.md

Download

---
name: auto-claude-code-research-in-sleep-ablation-planner
description: "Use when main results pass result-to-claim (`claim_supported = yes` or `partial`) and ablation studies are needed for paper submission. A secondary Codex agent designs ablations from a reviewer's perspective; the local executor reviews feasibility and implements."
---
# Ablation Planner

Systematically design ablation studies that answer the questions reviewers will ask. The reviewer agent leads the design; the local executor reviews feasibility and implements.

## Context: $ARGUMENTS

## When to Use

- Main results pass `/result-to-claim` with `claim_supported = yes` or `partial`
- The user explicitly requests ablation planning
- `/auto-review-loop` identifies missing ablations

## Workflow

### Step 1: Prepare Context

Read available project files to build the full picture:

- Method description and components (from `docs/research_contract.md`, project notes, or method docs)
- Current experiment results (from `EXPERIMENT_LOG.md`, `EXPERIMENT_TRACKER.md`, or W&B)
- Confirmed and intended claims (from `/result-to-claim` output or project notes)
- Available compute resources (from server notes, run configs, or user-provided budget)

### Step 2: Codex Designs Ablations

```text
spawn_agent:
  model: gpt-5.5
  reasoning_effort: xhigh
  message: |
    You are a rigorous ML reviewer planning ablation studies.
    Given this method and results, design ablations that:

    1. Isolate the contribution of each novel component
    2. Answer questions reviewers will definitely ask
    3. Test sensitivity to key hyperparameters
    4. Compare against natural alternative design choices

    Method: [description from project files]
    Components: [list of removable or replaceable components]
    Current results: [key metrics from experiments]
    Claims: [what we claim and current evidence]

    For each ablation, specify:
    - name: what to change (for example, "remove module X", "replace Y with Z")
    - what_it_tests: the specific question this answers
    - expected_if_component_matters: what we predict if the component is important
    - priority: 1 (must-run) to 5 (nice-to-have)

    Also provide:
    - coverage_assessment: what reviewer questions these ablations answer
    - unnecessary_ablations: experiments that seem useful but will not add insight
    - suggested_order: run order optimized for maximum early information
    - estimated_compute: total GPU-hours estimate
```

If delegation is unavailable, generate the same plan locally and mark it `[pending external review]`.

### Step 3: Parse Ablation Plan

Normalize the response into a structured format:

```markdown
## Ablation Plan

### Component Ablations (highest priority)
| # | Name | What It Tests | Expected If Matters | Priority |
|---|------|---------------|---------------------|----------|
| 1 | remove module X | contribution of X | performance drops on metric Y | 1 |
| 2 | replace X with simpler Z | value of learned vs fixed | drops, especially on dataset A | 2 |

### Hyperparameter Sensitivity
| # | Parameter | Values to Test | What It Tests | Priority |
|---|-----------|----------------|---------------|----------|
| 3 | lambda | [0.01, 0.1, 1.0] | sensitivity to regularization | 3 |

### Design Choice Comparisons
| # | Name | What It Tests | Priority |
|---|------|---------------|----------|
| 4 | joint vs separate matching | whether joint adds value | 4 |

### Coverage Assessment
[What reviewer questions these ablations answer]

### Unnecessary Ablations
[Experiments that seem useful but will not add insight - skip these]

### Run Order
[Optimized for maximum early information]

### Estimated Compute
[Total GPU-hours]
```

### Step 4: CC Reviews Feasibility

Before running anything, the local executor checks:

- Compute budget - Can you afford all ablations with available GPUs?
- Code changes - Which ablations need code modifications vs config-only changes?
- Dependencies - Which ablations can run in parallel?
- Cuts - If budget is tight, propose removing lower-priority ablations and ask the reviewer agent to re-prioritize when possible

### Step 5: Implement and Run

1. Create configs or scripts for each ablation (config-only changes first)
2. Smoke test each ablation before the full run
3. Run in the suggested order, using descriptive names (for example, `ablation-no-module-X`)
4. Track results in `EXPERIMENT_LOG.md`
5. After all ablations complete, update `findings.md` with insights

## Rules

- **The reviewer agent leads the design.** Do not pre-filter or bias the ablation list before external review sees it. The reviewer thinks like a reviewer; the local executor thinks like an engineer.
- Every ablation must have a clear `what_it_tests` and `expected_if_component_matters`. No "just try it" experiments.
- Config-only ablations take priority over those needing code changes (faster, less error-prone).
- If total compute exceeds budget, propose cuts and ask for re-prioritization - do not silently drop ablations.
- Component ablations (remove or replace) take priority over hyperparameter sweeps.
- Do not generate ablations for components identical to the baseline (no-op ablations).
- Record all ablation results in `EXPERIMENT_LOG.md`, including negative results (for example, component removal had no effect).

INPUTS

$ARGUMENTS REQUIRED: Context for the ablation planner

REQUIRED CONTEXT

method description and components
current experiment results
confirmed and intended claims
available compute resources

ROLES & RULES

Role assignments

You are a rigorous ML reviewer planning ablation studies.

The reviewer agent leads the design. Do not pre-filter or bias the ablation list before external review sees it.
Every ablation must have a clear `what_it_tests` and `expected_if_component_matters`. No "just try it" experiments.
Config-only ablations take priority over those needing code changes.
If total compute exceeds budget, propose cuts and ask for re-prioritization - do not silently drop ablations.
Component ablations (remove or replace) take priority over hyperparameter sweeps.
Do not generate ablations for components identical to the baseline (no-op ablations).
Record all ablation results in `EXPERIMENT_LOG.md`, including negative results.

EXPECTED OUTPUT

Format

markdown

Schema

markdown_sections · Ablation Plan, Component Ablations (highest priority), Hyperparameter Sensitivity, Design Choice Comparisons, Coverage Assessment, Unnecessary Ablations, Run Order, Estimated Compute

Constraints

use the exact Ablation Plan markdown structure with specified tables and sections
include name, what_it_tests, expected_if_component_matters, priority for each ablation
record results in EXPERIMENT_LOG.md

SUCCESS CRITERIA

Isolate the contribution of each novel component
Answer questions reviewers will definitely ask
Test sensitivity to key hyperparameters
Compare against natural alternative design choices
Every ablation must specify name, what_it_tests, expected_if_component_matters and priority

FAILURE MODES

May generate no-op ablations for baseline-identical components
May silently drop ablations when compute exceeds budget instead of proposing cuts
May prioritize code-change ablations over config-only ones

EXAMPLES

Includes one detailed spawn_agent message with ablation specification fields plus a full markdown output template containing four sample table rows.

CAVEATS

Dependencies

Requires project files (docs/research_contract.md, EXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, W&B, server notes)
Requires /result-to-claim output or project notes for claims
Requires user-provided compute budget or run configs

Missing context

Exact structure and location of files such as docs/research_contract.md and EXPERIMENT_LOG.md
Definition of the /result-to-claim command output format

Ambiguities

Does not specify exact mechanism or syntax for obtaining placeholder values such as [description from project files]
spawn_agent block uses undefined fields (model, reasoning_effort) without defining their schema

QUALITY

OVERALL: 0.84
CLARITY: 0.88
SPECIFICITY: 0.92
REUSABILITY: 0.78
COMPLETENESS: 0.85

IMPROVEMENT SUGGESTIONS

Replace inline placeholders like [description from project files] with explicit, named template variables (e.g., {{method_description}})
Add a short 'Input Variables' section at the top listing every required external value

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.