agent planning skill risk: low

ML Ablation Study Planner

Designs ablation studies from a reviewer's perspective to isolate novel components, test hyperparameters, and answer expected questions, then parses results into tables and has CC…

External action: low

SKILL 1 file

SKILL.md

Download

---
name: ablation-planner
description: "Use when main results pass result-to-claim (claim_supported=yes or partial) and ablation studies are needed for paper submission. Codex designs ablations from a reviewer's perspective, CC reviews feasibility and implements."
---
# Ablation Planner

Systematically design ablation studies that answer the questions reviewers will ask. Codex leads the design (reviewer perspective), CC reviews feasibility and implements.

## Context: $ARGUMENTS

## When to Use

- Main results pass `/result-to-claim` with claim_supported = yes or partial
- User explicitly requests ablation planning
- `/auto-review-loop` reviewer identifies missing ablations

## Workflow

### Step 1: Prepare Context

CC reads available project files to build the full picture:
- Method description and components (from docs/research_contract.md or project CLAUDE.md)
- Current experiment results (from EXPERIMENT_LOG.md, EXPERIMENT_TRACKER.md, or W&B)
- Confirmed and intended claims (from result-to-claim output or project notes)
- Available compute resources (from CLAUDE.md server config, if present)

### Step 2: Codex Designs Ablations

```
mcp__codex__codex:
  config: {"model_reasoning_effort": "xhigh"}
  prompt: |
    You are a rigorous ML reviewer planning ablation studies.
    Given this method and results, design ablations that:

    1. Isolate the contribution of each novel component
    2. Answer questions reviewers will definitely ask
    3. Test sensitivity to key hyperparameters
    4. Compare against natural alternative design choices

    Method: [description from project files]
    Components: [list of removable/replaceable components]
    Current results: [key metrics from experiments]
    Claims: [what we claim and current evidence]

    For each ablation, specify:
    - name: what to change (e.g., "remove module X", "replace Y with Z")
    - what_it_tests: the specific question this answers
    - expected_if_component_matters: what we predict if the component is important
    - priority: 1 (must-run) to 5 (nice-to-have)

    Also provide:
    - coverage_assessment: what reviewer questions these ablations answer
    - unnecessary_ablations: experiments that seem useful but won't add insight
    - suggested_order: run order optimized for maximum early information
    - estimated_compute: total GPU-hours estimate
```

### Step 3: Parse Ablation Plan

Normalize Codex response into structured format:

```markdown
## Ablation Plan

### Component Ablations (highest priority)
| # | Name | What It Tests | Expected If Matters | Priority |
|---|------|---------------|---------------------|----------|
| 1 | remove module X | contribution of X | performance drops on metric Y | 1 |
| 2 | replace X with simpler Z | value of learned vs fixed | drops, especially on dataset A | 2 |

### Hyperparameter Sensitivity
| # | Parameter | Values to Test | What It Tests | Priority |
|---|-----------|---------------|---------------|----------|
| 3 | lambda | [0.01, 0.1, 1.0] | sensitivity to regularization | 3 |

### Design Choice Comparisons
| # | Name | What It Tests | Priority |
|---|------|---------------|----------|
| 4 | joint vs separate matching | whether joint adds value | 4 |

### Coverage Assessment
[What reviewer questions these ablations answer]

### Unnecessary Ablations
[Experiments that seem useful but won't add insight — skip these]

### Run Order
[Optimized for maximum early information]

### Estimated Compute
[Total GPU-hours]
```

### Step 4: CC Reviews Feasibility

Before running anything, CC checks:
- Compute budget: can we afford all ablations with available GPUs?
- Code changes: which ablations need code modifications vs config-only changes?
- Dependencies: which ablations can run in parallel?
- Cuts: if budget is tight, propose removing lower-priority ablations and ask Codex to confirm

### Step 5: Implement and Run

1. Create configs/scripts for each ablation (config-only changes first)
2. Smoke test each ablation before full run
3. Run in suggested order, using descriptive names (e.g., `ablation-no-module-X`)
4. Track results in EXPERIMENT_LOG.md
5. After all ablations complete → update findings.md with insights

## Rules

- **Codex leads the design. CC does not pre-filter or bias the ablation list** before Codex sees it. Codex thinks like a reviewer; CC thinks like an engineer.
- Every ablation must have a clear `what_it_tests` and `expected_if_component_matters`. No "just try it" experiments.
- Config-only ablations take priority over those needing code changes (faster, less error-prone).
- If total compute exceeds budget, CC proposes cuts and asks Codex to re-prioritize — don't silently drop ablations.
- Component ablations (remove/replace) take priority over hyperparameter sweeps.
- Do not generate ablations for components identical to the baseline (no-op ablations).
- Record all ablation results in EXPERIMENT_LOG.md, including negative results (component removal had no effect = important finding).

INPUTS

$ARGUMENTS REQUIRED: Context for the ablation planning task

REQUIRED CONTEXT

method description and components
current experiment results
confirmed and intended claims

OPTIONAL CONTEXT

available compute resources

TOOLS REQUIRED

codex

ROLES & RULES

Role assignments

You are a rigorous ML reviewer planning ablation studies.

Codex leads the design. CC does not pre-filter or bias the ablation list before Codex sees it.
Every ablation must have a clear `what_it_tests` and `expected_if_component_matters`.
Config-only ablations take priority over those needing code changes.
If total compute exceeds budget, CC proposes cuts and asks Codex to re-prioritize.
Component ablations take priority over hyperparameter sweeps.
Do not generate ablations for components identical to the baseline.
Record all ablation results in EXPERIMENT_LOG.md, including negative results.

EXPECTED OUTPUT

Format

markdown

Schema

markdown_sections · Ablation Plan, Component Ablations, Hyperparameter Sensitivity, Design Choice Comparisons, Coverage Assessment, Unnecessary Ablations, Run Order, Estimated Compute

Constraints

use specified table structure for component ablations, hyperparameter sensitivity, and design comparisons
include coverage_assessment, unnecessary_ablations, run_order, and estimated_compute sections

SUCCESS CRITERIA

Isolate the contribution of each novel component
Answer questions reviewers will definitely ask
Test sensitivity to key hyperparameters
Compare against natural alternative design choices

EXAMPLES

Includes one detailed example of the final normalized ablation plan in markdown with multiple tables and sections.

CAVEATS

Dependencies

Requires available project files (research_contract.md, CLAUDE.md, EXPERIMENT_LOG.md)
Requires result-to-claim output or project notes
Requires current experiment results and compute resources info

Missing context

Exact schema or example content for $ARGUMENTS
How to access or format the Codex tool call in different environments

Ambiguities

Context: $ARGUMENTS placeholder has no specified format or content requirements.
References to specific files (docs/research_contract.md, EXPERIMENT_LOG.md, CLAUDE.md) assume a fixed project structure without defining alternatives.

QUALITY

OVERALL: 0.85
CLARITY: 0.85
SPECIFICITY: 0.90
REUSABILITY: 0.80
COMPLETENESS: 0.85

IMPROVEMENT SUGGESTIONS

Replace the inline Codex prompt placeholders with explicit variables (e.g., {{method_description}}, {{components_list}}) so the template can be instantiated without manual editing.
Add a short 'Input contract' section listing the minimum files or data that must exist before the workflow runs.

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.