agent operations skill risk: low
ARIS Meta-Optimization Log Analyzer
Analyze ARIS usage logs from .aris/meta/events.jsonl to identify patterns in skill invocations, failures, and overrides, then generate ranked optimization opportunities and concret…
- External action: low
SKILL 1 file
SKILL.md
---
name: meta-optimize
description: "Analyze ARIS usage logs and propose optimizations to SKILL.md files, reviewer prompts, and workflow defaults. Outer-loop harness optimization inspired by Meta-Harness (Lee et al., 2026). Use when user says /\"优化技能/\", /\"meta optimize/\", /\"improve skills/\", /\"分析使用记录/\", or wants to optimize ARIS's own h"
---
# Meta-Optimize: Outer-Loop Harness Optimization for ARIS
Analyze accumulated usage logs and propose optimizations for: **$ARGUMENTS**
## Context
ARIS is a **research harness** — a system of skills, bridges, workflows, and artifact contracts that wraps around LLMs to orchestrate research. This skill implements a prototype **outer loop** that observes how the harness is used and proposes improvements to the harness itself (not to the research artifacts it produces).
Inspired by Meta-Harness (Lee et al., 2026): the key insight is that harness design matters as much as model weights, and harness engineering can be partially automated by logging execution traces and using them to guide improvements.
## What This Skill Optimizes (Harness Components)
| Component | Example | Optimizable? |
|-----------|---------|:---:|
| SKILL.md prompts | Reviewer instructions, quality gates, step descriptions | Yes |
| Default parameters | `difficulty: medium`, `MAX_ROUNDS: 4`, `threshold: 6/10` | Yes |
| Convergence rules | When to stop the review loop, retry counts | Yes |
| Workflow ordering | Skill chain sequence within a workflow | Yes |
| Artifact schemas | What fields go in EXPERIMENT_LOG.md, idea-stage/IDEA_REPORT.md | Cautious |
| MCP bridge config | Which reviewer model, routing rules | No (infra) |
**Not optimized**: The research artifacts themselves (papers, code, experiments). That's what the regular workflows do.
## Prerequisites
1. **Logging must be active.** Copy `templates/claude-hooks/meta_logging.json` into your project's `.claude/settings.json` (or merge the hooks section).
2. **Sufficient data.** At least 5 complete workflow runs logged in `.aris/meta/events.jsonl`. The skill will check and warn if insufficient.
## Workflow
### Step 0: Check Data Availability
```bash
EVENTS_FILE=".aris/meta/events.jsonl"
if [ ! -f "$EVENTS_FILE" ]; then
echo "ERROR: No event log found at $EVENTS_FILE"
echo "Enable logging first: copy templates/claude-hooks/meta_logging.json into .claude/settings.json"
exit 1
fi
EVENT_COUNT=$(wc -l < "$EVENTS_FILE")
SKILL_INVOCATIONS=$(grep -c '"skill_invoke"' "$EVENTS_FILE" || echo 0)
SESSIONS=$(grep -c '"session_start"' "$EVENTS_FILE" || echo 0)
echo "📊 Event log: $EVENT_COUNT events, $SKILL_INVOCATIONS skill invocations, $SESSIONS sessions"
if [ "$SKILL_INVOCATIONS" -lt 5 ]; then
echo "⚠️ Insufficient data (<5 skill invocations). Continue using ARIS normally and re-run later."
exit 0
fi
```
### Step 1: Analyze Usage Patterns
Read `.aris/meta/events.jsonl` and compute:
**Frequency analysis:**
- Which skills are invoked most often?
- Which slash commands do users type most?
- What parameter overrides are most common? (These suggest bad defaults.)
**Failure analysis:**
- Which tools fail most often? In which skills?
- What error patterns repeat? (OOM, import, compilation, timeout)
- How many auto-debug retries per workflow run?
**Convergence analysis (for auto-review-loop):**
- Average rounds to reach threshold
- Score trajectory shape (fast improvement? plateau? oscillation?)
- Which review round catches the most critical issues?
- Do users override difficulty mid-run?
**Human intervention analysis:**
- Where do users interrupt with manual prompts during workflows?
- What manual corrections do users make most? (These indicate skill gaps.)
Present findings as a structured summary table.
### Step 2: Identify Optimization Targets
Based on Step 1, rank optimization opportunities by expected impact:
```markdown
## Optimization Opportunities (ranked)
| # | Target | Signal | Proposed Change | Expected Impact |
|---|--------|--------|-----------------|-----------------|
| 1 | auto-review-loop default threshold | Users override to 7/10 in 60% of runs | Change default from 6/10 to 7/10 | Fewer manual overrides |
| 2 | experiment-bridge retry count | 40% of runs hit max retries on OOM | Add OOM-specific recovery (reduce batch size) | Fewer failed experiments |
| 3 | paper-write de-AI patterns | Users manually fix "delve" in 80% of runs | Add "delve" to default watchword list | Fewer manual edits |
```
If `$ARGUMENTS` specifies a target skill, focus analysis on that skill only.
If `$ARGUMENTS` is empty or "all", analyze all skills with sufficient data.
### Step 3: Generate Patch Proposals
For each optimization target, generate a concrete diff:
```diff
--- a/skills/auto-review-loop/SKILL.md
+++ b/skills/auto-review-loop/SKILL.md
@@ -15,7 +15,7 @@
## Constants
-- **SCORE_THRESHOLD = 6** — Minimum review score to accept.
+- **SCORE_THRESHOLD = 7** — Minimum review score to accept. (Raised based on usage data: 60% of users overrode to 7+.)
```
**Rules for patch generation:**
- One patch per optimization target
- Each patch must include a comment explaining WHY (with data from the log)
- Patches must be minimal — change only what the data supports
- Never change artifact schemas or MCP bridge config in v1
- Never change behavior that would break existing user workflows
### Step 4: Cross-Model Review of Patches
Send each patch to GPT-5.4 xhigh for adversarial review:
```
mcp__codex__codex:
model: gpt-5.5
config: {"model_reasoning_effort": "xhigh"}
prompt: |
You are reviewing a proposed optimization to an ARIS SKILL.md file.
## Original Skill (relevant section)
[paste original]
## Proposed Patch
[paste diff]
## Evidence from Usage Log
[paste summary stats]
Review this patch:
1. Does the evidence support the change?
2. Could this change hurt other use cases?
3. Is the change minimal and safe?
4. Score 1-10: should this be applied?
If score < 7, explain what additional evidence would be needed.
```
### Step 5: Present Results
Output a structured report:
```markdown
# ARIS Meta-Optimization Report
**Date**: [today]
**Data**: [N] events, [M] skill invocations, [K] sessions
**Target**: [skill name or "all"]
## Proposed Changes
### Change 1: [title]
- **Target**: [skill/file:line]
- **Signal**: [what the data shows]
- **Patch**: [diff]
- **Reviewer Score**: [X/10]
- **Reviewer Notes**: [summary]
- **Status**: ✅ Recommended / ⚠️ Needs more data / ❌ Rejected
### Change 2: ...
## Changes NOT Made (insufficient evidence)
- [pattern observed but too few samples]
## Recommendations
- [ ] Apply Change 1 (reviewer approved)
- [ ] Collect more data for Change 3 (need N more runs)
- [ ] Consider manual review of Change 2
## Next Steps
Run `/meta-optimize apply 1` to apply a specific change, or
`/meta-optimize apply all` to apply all recommended changes.
```
### Step 6: Apply Changes (if user approves)
If user runs `/meta-optimize apply [N]`:
1. Back up original SKILL.md to `.aris/meta/backups/`
2. Apply the patch
3. Log the change to `.aris/meta/optimizations.jsonl`
4. Remind user to test the changed skill on their next run
**Never auto-apply without user approval.**
## Key Rules
- **Log-driven, not speculative.** Every proposed change must cite specific data from the event log. No "I think this would be better."
- **Minimal patches.** Change one thing at a time. Don't rewrite entire skills.
- **Reviewer-gated.** Every patch goes through cross-model review before recommendation.
- **Reversible.** Always back up before applying. Always log what changed.
- **User-approved.** Never auto-apply. Present, explain, let the user decide.
- **Honest about uncertainty.** If the data is insufficient, say so. Don't optimize on noise.
- **Portable.** Optimizations should improve the skill for all users, not just one user's style. If a change seems user-specific, flag it.
## Event Schema Reference
The log at `.aris/meta/events.jsonl` contains JSONL records with these shapes:
```jsonl
{"ts":"...","session":"...","event":"skill_invoke","skill":"auto-review-loop","args":"difficulty: hard"}
{"ts":"...","session":"...","event":"PostToolUse","tool":"Bash","input_summary":"pdflatex main.tex"}
{"ts":"...","session":"...","event":"codex_call","tool":"mcp__codex__codex","input_summary":"review..."}
{"ts":"...","session":"...","event":"tool_failure","tool":"Bash","input_summary":"python train.py"}
{"ts":"...","session":"...","event":"slash_command","command":"/auto-review-loop","args":""}
{"ts":"...","session":"...","event":"user_prompt","prompt_preview":"change difficulty to hard"}
{"ts":"...","session":"...","event":"session_start","source":"startup","model":"claude-opus-4-6"}
{"ts":"...","session":"...","event":"session_end"}
```
## Triggering
This skill is NOT part of the standard W1→W1.5→W2→W3→W4 pipeline. It is a **maintenance workflow** with three trigger mechanisms:
1. **Passive logging** (always on): Claude Code hooks record events to `.aris/meta/events.jsonl` automatically during normal usage. Zero user effort.
2. **Automatic readiness check** (SessionEnd hook): When a Claude Code session ends, `check_ready.sh` counts skill invocations since the last `/meta-optimize` run. If ≥5 new invocations have accumulated, it prints a reminder:
```
📊 ARIS has logged 8 skill runs since last optimization. Run /meta-optimize to check for improvement opportunities.
```
This is a **suggestion only** — it does not auto-run optimization.
3. **Manual trigger**: User runs `/meta-optimize` when they see the reminder or whenever they want.
**After each `/meta-optimize` run**, the skill writes the current timestamp to `.aris/meta/.last_optimize` so the readiness check only counts new invocations.
## Acknowledgements
Inspired by [Meta-Harness](https://arxiv.org/abs/2603.28052) (Lee et al., 2026) — end-to-end optimization of model harnesses via filesystem-based experience access and agentic code search.
## Output Protocols
> Follow these shared protocols for all output files:
> - **[Output Versioning Protocol](../shared-references/output-versioning.md)** — write timestamped file first, then copy to fixed name
> - **[Output Manifest Protocol](../shared-references/output-manifest.md)** — log every output to MANIFEST.md
> - **[Output Language Protocol](../shared-references/output-language.md)** — respect the project's language setting
## Review Tracing
After each `mcp__codex__codex` or `mcp__codex__codex-reply` reviewer call, save the trace following `shared-references/review-tracing.md` (Policy C — forensic; never silently skip). Use `save_trace.sh` (resolved per the chain in `shared-references/integration-contract.md` §2) or write files directly to `.aris/traces/<skill>/<date>_run<NN>/`. Respect the `--- trace:` parameter (default: `full`).
INPUTS
- $ARGUMENTS
target skill name or 'all'
e.g. auto-review-loop
REQUIRED CONTEXT
- .aris/meta/events.jsonl usage log
OPTIONAL CONTEXT
- $ARGUMENTS (target skill or 'all')
TOOLS REQUIRED
- mcp__codex__codex
ROLES & RULES
- Log-driven, not speculative. Every proposed change must cite specific data from the event log.
- Minimal patches. Change one thing at a time. Don't rewrite entire skills.
- Reviewer-gated. Every patch goes through cross-model review before recommendation.
- Reversible. Always back up before applying. Always log what changed.
- User-approved. Never auto-apply. Present, explain, let the user decide.
- Honest about uncertainty. If the data is insufficient, say so.
- Portable. Optimizations should improve the skill for all users, not just one user's style.
- Never change artifact schemas or MCP bridge config in v1.
- Never change behavior that would break existing user workflows.
- Never auto-apply without user approval.
EXPECTED OUTPUT
- Format
- markdown
- Schema
- markdown_sections · ARIS Meta-Optimization Report, Proposed Changes, Change 1, Changes NOT Made (insufficient evidence), Recommendations, Next Steps
- Constraints
- structured report with specific sections
- include diffs and reviewer scores
- cite data from event log
SUCCESS CRITERIA
- Every proposed change must cite specific data from the event log
- Patches must be minimal and include a WHY comment with log data
- All patches must pass cross-model reviewer scoring >=7/10
- Present findings as structured summary tables
EXAMPLES
Includes event schema JSONL examples, optimization opportunity tables, patch diffs, structured report templates, and workflow step code blocks.
CAVEATS
- Dependencies
- Requires .aris/meta/events.jsonl with at least 5 skill invocations
- Requires templates/claude-hooks/meta_logging.json copied to .claude/settings.json
- Requires previous workflow runs for data analysis
- Missing context
- ARIS project root and exact file locations for templates/, .aris/, shared-references/
- Ambiguities
- Description field is truncated: "or wants to optimize ARIS's own h"
QUALITY
- OVERALL
- 0.68
- CLARITY
- 0.78
- SPECIFICITY
- 0.92
- REUSABILITY
- 0.28
- COMPLETENESS
- 0.93
IMPROVEMENT SUGGESTIONS
- Replace hardcoded paths (.aris/meta/events.jsonl, templates/claude-hooks/...) with explicit placeholders or variables so the prompt can be reused in other projects.
- Complete the truncated sentence in the name/description header.
USAGE
Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.
MORE FOR AGENT
- Local Documentation Online Sync Automatoragentoperations
- HashiCorp Packer Golden Image Expertagentoperations
- ML Experiment GPU Deployment Workflowagentoperations
- Codex Training Metrics Monitoragentoperations
- Context Optimization Techniques Guideagentoperations
- Issue Triage State Machineagentoperations
- ML Experiment Results Monitoragentoperations
- DOCX Document Creation Editing Guideagentoperations
- Repo Agent Skills Configuration Setupagentoperations
- Git Worktree Isolated Workspace Setupagentoperations
- Agent Context Compression Strategiesagentoperations
- Parallel Agent Dispatcher for Independent Tasksagentoperations
- Scientific Computing Resource Detectoragentoperations
- PPTX File Handling Skill Guideagentoperations
- Interactive QA GitHub Issue Fileragentoperations
- Sprint Retrospective Facilitatoragentoperations
- Agent Skill Writing Guideagentoperations
- Brilliant Directories Rube MCP Automation Guideagentoperations
- Istio Linkerd Service Mesh Expertagentoperations
- Machine Learning Experiment Monitoragentoperations
- Benchling Python SDK Integrationagentoperations
- Blackbaud Automation via Rube MCPagentoperations
- DigitalOcean Automation via Rube MCPagentoperations
- Service Mesh Architecture Expertagentoperations
- WandB Training Metrics Health Checkeragentoperations