agent evaluation skill risk: low
LLM Experiment Integrity Auditor
Collects file paths for evaluation scripts, results, and claims without reading content, then sends them to a separate reviewer model (GPT-5.4) to check for fraud patterns such as…
- External action: low
SKILL 1 file
SKILL.md
---
name: auto-claude-code-research-in-sleep-experiment-audit
description: "Audit experiment integrity before claiming results. Uses cross-model review (GPT-5.4) to check for fake ground truth, score normalization fraud, phantom results, and insufficient scope. Use when user says /\"审计实验/\", /\"check experiment integrity/\", /\"audit results/\", /\"实验诚实度/\", or after experiments co"
---
# Experiment Audit: Cross-Model Integrity Verification
Audit experiment integrity for: **$ARGUMENTS**
## Why This Exists
LLM agents can produce fraudulent experimental results through:
1. **Fake ground truth** — creating synthetic "reference" from model outputs, then reporting high agreement as performance
2. **Score normalization** — dividing metrics by the model's own max to get 0.99+
3. **Phantom results** — claiming numbers from files that don't exist or functions never called
4. **Insufficient scope** — reporting 2-scene pilots as "comprehensive evaluation"
These are NOT intentional deception — they are failure modes of optimizing agents that lack integrity constraints. This skill adds that constraint.
## Core Principle
**The executor (Claude) collects file paths. The reviewer (GPT-5.4) reads code and judges integrity. The executor does NOT participate in integrity judgment.**
This follows `shared-references/reviewer-independence.md` and `shared-references/experiment-integrity.md`.
## Constants
- **REVIEWER_BACKEND = `codex`** — Default: Codex MCP (xhigh). Override with `— reviewer: oracle-pro` for GPT-5.4 Pro via Oracle MCP. See `shared-references/reviewer-routing.md`.
## Workflow
### Step 1: Collect Artifacts (Executor — Claude)
Locate and list these files WITHOUT reading or summarizing their content:
```
Scan project directory for:
1. Evaluation scripts: *eval*.py, *metric*.py, *test*.py, *benchmark*.py
2. Result files: *.json, *.csv in results/, outputs/, logs/
3. Ground truth paths: look in eval scripts for data loading (dataset paths, GT references)
4. Experiment tracker: EXPERIMENT_TRACKER.md, EXPERIMENT_LOG.md
5. Paper claims: NARRATIVE_REPORT.md, paper/sections/*.tex, PAPER_PLAN.md
6. Config files: *.yaml, *.toml, *.json configs with metric definitions
```
**DO NOT summarize, interpret, or explain any file content.** Only collect paths.
### Step 2: Send to Reviewer (GPT-5.4 via Codex MCP)
Pass ONLY file paths and the audit checklist to the reviewer. The reviewer reads everything directly.
```
mcp__codex__codex:
model: gpt-5.5
config: {"model_reasoning_effort": "xhigh"}
sandbox: read-only
cwd: [project directory]
prompt: |
You are an experiment integrity auditor. Read ALL files listed below
and check for the following fraud patterns.
Files to read:
- Evaluation scripts: [list paths]
- Result files: [list paths]
- Experiment tracker: [list paths]
- Paper claims: [list paths]
- Config files: [list paths]
## Audit Checklist
### A. Ground Truth Provenance
For each evaluation script:
1. Where does "ground truth" / "reference" / "target" come from?
2. Is it loaded from the DATASET, or generated/derived from MODEL OUTPUTS?
3. If derived: is it explicitly labeled as proxy evaluation?
4. Are official eval scripts used when available for this benchmark?
FAIL if: GT is derived from model outputs without explicit proxy labeling.
### B. Score Normalization
For each metric computation:
1. Is any metric divided by max/min/mean of the model's OWN output?
2. Are raw scores reported alongside any normalized scores?
3. Are any scores suspiciously close to 1.0 or 100%?
FAIL if: Normalization denominator comes from prediction statistics.
### C. Result File Existence
For each claim in the paper/narrative:
1. Does the referenced result file actually exist?
2. Does the claimed metric key exist in that file?
3. Does the claimed NUMBER match what's in the file?
4. Is the experiment tracker status DONE (not TODO/IN_PROGRESS)?
FAIL if: Claimed results reference nonexistent files or mismatched numbers.
### D. Dead Code Detection
For each metric function defined in eval scripts:
1. Is it actually CALLED in any evaluation pipeline?
2. Does its output appear in any result file?
WARN if: Metric functions exist but are never called.
### E. Scope Assessment
1. How many scenes/datasets/configurations were actually tested?
2. How many seeds/runs per configuration?
3. Does the paper use words like "comprehensive", "extensive", "robust"?
4. Is the actual scope sufficient for those claims?
WARN if: Scope language exceeds actual evidence.
### F. Evaluation Type Classification
Classify each evaluation as:
- real_gt: uses dataset-provided ground truth
- synthetic_proxy: uses model-generated reference
- self_supervised_proxy: no GT by design
- simulation_only: simulated environment
- human_eval: human judges
## Output Format
For each check (A-F), report:
- Status: PASS | WARN | FAIL
- Evidence: exact file:line references
- Details: what specifically was found
Overall verdict: PASS | WARN | FAIL
Be thorough. Read every eval script line by line.
```
### Step 3: Parse and Write Report (Executor — Claude)
Parse the reviewer's response and write `EXPERIMENT_AUDIT.md`:
```markdown
# Experiment Audit Report
**Date**: [today]
**Auditor**: GPT-5.4 xhigh (cross-model, read-only)
**Project**: [project name]
## Overall Verdict: [PASS | WARN | FAIL]
## Integrity Status: [pass | warn | fail]
## Checks
### A. Ground Truth Provenance: [PASS|WARN|FAIL]
[details + file:line evidence]
### B. Score Normalization: [PASS|WARN|FAIL]
[details]
### C. Result File Existence: [PASS|WARN|FAIL]
[details]
### D. Dead Code Detection: [PASS|WARN|FAIL]
[details]
### E. Scope Assessment: [PASS|WARN|FAIL]
[details]
### F. Evaluation Type: [real_gt | synthetic_proxy | ...]
[classification + evidence]
## Action Items
- [specific fixes if WARN or FAIL]
## Claim Impact
- Claim 1: [supported | needs qualifier | unsupported]
- Claim 2: ...
```
Also write `EXPERIMENT_AUDIT.json` for machine consumption:
```json
{
"date": "2026-04-10",
"auditor": "gpt-5.5-xhigh",
"overall_verdict": "warn",
"integrity_status": "warn",
"checks": {
"gt_provenance": {"status": "pass", "details": "..."},
"score_normalization": {"status": "warn", "details": "..."},
"result_existence": {"status": "pass", "details": "..."},
"dead_code": {"status": "pass", "details": "..."},
"scope": {"status": "warn", "details": "..."},
"eval_type": "real_gt"
},
"claims": [
{"id": "C1", "impact": "supported"},
{"id": "C2", "impact": "needs_qualifier"}
]
}
```
### Step 4: Print Summary
```
🔬 Experiment Audit Complete
GT Provenance: ✅ PASS — real dataset GT used
Score Normalization: ⚠️ WARN — boundary metric uses self-reference
Result Existence: ✅ PASS — all files exist, numbers match
Dead Code: ✅ PASS — all metric functions called
Scope: ⚠️ WARN — 2 scenes, paper says "comprehensive"
Overall: ⚠️ WARN
See EXPERIMENT_AUDIT.md for details.
```
## Integration with Other Skills
### Automatic in /research-pipeline (advisory, never blocks)
When integrated into the pipeline, this skill runs automatically after `/experiment-bridge` and before `/auto-review-loop`:
```
/experiment-bridge → results ready
↓
/experiment-audit (automatic, advisory)
├── PASS → continue normally
├── WARN → print ⚠️ warning, continue, tag claims as [INTEGRITY: WARN]
└── FAIL → print 🔴 alert, continue, tag claims as [INTEGRITY CONCERN]
↓
/auto-review-loop → proceeds with integrity tags visible to reviewer
```
**Never blocks the pipeline.** Even on FAIL, the pipeline continues — but claims carry visible integrity tags.
### Read by /result-to-claim (if exists)
```
if EXPERIMENT_AUDIT.json exists:
read integrity_status
attach to verdict: {claim_supported: "yes", integrity_status: "warn"}
if integrity_status == "fail":
downgrade verdict display: "yes [INTEGRITY CONCERN]"
else:
verdict as normal, integrity_status = "unavailable"
mark as "provisional — no integrity audit"
```
### Read by /paper-write (if exists)
```
if EXPERIMENT_AUDIT.json exists AND integrity_status == "fail":
add footnote to affected claims: "Note: integrity audit flagged concerns with this evaluation"
```
## Key Rules
- **Reviewer independence**: executor collects paths, reviewer judges. Period.
- **Never block**: warn loudly, never halt the pipeline.
- **File-as-switch**: no EXPERIMENT_AUDIT.md = skill was never run = zero impact on existing behavior.
- **Cross-model**: the reviewer MUST be a different model family from the executor.
- **Honest about limits**: the audit catches common patterns, not all possible fraud. It is a safety net, not a guarantee.
## Acknowledgements
Motivated by community-reported integrity issues (#57, #131) where executor agents created fake ground truth and self-normalized scores.
## Review Tracing
After each `mcp__codex__codex` or `mcp__codex__codex-reply` reviewer call, save the trace following `shared-references/review-tracing.md` (Policy C — forensic; never silently skip). Use `save_trace.sh` (resolved per the chain in `shared-references/integration-contract.md` §2) or write files directly to `.aris/traces/<skill>/<date>_run<NN>/`. Respect the `--- trace:` parameter (default: `full`).
INPUTS
- $ARGUMENTS REQUIRED
experiment identifier or description to audit
REQUIRED CONTEXT
- project directory containing eval scripts, result files, and paper claims
OPTIONAL CONTEXT
- reviewer backend override
- EXPERIMENT_TRACKER.md
- NARRATIVE_REPORT.md
TOOLS REQUIRED
- mcp__codex__codex
ROLES & RULES
Role assignments
- You are an experiment integrity auditor.
- Locate and list these files WITHOUT reading or summarizing their content
- DO NOT summarize, interpret, or explain any file content. Only collect paths.
- Pass ONLY file paths and the audit checklist to the reviewer.
- Be thorough. Read every eval script line by line.
- Never blocks the pipeline.
- Reviewer independence: executor collects paths, reviewer judges. Period.
- Cross-model: the reviewer MUST be a different model family from the executor.
- After each mcp__codex__codex or mcp__codex__codex-reply reviewer call, save the trace following shared-references/review-tracing.md
EXPECTED OUTPUT
- Format
- structured_report
- Schema
- markdown_sections · Overall Verdict, Integrity Status, Checks A-F, Action Items, Claim Impact, EXPERIMENT_AUDIT.json structure
- Constraints
- produce EXPERIMENT_AUDIT.md in specified markdown format
- produce EXPERIMENT_AUDIT.json in specified schema
- print terminal summary with emoji status
- executor only collects paths, reviewer performs all judgment
SUCCESS CRITERIA
- Report Status: PASS | WARN | FAIL with exact file:line evidence for each check A-F
- Classify each evaluation type
- Write EXPERIMENT_AUDIT.md and EXPERIMENT_AUDIT.json
- Print summary with emoji status
CAVEATS
- Dependencies
- $ARGUMENTS
- project directory
- shared-references/reviewer-independence.md
- shared-references/experiment-integrity.md
- shared-references/reviewer-routing.md
- shared-references/review-tracing.md
- shared-references/integration-contract.md
- Missing context
- Project directory path or how it is discovered
- Exact mechanism for invoking the Codex MCP call (tool name syntax)
- Ambiguities
- Description field cuts off mid-sentence: "or after experiments co"
- Does not specify how $ARGUMENTS should be parsed or formatted when multiple items are supplied
- Output date example uses a hardcoded future date (2026-04-10) without indicating source
QUALITY
- OVERALL
- 0.86
- CLARITY
- 0.78
- SPECIFICITY
- 0.92
- REUSABILITY
- 0.85
- COMPLETENESS
- 0.88
IMPROVEMENT SUGGESTIONS
- Add a short "Usage" section showing example invocations with $ARGUMENTS
- Replace the literal future date in the JSON example with a placeholder such as {{date}}
- Clarify whether the reviewer model string (gpt-5.5) is fixed or should also be parameterized
USAGE
Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.
MORE FOR AGENT
- Advanced LLM-as-Judge Evaluationagentevaluation
- Context Degradation Pattern Diagnoseragentevaluation
- Agent Evaluation Framework Builderagentevaluation
- Experiment Results Claim Evaluation Gateagentevaluation
- NVIDIA Cosmos Policy LIBERO RoboCasa Evaluatoragentevaluation
- Nielsen Heuristics Mobile UX Auditoragentevaluation
- Scientific Manuscript Peer Revieweragentevaluation
- Nielsen Heuristics Mobile UX Auditoragentevaluation
- UX/UI Principles Agent Skillsagentevaluation
- Nielsen Heuristics Mobile UX Auditoragentevaluation
- ScholarEval Academic Research Evaluatoragentevaluation
- AI Agent Reasoning Trace Optimizeragentevaluation
- Experiment Integrity Cross-Model Auditoragentevaluation
- Research Result-to-Claim Evaluatoragentevaluation
- Web Interface Guidelines File Revieweragentevaluation
- UX/UI Principles Interface Evaluatoragentevaluation
- Comprehensive Final Review Checklist Assessoragentevaluation
- UX/UI Principles Agent Skillsagentevaluation
- Conductor Project Artifacts Validatoragentevaluation
- Comprehensive Codebase Bug Analysis and Fixeragentanalysis
- Xcode MCP Usage Guidelines for Agentsagenttool_use
- Xcode MCP Usage Guidelinesagenttool_use
- Rapid App MVP Prototyperagentcoding
- Local Documentation Online Sync Automatoragentoperations
- HashiCorp Packer Golden Image Expertagentoperations