agent evaluation skill risk: low
NVIDIA Cosmos Policy LIBERO RoboCasa Evaluator
Provides installation steps, environment configuration, evaluation commands, result parsing, and troubleshooting for running NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation…
SKILL 3 files · 1 folder
SKILL.md
---
name: evaluating-cosmos-policy
description: "Evaluates NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments. Use when setting up cosmos-policy for robot manipulation evaluation, running headless GPU evaluations with EGL rendering, or profiling inference latency on cluster or local GPU machines."
---
# Cosmos Policy Evaluation
Evaluation workflows for NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments from the public `cosmos-policy` repository. Covers blank-machine setup, headless GPU evaluation, and inference profiling.
## Quick start
Run a minimal LIBERO evaluation using the official public eval module:
```bash
uv run --extra cu128 --group libero --python 3.10 \
python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
--config cosmos_predict2_2b_480p_libero__inference_only \
--ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
--config_file cosmos_policy/config/config.py \
--use_wrist_image True \
--use_proprio True \
--normalize_proprio True \
--unnormalize_actions True \
--dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
--t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
--trained_with_image_aug True \
--chunk_size 16 \
--num_open_loop_steps 16 \
--task_suite_name libero_10 \
--num_trials_per_task 1 \
--local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
--seed 195 \
--randomize_seed False \
--deterministic True \
--run_id_note smoke \
--ar_future_prediction False \
--ar_value_prediction False \
--use_jpeg_compression True \
--flip_images True \
--num_denoising_steps_action 5 \
--num_denoising_steps_future_state 1 \
--num_denoising_steps_value 1 \
--data_collection False
```
## Core concepts
**What Cosmos Policy is**: NVIDIA Cosmos Policy is a vision-language-action (VLA) model that uses Cosmos Tokenizer to encode visual observations into discrete tokens, then predicts robot actions conditioned on language instructions and visual context.
**Key architecture choices**:
| Component | Design |
|-----------|--------|
| Visual encoder | Cosmos Tokenizer (discrete tokens) |
| Language conditioning | Cross-attention to language embeddings |
| Action prediction | Autoregressive action token generation |
**Public command surface**: The supported evaluation entrypoints are `cosmos_policy.experiments.robot.libero.run_libero_eval` and `cosmos_policy.experiments.robot.robocasa.run_robocasa_eval`. Keep reproduction notes anchored to these public modules and their documented flags.
## Compute requirements
| Task | GPU | VRAM | Typical wall time |
|------|-----|------|-------------------|
| LIBERO smoke eval (1 trial) | 1x A40/A100 | ~16 GB | 5-10 min |
| LIBERO full eval (50 trials) | 1x A40/A100 | ~16 GB | 2-4 hours |
| RoboCasa single-task (2 trials) | 1x A40/A100 | ~18 GB | 10-15 min |
| RoboCasa all-tasks | 1x A40/A100 | ~18 GB | 4-8 hours |
## When to use vs alternatives
**Use this skill when:**
- Evaluating NVIDIA Cosmos Policy on LIBERO or RoboCasa benchmarks
- Profiling inference latency and throughput for Cosmos Policy
- Setting up headless EGL rendering for robot simulation on GPU clusters
**Use alternatives when:**
- Training or fine-tuning Cosmos Policy from scratch (use official Cosmos training docs)
- Working with OpenVLA-based policies (use `fine-tuning-openvla-oft`)
- Working with Physical Intelligence pi0 models (use `fine-tuning-serving-openpi`)
- Running real-robot evaluation rather than simulation
---
## Workflow 1: LIBERO evaluation
Copy this checklist and track progress:
```text
LIBERO Eval Progress:
- [ ] Step 1: Install environment and dependencies
- [ ] Step 2: Configure headless EGL rendering
- [ ] Step 3: Run smoke evaluation
- [ ] Step 4: Validate outputs and parse results
- [ ] Step 5: Run full benchmark if smoke passes
```
**Step 1: Install environment**
```bash
git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md to build and enter the supported Docker container.
# Then, inside the container:
uv sync --extra cu128 --group libero --python 3.10
```
**Step 2: Configure headless rendering**
```bash
export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
```
**Step 3: Run smoke evaluation**
```bash
uv run --extra cu128 --group libero --python 3.10 \
python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
--config cosmos_predict2_2b_480p_libero__inference_only \
--ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
--config_file cosmos_policy/config/config.py \
--use_wrist_image True \
--use_proprio True \
--normalize_proprio True \
--unnormalize_actions True \
--dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
--t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
--trained_with_image_aug True \
--chunk_size 16 \
--num_open_loop_steps 16 \
--task_suite_name libero_10 \
--num_trials_per_task 1 \
--local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
--seed 195 \
--randomize_seed False \
--deterministic True \
--run_id_note smoke \
--ar_future_prediction False \
--ar_value_prediction False \
--use_jpeg_compression True \
--flip_images True \
--num_denoising_steps_action 5 \
--num_denoising_steps_future_state 1 \
--num_denoising_steps_value 1 \
--data_collection False
```
**Step 4: Validate and parse results**
```python
import json
import glob
# Find latest evaluation result from the official log directory
log_files = sorted(glob.glob("cosmos_policy/experiments/robot/libero/logs/**/*.json", recursive=True))
with open(log_files[-1]) as f:
results = json.load(f)
print(results)
```
**Step 5: Scale up**
Run across all four LIBERO task suites with 50 trials:
```bash
for suite in libero_spatial libero_object libero_goal libero_10; do
uv run --extra cu128 --group libero --python 3.10 \
python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
--config cosmos_predict2_2b_480p_libero__inference_only \
--ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
--config_file cosmos_policy/config/config.py \
--use_wrist_image True \
--use_proprio True \
--normalize_proprio True \
--unnormalize_actions True \
--dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
--t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
--trained_with_image_aug True \
--chunk_size 16 \
--num_open_loop_steps 16 \
--task_suite_name "$suite" \
--num_trials_per_task 50 \
--local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
--seed 195 \
--randomize_seed False \
--deterministic True \
--run_id_note "suite_${suite}" \
--ar_future_prediction False \
--ar_value_prediction False \
--use_jpeg_compression True \
--flip_images True \
--num_denoising_steps_action 5 \
--num_denoising_steps_future_state 1 \
--num_denoising_steps_value 1 \
--data_collection False
done
```
---
## Workflow 2: RoboCasa evaluation
Copy this checklist and track progress:
```text
RoboCasa Eval Progress:
- [ ] Step 1: Install RoboCasa assets and verify macros
- [ ] Step 2: Run single-task smoke evaluation
- [ ] Step 3: Validate outputs
- [ ] Step 4: Expand to multi-task runs
```
**Step 1: Install RoboCasa**
```bash
git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy
python -m robocasa.scripts.setup_macros
python -m robocasa.scripts.download_kitchen_assets
```
This fork installs the `robocasa` Python package expected by Cosmos Policy while preserving the patched environment changes used in the public RoboCasa eval path. Verify `macros_private.py` exists and paths are correct.
**Step 2: Single-task smoke evaluation**
```bash
uv run --extra cu128 --group robocasa --python 3.10 \
python -m cosmos_policy.experiments.robot.robocasa.run_robocasa_eval \
--config cosmos_predict2_2b_480p_robocasa_50_demos_per_task__inference \
--ckpt_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B \
--config_file cosmos_policy/config/config.py \
--use_wrist_image True \
--num_wrist_images 1 \
--use_proprio True \
--normalize_proprio True \
--unnormalize_actions True \
--dataset_stats_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_dataset_statistics.json \
--t5_text_embeddings_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_t5_embeddings.pkl \
--trained_with_image_aug True \
--chunk_size 32 \
--num_open_loop_steps 16 \
--task_name TurnOffMicrowave \
--obj_instance_split A \
--num_trials_per_task 2 \
--local_log_dir cosmos_policy/experiments/robot/robocasa/logs/ \
--seed 195 \
--randomize_seed False \
--deterministic True \
--run_id_note smoke \
--use_variance_scale False \
--use_jpeg_compression True \
--flip_images True \
--num_denoising_steps_action 5 \
--num_denoising_steps_future_state 1 \
--num_denoising_steps_value 1 \
--data_collection False
```
**Step 3: Validate outputs**
- Confirm the eval log prints the expected task name, object split, and checkpoint/config values.
- Inspect the final `Success rate:` line in the log.
**Step 4: Expand scope**
Increase `--num_trials_per_task` or add more tasks. Keep `--obj_instance_split` fixed across repeated runs for comparability.
---
## Workflow 3: Blank-machine cluster launch
```text
Cluster Launch Progress:
- [ ] Step 1: Clone the public repo and enter the supported runtime
- [ ] Step 2: Sync the benchmark-specific dependency group
- [ ] Step 3: Export rendering and cache environment variables before eval
```
**Step 1: Clone and enter the supported runtime**
```bash
git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md, start the Docker container, and enter it before continuing.
```
**Step 2: Sync dependencies**
```bash
uv sync --extra cu128 --group libero --python 3.10
# or, for RoboCasa:
uv sync --extra cu128 --group robocasa --python 3.10
# then install the Cosmos-compatible RoboCasa fork:
git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy
```
**Step 3: Export runtime environment**
```bash
export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
export HF_HOME=${HF_HOME:-$HOME/.cache/huggingface}
export TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE:-$HF_HOME}
```
---
## Expected performance benchmarks
Reference values from official evaluation (tied to specific setup and seeds):
| Task Suite | Success Rate | Notes |
|-----------|-------------|-------|
| LIBERO-Spatial | 98.1% | Official LIBERO spatial result |
| LIBERO-Object | 100.0% | Official LIBERO object result |
| LIBERO-Goal | 98.2% | Official LIBERO goal result |
| LIBERO-Long | 97.6% | Official LIBERO long-horizon result |
| LIBERO-Average | 98.5% | Official average across LIBERO suites |
| RoboCasa | 67.1% | Official RoboCasa average result |
**Reproduction note**: Published success rates still depend on checkpoint choice, task suite, seeds, and simulator setup. Record the exact command and environment alongside any reported number.
---
## Non-negotiable rules
- **EGL alignment**: Always set `CUDA_VISIBLE_DEVICES`, `MUJOCO_EGL_DEVICE_ID`, `MUJOCO_GL=egl`, and `PYOPENGL_PLATFORM=egl` together on headless GPU nodes.
- **Official runtime first**: If host-Python installs hit binary compatibility issues, fall back to the supported container workflow from `SETUP.md` before debugging package internals.
- **Cache consistency**: Use the same cache directory across setup and eval so Hugging Face and dependency caches are reused.
- **Run comparability**: Keep task name, object split, seed, and trial count fixed across repeated runs.
---
## Common issues
**Issue: binary compatibility or loader failures on host Python**
Fix: rerun inside the official container/runtime from `SETUP.md`. Do not assume host-package rebuilds will match the public release environment.
**Issue: LIBERO prompts for config path in a non-interactive shell**
Fix: pre-create `LIBERO_CONFIG_PATH/config.yaml`:
```python
import os, yaml
config_dir = os.path.expanduser("~/.libero")
os.makedirs(config_dir, exist_ok=True)
with open(os.path.join(config_dir, "config.yaml"), "w") as f:
yaml.dump({"benchmark_root": "/path/to/libero/datasets"}, f)
```
**Issue: EGL initialization or shutdown noise**
Fix: align EGL environment variables first. Treat teardown-only `EGL_NOT_INITIALIZED` warnings as low-signal unless the job exits non-zero.
**Issue: Kitchen object sampling NaNs or asset lookup failures in RoboCasa**
Fix: rerun asset setup and confirm the patched robocasa install is intact:
```bash
python -m robocasa.scripts.download_kitchen_assets
python -c "import robocasa; print(robocasa.__file__)"
```
**Issue: MuJoCo rendering mismatch**
Fix: verify GPU device alignment:
```python
import os
cuda_dev = os.environ.get("CUDA_VISIBLE_DEVICES", "not set")
egl_dev = os.environ.get("MUJOCO_EGL_DEVICE_ID", "not set")
assert cuda_dev == egl_dev, f"GPU mismatch: CUDA={cuda_dev}, EGL={egl_dev}"
print(f"Rendering on GPU {cuda_dev}")
```
---
## Advanced topics
**LIBERO command matrix**: See [references/libero-commands.md](references/libero-commands.md)
**RoboCasa command matrix**: See [references/robocasa-commands.md](references/robocasa-commands.md)
## Resources
- Cosmos Policy repository: https://github.com/NVlabs/cosmos-policy
- LIBERO benchmark: https://github.com/Lifelong-Robot-Learning/LIBERO
- Cosmos-compatible RoboCasa fork: https://github.com/moojink/robocasa-cosmos-policy
- Upstream RoboCasa project: https://github.com/robocasa/robocasa
- MuJoCo documentation: https://mujoco.readthedocs.io/
OPTIONAL CONTEXT
- evaluation environment (LIBERO or RoboCasa)
- task suite or task name
- number of trials
ROLES & RULES
- Always set CUDA_VISIBLE_DEVICES, MUJOCO_EGL_DEVICE_ID, MUJOCO_GL=egl, and PYOPENGL_PLATFORM=egl together on headless GPU nodes.
- If host-Python installs hit binary compatibility issues, fall back to the supported container workflow from SETUP.md before debugging package internals.
- Use the same cache directory across setup and eval so Hugging Face and dependency caches are reused.
- Keep task name, object split, seed, and trial count fixed across repeated runs.
EXPECTED OUTPUT
- Format
- markdown
- Constraints
- include bash and python code blocks
- use checklists for workflows
- provide tables for requirements and benchmarks
SUCCESS CRITERIA
- Follow EGL alignment rules
- Use official runtime first
- Validate outputs and parse results correctly
- Keep reproduction notes anchored to public modules
FAILURE MODES
- Binary compatibility or loader failures on host Python
- LIBERO prompts for config path in non-interactive shell
- EGL initialization or shutdown noise
- Kitchen object sampling NaNs or asset lookup failures in RoboCasa
- MuJoCo rendering mismatch
EXAMPLES
Includes multiple bash commands, Python snippets, checklists, tables, and full evaluation workflows for LIBERO and RoboCasa.
CAVEATS
- Dependencies
- Requires cosmos-policy repository clone
- Requires RoboCasa fork install
- Requires specific Hugging Face checkpoints and dataset stats
- Missing context
- User-specific paths for datasets, logs, and caches
- Target GPU hardware details beyond A40/A100 examples
QUALITY
- OVERALL
- 0.82
- CLARITY
- 0.90
- SPECIFICITY
- 0.95
- REUSABILITY
- 0.60
- COMPLETENESS
- 0.85
IMPROVEMENT SUGGESTIONS
- Introduce placeholders (e.g., {{GPU_ID}}, {{CKPT_PATH}}) in all command examples to improve reusability across environments.
- Extract the long repeated command flags into a single reusable YAML or dict template so users can override only the differing values.
USAGE
Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.
MORE FOR AGENT
- Advanced LLM-as-Judge Evaluationagentevaluation
- Context Degradation Pattern Diagnoseragentevaluation
- Agent Evaluation Framework Builderagentevaluation
- LLM Experiment Integrity Auditoragentevaluation
- Experiment Results Claim Evaluation Gateagentevaluation
- Nielsen Heuristics Mobile UX Auditoragentevaluation
- Scientific Manuscript Peer Revieweragentevaluation
- Nielsen Heuristics Mobile UX Auditoragentevaluation
- UX/UI Principles Agent Skillsagentevaluation
- Nielsen Heuristics Mobile UX Auditoragentevaluation
- ScholarEval Academic Research Evaluatoragentevaluation
- AI Agent Reasoning Trace Optimizeragentevaluation
- Experiment Integrity Cross-Model Auditoragentevaluation
- Research Result-to-Claim Evaluatoragentevaluation
- Web Interface Guidelines File Revieweragentevaluation
- UX/UI Principles Interface Evaluatoragentevaluation
- Comprehensive Final Review Checklist Assessoragentevaluation
- UX/UI Principles Agent Skillsagentevaluation
- Conductor Project Artifacts Validatoragentevaluation
- Comprehensive Codebase Bug Analysis and Fixeragentanalysis
- Xcode MCP Usage Guidelines for Agentsagenttool_use
- Xcode MCP Usage Guidelinesagenttool_use
- Rapid App MVP Prototyperagentcoding
- Local Documentation Online Sync Automatoragentoperations
- HashiCorp Packer Golden Image Expertagentoperations