agent evaluation skill risk: low

NVIDIA Cosmos Policy LIBERO RoboCasa Evaluator

Provides installation steps, environment configuration, evaluation commands, result parsing, and troubleshooting for running NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation…

SKILL 3 files · 1 folder

SKILL.md

Download

---
name: evaluating-cosmos-policy
description: "Evaluates NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments. Use when setting up cosmos-policy for robot manipulation evaluation, running headless GPU evaluations with EGL rendering, or profiling inference latency on cluster or local GPU machines."
---
# Cosmos Policy Evaluation

Evaluation workflows for NVIDIA Cosmos Policy on LIBERO and RoboCasa simulation environments from the public `cosmos-policy` repository. Covers blank-machine setup, headless GPU evaluation, and inference profiling.

## Quick start

Run a minimal LIBERO evaluation using the official public eval module:

```bash
uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False
```

## Core concepts

**What Cosmos Policy is**: NVIDIA Cosmos Policy is a vision-language-action (VLA) model that uses Cosmos Tokenizer to encode visual observations into discrete tokens, then predicts robot actions conditioned on language instructions and visual context.

**Key architecture choices**:

| Component | Design |
|-----------|--------|
| Visual encoder | Cosmos Tokenizer (discrete tokens) |
| Language conditioning | Cross-attention to language embeddings |
| Action prediction | Autoregressive action token generation |

**Public command surface**: The supported evaluation entrypoints are `cosmos_policy.experiments.robot.libero.run_libero_eval` and `cosmos_policy.experiments.robot.robocasa.run_robocasa_eval`. Keep reproduction notes anchored to these public modules and their documented flags.

## Compute requirements

| Task | GPU | VRAM | Typical wall time |
|------|-----|------|-------------------|
| LIBERO smoke eval (1 trial) | 1x A40/A100 | ~16 GB | 5-10 min |
| LIBERO full eval (50 trials) | 1x A40/A100 | ~16 GB | 2-4 hours |
| RoboCasa single-task (2 trials) | 1x A40/A100 | ~18 GB | 10-15 min |
| RoboCasa all-tasks | 1x A40/A100 | ~18 GB | 4-8 hours |

## When to use vs alternatives

**Use this skill when:**
- Evaluating NVIDIA Cosmos Policy on LIBERO or RoboCasa benchmarks
- Profiling inference latency and throughput for Cosmos Policy
- Setting up headless EGL rendering for robot simulation on GPU clusters

**Use alternatives when:**
- Training or fine-tuning Cosmos Policy from scratch (use official Cosmos training docs)
- Working with OpenVLA-based policies (use `fine-tuning-openvla-oft`)
- Working with Physical Intelligence pi0 models (use `fine-tuning-serving-openpi`)
- Running real-robot evaluation rather than simulation

---

## Workflow 1: LIBERO evaluation

Copy this checklist and track progress:

```text
LIBERO Eval Progress:
- [ ] Step 1: Install environment and dependencies
- [ ] Step 2: Configure headless EGL rendering
- [ ] Step 3: Run smoke evaluation
- [ ] Step 4: Validate outputs and parse results
- [ ] Step 5: Run full benchmark if smoke passes
```

**Step 1: Install environment**

```bash
git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md to build and enter the supported Docker container.
# Then, inside the container:
uv sync --extra cu128 --group libero --python 3.10
```

**Step 2: Configure headless rendering**

```bash
export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
```

**Step 3: Run smoke evaluation**

```bash
uv run --extra cu128 --group libero --python 3.10 \
  python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
    --config cosmos_predict2_2b_480p_libero__inference_only \
    --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 16 \
    --num_open_loop_steps 16 \
    --task_suite_name libero_10 \
    --num_trials_per_task 1 \
    --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --ar_future_prediction False \
    --ar_value_prediction False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False
```

**Step 4: Validate and parse results**

```python
import json
import glob

# Find latest evaluation result from the official log directory
log_files = sorted(glob.glob("cosmos_policy/experiments/robot/libero/logs/**/*.json", recursive=True))
with open(log_files[-1]) as f:
    results = json.load(f)

print(results)
```

**Step 5: Scale up**

Run across all four LIBERO task suites with 50 trials:

```bash
for suite in libero_spatial libero_object libero_goal libero_10; do
  uv run --extra cu128 --group libero --python 3.10 \
    python -m cosmos_policy.experiments.robot.libero.run_libero_eval \
      --config cosmos_predict2_2b_480p_libero__inference_only \
      --ckpt_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B \
      --config_file cosmos_policy/config/config.py \
      --use_wrist_image True \
      --use_proprio True \
      --normalize_proprio True \
      --unnormalize_actions True \
      --dataset_stats_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_dataset_statistics.json \
      --t5_text_embeddings_path nvidia/Cosmos-Policy-LIBERO-Predict2-2B/libero_t5_embeddings.pkl \
      --trained_with_image_aug True \
      --chunk_size 16 \
      --num_open_loop_steps 16 \
      --task_suite_name "$suite" \
      --num_trials_per_task 50 \
      --local_log_dir cosmos_policy/experiments/robot/libero/logs/ \
      --seed 195 \
      --randomize_seed False \
      --deterministic True \
      --run_id_note "suite_${suite}" \
      --ar_future_prediction False \
      --ar_value_prediction False \
      --use_jpeg_compression True \
      --flip_images True \
      --num_denoising_steps_action 5 \
      --num_denoising_steps_future_state 1 \
      --num_denoising_steps_value 1 \
      --data_collection False
done
```

---

## Workflow 2: RoboCasa evaluation

Copy this checklist and track progress:

```text
RoboCasa Eval Progress:
- [ ] Step 1: Install RoboCasa assets and verify macros
- [ ] Step 2: Run single-task smoke evaluation
- [ ] Step 3: Validate outputs
- [ ] Step 4: Expand to multi-task runs
```

**Step 1: Install RoboCasa**

```bash
git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy
python -m robocasa.scripts.setup_macros
python -m robocasa.scripts.download_kitchen_assets
```

This fork installs the `robocasa` Python package expected by Cosmos Policy while preserving the patched environment changes used in the public RoboCasa eval path. Verify `macros_private.py` exists and paths are correct.

**Step 2: Single-task smoke evaluation**

```bash
uv run --extra cu128 --group robocasa --python 3.10 \
  python -m cosmos_policy.experiments.robot.robocasa.run_robocasa_eval \
    --config cosmos_predict2_2b_480p_robocasa_50_demos_per_task__inference \
    --ckpt_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B \
    --config_file cosmos_policy/config/config.py \
    --use_wrist_image True \
    --num_wrist_images 1 \
    --use_proprio True \
    --normalize_proprio True \
    --unnormalize_actions True \
    --dataset_stats_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_dataset_statistics.json \
    --t5_text_embeddings_path nvidia/Cosmos-Policy-RoboCasa-Predict2-2B/robocasa_t5_embeddings.pkl \
    --trained_with_image_aug True \
    --chunk_size 32 \
    --num_open_loop_steps 16 \
    --task_name TurnOffMicrowave \
    --obj_instance_split A \
    --num_trials_per_task 2 \
    --local_log_dir cosmos_policy/experiments/robot/robocasa/logs/ \
    --seed 195 \
    --randomize_seed False \
    --deterministic True \
    --run_id_note smoke \
    --use_variance_scale False \
    --use_jpeg_compression True \
    --flip_images True \
    --num_denoising_steps_action 5 \
    --num_denoising_steps_future_state 1 \
    --num_denoising_steps_value 1 \
    --data_collection False
```

**Step 3: Validate outputs**

- Confirm the eval log prints the expected task name, object split, and checkpoint/config values.
- Inspect the final `Success rate:` line in the log.

**Step 4: Expand scope**

Increase `--num_trials_per_task` or add more tasks. Keep `--obj_instance_split` fixed across repeated runs for comparability.

---

## Workflow 3: Blank-machine cluster launch

```text
Cluster Launch Progress:
- [ ] Step 1: Clone the public repo and enter the supported runtime
- [ ] Step 2: Sync the benchmark-specific dependency group
- [ ] Step 3: Export rendering and cache environment variables before eval
```

**Step 1: Clone and enter the supported runtime**

```bash
git clone https://github.com/NVlabs/cosmos-policy.git
cd cosmos-policy
# Follow SETUP.md, start the Docker container, and enter it before continuing.
```

**Step 2: Sync dependencies**

```bash
uv sync --extra cu128 --group libero --python 3.10
# or, for RoboCasa:
uv sync --extra cu128 --group robocasa --python 3.10
# then install the Cosmos-compatible RoboCasa fork:
git clone https://github.com/moojink/robocasa-cosmos-policy.git
uv pip install -e robocasa-cosmos-policy
```

**Step 3: Export runtime environment**

```bash
export CUDA_VISIBLE_DEVICES=0
export MUJOCO_EGL_DEVICE_ID=0
export MUJOCO_GL=egl
export PYOPENGL_PLATFORM=egl
export HF_HOME=${HF_HOME:-$HOME/.cache/huggingface}
export TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE:-$HF_HOME}
```

---

## Expected performance benchmarks

Reference values from official evaluation (tied to specific setup and seeds):

| Task Suite | Success Rate | Notes |
|-----------|-------------|-------|
| LIBERO-Spatial | 98.1% | Official LIBERO spatial result |
| LIBERO-Object | 100.0% | Official LIBERO object result |
| LIBERO-Goal | 98.2% | Official LIBERO goal result |
| LIBERO-Long | 97.6% | Official LIBERO long-horizon result |
| LIBERO-Average | 98.5% | Official average across LIBERO suites |
| RoboCasa | 67.1% | Official RoboCasa average result |

**Reproduction note**: Published success rates still depend on checkpoint choice, task suite, seeds, and simulator setup. Record the exact command and environment alongside any reported number.

---

## Non-negotiable rules

- **EGL alignment**: Always set `CUDA_VISIBLE_DEVICES`, `MUJOCO_EGL_DEVICE_ID`, `MUJOCO_GL=egl`, and `PYOPENGL_PLATFORM=egl` together on headless GPU nodes.
- **Official runtime first**: If host-Python installs hit binary compatibility issues, fall back to the supported container workflow from `SETUP.md` before debugging package internals.
- **Cache consistency**: Use the same cache directory across setup and eval so Hugging Face and dependency caches are reused.
- **Run comparability**: Keep task name, object split, seed, and trial count fixed across repeated runs.

---

## Common issues

**Issue: binary compatibility or loader failures on host Python**

Fix: rerun inside the official container/runtime from `SETUP.md`. Do not assume host-package rebuilds will match the public release environment.

**Issue: LIBERO prompts for config path in a non-interactive shell**

Fix: pre-create `LIBERO_CONFIG_PATH/config.yaml`:

```python
import os, yaml

config_dir = os.path.expanduser("~/.libero")
os.makedirs(config_dir, exist_ok=True)
with open(os.path.join(config_dir, "config.yaml"), "w") as f:
    yaml.dump({"benchmark_root": "/path/to/libero/datasets"}, f)
```

**Issue: EGL initialization or shutdown noise**

Fix: align EGL environment variables first. Treat teardown-only `EGL_NOT_INITIALIZED` warnings as low-signal unless the job exits non-zero.

**Issue: Kitchen object sampling NaNs or asset lookup failures in RoboCasa**

Fix: rerun asset setup and confirm the patched robocasa install is intact:

```bash
python -m robocasa.scripts.download_kitchen_assets
python -c "import robocasa; print(robocasa.__file__)"
```

**Issue: MuJoCo rendering mismatch**

Fix: verify GPU device alignment:

```python
import os
cuda_dev = os.environ.get("CUDA_VISIBLE_DEVICES", "not set")
egl_dev = os.environ.get("MUJOCO_EGL_DEVICE_ID", "not set")
assert cuda_dev == egl_dev, f"GPU mismatch: CUDA={cuda_dev}, EGL={egl_dev}"
print(f"Rendering on GPU {cuda_dev}")
```

---

## Advanced topics

**LIBERO command matrix**: See [references/libero-commands.md](references/libero-commands.md)
**RoboCasa command matrix**: See [references/robocasa-commands.md](references/robocasa-commands.md)

## Resources

- Cosmos Policy repository: https://github.com/NVlabs/cosmos-policy
- LIBERO benchmark: https://github.com/Lifelong-Robot-Learning/LIBERO
- Cosmos-compatible RoboCasa fork: https://github.com/moojink/robocasa-cosmos-policy
- Upstream RoboCasa project: https://github.com/robocasa/robocasa
- MuJoCo documentation: https://mujoco.readthedocs.io/

OPTIONAL CONTEXT

evaluation environment (LIBERO or RoboCasa)
task suite or task name
number of trials

ROLES & RULES

Always set CUDA_VISIBLE_DEVICES, MUJOCO_EGL_DEVICE_ID, MUJOCO_GL=egl, and PYOPENGL_PLATFORM=egl together on headless GPU nodes.
If host-Python installs hit binary compatibility issues, fall back to the supported container workflow from SETUP.md before debugging package internals.
Use the same cache directory across setup and eval so Hugging Face and dependency caches are reused.
Keep task name, object split, seed, and trial count fixed across repeated runs.

EXPECTED OUTPUT

Format

markdown

Constraints

include bash and python code blocks
use checklists for workflows
provide tables for requirements and benchmarks

SUCCESS CRITERIA

Follow EGL alignment rules
Use official runtime first
Validate outputs and parse results correctly
Keep reproduction notes anchored to public modules

FAILURE MODES

Binary compatibility or loader failures on host Python
LIBERO prompts for config path in non-interactive shell
EGL initialization or shutdown noise
Kitchen object sampling NaNs or asset lookup failures in RoboCasa
MuJoCo rendering mismatch

EXAMPLES

Includes multiple bash commands, Python snippets, checklists, tables, and full evaluation workflows for LIBERO and RoboCasa.

CAVEATS

Dependencies

Requires cosmos-policy repository clone
Requires RoboCasa fork install
Requires specific Hugging Face checkpoints and dataset stats

Missing context

User-specific paths for datasets, logs, and caches
Target GPU hardware details beyond A40/A100 examples

QUALITY

OVERALL: 0.82
CLARITY: 0.90
SPECIFICITY: 0.95
REUSABILITY: 0.60
COMPLETENESS: 0.85

IMPROVEMENT SUGGESTIONS

Introduce placeholders (e.g., {{GPU_ID}}, {{CKPT_PATH}}) in all command examples to improve reusability across environments.
Extract the long repeated command flags into a single reusable YAML or dict template so users can override only the differing values.

USAGE

Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.