developer machine_learning skill risk: low
SimPO Preference Optimization Training Guide
The prompt provides installation steps, YAML configuration examples, and multiple workflows for running SimPO training on base and instruct LLMs using preference datasets.
SKILL 4 files · 1 folder
SKILL.md
--- name: simpo-training description: "Simple Preference Optimization for LLM alignment. Reference-free alternative to DPO with better performance (+6.4 points on AlpacaEval 2.0). No reference model needed, more efficient than DPO. Use for preference alignment when want simpler, faster training than DPO/PPO." --- # SimPO - Simple Preference Optimization ## Quick start SimPO is a reference-free preference optimization method that outperforms DPO without needing a reference model. **Installation**: ```bash # Create environment conda create -n simpo python=3.10 && conda activate simpo # Install PyTorch 2.2.2 # Visit: https://pytorch.org/get-started/locally/ # Install alignment-handbook git clone https://github.com/huggingface/alignment-handbook.git cd alignment-handbook python -m pip install . # Install Flash Attention 2 python -m pip install flash-attn --no-build-isolation ``` **Training** (Mistral 7B): ```bash ACCELERATE_LOG_LEVEL=info accelerate launch \ --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py \ training_configs/mistral-7b-base-simpo.yaml ``` ## Common workflows ### Workflow 1: Train from base model (Mistral 7B) **Config** (`mistral-7b-base-simpo.yaml`): ```yaml # Model model_name_or_path: mistralai/Mistral-7B-v0.1 torch_dtype: bfloat16 # Dataset dataset_mixer: HuggingFaceH4/ultrafeedback_binarized: 1.0 dataset_splits: - train_prefs - test_prefs # SimPO hyperparameters beta: 2.0 # Reward scaling (2.0-10.0) gamma_beta_ratio: 0.5 # Target margin (0-1) loss_type: sigmoid # sigmoid or hinge sft_weight: 0.0 # Optional SFT regularization # Training learning_rate: 5e-7 # Critical: 3e-7 to 1e-6 num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 8 # Output output_dir: ./outputs/mistral-7b-simpo ``` **Launch training**: ```bash accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py training_configs/mistral-7b-base-simpo.yaml ``` ### Workflow 2: Fine-tune instruct model (Llama 3 8B) **Config** (`llama3-8b-instruct-simpo.yaml`): ```yaml model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct dataset_mixer: argilla/ultrafeedback-binarized-preferences-cleaned: 1.0 beta: 2.5 gamma_beta_ratio: 0.5 learning_rate: 5e-7 sft_weight: 0.1 # Add SFT loss to preserve capabilities num_train_epochs: 1 per_device_train_batch_size: 2 gradient_accumulation_steps: 4 output_dir: ./outputs/llama3-8b-simpo ``` **Launch**: ```bash accelerate launch --config_file accelerate_configs/deepspeed_zero3.yaml \ scripts/run_simpo.py training_configs/llama3-8b-instruct-simpo.yaml ``` ### Workflow 3: Reasoning-intensive tasks (lower LR) **For math/code tasks**: ```yaml model_name_or_path: deepseek-ai/deepseek-math-7b-base dataset_mixer: argilla/distilabel-math-preference-dpo: 1.0 beta: 5.0 # Higher for stronger signal gamma_beta_ratio: 0.7 # Larger margin learning_rate: 3e-7 # Lower LR for reasoning sft_weight: 0.0 num_train_epochs: 1 per_device_train_batch_size: 1 gradient_accumulation_steps: 16 ``` ## When to use vs alternatives **Use SimPO when**: - Want simpler training than DPO (no reference model) - Have preference data (chosen/rejected pairs) - Need better performance than DPO - Limited compute resources - Single-node training sufficient **Algorithm selection**: - **SimPO**: Simplest, best performance, no reference model - **DPO**: Need reference model baseline, more conservative - **PPO**: Maximum control, need reward model, complex setup - **GRPO**: Memory-efficient RL, no critic **Use alternatives instead**: - **OpenRLHF**: Multi-node distributed training, PPO/GRPO - **TRL**: Need multiple methods in one framework - **DPO**: Established baseline comparison ## Common issues **Issue: Loss divergence** Reduce learning rate: ```yaml learning_rate: 3e-7 # Reduce from 5e-7 ``` Reduce beta: ```yaml beta: 1.0 # Reduce from 2.0 ``` **Issue: Model forgets capabilities** Add SFT regularization: ```yaml sft_weight: 0.1 # Add SFT loss component ``` **Issue: Poor preference separation** Increase beta and margin: ```yaml beta: 5.0 # Increase from 2.0 gamma_beta_ratio: 0.8 # Increase from 0.5 ``` **Issue: OOM during training** Reduce batch size: ```yaml per_device_train_batch_size: 1 gradient_accumulation_steps: 16 # Maintain effective batch ``` Enable gradient checkpointing: ```yaml gradient_checkpointing: true ``` ## Advanced topics **Loss functions**: See [references/loss-functions.md](references/loss-functions.md) for sigmoid vs hinge loss, mathematical formulations, and when to use each. **Hyperparameter tuning**: See [references/hyperparameters.md](references/hyperparameters.md) for beta, gamma, learning rate selection guide, and model-size-specific recommendations. **Dataset preparation**: See [references/datasets.md](references/datasets.md) for preference data formats, quality filtering, and custom dataset creation. ## Hardware requirements - **GPU**: NVIDIA A100/H100 recommended - **VRAM**: - 7B model: 1× A100 40GB (DeepSpeed ZeRO-3) - 8B model: 2× A100 40GB - 70B model: 8× A100 80GB - **Single-node**: DeepSpeed ZeRO-3 sufficient - **Mixed precision**: BF16 recommended **Memory optimization**: - DeepSpeed ZeRO-3 (default config) - Gradient checkpointing - Flash Attention 2 ## Resources - Paper: https://arxiv.org/abs/2405.14734 (NeurIPS 2024) - GitHub: https://github.com/princeton-nlp/SimPO - Models: https://huggingface.co/princeton-nlp - Alignment Handbook: https://github.com/huggingface/alignment-handbook
EXPECTED OUTPUT
- Format
- markdown
- Constraints
- include code blocks for configs and commands
- structure with sections for workflows, issues, and hardware
EXAMPLES
Includes three workflows with full YAML config examples plus installation/training commands for Mistral-7B, Llama-3-8B, and math/code models.
QUALITY
- OVERALL
- 0.65
- CLARITY
- 0.90
- SPECIFICITY
- 0.85
- REUSABILITY
- 0.30
- COMPLETENESS
- 0.80
IMPROVEMENT SUGGESTIONS
- Convert concrete model/dataset names and file paths into explicit placeholders (e.g., {{model_name}}, {{dataset_mixer}}) to raise reusability.
USAGE
Copy the prompt above and paste it into your AI of choice — Claude, ChatGPT, Gemini, or anywhere else you're working. Replace any placeholder sections with your own context, then ask for the output.
MORE FOR DEVELOPER
- Mergekit LLM Model Merging Guidedevelopermachine_learning
- Context7 Library Documentation Expertdevelopercoding
- Structured Python Production Code Generatordevelopercoding
- Minimax Music API Generation Agentdevelopercreative
- Angular Standalone Directive Generatordevelopercoding
- Pytest Unit Test Suite Generatordevelopercoding
- Unity Architecture Specialistdevelopercoding
- iOS App Store Compliance Auditordeveloperanalysis
- Reusable Enterprise Website Template Architectdevelopersoftware_engineering
- Web Typography CSS Generatordevelopercoding
- DevOps CI/CD Pipeline Automatordeveloperoperations
- Cascading System Failure Simulatordeveloperoperations
- Python Security Code Auditordevelopersecurity
- WCAG Web Accessibility Auditor Remediatordeveloperevaluation
- VSCode CodeTour File Expertdevelopercoding
- 6-Day Sprint Feature Prioritizerdeveloperplanning
- Senior Python Code Reviewerdevelopercoding
- Micro-SaaS Vibecoder MVP Blueprint Architectdeveloperproduct
- Structured Cross-Language Code Translatordevelopercoding
- Playwright Web App Testing Toolkitdeveloperoperations
- Multi-DB SQL Query Optimizer and Builderdevelopercoding
- Base R Programming Reference Guidedevelopercoding
- IT Project Backlog Generator from Docsdeveloperplanning
- Flutter Map SDK Layer Bug Fixerdevelopercoding
- Playwright Web App Testing Toolkitdevelopertool_use