概要
SimPO (Simple Preference Optimization) is a cutting-edge technique for aligning Large Language Models (LLMs) with human preferences, offering a more efficient and higher-performing alternative to standard Direct Preference Optimization (DPO). By eliminating the need for a reference model during training, SimPO significantly reduces memory requirements and computational overhead while achieving superior results on benchmarks like AlpacaEval 2.0. This skill provides comprehensive workflows for training base and instruct models, reasoning-intensive task optimization, and troubleshooting common issues like loss divergence or capability forgetting, making it essential for AI researchers and engineers aiming for state-of-the-art model performance with limited compute resources.