关于
This skill provides comprehensive, expert-level guidance for training language models through Reinforcement Learning (RL) using the efficient GRPO algorithm. It allows developers to align models with complex domain behaviors—such as multi-step reasoning, strict XML/JSON formatting, and verifiable task accuracy—without the need for expensive labeled preference data. By leveraging the Transformer Reinforcement Learning (TRL) library, this skill offers battle-tested patterns for reward function design, memory-optimized training configurations, and performance-boosting integrations like Unsloth and vLLM.