概要
This skill provides comprehensive patterns and best practices for implementing Group Relative Policy Optimization (GRPO) in LLM alignment workflows. It bridges the gap between basic Supervised Fine-Tuning (SFT) and advanced RLHF by offering structured guidance on GRPOTrainer configuration, efficient reward function design, and KL divergence constraints. It is particularly optimized for training thinking-aware reasoning models using memory-efficient techniques like LoRA and Unsloth, making it ideal for developers looking to refine model behaviors or develop complex reasoning capabilities without the overhead of a traditional critic model.