About
RLOO is a specialized reinforcement learning skill for Claude Code that focuses on variance reduction during policy optimization. By using a leave-one-out baseline estimation—where multiple completions are generated per prompt and compared against each other—it provides significantly more stable gradients than traditional single-sample RL methods. This skill is particularly effective for training reasoning-capable models (such as Qwen-Thinking) and includes pre-defined patterns for reward function integration, thinking-aware token boundaries, and memory-efficient training configurations using Unsloth and TRL.