How do I prevent 'reward hacking' in my model?

You can prevent reward hacking by combining multiple reward signals, adding diversity penalties, and using a KL divergence penalty (beta) to keep the policy close to the reference.

What is a typical learning rate for GRPO?

A typical learning rate for GRPO is around 1e-5, which is generally 10x lower than standard Supervised Fine-Tuning (SFT) to ensure stability.

What is the main advantage of GRPO over PPO?

GRPO eliminates the need for a separate critic model by using group-relative advantages, which significantly reduces memory consumption and simplifies training stability.

Why are token-based reward functions recommended?

Token-based rewards use provided completion IDs to identify boundaries like the token, eliminating the performance overhead of re-tokenization during training.

Can I use GRPO with 4-bit quantization?

Yes, this skill includes patterns for using Unsloth's 4-bit FastLanguageModel and LoRA to make GRPO training accessible on consumer-grade hardware.

GRPO RLHF Alignment

Name: GRPO RLHF Alignment
Author: atrawog

byatrawog

データサイエンスとML

Optimizes LLM alignment through Group Relative Policy Optimization (GRPO) for stable reinforcement learning and reasoning model training.

概要

This skill provides comprehensive patterns and best practices for implementing Group Relative Policy Optimization (GRPO) in LLM alignment workflows. It bridges the gap between basic Supervised Fine-Tuning (SFT) and advanced RLHF by offering structured guidance on GRPOTrainer configuration, efficient reward function design, and KL divergence constraints. It is particularly optimized for training thinking-aware reasoning models using memory-efficient techniques like LoRA and Unsloth, making it ideal for developers looking to refine model behaviors or develop complex reasoning capabilities without the overhead of a traditional critic model.

主な機能

Specialized GRPOTrainer and GRPOConfig implementation patterns
Token-based reward functions for efficient thinking-aware training
0 GitHub stars
Integration with Unsloth and FastLanguageModel for 4-bit LoRA training
Detailed troubleshooting guides for reward hacking and memory issues
KL penalty management and learning rate optimization for stability

ユースケース

Aligning LLMs with human preferences without requiring a separate critic model
Developing 'thinking' models that exhibit structured reasoning before output
Optimizing model performance for specific rule-based or programmatic reward signals

概要

主な機能

Specialized GRPOTrainer and GRPOConfig implementation patterns
Token-based reward functions for efficient thinking-aware training
0 GitHub stars
Integration with Unsloth and FastLanguageModel for 4-bit LoRA training
Detailed troubleshooting guides for reward hacking and memory issues
KL penalty management and learning rate optimization for stability

ユースケース

Aligning LLMs with human preferences without requiring a separate critic model
Developing 'thinking' models that exhibit structured reasoning before output
Optimizing model performance for specific rule-based or programmatic reward signals