01Full RLHF pipeline implementation including Reward Modeling
02Standardized training configurations for HuggingFace Transformers
03Direct Preference Optimization (DPO) for stable alignment
04384 GitHub stars
05Memory-efficient Online RL using GRPO
06Supervised Fine-Tuning (SFT) for instruction following