How does Slime differ from other RL frameworks like verl or miles?

Slime is specifically optimized for the Megatron-LM and SGLang ecosystem, offering tighter integration for teams already using Tsinghua's THUDM-based workflows or requiring specific GLM architecture support.

Does Slime support multi-node distributed training?

Yes, it leverages Megatron-LM’s full suite of parallelism (TP, PP, DP, SP) to facilitate scaling across multiple GPU nodes for both training and rollout.

Can I use Slime for GRPO training?

Yes, Slime includes dedicated workflows and parameters for Group Relative Policy Optimization (GRPO), which is ideal for training models with group-relative advantages.

What models does Slime support?

Slime provides native support and pre-configured scripts for GLM-4.x, Qwen3, DeepSeek V3/R1, and Llama 3 models.

Slime RL Post-Training

Name: Slime RL Post-Training
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

データサイエンスとML

Scales LLM post-training via reinforcement learning by integrating Megatron-LM training with high-throughput SGLang inference.

Slime is a high-performance reinforcement learning framework designed for scaling LLM post-training workflows, famously powering the GLM-4 model series. It bridges the gap between Megatron-LM's distributed training capabilities and SGLang’s efficient rollout generation, enabling researchers to implement advanced algorithms like GRPO and PPO at scale. This skill is particularly useful for teams developing reasoning models or agentic systems that require custom data generation buffers and tight integration with production-grade model parallelism across large GPU clusters.

主な機能

013,983 GitHub stars

02First-class support for GLM-4, Qwen3, DeepSeek V3, and Llama 3 architectures

03Pre-configured workflows for GRPO, PPO, and Reinforce++ algorithms

04High-throughput rollout generation using SGLang with integrated routing

05Flexible data buffer system for custom prompt management and sample storage

06Native Megatron-LM integration supporting TP, PP, DP, and Sequence Parallelism

ユースケース

01Training reasoning-intensive models using Group Relative Policy Optimization (GRPO)

02Scaling post-training for massive models across multi-node GPU infrastructures

03Developing multi-turn agentic workflows with custom tool-use reward functions

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills slime

For use in Claude.ai and ChatGPT

Download Skill