How do I prevent expert collapse during training?

The skill includes implementation details for auxiliary load balancing loss and router Z-loss to ensure even distribution of tokens across all available expert networks.

Can this skill help with Mixtral-style models?

Yes, it provides specific architecture patterns, Top-2 routing logic, and configuration parameters used in SOTA models like Mixtral 8x7B and DeepSeek-V3.

Which frameworks are supported for MoE training?

This skill covers implementations using DeepSpeed, HuggingFace Transformers, and PyTorch, with a specific focus on scaling through expert parallelism.

What are the benefits of MoE over dense models?

MoE models offer a significant reduction in compute cost (up to 5x) for the same parameter count by only activating a subset of experts per token during both training and inference.

MoE Training: Mixture of Experts

Name: MoE Training: Mixture of Experts
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

数据科学与机器学习

Trains and optimizes Mixture of Experts (MoE) models to scale AI capacity with significantly reduced compute costs.

This Claude Code skill provides comprehensive guidance for implementing and training Mixture of Experts (MoE) architectures, enabling developers to scale model capacity up to 5x more efficiently than dense models. It includes production-ready patterns for top-k routing, expert parallelism, and load balancing using industry-standard frameworks like DeepSpeed and HuggingFace. Whether you are building sparse models like Mixtral or DeepSeek or optimizing inference latency through sparse activation, this skill offers the implementation logic, auxiliary loss functions, and hardware configurations necessary for state-of-the-art AI research and engineering.

主要功能

01Advanced load balancing with auxiliary and router Z-loss functions

023,983 GitHub stars

03Mixtral-style architecture patterns with 8x7B expert structures

04Inference optimization through sparse activation and capacity factor tuning

05DeepSpeed MoE configuration for large-scale expert parallelism

06Implementation of Top-k and Switch Transformer routing mechanisms

使用场景

01Developing domain-specific models with specialized expert networks

02Scaling large language models (LLMs) on a limited compute budget

03Reducing inference latency for high-parameter models through sparsity

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills moe-training

For use in Claude.ai and ChatGPT

Download Skill