关于
This skill provides specialized guidance for training large-scale Mixture of Experts (MoE) models, enabling researchers and engineers to scale model capacity without a proportional increase in compute costs. It covers essential techniques like top-k routing, expert parallelism using DeepSpeed, and load balancing strategies to prevent expert collapse. Whether you're implementing sparse architectures like Mixtral 8x7B or DeepSeek-V3 or optimizing inference through sparse activation, this skill offers production-ready patterns for high-performance AI research and development.