01Performance tuning for 47%+ Model FLOP Utilization (MFU)
02Automated hyperparameter configuration for LLaMA-style models
03Expert Parallelism for Mixture of Experts (MoE) training
04Advanced 3D Parallelism (Tensor, Pipeline, Data)
05FP8 Mixed-Precision support for NVIDIA H100 GPUs
06384 GitHub stars