关于
This skill provides specialized guidance for training massive language models ranging from 2B to over 400B parameters. It enables developers to implement complex parallelism strategies—including Tensor, Pipeline, Sequence, and Expert parallelism—to maximize GPU efficiency and Model FLOP Utilization (MFU) on NVIDIA hardware. Designed for researchers and engineers building production-grade models like LLaMA, Nemotron, or DeepSeek, it provides standardized workflows for FP8 precision training, Mixture of Experts (MoE) configuration, and multi-node cluster optimization.