What model sizes does this skill support?

It is optimized for large-scale models ranging from 2 billion up to 462 billion parameters, providing specific parallelism configurations for each size.

What is MFU and why does it matter?

Model FLOP Utilization (MFU) measures how efficiently you use your GPU's peak compute performance. This skill helps achieve up to 47% MFU on H100s, significantly reducing training time and cost.

Does it support Mixture of Experts (MoE)?

Yes, it includes specific workflows for configuring expert parallelism to distribute experts across GPUs, drastically reducing memory requirements for sparse models.

Which GPUs are recommended for Megatron-Core?

It is designed for NVIDIA Ampere, Hopper, and Blackwell architectures, specifically A100 and H100 GPUs, utilizing NVLink and InfiniBand for performance.

Can I use this for multi-node training?

Absolutely. The skill provides SLURM and torchrun configurations for scaling training across hundreds of GPUs and multiple nodes.

Megatron-Core LLM Training

Name: Megatron-Core LLM Training
Author: zechenzhangAGI

byzechenzhangAGI

•

384

数据科学与机器学习

Optimizes large-scale language model training using NVIDIA Megatron-Core with advanced 3D and expert parallelism strategies.

关于

This skill provides specialized guidance for training massive language models ranging from 2B to over 400B parameters. It enables developers to implement complex parallelism strategies—including Tensor, Pipeline, Sequence, and Expert parallelism—to maximize GPU efficiency and Model FLOP Utilization (MFU) on NVIDIA hardware. Designed for researchers and engineers building production-grade models like LLaMA, Nemotron, or DeepSeek, it provides standardized workflows for FP8 precision training, Mixture of Experts (MoE) configuration, and multi-node cluster optimization.

主要功能

Performance tuning for 47%+ Model FLOP Utilization (MFU)
Automated hyperparameter configuration for LLaMA-style models
Expert Parallelism for Mixture of Experts (MoE) training
Advanced 3D Parallelism (Tensor, Pipeline, Data)
FP8 Mixed-Precision support for NVIDIA H100 GPUs
384 GitHub stars

使用场景

Implementing sparse Mixture of Experts architectures like Mixtral
Training LLaMA-3 70B or 405B models on multi-node H100 clusters
Optimizing throughput and reducing memory overhead in large-scale AI research

关于

主要功能

Performance tuning for 47%+ Model FLOP Utilization (MFU)
Automated hyperparameter configuration for LLaMA-style models
Expert Parallelism for Mixture of Experts (MoE) training
Advanced 3D Parallelism (Tensor, Pipeline, Data)
FP8 Mixed-Precision support for NVIDIA H100 GPUs
384 GitHub stars

使用场景

Implementing sparse Mixture of Experts architectures like Mixtral
Training LLaMA-3 70B or 405B models on multi-node H100 clusters
Optimizing throughput and reducing memory overhead in large-scale AI research