What model sizes does this skill support?

It is optimized for large-scale models ranging from 2 billion up to 462 billion parameters, providing specific parallelism configurations for each size.

What is MFU and why does it matter?

Model FLOP Utilization (MFU) measures how efficiently you use your GPU's peak compute performance. This skill helps achieve up to 47% MFU on H100s, significantly reducing training time and cost.

Does it support Mixture of Experts (MoE)?

Yes, it includes specific workflows for configuring expert parallelism to distribute experts across GPUs, drastically reducing memory requirements for sparse models.

Which GPUs are recommended for Megatron-Core?

It is designed for NVIDIA Ampere, Hopper, and Blackwell architectures, specifically A100 and H100 GPUs, utilizing NVLink and InfiniBand for performance.

Can I use this for multi-node training?

Absolutely. The skill provides SLURM and torchrun configurations for scaling training across hundreds of GPUs and multiple nodes.

Megatron-Core LLM Training

Name: Megatron-Core LLM Training
Author: zechenzhangAGI

byzechenzhangAGI

•

384

•

Ciencia de Datos y ML

Optimizes large-scale language model training using NVIDIA Megatron-Core with advanced 3D and expert parallelism strategies.

This skill provides specialized guidance for training massive language models ranging from 2B to over 400B parameters. It enables developers to implement complex parallelism strategies—including Tensor, Pipeline, Sequence, and Expert parallelism—to maximize GPU efficiency and Model FLOP Utilization (MFU) on NVIDIA hardware. Designed for researchers and engineers building production-grade models like LLaMA, Nemotron, or DeepSeek, it provides standardized workflows for FP8 precision training, Mixture of Experts (MoE) configuration, and multi-node cluster optimization.

Características Principales

01Performance tuning for 47%+ Model FLOP Utilization (MFU)

02Automated hyperparameter configuration for LLaMA-style models

03Expert Parallelism for Mixture of Experts (MoE) training

04Advanced 3D Parallelism (Tensor, Pipeline, Data)

05FP8 Mixed-Precision support for NVIDIA H100 GPUs

06384 GitHub stars

Casos de Uso

01Implementing sparse Mixture of Experts architectures like Mixtral

02Training LLaMA-3 70B or 405B models on multi-node H100 clusters

03Optimizing throughput and reducing memory overhead in large-scale AI research

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add zechenzhangagi/ai-research-skills megatron-core

For use in Claude.ai and ChatGPT

Download Skill