How does this skill handle Out of Memory (OOM) errors?

It provides built-in strategies for gradient checkpointing (recomputation), optimizer offloading to CPU/NVMe, and dynamic adjustment of tensor/pipeline parallelism degrees.

What hardware is required for this training skill?

This skill requires NVIDIA Ampere or newer GPUs (A100, H100) and high-speed networking like InfiniBand or 400Gb Ethernet for multi-node scaling.

What models can I train with this skill?

You can train models ranging from 2B to 462B parameters, including architectures like LLaMA, Nemotron, and MoE models like Mixtral or DeepSeek.

Does this skill support FP8 training?

Yes, it includes specialized configurations for FP8 hybrid precision training designed specifically for NVIDIA Hopper (H100) and newer GPU architectures.

When should I use Megatron-Core over PyTorch FSDP?

Megatron-Core is preferred for models larger than 70B parameters or scenarios requiring maximum hardware efficiency (MFU > 40%) and fine-grained control over 3D parallelism.

Megatron-Core LLM Training

Name: Megatron-Core LLM Training
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

Data Science & ML

Trains large-scale language models using NVIDIA Megatron-Core with advanced parallelism strategies for maximum GPU efficiency.

This skill empowers AI researchers and engineers to architect and execute high-performance training runs for models ranging from 2B to over 400B parameters. By leveraging NVIDIA Megatron-Core, it provides standardized workflows for implementing 3D parallelism (Tensor, Pipeline, and Data), Expert Parallelism for MoE models, and FP8 optimization on H100 hardware. It is ideal for teams needing production-grade scalability and state-of-the-art Model FLOPs Utilization (MFU) for architectures like LLaMA, Nemotron, or DeepSeek.

Key Features

01FP8 Precision Training optimized for NVIDIA H100/Hopper

02Sequence and Context Parallelism for long-context windows

03Advanced 3D Parallelism (Tensor, Pipeline, and Data)

04Mixture of Experts (MoE) with Expert Parallelism support

05Automated performance tuning for maximum Model FLOP Utilization (MFU)

063,983 GitHub stars

Use Cases

01Implementing sparse Mixture of Experts (MoE) architectures with efficient memory distribution

02Training foundation models like LLaMA-3 from scratch on large GPU clusters

03Optimizing throughput and reducing training costs on NVIDIA A100 and H100 infrastructure

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills megatron-core

For use in Claude.ai and ChatGPT

Download Skill