About
This skill provides comprehensive technical guidance for implementing and optimizing PyTorch Fully Sharded Data Parallel (FSDP) to train massive models. It assists developers with parameter sharding, mixed-precision configuration, CPU offloading, and utilizing the advanced features of FSDP2. By offering expert patterns for distributed communication backends like NCCL and Gloo, and providing solutions for uneven input handling via the Join context manager, this skill ensures that your AI research agents can handle high-performance, multi-node training workloads with minimal memory overhead.