01Distributed checkpointing with HuggingFace interoperability
02Standardized TOML configuration for Llama 3.1 and custom models
03Multi-node scaling with SLURM and torchrun integration
04Native 4D Parallelism (FSDP2, TP, PP, and CP)
05Float8 training support for H100 GPU performance boosts
063,983 GitHub stars