01Support for native PyTorch 2.2+ Scaled Dot Product Attention (SDPA)
02Advanced workflows for multi-query attention and sliding window attention
03Achieve 2-4x speedup for Transformer attention layers
04FlashAttention-3 implementation for H100 FP8 maximum performance
05Reduce GPU memory footprint by 10-20x via IO-aware tiling
063,983 GitHub stars