01Advanced memory optimization via attention slicing, VAE tiling, and BF16
02Generic pipeline support for FLUX, SDXL, Wan, and CogVideoX models
03Automated NPU environment and hardware resource pre-checks
04Dynamic LoRA adapter loading and multi-LoRA weight stacking
0551 GitHub stars
06Distributed multi-NPU inference using Context Parallelism and HCCL