How does it handle NPU Out of Memory (OOM) errors?

The skill provides strategies like Attention Slicing, VAE Slicing, VAE Tiling, and BF16 precision to significantly reduce the NPU memory footprint.

Why is the first inference run slow on NPU?

Ascend NPUs require an initial graph compilation phase. This skill includes benchmarking scripts that handle warm-up runs to ensure accurate steady-state performance measurement.

Can I use LoRA adapters with this skill?

Yes, it includes detailed guidance for loading single or multiple LoRA weights and adjusting their scales for customized generation on NPU hardware.

Is multi-card (multi-NPU) inference supported?

Yes, it supports Context Parallelism (Ulysses or Ring Attention) for distributed inference across multiple NPUs using torch.distributed and the HCCL backend.

Which models are supported by this skill?

It supports any HuggingFace Diffusers-compatible model, specifically verified for FLUX.1-dev, SDXL, SD 3.5, Wan 2.1, and CogVideoX on Ascend NPUs.

Diffusers Ascend Pipeline

Name: Diffusers Ascend Pipeline
Author: ascend-ai-coding

byascend-ai-coding

•

Ciencia de Datos y ML

Optimizes and runs HuggingFace Diffusers pipelines on Huawei Ascend NPUs for high-performance image and video generation.

This skill provides a comprehensive framework for deploying and optimizing AI diffusion models on Huawei Ascend NPU hardware. It enables developers to execute complex inference pipelines for state-of-the-art models like FLUX, SDXL, and CogVideoX with domain-specific configurations for the Ascend ecosystem. By offering standardized scripts for environment pre-validation, advanced memory management (such as VAE tiling and attention slicing), LoRA integration, and multi-NPU context parallelism, it ensures efficient, stable, and high-throughput creative AI workflows.

Características Principales

01Advanced memory optimization via attention slicing, VAE tiling, and BF16

02Generic pipeline support for FLUX, SDXL, Wan, and CogVideoX models

03Automated NPU environment and hardware resource pre-checks

04Dynamic LoRA adapter loading and multi-LoRA weight stacking

0551 GitHub stars

06Distributed multi-NPU inference using Context Parallelism and HCCL

Casos de Uso

01Optimizing large model memory footprints to fit within specific NPU constraints

02Deploying generative AI image and video models on Huawei Atlas hardware

03Scaling diffusion inference across multiple NPUs for high-resolution content generation

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ascend-ai-coding/awesome-ascend-skills diffusers-ascend-pipeline

For use in Claude.ai and ChatGPT

Características Principales

01Advanced memory optimization via attention slicing, VAE tiling, and BF16

02Generic pipeline support for FLUX, SDXL, Wan, and CogVideoX models

03Automated NPU environment and hardware resource pre-checks

04Dynamic LoRA adapter loading and multi-LoRA weight stacking

0551 GitHub stars

06Distributed multi-NPU inference using Context Parallelism and HCCL

Casos de Uso

01Optimizing large model memory footprints to fit within specific NPU constraints

02Deploying generative AI image and video models on Huawei Atlas hardware

03Scaling diffusion inference across multiple NPUs for high-resolution content generation

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add ascend-ai-coding/awesome-ascend-skills diffusers-ascend-pipeline

For use in Claude.ai and ChatGPT