How much accuracy is lost during the pruning process?

When using advanced techniques provided in this skill at 50% sparsity, accuracy loss is typically observed to be under 1%, making it a highly effective trade-off for production performance.

What is the difference between Wanda and SparseGPT pruning?

Wanda uses weight magnitude multiplied by input activations for one-shot pruning, while SparseGPT utilizes second-order Hessian information for higher precision at the cost of more computation.

Which model architectures are supported?

The provided implementation patterns are designed for Hugging Face Transformers and work across major transformer-based architectures including Llama, Mistral, and BERT-style models.

Can I get a speedup on consumer GPUs using these techniques?

While unstructured pruning requires specialized kernels, N:M structured pruning (like 2:4) is specifically designed to provide significant hardware speedups on NVIDIA Ampere and newer architectures.

Does this skill require retraining the model?

No, the primary techniques included like Wanda and SparseGPT are one-shot methods that can prune the model successfully without the need for expensive retraining or extensive fine-tuning.

LLM Model Pruning

Name: LLM Model Pruning
Author: zechenzhangAGI

byzechenzhangAGI

•

384

•

データサイエンスとML

Compresses Large Language Models using advanced techniques like Wanda and SparseGPT to reduce memory footprint and accelerate inference speeds.

This skill provides a comprehensive framework for reducing the size and computational requirements of Large Language Models (LLMs) through state-of-the-art pruning techniques. It enables developers to implement one-shot methods like Wanda and SparseGPT, achieving up to 50% sparsity with negligible accuracy loss. Whether you are deploying models on edge devices, optimizing for NVIDIA sparse tensor cores with N:M structured pruning, or seeking to lower inference latency in production, this skill offers the implementation patterns and best practices needed to balance model performance with resource efficiency.

主な機能

01One-shot pruning with Wanda and SparseGPT for rapid compression

02Support for N:M semi-structured sparsity (2:4, 4:8) for hardware acceleration

03Customizable layer-wise and iterative pruning strategies

04Accuracy-aware pruning achieving <1% loss at 50% sparsity

05384 GitHub stars

06Implementation of both structured and unstructured pruning methods

ユースケース

01Deploying large transformer models on mobile or edge devices with limited VRAM

02Reducing cloud hosting costs by decreasing the memory footprint of served LLMs

03Accelerating inference throughput on NVIDIA A100/H100 GPUs using sparse tensor cores

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add zechenzhangagi/ai-research-skills model-pruning

For use in Claude.ai and ChatGPT

Download Skill