GPTQ is a post-training quantization method that compresses large language models to 4-bit precision, significantly reducing memory usage and increasing speed while preserving high accuracy.

Can I use GPTQ for fine-tuning?

Yes, GPTQ integrates with the PEFT library to enable QLoRA (Quantized LoRA) fine-tuning, allowing you to train large models on hardware with significantly less VRAM.

How does GPTQ compare to AWQ?

GPTQ is highly mature and optimized for inference speed with kernels like ExLlamaV2. While AWQ can offer slightly better accuracy on specific models, GPTQ is often preferred for deployment on a wider range of hardware.

Which GPUs support GPTQ acceleration?

GPTQ works on most NVIDIA GPUs. For maximum performance, it leverages specialized kernels like Marlin for Ampere (RTX 30-series/A100) and Ada Lovelace (RTX 40-series/H100) architectures.

GPTQ Model Quantization

Name: GPTQ Model Quantization
Author: Orchestra-Research

byOrchestra-Research

•

3,983

•

データサイエンスとML

Compresses large language models to 4-bit precision to enable high-speed inference and deployment on consumer-grade hardware.

This skill provides specialized guidance for implementing Generative Pre-trained Transformer Quantization (GPTQ), a post-training method that reduces LLM memory requirements by 4x while maintaining near-original accuracy. It enables developers to deploy massive models like Llama 3 70B or 405B on limited GPU memory, offers 3-4x inference speedups, and integrates seamlessly with Hugging Face Transformers and PEFT for memory-efficient fine-tuning (QLoRA). By leveraging group-wise quantization and high-performance kernels like ExLlamaV2 and Marlin, this skill helps researchers and engineers optimize their AI research workflows for maximum hardware efficiency.

主な機能

01Automated calibration workflows for custom model weights

02Support for high-performance inference kernels like ExLlamaV2 and Marlin

03Integration with PEFT for memory-efficient QLoRA fine-tuning

04Post-training 4-bit quantization with less than 2% accuracy loss

054x reduction in VRAM requirements for large model deployment

063,983 GitHub stars

ユースケース

01Fine-tuning massive models on limited hardware using GPTQ-based QLoRA

02Accelerating LLM inference speed by 3-4x for production environments

03Deploying 70B+ parameter models on single consumer GPUs like the RTX 4090

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add orchestra-research/ai-research-skills gptq

For use in Claude.ai and ChatGPT

Download Skill