Can I fine-tune a model after it's been quantized with HQQ?

Yes, HQQ is fully compatible with PEFT and LoRA, allowing you to fine-tune quantized models efficiently using techniques similar to QLoRA.

Does HQQ require a calibration dataset?

No, HQQ is calibration-free, meaning you can quantize models instantly without needing sample data or representative datasets.

Which backend should I use for the fastest inference?

For 4-bit quantization on NVIDIA Ampere architecture or newer, the Marlin backend typically offers the highest inference throughput.

What precision levels does HQQ support?

HQQ supports flexible precision from 8-bit down to 1-bit, with 4-bit usually providing the best balance between quality and memory savings.

HQQ LLM Quantization

Name: HQQ LLM Quantization
Author: zechenzhangAGI

byzechenzhangAGI

•

384

•

数据科学与机器学习

Quantizes Large Language Models to 4/3/2-bit precision without calibration data for faster inference and reduced memory footprint.

HQQ (Half-Quadratic Quantization) is a high-performance model optimization skill designed for rapid LLM compression. Unlike traditional methods like GPTQ or AWQ, HQQ is calibration-free, allowing developers to quantize models to ultra-low bit-widths (down to 1-bit) in minutes rather than hours without needing external datasets. It features native integration with HuggingFace Transformers and vLLM, supports multiple optimized CUDA backends like Marlin and BitBlas, and maintains compatibility with PEFT/LoRA fine-tuning for efficient model adaptation on consumer-grade hardware.

主要功能

01384 GitHub stars

02Calibration-free quantization requiring no sample datasets

03Full compatibility with PEFT and LoRA for quantized fine-tuning

04Support for extreme compression from 8-bit down to 1-bit precision

05Optimized inference backends including Marlin, BitBlas, and TorchAO

06Seamless integration with HuggingFace Transformers and vLLM

使用场景

01Accelerating inference speed for production serving using vLLM

02Performing QLoRA-style fine-tuning on highly compressed base models

03Compressing large models for deployment on GPUs with limited VRAM

What are Skills?·How to Install

Install with 🐟 Skill.Fish

npx skillfish add zechenzhangagi/ai-research-skills hqq

For use in Claude.ai and ChatGPT

Download Skill