Can I fine-tune a model after it's been quantized with HQQ?

Yes, HQQ is fully compatible with PEFT and LoRA, allowing you to fine-tune quantized models efficiently using techniques similar to QLoRA.

Does HQQ require a calibration dataset?

No, HQQ is calibration-free, meaning you can quantize models instantly without needing sample data or representative datasets.

Which backend should I use for the fastest inference?

For 4-bit quantization on NVIDIA Ampere architecture or newer, the Marlin backend typically offers the highest inference throughput.

What precision levels does HQQ support?

HQQ supports flexible precision from 8-bit down to 1-bit, with 4-bit usually providing the best balance between quality and memory savings.

HQQ LLM Quantization

Name: HQQ LLM Quantization
Author: zechenzhangAGI

byzechenzhangAGI

•

384

Data Science & ML

Quantizes Large Language Models to 4/3/2-bit precision without calibration data for faster inference and reduced memory footprint.

About

HQQ (Half-Quadratic Quantization) is a high-performance model optimization skill designed for rapid LLM compression. Unlike traditional methods like GPTQ or AWQ, HQQ is calibration-free, allowing developers to quantize models to ultra-low bit-widths (down to 1-bit) in minutes rather than hours without needing external datasets. It features native integration with HuggingFace Transformers and vLLM, supports multiple optimized CUDA backends like Marlin and BitBlas, and maintains compatibility with PEFT/LoRA fine-tuning for efficient model adaptation on consumer-grade hardware.

Key Features

384 GitHub stars
Calibration-free quantization requiring no sample datasets
Full compatibility with PEFT and LoRA for quantized fine-tuning
Support for extreme compression from 8-bit down to 1-bit precision
Optimized inference backends including Marlin, BitBlas, and TorchAO
Seamless integration with HuggingFace Transformers and vLLM

Use Cases

Accelerating inference speed for production serving using vLLM
Performing QLoRA-style fine-tuning on highly compressed base models
Compressing large models for deployment on GPUs with limited VRAM

About

Key Features

384 GitHub stars
Calibration-free quantization requiring no sample datasets
Full compatibility with PEFT and LoRA for quantized fine-tuning
Support for extreme compression from 8-bit down to 1-bit precision
Optimized inference backends including Marlin, BitBlas, and TorchAO
Seamless integration with HuggingFace Transformers and vLLM

Use Cases

Accelerating inference speed for production serving using vLLM
Performing QLoRA-style fine-tuning on highly compressed base models
Compressing large models for deployment on GPUs with limited VRAM