Quantization reduces model size and speeds up inference by using lower-precision numbers. This guide compares the major LLM quantization methods—GPTQ, AWQ, and GGUF—to help you choose the right approach.
Quick Comparison
| Method | Bits | Speed | Quality | Best For |
|---|---|---|---|---|
| GPTQ | 2-8 | Fast | Good | GPU inference |
| AWQ | 4 | Fast | Better | GPU inference |
| GGUF | 2-8 | Medium | Good | CPU + GPU hybrid |
| bitsandbytes | 4, 8 | Medium | Good | Training (QLoRA) |
Understanding Quantization
What Quantization Does
Quantization maps floating-point weights to lower-precision integers:
# FP16 weight: 0.0234375 (2 bytes)
# INT4 quantized: 2 (0.5 bytes) + scale factor
# Quantization formula
quantized = round(weight / scale)
dequantized = quantized * scale
# Example with scale = 0.01
weight = 0.0234375
quantized = round(0.0234375 / 0.01) = 2
dequantized = 2 * 0.01 = 0.02 # Small error introduced
Memory Savings
| Precision | Bits | Memory (7B model) |
|---|---|---|
| FP32 | 32 | 28 GB |
| FP16 | 16 | 14 GB |
| INT8 | 8 | 7 GB |
| INT4 | 4 | 3.5 GB |
| INT2 | 2 | 1.75 GB |
GPTQ: GPU Post-Training Quantization
How GPTQ Works
GPTQ uses second-order information (Hessian) to minimize quantization error:
# GPTQ quantizes weights to minimize output error
# Key insight: Some weights matter more than others
for each column in weight matrix:
# Compute optimal quantization considering:
# 1. Weight magnitude
# 2. Input activation patterns (from calibration data)
# 3. Error from already-quantized columns
quantized_col = optimal_quantization(col, hessian_info)
# Compensate error in remaining columns
remaining_cols -= error_compensation
GPTQ Implementation
from transformers import AutoModelForCausalLM, GPTQConfig
# Quantization config
gptq_config = GPTQConfig(
bits=4, # Quantization bits
group_size=128, # Weights per scale factor
dataset="c4", # Calibration dataset
desc_act=True, # Descending activation order
sym=False, # Asymmetric quantization
)
# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=gptq_config,
device_map="auto",
)
# Or load pre-quantized
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
device_map="auto",
)
GPTQ Parameters
| Parameter | Values | Effect |
|---|---|---|
| bits | 2, 3, 4, 8 | Lower = smaller but worse quality |
| group_size | 32, 64, 128 | Smaller = better quality, larger model |
| desc_act | True/False | True = better quality, slower quantization |
| sym | True/False | Symmetric/asymmetric quantization |
GPTQ Quality
Perplexity on WikiText-2 (lower is better):
| Model | FP16 | GPTQ 4-bit | GPTQ 3-bit |
|---|---|---|---|
| LLaMA-7B | 5.68 | 5.85 | 6.61 |
| LLaMA-13B | 5.09 | 5.20 | 5.62 |
| LLaMA-70B | 3.31 | 3.37 | 3.59 |
AWQ: Activation-aware Weight Quantization
How AWQ Works
AWQ observes that not all weights are equally important—some channels have much larger activations:
# AWQ key insight:
# Protecting 1% of salient weights improves quality significantly
# Step 1: Identify salient channels (high activation magnitude)
activations = model(calibration_data)
channel_importance = activations.abs().mean(dim=0)
salient_channels = channel_importance.topk(k=0.01 * num_channels)
# Step 2: Scale salient weights before quantization
# This reduces their relative quantization error
scale = compute_optimal_scale(weights, activations)
scaled_weights = weights * scale
# Step 3: Quantize scaled weights
quantized = quantize(scaled_weights)
# Step 4: Compensate scale in adjacent layer
next_layer.weights = next_layer.weights / scale
AWQ Implementation
from awq import AutoAWQForCausalLM
# Load model
model = AutoAWQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf"
)
# Quantize
model.quantize(
tokenizer,
quant_config={
"w_bit": 4,
"q_group_size": 128,
"zero_point": True,
"version": "GEMM", # or "GEMV" for single batch
}
)
# Save quantized model
model.save_quantized("llama-2-7b-awq")
AWQ vs GPTQ Quality
| Model | FP16 | GPTQ 4-bit | AWQ 4-bit |
|---|---|---|---|
| LLaMA-7B | 5.68 | 5.85 | 5.78 |
| LLaMA-13B | 5.09 | 5.20 | 5.14 |
| Vicuna-7B | 6.22 | 6.45 | 6.31 |
AWQ consistently achieves better quality at the same bit width.
AWQ Speed
AWQ's GEMM kernel is highly optimized:
| Model | FP16 | GPTQ | AWQ |
|---|---|---|---|
| LLaMA-7B | 1.0x | 1.8x | 2.1x |
| LLaMA-13B | 1.0x | 1.9x | 2.2x |
GGUF: CPU-Friendly Quantization
What is GGUF?
GGUF (GPT-Generated Unified Format) is designed for llama.cpp, supporting:
- CPU inference with SIMD optimizations
- GPU offloading of selected layers
- Multiple quantization levels in one file
GGUF Quantization Types
| Type | Bits | Quality | Use Case |
|---|---|---|---|
| Q2_K | 2.5 | Poor | Extreme compression |
| Q3_K_M | 3.4 | Fair | Very small models |
| Q4_K_M | 4.6 | Good | Balanced |
| Q5_K_M | 5.7 | Very Good | Quality-focused |
| Q6_K | 6.6 | Excellent | Near-FP16 quality |
| Q8_0 | 8.0 | Best | Maximum quality |
Converting to GGUF
# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
# Convert HF model to GGUF
python convert.py /path/to/model --outtype f16 --outfile model.gguf
# Quantize
./quantize model.gguf model-q4_k_m.gguf q4_k_m
Using GGUF with llama.cpp
# CPU inference
./main -m model-q4_k_m.gguf -p "Hello, world" -n 100
# GPU offload (50 layers to GPU)
./main -m model-q4_k_m.gguf -p "Hello" -ngl 50
# Server mode
./server -m model-q4_k_m.gguf --host 0.0.0.0 --port 8080
Using GGUF with Python
from llama_cpp import Llama
# Load GGUF model
llm = Llama(
model_path="model-q4_k_m.gguf",
n_ctx=2048, # Context length
n_gpu_layers=35, # Layers to offload to GPU
n_threads=8, # CPU threads
)
# Generate
output = llm(
"Hello, world!",
max_tokens=100,
temperature=0.7,
)
print(output["choices"][0]["text"])
bitsandbytes: Training-Friendly Quantization
NF4 for QLoRA
bitsandbytes provides NormalFloat4 (NF4), optimized for training:
from transformers import BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, # Quantize the quantization constants
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
)
bitsandbytes vs GPTQ/AWQ
| Feature | bitsandbytes | GPTQ/AWQ |
|---|---|---|
| Primary use | Training (QLoRA) | Inference |
| Quantization | Dynamic | Static (calibrated) |
| Speed | Slower | Faster |
| Quality | Good | Better |
| Calibration | Not required | Required |
Choosing the Right Method
Decision Matrix
| Scenario | Recommended Method |
|---|---|
| GPU inference, quality-critical | AWQ |
| GPU inference, speed-critical | GPTQ or AWQ |
| CPU inference | GGUF Q4_K_M |
| Mixed CPU+GPU | GGUF with offloading |
| Fine-tuning | bitsandbytes NF4 |
| Extreme compression | GGUF Q2_K or GPTQ 2-bit |
| Maximum quality | GGUF Q8_0 or no quantization |
Hardware Recommendations
| Hardware | Best Method | Notes |
|---|---|---|
| NVIDIA GPU (12GB+) | AWQ/GPTQ 4-bit | Full model in VRAM |
| NVIDIA GPU (<12GB) | GPTQ 3-bit or GGUF | May need offloading |
| Apple Silicon | GGUF | MLX also supported |
| CPU only | GGUF Q4_K_M | Use all cores |
| Multi-GPU | AWQ + tensor parallel | vLLM/TGI |
Benchmarks
Speed Comparison (tokens/second, RTX 4090)
LLaMA-2 7B, batch size 1:
| Method | Prefill | Decode |
|---|---|---|
| FP16 | 2,400 | 85 |
| GPTQ 4-bit | 3,100 | 142 |
| AWQ 4-bit | 3,300 | 156 |
| GGUF Q4_K_M (GPU) | 2,800 | 128 |
Memory Usage
| Method | LLaMA-7B | LLaMA-13B | LLaMA-70B |
|---|---|---|---|
| FP16 | 14 GB | 26 GB | 140 GB |
| GPTQ 4-bit | 4.2 GB | 7.8 GB | 38 GB |
| AWQ 4-bit | 4.2 GB | 7.8 GB | 38 GB |
| GGUF Q4_K_M | 4.4 GB | 8.1 GB | 40 GB |
Quality Comparison (MT-Bench)
| Model | FP16 | GPTQ 4b | AWQ 4b | GGUF Q4 |
|---|---|---|---|---|
| LLaMA-2 7B | 6.27 | 6.12 | 6.19 | 6.15 |
| LLaMA-2 13B | 6.65 | 6.51 | 6.58 | 6.54 |
| Mistral 7B | 7.61 | 7.48 | 7.55 | 7.50 |
Serving Quantized Models
vLLM with AWQ/GPTQ
from vllm import LLM
# AWQ model
llm = LLM(
model="TheBloke/Llama-2-7B-AWQ",
quantization="awq",
dtype="half",
)
# GPTQ model
llm = LLM(
model="TheBloke/Llama-2-7B-GPTQ",
quantization="gptq",
dtype="half",
)
TGI with Quantization
docker run --gpus all \
-p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id TheBloke/Llama-2-7B-AWQ \
--quantize awq
Common Issues
Issue 1: Quality Degradation
# Solution: Use higher bit quantization
gptq_config = GPTQConfig(bits=8) # Instead of 4
# Or use AWQ which typically has better quality
Issue 2: Slow Inference
# Ensure you're using optimized kernels
# For GPTQ: use exllama kernel
model = AutoModelForCausalLM.from_pretrained(
"TheBloke/Llama-2-7B-GPTQ",
use_exllama=True,
)
# For AWQ: use GEMM kernel
model.quantize(quant_config={"version": "GEMM"})
Issue 3: Calibration Data Mismatch
# Use domain-appropriate calibration data
gptq_config = GPTQConfig(
bits=4,
dataset="wikitext2", # General text
# Or use custom calibration data for your domain
)
References
-
Frantar, E., et al. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023
-
Lin, J., et al. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978
-
Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314
-
Gerganov, G. (2025). "llama.cpp: LLM inference in C/C++." GitHub