Does quantization hurt model quality?

4-bit quantization typically reduces benchmark scores by 1-5%. For most applications, this is acceptable. 8-bit quantization is nearly lossless. For quality-critical applications, use 8-bit or stick with FP16. AWQ generally achieves better quality than GPTQ at the same bit width.

Which quantization method is fastest?

For GPU inference, AWQ with GEMM kernels is typically fastest, followed closely by GPTQ with ExLlama kernels. GGUF is optimized for CPU but slower on GPU. For production GPU serving, use AWQ or GPTQ with vLLM or TensorRT-LLM.

Can I fine-tune a quantized model?

Yes, using QLoRA with bitsandbytes. The base model stays in 4-bit, while LoRA adapters train in FP16/BF16. GPTQ and AWQ models are primarily for inference—use bitsandbytes NF4 for training.

Should I use GGUF or GPTQ/AWQ?

Use GGUF if you need CPU inference or hybrid CPU+GPU with flexible offloading. Use GPTQ/AWQ for dedicated GPU inference—they're faster on GPU and integrate better with serving frameworks like vLLM and TGI.

LLM Quantization: GPTQ, AWQ, and GGUF Compared

Quantization reduces model size and speeds up inference by using lower-precision numbers. This guide compares the major LLM quantization methods—GPTQ, AWQ, and GGUF—to help you choose the right approach.

Quick Comparison

Method	Bits	Speed	Quality	Best For
GPTQ	2-8	Fast	Good	GPU inference
AWQ	4	Fast	Better	GPU inference
GGUF	2-8	Medium	Good	CPU + GPU hybrid
bitsandbytes	4, 8	Medium	Good	Training (QLoRA)

Understanding Quantization

What Quantization Does

Quantization maps floating-point weights to lower-precision integers:

# FP16 weight: 0.0234375 (2 bytes)
# INT4 quantized: 2 (0.5 bytes) + scale factor

# Quantization formula
quantized = round(weight / scale)
dequantized = quantized * scale

# Example with scale = 0.01
weight = 0.0234375
quantized = round(0.0234375 / 0.01) = 2
dequantized = 2 * 0.01 = 0.02  # Small error introduced

Memory Savings

Precision	Bits	Memory (7B model)
FP32	32	28 GB
FP16	16	14 GB
INT8	8	7 GB
INT4	4	3.5 GB
INT2	2	1.75 GB

GPTQ: GPU Post-Training Quantization

How GPTQ Works

GPTQ uses second-order information (Hessian) to minimize quantization error:

# GPTQ quantizes weights to minimize output error
# Key insight: Some weights matter more than others

for each column in weight matrix:
    # Compute optimal quantization considering:
    # 1. Weight magnitude
    # 2. Input activation patterns (from calibration data)
    # 3. Error from already-quantized columns

    quantized_col = optimal_quantization(col, hessian_info)

    # Compensate error in remaining columns
    remaining_cols -= error_compensation

GPTQ Implementation

from transformers import AutoModelForCausalLM, GPTQConfig

# Quantization config
gptq_config = GPTQConfig(
    bits=4,                    # Quantization bits
    group_size=128,            # Weights per scale factor
    dataset="c4",              # Calibration dataset
    desc_act=True,             # Descending activation order
    sym=False,                 # Asymmetric quantization
)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=gptq_config,
    device_map="auto",
)

# Or load pre-quantized
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
)

GPTQ Parameters

Parameter	Values	Effect
bits	2, 3, 4, 8	Lower = smaller but worse quality
group_size	32, 64, 128	Smaller = better quality, larger model
desc_act	True/False	True = better quality, slower quantization
sym	True/False	Symmetric/asymmetric quantization

GPTQ Quality

Perplexity on WikiText-2 (lower is better):

Model	FP16	GPTQ 4-bit	GPTQ 3-bit
LLaMA-7B	5.68	5.85	6.61
LLaMA-13B	5.09	5.20	5.62
LLaMA-70B	3.31	3.37	3.59

AWQ: Activation-aware Weight Quantization

How AWQ Works

AWQ observes that not all weights are equally important—some channels have much larger activations:

# AWQ key insight:
# Protecting 1% of salient weights improves quality significantly

# Step 1: Identify salient channels (high activation magnitude)
activations = model(calibration_data)
channel_importance = activations.abs().mean(dim=0)
salient_channels = channel_importance.topk(k=0.01 * num_channels)

# Step 2: Scale salient weights before quantization
# This reduces their relative quantization error
scale = compute_optimal_scale(weights, activations)
scaled_weights = weights * scale

# Step 3: Quantize scaled weights
quantized = quantize(scaled_weights)

# Step 4: Compensate scale in adjacent layer
next_layer.weights = next_layer.weights / scale

AWQ Implementation

from awq import AutoAWQForCausalLM

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf"
)

# Quantize
model.quantize(
    tokenizer,
    quant_config={
        "w_bit": 4,
        "q_group_size": 128,
        "zero_point": True,
        "version": "GEMM",  # or "GEMV" for single batch
    }
)

# Save quantized model
model.save_quantized("llama-2-7b-awq")

AWQ vs GPTQ Quality

Model	FP16	GPTQ 4-bit	AWQ 4-bit
LLaMA-7B	5.68	5.85	5.78
LLaMA-13B	5.09	5.20	5.14
Vicuna-7B	6.22	6.45	6.31

AWQ consistently achieves better quality at the same bit width.

AWQ Speed

AWQ's GEMM kernel is highly optimized:

Model	FP16	GPTQ	AWQ
LLaMA-7B	1.0x	1.8x	2.1x
LLaMA-13B	1.0x	1.9x	2.2x

GGUF: CPU-Friendly Quantization

What is GGUF?

GGUF (GPT-Generated Unified Format) is designed for llama.cpp, supporting:

CPU inference with SIMD optimizations
GPU offloading of selected layers
Multiple quantization levels in one file

GGUF Quantization Types

Type	Bits	Quality	Use Case
Q2_K	2.5	Poor	Extreme compression
Q3_K_M	3.4	Fair	Very small models
Q4_K_M	4.6	Good	Balanced
Q5_K_M	5.7	Very Good	Quality-focused
Q6_K	6.6	Excellent	Near-FP16 quality
Q8_0	8.0	Best	Maximum quality

Converting to GGUF

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert HF model to GGUF
python convert.py /path/to/model --outtype f16 --outfile model.gguf

# Quantize
./quantize model.gguf model-q4_k_m.gguf q4_k_m

Using GGUF with llama.cpp

# CPU inference
./main -m model-q4_k_m.gguf -p "Hello, world" -n 100

# GPU offload (50 layers to GPU)
./main -m model-q4_k_m.gguf -p "Hello" -ngl 50

# Server mode
./server -m model-q4_k_m.gguf --host 0.0.0.0 --port 8080

Using GGUF with Python

from llama_cpp import Llama

# Load GGUF model
llm = Llama(
    model_path="model-q4_k_m.gguf",
    n_ctx=2048,           # Context length
    n_gpu_layers=35,      # Layers to offload to GPU
    n_threads=8,          # CPU threads
)

# Generate
output = llm(
    "Hello, world!",
    max_tokens=100,
    temperature=0.7,
)
print(output["choices"][0]["text"])

bitsandbytes: Training-Friendly Quantization

NF4 for QLoRA

bitsandbytes provides NormalFloat4 (NF4), optimized for training:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # Quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
)

bitsandbytes vs GPTQ/AWQ

Feature	bitsandbytes	GPTQ/AWQ
Primary use	Training (QLoRA)	Inference
Quantization	Dynamic	Static (calibrated)
Speed	Slower	Faster
Quality	Good	Better
Calibration	Not required	Required

Choosing the Right Method

Decision Matrix

Scenario	Recommended Method
GPU inference, quality-critical	AWQ
GPU inference, speed-critical	GPTQ or AWQ
CPU inference	GGUF Q4_K_M
Mixed CPU+GPU	GGUF with offloading
Fine-tuning	bitsandbytes NF4
Extreme compression	GGUF Q2_K or GPTQ 2-bit
Maximum quality	GGUF Q8_0 or no quantization

Hardware Recommendations

Hardware	Best Method	Notes
NVIDIA GPU (12GB+)	AWQ/GPTQ 4-bit	Full model in VRAM
NVIDIA GPU (<12GB)	GPTQ 3-bit or GGUF	May need offloading
Apple Silicon	GGUF	MLX also supported
CPU only	GGUF Q4_K_M	Use all cores
Multi-GPU	AWQ + tensor parallel	vLLM/TGI

Benchmarks

Speed Comparison (tokens/second, RTX 4090)

LLaMA-2 7B, batch size 1:

Method	Prefill	Decode
FP16	2,400	85
GPTQ 4-bit	3,100	142
AWQ 4-bit	3,300	156
GGUF Q4_K_M (GPU)	2,800	128

Memory Usage

Method	LLaMA-7B	LLaMA-13B	LLaMA-70B
FP16	14 GB	26 GB	140 GB
GPTQ 4-bit	4.2 GB	7.8 GB	38 GB
AWQ 4-bit	4.2 GB	7.8 GB	38 GB
GGUF Q4_K_M	4.4 GB	8.1 GB	40 GB

Quality Comparison (MT-Bench)

Model	FP16	GPTQ 4b	AWQ 4b	GGUF Q4
LLaMA-2 7B	6.27	6.12	6.19	6.15
LLaMA-2 13B	6.65	6.51	6.58	6.54
Mistral 7B	7.61	7.48	7.55	7.50

Serving Quantized Models

vLLM with AWQ/GPTQ

from vllm import LLM

# AWQ model
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half",
)

# GPTQ model
llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="half",
)

TGI with Quantization

docker run --gpus all \
    -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-7B-AWQ \
    --quantize awq

Common Issues

Issue 1: Quality Degradation

# Solution: Use higher bit quantization
gptq_config = GPTQConfig(bits=8)  # Instead of 4

# Or use AWQ which typically has better quality

Issue 2: Slow Inference

# Ensure you're using optimized kernels
# For GPTQ: use exllama kernel
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    use_exllama=True,
)

# For AWQ: use GEMM kernel
model.quantize(quant_config={"version": "GEMM"})

Issue 3: Calibration Data Mismatch

# Use domain-appropriate calibration data
gptq_config = GPTQConfig(
    bits=4,
    dataset="wikitext2",  # General text
    # Or use custom calibration data for your domain
)

References

Frantar, E., et al. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023
Lin, J., et al. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978
Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314
Gerganov, G. (2025). "llama.cpp: LLM inference in C/C++." GitHub