Back to all articles
LLM Inference

LLM Quantization: GPTQ, AWQ, and GGUF Compared

Complete comparison of LLM quantization methods. Learn how GPTQ, AWQ, and GGUF work, their quality-speed trade-offs, and when to use each for production deployment.

Flash Attention TeamJanuary 8, 20269 min read
LLM quantizationGPTQAWQGGUFmodel compressionINT4inference optimization

Quantization reduces model size and speeds up inference by using lower-precision numbers. This guide compares the major LLM quantization methods—GPTQ, AWQ, and GGUF—to help you choose the right approach.

Quick Comparison

MethodBitsSpeedQualityBest For
GPTQ2-8FastGoodGPU inference
AWQ4FastBetterGPU inference
GGUF2-8MediumGoodCPU + GPU hybrid
bitsandbytes4, 8MediumGoodTraining (QLoRA)

Understanding Quantization

What Quantization Does

Quantization maps floating-point weights to lower-precision integers:

# FP16 weight: 0.0234375 (2 bytes)
# INT4 quantized: 2 (0.5 bytes) + scale factor

# Quantization formula
quantized = round(weight / scale)
dequantized = quantized * scale

# Example with scale = 0.01
weight = 0.0234375
quantized = round(0.0234375 / 0.01) = 2
dequantized = 2 * 0.01 = 0.02  # Small error introduced

Memory Savings

PrecisionBitsMemory (7B model)
FP323228 GB
FP161614 GB
INT887 GB
INT443.5 GB
INT221.75 GB

GPTQ: GPU Post-Training Quantization

How GPTQ Works

GPTQ uses second-order information (Hessian) to minimize quantization error:

# GPTQ quantizes weights to minimize output error
# Key insight: Some weights matter more than others

for each column in weight matrix:
    # Compute optimal quantization considering:
    # 1. Weight magnitude
    # 2. Input activation patterns (from calibration data)
    # 3. Error from already-quantized columns

    quantized_col = optimal_quantization(col, hessian_info)

    # Compensate error in remaining columns
    remaining_cols -= error_compensation

GPTQ Implementation

from transformers import AutoModelForCausalLM, GPTQConfig

# Quantization config
gptq_config = GPTQConfig(
    bits=4,                    # Quantization bits
    group_size=128,            # Weights per scale factor
    dataset="c4",              # Calibration dataset
    desc_act=True,             # Descending activation order
    sym=False,                 # Asymmetric quantization
)

# Load and quantize
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=gptq_config,
    device_map="auto",
)

# Or load pre-quantized
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    device_map="auto",
)

GPTQ Parameters

ParameterValuesEffect
bits2, 3, 4, 8Lower = smaller but worse quality
group_size32, 64, 128Smaller = better quality, larger model
desc_actTrue/FalseTrue = better quality, slower quantization
symTrue/FalseSymmetric/asymmetric quantization

GPTQ Quality

Perplexity on WikiText-2 (lower is better):

ModelFP16GPTQ 4-bitGPTQ 3-bit
LLaMA-7B5.685.856.61
LLaMA-13B5.095.205.62
LLaMA-70B3.313.373.59

AWQ: Activation-aware Weight Quantization

How AWQ Works

AWQ observes that not all weights are equally important—some channels have much larger activations:

# AWQ key insight:
# Protecting 1% of salient weights improves quality significantly

# Step 1: Identify salient channels (high activation magnitude)
activations = model(calibration_data)
channel_importance = activations.abs().mean(dim=0)
salient_channels = channel_importance.topk(k=0.01 * num_channels)

# Step 2: Scale salient weights before quantization
# This reduces their relative quantization error
scale = compute_optimal_scale(weights, activations)
scaled_weights = weights * scale

# Step 3: Quantize scaled weights
quantized = quantize(scaled_weights)

# Step 4: Compensate scale in adjacent layer
next_layer.weights = next_layer.weights / scale

AWQ Implementation

from awq import AutoAWQForCausalLM

# Load model
model = AutoAWQForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf"
)

# Quantize
model.quantize(
    tokenizer,
    quant_config={
        "w_bit": 4,
        "q_group_size": 128,
        "zero_point": True,
        "version": "GEMM",  # or "GEMV" for single batch
    }
)

# Save quantized model
model.save_quantized("llama-2-7b-awq")

AWQ vs GPTQ Quality

ModelFP16GPTQ 4-bitAWQ 4-bit
LLaMA-7B5.685.855.78
LLaMA-13B5.095.205.14
Vicuna-7B6.226.456.31

AWQ consistently achieves better quality at the same bit width.

AWQ Speed

AWQ's GEMM kernel is highly optimized:

ModelFP16GPTQAWQ
LLaMA-7B1.0x1.8x2.1x
LLaMA-13B1.0x1.9x2.2x

GGUF: CPU-Friendly Quantization

What is GGUF?

GGUF (GPT-Generated Unified Format) is designed for llama.cpp, supporting:

  • CPU inference with SIMD optimizations
  • GPU offloading of selected layers
  • Multiple quantization levels in one file

GGUF Quantization Types

TypeBitsQualityUse Case
Q2_K2.5PoorExtreme compression
Q3_K_M3.4FairVery small models
Q4_K_M4.6GoodBalanced
Q5_K_M5.7Very GoodQuality-focused
Q6_K6.6ExcellentNear-FP16 quality
Q8_08.0BestMaximum quality

Converting to GGUF

# Clone llama.cpp
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# Convert HF model to GGUF
python convert.py /path/to/model --outtype f16 --outfile model.gguf

# Quantize
./quantize model.gguf model-q4_k_m.gguf q4_k_m

Using GGUF with llama.cpp

# CPU inference
./main -m model-q4_k_m.gguf -p "Hello, world" -n 100

# GPU offload (50 layers to GPU)
./main -m model-q4_k_m.gguf -p "Hello" -ngl 50

# Server mode
./server -m model-q4_k_m.gguf --host 0.0.0.0 --port 8080

Using GGUF with Python

from llama_cpp import Llama

# Load GGUF model
llm = Llama(
    model_path="model-q4_k_m.gguf",
    n_ctx=2048,           # Context length
    n_gpu_layers=35,      # Layers to offload to GPU
    n_threads=8,          # CPU threads
)

# Generate
output = llm(
    "Hello, world!",
    max_tokens=100,
    temperature=0.7,
)
print(output["choices"][0]["text"])

bitsandbytes: Training-Friendly Quantization

NF4 for QLoRA

bitsandbytes provides NormalFloat4 (NF4), optimized for training:

from transformers import BitsAndBytesConfig

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",           # NormalFloat4
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,      # Quantize the quantization constants
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
)

bitsandbytes vs GPTQ/AWQ

FeaturebitsandbytesGPTQ/AWQ
Primary useTraining (QLoRA)Inference
QuantizationDynamicStatic (calibrated)
SpeedSlowerFaster
QualityGoodBetter
CalibrationNot requiredRequired

Choosing the Right Method

Decision Matrix

ScenarioRecommended Method
GPU inference, quality-criticalAWQ
GPU inference, speed-criticalGPTQ or AWQ
CPU inferenceGGUF Q4_K_M
Mixed CPU+GPUGGUF with offloading
Fine-tuningbitsandbytes NF4
Extreme compressionGGUF Q2_K or GPTQ 2-bit
Maximum qualityGGUF Q8_0 or no quantization

Hardware Recommendations

HardwareBest MethodNotes
NVIDIA GPU (12GB+)AWQ/GPTQ 4-bitFull model in VRAM
NVIDIA GPU (<12GB)GPTQ 3-bit or GGUFMay need offloading
Apple SiliconGGUFMLX also supported
CPU onlyGGUF Q4_K_MUse all cores
Multi-GPUAWQ + tensor parallelvLLM/TGI

Benchmarks

Speed Comparison (tokens/second, RTX 4090)

LLaMA-2 7B, batch size 1:

MethodPrefillDecode
FP162,40085
GPTQ 4-bit3,100142
AWQ 4-bit3,300156
GGUF Q4_K_M (GPU)2,800128

Memory Usage

MethodLLaMA-7BLLaMA-13BLLaMA-70B
FP1614 GB26 GB140 GB
GPTQ 4-bit4.2 GB7.8 GB38 GB
AWQ 4-bit4.2 GB7.8 GB38 GB
GGUF Q4_K_M4.4 GB8.1 GB40 GB

Quality Comparison (MT-Bench)

ModelFP16GPTQ 4bAWQ 4bGGUF Q4
LLaMA-2 7B6.276.126.196.15
LLaMA-2 13B6.656.516.586.54
Mistral 7B7.617.487.557.50

Serving Quantized Models

vLLM with AWQ/GPTQ

from vllm import LLM

# AWQ model
llm = LLM(
    model="TheBloke/Llama-2-7B-AWQ",
    quantization="awq",
    dtype="half",
)

# GPTQ model
llm = LLM(
    model="TheBloke/Llama-2-7B-GPTQ",
    quantization="gptq",
    dtype="half",
)

TGI with Quantization

docker run --gpus all \
    -p 8080:80 \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id TheBloke/Llama-2-7B-AWQ \
    --quantize awq

Common Issues

Issue 1: Quality Degradation

# Solution: Use higher bit quantization
gptq_config = GPTQConfig(bits=8)  # Instead of 4

# Or use AWQ which typically has better quality

Issue 2: Slow Inference

# Ensure you're using optimized kernels
# For GPTQ: use exllama kernel
model = AutoModelForCausalLM.from_pretrained(
    "TheBloke/Llama-2-7B-GPTQ",
    use_exllama=True,
)

# For AWQ: use GEMM kernel
model.quantize(quant_config={"version": "GEMM"})

Issue 3: Calibration Data Mismatch

# Use domain-appropriate calibration data
gptq_config = GPTQConfig(
    bits=4,
    dataset="wikitext2",  # General text
    # Or use custom calibration data for your domain
)

References

  1. Frantar, E., et al. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023

  2. Lin, J., et al. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978

  3. Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314

  4. Gerganov, G. (2025). "llama.cpp: LLM inference in C/C++." GitHub

Frequently Asked Questions

Related Articles

Need Flash Attention wheels?

Skip the 30+ minute compilation. Find prebuilt wheels for your exact configuration.

Find Your Wheel