Deploying LLMs in production requires careful optimization to balance latency, throughput, and cost. This comprehensive guide covers every major technique for making LLM inference fast and efficient.
Understanding LLM Inference
The Autoregressive Bottleneck
LLMs generate text one token at a time:
# Pseudocode for autoregressive generation
def generate(prompt, max_tokens):
tokens = tokenize(prompt)
for _ in range(max_tokens):
# Full forward pass for ONE token
logits = model(tokens)
next_token = sample(logits[-1])
tokens.append(next_token)
if next_token == EOS:
break
return tokens
This creates two distinct phases:
| Phase | Compute Pattern | Bottleneck |
|---|---|---|
| Prefill | Process all prompt tokens | Compute-bound |
| Decode | Generate one token at a time | Memory-bound |
Key Metrics
| Metric | Definition | Target Range |
|---|---|---|
| Time to First Token (TTFT) | Latency until first token | 100-500ms |
| Inter-Token Latency (ITL) | Time between tokens | 20-50ms |
| Throughput | Tokens per second | 50-500+ tok/s |
| Tokens per Dollar | Cost efficiency | Maximize |
KV Cache: The Foundation
How KV Cache Works
Without caching, each new token requires recomputing attention over all previous tokens:
# Without KV cache: O(n²) per token, O(n³) total
for i in range(seq_len):
for j in range(i):
attention[i, j] = compute(Q[i], K[j], V[j])
With KV cache, we store and reuse Key and Value projections:
# With KV cache: O(n) per token, O(n²) total
kv_cache = []
for i in range(seq_len):
k_i, v_i = project(hidden[i])
kv_cache.append((k_i, v_i))
# Attend to all cached keys/values
attention[i] = compute(Q[i], kv_cache)
KV Cache Memory
KV Cache Size = 2 × num_layers × num_heads × head_dim × seq_len × batch_size × bytes
For LLaMA-2 70B with 4K context:
- 2 × 80 layers × 64 heads × 128 dim × 4096 tokens × 2 bytes = 5.2 GB per sequence
KV Cache Optimizations
1. Multi-Query Attention (MQA)
Share K,V across attention heads:
# Standard: Each head has own K, V
# K, V shape: [batch, num_heads, seq, head_dim]
# MQA: Single K, V for all heads
# K, V shape: [batch, 1, seq, head_dim]
# Reduces KV cache by num_heads (e.g., 32x)
2. Grouped-Query Attention (GQA)
LLaMA 2 and newer use GQA—a middle ground:
# GQA: Groups of heads share K, V
# num_kv_heads = num_heads // group_size
# LLaMA-2 70B: 64 heads, 8 KV heads → 8x reduction
3. Paged Attention (vLLM)
Manage KV cache like virtual memory:
# Traditional: Contiguous pre-allocated cache
cache = torch.zeros(max_seq_len, hidden_dim) # Wasteful!
# Paged: Allocate blocks on demand
block_table = {} # Maps logical → physical blocks
def get_block(seq_id, block_idx):
if (seq_id, block_idx) not in block_table:
block_table[(seq_id, block_idx)] = allocate_block()
return block_table[(seq_id, block_idx)]
Quantization for Inference
Quantization Methods Comparison
| Method | Bits | Speed | Quality | Use Case |
|---|---|---|---|---|
| FP16 | 16 | 1.0x | Best | Quality-critical |
| INT8 | 8 | 1.5x | Excellent | Balanced |
| INT4 (GPTQ) | 4 | 2.0x | Good | Memory-limited |
| INT4 (AWQ) | 4 | 2.0x | Better | Production |
| GGUF Q4 | 4 | 1.8x | Good | CPU inference |
GPTQ Quantization
Post-training quantization using calibration data:
from transformers import AutoModelForCausalLM, GPTQConfig
gptq_config = GPTQConfig(
bits=4,
dataset="c4",
group_size=128,
desc_act=True,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=gptq_config,
device_map="auto",
)
AWQ (Activation-aware Weight Quantization)
Preserves important weights based on activation patterns:
from awq import AutoAWQForCausalLM
model = AutoAWQForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
)
model.quantize(
tokenizer,
quant_config={
"w_bit": 4,
"q_group_size": 128,
"zero_point": True,
}
)
Memory Savings from Quantization
| Model | FP16 | INT8 | INT4 |
|---|---|---|---|
| 7B | 14 GB | 7 GB | 3.5 GB |
| 13B | 26 GB | 13 GB | 6.5 GB |
| 70B | 140 GB | 70 GB | 35 GB |
Batching Strategies
Static Batching
Simple but inefficient—wait for all sequences:
# Static batching
def static_batch_generate(prompts, max_tokens):
# Pad all to same length
padded = pad_sequences(prompts)
# Generate for fixed steps
for _ in range(max_tokens):
outputs = model(padded)
# All sequences generate same number of tokens
Problem: Short sequences wait for long ones.
Continuous Batching
Add/remove sequences dynamically:
# Continuous batching (vLLM, TGI)
class ContinuousBatcher:
def __init__(self):
self.active_sequences = []
self.waiting_queue = []
def step(self):
# Generate one token for all active sequences
outputs = model.generate_step(self.active_sequences)
# Remove finished sequences
finished = [s for s in self.active_sequences if s.is_done()]
self.active_sequences = [s for s in self.active_sequences if not s.is_done()]
# Add new sequences from queue
while self.waiting_queue and len(self.active_sequences) < max_batch:
self.active_sequences.append(self.waiting_queue.pop(0))
return finished
Throughput Comparison
| Batching | Throughput | GPU Utilization |
|---|---|---|
| No batching | 30 tok/s | 5% |
| Static (batch=8) | 150 tok/s | 25% |
| Continuous (batch=32) | 800 tok/s | 70% |
| Continuous + PagedAttn | 1500 tok/s | 85% |
Speculative Decoding
Use a small model to draft tokens, verify with large model:
def speculative_decode(prompt, draft_model, target_model, k=4):
tokens = tokenize(prompt)
while not done:
# Draft: Generate k tokens with small model (fast)
draft_tokens = draft_model.generate(tokens, num_tokens=k)
# Verify: Check all k tokens in parallel with large model
target_logits = target_model(tokens + draft_tokens)
# Accept matching tokens, reject from first mismatch
accepted = verify_and_accept(draft_tokens, target_logits)
tokens.extend(accepted)
Speculative Decoding Speedup
| Draft Model | Target Model | Acceptance Rate | Speedup |
|---|---|---|---|
| 68M | 7B | 70% | 2.1x |
| 160M | 7B | 80% | 2.5x |
| 1B | 70B | 75% | 2.8x |
Key insight: Large model processes k tokens in parallel (same cost as 1 token).
Serving Frameworks
vLLM
High-throughput serving with PagedAttention:
from vllm import LLM, SamplingParams
llm = LLM(
model="meta-llama/Llama-2-7b-hf",
tensor_parallel_size=1,
gpu_memory_utilization=0.9,
max_num_batched_tokens=8192,
)
sampling_params = SamplingParams(
temperature=0.7,
top_p=0.9,
max_tokens=256,
)
outputs = llm.generate(prompts, sampling_params)
TensorRT-LLM
NVIDIA's optimized inference engine:
# Build optimized engine
from tensorrt_llm import Builder
builder = Builder()
engine = builder.build(
model_dir="llama-7b",
dtype="float16",
max_batch_size=32,
max_input_len=2048,
max_output_len=512,
)
# Run inference
outputs = engine.generate(
input_ids,
max_new_tokens=256,
temperature=0.7,
)
Text Generation Inference (TGI)
Hugging Face's production server:
# Docker deployment
docker run --gpus all -p 8080:80 \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id meta-llama/Llama-2-7b-hf \
--quantize bitsandbytes-nf4 \
--max-batch-prefill-tokens 4096
Framework Comparison
| Feature | vLLM | TensorRT-LLM | TGI |
|---|---|---|---|
| PagedAttention | Yes | Yes | Yes |
| Continuous Batching | Yes | Yes | Yes |
| Speculative Decoding | Yes | Yes | No |
| Multi-GPU | Yes | Yes | Yes |
| Quantization | AWQ, GPTQ | FP8, INT8, INT4 | BnB, GPTQ |
| Setup Complexity | Low | High | Medium |
Flash Attention for Inference
Flash Attention speeds up both prefill and decode:
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
attn_implementation="flash_attention_2",
torch_dtype=torch.float16,
device_map="auto",
)
# Automatic Flash Attention in forward pass
output = model.generate(input_ids, max_new_tokens=100)
Impact on Inference
| Sequence Length | Standard Attention | Flash Attention | Speedup |
|---|---|---|---|
| 512 | 15ms | 8ms | 1.9x |
| 2048 | 89ms | 32ms | 2.8x |
| 8192 | 1420ms | 128ms | 11x |
| 32768 | OOM | 512ms | ∞ |
torch.compile for Inference
PyTorch 2.0's compiler provides free speedups:
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.float16,
)
# Compile for inference
model = torch.compile(model, mode="reduce-overhead")
# First call triggers compilation (slow)
# Subsequent calls are optimized
output = model.generate(input_ids, max_new_tokens=100)
Compile Modes
| Mode | Compilation Time | Runtime Speed | Use Case |
|---|---|---|---|
| default | Medium | 1.3x | General |
| reduce-overhead | Longer | 1.5x | Latency-critical |
| max-autotune | Very long | 1.7x | Production deploy |
Production Optimization Checklist
Memory Optimization
# 1. Use quantization
model = AutoModelForCausalLM.from_pretrained(
model_id,
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
)
# 2. Enable Flash Attention
model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="flash_attention_2",
)
# 3. Optimize KV cache
# Use GQA models (LLaMA 2, Mistral)
# Enable paged attention (vLLM)
Latency Optimization
# 1. Compile model
model = torch.compile(model, mode="reduce-overhead")
# 2. Use CUDA graphs (for fixed shapes)
# Automatically enabled in vLLM, TensorRT-LLM
# 3. Speculative decoding for interactive use
# Draft model + target model verification
Throughput Optimization
# 1. Continuous batching
# Use vLLM or TGI instead of naive batching
# 2. Maximize batch size
# Profile to find optimal batch size for your GPU
# 3. Pipeline parallelism for very large models
tensor_parallel_size = 4 # Split across GPUs
Benchmarks
Single GPU Performance (A100 80GB)
LLaMA-2 7B, 2K input + 256 output tokens:
| Configuration | TTFT | ITL | Throughput |
|---|---|---|---|
| Naive PyTorch | 850ms | 45ms | 22 tok/s |
| + Flash Attention | 320ms | 28ms | 35 tok/s |
| + torch.compile | 280ms | 22ms | 45 tok/s |
| + INT4 Quantization | 180ms | 15ms | 65 tok/s |
| vLLM (batch=32) | 400ms | 8ms | 420 tok/s |
| TensorRT-LLM | 150ms | 6ms | 580 tok/s |
Multi-GPU Scaling
LLaMA-2 70B throughput (tokens/second):
| GPUs | Tensor Parallel | Pipeline Parallel | Combined |
|---|---|---|---|
| 1 | OOM | N/A | N/A |
| 2 | 85 tok/s | N/A | 85 tok/s |
| 4 | 180 tok/s | 160 tok/s | 220 tok/s |
| 8 | 320 tok/s | 280 tok/s | 420 tok/s |
Cost Optimization
Tokens per Dollar (approximate, cloud pricing)
| Setup | Cost/hour | Throughput | Tokens/$ |
|---|---|---|---|
| A100 40GB (vLLM) | $3.50 | 400 tok/s | 411K |
| A100 80GB (vLLM) | $5.00 | 600 tok/s | 432K |
| 4x A10G (TGI) | $5.60 | 800 tok/s | 514K |
| H100 (TensorRT) | $8.00 | 1500 tok/s | 675K |
Right-sizing Recommendations
| Use Case | Model Size | Hardware | Framework |
|---|---|---|---|
| Chatbot (low latency) | 7B | A10G | vLLM |
| Batch processing | 7-13B | A100 | TensorRT-LLM |
| High quality | 70B | 8x A100 | vLLM + TP |
| Cost-sensitive | 7B INT4 | T4 | TGI |
References
-
Kwon, W., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." SOSP 2023
-
vLLM Comparative Analysis. (2025). "Comparative Analysis of Large Language Model Inference Serving Systems." arXiv:2511.17593
-
PagedAttention + FlexAttention. (2025). "Paged Attention Meets FlexAttention: Unlocking Long-Context Efficiency." arXiv:2506.07311
-
Frantar, E., et al. (2023). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers." ICLR 2023
-
Lin, J., et al. (2023). "AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration." arXiv:2306.00978
-
NVIDIA. (2025). "TensorRT-LLM." GitHub