Choosing between LoRA, QLoRA, and full fine-tuning significantly impacts your training cost, time, and final model quality. This guide provides concrete comparisons to help you make the right choice.
Quick Decision Guide
| Scenario | Recommended Method |
|---|---|
| Consumer GPU (16-24GB) | QLoRA |
| Single A100/H100 | LoRA or Full (7B models) |
| Multi-GPU setup | Full fine-tuning |
| Small dataset (<5K) | LoRA (any rank) |
| Large dataset (>50K) | Full fine-tuning |
| Quick experiments | QLoRA |
| Production deployment | Full or merged LoRA |
How Each Method Works
Full Fine-Tuning
Updates every parameter in the model:
# All parameters are trainable
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# 6.7B parameters, all updated during training
Memory requirement: Full model weights + optimizer states + gradients + activations
For AdamW optimizer:
- Model weights: 2 bytes/param (BF16)
- Optimizer states: 8 bytes/param (momentum + variance in FP32)
- Gradients: 2 bytes/param
- Total: ~12 bytes/param minimum
7B model: 7B × 12 bytes = 84GB (plus activations)
LoRA (Low-Rank Adaptation)
Freezes the base model and adds small trainable matrices:
# Original attention: Y = XW
# LoRA attention: Y = XW + X(BA)
# Where B: [d, r], A: [r, k], r << d, k
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)
Memory requirement: Frozen model + LoRA weights + optimizer states for LoRA
For LLaMA-7B with r=16 on attention layers:
- Frozen model: 2 bytes/param × 7B = 14GB
- LoRA params: ~13M × 12 bytes = ~160MB
- Total: ~15GB (plus activations)
QLoRA (Quantized LoRA)
Quantizes the base model to 4-bit and trains LoRA adapters:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
)
Memory requirement: 4-bit model + LoRA weights + optimizer states
For LLaMA-7B with QLoRA:
- Quantized model: 0.5 bytes/param × 7B = 3.5GB
- LoRA params: ~13M × 12 bytes = ~160MB
- Total: ~5GB (plus activations)
Memory Comparison
Training Memory by Model Size
| Model | Full FT | LoRA (r=16) | QLoRA |
|---|---|---|---|
| 7B | 84GB+ | 16GB | 6GB |
| 13B | 156GB+ | 28GB | 10GB |
| 34B | 408GB+ | 72GB | 24GB |
| 70B | 840GB+ | 150GB | 42GB |
With gradient checkpointing and BF16/NF4
Minimum GPU Requirements
| Model | Full FT | LoRA | QLoRA |
|---|---|---|---|
| 7B | A100 80GB | RTX 4090 | RTX 3080 |
| 13B | 2× A100 | A100 40GB | RTX 4090 |
| 70B | 8× A100 | 2× A100 | A6000 48GB |
Performance Benchmarks
Training Speed (samples/second, LLaMA-7B)
| Hardware | Full FT | LoRA | QLoRA |
|---|---|---|---|
| A100 80GB | 12.4 | 18.2 | 14.1 |
| RTX 4090 | OOM | 8.6 | 7.2 |
| A6000 48GB | OOM | 10.4 | 9.8 |
LoRA is often faster than full fine-tuning because:
- Fewer parameters to update
- Smaller optimizer state
- Better memory efficiency → larger batch sizes
QLoRA is slower than LoRA due to quantization/dequantization overhead.
Model Quality After Fine-Tuning
Benchmarks on instruction-following tasks (MT-Bench scores, higher is better):
| Method | LLaMA-7B | LLaMA-13B | LLaMA-70B |
|---|---|---|---|
| Base model | 4.2 | 5.1 | 6.3 |
| Full FT | 6.8 | 7.2 | 7.9 |
| LoRA r=64 | 6.6 | 7.0 | 7.7 |
| LoRA r=16 | 6.4 | 6.8 | 7.5 |
| QLoRA r=64 | 6.5 | 6.9 | 7.6 |
Key findings:
- Full fine-tuning achieves the best results
- LoRA with higher rank approaches full FT performance
- QLoRA matches LoRA quality (4-bit quantization doesn't hurt much)
- Larger models show smaller gaps between methods
LoRA Rank Selection
The rank r controls LoRA's capacity:
Impact of Rank on Quality
| Rank | Params (7B model) | MMLU | MT-Bench |
|---|---|---|---|
| 4 | 3.3M | 44.2 | 6.1 |
| 8 | 6.6M | 45.8 | 6.3 |
| 16 | 13.1M | 46.9 | 6.4 |
| 32 | 26.2M | 47.4 | 6.5 |
| 64 | 52.4M | 47.8 | 6.6 |
| 128 | 104.9M | 48.0 | 6.7 |
| Full | 6,738M | 48.3 | 6.8 |
Rank Recommendations
| Dataset Size | Task Complexity | Recommended r |
|---|---|---|
| <1K | Simple | 4-8 |
| <1K | Complex | 8-16 |
| 1K-10K | Simple | 8-16 |
| 1K-10K | Complex | 16-32 |
| 10K-100K | Any | 32-64 |
| >100K | Any | 64-128 or Full FT |
Target Modules
Which Layers to Adapt
# Conservative: attention only
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]
# Aggressive: attention + MLP
target_modules = [
"q_proj", "k_proj", "v_proj", "o_proj", # Attention
"gate_proj", "up_proj", "down_proj" # MLP
]
# Maximum: everything (approaches full FT memory)
target_modules = "all-linear"
Impact on Performance
| Target | Params | Memory | Quality |
|---|---|---|---|
| q,v only | 0.04% | Low | Baseline |
| All attention | 0.12% | Low | +3-5% |
| Attention + MLP | 0.24% | Medium | +5-8% |
| All linear | 0.48% | Higher | +8-10% |
When to Use Each Method
Choose Full Fine-Tuning When:
- You have large compute budget (8+ A100s)
- Dataset is very large (>100K examples)
- Target domain is significantly different from pretraining
- You need maximum possible performance
- The model will be deployed long-term
Choose LoRA When:
- Single GPU with 24-80GB memory
- Medium datasets (1K-100K examples)
- Need to iterate quickly on experiments
- Want to maintain multiple task-specific adapters
- Base model capabilities should be largely preserved
Choose QLoRA When:
- Limited GPU memory (<24GB)
- Training models >13B parameters
- Prototyping before committing to full training
- Cost-sensitive scenarios
- Performance parity with LoRA is acceptable
Combining Methods
Multi-LoRA Deployment
Train multiple LoRA adapters for different tasks, switch at inference:
from peft import PeftModel
# Load base model once
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# Switch adapters on demand
model = PeftModel.from_pretrained(base_model, "lora-adapter-1")
output_1 = model.generate(...)
model.load_adapter("lora-adapter-2")
output_2 = model.generate(...)
Merging for Deployment
For production, merge LoRA weights into base model:
# Merge and save
merged = model.merge_and_unload()
merged.save_pretrained("merged-model")
# Now deploy as a regular model with no LoRA overhead
Practical Tips
LoRA Alpha Scaling
The lora_alpha parameter scales LoRA outputs:
effective_scale = lora_alpha / r
Common settings:
r=16, alpha=32→ scale = 2r=64, alpha=64→ scale = 1r=8, alpha=32→ scale = 4
Higher scale = larger LoRA contribution. Start with alpha = 2 × r.
Gradient Checkpointing
Always enable for large models:
model.gradient_checkpointing_enable()
training_args = TrainingArguments(
gradient_checkpointing=True,
...
)
Reduces memory by recomputing activations instead of storing them.
Mixed Precision
Use BF16 for modern GPUs:
training_args = TrainingArguments(
bf16=True, # Better than fp16 for training stability
...
)
References
-
Hu, E., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685
-
Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314
-
Lialin, V., et al. (2023). "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning." arXiv:2303.15647
-
Biderman, S., et al. (2025). "LoRA Learns Less and Forgets Less." arXiv:2405.09673