Back to all articles
LLM Fine-Tuning

LoRA vs QLoRA vs Full Fine-Tuning: Which Approach Should You Use?

In-depth comparison of fine-tuning methods for LLMs. Covers memory requirements, performance trade-offs, and when to use each approach with practical benchmarks.

Flash Attention TeamJanuary 8, 20268 min read
LoRAQLoRAfine-tuningPEFTparameter efficientLLM training

Choosing between LoRA, QLoRA, and full fine-tuning significantly impacts your training cost, time, and final model quality. This guide provides concrete comparisons to help you make the right choice.

Quick Decision Guide

ScenarioRecommended Method
Consumer GPU (16-24GB)QLoRA
Single A100/H100LoRA or Full (7B models)
Multi-GPU setupFull fine-tuning
Small dataset (<5K)LoRA (any rank)
Large dataset (>50K)Full fine-tuning
Quick experimentsQLoRA
Production deploymentFull or merged LoRA

How Each Method Works

Full Fine-Tuning

Updates every parameter in the model:

# All parameters are trainable
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# 6.7B parameters, all updated during training

Memory requirement: Full model weights + optimizer states + gradients + activations

For AdamW optimizer:

  • Model weights: 2 bytes/param (BF16)
  • Optimizer states: 8 bytes/param (momentum + variance in FP32)
  • Gradients: 2 bytes/param
  • Total: ~12 bytes/param minimum

7B model: 7B × 12 bytes = 84GB (plus activations)

LoRA (Low-Rank Adaptation)

Freezes the base model and adds small trainable matrices:

# Original attention: Y = XW
# LoRA attention: Y = XW + X(BA)
# Where B: [d, r], A: [r, k], r << d, k

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

Memory requirement: Frozen model + LoRA weights + optimizer states for LoRA

For LLaMA-7B with r=16 on attention layers:

  • Frozen model: 2 bytes/param × 7B = 14GB
  • LoRA params: ~13M × 12 bytes = ~160MB
  • Total: ~15GB (plus activations)

QLoRA (Quantized LoRA)

Quantizes the base model to 4-bit and trains LoRA adapters:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
)

Memory requirement: 4-bit model + LoRA weights + optimizer states

For LLaMA-7B with QLoRA:

  • Quantized model: 0.5 bytes/param × 7B = 3.5GB
  • LoRA params: ~13M × 12 bytes = ~160MB
  • Total: ~5GB (plus activations)

Memory Comparison

Training Memory by Model Size

ModelFull FTLoRA (r=16)QLoRA
7B84GB+16GB6GB
13B156GB+28GB10GB
34B408GB+72GB24GB
70B840GB+150GB42GB

With gradient checkpointing and BF16/NF4

Minimum GPU Requirements

ModelFull FTLoRAQLoRA
7BA100 80GBRTX 4090RTX 3080
13B2× A100A100 40GBRTX 4090
70B8× A1002× A100A6000 48GB

Performance Benchmarks

Training Speed (samples/second, LLaMA-7B)

HardwareFull FTLoRAQLoRA
A100 80GB12.418.214.1
RTX 4090OOM8.67.2
A6000 48GBOOM10.49.8

LoRA is often faster than full fine-tuning because:

  • Fewer parameters to update
  • Smaller optimizer state
  • Better memory efficiency → larger batch sizes

QLoRA is slower than LoRA due to quantization/dequantization overhead.

Model Quality After Fine-Tuning

Benchmarks on instruction-following tasks (MT-Bench scores, higher is better):

MethodLLaMA-7BLLaMA-13BLLaMA-70B
Base model4.25.16.3
Full FT6.87.27.9
LoRA r=646.67.07.7
LoRA r=166.46.87.5
QLoRA r=646.56.97.6

Key findings:

  • Full fine-tuning achieves the best results
  • LoRA with higher rank approaches full FT performance
  • QLoRA matches LoRA quality (4-bit quantization doesn't hurt much)
  • Larger models show smaller gaps between methods

LoRA Rank Selection

The rank r controls LoRA's capacity:

Impact of Rank on Quality

RankParams (7B model)MMLUMT-Bench
43.3M44.26.1
86.6M45.86.3
1613.1M46.96.4
3226.2M47.46.5
6452.4M47.86.6
128104.9M48.06.7
Full6,738M48.36.8

Rank Recommendations

Dataset SizeTask ComplexityRecommended r
<1KSimple4-8
<1KComplex8-16
1K-10KSimple8-16
1K-10KComplex16-32
10K-100KAny32-64
>100KAny64-128 or Full FT

Target Modules

Which Layers to Adapt

# Conservative: attention only
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Aggressive: attention + MLP
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
    "gate_proj", "up_proj", "down_proj"       # MLP
]

# Maximum: everything (approaches full FT memory)
target_modules = "all-linear"

Impact on Performance

TargetParamsMemoryQuality
q,v only0.04%LowBaseline
All attention0.12%Low+3-5%
Attention + MLP0.24%Medium+5-8%
All linear0.48%Higher+8-10%

When to Use Each Method

Choose Full Fine-Tuning When:

  • You have large compute budget (8+ A100s)
  • Dataset is very large (>100K examples)
  • Target domain is significantly different from pretraining
  • You need maximum possible performance
  • The model will be deployed long-term

Choose LoRA When:

  • Single GPU with 24-80GB memory
  • Medium datasets (1K-100K examples)
  • Need to iterate quickly on experiments
  • Want to maintain multiple task-specific adapters
  • Base model capabilities should be largely preserved

Choose QLoRA When:

  • Limited GPU memory (<24GB)
  • Training models >13B parameters
  • Prototyping before committing to full training
  • Cost-sensitive scenarios
  • Performance parity with LoRA is acceptable

Combining Methods

Multi-LoRA Deployment

Train multiple LoRA adapters for different tasks, switch at inference:

from peft import PeftModel

# Load base model once
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Switch adapters on demand
model = PeftModel.from_pretrained(base_model, "lora-adapter-1")
output_1 = model.generate(...)

model.load_adapter("lora-adapter-2")
output_2 = model.generate(...)

Merging for Deployment

For production, merge LoRA weights into base model:

# Merge and save
merged = model.merge_and_unload()
merged.save_pretrained("merged-model")

# Now deploy as a regular model with no LoRA overhead

Practical Tips

LoRA Alpha Scaling

The lora_alpha parameter scales LoRA outputs:

effective_scale = lora_alpha / r

Common settings:

  • r=16, alpha=32 → scale = 2
  • r=64, alpha=64 → scale = 1
  • r=8, alpha=32 → scale = 4

Higher scale = larger LoRA contribution. Start with alpha = 2 × r.

Gradient Checkpointing

Always enable for large models:

model.gradient_checkpointing_enable()

training_args = TrainingArguments(
    gradient_checkpointing=True,
    ...
)

Reduces memory by recomputing activations instead of storing them.

Mixed Precision

Use BF16 for modern GPUs:

training_args = TrainingArguments(
    bf16=True,  # Better than fp16 for training stability
    ...
)

References

  1. Hu, E., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685

  2. Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314

  3. Lialin, V., et al. (2023). "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning." arXiv:2303.15647

  4. Biderman, S., et al. (2025). "LoRA Learns Less and Forgets Less." arXiv:2405.09673

Frequently Asked Questions

Related Articles

Need Flash Attention wheels?

Skip the 30+ minute compilation. Find prebuilt wheels for your exact configuration.

Find Your Wheel