Does QLoRA quality match LoRA?

Yes, QLoRA achieves nearly identical quality to LoRA in most benchmarks. The 4-bit quantization of the base model doesn't significantly impact the fine-tuned model's performance because gradients flow through the full-precision LoRA weights. The main trade-off is slightly slower training speed due to quantization overhead.

Can I use multiple LoRA adapters simultaneously?

Yes, you can combine multiple LoRA adapters through adapter composition. Libraries like PEFT support loading and merging adapters. However, the interaction between adapters can be unpredictable—it's often better to train a single adapter on combined data or merge adapters into the base model before adding new ones.

What LoRA rank should I start with?

Start with r=16 for most use cases. This provides a good balance between capacity and efficiency. Increase to r=32-64 if you have more than 10K examples or see underfitting. Decrease to r=8 for very small datasets (<1K) or simple tasks where you want to avoid overfitting.

Should I fine-tune all layers or just attention?

Start with attention layers only (q_proj, k_proj, v_proj, o_proj). This captures most of the benefit with minimal parameters. Add MLP layers (gate_proj, up_proj, down_proj) if you need more capacity or are fine-tuning on a significantly different domain.

LoRA vs QLoRA vs Full Fine-Tuning: Which Approach Should You Use?

Choosing between LoRA, QLoRA, and full fine-tuning significantly impacts your training cost, time, and final model quality. This guide provides concrete comparisons to help you make the right choice.

Quick Decision Guide

Scenario	Recommended Method
Consumer GPU (16-24GB)	QLoRA
Single A100/H100	LoRA or Full (7B models)
Multi-GPU setup	Full fine-tuning
Small dataset (<5K)	LoRA (any rank)
Large dataset (>50K)	Full fine-tuning
Quick experiments	QLoRA
Production deployment	Full or merged LoRA

How Each Method Works

Full Fine-Tuning

Updates every parameter in the model:

# All parameters are trainable
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
# 6.7B parameters, all updated during training

Memory requirement: Full model weights + optimizer states + gradients + activations

For AdamW optimizer:

Model weights: 2 bytes/param (BF16)
Optimizer states: 8 bytes/param (momentum + variance in FP32)
Gradients: 2 bytes/param
Total: ~12 bytes/param minimum

7B model: 7B × 12 bytes = 84GB (plus activations)

LoRA (Low-Rank Adaptation)

Freezes the base model and adds small trainable matrices:

# Original attention: Y = XW
# LoRA attention: Y = XW + X(BA)
# Where B: [d, r], A: [r, k], r << d, k

lora_config = LoraConfig(
    r=16,                    # Rank
    lora_alpha=32,           # Scaling
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
)

Memory requirement: Frozen model + LoRA weights + optimizer states for LoRA

For LLaMA-7B with r=16 on attention layers:

Frozen model: 2 bytes/param × 7B = 14GB
LoRA params: ~13M × 12 bytes = ~160MB
Total: ~15GB (plus activations)

QLoRA (Quantized LoRA)

Quantizes the base model to 4-bit and trains LoRA adapters:

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
)

Memory requirement: 4-bit model + LoRA weights + optimizer states

For LLaMA-7B with QLoRA:

Quantized model: 0.5 bytes/param × 7B = 3.5GB
LoRA params: ~13M × 12 bytes = ~160MB
Total: ~5GB (plus activations)

Memory Comparison

Training Memory by Model Size

Model	Full FT	LoRA (r=16)	QLoRA
7B	84GB+	16GB	6GB
13B	156GB+	28GB	10GB
34B	408GB+	72GB	24GB
70B	840GB+	150GB	42GB

With gradient checkpointing and BF16/NF4

Minimum GPU Requirements

Model	Full FT	LoRA	QLoRA
7B	A100 80GB	RTX 4090	RTX 3080
13B	2× A100	A100 40GB	RTX 4090
70B	8× A100	2× A100	A6000 48GB

Performance Benchmarks

Training Speed (samples/second, LLaMA-7B)

Hardware	Full FT	LoRA	QLoRA
A100 80GB	12.4	18.2	14.1
RTX 4090	OOM	8.6	7.2
A6000 48GB	OOM	10.4	9.8

LoRA is often faster than full fine-tuning because:

Fewer parameters to update
Smaller optimizer state
Better memory efficiency → larger batch sizes

QLoRA is slower than LoRA due to quantization/dequantization overhead.

Model Quality After Fine-Tuning

Benchmarks on instruction-following tasks (MT-Bench scores, higher is better):

Method	LLaMA-7B	LLaMA-13B	LLaMA-70B
Base model	4.2	5.1	6.3
Full FT	6.8	7.2	7.9
LoRA r=64	6.6	7.0	7.7
LoRA r=16	6.4	6.8	7.5
QLoRA r=64	6.5	6.9	7.6

Key findings:

Full fine-tuning achieves the best results
LoRA with higher rank approaches full FT performance
QLoRA matches LoRA quality (4-bit quantization doesn't hurt much)
Larger models show smaller gaps between methods

LoRA Rank Selection

The rank r controls LoRA's capacity:

Impact of Rank on Quality

Rank	Params (7B model)	MMLU	MT-Bench
4	3.3M	44.2	6.1
8	6.6M	45.8	6.3
16	13.1M	46.9	6.4
32	26.2M	47.4	6.5
64	52.4M	47.8	6.6
128	104.9M	48.0	6.7
Full	6,738M	48.3	6.8

Rank Recommendations

Dataset Size	Task Complexity	Recommended r
<1K	Simple	4-8
<1K	Complex	8-16
1K-10K	Simple	8-16
1K-10K	Complex	16-32
10K-100K	Any	32-64
>100K	Any	64-128 or Full FT

Target Modules

Which Layers to Adapt

# Conservative: attention only
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj"]

# Aggressive: attention + MLP
target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
    "gate_proj", "up_proj", "down_proj"       # MLP
]

# Maximum: everything (approaches full FT memory)
target_modules = "all-linear"

Impact on Performance

Target	Params	Memory	Quality
q,v only	0.04%	Low	Baseline
All attention	0.12%	Low	+3-5%
Attention + MLP	0.24%	Medium	+5-8%
All linear	0.48%	Higher	+8-10%

When to Use Each Method

Choose Full Fine-Tuning When:

You have large compute budget (8+ A100s)
Dataset is very large (>100K examples)
Target domain is significantly different from pretraining
You need maximum possible performance
The model will be deployed long-term

Choose LoRA When:

Single GPU with 24-80GB memory
Medium datasets (1K-100K examples)
Need to iterate quickly on experiments
Want to maintain multiple task-specific adapters
Base model capabilities should be largely preserved

Choose QLoRA When:

Limited GPU memory (<24GB)
Training models >13B parameters
Prototyping before committing to full training
Cost-sensitive scenarios
Performance parity with LoRA is acceptable

Combining Methods

Multi-LoRA Deployment

Train multiple LoRA adapters for different tasks, switch at inference:

from peft import PeftModel

# Load base model once
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

# Switch adapters on demand
model = PeftModel.from_pretrained(base_model, "lora-adapter-1")
output_1 = model.generate(...)

model.load_adapter("lora-adapter-2")
output_2 = model.generate(...)

Merging for Deployment

For production, merge LoRA weights into base model:

# Merge and save
merged = model.merge_and_unload()
merged.save_pretrained("merged-model")

# Now deploy as a regular model with no LoRA overhead

Practical Tips

LoRA Alpha Scaling

The lora_alpha parameter scales LoRA outputs:

effective_scale = lora_alpha / r

Common settings:

r=16, alpha=32 → scale = 2
r=64, alpha=64 → scale = 1
r=8, alpha=32 → scale = 4

Higher scale = larger LoRA contribution. Start with alpha = 2 × r.

Gradient Checkpointing

Always enable for large models:

model.gradient_checkpointing_enable()

training_args = TrainingArguments(
    gradient_checkpointing=True,
    ...
)

Reduces memory by recomputing activations instead of storing them.

Mixed Precision

Use BF16 for modern GPUs:

training_args = TrainingArguments(
    bf16=True,  # Better than fp16 for training stability
    ...
)

References

Hu, E., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685
Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314
Lialin, V., et al. (2023). "Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning." arXiv:2303.15647
Biderman, S., et al. (2025). "LoRA Learns Less and Forgets Less." arXiv:2405.09673