Back to all articles
LLM Fine-Tuning

The Ultimate Guide to Fine-Tuning Large Language Models in 2025

Comprehensive guide to LLM fine-tuning covering full fine-tuning, LoRA, QLoRA, data preparation, hyperparameters, and evaluation. Includes code examples for LLaMA, Mistral, and other models.

Flash Attention TeamJanuary 8, 202610 min read
fine-tuningLLM trainingLoRAQLoRAtransformersHugging Face

Fine-tuning transforms general-purpose language models into specialized tools for your specific use case. Whether you're building a customer service chatbot, a code assistant, or a domain-specific expert, this guide covers everything you need to know about fine-tuning LLMs effectively in 2025.

Why Fine-Tune?

Pre-trained models like LLaMA, Mistral, and Qwen have impressive general capabilities, but fine-tuning offers:

BenefitDescription
Task SpecializationDramatically better performance on specific tasks
Style ControlConsistent output format, tone, and behavior
Knowledge InjectionAdd domain-specific information
Instruction FollowingBetter adherence to complex instructions
EfficiencySmaller fine-tuned models can outperform larger general ones

A fine-tuned 7B model often beats a general-purpose 70B model on targeted tasks while being 10x cheaper to run.

Fine-Tuning Approaches

Full Fine-Tuning

Updates all model parameters. Best for:

  • Large datasets (>100K examples)
  • Significant domain shift
  • Maximum performance when compute isn't constrained

Resource requirements (approximate):

Model SizeGPU MemoryTraining Time (10K samples)
7B80GB (A100)4-8 hours
13B160GB (2x A100)8-16 hours
70B640GB (8x A100)24-48 hours

Parameter-Efficient Fine-Tuning (PEFT)

Updates only a small subset of parameters. Methods include:

  • LoRA: Low-rank adaptation matrices
  • QLoRA: LoRA with quantized base model
  • Adapters: Small bottleneck layers
  • Prefix Tuning: Learnable prompt prefixes

Resource requirements with LoRA:

Model SizeGPU MemoryTraining Time (10K samples)
7B16GB (RTX 4090)2-4 hours
13B24GB (RTX 4090)4-8 hours
70B48GB (A6000)12-24 hours

Choosing Your Approach

Decision Tree:

1. Do you have >100K high-quality examples?
   Yes → Consider full fine-tuning
   No → Use PEFT (LoRA/QLoRA)

2. GPU memory available?
   <24GB → QLoRA (4-bit quantization)
   24-48GB → LoRA with BF16
   >80GB → Full fine-tuning possible

3. How different is your domain from pretraining?
   Very different → Full fine-tuning or larger LoRA rank
   Somewhat similar → Standard LoRA (r=16-64)
   Just style/format changes → Small LoRA (r=8-16)

Data Preparation

Data quality is the single most important factor in fine-tuning success.

Dataset Formats

Instruction format (recommended for most use cases):

{
  "instruction": "Summarize the following article",
  "input": "The article text here...",
  "output": "The summary here..."
}

Chat format (for conversational models):

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant"},
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi! How can I help you?"}
  ]
}

Completion format (for continued pretraining):

{
  "text": "Document content for continued pretraining..."
}

Data Quality Checklist

  • Remove duplicates: Exact and near-duplicates hurt generalization
  • Verify correctness: Wrong examples teach wrong behavior
  • Balance distribution: Avoid over-representation of any category
  • Length diversity: Include short and long examples
  • Edge cases: Include challenging examples explicitly
  • Format consistency: Same structure across all examples

Dataset Size Guidelines

Use CaseMinimumRecommendedNotes
Style transfer100500-1KConsistent examples crucial
Task-specific5002K-10KQuality over quantity
Domain knowledge1K10K-50KDiverse coverage needed
General improvement10K50K-100KBroad distribution

Full Fine-Tuning Implementation

Basic Setup with Hugging Face

from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
from datasets import load_dataset
import torch

# Load model and tokenizer
model_id = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Use Flash Attention!
)

# Load and prepare dataset
dataset = load_dataset("your_dataset")

def format_instruction(example):
    text = f"### Instruction:\n{example['instruction']}\n\n"
    if example.get('input'):
        text += f"### Input:\n{example['input']}\n\n"
    text += f"### Response:\n{example['output']}"
    return {"text": text}

dataset = dataset.map(format_instruction)

def tokenize(example):
    return tokenizer(
        example["text"],
        truncation=True,
        max_length=2048,
        padding="max_length"
    )

tokenized_dataset = dataset.map(tokenize, remove_columns=dataset.column_names)

# Training arguments
training_args = TrainingArguments(
    output_dir="./llama-2-7b-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-5,
    weight_decay=0.01,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=10,
    save_strategy="epoch",
    bf16=True,
    gradient_checkpointing=True,
    optim="adamw_torch_fused",
)

# Train
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    data_collator=DataCollatorForLanguageModeling(tokenizer, mlm=False),
)

trainer.train()

LoRA Fine-Tuning

LoRA (Low-Rank Adaptation) adds small trainable matrices to attention layers while keeping the base model frozen.

How LoRA Works

Instead of updating weight matrix W directly:

W_new = W + ΔW

LoRA decomposes ΔW into low-rank matrices:

W_new = W + BA

Where:

  • W: Original weights (frozen), shape [d, k]
  • B: Low-rank matrix, shape [d, r]
  • A: Low-rank matrix, shape [r, k]
  • r: Rank (typically 8-64), much smaller than d or k

This reduces trainable parameters from d×k to r×(d+k).

LoRA Implementation

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import LoraConfig, get_peft_model, TaskType
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

# Configure LoRA
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                          # Rank
    lora_alpha=32,                 # Scaling factor
    lora_dropout=0.05,             # Dropout for regularization
    target_modules=[               # Which layers to adapt
        "q_proj", "k_proj", "v_proj", "o_proj",  # Attention
        "gate_proj", "up_proj", "down_proj"       # MLP
    ],
    bias="none"
)

# Create PEFT model
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 6,751,883,264 || trainable%: 0.20%

LoRA Hyperparameters

ParameterRangeImpact
r (rank)4-256Higher = more capacity, more memory
lora_alphar to 2×rScaling; alpha/r = effective learning rate multiplier
lora_dropout0-0.1Regularization; higher for small datasets
target_modulesvariesMore modules = more capacity

Recommended starting points:

  • Small dataset (<1K): r=8, alpha=16
  • Medium dataset (1K-10K): r=16, alpha=32
  • Large dataset (>10K): r=32-64, alpha=64

QLoRA Fine-Tuning

QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of large models on consumer GPUs.

QLoRA Implementation

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
import torch

# 4-bit quantization config
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 quantization
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True       # Nested quantization
)

# Load quantized model
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-70b-hf",
    quantization_config=bnb_config,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

# Prepare for k-bit training
model = prepare_model_for_kbit_training(model)

# Add LoRA
lora_config = LoraConfig(
    r=64,
    lora_alpha=128,
    target_modules=[
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj"
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

Memory Comparison

Fine-tuning LLaMA-2 70B:

MethodGPU MemoryTrainable Params
Full Fine-tuning~640GB70B (100%)
LoRA (BF16)~160GB168M (0.24%)
QLoRA (4-bit)~42GB168M (0.24%)

QLoRA makes 70B model fine-tuning possible on a single A6000 or 2x RTX 4090.

Hyperparameter Tuning

Learning Rate

The most critical hyperparameter:

MethodStarting LRRange
Full fine-tuning1e-55e-6 to 5e-5
LoRA1e-45e-5 to 3e-4
QLoRA2e-41e-4 to 5e-4

Higher learning rates for PEFT methods because fewer parameters need to change.

Batch Size and Gradient Accumulation

Effective batch size = per_device_batch_size × num_devices × gradient_accumulation_steps

Recommendations:

  • Effective batch size: 32-128 for most tasks
  • Larger batches (256+) for simpler tasks
  • Smaller batches (8-16) for complex reasoning
# Example: Achieve batch size 64 on single GPU
training_args = TrainingArguments(
    per_device_train_batch_size=4,      # Fits in memory
    gradient_accumulation_steps=16,      # 4 × 16 = 64 effective
    ...
)

Training Duration

Dataset SizeRecommended Epochs
<1K5-10 epochs
1K-10K3-5 epochs
10K-100K1-3 epochs
>100K1 epoch (or less)

Monitor validation loss—stop when it starts increasing (overfitting).

Evaluation

Automated Metrics

from evaluate import load

# Perplexity (lower is better)
perplexity = load("perplexity")
results = perplexity.compute(predictions=generated_texts, model_id=model_id)

# BLEU/ROUGE for summarization/translation
rouge = load("rouge")
results = rouge.compute(predictions=predictions, references=references)

# Exact match for QA
exact_match = sum(p == r for p, r in zip(predictions, references)) / len(predictions)

Task-Specific Evaluation

For instruction-following models, use benchmarks like:

  • MT-Bench: Multi-turn conversation quality
  • AlpacaEval: Instruction following comparison
  • MMLU: Multi-task knowledge
  • HumanEval: Code generation

Human Evaluation

For production use, always include human evaluation:

  1. Relevance: Does the output address the query?
  2. Correctness: Is the information accurate?
  3. Helpfulness: Is it actually useful?
  4. Safety: No harmful content?
  5. Format: Follows expected structure?

Common Issues and Solutions

Overfitting

Symptoms: Training loss decreases, validation loss increases Solutions:

  • Reduce training epochs
  • Add dropout (lora_dropout=0.05-0.1)
  • Reduce LoRA rank
  • Add more diverse training data
  • Use early stopping

Catastrophic Forgetting

Symptoms: Model loses general capabilities after fine-tuning Solutions:

  • Use LoRA instead of full fine-tuning
  • Lower learning rate
  • Mix in general instruction data
  • Shorter training

Poor Generation Quality

Symptoms: Outputs are repetitive, incoherent, or off-topic Solutions:

  • Check data quality (garbage in = garbage out)
  • Ensure proper tokenization
  • Adjust generation parameters (temperature, top_p)
  • Increase training data diversity

Saving and Deploying

Saving LoRA Weights

# Save only LoRA weights (small file)
model.save_pretrained("./lora-weights")

# Merge LoRA into base model for deployment
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")

Deployment Considerations

DeploymentRecommendation
Low latencyMerge LoRA weights, quantize
Memory constrainedKeep LoRA separate, load on demand
Multi-taskMultiple LoRA adapters, switch dynamically
ProductionMerge, quantize (GPTQ/AWQ), serve with vLLM

References

  1. Hu, E., et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models." arXiv:2106.09685

  2. Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized LLMs." arXiv:2305.14314

  3. LLaMA Survey. (2025). "Evolution of Meta's LLaMA Models and Parameter-Efficient Fine-Tuning of Large Language Models: A Survey." arXiv:2510.12178

  4. Wolf, T., et al. (2020). "Transformers: State-of-the-Art Natural Language Processing." EMNLP 2020. Paper

  5. Mangrulkar, S., et al. (2022). "PEFT: State-of-the-art Parameter-Efficient Fine-Tuning methods." GitHub

Frequently Asked Questions

Related Articles

Need Flash Attention wheels?

Skip the 30+ minute compilation. Find prebuilt wheels for your exact configuration.

Find Your Wheel