Back to all articles
LLM Fine-Tuning

Instruction Tuning vs RLHF vs DPO: LLM Alignment Techniques Compared

Compare the three main approaches to aligning LLMs: instruction tuning, RLHF, and DPO. Learn when to use each method with practical implementation guidance and real benchmarks.

Flash Attention TeamJanuary 8, 202611 min read
instruction tuningRLHFDPOLLM alignmentChatGPT trainingpreference learning

Making a language model helpful and safe requires more than pretraining on web data. This guide compares the three dominant alignment techniques—instruction tuning, RLHF, and DPO—to help you choose the right approach for your use case.

Quick Comparison

MethodComplexityData RequirementsCompute CostQuality Ceiling
Instruction TuningLowInstruction-output pairsLowGood
RLHFHighPreference rankingsVery HighExcellent
DPOMediumPreference pairsMediumExcellent

The Alignment Problem

Pretrained LLMs learn to predict the next token based on internet text. This creates models that are:

  • Capable: They know how to perform many tasks
  • Unreliable: They may refuse, hallucinate, or produce harmful content
  • Unshaped: They don't understand user intent

Alignment techniques shape model behavior to be helpful, harmless, and honest.

Instruction Tuning (Supervised Fine-Tuning)

How It Works

Instruction tuning trains models on demonstration data: pairs of instructions and ideal responses.

# Example instruction tuning data
{
    "instruction": "Explain photosynthesis in simple terms.",
    "response": "Photosynthesis is how plants make food from sunlight..."
}

The model learns through standard supervised learning:

from transformers import AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

trainer = SFTTrainer(
    model=model,
    train_dataset=instruction_dataset,
    args=TrainingArguments(
        output_dir="./sft-model",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-5,
    ),
    dataset_text_field="text",
)

trainer.train()

Key Datasets

DatasetSizeSourceUse Case
FLAN Collection1,836 tasksGoogleGeneral instruction following
Alpaca52KStanfordGPT-4 distillation
OpenAssistant161KOpen sourceConversational AI
Dolly15KDatabricksCommercial use
ShareGPT90KUser sharedChatGPT-style responses

Strengths

  1. Simple implementation: Standard supervised learning
  2. Low compute: Single training run
  3. Predictable: Model learns exactly what you demonstrate
  4. Fast iteration: Easy to add new examples

Limitations

  1. Quality ceiling: Limited by demonstration quality
  2. No preference learning: Can't distinguish good from better
  3. Distribution mismatch: Training distribution may differ from deployment
  4. Exposure bias: Only sees correct completions during training

When to Use

  • First alignment step: Always do instruction tuning before RLHF/DPO
  • Specific domains: When you have expert demonstrations
  • Limited resources: When you can't afford RL training
  • Deterministic tasks: When there's a single correct answer

RLHF (Reinforcement Learning from Human Feedback)

How It Works

RLHF uses human preferences to train a reward model, then optimizes the LLM against that reward using reinforcement learning.

Step 1: Collect Preferences
   Prompt → [Response A, Response B] → Human ranks A > B

Step 2: Train Reward Model
   Reward Model learns: score(A) > score(B)

Step 3: RL Fine-tuning
   LLM optimizes to maximize Reward Model scores
   (while staying close to original via KL penalty)

Implementation

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

# Load models
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("sft-model")
reward_model = AutoModelForSequenceClassification.from_pretrained("reward-model")

# PPO configuration
config = PPOConfig(
    model_name="rlhf-model",
    learning_rate=1e-5,
    batch_size=16,
    mini_batch_size=4,
    gradient_accumulation_steps=4,
    ppo_epochs=4,
    kl_penalty="kl",
    init_kl_coef=0.2,
)

ppo_trainer = PPOTrainer(
    config=config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop
for batch in dataloader:
    # Generate responses
    response_tensors = ppo_trainer.generate(batch["input_ids"])

    # Compute rewards
    rewards = [reward_model(r) for r in response_tensors]

    # PPO update
    stats = ppo_trainer.step(batch["input_ids"], response_tensors, rewards)

The Reward Model

The reward model is trained on preference data using the Bradley-Terry model:

# Preference data format
{
    "prompt": "Write a poem about autumn",
    "chosen": "Golden leaves cascade like memories...",
    "rejected": "Leaves fall. Trees bare. Cold comes..."
}

# Loss function
def reward_loss(chosen_reward, rejected_reward):
    return -torch.log(torch.sigmoid(chosen_reward - rejected_reward)).mean()

Training a reward model:

from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig

reward_model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    num_labels=1,
)

reward_config = RewardConfig(
    output_dir="./reward-model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=1e-5,
)

reward_trainer = RewardTrainer(
    model=reward_model,
    args=reward_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

reward_trainer.train()

Why RLHF Works

RLHF's effectiveness comes from several factors:

  1. Preference learning: Captures nuanced human judgments
  2. Exploration: RL explores responses not in training data
  3. Optimization target: Directly optimizes what humans want
  4. Distribution shift: Training distribution matches deployment

According to Ouyang et al. (2022), InstructGPT (1.3B parameters with RLHF) was preferred over GPT-3 (175B) in human evaluations, demonstrating RLHF's power.

Challenges

  1. Reward hacking: Model exploits reward model weaknesses
  2. Training instability: PPO is notoriously unstable
  3. High compute cost: Requires multiple forward passes per step
  4. Reward model quality: Garbage in, garbage out
  5. KL divergence tuning: Balancing reward vs staying close to SFT

Reward Hacking Example

# Reward model might learn superficial patterns
"Great question! Let me explain..." → High reward (politeness)
"No." → Low reward (brevity)

# Model may learn to be verbose rather than accurate

Mitigation strategies:

  • Ensemble reward models
  • Constitutional AI (self-critique)
  • Adversarial training data
  • Process-based rewards

DPO (Direct Preference Optimization)

How It Works

DPO eliminates the reward model and RL training by directly optimizing preferences. The key insight: the optimal policy under RLHF has a closed-form solution.

RLHF: preferences → reward model → RL → aligned model
DPO:  preferences → aligned model (direct optimization)

The DPO loss function:

def dpo_loss(policy_logps_chosen, policy_logps_rejected,
             ref_logps_chosen, ref_logps_rejected, beta=0.1):
    """
    policy_logps: log probability of response under trained model
    ref_logps: log probability of response under reference model
    beta: temperature parameter (controls deviation from reference)
    """
    policy_ratio = policy_logps_chosen - policy_logps_rejected
    ref_ratio = ref_logps_chosen - ref_logps_rejected

    loss = -F.logsigmoid(beta * (policy_ratio - ref_ratio))
    return loss.mean()

Implementation

from transformers import AutoModelForCausalLM
from trl import DPOTrainer, DPOConfig

# Load SFT model as starting point
model = AutoModelForCausalLM.from_pretrained("sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("sft-model")

# DPO configuration
dpo_config = DPOConfig(
    output_dir="./dpo-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=5e-7,
    beta=0.1,  # KL penalty coefficient
    loss_type="sigmoid",  # or "hinge", "ipo"
    gradient_checkpointing=True,
)

# Prepare preference dataset
# Format: {"prompt": ..., "chosen": ..., "rejected": ...}

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

DPO Variants

Several improvements to the original DPO have been proposed:

VariantKey ChangeWhen to Use
IPODifferent loss functionMore stable, less reward hacking
KTOSingle response (no pairs)When you only have ratings
ORPONo reference model neededLower memory usage
SimPOLength-normalizedBetter for varying response lengths
# IPO (Identity Preference Optimization)
def ipo_loss(policy_ratio, ref_ratio, beta=0.1):
    return ((policy_ratio - ref_ratio) - 1/(2*beta)) ** 2

# KTO (Kahneman-Tversky Optimization) - works with single responses
def kto_loss(policy_logp, ref_logp, is_chosen, beta=0.1):
    ratio = policy_logp - ref_logp
    if is_chosen:
        return -F.logsigmoid(beta * ratio)
    else:
        return -F.logsigmoid(-beta * ratio)

Strengths

  1. Simplicity: No reward model or RL infrastructure
  2. Stability: Supervised learning is more stable than RL
  3. Efficiency: 2-4x faster training than RLHF
  4. Memory: Only need policy and reference model (no value head)

Limitations

  1. No exploration: Only learns from existing preference data
  2. Static reference: Can't update reference during training
  3. Limited extrapolation: Harder to generalize beyond training preferences
  4. Beta sensitivity: Results depend on temperature parameter

DPO vs RLHF: Benchmark Comparison

Based on multiple studies, here's how they compare:

BenchmarkRLHFDPONotes
MT-Bench7.27.0RLHF slightly better
AlpacaEval85%82%Similar performance
TruthfulQA47%45%Comparable
Training Time100%40%DPO much faster
StabilityVariableConsistentDPO more reliable

Choosing the Right Approach

Decision Framework

Start with Instruction Tuning (always)
         ↓
Do you have preference data?
    No → Stick with instruction tuning
    Yes ↓

Do you need maximum quality?
    Yes → RLHF (if you have resources)
    No → DPO

Can you afford RL infrastructure?
    No → DPO
    Yes → Consider RLHF for critical applications

Practical Recommendations

Use Instruction Tuning when:

  • Building first version of aligned model
  • Limited compute budget
  • High-quality demonstration data available
  • Deterministic, factual tasks

Use RLHF when:

  • Maximum quality is critical (production chatbots)
  • You have strong ML infrastructure
  • Need exploration beyond training data
  • Can invest in reward model iteration

Use DPO when:

  • Want preference learning without RL complexity
  • Medium compute budget
  • Preference data available
  • Need stable, reproducible training

Combination Strategies

Modern alignment often combines approaches:

# Stage 1: Instruction tuning
sft_model = train_sft(base_model, instruction_data)

# Stage 2: Preference alignment (choose one)
# Option A: DPO (simpler)
aligned_model = train_dpo(sft_model, preference_data)

# Option B: RLHF (potentially higher quality)
reward_model = train_reward_model(preference_data)
aligned_model = train_ppo(sft_model, reward_model)

# Stage 3 (optional): Constitutional AI / Self-critique
final_model = train_self_critique(aligned_model)

Advanced Topics

Constitutional AI

Anthropic's Constitutional AI adds a self-critique step:

def constitutional_critique(model, response, principles):
    """
    Generate critique based on constitutional principles
    """
    critique_prompt = f"""
    Response: {response}

    Principles: {principles}

    Critique this response according to the principles.
    Then provide an improved response.
    """
    return model.generate(critique_prompt)

Iterative DPO

Recent work shows iterating DPO can approach RLHF quality:

for iteration in range(3):
    # Generate new responses with current model
    new_responses = generate_responses(model, prompts)

    # Rank responses (human or model-based)
    new_preferences = rank_responses(new_responses)

    # DPO update
    model = train_dpo(model, new_preferences)

Online vs Offline Preference Learning

AspectOnline (RLHF)Offline (DPO)
DataGenerated during trainingFixed dataset
ExplorationYesNo
Distribution shiftHandles naturallyMay struggle
ComputeHigherLower

Practical Example: Full Pipeline

Here's a complete alignment pipeline:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, DPOTrainer, DPOConfig
from datasets import load_dataset

# Step 1: Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Step 2: Instruction tuning
instruction_data = load_dataset("tatsu-lab/alpaca")

sft_trainer = SFTTrainer(
    model=base_model,
    train_dataset=instruction_data,
    dataset_text_field="text",
    max_seq_length=2048,
)
sft_trainer.train()
sft_model = sft_trainer.model

# Step 3: Preference alignment with DPO
preference_data = load_dataset("argilla/ultrafeedback-binarized-preferences")

dpo_config = DPOConfig(
    output_dir="./aligned-model",
    beta=0.1,
    learning_rate=5e-7,
    num_train_epochs=1,
)

dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=None,  # Will create copy automatically
    args=dpo_config,
    train_dataset=preference_data,
    tokenizer=tokenizer,
)
dpo_trainer.train()

# Step 4: Save aligned model
dpo_trainer.save_model("./final-aligned-model")

References

  1. Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." arXiv:2203.02155

  2. Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290

  3. DPO Survey. (2025). "A Survey of Direct Preference Optimization." arXiv:2503.11701

  4. Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073

  5. Ethayarajh, K., et al. (2025). "KTO: Model Alignment as Prospect Theoretic Optimization." arXiv:2402.01306

Frequently Asked Questions

Related Articles

Need Flash Attention wheels?

Skip the 30+ minute compilation. Find prebuilt wheels for your exact configuration.

Find Your Wheel