Should I do instruction tuning before DPO or RLHF?

Yes, always. Both DPO and RLHF assume you're starting from an instruction-tuned model. Training directly from a base model typically produces poor results because the model hasn't learned the basic format of following instructions. SFT first, then preference optimization.

How much preference data do I need for DPO?

For domain adaptation, 1,000-5,000 high-quality preference pairs often suffice. For general alignment, datasets like UltraFeedback (60K pairs) or Anthropic HH-RLHF (170K) are commonly used. Quality matters more than quantity—ensure your preferences are consistent and reflect your actual goals.

Why is RLHF more expensive than DPO?

RLHF requires: (1) training a separate reward model, (2) running the reward model on every generated response during PPO, (3) maintaining a value head for the policy, and (4) multiple forward passes per training step. DPO only needs forward passes through the policy and reference model, similar to standard fine-tuning.

Can I use AI-generated preferences instead of human feedback?

Yes, this is called RLAIF (RL from AI Feedback). GPT-4 or Claude can generate preference labels that often match human preferences. This dramatically reduces cost. However, you risk amplifying biases in the labeling model. Constitutional AI combines AI feedback with explicit principles to mitigate this.

Instruction Tuning vs RLHF vs DPO: LLM Alignment Techniques Compared

Making a language model helpful and safe requires more than pretraining on web data. This guide compares the three dominant alignment techniques—instruction tuning, RLHF, and DPO—to help you choose the right approach for your use case.

Quick Comparison

Method	Complexity	Data Requirements	Compute Cost	Quality Ceiling
Instruction Tuning	Low	Instruction-output pairs	Low	Good
RLHF	High	Preference rankings	Very High	Excellent
DPO	Medium	Preference pairs	Medium	Excellent

The Alignment Problem

Pretrained LLMs learn to predict the next token based on internet text. This creates models that are:

Capable: They know how to perform many tasks
Unreliable: They may refuse, hallucinate, or produce harmful content
Unshaped: They don't understand user intent

Alignment techniques shape model behavior to be helpful, harmless, and honest.

Instruction Tuning (Supervised Fine-Tuning)

How It Works

Instruction tuning trains models on demonstration data: pairs of instructions and ideal responses.

# Example instruction tuning data
{
    "instruction": "Explain photosynthesis in simple terms.",
    "response": "Photosynthesis is how plants make food from sunlight..."
}

The model learns through standard supervised learning:

from transformers import AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")

trainer = SFTTrainer(
    model=model,
    train_dataset=instruction_dataset,
    args=TrainingArguments(
        output_dir="./sft-model",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        learning_rate=2e-5,
    ),
    dataset_text_field="text",
)

trainer.train()

Key Datasets

Dataset	Size	Source	Use Case
FLAN Collection	1,836 tasks	Google	General instruction following
Alpaca	52K	Stanford	GPT-4 distillation
OpenAssistant	161K	Open source	Conversational AI
Dolly	15K	Databricks	Commercial use
ShareGPT	90K	User shared	ChatGPT-style responses

Strengths

Simple implementation: Standard supervised learning
Low compute: Single training run
Predictable: Model learns exactly what you demonstrate
Fast iteration: Easy to add new examples

Limitations

Quality ceiling: Limited by demonstration quality
No preference learning: Can't distinguish good from better
Distribution mismatch: Training distribution may differ from deployment
Exposure bias: Only sees correct completions during training

When to Use

First alignment step: Always do instruction tuning before RLHF/DPO
Specific domains: When you have expert demonstrations
Limited resources: When you can't afford RL training
Deterministic tasks: When there's a single correct answer

RLHF (Reinforcement Learning from Human Feedback)

How It Works

RLHF uses human preferences to train a reward model, then optimizes the LLM against that reward using reinforcement learning.

Step 1: Collect Preferences
   Prompt → [Response A, Response B] → Human ranks A > B

Step 2: Train Reward Model
   Reward Model learns: score(A) > score(B)

Step 3: RL Fine-tuning
   LLM optimizes to maximize Reward Model scores
   (while staying close to original via KL penalty)

Implementation

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead

# Load models
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("sft-model")
reward_model = AutoModelForSequenceClassification.from_pretrained("reward-model")

# PPO configuration
config = PPOConfig(
    model_name="rlhf-model",
    learning_rate=1e-5,
    batch_size=16,
    mini_batch_size=4,
    gradient_accumulation_steps=4,
    ppo_epochs=4,
    kl_penalty="kl",
    init_kl_coef=0.2,
)

ppo_trainer = PPOTrainer(
    config=config,
    model=model,
    ref_model=ref_model,
    tokenizer=tokenizer,
)

# Training loop
for batch in dataloader:
    # Generate responses
    response_tensors = ppo_trainer.generate(batch["input_ids"])

    # Compute rewards
    rewards = [reward_model(r) for r in response_tensors]

    # PPO update
    stats = ppo_trainer.step(batch["input_ids"], response_tensors, rewards)

The Reward Model

The reward model is trained on preference data using the Bradley-Terry model:

# Preference data format
{
    "prompt": "Write a poem about autumn",
    "chosen": "Golden leaves cascade like memories...",
    "rejected": "Leaves fall. Trees bare. Cold comes..."
}

# Loss function
def reward_loss(chosen_reward, rejected_reward):
    return -torch.log(torch.sigmoid(chosen_reward - rejected_reward)).mean()

Training a reward model:

from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig

reward_model = AutoModelForSequenceClassification.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    num_labels=1,
)

reward_config = RewardConfig(
    output_dir="./reward-model",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    learning_rate=1e-5,
)

reward_trainer = RewardTrainer(
    model=reward_model,
    args=reward_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

reward_trainer.train()

Why RLHF Works

RLHF's effectiveness comes from several factors:

Preference learning: Captures nuanced human judgments
Exploration: RL explores responses not in training data
Optimization target: Directly optimizes what humans want
Distribution shift: Training distribution matches deployment

According to Ouyang et al. (2022), InstructGPT (1.3B parameters with RLHF) was preferred over GPT-3 (175B) in human evaluations, demonstrating RLHF's power.

Challenges

Reward hacking: Model exploits reward model weaknesses
Training instability: PPO is notoriously unstable
High compute cost: Requires multiple forward passes per step
Reward model quality: Garbage in, garbage out
KL divergence tuning: Balancing reward vs staying close to SFT

Reward Hacking Example

# Reward model might learn superficial patterns
"Great question! Let me explain..." → High reward (politeness)
"No." → Low reward (brevity)

# Model may learn to be verbose rather than accurate

Mitigation strategies:

Ensemble reward models
Constitutional AI (self-critique)
Adversarial training data
Process-based rewards

DPO (Direct Preference Optimization)

How It Works

DPO eliminates the reward model and RL training by directly optimizing preferences. The key insight: the optimal policy under RLHF has a closed-form solution.

RLHF: preferences → reward model → RL → aligned model
DPO:  preferences → aligned model (direct optimization)

The DPO loss function:

def dpo_loss(policy_logps_chosen, policy_logps_rejected,
             ref_logps_chosen, ref_logps_rejected, beta=0.1):
    """
    policy_logps: log probability of response under trained model
    ref_logps: log probability of response under reference model
    beta: temperature parameter (controls deviation from reference)
    """
    policy_ratio = policy_logps_chosen - policy_logps_rejected
    ref_ratio = ref_logps_chosen - ref_logps_rejected

    loss = -F.logsigmoid(beta * (policy_ratio - ref_ratio))
    return loss.mean()

Implementation

from transformers import AutoModelForCausalLM
from trl import DPOTrainer, DPOConfig

# Load SFT model as starting point
model = AutoModelForCausalLM.from_pretrained("sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("sft-model")

# DPO configuration
dpo_config = DPOConfig(
    output_dir="./dpo-model",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    learning_rate=5e-7,
    beta=0.1,  # KL penalty coefficient
    loss_type="sigmoid",  # or "hinge", "ipo"
    gradient_checkpointing=True,
)

# Prepare preference dataset
# Format: {"prompt": ..., "chosen": ..., "rejected": ...}

dpo_trainer = DPOTrainer(
    model=model,
    ref_model=ref_model,
    args=dpo_config,
    train_dataset=preference_dataset,
    tokenizer=tokenizer,
)

dpo_trainer.train()

DPO Variants

Several improvements to the original DPO have been proposed:

Variant	Key Change	When to Use
IPO	Different loss function	More stable, less reward hacking
KTO	Single response (no pairs)	When you only have ratings
ORPO	No reference model needed	Lower memory usage
SimPO	Length-normalized	Better for varying response lengths

# IPO (Identity Preference Optimization)
def ipo_loss(policy_ratio, ref_ratio, beta=0.1):
    return ((policy_ratio - ref_ratio) - 1/(2*beta)) ** 2

# KTO (Kahneman-Tversky Optimization) - works with single responses
def kto_loss(policy_logp, ref_logp, is_chosen, beta=0.1):
    ratio = policy_logp - ref_logp
    if is_chosen:
        return -F.logsigmoid(beta * ratio)
    else:
        return -F.logsigmoid(-beta * ratio)

Strengths

Simplicity: No reward model or RL infrastructure
Stability: Supervised learning is more stable than RL
Efficiency: 2-4x faster training than RLHF
Memory: Only need policy and reference model (no value head)

Limitations

No exploration: Only learns from existing preference data
Static reference: Can't update reference during training
Limited extrapolation: Harder to generalize beyond training preferences
Beta sensitivity: Results depend on temperature parameter

DPO vs RLHF: Benchmark Comparison

Based on multiple studies, here's how they compare:

Benchmark	RLHF	DPO	Notes
MT-Bench	7.2	7.0	RLHF slightly better
AlpacaEval	85%	82%	Similar performance
TruthfulQA	47%	45%	Comparable
Training Time	100%	40%	DPO much faster
Stability	Variable	Consistent	DPO more reliable

Choosing the Right Approach

Decision Framework

Start with Instruction Tuning (always)
         ↓
Do you have preference data?
    No → Stick with instruction tuning
    Yes ↓

Do you need maximum quality?
    Yes → RLHF (if you have resources)
    No → DPO

Can you afford RL infrastructure?
    No → DPO
    Yes → Consider RLHF for critical applications

Practical Recommendations

Use Instruction Tuning when:

Building first version of aligned model
Limited compute budget
High-quality demonstration data available
Deterministic, factual tasks

Use RLHF when:

Maximum quality is critical (production chatbots)
You have strong ML infrastructure
Need exploration beyond training data
Can invest in reward model iteration

Use DPO when:

Want preference learning without RL complexity
Medium compute budget
Preference data available
Need stable, reproducible training

Combination Strategies

Modern alignment often combines approaches:

# Stage 1: Instruction tuning
sft_model = train_sft(base_model, instruction_data)

# Stage 2: Preference alignment (choose one)
# Option A: DPO (simpler)
aligned_model = train_dpo(sft_model, preference_data)

# Option B: RLHF (potentially higher quality)
reward_model = train_reward_model(preference_data)
aligned_model = train_ppo(sft_model, reward_model)

# Stage 3 (optional): Constitutional AI / Self-critique
final_model = train_self_critique(aligned_model)

Advanced Topics

Constitutional AI

Anthropic's Constitutional AI adds a self-critique step:

def constitutional_critique(model, response, principles):
    """
    Generate critique based on constitutional principles
    """
    critique_prompt = f"""
    Response: {response}

    Principles: {principles}

    Critique this response according to the principles.
    Then provide an improved response.
    """
    return model.generate(critique_prompt)

Iterative DPO

Recent work shows iterating DPO can approach RLHF quality:

for iteration in range(3):
    # Generate new responses with current model
    new_responses = generate_responses(model, prompts)

    # Rank responses (human or model-based)
    new_preferences = rank_responses(new_responses)

    # DPO update
    model = train_dpo(model, new_preferences)

Online vs Offline Preference Learning

Aspect	Online (RLHF)	Offline (DPO)
Data	Generated during training	Fixed dataset
Exploration	Yes	No
Distribution shift	Handles naturally	May struggle
Compute	Higher	Lower

Practical Example: Full Pipeline

Here's a complete alignment pipeline:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, DPOTrainer, DPOConfig
from datasets import load_dataset

# Step 1: Load base model
base_model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")

# Step 2: Instruction tuning
instruction_data = load_dataset("tatsu-lab/alpaca")

sft_trainer = SFTTrainer(
    model=base_model,
    train_dataset=instruction_data,
    dataset_text_field="text",
    max_seq_length=2048,
)
sft_trainer.train()
sft_model = sft_trainer.model

# Step 3: Preference alignment with DPO
preference_data = load_dataset("argilla/ultrafeedback-binarized-preferences")

dpo_config = DPOConfig(
    output_dir="./aligned-model",
    beta=0.1,
    learning_rate=5e-7,
    num_train_epochs=1,
)

dpo_trainer = DPOTrainer(
    model=sft_model,
    ref_model=None,  # Will create copy automatically
    args=dpo_config,
    train_dataset=preference_data,
    tokenizer=tokenizer,
)
dpo_trainer.train()

# Step 4: Save aligned model
dpo_trainer.save_model("./final-aligned-model")

References

Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." arXiv:2203.02155
Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290
DPO Survey. (2025). "A Survey of Direct Preference Optimization." arXiv:2503.11701
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073
Ethayarajh, K., et al. (2025). "KTO: Model Alignment as Prospect Theoretic Optimization." arXiv:2402.01306

Instruction Tuning vs RLHF vs DPO: LLM Alignment Techniques Compared

Quick Comparison

The Alignment Problem

Instruction Tuning (Supervised Fine-Tuning)

How It Works

Key Datasets

Strengths

Limitations

When to Use

RLHF (Reinforcement Learning from Human Feedback)

How It Works

Implementation

The Reward Model

Why RLHF Works

Challenges

Reward Hacking Example

DPO (Direct Preference Optimization)

How It Works

Implementation

DPO Variants

Strengths

Limitations

DPO vs RLHF: Benchmark Comparison

Choosing the Right Approach

Decision Framework

Practical Recommendations

Combination Strategies

Advanced Topics

Constitutional AI

Iterative DPO

Online vs Offline Preference Learning

Practical Example: Full Pipeline

References

Frequently Asked Questions

Related Articles

How to Fine-Tune Qwen3: A Practical Guide with Code

The Ultimate Guide to Fine-Tuning Large Language Models in 2025

LoRA vs QLoRA vs Full Fine-Tuning: Which Approach Should You Use?

Need Flash Attention wheels?