Making a language model helpful and safe requires more than pretraining on web data. This guide compares the three dominant alignment techniques—instruction tuning, RLHF, and DPO—to help you choose the right approach for your use case.
Quick Comparison
| Method | Complexity | Data Requirements | Compute Cost | Quality Ceiling |
|---|---|---|---|---|
| Instruction Tuning | Low | Instruction-output pairs | Low | Good |
| RLHF | High | Preference rankings | Very High | Excellent |
| DPO | Medium | Preference pairs | Medium | Excellent |
The Alignment Problem
Pretrained LLMs learn to predict the next token based on internet text. This creates models that are:
- Capable: They know how to perform many tasks
- Unreliable: They may refuse, hallucinate, or produce harmful content
- Unshaped: They don't understand user intent
Alignment techniques shape model behavior to be helpful, harmless, and honest.
Instruction Tuning (Supervised Fine-Tuning)
How It Works
Instruction tuning trains models on demonstration data: pairs of instructions and ideal responses.
# Example instruction tuning data
{
"instruction": "Explain photosynthesis in simple terms.",
"response": "Photosynthesis is how plants make food from sunlight..."
}
The model learns through standard supervised learning:
from transformers import AutoModelForCausalLM, TrainingArguments
from trl import SFTTrainer
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
trainer = SFTTrainer(
model=model,
train_dataset=instruction_dataset,
args=TrainingArguments(
output_dir="./sft-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
),
dataset_text_field="text",
)
trainer.train()
Key Datasets
| Dataset | Size | Source | Use Case |
|---|---|---|---|
| FLAN Collection | 1,836 tasks | General instruction following | |
| Alpaca | 52K | Stanford | GPT-4 distillation |
| OpenAssistant | 161K | Open source | Conversational AI |
| Dolly | 15K | Databricks | Commercial use |
| ShareGPT | 90K | User shared | ChatGPT-style responses |
Strengths
- Simple implementation: Standard supervised learning
- Low compute: Single training run
- Predictable: Model learns exactly what you demonstrate
- Fast iteration: Easy to add new examples
Limitations
- Quality ceiling: Limited by demonstration quality
- No preference learning: Can't distinguish good from better
- Distribution mismatch: Training distribution may differ from deployment
- Exposure bias: Only sees correct completions during training
When to Use
- First alignment step: Always do instruction tuning before RLHF/DPO
- Specific domains: When you have expert demonstrations
- Limited resources: When you can't afford RL training
- Deterministic tasks: When there's a single correct answer
RLHF (Reinforcement Learning from Human Feedback)
How It Works
RLHF uses human preferences to train a reward model, then optimizes the LLM against that reward using reinforcement learning.
Step 1: Collect Preferences
Prompt → [Response A, Response B] → Human ranks A > B
Step 2: Train Reward Model
Reward Model learns: score(A) > score(B)
Step 3: RL Fine-tuning
LLM optimizes to maximize Reward Model scores
(while staying close to original via KL penalty)
Implementation
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
# Load models
model = AutoModelForCausalLMWithValueHead.from_pretrained("sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("sft-model")
reward_model = AutoModelForSequenceClassification.from_pretrained("reward-model")
# PPO configuration
config = PPOConfig(
model_name="rlhf-model",
learning_rate=1e-5,
batch_size=16,
mini_batch_size=4,
gradient_accumulation_steps=4,
ppo_epochs=4,
kl_penalty="kl",
init_kl_coef=0.2,
)
ppo_trainer = PPOTrainer(
config=config,
model=model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# Training loop
for batch in dataloader:
# Generate responses
response_tensors = ppo_trainer.generate(batch["input_ids"])
# Compute rewards
rewards = [reward_model(r) for r in response_tensors]
# PPO update
stats = ppo_trainer.step(batch["input_ids"], response_tensors, rewards)
The Reward Model
The reward model is trained on preference data using the Bradley-Terry model:
# Preference data format
{
"prompt": "Write a poem about autumn",
"chosen": "Golden leaves cascade like memories...",
"rejected": "Leaves fall. Trees bare. Cold comes..."
}
# Loss function
def reward_loss(chosen_reward, rejected_reward):
return -torch.log(torch.sigmoid(chosen_reward - rejected_reward)).mean()
Training a reward model:
from transformers import AutoModelForSequenceClassification
from trl import RewardTrainer, RewardConfig
reward_model = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-2-7b-hf",
num_labels=1,
)
reward_config = RewardConfig(
output_dir="./reward-model",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=1e-5,
)
reward_trainer = RewardTrainer(
model=reward_model,
args=reward_config,
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
reward_trainer.train()
Why RLHF Works
RLHF's effectiveness comes from several factors:
- Preference learning: Captures nuanced human judgments
- Exploration: RL explores responses not in training data
- Optimization target: Directly optimizes what humans want
- Distribution shift: Training distribution matches deployment
According to Ouyang et al. (2022), InstructGPT (1.3B parameters with RLHF) was preferred over GPT-3 (175B) in human evaluations, demonstrating RLHF's power.
Challenges
- Reward hacking: Model exploits reward model weaknesses
- Training instability: PPO is notoriously unstable
- High compute cost: Requires multiple forward passes per step
- Reward model quality: Garbage in, garbage out
- KL divergence tuning: Balancing reward vs staying close to SFT
Reward Hacking Example
# Reward model might learn superficial patterns
"Great question! Let me explain..." → High reward (politeness)
"No." → Low reward (brevity)
# Model may learn to be verbose rather than accurate
Mitigation strategies:
- Ensemble reward models
- Constitutional AI (self-critique)
- Adversarial training data
- Process-based rewards
DPO (Direct Preference Optimization)
How It Works
DPO eliminates the reward model and RL training by directly optimizing preferences. The key insight: the optimal policy under RLHF has a closed-form solution.
RLHF: preferences → reward model → RL → aligned model
DPO: preferences → aligned model (direct optimization)
The DPO loss function:
def dpo_loss(policy_logps_chosen, policy_logps_rejected,
ref_logps_chosen, ref_logps_rejected, beta=0.1):
"""
policy_logps: log probability of response under trained model
ref_logps: log probability of response under reference model
beta: temperature parameter (controls deviation from reference)
"""
policy_ratio = policy_logps_chosen - policy_logps_rejected
ref_ratio = ref_logps_chosen - ref_logps_rejected
loss = -F.logsigmoid(beta * (policy_ratio - ref_ratio))
return loss.mean()
Implementation
from transformers import AutoModelForCausalLM
from trl import DPOTrainer, DPOConfig
# Load SFT model as starting point
model = AutoModelForCausalLM.from_pretrained("sft-model")
ref_model = AutoModelForCausalLM.from_pretrained("sft-model")
# DPO configuration
dpo_config = DPOConfig(
output_dir="./dpo-model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=5e-7,
beta=0.1, # KL penalty coefficient
loss_type="sigmoid", # or "hinge", "ipo"
gradient_checkpointing=True,
)
# Prepare preference dataset
# Format: {"prompt": ..., "chosen": ..., "rejected": ...}
dpo_trainer = DPOTrainer(
model=model,
ref_model=ref_model,
args=dpo_config,
train_dataset=preference_dataset,
tokenizer=tokenizer,
)
dpo_trainer.train()
DPO Variants
Several improvements to the original DPO have been proposed:
| Variant | Key Change | When to Use |
|---|---|---|
| IPO | Different loss function | More stable, less reward hacking |
| KTO | Single response (no pairs) | When you only have ratings |
| ORPO | No reference model needed | Lower memory usage |
| SimPO | Length-normalized | Better for varying response lengths |
# IPO (Identity Preference Optimization)
def ipo_loss(policy_ratio, ref_ratio, beta=0.1):
return ((policy_ratio - ref_ratio) - 1/(2*beta)) ** 2
# KTO (Kahneman-Tversky Optimization) - works with single responses
def kto_loss(policy_logp, ref_logp, is_chosen, beta=0.1):
ratio = policy_logp - ref_logp
if is_chosen:
return -F.logsigmoid(beta * ratio)
else:
return -F.logsigmoid(-beta * ratio)
Strengths
- Simplicity: No reward model or RL infrastructure
- Stability: Supervised learning is more stable than RL
- Efficiency: 2-4x faster training than RLHF
- Memory: Only need policy and reference model (no value head)
Limitations
- No exploration: Only learns from existing preference data
- Static reference: Can't update reference during training
- Limited extrapolation: Harder to generalize beyond training preferences
- Beta sensitivity: Results depend on temperature parameter
DPO vs RLHF: Benchmark Comparison
Based on multiple studies, here's how they compare:
| Benchmark | RLHF | DPO | Notes |
|---|---|---|---|
| MT-Bench | 7.2 | 7.0 | RLHF slightly better |
| AlpacaEval | 85% | 82% | Similar performance |
| TruthfulQA | 47% | 45% | Comparable |
| Training Time | 100% | 40% | DPO much faster |
| Stability | Variable | Consistent | DPO more reliable |
Choosing the Right Approach
Decision Framework
Start with Instruction Tuning (always)
↓
Do you have preference data?
No → Stick with instruction tuning
Yes ↓
Do you need maximum quality?
Yes → RLHF (if you have resources)
No → DPO
Can you afford RL infrastructure?
No → DPO
Yes → Consider RLHF for critical applications
Practical Recommendations
Use Instruction Tuning when:
- Building first version of aligned model
- Limited compute budget
- High-quality demonstration data available
- Deterministic, factual tasks
Use RLHF when:
- Maximum quality is critical (production chatbots)
- You have strong ML infrastructure
- Need exploration beyond training data
- Can invest in reward model iteration
Use DPO when:
- Want preference learning without RL complexity
- Medium compute budget
- Preference data available
- Need stable, reproducible training
Combination Strategies
Modern alignment often combines approaches:
# Stage 1: Instruction tuning
sft_model = train_sft(base_model, instruction_data)
# Stage 2: Preference alignment (choose one)
# Option A: DPO (simpler)
aligned_model = train_dpo(sft_model, preference_data)
# Option B: RLHF (potentially higher quality)
reward_model = train_reward_model(preference_data)
aligned_model = train_ppo(sft_model, reward_model)
# Stage 3 (optional): Constitutional AI / Self-critique
final_model = train_self_critique(aligned_model)
Advanced Topics
Constitutional AI
Anthropic's Constitutional AI adds a self-critique step:
def constitutional_critique(model, response, principles):
"""
Generate critique based on constitutional principles
"""
critique_prompt = f"""
Response: {response}
Principles: {principles}
Critique this response according to the principles.
Then provide an improved response.
"""
return model.generate(critique_prompt)
Iterative DPO
Recent work shows iterating DPO can approach RLHF quality:
for iteration in range(3):
# Generate new responses with current model
new_responses = generate_responses(model, prompts)
# Rank responses (human or model-based)
new_preferences = rank_responses(new_responses)
# DPO update
model = train_dpo(model, new_preferences)
Online vs Offline Preference Learning
| Aspect | Online (RLHF) | Offline (DPO) |
|---|---|---|
| Data | Generated during training | Fixed dataset |
| Exploration | Yes | No |
| Distribution shift | Handles naturally | May struggle |
| Compute | Higher | Lower |
Practical Example: Full Pipeline
Here's a complete alignment pipeline:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import SFTTrainer, DPOTrainer, DPOConfig
from datasets import load_dataset
# Step 1: Load base model
base_model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
torch_dtype=torch.bfloat16,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
# Step 2: Instruction tuning
instruction_data = load_dataset("tatsu-lab/alpaca")
sft_trainer = SFTTrainer(
model=base_model,
train_dataset=instruction_data,
dataset_text_field="text",
max_seq_length=2048,
)
sft_trainer.train()
sft_model = sft_trainer.model
# Step 3: Preference alignment with DPO
preference_data = load_dataset("argilla/ultrafeedback-binarized-preferences")
dpo_config = DPOConfig(
output_dir="./aligned-model",
beta=0.1,
learning_rate=5e-7,
num_train_epochs=1,
)
dpo_trainer = DPOTrainer(
model=sft_model,
ref_model=None, # Will create copy automatically
args=dpo_config,
train_dataset=preference_data,
tokenizer=tokenizer,
)
dpo_trainer.train()
# Step 4: Save aligned model
dpo_trainer.save_model("./final-aligned-model")
References
-
Ouyang, L., et al. (2022). "Training language models to follow instructions with human feedback." arXiv:2203.02155
-
Rafailov, R., et al. (2023). "Direct Preference Optimization: Your Language Model is Secretly a Reward Model." arXiv:2305.18290
-
DPO Survey. (2025). "A Survey of Direct Preference Optimization." arXiv:2503.11701
-
Bai, Y., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv:2212.08073
-
Ethayarajh, K., et al. (2025). "KTO: Model Alignment as Prospect Theoretic Optimization." arXiv:2402.01306