Instruction Tuning, RLHF, and DPO

What Is Instruction Tuning, RLHF, and DPO?

A base language model is like a brilliant intern who has read the internet but has never worked in your company. It can complete text, but it does not naturally behave like a reliable assistant. It might answer in odd formats, ignore user intent, or optimize for “likely continuation” instead of “helpful outcome.”

The alignment pipeline that turned base models into assistants is often told in three stages:

Instruction tuning (SFT): teach the model how good assistant responses look.
RLHF: optimize the model using human preference signals.
DPO: optimize preferences directly without a separate RL loop.

Technical definition:

Instruction tuning is supervised fine-tuning on prompt-response pairs written to reflect desired assistant behavior.
RLHF (Reinforcement Learning from Human Feedback) learns a reward model from preference data and then optimizes the policy model to increase that reward.
DPO (Direct Preference Optimization) uses preference pairs to directly update the policy with a classification-style objective, bypassing explicit reward-model-plus-RL optimization.

Why Does It Matter?

If you build with LLMs, this stack explains why modern assistants feel different from plain next-token predictors.

It matters for:

Helpfulness: models learn to follow instructions and answer in usable formats.
Harmlessness and policy behavior: preference training can discourage unsafe or low-quality behavior.
Product consistency: models are nudged toward responses that humans rate as better.
Training strategy choices: teams must choose whether SFT alone is enough or preference optimization is needed.

It also explains tradeoffs you see in practice. Over-optimizing preferences can produce overly cautious models. Under-optimizing can produce fluent but unhelpful outputs.

How It Works

Stage 0: Pretraining (starting point)

Pretraining teaches a model broad statistical knowledge by predicting tokens on massive corpora. This creates strong general capabilities, but not task-specific assistant behavior.

Stage 1: Instruction tuning (SFT)

You collect high-quality examples:

input instruction,
preferred assistant response.

Then you fine-tune with supervised learning so the model imitates these responses.

What SFT gives you:

better adherence to user requests,
cleaner structure and formatting,
improved multi-step answer style.

What SFT does not fully solve:

nuanced preference tradeoffs (concise vs detailed, cautious vs direct),
hard-to-specify quality signals.

Stage 2: RLHF

RLHF usually has three substeps.

Collect preference comparisons

Human raters compare two or more candidate outputs for the same prompt.
Train a reward model

Learn a function R(prompt, response) that predicts which outputs humans prefer.
Optimize the policy model

Use RL (often PPO variants) to increase expected reward while constraining drift from the SFT model.

Intuition: rather than learning exact target responses, the model learns a gradient toward what humans tend to prefer.

Operational downside: RLHF pipelines can be complex and unstable. You must manage reward hacking, KL penalties, and training sensitivity.

Stage 3: DPO

DPO keeps the same preference data idea but simplifies optimization.

Given a prompt with preferred response y+ and dispreferred response y-, DPO updates the model to increase relative likelihood of y+ over y-, usually with a reference policy as anchor.

Intuition:

RLHF says “learn reward, then optimize reward.”
DPO says “optimize pairwise preference directly.”

Why teams like DPO:

simpler training stack,
no separate reward model deployment,
often more stable iteration for many workloads.

Putting it together

Most real pipelines are not strictly one method forever. A common loop is:

start with strong SFT,
add preference optimization where behavior still misses product goals,
evaluate continuously with task-specific and safety benchmarks.

Key Terminology

SFT (Supervised Fine-Tuning): Training on curated instruction-response examples.
Preference data: Human judgments that one response is better than another for the same prompt.
Reward model: Model trained to predict human preference scores.
Policy optimization: Updating the assistant model to improve a target objective.
DPO (Direct Preference Optimization): Directly optimizing preferred responses over rejected ones.

Real-World Applications

General assistants: Improve instruction following, tone control, and refusal quality.
Enterprise copilots: Align responses to internal writing style and policy constraints.
Domain-specific assistants: Prioritize clinically safer phrasing in healthcare or more auditable phrasing in finance.
Developer tools: Prefer outputs that compile, pass tests, and follow style guides over merely plausible code.

Common Misconceptions

“RLHF is only about safety filtering.” It also shapes broad utility preferences such as clarity, relevance, and tone.
“DPO replaces SFT.” In practice, DPO usually builds on top of SFT, not as a complete substitute.
“Preference optimization guarantees truthfulness.” It improves preferred behavior, but factuality still depends on grounding, retrieval, and evaluation design.