Machine Learning

Reinforcement Learning from Human Feedback (RLHF)

How a pile of "A is better than B" clicks turned GPT-3 into something you'd actually talk to

RLHF aligns a language model to human preferences in three stages: supervised fine-tuning, training a reward model from pairwise preference rankings, then optimizing the policy against that reward with PPO under a KL penalty.

  • StagesSFT → reward model → PPO
  • Human signalPairwise preferences
  • Reward model lossBradley-Terry
  • RL objectivereward − β·KL
  • Canonical paperInstructGPT, 2022

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The core idea: humans can't score, but they can compare

A pretrained language model is a brilliant autocomplete and a terrible assistant. Ask GPT-3 base "Explain the moon landing to a six-year-old" and it might answer — or it might continue with three more exam questions, because that's a plausible continuation of the text it saw on the web. The model has the knowledge; it has no idea what you want. RLHF is the technique that closed that gap, and it's the reason ChatGPT, Claude, and Gemini feel like they're cooperating with you instead of completing a document.

The central trick is a concession about humans. If you ask a labeler "rate this answer from 1 to 10," you get noisy, drifting, incomparable numbers — one annotator's 7 is another's 4. But if you show them two answers to the same prompt and ask "which is better?", they're fast, consistent, and cheap. RLHF is built entirely around that asymmetry: collect pairwise preferences, distill them into a learned reward model that can output a scalar score, and then use reinforcement learning to push the language model toward outputs the reward model likes.

The canonical recipe is three stages, formalized by Ouyang et al. in the 2022 InstructGPT paper (the direct ancestor of ChatGPT), building on Christiano et al. (2017) and Stiennon et al. (2020):

  1. Supervised fine-tuning (SFT). Fine-tune the base model on a few thousand high-quality human demonstrations of the behavior you want. This gives a model that at least tries to follow instructions. Call this policy πSFT; it also serves as the frozen reference later.
  2. Reward modeling (RM). Sample several completions per prompt, have humans rank them, and train a reward model rφ(x, y) to predict which completion a human would prefer.
  3. RL optimization. Treat the LM as a policy. For each prompt, sample a completion, score it with the reward model, and update the policy with PPO (Proximal Policy Optimization) to raise that score — while a KL penalty keeps it from straying too far from the SFT model.

Stage 2: the reward model and the Bradley-Terry loss

The reward model is usually the SFT model with its final token-prediction head swapped for a single scalar head. It reads a prompt x and a completion y and emits one number rφ(x, y) — "how good is this?"

How do you train a scalar predictor when your labels are comparisons, not scores? With the Bradley-Terry model (1952), the standard model for paired comparisons. It says the probability that completion yw (the winner) beats yl (the loser) is the logistic of their reward difference:

P(y_w ≻ y_l | x) = σ( r_φ(x, y_w) − r_φ(x, y_l) )

where σ(z) = 1 / (1 + e^−z)

Maximizing the likelihood of the observed preferences gives the reward-model loss — a pairwise logistic (a.k.a. ranking) loss:

L_RM(φ) = − E_(x, y_w, y_l) [ log σ( r_φ(x, y_w) − r_φ(x, y_l) ) ]

Three things fall out of this. First, only the difference in rewards matters, so the absolute scale is arbitrary — reward models are typically normalized to mean 0 on a validation set. Second, when a labeler ranks K completions for one prompt, you don't treat the C(K,2) pairs as independent examples; InstructGPT showed that overfits, so all pairs from one prompt go in a single batch and the loss is averaged by 1/C(K,2). Third, the reward model is only as good as its training distribution — push the policy into regions the RM never saw and its scores become meaningless. That fragility is the whole reason the next stage needs a KL leash.

Stage 3: PPO and the KL-penalized objective

Now we optimize. The thing we want to maximize, per prompt x drawn from the dataset and completion y sampled from the current policy πθ, is:

objective(θ) = E_(x ~ D, y ~ π_θ) [ r_φ(x, y) − β · log( π_θ(y|x) / π_ref(y|x) ) ]

The first term is "make the reward model happy." The second is a per-token KL penalty: how far the policy's probability for this output has drifted from the frozen reference πref (the SFT model). The coefficient β — often around 0.01–0.1, sometimes adapted online to hit a target KL — sets the leash length. Drop the KL and the policy will gleefully reward-hack: it discovers that the reward model loves long, hedged, sycophantic answers and starts emitting those exclusively, because the RM is a flawed proxy and the policy has found its blind spots.

Why PPO and not plain policy gradient? Because language generation is a one-shot bandit-like setting where a single bad gradient step can wreck a model that cost millions to train. PPO (Schulman et al., 2017) maximizes a clipped surrogate that refuses to move the policy too far in one update:

ratio  = π_θ(y|x) / π_θ_old(y|x)
L_clip = E[ min( ratio · Â,  clip(ratio, 1−ε, 1+ε) · Â ) ]

 = advantage (token reward − value baseline, often via GAE)
ε ≈ 0.2

The clip is a trust region you get for free: if the new policy already assigns much higher probability to a good action (ratio > 1+ε), the gradient is clipped so the model doesn't sprint off a cliff. A separate value network estimates the baseline, so PPO is actor-critic. InstructGPT also mixes in a small amount of the original pretraining loss (the "PTX" term) to stop the aligned model from forgetting general capabilities — the alignment tax.

When RLHF earns its complexity — and when it doesn't

  • The objective is fuzzy and human-judged. "Be helpful, honest, and harmless," "write like a friendly tutor," "summarize the way an editor would." There's no closed-form loss for taste — preferences are the only signal, and that's exactly what RLHF consumes.
  • You already have a strong base + SFT model. RLHF polishes; it does not teach knowledge. If the model can't do the task at all after SFT, RLHF won't conjure the ability.
  • You can afford a labeling pipeline and an RL stack. Three models in memory at once (policy, reference, reward — plus a value head) and an unstable optimizer is real engineering. Teams short on either increasingly reach for DPO instead.
  • You want on-policy improvement. RLHF's reward model can score brand-new samples the policy generates as it trains — a feedback loop that purely offline methods can't replicate.

Skip RLHF when a verifiable reward exists. If correctness is checkable — a unit test passes, a math answer matches, a theorem prover accepts the proof — use that signal directly (RL with verifiable rewards / RLVR). You don't need a learned human-preference proxy when the ground truth is computable.

RLHF vs the alternatives

RLHF (PPO)DPORLAIFBest-of-NRRHF / RAFTRLVR
Explicit reward model?Yes (separate)No (implicit)Yes (AI-labeled)Yes (for ranking)Yes (for ranking)No (rule/verifier)
Online RL loop?YesNoYesNo (inference-time)No (offline)Yes
Models held in memory4 (policy, ref, RM, value)2 (policy, ref)42 (gen, RM)22–3
Training stabilityFiddly — KL collapse, reward hackingHigh — supervised-styleAs PPON/A (no training)HighModerate
Human labels neededHighHighLow (AI feedback)Medium (for RM)MediumNone (verifier)
On-policy improvementYesNo (fixed dataset)YesNoLimitedYes
Where it shinesFrontier chat alignmentCheap, reproducible alignmentScaling labels via a constitutionCheap quality bump at inferenceSimple offline fine-tuneMath, code, reasoning

The big shift since 2023 is DPO. By proving that the KL-constrained RL solution has a closed form, DPO collapses the whole reward-model-plus-PPO machine into a single classification-style loss on preference pairs (more on this below). Many open models — Zephyr, much of the Llama 3 instruct line, Tülu — are aligned with DPO or its descendants rather than PPO, because it's dramatically easier to get right. RLHF-with-PPO remains the choice when you want a reusable reward model and genuine on-policy exploration, which is why frontier labs still run it.

What the numbers actually say

  • A 100× smaller model won on preference. InstructGPT's 1.3B-parameter RLHF model was preferred by labelers over the 175B-parameter GPT-3 base — same architecture family, one-hundred-and-thirty-fewer billion parameters, beaten by alignment alone. That single result is why every lab adopted the pipeline.
  • Human supervision is cheaper than it looks. InstructGPT used ~13k SFT prompts, ~33k reward-model prompts (each ranked into C(K,2) pairs, so hundreds of thousands of comparisons), and ~31k RL prompts. Llama 2 (2023) scaled the preference set past 1M comparisons. The reward model is the amortizer: it answers the billions of reward queries PPO issues, none of which touch a human.
  • The KL leash has a price tag. Raising the reward by chasing a higher RM score buys you measurable KL divergence from the reference. Plot reward-model score against KL and you get the classic frontier — past a point, extra "reward" is pure hacking, and held-out human win-rate goes down even as RM score goes up. The β coefficient is the knob that picks a point on that curve.
  • Reward hacking shows up as length. A near-universal failure mode is the policy learning that longer answers score higher under an imperfect RM, so it pads. Teams routinely length-normalize the reward or add an explicit length penalty to counter it.
  • Four model copies, not one. A PPO step needs forward passes through the policy and the value network, a forward pass through the reward model, and a forward pass through the frozen reference for the KL term — roughly 4× the memory and compute of plain fine-tuning, the chief reason DPO (2 copies) is so attractive.

JavaScript: the reward model and the KL-penalized reward

You won't run PPO on a transformer in the browser, but the two ideas at RLHF's heart — the Bradley-Terry reward loss and the KL-penalized reward signal — are tiny and worth seeing in plain code.

const sigmoid = z => 1 / (1 + Math.exp(-z));

// ── Stage 2: one gradient step of Bradley-Terry reward-model training ──
// Toy linear reward model: r(features) = w · features.
// Each example is a pair: features of the winner vs the loser.
function rmStep(w, winnerFeat, loserFeat, lr = 0.05) {
  const dot = (a, b) => a.reduce((s, v, i) => s + v * b[i], 0);
  const rW = dot(w, winnerFeat);
  const rL = dot(w, loserFeat);
  // Loss = -log σ(rW - rL).  dLoss/d(rW-rL) = σ(rW-rL) - 1
  const g = sigmoid(rW - rL) - 1;          // in (-1, 0); largest in magnitude when the model is most wrong
  // Chain rule: gradient w.r.t. w is g * (winnerFeat - loserFeat)
  return w.map((wi, i) => wi - lr * g * (winnerFeat[i] - loserFeat[i]));
}

// ── Stage 3: the per-sample reward PPO actually optimizes ──
// rmScore: reward model's scalar.  logp / logpRef: log-probs of this
// completion under the current policy and the frozen reference.
function rlhfReward(rmScore, logp, logpRef, beta = 0.02) {
  const klPerSample = logp - logpRef;       // estimate of KL(π ‖ π_ref)
  return rmScore - beta * klPerSample;      // reward, leashed to the reference
}

// Demo: the model currently prefers the WINNER, so it learns a little.
let w = [0.1, -0.2, 0.0];
w = rmStep(w, [1, 0, 1], [0, 1, 0]);
console.log('updated reward weights', w.map(x => x.toFixed(4)));
console.log('leashed reward', rlhfReward(2.3, -8.1, -7.9).toFixed(4)); // 2.3 - 0.02*(-0.2)

Notice rmStep only ever uses the difference winnerFeat − loserFeat — the Bradley-Terry model is blind to absolute reward scale. And rlhfReward is the entire alignment objective in one line: maximize the proxy, but pay β for every nat of drift from where you started.

Python: a minimal PPO-on-text training step

This is a deliberately stripped PPO update for one batch of (prompt, completion, reward) tuples, written against a PyTorch-style API so the moving parts are visible without a 2,000-line RL library.

import torch
import torch.nn.functional as F

def rm_loss(reward_chosen, reward_rejected):
    """Stage 2 — Bradley-Terry pairwise loss for the reward model."""
    return -F.logsigmoid(reward_chosen - reward_rejected).mean()

def ppo_step(policy, value, ref, reward_model, tokenizer, prompts,
             beta=0.02, clip_eps=0.2, optimizer=None):
    """Stage 3 — one PPO update on freshly sampled completions."""
    # 1. Roll out: sample completions from the CURRENT policy (on-policy).
    seqs   = policy.generate(prompts, do_sample=True, max_new_tokens=128)
    logp   = policy.log_probs(seqs)            # log π_θ(y|x), per token
    logp_old = logp.detach()                   # frozen snapshot for the ratio

    # 2. Score them: reward model gives a scalar per sequence.
    with torch.no_grad():
        rm_score = reward_model(seqs)          # r_φ(x, y)
        logp_ref = ref.log_probs(seqs)         # log π_ref(y|x), frozen SFT

    # 3. KL-penalized reward, applied as a per-token shaping term.
    kl     = logp_old - logp_ref               # estimate of KL(π ‖ π_ref)
    reward = rm_score - beta * kl              # the leashed objective
    values = value(seqs)
    adv    = (reward - values).detach()        # advantage (GAE in practice)
    adv    = (adv - adv.mean()) / (adv.std() + 1e-8)  # normalize for stability

    # 4. Clipped surrogate — PPO's trust region.
    ratio   = torch.exp(logp - logp_old)
    surr1   = ratio * adv
    surr2   = torch.clamp(ratio, 1 - clip_eps, 1 + clip_eps) * adv
    pg_loss = -torch.min(surr1, surr2).mean()

    # 5. Value-function regression so the baseline tracks the reward.
    v_loss  = F.mse_loss(values, reward.detach())
    loss    = pg_loss + 0.5 * v_loss

    optimizer.zero_grad()
    loss.backward()
    torch.nn.utils.clip_grad_norm_(policy.parameters(), 1.0)
    optimizer.step()
    return {"reward": rm_score.mean().item(), "kl": kl.mean().item()}

The shape of a real loop: repeat ppo_step over prompt batches, watch the returned kl, and if it explodes, raise beta (or use an adaptive KL controller that targets a fixed KL). If reward climbs while a held-out human win-rate stalls, you're reward hacking — time to refresh the reward model on new on-policy samples. Libraries like Hugging Face TRL, OpenAI's original lm-human-preferences, and Allen AI's open-instruct implement exactly this with the production details (GAE, micro-batching, reference-model sharing) filled in.

Variants worth knowing

DPO (Direct Preference Optimization), Rafailov et al. 2023. The big one. DPO observes that the KL-constrained RL objective has a closed-form optimal policy, and that you can invert it to express the implicit reward in terms of the policy itself: r(x,y) = β·log(πθ(y|x) / πref(y|x)) + const. Substitute that into the Bradley-Terry loss and the reward model and the RL loop both vanish — you train directly on preference pairs with a single supervised-style loss. No reward model, no PPO, two models in memory instead of four. It's the default for most open-weight alignment now.

RLAIF / Constitutional AI (Anthropic, 2022). Replace human preference labelers with an AI model that ranks outputs according to a written set of principles (a "constitution"). This scales the label supply almost for free and was central to early Claude. The RL machinery is identical; only the source of preferences changes.

Best-of-N sampling (a.k.a. rejection sampling). The cheapest way to use a reward model: sample N completions at inference time, score each with the RM, return the best. No training at all — you spend compute at serving time instead. It's a strong baseline and is often used to generate the SFT data for the next round (RAFT / rejection-sampling fine-tuning, as in Llama 2's iterative pipeline).

RRHF / SLiC / IPO / KTO. A family of offline preference losses that, like DPO, skip PPO. IPO fixes a DPO overfitting failure mode; KTO drops the need for pairs entirely, learning from independent thumbs-up / thumbs-down labels using a prospect-theory-style loss.

GRPO (Group Relative Policy Optimization). A PPO variant that drops the separate value network — it estimates the advantage by comparing each sample to the mean reward of a group of samples for the same prompt. Cheaper (one fewer model) and popular for reasoning models trained with verifiable rewards.

Common failure modes and edge cases

  • Reward hacking / over-optimization. The defining RLHF pathology: the policy exploits the reward model's imperfections — verbosity, sycophancy, formulaic openings — driving RM score up while true quality falls. Counter with the KL penalty, length normalization, RM ensembles, and retraining the RM on fresh on-policy data.
  • KL collapse or KL explosion. Too-large β pins the policy to the SFT model (no learning); too-small β lets it drift into gibberish that fools the RM. Adaptive KL controllers that target a fixed KL per step are the standard fix.
  • Forgetting the reference is the SFT model, not the base model. The KL anchor must be the SFT checkpoint. Anchoring to the raw pretrained base pulls the policy back toward non-instruction-following behavior.
  • Treating all pairs from one ranking as independent. InstructGPT showed this overfits the reward model; put all C(K,2) pairs from a prompt in one batch and average the loss.
  • Distribution shift on the reward model. The RM is accurate only near its training distribution. As PPO moves the policy, the RM's scores degrade exactly where you need them — hence iterative pipelines that re-collect preferences on the evolving policy's outputs.
  • Value-function lag. If the critic's baseline can't keep up with the moving reward, advantages get noisy and PPO stalls or oscillates. Normalize advantages per batch and give the value head enough capacity.
  • Reward sparsity at the sequence level. The RM scores the whole completion, so the reward lands only at the final token. Credit assignment across 128 tokens is what makes this RL and not supervised learning — GAE and the per-token KL shaping spread the signal back.

Frequently asked questions

Why does RLHF use a reward model instead of asking humans to score every output directly?

Humans are unreliable at assigning absolute scores but consistent at picking the better of two outputs. RLHF collects pairwise preferences, fits a Bradley-Terry reward model to them, and then queries that model millions of times during PPO — far cheaper and lower-variance than putting a human in every training loop.

What is the KL penalty in RLHF and why is it needed?

The PPO objective subtracts β·KL(policy ‖ reference) from the reward. Without it the policy drifts far from the SFT model to maximize a reward model that is only accurate near the training distribution, producing fluent gibberish that scores high — classic reward hacking. The KL term tethers the policy to the reference, trading a little reward for staying in-distribution.

What are the three stages of RLHF?

1) Supervised fine-tuning (SFT): fine-tune the base model on high-quality human demonstrations. 2) Reward modeling: collect pairwise preference labels and train a reward model with a Bradley-Terry loss. 3) RL optimization: use PPO to maximize the reward model's score minus a KL penalty against the SFT model. The InstructGPT paper (Ouyang et al., 2022) defines this canonical pipeline.

How is RLHF different from DPO?

Direct Preference Optimization removes the explicit reward model and the RL loop entirely. It rewrites the KL-constrained RL objective so that the optimal policy IS its own implicit reward model, yielding a simple classification-style loss trained directly on preference pairs. DPO is far simpler and more stable to train but ties you to your fixed preference dataset; RLHF's separate reward model can be queried on fresh on-policy samples.

What is reward hacking in RLHF?

Reward hacking is when the policy finds outputs that score highly under the reward model but are bad by human standards — exploiting blind spots in an imperfect proxy. Common symptoms are excessive length, repeated flattery, and formulaic structure. Mitigations include the KL penalty, length-normalizing the reward, reward-model ensembles, and periodically retraining the reward model on fresh on-policy samples.

How much human labeling does RLHF actually require?

Less than people expect. InstructGPT was trained on roughly 13k SFT demonstrations and about 33k reward-model prompts (each ranked into C(K,2) pairs, so hundreds of thousands of comparison labels), and the resulting 1.3B-parameter model was preferred over the 175B GPT-3 base. Llama 2 used over 1M human preference comparisons. The reward model amortizes that cost — it answers the billions of reward queries PPO needs without any further human input.