Machine Learning

Proximal Policy Optimization (PPO)

Take the biggest step you can — then clip it before it hurts

Proximal Policy Optimization (PPO) is a clipped policy-gradient reinforcement learning algorithm that takes the biggest safe step toward a better policy by clipping the probability ratio to [1−ε, 1+ε], making it stable enough to be the default engine behind RLHF.

  • FamilyOn-policy actor-critic
  • Clip range ε0.2 (typical)
  • Epochs per batch3–10
  • OptimizerFirst-order (Adam)
  • IntroducedSchulman et al., 2017

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The problem PPO solves: how big a step is too big?

Reinforcement learning by policy gradient is conceptually simple: roll out the current policy, see which actions earned more reward than expected, and push their probabilities up. The catch is the step size. Take too small a step and training crawls. Take too big a step and the policy can leap into a region where its own freshly-collected data is misleading — it overwrites a decent policy with a terrible one, the next rollout is garbage, and the run never recovers. Vanilla policy gradient is notoriously brittle for exactly this reason: there's no safety rail on how far one update can move the policy.

The 2015 answer was Trust Region Policy Optimization (TRPO), which forced each update to stay inside a "trust region" by constraining the KL divergence between the old and new policies. It worked, but it demanded a second-order optimization — conjugate gradients, Fisher-vector products, a backtracking line search — that's painful to implement and slow to run. In 2017 John Schulman and colleagues at OpenAI published Proximal Policy Optimization, which gets nearly the same stability with a trick you can write in a few lines and optimize with plain Adam.

The key object is the probability ratio r(θ) = π_θ(a|s) / π_θ_old(a|s) — how much more (or less) likely the new policy makes the action than the policy that collected the data. PPO multiplies that ratio by the action's advantage, but clips the ratio so that once it leaves the band [1−ε, 1+ε], pushing further yields no extra reward in the objective. The gradient flattens. The policy is gently told: you've moved far enough, stop.

The clipped surrogate objective

Let A_t be the estimated advantage of action a_t in state s_t — how much better that action was than the critic's baseline expectation. Define the ratio r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t). The PPO-Clip objective, maximized over θ, is:

L_CLIP(θ) = E_t[ min( r_t(θ) · A_t ,  clip(r_t(θ), 1−ε, 1+ε) · A_t ) ]

The two terms inside min are the unclipped surrogate and the clipped surrogate. Reading them by the sign of the advantage is the whole intuition:

  • Positive advantage (a good action). We want to raise its probability, so r_t climbs above 1. But the clipped term caps the objective at (1+ε)·A_t. Once r_t > 1+ε, the gradient is zero — no reward for pushing the action's probability up further. The step is bounded.
  • Negative advantage (a bad action). We want to lower its probability, so r_t falls below 1. The clipped term floors the objective at (1−ε)·A_t. The min is what makes the difference: because A_t is negative, min keeps the larger penalty, so if a previous epoch already pushed the action's probability too high, PPO is still allowed to pull it back down — the clip never traps you in a mistake.

That asymmetry is the cleverest part. The clip is a one-sided pessimism: it removes the incentive to over-commit, but it never blocks the gradient from undoing an over-commitment. A naive "just clip the ratio" without the min would do both, and would get stuck.

In practice you optimize the full objective L = L_CLIP − c₁·L_VF + c₂·H, where L_VF is the value-function (critic) mean-squared error, H is an entropy bonus to keep exploring, and c₁ ≈ 0.5, c₂ ≈ 0.0–0.01. The advantages A_t come from Generalized Advantage Estimation (GAE).

Cost per update and where the time goes

PPO's per-iteration work is dominated by two passes. First, collection: run the policy for N parallel actors × T timesteps to gather a batch of N·T transitions — this is one forward pass of the policy network per environment step, so O(N·T · F) where F is the network's forward cost. Second, optimization: K epochs of minibatch SGD over those same N·T samples, costing O(K · N·T · B) where B is the forward+backward cost.

The headline number versus TRPO is the optimizer. TRPO's KL-constrained step needs a conjugate-gradient solve plus a line search every update — typically 10+ Fisher-vector products, each a full forward/backward pass, before a single parameter update lands. PPO replaces all of that with K ordinary Adam minibatch steps. There is no second-order machinery, no matrix to invert, no per-step line search. Memory is just the rollout buffer (N·T transitions) plus one set of network weights and Adam moments — no replay buffer of millions of transitions like off-policy methods keep.

When to reach for PPO

  • You want a strong baseline that just works. PPO is the most common first thing to try in deep RL precisely because it tolerates rough hyperparameters and rarely diverges catastrophically.
  • Continuous control and robotics. MuJoCo locomotion, dexterous manipulation, and sim-to-real pipelines lean on PPO heavily — it handles continuous action spaces with a Gaussian policy out of the box.
  • Massively parallel simulators. When you can run thousands of environments at once (Isaac Gym, EnvPool), PPO's on-policy hunger for fresh data stops being a liability and its simple update shines.
  • RLHF and language-model alignment. The policy is the LM, the reward comes from a learned reward model, and a KL penalty to the reference model keeps generations sane. PPO drove InstructGPT and early ChatGPT.

Reach for something else when sample efficiency is the binding constraint. If each environment step is expensive (a real robot, a slow simulator), an off-policy method like SAC or TD3 that replays old data many times will reach the same performance with far fewer interactions. PPO throws its data away after a handful of epochs.

PPO vs other policy-optimization methods

PPOTRPOA2C / A3CDDPG / TD3SACDPO
On / off policyOn-policyOn-policyOn-policyOff-policyOff-policyOffline (preferences)
Step-size controlRatio clip [1−ε,1+ε]Hard KL trust regionNone (raw PG)Target nets + slow τEntropy + target netsImplicit via β·KL
Optimizer orderFirst-order (Adam)Second-order (CG + line search)First-orderFirst-orderFirst-orderFirst-order
Sample efficiencyModerateModerateLowHighHighVery high (no rollouts)
Action spaceDiscrete + continuousDiscrete + continuousDiscrete + continuousContinuous onlyContinuous (+ discrete variant)Sequence / token
Implementation difficultyLowHighLowMediumMediumLow
Where it's usedRLHF, robotics, gamesResearch baselinesLightweight baselinesContinuous controlContinuous control, roboticsLLM preference tuning

The honest summary: PPO is the practical compromise. TRPO has stronger theoretical guarantees but is a nightmare to ship; A2C is simpler but unstable; the off-policy crowd is more sample-efficient but trickier to stabilize. PPO trades a little of each for an algorithm that's easy to implement and hard to break — which is exactly why it became the default.

What the numbers actually say

  • ε = 0.2 is the canonical clip. In the original paper PPO swept ε over {0.1, 0.2, 0.3} and 0.2 won on the MuJoCo suite. It means each action's probability can shift by about ±20% before the gradient is cut off — small enough to be safe, large enough to learn.
  • K = 3 to 10 epochs per batch. Reusing each rollout for ~3–10 passes of SGD is what gives PPO its modest sample efficiency over single-pass A2C. Push K too high and the policy drifts past the clip range and the extra epochs do nothing useful.
  • GAE λ = 0.95, γ = 0.99. These two near-1 values are the de-facto defaults across the field; λ trades advantage bias for variance, γ sets the effective horizon (≈100 steps at 0.99).
  • "37 implementation details." A widely-cited 2022 ICLR blog catalogued 37 separate code-level tricks — observation normalization, advantage normalization, value-loss clipping, orthogonal init, learning-rate annealing — that swing PPO's Atari/MuJoCo scores enormously. The algorithm is simple; the engineering is not, and reproducibility hinges on those details.

JavaScript implementation

The clip and the advantage estimator are the parts worth seeing in code; the network plumbing is standard. Here's the PPO-Clip loss and a GAE pass in plain JavaScript (no tensor library, so you can read every step).

// Generalized Advantage Estimation over one trajectory.
// rewards[t], values[t] from the critic, dones[t] in {0,1}.
function gae(rewards, values, dones, gamma = 0.99, lambda = 0.95) {
  const n = rewards.length;
  const adv = new Array(n).fill(0);
  let lastGae = 0;
  // values has one extra bootstrap entry for the state after the last step.
  for (let t = n - 1; t >= 0; t--) {
    const mask = 1 - dones[t];                 // 0 if episode ended at t
    const delta = rewards[t] + gamma * values[t + 1] * mask - values[t];
    lastGae = delta + gamma * lambda * mask * lastGae;
    adv[t] = lastGae;
  }
  const returns = adv.map((a, t) => a + values[t]); // critic targets
  return { adv, returns };
}

// Clipped surrogate loss for a single sample.
//   ratio   = exp(logProbNew - logProbOld)   (= π_new / π_old)
//   advantage normalized across the minibatch beforehand.
function ppoClipLoss(ratio, advantage, eps = 0.2) {
  const unclipped = ratio * advantage;
  const clipped = clamp(ratio, 1 - eps, 1 + eps) * advantage;
  // We minimize the NEGATIVE of the objective, hence -min(...).
  return -Math.min(unclipped, clipped);
}

const clamp = (x, lo, hi) => Math.max(lo, Math.min(hi, x));

// One PPO update: K epochs of minibatch SGD over a fixed rollout.
function ppoUpdate(batch, policy, eps = 0.2, epochs = 4) {
  // Normalize advantages once — a critical stabilizing trick.
  const mean = avg(batch.adv);
  const std = Math.sqrt(avg(batch.adv.map(a => (a - mean) ** 2))) + 1e-8;
  const advN = batch.adv.map(a => (a - mean) / std);

  for (let e = 0; e < epochs; e++) {
    for (const mb of minibatches(batch)) {
      let loss = 0;
      for (const i of mb) {
        const logpNew = policy.logProb(batch.states[i], batch.actions[i]);
        const ratio = Math.exp(logpNew - batch.logpOld[i]);
        loss += ppoClipLoss(ratio, advN[i], eps);
      }
      policy.step(loss / mb.length);          // Adam on the averaged loss
    }
  }
}

Two details carry most of the stability. First, logpOld is captured once when the batch is collected and frozen for all K epochs — that's what keeps the ratio meaningful across epochs. Second, advantages are normalized to zero mean and unit variance per batch before clipping; skip this and the loss scale wanders and training stalls.

Python implementation (PyTorch)

The same loss in idiomatic PyTorch, which is how PPO actually ships. Note that torch.min and torch.clamp are elementwise across the whole minibatch.

import torch
import torch.nn.functional as F

def gae(rewards, values, dones, gamma=0.99, lam=0.95):
    # values length = len(rewards) + 1 (bootstrap value for final state)
    adv = torch.zeros_like(rewards)
    last = 0.0
    for t in reversed(range(len(rewards))):
        mask = 1.0 - dones[t]
        delta = rewards[t] + gamma * values[t + 1] * mask - values[t]
        last = delta + gamma * lam * mask * last
        adv[t] = last
    returns = adv + values[:-1]
    return adv, returns

def ppo_loss(logp_new, logp_old, adv, value, value_target,
             eps=0.2, vf_coef=0.5, ent_coef=0.01, entropy=0.0):
    # Normalize advantages across the minibatch.
    adv = (adv - adv.mean()) / (adv.std() + 1e-8)

    ratio = torch.exp(logp_new - logp_old)            # π_new / π_old
    unclipped = ratio * adv
    clipped = torch.clamp(ratio, 1 - eps, 1 + eps) * adv
    policy_loss = -torch.min(unclipped, clipped).mean()  # PPO-Clip

    # Value loss (often itself clipped, omitted here for clarity).
    value_loss = F.mse_loss(value, value_target)

    # Maximize entropy -> subtract it from the loss we minimize.
    return policy_loss + vf_coef * value_loss - ent_coef * entropy

# Training loop sketch:
#   1. Collect a rollout of N*T steps with the CURRENT policy; store
#      states, actions, rewards, dones, and logp_old (frozen).
#   2. Compute adv, returns with gae().
#   3. For K epochs, shuffle into minibatches and step Adam on ppo_loss.
#   4. Discard the rollout. Repeat.

The RLHF variant adds one term: a per-token KL penalty β · KL(π_θ ‖ π_ref) against the frozen reference (pre-RL) model, folded into the reward. Without it the policy quickly learns to emit reward-model-hacking gibberish that scores high but reads as nonsense.

Variants worth knowing

PPO-Penalty (adaptive KL). The paper's other variant drops the clip and instead adds −β · KL(π_old ‖ π_θ) to the objective, adapting β up or down to hit a target KL. It's closer in spirit to TRPO but generally underperforms PPO-Clip, which is why "PPO" almost always means the clipped version.

Value-function clipping. Many implementations also clip the critic's update so the value estimate can't move more than ε from its old value, mirroring the policy clip. It's one of the "37 details" — sometimes helpful, sometimes neutral, and the subject of recurring debate.

Phasic Policy Gradient (PPG). Separates policy and value training into distinct phases so the two heads stop fighting over shared features, improving sample efficiency on some benchmarks.

GRPO (Group Relative Policy Optimization). A 2024 variant popularized for LLM reasoning that drops the learned value critic entirely: it samples a group of completions per prompt and uses their reward mean as the baseline, then applies a PPO-style clip. Cheaper than PPO for RLHF because there's no separate value network to train.

DPO (Direct Preference Optimization). Not a PPO variant but its main rival for alignment. DPO reformulates preference tuning as a simple classification loss with no rollouts, no reward model, and no RL loop — much simpler, though PPO can still edge it out when an online reward signal is available.

Common bugs and edge cases

  • Recomputing logp_old inside the epoch loop. The old log-probs must be frozen at collection time. If you recompute them from the current policy, the ratio is always 1, the clip never fires, and PPO silently degrades to vanilla policy gradient.
  • Forgetting to normalize advantages. Per-batch advantage normalization (zero mean, unit std) is nearly mandatory; without it the loss scale drifts with the reward magnitude and learning becomes erratic.
  • Wrong GAE bootstrap at episode boundaries. The mask = 1 − done term must zero out the bootstrap value when an episode ends, or advantages leak reward across episode boundaries. A truncated (time-limit) episode needs the bootstrap; a true terminal state does not — conflating them is a classic bug.
  • Too many epochs, too large ε. Crank K to 30 or ε to 0.5 and the policy walks far outside the trust region within a single batch; the clip can't save you because by then the whole batch is off-policy. Symptoms: KL spikes, entropy collapses, reward craters.
  • Sharing the optimizer/network between actor and critic carelessly. If the value loss dominates the combined loss, it can swamp the policy gradient. Tune vf_coef or split the networks.
  • Reward hacking in RLHF. Drop the KL-to-reference penalty and the model finds adversarial completions that fool the reward model — repeated tokens, flattery, formatting tricks. The KL leash is not optional.

Frequently asked questions

Why does PPO clip the probability ratio instead of using a KL constraint like TRPO?

TRPO enforces a hard trust region with a KL-divergence constraint, which requires a second-order conjugate-gradient solve and a line search every update — expensive and fiddly to implement. PPO replaces that machinery with a first-order trick: clip the importance-sampling ratio to [1−ε, 1+ε] so the surrogate objective flattens once the new policy drifts too far. You get most of TRPO's stability with plain Adam and a few lines of code.

What does the epsilon (ε) clip parameter actually control in PPO?

ε is the half-width of the allowed step. The new policy's probability for an action can be at most a factor of 1+ε larger or 1−ε smaller than the old policy's before the gradient is zeroed out. The canonical value is 0.2, meaning each action's probability can move by roughly ±20% per batch of updates. Smaller ε is more conservative and stable; larger ε learns faster but risks collapse.

Why does PPO take the minimum of the clipped and unclipped surrogate objectives?

The min() makes the clip a pessimistic lower bound on improvement. When the advantage is positive, min() caps the upside so the policy can't over-commit to a good action. When the advantage is negative, min() removes the ceiling on the penalty, so a bad action that grew too likely is still pushed down hard even outside the clip range. Without the min, the clip would also block undoing harmful over-shoots.

Is PPO on-policy or off-policy, and why does that matter?

PPO is on-policy: it can only reuse data collected by a policy close to the current one. The importance ratio lets it squeeze a handful of gradient epochs (typically 3–10) out of each rollout batch, but once the policy moves far enough that the ratio leaves the clip range, the data is stale and must be discarded. That makes PPO less sample-efficient than off-policy methods like SAC, but far more stable and simpler to tune.

How is PPO used in RLHF to fine-tune language models?

In RLHF the language model is the policy: a state is the prompt-plus-tokens-so-far and the action is the next token. A separate reward model scores completions, and PPO nudges the LM to produce higher-scoring text while a per-token KL penalty to the original model keeps it from drifting into gibberish that games the reward. PPO was the algorithm behind InstructGPT and the first ChatGPT, though direct-preference methods like DPO now compete with it.

What is GAE and why is PPO almost always paired with it?

GAE — Generalized Advantage Estimation — computes the advantage A(s,a) as an exponentially-weighted average of multi-step TD errors, controlled by a parameter λ (typically 0.95). It trades bias against variance: λ=0 is low-variance one-step TD, λ=1 is high-variance Monte-Carlo. PPO clips the policy step, but the quality of the advantage estimate it clips against still drives learning, so GAE's variance reduction is what makes PPO converge smoothly.