Machine Learning

Actor-Critic Methods

Two networks, one team: a doer and a judge that trains it

Actor-critic methods pair a policy network (the actor) that picks actions with a value network (the critic) that scores them, using the critic's advantage estimate to cut the variance of policy-gradient updates.

  • Actorpolicy π(a|s; θ)
  • Criticvalue V(s; w)
  • Learning signaladvantage A = Q − V
  • TD errorδ = r + γV(s′) − V(s)
  • Update costO(params) per step

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How actor-critic methods work

Reinforcement learning has two broad camps. Value-based methods (like Q-learning) learn how good states or actions are, then act greedily on those numbers. Policy-based methods skip the value table entirely and directly tune a parameterized policy by gradient ascent on expected reward. Actor-critic is the hybrid: it keeps both, and lets each fix the other's weakness.

The actor is the policy π(a | s; θ) — a neural network that takes a state and outputs a distribution over actions. This is the thing you actually deploy. The critic is a value function V(s; w) (or a state-action value Q(s,a; w)) — a second network that never chooses anything. Its only job is to estimate how much reward the actor can expect from a state, so it can grade the actor's decisions.

The loop, per step: the actor samples an action, the environment returns a reward r and a next state s′, and the critic computes the temporal-difference error

δ = r + γ·V(s′) − V(s)

That single number does double duty. It trains the critic (drive δ → 0 by regressing V(s) toward r + γV(s′)), and it trains the actor (δ is an unbiased one-step estimate of the advantage, so push the chosen action's log-probability up when δ > 0 and down when δ < 0). A positive δ means "that action turned out better than I expected here" — reinforce it.

Why the critic cuts variance

The classic policy gradient (the REINFORCE estimator, Williams 1992) is

∇θ J(θ) = E[ ∇θ log π(a|s; θ) · G_t ]

where G_t is the full discounted return from time t. The estimator is unbiased, but G_t is a sum of many noisy future rewards — its variance grows roughly linearly with the episode length. High variance means the gradient points in nearly random directions from one episode to the next, so you need huge batches or tiny learning rates to make progress.

The fix exploits a key identity: subtracting any function b(s) that does not depend on the action leaves the gradient unbiased, because E[∇θ log π(a|s) · b(s)] = 0. The variance-minimizing choice of baseline is close to the value function V(s). Plugging it in turns the return into the advantage:

∇θ J(θ) = E[ ∇θ log π(a|s; θ) · A(s,a) ],   A(s,a) = Q(s,a) − V(s)

Now the actor is rewarded for choosing actions that beat the average action in that state, not for the raw return. Centering the signal around zero is what kills the variance. The critic's whole reason to exist is to supply this baseline cheaply and online, without waiting for the episode to end.

When to use actor-critic

  • Continuous action spaces. Value-based methods need a max over actions, which is intractable when actions are real-valued (a robot's joint torques). The actor outputs the action directly, so DDPG/TD3/SAC dominate continuous control.
  • Stochastic policies you can sample online. If you want a probabilistic policy and learning signal at every step (not just episode end), actor-critic gives you both.
  • Long episodes where REINFORCE's variance explodes. Bootstrapping with the critic lets you update mid-episode instead of waiting for a return.
  • Large or continuous state spaces where tabular value methods can't fit — both actor and critic are function approximators.

Reach for a value-based method instead when actions are few and discrete and sample efficiency matters most — DQN-style replay buffers reuse data more aggressively than on-policy actor-critics. Reach for plain REINFORCE only for tiny problems where the simplicity outweighs the variance.

Actor-critic vs other RL families

Actor-critic (A2C)REINFORCEQ-learning / DQNPPODDPG / SAC
Learns a policy?Yes (actor)YesNo (greedy on Q)YesYes
Learns a value fn?Yes (critic)NoYes (Q)Yes (critic)Yes (Q-critic)
Gradient varianceLow (advantage)High (full return)n/aLow (clipped)Low (deterministic)
On / off policyOn-policyOn-policyOff-policyOn-policyOff-policy
Sample efficiencyMediumLowHigh (replay)MediumHigh (replay)
Continuous actionsYesYesNo (needs max)YesYes (specialty)
Update timingEvery step (TD)End of episodeEvery stepPer rollout batchEvery step
BiasSome (bootstrap)NoneSome (bootstrap)Tunable (GAE)Some (bootstrap)

The headline trade-off is bias for variance. REINFORCE is unbiased but noisy; the critic's bootstrap injects a little bias in exchange for a dramatic drop in variance, which usually wins on wall-clock convergence. PPO, DDPG, TD3, and SAC are all actor-critics — the family is the backbone of modern deep RL, not a niche.

What the numbers actually say

  • Variance of REINFORCE grows with episode length. For an episode of length T with per-step reward variance σ², the return variance is roughly T·σ² under discounting. A 1,000-step episode therefore carries ~1,000× the noise of a single-step bandit — which is exactly why long-horizon tasks stall without a baseline.
  • A3C trained Atari in ~1 day on 16 CPU cores (Mnih et al., 2016), beating the original DQN's ~8 GPU-days on many games — actor-critic plus asynchrony, no replay buffer or GPU required.
  • A2C ≈ A3C in sample efficiency. OpenAI's 2017 study found the synchronous A2C matched A3C while using the GPU better through batched forward passes, so most codebases ship A2C.
  • GAE's λ knob is one float. Setting λ = 0 gives pure one-step TD (low variance, high bias); λ = 1 gives the full Monte Carlo advantage (high variance, zero bias). Schulman et al. (2016) report λ ≈ 0.95 and γ ≈ 0.99 as robust defaults across MuJoCo locomotion.
  • Cost per update is O(|θ| + |w|) — one forward and backward pass through each network — independent of how long the agent has been training. No replay, no max-over-actions search.

JavaScript implementation

A minimal one-step advantage actor-critic. The networks are sketched as plain objects; the load-bearing part is the update rule, which is identical across every framework.

// One-step actor-critic (A2C-style) update for a single transition.
// actor.forward(s)  -> array of action logits
// critic.forward(s) -> scalar V(s)
// Both expose .backprop(grad) to accumulate parameter gradients.

const gamma = 0.99;
const lrActor = 1e-3;
const lrCritic = 5e-3;

function softmax(logits) {
  const m = Math.max(...logits);
  const exps = logits.map(z => Math.exp(z - m));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map(e => e / sum);
}

function sample(probs) {
  let r = Math.random(), i = 0;
  while ((r -= probs[i]) > 0) i++;
  return Math.min(i, probs.length - 1);
}

function step(state, env, actor, critic) {
  const probs = softmax(actor.forward(state));
  const action = sample(probs);

  const { reward, nextState, done } = env.step(action);

  // Critic: bootstrap target and TD error (the advantage estimate).
  const v  = critic.forward(state);
  const vNext = done ? 0 : critic.forward(nextState);
  const target = reward + gamma * vNext;
  const delta = target - v;            // δ = advantage estimate

  // Critic update: minimize ½δ²  ->  gradient is -δ on V(s).
  critic.backprop(-delta * lrCritic);

  // Actor update: ascend ∇log π(a|s) · δ.
  // ∇ of cross-entropy log-prob: (oneHot(a) - probs).
  const grad = probs.map((p, i) => (i === action ? 1 - p : -p) * delta * lrActor);
  actor.backprop(grad, state);

  return { nextState, done, delta };
}

Two things to notice. First, delta is computed once and reused for both networks — that shared TD error is the whole trick. Second, vNext is forced to zero on terminal states; forgetting this is the single most common actor-critic bug, because it makes the critic hallucinate reward beyond the end of the episode.

Python implementation

The same algorithm in PyTorch, with a shared-body two-head network — the standard A2C architecture where the actor and critic share early features.

import torch, torch.nn as nn, torch.nn.functional as F
from torch.distributions import Categorical

class ActorCritic(nn.Module):
    def __init__(self, obs_dim, n_actions, hidden=128):
        super().__init__()
        self.body   = nn.Sequential(nn.Linear(obs_dim, hidden), nn.Tanh())
        self.actor  = nn.Linear(hidden, n_actions)   # policy logits
        self.critic = nn.Linear(hidden, 1)            # V(s)

    def forward(self, s):
        h = self.body(s)
        return self.actor(h), self.critic(h).squeeze(-1)

def update(model, opt, s, a, r, s_next, done, gamma=0.99, c_v=0.5, c_ent=0.01):
    logits, v = model(s)
    with torch.no_grad():
        _, v_next = model(s_next)
        target = r + gamma * v_next * (1.0 - done)   # zero past terminal
    advantage = target - v                            # δ = A(s,a) estimate

    dist = Categorical(logits=logits)
    log_prob = dist.log_prob(a)

    # Actor: ascend  log π(a|s) · A   ->  loss = -(log π · A.detach())
    actor_loss  = -(log_prob * advantage.detach()).mean()
    # Critic: regress V toward the bootstrap target.
    critic_loss = F.mse_loss(v, target)
    # Entropy bonus keeps the policy exploring.
    entropy     = dist.entropy().mean()

    loss = actor_loss + c_v * critic_loss - c_ent * entropy
    opt.zero_grad(); loss.backward()
    nn.utils.clip_grad_norm_(model.parameters(), 0.5)
    opt.step()
    return advantage.mean().item()

Note advantage.detach() in the actor loss: gradients must not flow from the actor's objective back into the critic, or the critic would be trained to make the actor's loss small rather than to predict value accurately. The torch.no_grad() around the target serves the same purpose — the bootstrap target is treated as a fixed label, not a differentiable quantity.

Variants worth knowing

A3C — Asynchronous Advantage Actor-Critic (Mnih et al., 2016). Many worker threads each interact with their own environment copy, compute gradients, and update a shared model asynchronously. The decorrelation between workers replaces the experience-replay buffer, which is why A3C needs no GPU.

A2C — the synchronous sibling. Workers step in lockstep, gradients are averaged into one batched update. Simpler and more GPU-friendly; OpenAI showed it matches A3C, so it became the default.

GAE — Generalized Advantage Estimation (Schulman et al., 2016). Instead of a one-step TD error, take an exponentially-weighted average of n-step advantages controlled by λ. One knob slides smoothly from low-variance/high-bias (λ=0) to high-variance/zero-bias (λ=1).

PPO — Proximal Policy Optimization (Schulman et al., 2017). An actor-critic that clips the policy ratio so a single batch can't move the policy too far. Robust, hard to destabilize, and the default for most applied RL and for RLHF on large language models.

DDPG / TD3 / SAC. Off-policy actor-critics for continuous control. The actor outputs a deterministic action (DDPG, TD3) or a squashed-Gaussian sample (SAC), and the Q-critic supplies the gradient. SAC adds an entropy term to the objective for state-of-the-art continuous-control sample efficiency.

Common bugs and edge cases

  • Not zeroing V(s′) at terminal states. The bootstrap target r + γV(s′) must drop the γV(s′) term when the episode ends, or the critic invents reward past the finish line and both networks diverge.
  • Letting actor gradients flow into the critic. The advantage must be detached in the actor loss. Without it, the critic optimizes the wrong objective and value estimates collapse.
  • Critic learning rate too low relative to the actor. If the actor outruns a stale critic, the advantage signal is garbage and the policy chases noise. A common fix is a larger critic learning rate (or more critic updates per actor update).
  • No entropy bonus. The policy collapses to a near-deterministic action early and stops exploring. A small entropy regularizer keeps the distribution spread out.
  • Reusing on-policy data. Vanilla A2C/A3C are on-policy — once the actor changes, old transitions are off-distribution and must be discarded. Replaying them (as DQN does) silently biases the gradient.
  • Unnormalized advantages. Raw advantages can have wildly varying scale across batches; normalizing them to zero mean and unit variance per batch stabilizes the actor's step size, a trick PPO codebases use by default.

Frequently asked questions

Why do actor-critic methods reduce variance compared to REINFORCE?

REINFORCE scales each gradient by the full Monte Carlo return, which swings wildly across episodes. Actor-critic replaces that return with the advantage A = Q − V, where the critic's value V acts as a baseline. Subtracting a state-dependent baseline leaves the gradient unbiased but shrinks its variance, so updates are smoother and the actor converges with far fewer samples.

What is the difference between the actor and the critic?

The actor is the policy π(a|s; θ) — it maps a state to a distribution over actions and is what you actually deploy. The critic is a value function V(s; w) or Q(s,a; w) — it never picks actions, it only scores how good the actor's choices were, producing the learning signal that trains the actor.

What is the advantage function in actor-critic?

The advantage A(s,a) = Q(s,a) − V(s) measures how much better action a is than the policy's average action in state s. A one-step estimate uses the TD error δ = r + γV(s′) − V(s), which is an unbiased sample of the advantage. Positive advantage pushes the action's probability up; negative pushes it down.

How is A2C different from A3C?

A3C (Asynchronous Advantage Actor-Critic, 2016) runs many worker threads that each compute gradients and update a shared model asynchronously. A2C is the synchronous variant: all workers step in lockstep, gradients are averaged, then one update is applied. A2C is simpler, uses the GPU better via batching, and OpenAI found it matches A3C's sample efficiency.

Is the critic's bias a problem in actor-critic methods?

Yes — bootstrapping with r + γV(s′) introduces bias because the critic is only an estimate, not the true value. This is the bias-variance trade-off: a one-step TD target has low variance but high bias, while a full Monte Carlo return has zero bias but high variance. Generalized Advantage Estimation (GAE) interpolates between them with a single λ knob.

Are PPO and DDPG actor-critic methods?

Yes. PPO is an on-policy actor-critic that clips the policy update to stay near the old policy. DDPG, TD3, and SAC are off-policy actor-critics for continuous actions where the actor outputs a deterministic or squashed-Gaussian action and the critic is a Q-network. The actor-critic skeleton — a policy trained by a learned value signal — underpins nearly all modern deep RL.