Machine Learning

Q-Learning

Q: Why is Q-learning called off-policy?

Because the update uses max over next-state actions — the value of the greedy action — regardless of which action the agent actually took. The agent can explore randomly with ε-greedy while still learning the value of the optimal (greedy) policy. The behaviour policy that collects data differs from the target policy being evaluated, which is the definition of off-policy.

Q: What's the difference between Q-learning and SARSA?

Q-learning bootstraps from max Q(s′, a′) — the best possible next action — so it learns the optimal policy directly (off-policy). SARSA bootstraps from Q(s′, a′) where a′ is the action actually taken next, so it learns the value of the policy it follows, exploration included (on-policy). On a cliff-walking task SARSA learns a safer path that stays away from the edge; Q-learning learns the shorter, riskier optimal path and falls off more during training.

Q: Does Q-learning always converge to the optimal policy?

In the tabular case, yes — with probability 1, provided every state-action pair is visited infinitely often and the learning rate satisfies the Robbins-Monro conditions (the sum of learning rates diverges while the sum of their squares converges). With function approximation, like a neural network, those guarantees vanish; the 'deadly triad' of bootstrapping, off-policy updates, and function approximation can cause divergence.

Q: What does the discount factor γ do in Q-learning?

γ in [0, 1) weights how much future rewards matter relative to immediate ones. γ = 0 makes the agent myopic — it only cares about the next reward. γ near 1 makes it far-sighted but slows convergence and amplifies estimation error, because the effective horizon is roughly 1/(1−γ) steps: γ = 0.99 plans about 100 steps ahead. Episodic tasks can use γ = 1; continuing tasks need γ < 1 to keep returns finite.

Q: Why does Q-learning overestimate action values?

The max operator in the target picks the largest noisy estimate, so positive estimation errors get selected and propagated — a systematic upward bias known as maximization bias. Double Q-learning fixes it by keeping two value tables: one chooses the maximizing action, the other evaluates it, decoupling selection from evaluation so the noise cancels instead of compounding.

Q: Can Q-learning handle continuous or huge state spaces?

Not in tabular form — a table needs one entry per state-action pair, which is impossible for continuous states or anything like a chess board's 10^40 positions. The fix is function approximation: replace the table with a parameterized function Q(s, a; θ). A Deep Q-Network uses a neural net for θ, which is how Q-learning scaled to play 49 Atari games from raw pixels in 2015.

Learn the value of every action — without ever modeling the world

Q-learning is a model-free, off-policy reinforcement learning algorithm that learns the value of each action by bootstrapping its own estimates toward observed rewards, provably converging to the optimal policy without ever knowing the environment's dynamics.

TypeModel-free, off-policy TD
LearnsOptimal action-value Q*(s, a)
Update costO(1) per step (tabular)
MemoryO(|S| · |A|) table
Convergencew.p. 1 under Robbins-Monro

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How Q-learning works

Suppose you drop an agent into a maze it has never seen. It can't read the floor plan, doesn't know where the exit is, and only finds out it reached the goal because a reward suddenly appears. Q-learning is how that agent figures out the best move in every square — purely by trial, error, and arithmetic.

The core object is the action-value function Q(s, a): the expected total discounted reward from taking action a in state s and then acting optimally forever after. If you knew the true Q*, the optimal policy would be trivial — in every state, pick the action with the highest Q-value. Q-learning's whole job is to estimate Q* from experience.

It does this with a single update applied after every step. The agent is in state s, takes action a, lands in s′, and collects reward r. Then:

Q(s, a) ← Q(s, a) + α · [ r + γ · max_a′ Q(s′, a′) − Q(s, a) ]
                          └──────────── TD target ───────────┘
          └──────────────── TD error δ ──────────────────────┘

Read that bracket left to right. r + γ · max Q(s′, a′) is a fresh, slightly-better estimate of what Q(s, a) should be: one real reward plus the discounted value of the best thing you could do next. Subtract the old estimate and you get the temporal-difference error δ — how surprised the agent was. Nudge the old value toward the new one by a fraction α (the learning rate), and repeat.

The deep idea is bootstrapping: the agent improves its guess of Q(s, a) using another of its own guesses, Q(s′, a′), rather than waiting for the episode to finish. This is what makes Q-learning a temporal-difference method, sitting between Monte Carlo (wait for the final return) and dynamic programming (need a full model). It learns from a single transition, online, with no model of how the world transitions or pays out.

Why "off-policy" is the magic word

The max in the update is doing something subtle and powerful. The agent might have wandered into s′ by taking a random exploratory action — but when it bootstraps, it assumes it will act greedily from s′ onward. So the data is collected by one policy (the exploratory behaviour policy) while the values being learned belong to a different policy (the greedy target policy). That decoupling is what "off-policy" means.

This buys you two things. First, the agent can explore aggressively — even act randomly — and still converge to the value of the optimal policy. Second, you can learn from old experience generated by any behaviour, which is exactly why off-policy methods power experience replay in deep RL: store millions of past transitions and replay them to train, no matter what policy generated them.

Exploration is usually handled with ε-greedy: with probability ε pick a random action, otherwise pick argmax_a Q(s, a). Start with ε near 1 (almost all exploration) and decay it toward ~0.05 as the table sharpens. The convergence proof requires that every state-action pair keeps getting visited, which ε-greedy with ε > 0 guarantees.

When to use Q-learning

Discrete, modest state spaces where a table of |S| · |A| entries fits in memory — gridworlds, board-game endgames, inventory control, simple robotics with discretized states.
No model of the environment — you can sample transitions (a simulator or the real world) but can't write down the transition probabilities. Q-learning never needs them.
You want the optimal policy, not a safe one. Off-policy learning targets the best achievable behaviour even while exploring recklessly.
Reusable experience. Off-policy means you can train on logged data or replay buffers, decoupling data collection from learning.

Reach for something else when the state space is continuous or astronomically large (use a neural network approximator — a Deep Q-Network), when you need on-policy safety guarantees during training (use SARSA), or when the action space is continuous (use policy-gradient or actor-critic methods, which Q-learning's argmax over actions can't cleanly handle).

Q-learning vs other RL algorithms

	Q-learning	SARSA	Expected SARSA	Monte Carlo control	DQN
Policy type	Off-policy	On-policy	Either	On-policy	Off-policy
Bootstraps?	Yes (max)	Yes (sampled a′)	Yes (expectation)	No (full return)	Yes (max)
Target	r + γ·max Q(s′,·)	r + γ·Q(s′,a′)	r + γ·𝔼[Q(s′,·)]	actual return Gₜ	r + γ·max Q̄(s′,·)
Learns optimal π?	Directly	Policy it follows	Target policy	Policy it follows	Directly
Update timing	Per step (online)	Per step (online)	Per step (online)	End of episode	Per minibatch
Variance vs bias	Low var, biased	Low var, biased	Lowest var (no a′ sampling)	High var, unbiased	Low var, biased
State space	Tabular / small	Tabular / small	Tabular / small	Tabular / small	High-dim (pixels)
Maximization bias	Yes	No	Reduced	No	Yes (Double DQN fixes)

The clean way to remember the family: Q-learning and SARSA differ only in one symbol of the target. Q-learning takes the max (optimistic, off-policy); SARSA plugs in the action it actually took next (realistic, on-policy); Expected SARSA averages over the policy's action distribution, killing the variance from sampling a single a′.

What the numbers actually say

Tabular memory is exact and brutal. A table costs one float per state-action pair. A 100×100 gridworld with 4 actions is 40,000 floats — 320 KB, trivial. But backgammon's ~10^20 positions or chess's ~10^40 are flatly impossible to tabulate; that's the wall function approximation exists to break.
Per-step cost is O(|A|). Each update needs one max over the |A| actions in the next state. With 4 actions that's 4 comparisons — a few nanoseconds. The expense is the number of episodes, not the per-step math.
Sample complexity, not compute, is the bottleneck. A 4×4 FrozenLake converges in a few thousand episodes; a noisy 10×10 maze can need 10⁵–10⁶ steps. The 2015 DQN that learned Atari trained on 50 million frames (about 10 days of game time) per game.
The effective horizon is ~1/(1−γ). γ = 0.9 sees ~10 steps ahead; γ = 0.99 sees ~100. Push γ too close to 1 and the bootstrapped error compounds over that horizon, slowing or destabilizing learning.
Maximization bias is measurable. On Sutton & Barto's small MDP where every path's true value is ≤ 0, vanilla Q-learning picks the biased "left" action far more than 5% (optimal) early in training; Double Q-learning tracks the correct ~5% almost immediately.

JavaScript implementation

A complete tabular Q-learner with ε-greedy exploration, applied to a tiny deterministic gridworld. The environment is abstracted behind step() — the agent never sees its internals.

class QLearner {
  constructor(nStates, nActions, { alpha = 0.1, gamma = 0.99, epsilon = 1.0 } = {}) {
    this.Q = Array.from({ length: nStates }, () => new Float64Array(nActions));
    this.nActions = nActions;
    this.alpha = alpha;       // learning rate
    this.gamma = gamma;       // discount factor
    this.epsilon = epsilon;   // exploration rate
  }

  argmax(s) {
    const row = this.Q[s];
    let best = 0;
    for (let a = 1; a < this.nActions; a++) if (row[a] > row[best]) best = a;
    return best;
  }

  act(s) {                                    // ε-greedy behaviour policy
    if (Math.random() < this.epsilon) return Math.floor(Math.random() * this.nActions);
    return this.argmax(s);
  }

  update(s, a, r, sNext, done) {
    // Off-policy: bootstrap from the GREEDY next value, not the action we'll take.
    const bootstrap = done ? 0 : this.Q[sNext][this.argmax(sNext)];
    const tdTarget = r + this.gamma * bootstrap;
    const tdError = tdTarget - this.Q[s][a];
    this.Q[s][a] += this.alpha * tdError;     // nudge toward the target
    return tdError;
  }
}

// Training loop — env.reset() returns a start state; env.step(a) returns [sNext, r, done]
function train(env, agent, episodes = 5000) {
  for (let ep = 0; ep < episodes; ep++) {
    let s = env.reset(), done = false;
    while (!done) {
      const a = agent.act(s);
      const [sNext, r, isDone] = env.step(a);
      agent.update(s, a, r, sNext, isDone);
      s = sNext; done = isDone;
    }
    agent.epsilon = Math.max(0.05, agent.epsilon * 0.999);  // decay exploration
  }
}

Two details carry the whole method. The done ? 0 : ... guard zeroes the bootstrap at a terminal state — a frequent and silent bug if you forget it (the agent keeps adding phantom future value past the goal). And the bootstrap uses argmax(sNext), the greedy action, never act(sNext) — that one substitution is the entire difference between Q-learning and SARSA.

Python implementation

The same agent in NumPy, plus the famous FrozenLake / gridworld control loop and a greedy-policy extractor — the artifact you actually deploy after training.

import numpy as np

class QLearner:
    def __init__(self, n_states, n_actions, alpha=0.1, gamma=0.99, epsilon=1.0):
        self.Q = np.zeros((n_states, n_actions))
        self.n_actions = n_actions
        self.alpha, self.gamma, self.epsilon = alpha, gamma, epsilon

    def act(self, s):                              # epsilon-greedy
        if np.random.rand() < self.epsilon:
            return np.random.randint(self.n_actions)
        return int(np.argmax(self.Q[s]))

    def update(self, s, a, r, s_next, done):
        bootstrap = 0.0 if done else np.max(self.Q[s_next])   # OFF-POLICY: greedy max
        td_target = r + self.gamma * bootstrap
        self.Q[s, a] += self.alpha * (td_target - self.Q[s, a])

def train(env, agent, episodes=5000, eps_min=0.05, eps_decay=0.999):
    for _ in range(episodes):
        s, done = env.reset(), False
        while not done:
            a = agent.act(s)
            s_next, r, done = env.step(a)
            agent.update(s, a, r, s_next, done)
            s = s_next
        agent.epsilon = max(eps_min, agent.epsilon * eps_decay)

def greedy_policy(agent):
    # After training: the deployable policy is just argmax over each state's row.
    return np.argmax(agent.Q, axis=1)

Note what you ship: not the algorithm, just np.argmax(Q, axis=1) — a lookup table from state to best action. Training is the expensive part; inference is a single array index.

Variants worth knowing

Double Q-learning. Keeps two tables, Q_A and Q_B. On each step, flip a coin to decide which to update; use one to select the maximizing action and the other to evaluate it: Q_A(s,a) ← ... + γ·Q_B(s′, argmax Q_A(s′,·)). Decoupling selection from evaluation cancels the maximization bias. This idea, lifted into deep RL as Double DQN (2015), measurably improved Atari scores.

Deep Q-Network (DQN). Replace the table with a neural net Q(s, a; θ). The two tricks that made it stable: an experience replay buffer (sample random past transitions to break correlation) and a slowly-updated target network θ⁻ for the bootstrap, so the target doesn't chase a moving estimate. DeepMind's 2015 Nature paper used these to reach human-level play on 49 Atari games from raw pixels.

Expected SARSA. Replaces the sampled or max next-value with the expectation over the policy's action probabilities: 𝔼_a′[Q(s′,a′)]. It removes the variance of sampling a single a′ and, if the policy is greedy, reduces exactly to Q-learning — so it's a strict generalization.

n-step Q-learning & Q(λ). Instead of bootstrapping after one step, accumulate n real rewards before bootstrapping, trading bias for variance. Q(λ) blends all horizons with an eligibility-trace decay λ, interpolating smoothly between one-step TD (λ=0) and Monte Carlo (λ=1).

Prioritized experience replay. Sample transitions with large TD error more often, since they carry the most learning signal. Roughly 2× faster convergence than uniform replay on many Atari games.

Common bugs and edge cases

Bootstrapping past a terminal state. If done is true, the target is just r — there is no s′. Forgetting the done guard leaks phantom value backward and inflates every Q-value.
Using the behaviour action in the bootstrap. Writing Q[s_next][act(s_next)] instead of max Q[s_next] silently turns Q-learning into a buggy SARSA. The update must use the greedy max.
Never decaying ε. If ε stays high, the agent keeps acting randomly forever and the learned policy is fine but its behaviour isn't — and you never get clean greedy rollouts. Decay ε toward a small floor.
Learning rate too high. Large α on a stochastic environment makes Q-values oscillate and never settle; the Robbins-Monro conditions require α to shrink over time for convergence.
γ = 1 on a continuing task. Without termination, undiscounted returns are infinite and Q-values blow up. Use γ < 1 for any non-episodic problem.
The deadly triad. Combine bootstrapping + off-policy + function approximation (i.e. naive deep Q-learning) and the values can diverge to infinity. Target networks and replay buffers exist specifically to tame this.
Sparse rewards. If reward only appears at the goal, early episodes propagate almost no signal — the agent flails for thousands of steps. Reward shaping or optimistic initialization (start Q high to encourage exploration) helps.

Frequently asked questions

Why is Q-learning called off-policy?