Machine Learning

Deep Q-Networks (DQN)

The algorithm that learned to play Atari from raw pixels — and made deep reinforcement learning work

A Deep Q-Network (DQN) is a reinforcement-learning agent that uses a convolutional neural net to approximate the action-value function Q(s,a) from raw pixels, stabilized by experience replay and a separate, slowly-updated target network.

IntroducedMnih et al., 2013 / Nature 2015
LearnsQ(s,a) from pixels
Replay buffer1,000,000 transitions
Target sync (C)every 10,000 steps
Atari benchmark≥ human on 29/49 games

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: a neural net that scores every move

Imagine a chess clock that, for any board you show it, instantly prints a number next to each legal move — the expected total reward if you play that move and then keep playing well. Pick the highest number, repeat, and you play near-optimally without ever planning ahead. That number is the action-value, written Q(s, a): the expected discounted return from taking action a in state s and acting greedily thereafter.

Classic Q-learning stores those numbers in a table — one cell per state-action pair. That works for grid-worlds with a few thousand states. It collapses the instant the state is an image. An Atari screen is roughly 210×160 pixels with 128 colors; the number of distinct screens dwarfs the number of atoms in the universe. You cannot tabulate it, and even if you could, you would never visit the same screen twice to fill a cell.

DQN's move is to replace the table with a function approximator: a convolutional neural network Q(s, a; θ) that takes the raw pixels in and emits one value per action out. Pixels in, action-values out, one forward pass. Because the network shares features across similar screens, it generalizes — it can score a screen it has never seen by analogy to ones it has. That single substitution, plus two stabilizers we will get to, is the whole idea. DeepMind's 2013 workshop paper and the 2015 Nature article "Human-level control through deep reinforcement learning" turned it from idea into the result that launched modern deep RL.

How DQN works, step by step

DQN is Q-learning with three changes: a network instead of a table, a replay buffer instead of on-line updates, and a frozen copy of the network to compute targets. Here is the loop the agent runs on every step.

Observe and act. Build the state s from the last four grayscale 84×84 frames. With probability ε pick a random action (explore); otherwise pick argmax_a Q(s, a; θ) (exploit). ε is annealed from 1.0 down to 0.1 over the first million frames.
Step the environment. Receive reward r and next state s′.
Remember. Push the transition (s, a, r, s′, done) into a circular replay buffer holding up to one million transitions; the oldest is evicted when full.
Replay. Sample a random minibatch of 32 transitions from the buffer.
Compute targets with the frozen net. For each transition, the target is y = r + γ · max_a′ Q(s′, a′; θ⁻) if the episode continues, or just y = r if done. Note the weights θ⁻ — these belong to the target network, a stale snapshot, not the live network.
Fit. Take one gradient-descent step (via backpropagation) on the squared error between the live prediction and that target: L(θ) = (y − Q(s, a; θ))², summed over the minibatch.
Sync occasionally. Every C = 10,000 gradient steps, copy θ → θ⁻ so the target network catches up.

The discount factor γ (0.99 in the Atari setup) trades off immediate versus future reward. The only supervision is the scalar game score; nobody labels the right move. The agent bootstraps — it trains its own predictions toward slightly-more-grounded versions of those same predictions — which is exactly why it needs the next section's tricks to not fall apart.

The mechanism and why it is fragile

At the heart of DQN is the Bellman optimality equation for action-values:

Q*(s, a) = E[ r + γ · max  Q*(s′, a′) ]
                          a′

If you knew Q* exactly, greedy action selection is optimal. DQN approximates Q* by minimizing the temporal-difference loss over minibatches drawn from the replay buffer D:

L(θ) = E              [ ( r + γ · max  Q(s′,a′; θ⁻) − Q(s,a; θ) )² ]
        (s,a,r,s′)~D                a′
                                    └─ target network θ⁻ (frozen)

Three forces conspire to make naïvely training this diverge — the "deadly triad" of function approximation, bootstrapping, and off-policy data:

Moving target. If the same weights θ appear in both the prediction and the target, every update changes the thing you are fitting to. You chase a target that runs away from you. Fix: the target network θ⁻, frozen for C steps.
Correlated samples. Consecutive frames are nearly identical; SGD assumes roughly i.i.d. data. Training on a stream of correlated frames causes the network to overfit the current stretch of gameplay and forget the rest. Fix: experience replay — sample uniformly at random from a huge buffer.
Maximization bias. The max picks the action with the highest estimate, and noisy estimates have some too-high outliers, so the max is biased upward. Fix: Double DQN (see variants) decouples selection from evaluation.

Complexity. Each environment step costs one forward pass to choose an action, plus one minibatch of size B (=32) of forward and backward passes to learn — so per-step compute is O(B · F), where F is the cost of one network pass. Action selection over |A| actions is O(|A|) on top of the shared convolutional trunk, which is why DQN handles discrete action sets cheaply but does not scale to continuous actions (you would have to maximize over an infinite set). Memory is dominated by the replay buffer: one million transitions of four 84×84 uint8 frames each is about 1e6 × 84 × 84 × 4 ≈ 28 GB if stored naïvely, which is why real implementations store frames as uint8 and share overlapping frame-stacks.

When to reach for DQN — and when not to

Discrete, finite action sets. DQN needs an explicit max_a over actions. Steering wheels and joint torques are continuous — use a policy-gradient or actor-critic method (DDPG, SAC, PPO) instead.
Cheap, simulable environments. DQN is famously sample-hungry — tens of millions of frames per game. If each real-world step is expensive (a physical robot, a recommendation served to a user), the cost is prohibitive without a simulator.
High-dimensional observations. The whole point is learning from pixels or other rich sensory input where a table is impossible.
Off-policy reuse matters. Because it learns off-policy from a buffer, DQN can squeeze many updates out of each interaction — valuable when interaction, not compute, is the bottleneck.

Avoid DQN for short-horizon bandit problems (a contextual bandit is simpler), for problems where you already have a good model of the dynamics (use planning / Monte-Carlo tree search), and for partially-observable tasks needing long memory (frame-stacking only buys you a few steps of history; a recurrent DRQN does better).

DQN vs other value- and policy-based methods

	Tabular Q-learning	DQN	Double DQN	Dueling DQN	PPO (policy gradient)	SAC (actor-critic)
Value representation	lookup table	CNN Q(s,a;θ)	CNN, two-net target	CNN split V + advantage	policy net + value baseline	twin Q + stochastic policy
Action space	discrete, small	discrete	discrete	discrete	discrete or continuous	continuous
On/off-policy	off-policy	off-policy	off-policy	off-policy	on-policy	off-policy
Replay buffer	none	1M transitions	1M transitions	1M transitions	none (rollout only)	large buffer
Stability trick	tabular, exact	target net + replay	decoupled max	advantage normalization	clipped surrogate	entropy + twin critics
Maximization bias	present	significant	largely removed	present	n/a (no max)	mitigated by min of twins
Typical use	grid worlds	Atari from pixels	Atari, default upgrade	Atari, sparse-action games	robotics, LLM RLHF	continuous control

The headline is that DQN sits at the value-based, off-policy, discrete-action corner of the design space. Its descendants (Double, Dueling, Prioritized Replay, and the combined "Rainbow" agent) keep the same skeleton and patch individual weaknesses; policy-gradient methods like PPO take a different route entirely by parameterizing the policy directly, which is what you need for continuous control and what powers RLHF for large language models.

What the numbers actually say

Human-level on 29 of 49 games. The 2015 Nature DQN, with one fixed architecture and one set of hyperparameters, matched or beat a professional human games tester on 29 of 49 Atari 2600 titles, and on some (Breakout, Video Pinball) scored many times the human reference.
50 million frames per game. Training used 50M frames — roughly 38 days of game experience, the figure the paper itself reports — per title, on a single GPU over 1–2 weeks of compute in 2015. This is the "sample-hungry" cost in concrete units.
Replay buffer ≈ 28 GB if naïve. 1,000,000 transitions × four 84×84 frames × 1 byte ≈ 28 GB. Storing frames as uint8 (not float32) saves 4× — the difference between fitting in RAM and not.
Target sync every 10,000 steps (C). Updating the target net every step (C=1) reproduces the unstable, diverging baseline; C=10,000 was the value that made 49-game training stable in the original paper.
Double DQN's overestimation gap. On games like Asterix and Wizard of Wor, vanilla DQN's value estimates ran an order of magnitude above the true returns; Double DQN brought estimates back in line and roughly doubled normalized score on the worst-affected games (van Hasselt et al., 2016).

JavaScript implementation (core training step)

A compact, framework-agnostic sketch of the DQN update. The network forward/backward are stubbed so the logic — replay sampling, frozen target, the done-mask, the periodic sync — is visible without a tensor library.

class ReplayBuffer {
  constructor(capacity) { this.cap = capacity; this.data = []; this.pos = 0; }
  push(t) {                          // t = {s, a, r, s2, done}
    if (this.data.length < this.cap) this.data.push(t);
    else this.data[this.pos] = t;    // overwrite oldest (circular)
    this.pos = (this.pos + 1) % this.cap;
  }
  sample(batch) {
    const out = [];
    for (let i = 0; i < batch; i++)
      out.push(this.data[(Math.random() * this.data.length) | 0]);
    return out;                      // uniform random — breaks correlation
  }
  get size() { return this.data.length; }
}

const GAMMA = 0.99, BATCH = 32, C = 10000;

function trainStep(online, target, buffer, step) {
  if (buffer.size < BATCH) return;                  // warm up first
  const batch = buffer.sample(BATCH);

  const targets = batch.map(({ r, s2, done }) => {
    if (done) return r;                             // terminal: no bootstrap
    const qNext = target.forward(s2);               // FROZEN weights θ⁻
    return r + GAMMA * Math.max(...qNext);          // max over actions a′
  });

  // One SGD step on  L = mean( (y − Q(s,a;θ))²  for the taken action a )
  online.fitActionValues(
    batch.map(t => t.s),
    batch.map(t => t.a),
    targets
  );

  if (step % C === 0) target.copyWeightsFrom(online); // θ⁻ ← θ
}

function selectAction(online, s, eps, nActions) {
  if (Math.random() < eps)                          // ε-greedy explore
    return (Math.random() * nActions) | 0;
  const q = online.forward(s);
  return q.indexOf(Math.max(...q));                 // exploit: argmax Q
}

Two details carry the whole algorithm. First, target.forward(s2) uses the frozen weights, never the live ones — swap that to online.forward and you get the diverging non-target baseline. Second, the done branch returns the bare reward: bootstrapping past a terminal state injects phantom future value and is a classic bug.

Python implementation (PyTorch)

The same logic with real tensors. This is close to a runnable training step for a Gym environment with a small MLP Q-network.

import random, torch
import torch.nn as nn
import torch.nn.functional as F
from collections import deque

class QNet(nn.Module):
    def __init__(self, n_obs, n_act):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(n_obs, 128), nn.ReLU(),
            nn.Linear(128, 128),  nn.ReLU(),
            nn.Linear(128, n_act))          # one output per action
    def forward(self, x): return self.net(x)

GAMMA, BATCH, C = 0.99, 32, 10_000
buffer = deque(maxlen=1_000_000)            # circular replay buffer

online = QNet(n_obs, n_act)
target = QNet(n_obs, n_act)
target.load_state_dict(online.state_dict()) # start identical
opt = torch.optim.Adam(online.parameters(), lr=2.5e-4)

def train_step(step):
    if len(buffer) < BATCH:
        return
    batch = random.sample(buffer, BATCH)
    s, a, r, s2, done = map(lambda x: torch.tensor(x), zip(*batch))

    # Q(s,a) for the action actually taken
    q = online(s.float()).gather(1, a.long().unsqueeze(1)).squeeze(1)

    with torch.no_grad():                   # target net is not trained
        q_next = target(s2.float()).max(dim=1).values
        y = r + GAMMA * q_next * (1 - done.float())   # zero out if terminal

    loss = F.smooth_l1_loss(q, y)           # Huber loss — robust to outliers
    opt.zero_grad(); loss.backward()
    nn.utils.clip_grad_norm_(online.parameters(), 10.0)  # clip exploding grads
    opt.step()

    if step % C == 0:                       # periodic hard target sync
        target.load_state_dict(online.state_dict())

Note torch.no_grad() around the target computation — gradients must not flow into θ⁻ — and the Huber (smooth_l1) loss, which the original paper used because squared error on the occasionally-large TD error produced unstable gradients. Clipping the gradient norm to 10 is the third belt-and-braces stabilizer.

Variants worth knowing

Double DQN (van Hasselt et al., 2016). The single biggest fix. Use the online net to select the next action and the target net to evaluate it: y = r + γ · Q(s′, argmax_a′ Q(s′,a′; θ); θ⁻). Removes most maximization bias at zero extra network cost.

Dueling DQN (Wang et al., 2016). Split the head into a state-value stream V(s) and an advantage stream A(s,a), then recombine as Q = V + (A − mean A). Lets the net learn which states are valuable without having to learn the effect of every action there — a big win when many actions are irrelevant.

Prioritized Experience Replay (Schaul et al., 2016). Sample transitions in proportion to their TD-error magnitude instead of uniformly, so surprising transitions are replayed more. Needs importance-sampling weights to stay unbiased; typically implemented with a sum-tree for O(log n) sampling.

Rainbow (Hessel et al., 2018). Combines six improvements — Double, Dueling, Prioritized Replay, multi-step returns, distributional RL (C51), and Noisy Nets — into one agent that substantially outperforms each component alone on the Atari suite.

DRQN. Replace the first fully-connected layer with an LSTM so the agent has genuine memory, for partially-observable tasks where four stacked frames are not enough history.

Common bugs and edge cases

Computing the target with the online network. The single most common DQN bug — it reproduces the unstable 2013-baseline and quietly diverges. Always use θ⁻.
Bootstrapping past terminal states. Forgetting the (1 − done) mask adds phantom future reward at episode boundaries. Values inflate, scores collapse.
Letting gradients flow into the target net. Omitting torch.no_grad() (or .detach()) trains both nets toward each other and defeats the purpose of freezing.
Reward scale. Atari rewards span orders of magnitude across games; the paper clips rewards to {−1, 0, +1}. Skip clipping and the loss is dominated by the highest-scoring game.
Training before the buffer warms up. Sampling 32 from a near-empty buffer means replaying the same handful of transitions thousands of times — overfit. Wait for a minimum fill (e.g. 50,000 transitions).
Float32 frame storage. Storing replay frames as float32 instead of uint8 quadruples memory and pushes the 1M buffer past 100 GB. Store uint8, convert to float only inside the forward pass.
Using DQN for continuous actions. There is no tractable max_a over a continuous set. Discretizing throws away precision and explodes with action dimensionality; reach for DDPG/SAC instead.

Frequently asked questions

Why does DQN need a separate target network?

Without it, the same network produces both the prediction Q(s,a) and the bootstrap target r + γ·max Q(s′,a′). Every gradient step shifts the target you are chasing, creating a feedback loop that diverges. DQN freezes a copy of the weights as the target network and only refreshes it every C steps (C = 10,000 in the original Atari paper), giving the online network a stationary target to fit.

What problem does experience replay solve?

Consecutive frames of gameplay are highly correlated, and stochastic gradient descent assumes roughly independent, identically distributed samples. Experience replay stores millions of past transitions (s, a, r, s′) in a buffer and trains on random minibatches drawn from it. This breaks temporal correlation and lets each transition be reused many times, improving both stability and sample efficiency.

What is the difference between Q-learning and DQN?

Tabular Q-learning stores one value per (state, action) cell, which is impossible when states are 84×84 pixel images. DQN replaces the table with a neural network Q(s,a;θ) that generalizes across states, and adds two stabilizers — experience replay and a target network — to make training a non-linear function approximator on a moving target actually converge.

Why does DQN stack four frames as input?

A single Atari frame is not a Markov state — you cannot tell from one image whether a ball is moving up or down. DQN stacks the last four grayscale 84×84 frames so velocity and direction are inferable, turning a partially-observable problem into an approximately-Markov one without a recurrent network.

What is the maximization bias in DQN, and how does Double DQN fix it?

The max operator in r + γ·max Q(s′,a′) both selects and evaluates the next action with the same noisy estimates, so it systematically overestimates values. Double DQN decouples them: the online network picks the argmax action, and the target network evaluates it — r + γ·Q_target(s′, argmax_a Q_online(s′,a)). This removes most of the upward bias at almost zero extra cost.

How well did the original DQN actually perform on Atari?

Mnih et al. (Nature, 2015) trained a single architecture and set of hyperparameters across 49 Atari 2600 games, learning only from pixels and the score. It reached at least human-level play on 29 of the 49 games, and exceeded a professional human tester on those, using 50 million frames (about 38 days of game experience) per game.