Machine Learning
Monte Carlo Tree Search (MCTS)
Grow a game tree from random guesses — the planner that cracked Go
Monte Carlo Tree Search (MCTS) builds a search tree one node at a time using random rollouts, then balances exploration against exploitation with the UCT formula — the planning algorithm at the heart of AlphaGo.
- Invented2006 (Coulom; Kocsis & Szepesvári)
- Per-iteration phasesSelect · Expand · Simulate · Backprop
- Selection ruleUCT (UCB1 on a tree)
- Needs a heuristic?No — random rollouts suffice
- Famous useAlphaGo / AlphaZero / MuZero
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The idea: judge a move by playing it out
Suppose it's your turn in a game so large that you can't possibly look ahead to the end — Go has roughly 2.1 × 10170 legal positions — vastly more than the number of atoms in the observable universe (about 1080), so brute-force lookahead is hopeless. How do you decide which move is best? Monte Carlo Tree Search answers with a deceptively simple trick: from each candidate move, play a few random games all the way to the end and see who tends to win. Moves that lead to more wins in these random "rollouts" get explored more carefully; moves that consistently lose get abandoned.
The genius is what happens with repetition. A single random game tells you almost nothing — it's noise. But run thousands of them and steer each new one slightly toward the moves that have looked promising so far, and the noisy estimates sharpen into a remarkably accurate ranking. MCTS spends its compute budget asymmetrically: deep, careful lines for moves that matter, and barely a glance at obvious blunders. Over many iterations the tree grows lopsided, mirroring the parts of the game that actually decide the outcome.
This is why MCTS broke a 40-year logjam. Before 2006, computer Go programs were stuck around the level of a weak amateur because nobody could write a good static evaluation function for a Go board. MCTS sidestepped the problem entirely: it doesn't need to know what a good position looks like, it just needs to be able to finish a game and check who won. Rémi Coulom introduced the tree-building scheme in 2006; Levente Kocsis and Csaba Szepesvári gave it its decisive selection rule, UCT, the same year.
The four phases of one iteration
MCTS repeats a four-step loop, each pass adding exactly one node to the tree and refining the win-rate estimates along one root-to-leaf path.
- Selection. Start at the root. Repeatedly descend to a child, always picking the one that maximizes the UCT score (below). Stop when you reach a node that still has at least one move you've never tried — or a terminal state.
- Expansion. If the node you stopped at isn't terminal, add one new child for an untried move. The tree grows by exactly one node per iteration.
- Simulation (rollout / playout). From that new node, play moves until the game ends. In vanilla MCTS the moves are uniformly random; in stronger engines they follow a cheap policy. Score the terminal state (e.g. +1 win, 0 loss).
- Backpropagation. Walk back up the path from the new node to the root. At every node on the path, increment the visit count
Nand add the rollout result to the total rewardW. Flip the reward's sign or perspective at each ply for two-player games.
After the budget is spent — a fixed number of iterations, or a wall-clock deadline — you pick a move from the root. Crucially, you return the child with the most visits, not the highest average reward. High visit count means the search kept choosing to investigate that move, a far more stable signal than a win rate computed from a handful of lucky rollouts.
The math: UCT and the exploration–exploitation trade-off
Every selection step is a multi-armed-bandit problem: each child is an "arm," and you want to spend pulls on the arm with the best true payoff without prematurely abandoning an arm that just got unlucky. MCTS uses UCT — Upper Confidence bounds applied to Trees — which is the UCB1 bandit formula applied at every internal node. For a child i you compute:
UCT(i) = W_i / N_i + C · sqrt( ln(N_parent) / N_i )
└─ exploitation ─┘ └──────── exploration ────────┘
The first term is the child's empirical win rate (average reward). The second is an exploration bonus that shrinks as a child is visited more (the N_i in the denominator) and grows, slowly, as the parent accumulates visits (the ln(N_parent) in the numerator). A child that has never been visited has N_i = 0, giving an infinite bonus, so MCTS always tries every move at a node at least once before it starts being selective.
The constant C sets how much you value exploration. For rewards scaled to [0, 1], the value that makes the UCB1 regret bound hold is C = √2 ≈ 1.41, but in practice engines tune it by self-play, typically landing between 0.5 and 2. The headline theoretical result (Kocsis & Szepesvári, 2006) is that, given enough iterations, the probability of selecting a sub-optimal move at the root converges to zero; the regret at the root grows only as O(log n) in the number of simulations n.
Complexity. Each iteration costs O(d) for selection down a path of depth d, plus the cost of one rollout (O(L) for a game of length L), plus O(d) for backpropagation — so a single iteration is O(d + L). Memory grows by one node per iteration, so n iterations use O(n) space. There is no exponential blow-up: that linear memory profile is exactly what lets MCTS run for millions of iterations where minimax would choke.
When to reach for MCTS
- Huge branching factor, no good heuristic. Go (≈250 legal moves per turn), general game playing, and many planning problems where you can't hand-write an evaluation function.
- You have a simulator but not a model. If you can roll the world forward to an outcome — even stochastically — MCTS works. It's a staple of model-based reinforcement learning and was the backbone of MuZero.
- Anytime requirements. Robotics and real-time agents that must return some action by a deadline benefit from MCTS's monotone improvement.
- Single-agent and stochastic domains. Puzzles, scheduling, the Canadian Traveller Problem, and games of chance all have MCTS variants.
Avoid plain MCTS when the domain is tactical and "spiky" — chess, where one missed refutation flips the result, defeats random rollouts. Reach for alpha-beta minimax with a strong evaluation instead, or graft a neural network onto MCTS as AlphaZero did. Also skip it when a cheap exact solver exists: you don't Monte-Carlo a problem you can solve with dynamic programming.
MCTS vs other search and planning methods
| MCTS (UCT) | Minimax + α-β | A* | Greedy / hill-climbing | AlphaZero (MCTS + NN) | |
|---|---|---|---|---|---|
| Needs an evaluation heuristic | No (random rollouts) | Yes (critical) | Yes (admissible h) | Yes | Learned by self-play |
| Search shape | Asymmetric, deepens promising lines | Uniform to fixed depth | Guided by f = g + h | Single greedy path | Asymmetric, policy-guided |
| Anytime (usable if stopped early) | Yes | Only after a full ply | No (needs goal reached) | Yes | Yes |
| Memory per step | O(n) — one node / iteration | O(depth) with α-β | O(frontier), can explode | O(1) | O(n) |
| Handles huge branching factor | Excellent | Poor without pruning | Depends on h quality | n/a | Excellent |
| Optimality guarantee | Converges to optimal as n→∞ | Exact to search depth | Optimal if h admissible | None (local optima) | Empirical, super-human |
| Signature domain | Go, GGP, MuZero planning | Chess (Stockfish classic), checkers | Pathfinding, routing | Quick approximations | Go, chess, shogi, Atari |
The headline split is heuristic dependence. Minimax and A* are only as good as the evaluation or distance function you feed them; MCTS bootstraps a value estimate from rollouts alone. The cost is that MCTS converges slowly on tactical positions where averaging is the wrong aggregation — which is precisely the gap a neural-network policy and value head closes in AlphaZero.
What the numbers actually say
- Go's search space is ≈2.1 × 10170 legal positions (Tromp, 2016) with a branching factor near 250. Full-width minimax to even modest depth is impossible; MCTS made strong play feasible on commodity hardware.
- UCT regret is O(log n). Cumulative regret grows only logarithmically, so each doubling of the iteration budget adds just a constant to it while the average regret per simulation keeps falling — the law of diminishing-but-reliable returns that makes "just run more rollouts" a sound strategy.
- AlphaGo (2016) searched ≈ tens of thousands of positions per move — vs Deep Blue's ≈200 million per second for chess in 1997. MCTS wins by looking at far fewer positions, but the right ones, guided by a neural network.
- One node per iteration. A 1-million-iteration search builds a tree of about one million nodes — linear, predictable memory, in contrast to the exponential frontier of breadth-first or uniform-cost search.
- Crazy Stone and MoGo, the first UCT Go engines (2006–2008), jumped from weak-amateur to strong-amateur strength almost overnight — a leap classical evaluation-based programs had failed to make in three decades.
JavaScript implementation
A complete, game-agnostic UCT search. Plug in any game object exposing getMoves, play, isTerminal, and reward. Rewards are from the perspective of the player to move at the root, and we negate at each ply so backpropagation works for two-player zero-sum games.
class MCTSNode {
constructor(state, parent = null, move = null) {
this.state = state; // game state at this node
this.parent = parent;
this.move = move; // move that led here
this.children = [];
this.untried = state.getMoves(); // moves not yet expanded
this.N = 0; // visit count
this.W = 0; // total reward (from mover's view)
}
get isFullyExpanded() { return this.untried.length === 0; }
get isTerminal() { return this.state.isTerminal(); }
}
const C = Math.SQRT2; // exploration constant ≈ 1.41
function uct(child, parentN) {
const exploit = child.W / child.N;
const explore = C * Math.sqrt(Math.log(parentN) / child.N);
return exploit + explore;
}
function bestChild(node) {
let best = null, bestVal = -Infinity;
for (const c of node.children) {
const v = uct(c, node.N);
if (v > bestVal) { bestVal = v; best = c; }
}
return best;
}
// 1. SELECTION — descend by UCT until expandable or terminal
function select(node) {
while (!node.isTerminal && node.isFullyExpanded) node = bestChild(node);
return node;
}
// 2. EXPANSION — add one child for an untried move
function expand(node) {
const i = (Math.random() * node.untried.length) | 0;
const move = node.untried.splice(i, 1)[0];
const child = new MCTSNode(node.state.play(move), node, move);
node.children.push(child);
return child;
}
// 3. SIMULATION — random rollout to a terminal state
function rollout(state) {
let s = state;
while (!s.isTerminal()) {
const moves = s.getMoves();
s = s.play(moves[(Math.random() * moves.length) | 0]);
}
return s.reward(); // e.g. +1 / 0 / -1 for the player to move
}
// 4. BACKPROPAGATION — push reward up, flipping sign each ply
function backprop(node, reward) {
while (node) {
node.N += 1;
node.W += reward;
reward = -reward; // opponent's perspective one level up
node = node.parent;
}
}
function mcts(rootState, iterations = 10000) {
const root = new MCTSNode(rootState);
for (let i = 0; i < iterations; i++) {
let node = select(root);
if (!node.isTerminal) node = expand(node);
backprop(node, rollout(node.state));
}
// Return the most-visited child — the "robust" move
return root.children.reduce((a, b) => (b.N > a.N ? b : a)).move;
}
Two details matter. First, expand mutates untried with splice, so once a node's untried array is empty it is "fully expanded" and selection will UCT-pick among its children. Second, the final move is chosen by max visit count, not max W/N — averaging over few samples is noisy, and the most-visited child is the one the search itself trusted most.
Python implementation
The same algorithm in Python, with one common upgrade: a per-node UCT cache and an explicit two-player reward negation. The State protocol is the only thing you supply per game.
import math, random
class Node:
__slots__ = ('state', 'parent', 'move', 'children', 'untried', 'N', 'W')
def __init__(self, state, parent=None, move=None):
self.state = state
self.parent = parent
self.move = move
self.children = []
self.untried = list(state.get_moves())
self.N = 0 # visits
self.W = 0.0 # total reward, from this node's mover view
C = math.sqrt(2)
def uct(child, parent_N):
return child.W / child.N + C * math.sqrt(math.log(parent_N) / child.N)
def best_child(node):
return max(node.children, key=lambda c: uct(c, node.N))
def tree_policy(node): # selection + expansion
while not node.state.is_terminal():
if node.untried: # expand one untried move
move = node.untried.pop(random.randrange(len(node.untried)))
child = Node(node.state.play(move), node, move)
node.children.append(child)
return child
node = best_child(node) # else descend by UCT
return node
def rollout(state): # random simulation
while not state.is_terminal():
state = state.play(random.choice(state.get_moves()))
return state.reward()
def backprop(node, reward):
while node is not None:
node.N += 1
node.W += reward
reward = -reward # flip for the opponent
node = node.parent
def mcts(root_state, iterations=10_000):
root = Node(root_state)
for _ in range(iterations):
leaf = tree_policy(root)
backprop(leaf, rollout(leaf.state))
return max(root.children, key=lambda c: c.N).move # robust child
Note the asymmetry between the JS and Python listings: here tree_policy folds selection and expansion into one descent (the standard textbook form from Browne et al., 2012), expanding the first node that still has an untried move. Both are correct — the only invariant that matters is "add exactly one node per iteration, then roll out from it."
Variants worth knowing
PUCT (Predictor + UCT). AlphaGo and AlphaZero replace the √(ln N / Nᵢ) exploration term with a prior-weighted form, C · P(i) · √N_parent / (1 + Nᵢ), where P(i) is a neural-network policy probability. The network also supplies a value estimate, so the random rollout is dropped entirely — the simulation phase becomes a single network evaluation.
RAVE / AMAF (All-Moves-As-First). Share statistics between moves that appear anywhere in a rollout, not just at the node where they were played. This warms up estimates fast with few simulations; modern engines blend RAVE and UCT scores with a decaying weight.
Progressive widening / progressive unpruning. In games with enormous or continuous action spaces, only reveal new children gradually as a node's visit count grows (typically when Nα crosses the next integer), so the search isn't swamped by thousands of untried moves.
Root and leaf parallelization. Run independent trees and merge their root statistics (root parallel), or share one tree across threads with virtual loss to discourage collisions (tree parallel) — how engines exploit many cores.
Flat Monte Carlo. The degenerate ancestor: just roll out each root move many times and pick the best average, with no tree and no UCT. It's the baseline MCTS beats by spending its budget asymmetrically.
Common bugs and edge cases
- Forgetting to negate the reward in backprop. In a two-player zero-sum game, a win for the player to move at a leaf is a loss for the player who moved into it. Skip the sign flip and the tree confidently steers into losing lines.
- Dividing by zero in UCT. An unvisited child has
N = 0. Treat it as +∞ priority (visit unexplored moves first) rather than evaluating the formula — otherwise you getNaNand a crash. - Returning the highest-win-rate child instead of the most-visited one. A child with one lucky win has a 100% rate from a single sample. Always return max-visits.
- Mutating the shared game state. Each rollout must work on a copy. Aliasing the root state into rollouts corrupts the tree as moves are applied in place.
- Random rollouts in tactical games. In chess, uniform-random playouts almost never find the one refuting capture, so MCTS over-values blunders. Use a heuristic rollout policy, or replace rollouts with a value network.
- Not normalizing rewards for the chosen C. The
C = √2default assumes rewards in[0, 1]. Feeding scores in[-100, 100]makes the exploration term negligible, collapsing the search into pure greed.
Frequently asked questions
What does the C constant in the UCT formula control?
C is the exploration weight. A larger C inflates the exploration bonus, so MCTS keeps trying rarely-visited children even when their average reward looks poor; a smaller C makes the search greedier toward the current best move. For rewards normalized to [0, 1] the classic theoretical value is √2 ≈ 1.41, but tuned engines often use values between 0.5 and 2 found by self-play.
Does MCTS need an evaluation function like minimax does?
No — that is its headline advantage. Plain MCTS estimates a position's value by playing random games to the end and averaging the outcomes, so it needs only the game rules, not a hand-crafted heuristic. That is why it broke through on Go in 2006, where no good static evaluation existed. AlphaGo later replaced the random rollout with a neural-network value estimate, but that is an enhancement, not a requirement.
Why is MCTS called an anytime algorithm?
Each iteration is independent and improves the statistics at the root, so you can stop after any number of iterations and read off the best move so far. Give it 1,000 rollouts or 10 million — it always returns a usable answer, and the answer monotonically improves with more compute. Depth-limited minimax, by contrast, must finish a full ply before its result is trustworthy.
What are the four phases of one MCTS iteration?
Selection (walk down from the root choosing the highest-UCT child until you reach a node with unexpanded moves), Expansion (add one new child for an untried move), Simulation or rollout (play to a terminal state, randomly or with a policy, and score it), and Backpropagation (push that score back up the path, incrementing each ancestor's visit count and total reward).
Which child does MCTS finally return as its move?
Not the one with the highest average reward — the one with the highest visit count (the "robust child"). High visits mean the UCT search kept returning to that move, which is a lower-variance signal than a possibly-lucky win rate from few samples. Some implementations combine both, but max-visits is the standard and is what AlphaGo used.
Why does basic MCTS struggle with chess but excel at Go?
Chess is full of tactical traps where a single forced refutation flips the evaluation, and random rollouts almost never find the precise refuting line — so MCTS over-rates losing moves. Go positions are more positional and "smooth," so averaging random outcomes correlates well with true value. AlphaZero only made MCTS dominant at chess by replacing random rollouts with a trained neural network that supplies tactical judgement.