Algorithms
Simulated Annealing
Climb downhill mostly, uphill sometimes, and cool down
Simulated annealing is a probabilistic optimization method that escapes local minima by occasionally accepting worse moves. Borrowed from metallurgy — slow cooling lets a crystal find a low-energy state — it routinely finds tours within 2% of optimal on 10,000-city TSPs in seconds.
- ClassMetaheuristic
- Solution keptSingle state
- Acceptance ruleMetropolis
- ConvergenceProbabilistic
- Famous useVLSI placement, TSP
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How simulated annealing works
Heat a metal until its atoms are jostling freely, then cool it slowly. Atoms have time to settle into a low-energy crystalline structure rather than freezing into whatever messy configuration they were in. Cool too fast and you trap a brittle high-energy state. Cool slowly enough and you reach the global minimum of the energy landscape. That physical intuition is the algorithm, almost literally.
The optimization version replaces "energy" with the cost function you want to minimize and "temperature" with a single scalar T that controls how willing the search is to accept a worse move. The loop:
- Start from any solution s and a high temperature T.
- Generate a random neighbour s' (e.g. swap two cities in a TSP tour).
- Compute the cost change ΔE = cost(s') − cost(s).
- If ΔE ≤ 0, accept the new solution.
- If ΔE > 0, accept it with probability exp(−ΔE / T) — the Metropolis criterion.
- Lower T a bit. Repeat until T is below a stopping threshold or you've used your iteration budget.
At high temperature almost every uphill move is accepted, so the search wanders the landscape broadly. As T falls, uphill moves become exponentially less likely. By the end the algorithm is effectively hill-climbing inside whatever basin it has settled near.
When to use simulated annealing
- The search landscape is full of local minima — pure hill-climbing gets stuck.
- You can compute the cost delta of a small perturbation cheaply (often O(1) on combinatorial problems with the right representation).
- You need a near-optimal answer fast, not a guaranteed optimum.
- The state space resists clean recursive decomposition — DP and branch-and-bound don't apply.
If your cost function is convex, use gradient descent. If subproblems repeat heavily, use dynamic programming. If you can prune the search exhaustively, use greedy with backtracking. SA's niche is everything else: messy, combinatorial, multi-modal landscapes.
SA vs hill-climbing vs genetic algorithms
| Hill-climbing | Simulated annealing | Genetic algorithm | |
|---|---|---|---|
| Solutions kept | One | One | Population (50–500) |
| Accepts worse moves? | Never | Yes, with prob exp(−ΔE/T) | Implicitly via selection pressure |
| Escape local minima | No (restart only) | Yes | Yes (crossover, mutation) |
| Hyperparameters | None | T₀, schedule, stop | Pop size, mutation, crossover, selection |
| Memory footprint | O(state) | O(state) | O(pop · state) |
| Parallelism | Trivial restart parallelism | Multi-chain parallel tempering | Embarrassingly parallel evals |
| Strongest on | Convex / unimodal | Combinatorial, multi-modal | Continuous + discrete, multi-objective |
The trio is often blended: a "hybrid GA" runs SA as its local-search step inside each generation. The empirical winner on a given problem is rarely the one the textbook predicts.
Pseudo-code
function simulatedAnnealing(s0, T0, alpha, iters):
s, best = s0, s0
T = T0
for k in 1..iters:
s' = randomNeighbour(s)
dE = cost(s') - cost(s)
if dE <= 0 or random() < exp(-dE / T):
s = s'
if cost(s) < cost(best): best = s
T = T * alpha # geometric cooling
return best
JavaScript: TSP via 2-opt simulated annealing
function tspSA(coords, T0 = 100, alpha = 0.995, iters = 100000) {
const n = coords.length;
const dist = (i, j) => {
const dx = coords[i][0] - coords[j][0];
const dy = coords[i][1] - coords[j][1];
return Math.sqrt(dx*dx + dy*dy);
};
const tourCost = t => t.reduce((s, _, i) => s + dist(t[i], t[(i+1)%n]), 0);
// Start from a random permutation.
let tour = [...Array(n).keys()].sort(() => Math.random() - 0.5);
let cost = tourCost(tour);
let best = tour.slice(), bestCost = cost;
let T = T0;
for (let k = 0; k < iters; k++) {
// 2-opt: pick i < j and reverse tour[i..j].
const i = 1 + (Math.random() * (n - 2)) | 0;
const j = i + 1 + (Math.random() * (n - i - 1)) | 0;
// Energy delta in O(1): only 2 edges change.
const a = tour[i-1], b = tour[i], c = tour[j], d = tour[(j+1) % n];
const dE = dist(a, c) + dist(b, d) - dist(a, b) - dist(c, d);
if (dE < 0 || Math.random() < Math.exp(-dE / T)) {
// Apply the reversal.
for (let l = i, r = j; l < r; l++, r--) [tour[l], tour[r]] = [tour[r], tour[l]];
cost += dE;
if (cost < bestCost) { bestCost = cost; best = tour.slice(); }
}
T *= alpha;
}
return { tour: best, cost: bestCost };
}
Python: same TSP solver
import math, random
def tsp_sa(coords, T0=100, alpha=0.995, iters=100_000):
n = len(coords)
def dist(i, j):
dx = coords[i][0] - coords[j][0]
dy = coords[i][1] - coords[j][1]
return math.hypot(dx, dy)
def tour_cost(t):
return sum(dist(t[i], t[(i+1) % n]) for i in range(n))
tour = list(range(n))
random.shuffle(tour)
cost = tour_cost(tour)
best, best_cost = tour[:], cost
T = T0
for _ in range(iters):
i = random.randint(1, n - 2)
j = random.randint(i + 1, n - 1)
a, b = tour[i-1], tour[i]
c, d = tour[j], tour[(j+1) % n]
dE = dist(a, c) + dist(b, d) - dist(a, b) - dist(c, d)
if dE < 0 or random.random() < math.exp(-dE / T):
tour[i:j+1] = tour[i:j+1][::-1]
cost += dE
if cost < best_cost:
best_cost, best = cost, tour[:]
T *= alpha
return best, best_cost
Variants and cooling schedules
- Geometric cooling. Tk+1 = α · Tk, with α ∈ [0.85, 0.99]. Cheap, predictable, the textbook default.
- Logarithmic cooling. Tk = c / log(k+1). The unique schedule with a convergence proof (Geman & Geman 1984) but absurdly slow in practice.
- Linear cooling. Tk = T0 − k·η. Easy to reason about for fixed budgets but accepts too few uphill moves late.
- Adaptive cooling. Monitor the acceptance rate; cool faster when high, hold or even reheat when low. Lam & Delosme (1988) is the classic reference.
- Re-annealing / restarts. When the temperature has fallen to near zero, jump back to a high T from the current best (or a random state). Trades a little determinism for a much better worst case.
- Parallel tempering. Run several chains at different temperatures simultaneously and occasionally swap states between them. Excellent at escaping deep, narrow minima — the modern workhorse for hard combinatorial problems.
- Threshold accepting. Replace the stochastic acceptance with a deterministic threshold: accept if ΔE ≤ τ, decrement τ on a schedule. Faster in practice with little quality loss.
Cost in real terms
On a 1,000-city TSP, a naive exhaustive search has 1,000! ≈ 102,567 tours. SA with 2-opt and 100,000 iterations runs in about 0.3 seconds in compiled code and reliably finds tours within 3% of the Concorde-computed optimum. On 10,000 cities, scaling iterations to 5 million takes roughly 30 seconds and gives tours within 5% of optimal — orders of magnitude better than greedy nearest-neighbour at a fraction of integer-programming runtime.
Common bugs and edge cases
- Bad initial temperature. Set T₀ too low and the algorithm degenerates into hill-climbing on the random starting state. Calibrate it: sample neighbours, target an 80% initial acceptance rate, and back-solve T₀ = −avgΔE / ln(0.8).
- Cooling too quickly. α = 0.7 freezes the chain in tens of iterations and the result is barely better than a greedy heuristic. Start at α = 0.95 and drop only if you must finish faster.
- Forgetting to track the best-so-far. The current state can drift away from a great solution it briefly visited; always keep
bestseparately fromcurrent. - O(n) cost recomputation per move. Re-summing the tour cost after each 2-opt nukes the speed. Compute the delta from the four affected edges and mutate
costincrementally. - Asymmetric or non-reversible moves. The Metropolis derivation assumes the proposal distribution is symmetric. If your neighbourhood isn't (e.g. insert/remove with different probabilities), you need the Metropolis-Hastings correction or you'll converge to the wrong stationary distribution.
- Numerical underflow on exp(−ΔE/T). When T is tiny and ΔE moderately positive, exp(−ΔE/T) is exactly zero in double precision. That's actually fine — you just never accept. The bug appears if you accidentally compute exp(ΔE/T) (sign flipped) or feed a NaN into Math.random() < ...
Frequently asked questions
Does simulated annealing always converge to the global optimum?
Only in a theoretical limit. Geman & Geman (1984) proved that with a logarithmic cooling schedule T_k = c / log(k+1) and infinite time, the probability of being at the global optimum tends to 1. In practice nobody can wait that long, so people use geometric cooling and accept that they find an excellent — but not provably optimal — solution.
Why accept a worse solution?
Pure hill-climbing gets stuck in any local minimum. Accepting an occasional uphill move lets the search climb out of a shallow basin and look for a deeper one. The Metropolis criterion P = exp(-ΔE / T) makes the willingness to climb a function of how bad the move is and how hot the system currently is.
What temperature should I start at?
A common rule of thumb: choose T_0 so that the initial acceptance probability of an average-bad move is about 0.8. Run a short calibration: sample a few hundred random neighbours, compute the average ΔE, and set T_0 = -avgΔE / ln(0.8). This auto-scales to the energy units of your problem.
Geometric vs logarithmic cooling — which should I use?
Logarithmic cooling has the convergence proof but is far too slow for any real-world budget. Geometric cooling T_{k+1} = α·T_k with α ∈ [0.85, 0.99] is the practical default — fast enough to finish, slow enough to find very good optima. Linear and adaptive (re-heating) schedules also see use.
What is a good neighbourhood function for TSP?
The 2-opt swap: pick two edges of the tour, reverse the path between them. It changes the tour by O(1) edges, the energy delta is computable in O(1), and the move is reversible. 2-opt + simulated annealing reliably finds tours within a few percent of optimal on instances up to 10,000 cities.
How is simulated annealing different from genetic algorithms?
SA holds a single solution and walks it; GAs maintain a population that crosses over and mutates. GAs explore breadth-first across many basins simultaneously; SA explores depth-first, controlled by temperature. SA usually has fewer hyperparameters and fits problems where defining a good crossover is unnatural (TSP, layout, scheduling).