Computer Architecture

Cache Associativity

Where can a memory block live in the cache — and why the answer decides your hit rate

Cache associativity decides where in the cache a memory block is allowed to live: direct-mapped (exactly one slot), fully associative (any slot), or the N-way set-associative middle ground that nearly every real CPU ships.

Direct-mapped1 slot / block
N-way set assoc.N candidate lines
Fully associativeany line
Lookup comparatorsN per access
Typical L18-way

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The intuition: a parking-lot policy

Imagine the cache as a parking lot and every memory block as a car looking for a spot. Associativity is the parking policy — the rule that says which spaces a given car is allowed to use.

A direct-mapped cache is the strictest valet in the world: your license plate hashes to exactly one space, and if another car is already there, it gets towed — even if the lot is half empty. A fully associative cache is a free-for-all: park anywhere there's room, and only when the whole lot fills does anyone get evicted. The N-way set-associative cache is the realistic compromise: your plate picks a small numbered section, and within that section you may take any of N spaces.

That single design choice ripples through everything: how the hardware splits an address, how many comparators fire on each access, how often a hot loop thrashes itself, and ultimately how many nanoseconds your program spends waiting on memory. Get it wrong by a factor of two and a tight numerical kernel can run an order of magnitude slower for reasons that never appear in the source code.

How the address gets carved up

Every cache works on fixed-size blocks (also called lines), commonly 64 bytes on x86-64 and Arm. When the CPU issues an address, the cache slices its bits into three fields:

|<------------- tag ------------->|<-- set index -->|<-- offset -->|
 high bits                          middle bits        low bits

Offset — the low log₂(line size) bits, selecting a byte inside the block. For a 64-byte line that's 6 bits.
Set index — the next log₂(number of sets) bits, choosing which set the block belongs to.
Tag — everything left over, stored alongside the data so the cache can confirm the block in a line is the one you asked for.

The number of sets is the lever associativity pulls:

number_of_sets = cache_size / (line_size × associativity)
index_bits     = log₂(number_of_sets)

Concretely, take a 32 KiB cache with 64-byte lines:

Direct-mapped (1-way): 32768 / 64 = 512 sets → 9 index bits. A block has exactly one home.
8-way set-associative: 32768 / (64 × 8) = 64 sets → 6 index bits. A block may live in any of 8 lines within its set.
Fully associative: 1 set → 0 index bits. Every line is a candidate; the entire address-above-offset is the tag.

Notice the conservation law: doubling associativity halves the number of sets, removes one index bit, and hands that bit to the tag. The data array stays the same size; only the lookup logic changes.

What happens on a lookup

On an access the cache uses the index bits to pick a set, then compares the request's tag against the stored tag of every line in that set, in parallel. A direct-mapped cache compares 1 tag; an N-way compares N; a fully associative cache compares all of them. If a tag matches and the line is valid, it's a hit and the offset selects the byte. Otherwise it's a miss: the block is fetched from the next level, and if the set is full, the replacement policy chooses a victim to evict.

This is why associativity is a hardware cost, not just a math curiosity. Each "way" needs its own tag comparator wired to fire simultaneously, plus a wider multiplexer to route the winning line's data out. The comparison must finish inside the cache's hit-latency budget — typically 4–5 cycles for L1 — so the comparator count is hard-capped by physics, not preference.

Conflict misses: the whole reason associativity exists

Caches miss for three reasons, the classic "three Cs" coined by Mark Hill in 1987:

Compulsory — the first-ever reference to a block; unavoidable without prefetching.
Capacity — the working set is simply larger than the cache; only a bigger cache helps.
Conflict — the block was evicted even though the cache wasn't full, because too many hot blocks mapped to the same set.

Conflict misses are the ones associativity attacks. Consider a direct-mapped 32 KiB cache and a loop that alternately touches address A and address A + 32768. Both addresses share the same index bits, so they map to the same single line and evict each other on every iteration — a 100% miss rate on what looks like a perfectly cache-friendly loop. A 2-way cache fixes it instantly: both blocks coexist in the same set's two ways. This pathology, "cache thrashing," is exactly why pure direct-mapped caches fell out of favor for L1.

Direct-mapped vs N-way vs fully associative

	Direct-mapped (1-way)	N-way set-associative	Fully associative
Slots a block may use	exactly 1	N (the set's ways)	all lines
Number of sets	lines	lines / N	1
Tag comparators / access	1	N	all lines
Conflict misses	worst	few (drop fast with N)	none (only capacity)
Hit latency	lowest	moderate	highest
Replacement policy	none needed	LRU / PLRU / RRIP	LRU / PLRU / RRIP
Hardware cost	cheapest	moderate	most expensive
Typical real use	some old L1, simple uControllers	nearly all modern L1/L2/L3	TLBs, small victim caches

Direct-mapped and fully associative are just the two endpoints of the same axis: a 1-way cache is direct-mapped, and an N-way cache with N equal to the line count is fully associative. Everything CPUs ship lives in between, because the endpoints trade away the wrong things — direct-mapped gives up hit rate, fully associative gives up latency and power.

What the numbers actually say

The "2× rule." Hennessy and Patterson's measurements give a durable rule of thumb: an 8-way cache has roughly the same miss rate as a direct-mapped cache of twice the capacity. Going from direct-mapped to 2-way typically cuts the miss rate by 15–30% on SPEC-style workloads; 2-way to 4-way and 4-way to 8-way each buy progressively less.
Diminishing returns are real. Past 8-way the miss-rate curve is nearly flat for most code, which is exactly why Intel's and AMD's L1 data caches have parked at 8-way for over a decade while L2/L3 run 8- to 16-way.
Latency is the counterweight. Each doubled way adds comparator delay and a wider mux. That's why associativity isn't cranked to the ceiling — a 32-way L1 might hit 1% more often but cost an extra cycle on every access, a losing trade.
The thrash penalty is enormous. An L1 hit is ~4 cycles; an L1 miss that goes to DRAM is 200–300 cycles. One avoidable conflict miss per iteration can make a loop 50× slower, which is why padding arrays to dodge same-set collisions is a standard performance trick.

JavaScript: an N-way set-associative cache simulator

The cleanest way to internalize associativity is to simulate it. This model takes the cache size, line size, and number of ways, derives the set count, and tracks LRU per set.

class SetAssociativeCache {
  constructor({ cacheBytes, lineBytes, ways }) {
    this.lineBytes = lineBytes;
    this.ways = ways;
    this.numSets = cacheBytes / (lineBytes * ways);   // direct-mapped ⇒ ways=1
    this.offsetBits = Math.log2(lineBytes);
    this.indexBits  = Math.log2(this.numSets);
    // Each set is an array of { tag, lastUsed }; LRU via a monotonic clock.
    this.sets = Array.from({ length: this.numSets }, () => []);
    this.clock = 0;
    this.hits = 0; this.misses = 0;
  }

  // Split an address into (tag, setIndex). The offset is discarded — a
  // whole line is fetched, so byte-within-line never affects hit/miss.
  decode(addr) {
    const blockId  = Math.floor(addr / this.lineBytes);   // drops offset bits
    const setIndex = blockId % this.numSets;              // the index bits
    const tag      = Math.floor(blockId / this.numSets);  // the high bits
    return { tag, setIndex };
  }

  access(addr) {
    const { tag, setIndex } = this.decode(addr);
    const set = this.sets[setIndex];
    const line = set.find(l => l.tag === tag);

    if (line) {                       // HIT — refresh recency
      line.lastUsed = ++this.clock;
      this.hits++;
      return 'hit';
    }

    this.misses++;                    // MISS — install the block
    if (set.length < this.ways) {
      set.push({ tag, lastUsed: ++this.clock });
    } else {                          // set full → evict LRU victim
      let victim = 0;
      for (let i = 1; i < set.length; i++)
        if (set[i].lastUsed < set[victim].lastUsed) victim = i;
      set[victim] = { tag, lastUsed: ++this.clock };
    }
    return 'miss';
  }

  get missRate() { return this.misses / (this.hits + this.misses); }
}

// The thrash demo: alternate two addresses exactly one cache-size apart.
const STRIDE = 32 * 1024;
function run(ways) {
  const c = new SetAssociativeCache({ cacheBytes: 32 * 1024, lineBytes: 64, ways });
  for (let i = 0; i < 1000; i++) { c.access(0); c.access(STRIDE); }
  return c.missRate;
}

console.log('direct-mapped miss rate:', run(1)); // ≈ 1.0  (constant eviction)
console.log('2-way miss rate:        ', run(2)); // ≈ 0.001 (both blocks fit)

The only line that encodes the whole concept is setIndex = blockId % this.numSets. With one set (fully associative) the modulo is a no-op and nothing ever conflicts; with many sets (direct-mapped) collisions are frequent. Everything else — the tag, the LRU bookkeeping — is the same regardless of associativity.

Python: address decoding with real bit fields

The JavaScript version used arithmetic for clarity. Real hardware works on bit fields, and seeing the masks makes the "doubling associativity steals one index bit" rule concrete.

from math import log2

def cache_geometry(cache_bytes, line_bytes, ways):
    num_sets    = cache_bytes // (line_bytes * ways)
    offset_bits = int(log2(line_bytes))
    index_bits  = int(log2(num_sets))
    return offset_bits, index_bits

def decode(addr, offset_bits, index_bits):
    offset = addr & ((1 << offset_bits) - 1)
    index  = (addr >> offset_bits) & ((1 << index_bits) - 1)
    tag    = addr >> (offset_bits + index_bits)
    return tag, index, offset

# 32 KiB cache, 64-byte lines — watch the field widths shift with `ways`.
for ways in (1, 8, 512):                      # 512-way ⇒ fully associative here
    off, idx = cache_geometry(32 * 1024, 64, ways)
    tag, index, offset = decode(0xDEADBEEF, off, idx)
    label = {1: 'direct-mapped', 512: 'fully assoc.'}.get(ways, f'{ways}-way')
    print(f'{label:14} offset_bits={off} index_bits={idx:>2}  '
          f'tag={tag:#x} set={index}')

# direct-mapped  offset_bits=6 index_bits= 9  tag=0x1bd5b set=251
# 8-way          offset_bits=6 index_bits= 6  tag=0xdeadb set=59
# fully assoc.   offset_bits=6 index_bits= 0  tag=0x37ab6fb set=0

As ways climbs, index_bits shrinks and the tag widens by exactly the same amount. At full associativity there are zero index bits — the set is always 0, and the entire address above the offset is the tag, which is why a fully associative cache must compare every stored tag on every access.

Variants and refinements worth knowing

Victim caches. Norman Jouppi's 1990 idea: bolt a tiny fully associative cache (4–16 entries) onto a direct-mapped L1 to catch the blocks it just evicted. It absorbs conflict misses cheaply, giving most of the benefit of higher associativity at a fraction of the comparator cost.

Skewed-associative caches. Instead of using the same index function for every way, André Seznec's design hashes the address differently per way, so two addresses that collide in way 0 likely won't collide in way 1. The result is conflict behavior closer to a cache with more ways, without paying for them.

Pseudo-associative / way-prediction. Behave like a direct-mapped cache for speed, but on a miss probe a second location (e.g., flip the top index bit) before going to the next level. Way-prediction generalizes this: guess which way will hit and check only that one first, recovering direct-mapped latency while keeping N-way hit rates.

Approximate replacement. True LRU on 16 ways would need to order 16 items, so production caches use tree-PLRU (one bit per internal node of a binary tree), NRU (not-recently-used), or RRIP (re-reference interval prediction, which resists scan and thrash patterns that wreck LRU).

The TLB connection. Translation Lookaside Buffers are usually small and fully associative or highly associative, because a TLB miss is catastrophically expensive and the entry count is small enough to afford the comparators.

Common pitfalls and edge cases

Confusing "associativity" with "cache size." Raising associativity does not add storage — it rearranges the same lines into fewer, wider sets. It only helps conflict misses, never capacity misses.
Power-of-two array strides. Allocating a 2D array whose row stride is a power of two often makes every column map to the same handful of sets, silently thrashing. The classic fix is padding the row to a non-power-of-two length.
Assuming direct-mapped needs a replacement policy. It doesn't — there's a single candidate line, so eviction is forced. Adding LRU logic to a 1-way cache is a no-op that wastes gates.
Forgetting the offset is discarded for hit/miss. Two addresses in the same 64-byte block always hit together; reasoning about hits at byte granularity instead of block granularity gives wrong answers.
Treating LRU as exact. Most hardware approximates it. A microbenchmark tuned to defeat textbook LRU may behave differently on a tree-PLRU or RRIP cache.
Cranking associativity for its own sake. Past 8–16 ways the hit-rate gain is marginal while latency and power keep climbing — the curve has a knee, and CPU designers sit right at it.

Frequently asked questions

What is cache associativity in one sentence?

Associativity is the number of distinct cache lines a given memory block is allowed to occupy. Direct-mapped allows exactly one, fully associative allows all of them, and N-way set-associative allows any of the N lines inside one set.

Why don't CPUs just use a fully associative cache?

A fully associative lookup must compare the address tag against every line in parallel, which needs one comparator per line plus a wide priority encoder. For a 1 MiB cache with 64-byte lines that is 16,384 comparators firing every access — too much area, power, and latency. Set-associativity caps the comparison at N (typically 8 or 16), which fits in a single fast clock edge.

What is a conflict miss?

A conflict miss happens when a block was evicted even though the cache was not full, because too many hot blocks mapped to the same set. Direct-mapped caches suffer the most conflict misses; raising associativity converts conflict misses into hits until you hit the capacity limit, where only a bigger cache helps.

How does the CPU split an address for an N-way cache?

The low bits are the block offset (log2 of the line size), the next bits are the set index (log2 of the number of sets), and the remaining high bits are the tag. Crucially, the number of index bits is log2(cache size / (line size × associativity)) — so doubling associativity removes one index bit and adds it to the tag.

Does higher associativity always make a cache faster?

No. Hit rate improves with diminishing returns — the classic rule of thumb is that an 8-way cache misses about as often as a direct-mapped cache twice its size, but past 8 ways the curve flattens. Meanwhile each way you add costs comparator area, power, and a slightly longer hit latency, so most L1 caches stop at 8-way and L2/L3 at 16-way.

What replacement policy do set-associative caches use?

True LRU is common at low associativity, but tracking exact recency for 16 ways needs 16! orderings, so hardware approximates it with tree-PLRU, NRU, or RRIP. Direct-mapped caches need no replacement policy at all — there is only one candidate slot — which is one reason they were historically attractive.