Memory Allocation
Slab Allocator
Pre-allocate N identical objects per slab — turn alloc and free into O(1) list ops
A slab allocator caches objects of one fixed size in a pre-built free list. Alloc pops, free pushes — both O(1). It's Bonwick's 1994 trick; Linux's SLUB powers every kernel object type.
- Alloc costO(1) — pop free-list head
- Free costO(1) — push to free-list head
- Internal fragmentationNear zero (objects identically sized)
- ConstructorRun once per object lifetime, not per alloc
- Invented byJeff Bonwick, Sun Microsystems, 1994
- Linux variantsSLAB, SLUB (default), SLOB
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How the slab allocator works
The Linux kernel allocates millions of small objects per second — a TCP socket, a directory entry, a task struct, an inode. Each has a fixed size and a known initial state. A general-purpose malloc would carry a header on each, fragment over time, lock-contend at the global heap, and constantly re-initialize bytes the caller is about to re-initialize anyway. Bonwick's slab allocator says: just keep a stack of pre-constructed objects of size X, refill it in batches from the page allocator, and never look at headers.
kmem_cache "task_struct" size = 9472 B, align = 64 B
per-CPU active slab (cpu0) partial slabs (per-node) full slabs
┌──────────────────────────┐ ┌────────────┬────────────┐ ┌─────────┐
│ obj 0 | obj 1 | ... | n-1│ ──→ │ partial 1 │ partial 2 │ → │ full 1 │
│ ▲ free list head │ │ some free │ some free │ │ all in │
└──────────────────────────┘ └────────────┴────────────┘ │ use │
│ └─────────┘
▼ refill on miss
┌──────────────────────────┐
│ buddy allocator (order-N) │ refill returns one slab = N contiguous pages
└──────────────────────────┘
Each slab is one or more physically contiguous pages obtained from the buddy allocator, carved into obj_per_slab equal-sized objects. The slab carries a free-list — in SLUB, an inline singly-linked list threaded through the free objects' first bytes — so allocation just pops the head and freeing just pushes it. Per-CPU slabs make the hot path lock-free; the partial list serves CPU misses; empty slabs are returned to the buddy when reclaim asks.
The structural pieces
Cache (kmem_cache). Per-type metadata: object size, alignment, constructor function, list of partial slabs, per-CPU slab pointers. Created by kmem_cache_create("name", size, align, flags, ctor). The kernel creates dozens at boot, more from modules.
Slab. One physically contiguous run of pages divided into N equal objects, owned by exactly one cache. The slab tracks how many of its objects are free, its place in the cache's partial/full lists, and which CPU (if any) is currently using it as the per-CPU active slab.
Per-CPU slot. Each online CPU holds a pointer to a "current" slab. Allocation pops the next free object from that slab without a lock. If the slab becomes full, the CPU promotes a partial slab from the per-node partial list; if no partial is available, it requests a fresh one from the buddy.
Free list. A singly-linked list threaded through the free objects themselves — the first 8 bytes of each free object hold a pointer to the next. No external bookkeeping; no per-object header.
Shrinker callback. Reclaim under memory pressure calls each cache's shrinker. Caches with releasable empty slabs hand pages back to the buddy allocator. The dentry and inode caches are the most prolific shrinker clients in a running system.
When a slab allocator wins
- Many small, identical objects. Kernel objects, network protocol headers, file-system metadata, language-runtime small objects. The cache hit rate makes alloc cheaper than memcpy.
- Reusable initial state. If
init_my_object()is expensive, the slab's constructor runs it once when the page is first carved up; subsequent allocations skip it. Free can be a "destructor" that restores state for the next reuse, not a full re-init. - Multi-core scaling. Per-CPU slabs make the hot path lock-free. Even on 96 cores, alloc/free latency stays single-digit nanoseconds.
- Allocation profiling. Per-cache statistics tell you exactly which subsystem is leaking.
/proc/slabinfoin Linux lists every kmem_cache with its allocation count, slab count, and waste.
Slabs lose when allocations are big, rare, or highly variable in size — every cache holds one size class, so 50 sizes need 50 caches with their own per-CPU storage. For variable-size general allocations, kmalloc uses a fixed family of size-class caches (kmalloc-8, kmalloc-16, ..., kmalloc-8192) and rounds up.
Slab vs other kernel/userspace allocators
| Slab / SLUB | Buddy | vmalloc | kmalloc (size class) | tcmalloc | dlmalloc / ptmalloc | |
|---|---|---|---|---|---|---|
| Granularity | One fixed object size | Power-of-2 pages | Page-aligned vmem range | Rounded to size class | Per-thread size class | Bins of size classes |
| Alloc cost | O(1) pop | O(log MAX_ORDER) split | Page-table setup | O(1) into slab | O(1) per-thread cache | O(1) amortized |
| Fragmentation | Near zero internal | Order-aligned waste | Page-level only | Class rounding | Class rounding | Internal headers + external |
| Per-CPU/thread cache | Yes (SLUB) | Yes (pageset) | No | Inherited from slab | Yes (per-thread) | Per-arena |
| Constructor support | Yes | No | No | No | No | No |
| Used by | Linux kernel objects | Page allocator | Module space, big buffers | Mixed-size kernel objects | Google userspace | glibc default malloc |
The slab is unique in knowing in advance exactly what size every allocation will be. That single piece of information eliminates the bookkeeping every general-purpose allocator must carry.
What slab allocators actually cost
- O(1) alloc / free. Both reduce to a list pop or push — a handful of memory ops, no locks on the hot per-CPU path. Real numbers:
kmem_cache_allocmeasured at 30-80 ns under no contention, ~150 ns under heavy contention before falling back to per-CPU. - Internal fragmentation near zero. A cache with object size 192 B in 4-KB slabs holds 21 objects (waste ~96 B per slab, ~2.3%). Compare to a general malloc rounding 192 B up to a 256-B size class (25% waste per allocation).
kmem_cache_createexists for every long-lived object type. Linux declares 50–200 caches at boot (varies by config and modules)./proc/slabinfoon a typical desktop shows ~120 named caches.- Per-CPU storage. Each cache holds one pointer to an active slab per CPU. On a 96-core box that's tens of KB of per-cache overhead — usually negligible compared to the slab pages themselves.
- Memory returned only on shrinker pressure. Empty slabs sit on the partial/empty list until
/proc/sys/vm/drop_cachesor watermark crossing wakes a shrinker. A box that just ran a bigfindoften shows hundreds of MB pinned in the dentry cache. - Object reuse rate >99% in steady state. Hot kernel caches (struct file, struct sock, struct dentry) reuse the same memory hundreds of times before any slab returns to the buddy.
JavaScript implementation (sketch)
// A toy slab cache. One cache, one fixed object size, free list threaded
// through the unused entries. Allocation and free are O(1) list ops.
class Slab {
// capacity = how many objects this slab holds.
constructor(slabId, capacity, ctor) {
this.slabId = slabId;
this.capacity = capacity;
this.objects = new Array(capacity);
this.next = new Int32Array(capacity); // free-list pointers
for (let i = 0; i < capacity - 1; i++) this.next[i] = i + 1;
this.next[capacity - 1] = -1; // end of list
this.freeHead = 0;
this.freeCount = capacity;
// run constructor once per object so allocation never has to.
if (ctor) for (let i = 0; i < capacity; i++) this.objects[i] = ctor();
else for (let i = 0; i < capacity; i++) this.objects[i] = {};
}
alloc() {
if (this.freeHead === -1) return null;
const idx = this.freeHead;
this.freeHead = this.next[idx];
this.freeCount--;
return { slabId: this.slabId, idx, obj: this.objects[idx] };
}
free(idx) {
this.next[idx] = this.freeHead;
this.freeHead = idx;
this.freeCount++;
}
isFull() { return this.freeCount === 0; }
isEmpty() { return this.freeCount === this.capacity; }
}
class KmemCache {
constructor(name, objectsPerSlab, ctor) {
this.name = name;
this.objectsPerSlab = objectsPerSlab;
this.ctor = ctor;
this.partial = []; // slabs with some free objects
this.full = []; // slabs entirely allocated
this.empty = []; // slabs entirely free (reclaim candidates)
this.nextSlabId = 0;
}
_refill() {
// In a real kernel: pop from the buddy allocator. Here: just create one.
const s = new Slab(this.nextSlabId++, this.objectsPerSlab, this.ctor);
this.partial.push(s);
return s;
}
alloc() {
let slab = this.partial[this.partial.length - 1] ?? this._refill();
const got = slab.alloc();
if (slab.isFull()) {
this.partial.pop();
this.full.push(slab);
}
return got; // { slabId, idx, obj }
}
free(ref) {
const { slabId, idx } = ref;
let slab = this.partial.find(s => s.slabId === slabId)
?? this.full.find(s => s.slabId === slabId);
const wasFull = slab.isFull();
slab.free(idx);
if (wasFull) {
this.full = this.full.filter(s => s !== slab);
this.partial.push(slab);
}
if (slab.isEmpty()) {
this.partial = this.partial.filter(s => s !== slab);
this.empty.push(slab); // candidate for shrinker
}
}
shrink() { // give pages back to buddy
const released = this.empty.length;
this.empty = [];
return released;
}
}
// Usage: one cache per object type, like the kernel does it.
const taskCache = new KmemCache('task_struct', 64, () => ({ pid: 0, state: 0 }));
const t = taskCache.alloc();
t.obj.pid = 42;
taskCache.free(t);
Real SLUB threads the free list through the unused objects themselves — no separate next array — and stores per-slab metadata in the page struct. The structure is the same: per-cache partial list, O(1) alloc/free, empty-slab return on shrinker callback.
Python implementation — per-CPU caches
import threading
from dataclasses import dataclass, field
from typing import Callable, Optional
@dataclass
class Slab:
slab_id: int
capacity: int
free_head: int = 0
free_count: int = 0
next: list[int] = field(default_factory=list)
storage: list = field(default_factory=list)
def __post_init__(self) -> None:
self.next = list(range(1, self.capacity)) + [-1]
self.free_count = self.capacity
self.storage = [None] * self.capacity
def alloc(self) -> Optional[int]:
if self.free_head == -1: return None
idx = self.free_head
self.free_head = self.next[idx]
self.free_count -= 1
return idx
def free(self, idx: int) -> None:
self.next[idx] = self.free_head
self.free_head = idx
self.free_count += 1
def is_full(self) -> bool: return self.free_count == 0
def is_empty(self) -> bool: return self.free_count == self.capacity
class KmemCache:
"""Slab cache with per-CPU active slabs and a partial/full list under a lock."""
def __init__(self, name: str, objects_per_slab: int, ctor: Callable, cpu_count: int = 4):
self.name = name
self.objects_per_slab = objects_per_slab
self.ctor = ctor
self.partial: list[Slab] = []
self.full: list[Slab] = []
self.empty: list[Slab] = []
self._next_id = 0
self._lock = threading.Lock()
self.per_cpu: list[Optional[Slab]] = [None] * cpu_count
def _new_slab(self) -> Slab:
s = Slab(slab_id=self._next_id, capacity=self.objects_per_slab)
self._next_id += 1
for i in range(self.objects_per_slab):
s.storage[i] = self.ctor()
return s
def alloc(self, cpu_id: int) -> tuple[Slab, int]:
# Fast path: per-CPU active slab. No global lock.
slab = self.per_cpu[cpu_id]
if slab is not None and not slab.is_full():
idx = slab.alloc()
return slab, idx
# Slow path: promote a partial slab (or refill from buddy).
with self._lock:
if slab is not None: # retire the now-full one
self.full.append(slab)
slab = self.partial.pop() if self.partial else self._new_slab()
self.per_cpu[cpu_id] = slab
idx = slab.alloc()
return slab, idx
def free(self, slab: Slab, idx: int) -> None:
was_full = slab.is_full()
slab.free(idx)
if was_full:
with self._lock:
if slab in self.full:
self.full.remove(slab)
self.partial.append(slab)
if slab.is_empty():
with self._lock:
if slab in self.partial:
self.partial.remove(slab)
self.empty.append(slab)
# In a real kernel, the shrinker decides when to return to buddy.
def shrink(self) -> int:
with self._lock:
released = len(self.empty)
self.empty.clear()
return released
The hot per-CPU path is lock-free; only refills (which are batched) hit the global lock. This is precisely the structure Linux SLUB uses to scale to hundreds of cores — locking only on slab transitions, never on per-object allocation.
Variants and real-world deployments
Bonwick SLAB (Solaris, 1994). The original. Three free queues per cache: per-CPU magazines (depot), per-magazine cache, central cache. Object coloring to spread cache-line offsets across slabs.
Linux SLAB. Direct port of Bonwick's design. Three-queue structure, slab coloring. Removed from the kernel in 6.x once SLUB matured.
Linux SLUB (default since 2.6.23). Simplified queue structure, per-CPU active slab, partial-slab list per NUMA node, metadata in the page struct. The default for x86, ARM64, RISC-V.
Linux SLOB. Minimum-footprint allocator for embedded kernels. Best-fit list of pages, no per-type caching. Trades performance for code size.
FreeBSD UMA (Universal Memory Allocator). Slab plus per-CPU buckets plus per-domain (NUMA) zones. Universal in the sense that it's the allocator for both kernel objects and the buddy of the page allocator.
illumos / Solaris kmem. The direct descendant of Bonwick's original. Adds magazines and depot layers for stronger per-CPU caching, plus KGI for kernel-aware leak/use-after-free debug.
Windows lookaside lists. The structural analog — per-type free list of objects of identical size, refilled in batches from the executive pool allocator.
Userspace echoes. jemalloc, tcmalloc, mimalloc are slab-flavored: size classes plus per-thread caches plus central free lists. The kernel slab is the design ancestor of all of them.
Common bugs and edge cases
- Use-after-free poisoning. The freed object sits in the cache, ready to be re-handed-out. A stale pointer to it may successfully read the bytes of an unrelated new allocation. SLUB debug mode poisons freed memory with 0x6b to catch this.
- Double free corruption. Pushing the same index twice onto the free list creates a cycle. Allocation later returns the same object to two callers. SLUB's redzones and freelist randomization help; production builds remove them for performance.
- Object that escapes its cache. Freeing a slab object to the wrong cache scrambles two free lists. Type-stable allocations make this a memory-safety hole.
- Slab leak in the dentry cache. A subsystem holding dentries forever quietly consumes RAM until the shrinker can't catch up. Diagnose with
slabtopandslub_debug=U. - False sharing across objects. Two unrelated 192-byte objects packed into the same 64-byte cache line cause two cores to ping-pong the line. Slab coloring used to mitigate this; SLUB largely relies on object-size-rounded-to-cache-line alignment instead.
- Per-CPU slab not flushed on offline. Hot-unplug must flush the CPU's slab back to the partial list, or those objects leak. Kernel bugs here have caused per-CPU memory leaks measured in GB on long-running boxes.
- Constructor side effects. A bad constructor (e.g., one that takes a sleeping lock) deadlocks at slab-refill time inside a non-preemptive context. Constructors must be small, deterministic, and atomic.
Frequently asked questions
Why is the slab allocator faster than malloc?
Each cache holds objects of one fixed size, so alloc/free reduce to a pop/push against a singly-linked free list — a couple of memory ops and no header parsing. The fragment-and-coalesce dance that general malloc has to perform on every operation is replaced by a constant-time list manipulation. Per-CPU caches (the second tier in Linux SLUB) further remove the lock from the hot path so multi-core scaling stays linear.
What is the kmem_cache structure in Linux?
kmem_cache is the per-object-type metadata: object size, alignment, the constructor function (if any), and pointers to the cache's slabs and per-CPU page slots. Every kernel object type that wants its own slab — task_struct, dentry, inode, mm_struct, sock — calls kmem_cache_create at boot or module init. After that, kmem_cache_alloc returns an aligned, often pre-initialized object in nanoseconds, and kmem_cache_free returns it to the cache for the next caller.
How does a slab cache reclaim memory?
Each cache tracks its slabs as partial, full, or empty. Empty slabs (every object free) are candidates to return to the buddy allocator. A 'shrinker' callback is registered per cache so memory reclaim under watermark pressure can ask each cache to release some empty slabs. Big caches like the dentry and inode caches have explicit shrinkers run by the VFS reclaimer; trimming them is what 'echo 2 > /proc/sys/vm/drop_caches' actually does.
What is the difference between SLAB, SLUB, and SLOB?
SLAB was Linux's first port of Bonwick's design — three queues (CPU, node, global), aggressive coloring for cache-line distribution. SLUB (default since 2.6.23) keeps a per-CPU slab and a per-node partial list, drops most of the queue machinery, and stores metadata in the page struct instead of inline — smaller memory footprint, better multi-core scaling. SLOB is for embedded/tiny kernels: a simple first-fit list-of-pages allocator, no per-type caching, minimum kernel footprint.
How does slab compare to userspace malloc?
General malloc must handle any byte size from any caller; it uses size-class bins, headers, coalescing, and complex free-list management. Slab knows in advance that every allocation from a given cache is the same size — so no per-allocation header, no in-place coalescing, no fragmentation between sizes. Userspace allocators like jemalloc and tcmalloc adopt the same idea (size classes plus per-thread arenas) but ship general-purpose code that handles arbitrary sizes by routing each through its size class.
What is slab coloring?
Two slabs holding the same object type, allocated back-to-back from the buddy allocator, would put corresponding objects at the same offsets inside their pages — and therefore at the same cache-line index inside the L1. Slab coloring stagger the start-of-slab offset by one cache-line each, so consecutive objects across slabs spread across cache sets. The cost is a few bytes of wasted space at the start of each slab; the win is far fewer L1 conflict misses. SLUB drops most coloring because modern caches have grown enough associativity that the gain is marginal.