Memory Architecture
NUMA
Each socket has its own DRAM — local is fast, remote is slow
On NUMA hardware each socket has its own local DRAM. Local access is ~80 ns; reaching another socket's memory across UPI/Infinity Fabric is ~140 ns. Pin threads to their node.
- Local DRAM latency~80 ns
- Remote (1-hop) latency~140 ns
- Remote / local ratio1.5× – 2× (2P), up to 3× (8P)
- UPI 4.0 link bandwidth~32 GB/s per direction
- Sub-NUMA clusterEPYC NPS4 → 4 nodes/socket
- Linux default policyFirst-touch (allocate on toucher's node)
Interactive visualization
Press play, or step through manually. Watch a thread on socket A hit its local DRAM in 80 ns — then watch the same thread reach socket B's memory and pay the inter-socket toll.
How NUMA works
A single-socket server has one set of memory channels driven by the CPU's integrated memory controllers. Every core sees the same memory at the same latency — that's uniform memory access. Plug in a second socket and the picture changes. Each socket has its own memory controllers wired to its own DIMM slots. A core on socket 0 can address socket 1's DRAM, but the request has to travel across an inter-socket interconnect — Intel's UPI or AMD's Infinity Fabric — to the remote memory controller and the data has to come back the same way. That round trip is what makes it non-uniform.
From the operating system's view, the machine has multiple NUMA nodes. Each node is a (cores + memory + memory controllers) tuple with its own latency and bandwidth characteristics. lscpu shows the NUMA layout; numactl --hardware shows the latency matrix between nodes, in units of 10 (so a value of 11 means 1.1× baseline, 21 means 2.1× baseline).
From the program's view, every load and store works regardless of where the page lives — the CPU does cross-socket cache coherence transparently. The only visible effect is latency and bandwidth: code that touches mostly local pages runs fast; code that constantly crosses the interconnect runs slow.
Common NUMA topologies
| Topology | Nodes | Local latency | Remote latency | Inter-socket link |
|---|---|---|---|---|
| Single socket Xeon | 1 | ~80 ns | n/a | n/a |
| 2P Sapphire Rapids | 2 (4 with SNC) | ~80 ns | ~140 ns | UPI 2.0, 16 GT/s × 24 lanes |
| 2P EPYC Genoa, NPS1 | 2 | ~95 ns | ~160 ns | Infinity Fabric, ~64 GB/s/dir |
| 2P EPYC Genoa, NPS4 | 8 (4 per socket) | ~95 ns | ~130 ns (intra-socket), ~165 ns (inter-socket) | Infinity Fabric |
| 4P Skylake-SP, fully-connected | 4 | ~80 ns | ~140 ns (1-hop), ~190 ns (2-hop) | 3× UPI per socket |
| 8P SGI UV / glueless | 8 | ~80 ns | ~240 ns (long diagonal) | HBM scaling fabric |
The ratio matters more than the absolute number. A 1.5× penalty is forgiving — most workloads can tolerate occasional remote hits. A 2× or 3× penalty is brutal: a hot loop that does 50% local + 50% remote runs in roughly the time of 100% remote.
First-touch and why malloc is misleading
The Linux memory allocator returns a virtual address space — no physical pages have been allocated yet. The page is only backed by physical RAM at the moment of the first touch (load or store) by some thread. That physical page comes from the NUMA node of the touching CPU.
That sounds harmless until you write:
// BAD — main thread touches everything first.
float *buf = malloc(N * sizeof(float));
memset(buf, 0, N * sizeof(float)); // page-fault each page on node 0
#pragma omp parallel for
for (int i = 0; i < N; i++) buf[i] = f(i); // workers on nodes 0+1
// Every page is on node 0. Worker threads on node 1 pay remote cost.
The fix is to let each worker thread touch its own slice first:
// GOOD — first-touch from the consumer thread.
float *buf = malloc(N * sizeof(float));
#pragma omp parallel for
for (int i = 0; i < N; i++) buf[i] = 0.0f; // each thread page-faults its slice
#pragma omp parallel for
for (int i = 0; i < N; i++) buf[i] = f(i); // accesses stay local
Controlling NUMA placement
# Show the NUMA topology, latency matrix, free memory per node.
numactl --hardware
# Run an app pinned to node 0 — CPUs and memory both from node 0.
numactl --cpunodebind=0 --membind=0 ./my-app
# Interleave pages across all nodes (good for memory-bound, locality-blind).
numactl --interleave=all ./my-app
# Show per-process local vs remote allocations.
numastat -p $(pidof my-app)
# Migrate an existing process's memory to node 1.
migratepages $(pidof my-app) 0 1
#include <numa.h>
#include <numaif.h>
if (numa_available() < 0) abort();
// Bind the calling thread's allocations to node 1.
numa_set_membind(numa_parse_nodestring("1"));
// Allocate explicitly on node 0.
void *p = numa_alloc_onnode(1 << 30, 0); // 1 GiB
// Pin the thread to socket 0 CPUs.
struct bitmask *cpus = numa_allocate_cpumask();
numa_node_to_cpus(0, cpus);
numa_sched_setaffinity(0, cpus);
// Low-level pin: set_mempolicy
unsigned long nodemask = 1UL << 0;
set_mempolicy(MPOL_BIND, &nodemask, sizeof(nodemask) * 8);
Performance numbers
- Local DRAM hit: ~80 ns on current Xeon, ~95 ns on EPYC Genoa.
- Remote DRAM hit, 2-socket Xeon: ~140 ns. That's a ~60 ns penalty — equivalent to roughly 300 wasted cycles at 5 GHz.
- Sustained per-socket DRAM bandwidth on Sapphire Rapids (8 channels DDR5-4800): ~300 GB/s local. Pulling from the other socket halves available bandwidth and shares the UPI links.
- UPI 2.0 bandwidth budget on Sapphire Rapids: ~32 GB/s per direction per link, 3 links per socket on top SKUs. Coherence traffic eats from this budget too.
- Misaligned NUMA placement on a STREAM TRIAD benchmark: ~50% throughput loss compared to first-touch correct version.
- NPS4 mode on EPYC Genoa: ~10 ns lower local latency (~85 ns) and ~5–10% throughput uplift for NUMA-aware code, at the cost of needing software to handle 4 nodes per socket.
Common pitfalls
- Initializing memory from the main thread. First-touch puts every page on node 0. Always parallel-init in the same pattern you'll parallel-access.
- OS-level autoscheduling moving threads across sockets. Linux's scheduler can migrate threads for load balance — taking them away from their warm cache and local memory. Pin with
numactl,taskset, orsched_setaffinityfor steady-state workloads. - Interleave for the wrong reasons.
--interleave=allspreads pages across nodes, which is great for memory-bandwidth-bound HPC kernels but worse than node-binding for cache-friendly code. - Forgetting THP-touched-once. Transparent huge pages allocate 2 MiB on the toucher's node. A bad first-touch on one byte gives the wrong socket the whole 2 MiB.
- Cross-socket synchronization. A mutex or atomic stored on one socket and contended from the other costs an inter-socket coherence round trip per acquire-release. Co-locate hot synchronization with the threads using it.
Frequently asked questions
What does NUMA stand for?
Non-Uniform Memory Access. Each socket has its own local memory controller and DIMMs. Local access is fast; reaching another socket's memory routes across UPI or Infinity Fabric and back.
How much slower is remote NUMA access?
Roughly 1.5–2× on 2-socket boxes (~80 ns local, ~140 ns remote), up to 3× on long-diagonal 8-socket. Bandwidth is also reduced because traffic shares the inter-socket links.
What is a NUMA node?
A NUMA node is one cluster of cores sharing a local memory controller. 2-socket Xeon = 2 nodes; EPYC NPS4 splits each socket into 4 nodes. numactl --hardware enumerates them.
How do I pin a thread or memory to a NUMA node?
numactl --cpunodebind=0 --membind=0 ./app on Linux. Programmatically: sched_setaffinity, mbind, set_mempolicy, move_pages. libnuma wraps these. Default policy is first-touch.
What is first-touch policy?
The page only gets physical RAM when first written. Linux allocates that RAM from the node of the touching CPU. Initialize from the consumer threads, not the main thread.
How do I detect NUMA imbalance?
numastat -p <pid> for per-process local vs remote ratios; perf for remote DRAM hit counters; VTune or amd-uprof for visual NUMA timelines. Healthy HPC code is >95% local.
Should I use 1 big socket or 2 smaller sockets?
If your workload fits in one socket's cores and bandwidth, 1 socket is faster — no NUMA. Two sockets win when you need the aggregate cores or DRAM and partition cleanly.