Question 1

What does NUMA stand for?

Accepted Answer

Non-Uniform Memory Access. On a multi-socket or chiplet-based server, each CPU socket has its own bank of DRAM directly attached to its integrated memory controllers. A core can reach its local bank quickly; reaching another socket's bank requires routing the request across a coherent interconnect — Intel UPI, AMD Infinity Fabric — and back, adding latency and consuming inter-socket bandwidth.

Question 2

How much slower is remote NUMA access?

Accepted Answer

Roughly 1.5× to 2× the latency of a local access on current 2-socket Xeon and EPYC, and as much as 3× across the long diagonal of an 8-socket box. Concretely: ~80 ns local DRAM hit, ~140 ns remote (1-hop). Sustained bandwidth from remote nodes is also reduced because traffic shares the inter-socket interconnect with cache coherence.

Question 3

What is a NUMA node?

Accepted Answer

A NUMA node is one cluster of cores that share a local memory controller and the DIMMs attached to it. On a dual-socket Xeon there are typically 2 nodes (one per socket). On AMD EPYC Genoa and Bergamo, sub-NUMA-clustering can split each socket into 4 NUMA nodes (NPS4 mode) to reduce intra-socket cross-die hops. lscpu and numactl --hardware enumerate them.

Question 4

How do I pin a thread or memory to a NUMA node?

Accepted Answer

On Linux, numactl --cpunodebind=0 --membind=0 ./app constrains the whole process. Programmatically: sched_setaffinity sets the CPU mask, mbind / set_mempolicy / move_pages pin or migrate memory. libnuma wraps these calls. On Windows, use SetThreadIdealProcessor and VirtualAllocExNuma. Default Linux uses first-touch policy: the page is allocated on the node whose CPU first writes to it.

Question 5

What is first-touch policy?

Accepted Answer

When a process maps memory (via mmap or malloc-then-page-fault), the page isn't physically allocated until the first store touches it. Linux allocates that physical page from the NUMA node of the touching CPU. So a thread that calls memset on a giant buffer ends up with all those pages on its node — even if other threads will be the consumers. Pre-touching pages in parallel from the threads that will use them is a standard NUMA-aware optimization.

Question 6

How do I detect NUMA imbalance?

Accepted Answer

Linux: numastat -p <pid> shows local vs remote allocations. perf stat -e mem_load_uops_l3_miss_retired.remote_dram counts remote DRAM hits per core. AMD's amd-uprof and Intel VTune have dedicated NUMA views. A healthy ratio for HPC code is >95% local; under 80% local often points to allocations on the wrong node, missing first-touch, or unbalanced thread placement.

Question 7

Should I use 1 big socket or 2 smaller sockets?

Accepted Answer

If your workload fits in one socket's cores and bandwidth, 1 socket is almost always faster — no NUMA. Two sockets win when you need the aggregate core count or DRAM capacity and your workload partitions cleanly. Cloud and HPC schedulers increasingly carve large hosts into NUMA-aligned VMs so each guest sees a single node.

Topology	Nodes	Local latency	Remote latency	Inter-socket link
Single socket Xeon	1	~80 ns	n/a	n/a
2P Sapphire Rapids	2 (4 with SNC)	~80 ns	~140 ns	UPI 2.0, 16 GT/s × 24 lanes
2P EPYC Genoa, NPS1	2	~95 ns	~160 ns	Infinity Fabric, ~64 GB/s/dir
2P EPYC Genoa, NPS4	8 (4 per socket)	~95 ns	~130 ns (intra-socket), ~165 ns (inter-socket)	Infinity Fabric
4P Skylake-SP, fully-connected	4	~80 ns	~140 ns (1-hop), ~190 ns (2-hop)	3× UPI per socket
8P SGI UV / glueless	8	~80 ns	~240 ns (long diagonal)	HBM scaling fabric

NUMA

Interactive visualization

How NUMA works

Common NUMA topologies

First-touch and why malloc is misleading

Controlling NUMA placement

Performance numbers

Common pitfalls

Frequently asked questions