Memory Architecture
Write-Combining
Coalesce many stores into one 64-byte bus burst
Write-combining buffers coalesce multiple CPU stores to the same cache line into one transaction. The user-visible WC memory type accelerates GPU and PCIe MMIO traffic.
- Buffer width64 bytes (one cache line)
- Buffers per core~6 (Intel) / ~10 (AMD Zen 3+)
- Memory typesWB, WC, WT, WP, UC, UC−
- Drain triggersSFENCE, full buffer, line miss
- Speedup vs UC for MMIO5–20× on burst traffic
- OrderingWeak (within buffer, across lines)
Interactive visualization
Press play, or step through manually. Watch sixteen separate stores either go out one-at-a-time (UC) or accumulate in a 64-byte WC buffer and flush as a single bus burst.
How write-combining works
A CPU writing to normal writeback memory updates a cache line in L1, the line gets marked dirty, and at some later point the coherence protocol writes it back to memory or hands it off to another core. The store is essentially free from the program's view because L1 is fast and the bus traffic is amortized.
That story falls apart for memory-mapped device regions. A GPU's BAR window or a NIC's doorbell register can't be cached: the device needs to see every store, and the cache has no way to prefetch from it. The old solution was UC (Uncached) memory: every store goes out as a separate transaction on the bus, in strict program order. UC is correct but painfully slow — eight 8-byte stores to UC memory means eight bus transactions, each costing 50–200 ns.
Write-combining is the middle ground. WC memory bypasses the cache, but instead of issuing every store immediately the CPU buffers them in a small dedicated 64-byte structure — the WC buffer. Each core has several. When all the bytes of a 64-byte line have been written, or when something forces a drain (SFENCE, MFENCE, lock prefix, serializing instruction, a different line being touched, full buffer pool), the entire 64-byte block goes out as one bus transaction. PCIe sees it as a single Transaction Layer Packet up to the device's max-payload size (256 or 512 bytes).
The savings are substantial. Sixteen 4-byte stores to a single 64-byte WC line cost one bus burst instead of sixteen. On a framebuffer or a GPU command-queue, that's a 16× reduction in PCIe traffic for a noticeable speedup.
WB vs WC vs UC and friends
| Type | Cached | Coalescing | Ordering | Typical use |
|---|---|---|---|---|
| WB · Writeback | Yes (L1/L2/L3) | Yes (via cache) | x86 TSO — total store order | Default for normal RAM |
| WT · Write-through | Reads cached, writes through | No | TSO | Some legacy ROM regions |
| WP · Write-protect | Reads cached, writes uncached | No | TSO | Read-only mappings, BIOS |
| WC · Write-combining | No | Yes (WC buffer) | Weak — needs SFENCE | GPU/PCIe MMIO, framebuffers |
| UC− · Uncached, weakly ordered | No | No | Weak | Some PCIe regions |
| UC · Uncached | No | No | Strict — each store separately | Control registers requiring strict order |
The four-by-six PAT table lets the OS pick the type per-page; older systems used MTRRs (fixed regions). Linux's /proc/mtrr shows the legacy ranges; pgprot_writecombine() sets WC on new mappings.
When to choose WC
- Streaming writes to a GPU framebuffer or aperture. Software renderers, V4L2 capture buffers, X server pixmaps in shared memory — anything that bulk-writes pixels.
- NIC and NVMe doorbells and submission rings. Write the queue tail (4 or 8 bytes) plus the descriptor body in nearby addresses; the WC buffer coalesces, SFENCE forces the doorbell to be the visible tail.
- Streaming memcpy to slow targets. Glibc's memcpy uses non-temporal stores for very large copies (over the L3 working-set size) precisely because NT stores ride the WC buffer.
- Persistent memory writes. Some Intel Optane / CXL.mem persistent-memory regions are mapped WC to avoid cache pollution;
clwbandsfencesequence guarantees persistence.
Code: WC mappings and non-temporal stores
// Kernel: map a PCIe BAR as write-combining.
void __iomem *base = ioremap_wc(bar_phys, bar_len);
// User-space: PROT_WRITE-COMBINE via /dev/mem (root) or driver-provided mmap.
void *fb = mmap(NULL, fb_size, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
// On Linux the driver decides the memory type for an mmap'd PCIe BAR;
// i915, amdgpu, and nvidia all pick WC for the aperture / WC-BAR.
// Non-temporal stores trigger WC even on WB memory.
#include <immintrin.h>
void wc_memcpy_4k(void *dst, const void *src) {
const __m256i *s = (const __m256i *)src;
__m256i *d = (__m256i *)dst; // dst must be 32-byte aligned
for (int i = 0; i < 128; i++) { // 128 × 32 = 4096 bytes
__m256i v = _mm256_load_si256(s + i); // normal load
_mm256_stream_si256(d + i, v); // NT store → WC buffer
}
_mm_sfence(); // drain WC buffers, make visible
}
// Doorbell pattern.
void post_descriptor(struct ring *r, struct desc *d) {
int tail = r->tail;
memcpy(&r->entries[tail], d, sizeof(*d)); // body, often WC mapped
_mm_sfence(); // ordering boundary
*r->doorbell_mmio = tail + 1; // single 4-byte write to the ring
_mm_sfence(); // ensure doorbell visible before next op
}
# Find WC ranges in /proc/mtrr (legacy, but still informative).
cat /proc/mtrr
# reg00: base=0xf0000000 (3840MB), size= 256MB, count=1: write-combining
# reg01: ...
# PCIe BARs and their attributes.
lspci -vvv | grep -A2 "VGA compatible"
# Confirm a process's mappings include WC pages.
cat /proc/$(pidof game)/smaps | grep -E '^VmFlags'
# Look for the 'wc' flag in VmFlags.
Performance numbers
- WC buffer width: 64 bytes (one cache line). Drain happens when full or on SFENCE.
- x86 has 6–10 WC buffers per core (Skylake 6, Sunny Cove 6, Zen 3 10, Zen 4 12). With this many buffers the core can stream to several different lines in parallel before the oldest one drains.
- Sixteen 4-byte stores to a 64-byte line on WC memory: 1 PCIe TLP. On UC memory: 16 TLPs. The bus-side cost difference is ~10–20×.
- Non-temporal memcpy at large sizes (>16 MiB) on Skylake-X: ~22 GB/s with NT stores vs ~14 GB/s with regular stores — both DRAM-bandwidth limited, NT wins by avoiding write-back of the destination.
- GPU framebuffer streaming over PCIe: 12 GB/s with WC, 1–2 GB/s with UC. Without WC, software rendering on integrated GPUs would be unviable.
- SFENCE drain cost: ~10–20 cycles on modern x86 (the actual buffer drain runs in the background; the fence just establishes the ordering boundary).
Common pitfalls
- Reading from WC memory. Reads bypass cache and are slow — sometimes hundreds of cycles. Treat WC ranges as write-only; cache the values in normal RAM if you also need to read.
- Forgetting SFENCE before a doorbell. Without it, the device may see the doorbell before the descriptor body. The classic "GPU draws garbage" or "NIC drops packets" symptom.
- Mixing locked and WC stores. A LOCK prefix is a serializing event that drains buffers — fine for correctness, terrible for throughput if it happens inside a streaming loop.
- Partial line eviction. Touching too many distinct cache lines while WC buffers are still being filled can cause partial-line evictions. PCIe accepts them but they're less efficient than full-line bursts. Keep contiguous-line writes contiguous.
- Assuming MTRR-WC and PAT-WC agree. They don't always; PAT bits in the PTE override MTRRs for that page. Linux's ioremap_wc handles this for kernel callers, but driver-managed mappings sometimes get it wrong.
- Using WC for control registers that need strict ordering. A control register that must be observed before another is a UC use case, not WC. The buffer can reorder writes within a line; that breaks strict programming sequences.
Frequently asked questions
What is write-combining?
A CPU mechanism that holds multiple consecutive stores to the same cache line in a 64-byte buffer, then flushes them as one bus transaction. Each core has ~4–10 WC buffers.
How is WC different from WB and UC memory?
WB is cached and coherent (default for RAM). WC bypasses cache but buffers writes to coalesce them. UC is uncached with strict ordering — slowest. WC is fast for write-streaming to MMIO.
What is WC memory used for?
GPU and PCIe MMIO regions. Framebuffer writes, GPU command queue, NIC doorbells, NVMe submission queues — anywhere the CPU streams large amounts of data to a memory-mapped device.
How do I make memory WC?
MTRRs (legacy) and PAT bits in the PTE (modern). Linux: ioremap_wc() in kernel, pgprot_writecombine in user space. NT-stores (_mm_stream_si128) trigger WC even on WB memory.
How big is the write-combining buffer?
One cache line — 64 bytes on x86 and ARM. Skylake 6 buffers/core, Zen 3 10/core. Drain on SFENCE, MFENCE, locked ops, full pool, or line miss.
What ordering guarantees does WC give me?
Weak. Stores within a buffer and across lines can reorder. SFENCE forces a drain and establishes a global-visibility point. Pair NT-store bursts with SFENCE before any consumer read.
Why are non-temporal stores related to write-combining?
NT stores (movnti/movntdq) bypass cache and ride the same WC buffer mechanism — coalesced into 64-byte bursts. Glibc large memcpy and graphics drivers use them for cache-friendly streaming.