What is write-combining?

Write-combining is a CPU mechanism that holds multiple consecutive stores to the same cache line in a small dedicated buffer, then flushes them as one bus transaction. On x86 each core has ~4-10 write-combining buffers, each 64 bytes wide. When the buffer fills or hits a serializing event, the whole line goes out on the memory bus as one burst instead of as N separate transactions.

How is WC different from WB and UC memory?

WB (writeback) is the default for normal RAM: cached in L1/L2/L3, store-forwarded, fully coherent. WC (write-combining) bypasses caches but buffers writes to coalesce them; reads are uncached and slow. UC (uncached) sends every byte as an individual bus transaction with strict ordering — slowest but most deterministic. The trade-off: WB is fast for reread data, WC is fast for write-streaming to MMIO, UC is needed when ordering must be strict and traffic is small.

What is WC memory used for?

Mostly GPU and PCIe device MMIO regions. Framebuffer pixel writes, GPU command-queue submission, NIC doorbell rings, NVMe submission-queue tails. Anywhere the CPU streams large amounts of data to a memory-mapped device and never reads it back. The bus burst lets PCIe deliver the writes as a single Transaction Layer Packet — far cheaper than per-byte UC traffic.

How do I make memory WC?

On x86 the memory type is controlled by MTRRs (legacy) and the PAT (Page Attribute Table) bits in the PTE. Linux exposes WC mapping via ioremap_wc() in the kernel and through pgprot_writecombine in user space. On modern systems, GPU drivers (i915, amdgpu, nvidia) routinely map BAR1 / aperture pages as WC for fast pixel streaming. There's also a non-temporal-store path: _mm_stream_si128 / movntdq writes bypass cache and trigger write-combining even on WB memory.

How big is the write-combining buffer?

One cache line — 64 bytes on x86 and ARM. Each core has a small pool of WC buffers (Skylake reports 6, Zen 3 reports 10). When all are full and a new store maps to a different line, one buffer evicts as a single 64-byte burst. Strict-ordering events (LFENCE, MFENCE, SFENCE, locked instructions, serializing instructions) also drain the buffers.

What ordering guarantees does WC give me?

Weaker than normal writeback. Stores to WC memory can be reordered freely within and across buffers; from another agent's view the order of writes within a 64-byte line is undefined until the buffer drains. SFENCE forces a drain and a global-visibility point. Don't depend on store order to WC memory across cache lines without explicit fences — the doorbell trick (one final store fenced to MMIO) is the standard pattern.

Why are non-temporal stores related to write-combining?

Non-temporal stores (movnti, movntps, movntdq, vmovntdq) write to memory without polluting the cache. Internally they go through the same WC buffer mechanism — the CPU coalesces them into 64-byte bursts before sending them out. This is why large memcpy implementations and graphics drivers use NT stores: avoid cache pollution, get write-combined bus efficiency. Always pair NT-store bursts with SFENCE before the consumer reads, because NT-store ordering is weak.

Write-Combining — WC Memory, 64-Byte Burst

How write-combining works

A CPU writing to normal writeback memory updates a cache line in L1, the line gets marked dirty, and at some later point the coherence protocol writes it back to memory or hands it off to another core. The store is essentially free from the program's view because L1 is fast and the bus traffic is amortized.

That story falls apart for memory-mapped device regions. A GPU's BAR window or a NIC's doorbell register can't be cached: the device needs to see every store, and the cache has no way to prefetch from it. The old solution was UC (Uncached) memory: every store goes out as a separate transaction on the bus, in strict program order. UC is correct but painfully slow — eight 8-byte stores to UC memory means eight bus transactions, each costing 50–200 ns.

Write-combining is the middle ground. WC memory bypasses the cache, but instead of issuing every store immediately the CPU buffers them in a small dedicated 64-byte structure — the WC buffer. Each core has several. When all the bytes of a 64-byte line have been written, or when something forces a drain (SFENCE, MFENCE, lock prefix, serializing instruction, a different line being touched, full buffer pool), the entire 64-byte block goes out as one bus transaction. PCIe sees it as a single Transaction Layer Packet up to the device's max-payload size (256 or 512 bytes).

The savings are substantial. Sixteen 4-byte stores to a single 64-byte WC line cost one bus burst instead of sixteen. On a framebuffer or a GPU command-queue, that's a 16× reduction in PCIe traffic for a noticeable speedup.

WB vs WC vs UC and friends

Type	Cached	Coalescing	Ordering	Typical use
WB · Writeback	Yes (L1/L2/L3)	Yes (via cache)	x86 TSO — total store order	Default for normal RAM
WT · Write-through	Reads cached, writes through	No	TSO	Some legacy ROM regions
WP · Write-protect	Reads cached, writes uncached	No	TSO	Read-only mappings, BIOS
WC · Write-combining	No	Yes (WC buffer)	Weak — needs SFENCE	GPU/PCIe MMIO, framebuffers
UC− · Uncached, weakly ordered	No	No	Weak	Some PCIe regions
UC · Uncached	No	No	Strict — each store separately	Control registers requiring strict order

The four-by-six PAT table lets the OS pick the type per-page; older systems used MTRRs (fixed regions). Linux's /proc/mtrr shows the legacy ranges; pgprot_writecombine() sets WC on new mappings.

When to choose WC

Streaming writes to a GPU framebuffer or aperture. Software renderers, V4L2 capture buffers, X server pixmaps in shared memory — anything that bulk-writes pixels.
NIC and NVMe doorbells and submission rings. Write the queue tail (4 or 8 bytes) plus the descriptor body in nearby addresses; the WC buffer coalesces, SFENCE forces the doorbell to be the visible tail.
Streaming memcpy to slow targets. Glibc's memcpy uses non-temporal stores for very large copies (over the L3 working-set size) precisely because NT stores ride the WC buffer.
Persistent memory writes. Some Intel Optane / CXL.mem persistent-memory regions are mapped WC to avoid cache pollution; clwb and sfence sequence guarantees persistence.

Code: WC mappings and non-temporal stores

// Kernel: map a PCIe BAR as write-combining.
void __iomem *base = ioremap_wc(bar_phys, bar_len);

// User-space: PROT_WRITE-COMBINE via /dev/mem (root) or driver-provided mmap.
void *fb = mmap(NULL, fb_size, PROT_READ | PROT_WRITE,
                MAP_SHARED, fd, 0);
// On Linux the driver decides the memory type for an mmap'd PCIe BAR;
// i915, amdgpu, and nvidia all pick WC for the aperture / WC-BAR.

// Non-temporal stores trigger WC even on WB memory.
#include <immintrin.h>

void wc_memcpy_4k(void *dst, const void *src) {
    const __m256i *s = (const __m256i *)src;
    __m256i       *d = (__m256i *)dst;        // dst must be 32-byte aligned
    for (int i = 0; i < 128; i++) {           // 128 × 32 = 4096 bytes
        __m256i v = _mm256_load_si256(s + i); // normal load
        _mm256_stream_si256(d + i, v);        // NT store → WC buffer
    }
    _mm_sfence();                              // drain WC buffers, make visible
}

// Doorbell pattern.
void post_descriptor(struct ring *r, struct desc *d) {
    int tail = r->tail;
    memcpy(&r->entries[tail], d, sizeof(*d));  // body, often WC mapped
    _mm_sfence();                              // ordering boundary
    *r->doorbell_mmio = tail + 1;              // single 4-byte write to the ring
    _mm_sfence();                              // ensure doorbell visible before next op
}

# Find WC ranges in /proc/mtrr (legacy, but still informative).
cat /proc/mtrr
# reg00: base=0xf0000000 (3840MB), size= 256MB, count=1: write-combining
# reg01: ...

# PCIe BARs and their attributes.
lspci -vvv | grep -A2 "VGA compatible"

# Confirm a process's mappings include WC pages.
cat /proc/$(pidof game)/smaps | grep -E '^VmFlags'
# Look for the 'wc' flag in VmFlags.

Performance numbers

WC buffer width: 64 bytes (one cache line). Drain happens when full or on SFENCE.
x86 has 6–10 WC buffers per core (Skylake 6, Sunny Cove 6, Zen 3 10, Zen 4 12). With this many buffers the core can stream to several different lines in parallel before the oldest one drains.
Sixteen 4-byte stores to a 64-byte line on WC memory: 1 PCIe TLP. On UC memory: 16 TLPs. The bus-side cost difference is ~10–20×.
Non-temporal memcpy at large sizes (>16 MiB) on Skylake-X: ~22 GB/s with NT stores vs ~14 GB/s with regular stores — both DRAM-bandwidth limited, NT wins by avoiding write-back of the destination.
GPU framebuffer streaming over PCIe: 12 GB/s with WC, 1–2 GB/s with UC. Without WC, software rendering on integrated GPUs would be unviable.
SFENCE drain cost: ~10–20 cycles on modern x86 (the actual buffer drain runs in the background; the fence just establishes the ordering boundary).

Common pitfalls

Reading from WC memory. Reads bypass cache and are slow — sometimes hundreds of cycles. Treat WC ranges as write-only; cache the values in normal RAM if you also need to read.
Forgetting SFENCE before a doorbell. Without it, the device may see the doorbell before the descriptor body. The classic "GPU draws garbage" or "NIC drops packets" symptom.
Mixing locked and WC stores. A LOCK prefix is a serializing event that drains buffers — fine for correctness, terrible for throughput if it happens inside a streaming loop.
Partial line eviction. Touching too many distinct cache lines while WC buffers are still being filled can cause partial-line evictions. PCIe accepts them but they're less efficient than full-line bursts. Keep contiguous-line writes contiguous.
Assuming MTRR-WC and PAT-WC agree. They don't always; PAT bits in the PTE override MTRRs for that page. Linux's ioremap_wc handles this for kernel callers, but driver-managed mappings sometimes get it wrong.
Using WC for control registers that need strict ordering. A control register that must be observed before another is a UC use case, not WC. The buffer can reorder writes within a line; that breaks strict programming sequences.

Write-Combining

Interactive visualization

Watch the 60-second explainer

How write-combining works

WB vs WC vs UC and friends

When to choose WC

Code: WC mappings and non-temporal stores

Performance numbers

Common pitfalls

Frequently asked questions

Interactive visualization

Watch the 60-second explainer

How write-combining works

WB vs WC vs UC and friends

When to choose WC

Code: WC mappings and non-temporal stores

Performance numbers

Common pitfalls

Frequently asked questions

Related concepts