Computer Architecture

Direct Memory Access (DMA)

Network cards and SSDs read/write RAM directly — the CPU only orchestrates

Direct Memory Access (DMA) is a hardware mechanism allowing peripherals (NICs, SSDs, GPUs) to read/write system memory directly without involving the CPU for each byte. The CPU sets up a DMA descriptor (source, destination, length) and issues a single command; the device's DMA engine transfers the data and raises an interrupt on completion. Without DMA, a 10 Gbps NIC would saturate a CPU at 100% with data copying alone. Modern systems use scatter-gather DMA (multiple non-contiguous regions in one descriptor), IOMMU (memory protection for DMA addresses, preventing rogue DMA attacks), and zero-copy paths combining DMA with memory-mapped I/O.

  • CPU involvementsetup + completion only
  • Throughputline-rate (10-100 Gbps)
  • Modesprogrammed I/O (PIO) vs DMA
  • Scatter-gathermulti-region descriptor
  • IOMMUaddress translation + protection
  • Zero-copycombines with mmap

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why DMA matters

  • Line-rate networking. A 100 Gbps NIC moves 12.5 GB/s — at PIO that's 12.5 billion four-byte CPU instructions per second, which no CPU can sustain. DMA lets the NIC's silicon be the data mover; the CPU spends a few microseconds setting up rings and processing completions.
  • NVMe latency. A modern PCIe Gen4 NVMe SSD does a 4 KB random read in ~10 microseconds end-to-end. Half of that budget is in the SSD; the other half is descriptor setup and DMA on the bus. Without DMA, just copying the 4 KB by CPU would dwarf the device latency.
  • GPU compute. CUDA's cudaMemcpy uses DMA between system RAM and GPU HBM. Pinned (page-locked) memory plus DMA hits PCIe Gen5 line rate (~63 GB/s); pageable memory falls back to bounce buffers and runs at half that.
  • Kernel bypass. DPDK and SPDK pin huge pages and let userspace post DMA descriptors directly to the device. Eliminates kernel involvement on every packet/IO; DPDK can route 14.88 Mpps per core.
  • Audio/video. Sound cards DMA-stream audio buffers to the codec; the CPU only fills new buffer halves. The same model drives every video display via the GPU's display DMA engine ("scanout").
  • Disk encryption / checksumming. Modern controllers do AES, CRC, and parity in DMA path. The CPU sees only the plaintext data after decryption, with no extra copy.

The DMA lifecycle

A single I/O operation through DMA looks like:

  1. Allocate. Kernel obtains pages and pins them so the MMU cannot reclaim them mid-DMA. dma_alloc_coherent() or get_user_pages() for user buffers.
  2. Map. Kernel calls dma_map_single() / dma_map_sg() to obtain a bus address (an IOVA if IOMMU is on, else a physical address). The IOMMU page table is updated.
  3. Build descriptor. Kernel writes a (bus address, length, flags, completion-queue index) entry into the device's command ring.
  4. Doorbell. Kernel writes to a memory-mapped register to tell the device "new commands available." This is one MMIO write — the only synchronous CPU cost in the data path.
  5. Device DMAs. The device's bus master reads or writes RAM autonomously, multiple bus transactions, possibly out of order with respect to other commands.
  6. Completion. Device writes a completion entry to the host's completion ring (also via DMA), then raises an MSI-X interrupt or marks the entry for polling.
  7. Unmap. Kernel calls dma_unmap_*() to flush caches if needed, tear down the IOMMU mapping, and release the pin.

Cache coherency and DMA

On x86 and modern ARMv8, DMA is cache-coherent: the device's reads see CPU writes that are still in cache, and the device's writes invalidate stale CPU cache lines. The cost is paid by the cache-coherency interconnect (Intel UPI, AMD Infinity Fabric, ARM CCN). On older or embedded systems without coherency, the kernel must explicitly flush or invalidate caches around each DMA — that's what dma_sync_single_for_device() and dma_sync_single_for_cpu() are for. Forgetting these calls produces silent corruption that only shows up at high stress.

Example throughput numbers

  • NVMe Gen4 (line rate). Single drive: 7 GB/s sequential, 1.5 M IOPS random 4K. Eight drives in RAID0 saturate PCIe Gen4 x16 (~31 GB/s).
  • 100 Gbps NIC. 12.5 GB/s sustained, ~150 M packets/sec at 64-byte packets. Without DMA, impossible on any CPU shipping in 2026.
  • GPU PCIe transfer. Pinned memory + DMA: 24-25 GB/s on PCIe Gen4 x16; 50+ GB/s on Gen5. Pageable memory drops to ~10 GB/s.
  • RDMA verbs. Userspace posts a Work Request describing remote and local virtual addresses; the NIC does both ends of the transfer. Latency ~1 microsecond, throughput line-rate.

Common misconceptions

  • "DMA is slow because of setup." Setup is microseconds; the data transfer dominates. For large transfers (>= 1 KB) DMA is unambiguously faster than PIO. Only for sub-cache-line operations does PIO break even.
  • "IOMMU adds significant overhead." Modern Intel VT-d and AMD-Vi handle TLB lookups in hardware, often free or low single-digit nanoseconds. The expensive operation is invalidation, which the kernel batches.
  • "DMA buffers must be physically contiguous." Not since scatter-gather. SG descriptors handle dozens of physically-discontiguous pages in one transfer. dma_alloc_coherent() is for small structures only.
  • "DMA is automatic." Every byte path through DMA was designed by a driver author — including page pinning, DMA mask validation (32-bit vs 64-bit addressing), and unmapping order. Bugs here cause kernel panics or silent corruption.
  • "DMA is safe by default." Without an IOMMU, any DMA-capable device can access all RAM. This is why "Kernel DMA Protection" and IOMMU strict mode exist.
  • "DMA = kernel bypass." They are different. DMA is the hardware mechanism; kernel bypass is the software architecture (DPDK, SPDK, RDMA verbs) that lets userspace drive DMA without a syscall per operation.

Inside an Ethernet RX path

  • NIC receives a frame, runs Receive Side Scaling to pick a queue, reads the next descriptor from the host's RX ring (DMA read).
  • NIC writes the frame into the buffer pointed to by the descriptor (DMA write), updates the descriptor's status, optionally writes hash/timestamp.
  • NIC raises MSI-X to the CPU that owns this queue; or, if NAPI polling is active, just sets the status bit and the polling thread picks it up next round.
  • Kernel reads the descriptor, hands the buffer (now containing the frame) up the network stack, allocates a fresh buffer, writes a new descriptor — ready for the next frame.
  • At 64-byte packets and 10 Gbps line rate that's 14.88 million round-trips per second through this loop. Modern NICs amortize via interrupt coalescing and batched descriptor writes.

Frequently asked questions

What's the difference between PIO and DMA?

Programmed I/O (PIO) means the CPU executes load and store instructions to move each word between memory and a device register — the CPU is the data mover. DMA means the device has its own bus master that reads or writes RAM directly using the system's memory controller, while the CPU does other work. For a 1500-byte Ethernet frame at PIO that's 375 four-byte loads/stores per frame; at line rate of 10 Gbps that saturates a CPU core just on the copy. With DMA, the CPU writes a single descriptor (a pointer plus length) to the NIC, which then does the entire transfer with zero CPU cycles spent moving bytes. The CPU only handles setup and the completion interrupt.

How does scatter-gather DMA work?

A scatter-gather (SG) descriptor is a list of (physical address, length) pairs rather than a single contiguous range. The DMA engine walks the list, transferring each segment in turn. This matters because a 64 KB user buffer in virtual memory is typically split across 16 physical 4 KB pages — non-contiguous in RAM. Without SG, the kernel would have to copy the data into a contiguous bounce buffer before DMA. With SG, the kernel pins the user pages in place, builds a 16-entry descriptor, and the NIC reads them directly. Modern NICs and NVMe drives support hundreds of SG entries per descriptor, eliminating bounce buffers entirely for normal-size I/O.

What is an IOMMU and why does it matter?

An IOMMU (I/O Memory Management Unit) is the device-side analog of the CPU's MMU: it translates the I/O virtual addresses (IOVAs) that devices issue into physical addresses, with per-device page tables. Two purposes: (1) protection — a malicious or buggy device can only DMA into memory that the OS has explicitly mapped to it, preventing arbitrary RAM overwrites; (2) virtualization — VMs can pass through PCI devices safely because the IOMMU enforces the guest's view of physical memory. Intel's VT-d and AMD's AMD-Vi are IOMMU implementations. Linux turns it on with intel_iommu=on or amd_iommu=on. Modern hardware makes the per-DMA cost negligible (a TLB lookup) compared to the security gain.

How do DMA attacks (Thunderclap, DMA over Thunderbolt) work?

Thunderbolt and PCIe expose direct memory access to plugged-in devices. Without an IOMMU enforcing isolation, a malicious USB-C dock or PCIe card can read arbitrary RAM — kernel keys, password hashes, BitLocker recovery state. The Thunderclap research (2019) demonstrated bypass even on systems with IOMMU because the IOMMU was configured permissively for legacy compatibility. Mitigations: 'Kernel DMA Protection' on Windows, 'IOMMU strict mode' on Linux (which forces a TLB flush on every unmap, slower but secure), and Thunderbolt 3 'security level' that prompts the user before allowing new devices. Modern laptops ship with IOMMU strict by default.

How does zero-copy networking use DMA?

Traditional sendfile() goes disk → kernel page cache (DMA from SSD), kernel → userspace (CPU copy), userspace → kernel socket buffer (CPU copy), kernel → NIC (DMA). Zero-copy collapses this: sendfile() and splice() let the kernel set up DMA from the page cache directly into the NIC's TX ring, with no CPU-driven copy. Combined with TCP segmentation offload (TSO) on the NIC, the CPU does microseconds of work to transmit megabytes. Netflix's CDN serves 800 Gbps from a single FreeBSD box this way; Linux io_uring with registered buffers extends the same model to userspace applications. For RDMA NICs, the data does not even traverse kernel memory — the NIC writes directly into the destination's pre-registered userspace buffer.

Why does NVMe have its own DMA queues?

Legacy SATA had a single command queue with a 32-deep AHCI ring. NVMe was designed for parallel SSDs that can serve millions of IOPS, so it provides up to 65536 submission queues, each 65536 deep, mapped into the host's memory. The host driver writes a 64-byte command (LBA, length, SG list, completion queue ID) into the SQ; the SSD's controller reads the command via DMA, executes the I/O, writes the completion entry into the CQ via DMA, and (optionally) raises an MSI-X interrupt. Per-queue isolation lets each CPU core have its own queue, eliminating cross-core lock contention. Combined with polling rather than interrupts on hot paths, modern NVMe drives sustain 5-7 million 4K IOPS per device.