Memory Architecture
ECC Memory
Hamming SEC-DED on every DRAM access — the silent defender against bit rot
ECC memory adds 8 parity bits per 64-bit word, enough to correct any single-bit flip and detect any double-bit flip on every read. The standard in servers and workstations.
- Word width72 bits (64 data + 8 ECC)
- CodeHamming SEC-DED (72,64)
- Overhead12.5% extra DRAM
- CorrectsAny 1-bit flip silently
- DetectsAny 2-bit flip (no correction)
- Found inServers, workstations, Apple Silicon
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How ECC memory works
DRAM is fragile. A capacitor leaks; an alpha particle from chip packaging strikes a cell; a cosmic-ray neutron deposits enough charge to flip a bit. The memory looks fine — the value just changed. Without ECC, your program sees the corrupted value and carries on. ECC notices.
The mechanism is a Hamming SEC-DED code. For every 64-bit data word the controller writes to DRAM, it computes 8 extra check bits and stores them alongside in dedicated DRAM chips. On every read, it recomputes those 8 bits and XORs them with the stored copy. The result is the syndrome — an 8-bit number that:
- 0 means no error (or a 4+ bit error that fooled the code).
- Non-zero, odd weight identifies which exact bit flipped — the controller flips it back, returns clean data, logs an event.
- Non-zero, even weight means at least two bits flipped — the controller can't tell which two, but it raises a machine-check exception. The OS panics or terminates the offending process.
This three-way response is why the standard is called SEC-DED — Single Error Correction, Double Error Detection. The (72, 64) Hamming code carries exactly the redundancy needed: 7 bits to locate one error among 64+7=71 positions, plus 1 overall parity bit for even/odd weight to separate single from double errors.
The 72-bit DIMM
A non-ECC DDR DIMM has eight ×8 DRAM chips, total width 64 bits. An ECC DIMM has nine ×8 chips, total 72 bits — the ninth chip stores the check bits. You can spot ECC modules visually: an extra chip on each side, often unlabeled. Registered ECC (RDIMM) adds a register chip between controller and DRAM to fan out signals for high-capacity servers; load-reduced ECC (LRDIMM) goes further with a buffer.
Non-ECC DIMM: [chip 0][chip 1]...[chip 7] 64 bits per access
ECC DIMM: [chip 0][chip 1]...[chip 7][chip ECC] 72 bits per access
data bits: 64
check bits: 8 ← 12.5% overhead
total: 72
Hamming(72,64) corrects 1 bit, detects 2.
The ninth chip is the entire cost of ECC. At ~$5-15 per DIMM extra, on a server with 256 GB of RAM, ECC adds roughly $50-200. The cheapest insurance any datacenter buys.
What the numbers actually look like
- SEC-DED corrects 1 bit, detects 2 in a 72-bit word. Triple-bit errors can be miscorrected (mistaken for a single-bit flip in a different position). In practice such events are dominated by row-level DRAM failures and trapped by Chipkill.
- Google fleet study (Schroeder et al., 2009): 25,000-70,000 correctable errors per Gbit per year. A 64 GB server sees ~100 corrected single-bit errors per day; an uncorrectable double event roughly once per 5-10 years per machine.
- 12.5% storage overhead. 8 check bits per 64-bit word. Every read costs an extra DRAM access from the ninth chip; bandwidth penalty is negligible (parallel chips).
- 2-3% latency penalty from syndrome computation in the memory controller. For most workloads this is below noise.
- Apple Silicon (M1/M2/M3/M4) includes on-die ECC silently — no separate "ECC SKU." Older Intel desktop CPUs (without Xeon branding) typically lack ECC support; AMD Ryzen supports ECC on most chips but motherboard validation is hit-or-miss.
ECC variants in production
| Plain ECC (SEC-DED) | Chipkill | Advanced ECC | Parity | None | On-die ECC (DDR5) | |
|---|---|---|---|---|---|---|
| Detects | 1 or 2 bit flips | Full-chip failure | Multi-bit + chip | 1 bit flip | Nothing | 1 bit per ×4 burst |
| Corrects | 1 bit | 1 full chip | 2-4 bits | None | None | 1 bit per ×4 burst |
| Overhead | 12.5% | 12.5%-25% | 25%-50% | ~12.5% | 0% | Internal — invisible |
| Width | 72-bit DIMM | 72-bit + interleave | 144-bit | 72-bit (older) | 64-bit | 64-bit external |
| Used in | Most servers | IBM Power, Xeon-SP | Mainframes, HPC | 1980s-90s PCs | Consumer PCs | All DDR5 modules |
| Latency cost | 2-3% | 3-5% | 5-8% | ~1% | 0% | 1-2% |
Chipkill (IBM, then everyone) interleaves bits across multiple chips so a whole-chip failure looks like single-bit errors per word — still correctable. DDR5 on-die ECC is internal to each chip and doesn't replace external ECC; servers stack the two.
Worked example — a flipped bit, caught
Suppose the CPU writes the byte 0x4F = 0100 1111 to memory address 0x1000. The controller computes 4 check bits for a Hamming(12, 8) toy code and stores all 12 bits. Later a cosmic ray flips bit 3 of the stored byte from 1 to 0, giving 0100 0111 = 0x47.
On read, the controller recomputes the 4 check bits over the corrupted data and XORs against the stored check bits. The syndrome bits encode the binary index of the flipped position — 0011 = position 3. The controller flips bit 3, returns 0x4F to the CPU, increments a hardware counter. The OS sees the corrected-error log entry the next time it scans /sys/devices/system/edac/ and may mark the row as suspect after enough flips.
Python — Hamming(72,64) encode and correct
def hamming_72_64_encode(data: int) -> int:
"""Encode 64 data bits into a 72-bit word (8 ECC bits)."""
# Bit positions 1..71; positions that are powers of 2 (1,2,4,8,16,32,64) are parity bits.
# Position 72 is the overall parity bit for DED.
bits = [0] * 72 # bits[i] holds the value of position i+1
di = 0
for i in range(1, 72):
if (i & (i - 1)) != 0: # not a power of 2 → data bit
bits[i - 1] = (data >> di) & 1
di += 1
# Compute 7 Hamming parity bits at positions 1,2,4,8,16,32,64
for p in (1, 2, 4, 8, 16, 32, 64):
x = 0
for i in range(1, 72):
if i & p and i != p:
x ^= bits[i - 1]
bits[p - 1] = x
# Position 72: overall parity (XOR of everything before it) → DED
bits[71] = 0
bits[71] = sum(bits) & 1
return sum(b << i for i, b in enumerate(bits))
def hamming_72_64_decode(word: int) -> tuple[int, str]:
bits = [(word >> i) & 1 for i in range(72)]
# Recompute Hamming syndrome
syndrome = 0
for p in (1, 2, 4, 8, 16, 32, 64):
x = 0
for i in range(1, 72):
if i & p:
x ^= bits[i - 1]
if x: syndrome |= p
overall = sum(bits) & 1 # parity of the 72-bit word
if syndrome == 0 and overall == 0:
status = 'clean'
elif syndrome != 0 and overall == 1:
bits[syndrome - 1] ^= 1 # correct single-bit error
status = f'corrected bit {syndrome}'
elif syndrome != 0 and overall == 0:
status = 'UNCORRECTABLE: 2-bit error detected' # double error
else:
status = 'check-bit flip (data intact)'
data = 0; di = 0
for i in range(1, 72):
if (i & (i - 1)) != 0:
data |= bits[i - 1] << di
di += 1
return data, status
Production controllers do this in pure combinational logic — a fixed XOR tree taking a couple of nanoseconds. The Python is a model, not the implementation.
Beyond plain ECC — Chipkill, Memory Mirroring, RAS
Chipkill. Spreads each word's bits across many chips so a whole DRAM chip failing manifests as one bit-error per word — still correctable. Mandatory in modern Xeon-SP and EPYC server platforms.
Memory mirroring. The OS commits writes to two physical regions; reads compare. Detects everything ECC misses, at 50% capacity cost. Used in mission-critical configs (financial trading, telecom switches).
Patrol scrubbing. The memory controller proactively reads every cache line on a slow cycle (typically every 24 hours) and corrects soft errors before they accumulate into uncorrectable double-bit events. Critical for long-uptime servers.
Row-hammer mitigation. Repeated access to the same DRAM row can flip bits in adjacent rows. DDR4 introduced TRR (Target Row Refresh); DDR5 mandates RFM (Refresh Management). ECC alone is not enough against engineered Rowhammer attacks.
Spare DIMM / fault-tolerant memory. Server platforms reserve an entire DIMM as a hot spare. When ECC error counts cross a threshold, the platform copies the failing module's contents to the spare and remaps. Production systems run for years on this.
When ECC is mandatory
- Databases. A flipped bit in a B-tree page index can return wrong rows for years before discovery. PostgreSQL, MySQL, Oracle, SQL Server — all assume ECC.
- Hypervisors. A bit flip in a page table entry remaps memory between VMs. ESXi and KVM tenants share enough that this becomes a security boundary.
- Compilers and build farms. A flipped bit during compilation embeds a corrupted constant into binaries shipped to millions of users. Google's internal benchmarks justified ECC purely on build correctness.
- Long-running scientific simulations. Multi-day runs with single-precision floats are particularly vulnerable; one corrupted sample propagates through stencil updates.
- Financial transaction processing. One mis-credited cent is a regulatory event. ECC is non-negotiable.
Consumer workloads (games, browsing, video editing) generally tolerate bit flips — a single corrupted pixel in a frame disappears in milliseconds. The shift point is workloads that persist corrupted state.
Common ECC gotchas
- Mixing ECC and non-ECC DIMMs. Most platforms refuse to boot, but a few will silently fall back to non-ECC operation. Check the platform documentation.
- Motherboard ECC enablement. The CPU may support ECC but the BIOS exposes it as "disabled" by default on consumer-leaning AM5/Ryzen boards. Always verify in
/sys/devices/system/edac/mc/after boot. - Correctable-error storms. A single DRAM cell stuck-at-X generates millions of corrected errors before retire. Without rate limiting, the log floods. Linux EDAC and IBM mcelog deduplicate these.
- Rowhammer. ECC catches a few flips but a determined attacker can flip 3+ bits per word and slip past SEC-DED. Mitigated at the DRAM refresh layer, not in the ECC code.
- Counting "errors" instead of "events." One stuck cell hit a million times reports a million correctable events but represents a single physical failure. Dedupe by physical address before alerting.
- BIOS hiding errors. Some firmware silently logs ECC events to a vendor-specific area without exposing them to the OS. Tools like
ipmitool selor platform-specific ones (Dell OpenManage, HPE iLO) are required to see them.
Frequently asked questions
What does SEC-DED actually mean?
Single Error Correction, Double Error Detection. The memory controller can transparently fix any one bit flip in a 64-bit word using the 8 ECC bits, and reliably notice (but not correct) any two flips. Three or more flips can be miscorrected; in practice they're vanishingly rare in a single memory word.
Why are ECC DIMMs 72 bits wide instead of 64?
A standard non-ECC DIMM carries 64 data bits per access. ECC adds 8 extra bits (12.5% overhead) that store a Hamming SEC-DED check vector. The extra chip is visible on the module — 9 chips instead of 8. Mainboard and CPU memory controller must both support it; consumer chipsets often don't.
How often do bit flips actually happen?
Google's 2009 study of its fleet found 25,000-70,000 correctable errors per Gbit per year — roughly one flip per GB per few hours under heavy load. At datacenter scale that's millions of silent corruptions per day without ECC. Cosmic rays and alpha decay from chip packaging are the dominant sources.
Is ECC slower than regular memory?
Slightly. The memory controller has to compute the syndrome on every read, which costs around 2-3% in latency. The extra DRAM chip also adds capacitance. Most servers eat the cost gladly — silent corruption is worse than a fractional slowdown.
What's the difference between ECC and parity memory?
Parity uses 1 extra bit per byte (or per word) to detect single-bit errors but cannot correct them. ECC uses a Hamming code with enough redundancy to both detect AND correct single-bit errors. Modern servers all use ECC; parity-only memory died out in the 1990s.
Do laptops and gaming PCs use ECC?
Rarely. Intel historically gated ECC to Xeon (server) and some workstation chipsets. AMD Ryzen supports ECC on most desktop CPUs unofficially. Apple Silicon uses on-die ECC silently. If your machine runs a database, hypervisor, or compiler workload for hours, ECC pays for itself.