Storage Systems
Data Deduplication
Hash each chunk, store identical chunks once — how backups shrink 10-50×
Data dedup splits storage into chunks, hashes each, and keeps just one physical copy of identical hashes. VM backups commonly compress 10-50×. Veeam, Avamar, ZFS dedup standard.
- Chunk size4 KB to 1 MB typical
- HashSHA-256 or BLAKE3 (32 bytes)
- VM backup ratio10-50× reduction
- File backup ratio5-20× reduction
- ChunkingFixed or content-defined (Rabin)
- Index cost~0.5% of stored size in RAM
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How deduplication works
Most data isn't original. Across a hundred VM backups, the Windows system files, the patched kernels, and the standard Office binaries are byte-identical. Across a corporate file server's nightly snapshots, 99.9% of files haven't changed since yesterday. Across a research lab's archive, the same reference datasets sit in dozens of subdirectories. Storage these redundantly is paying for the same bytes a hundred times.
Deduplication recognizes the duplicates and stores them once. The mechanism is mechanical:
- Chunk. Slice incoming data into pieces of 4 KB to 1 MB.
- Hash. Compute SHA-256 (or BLAKE3, or another cryptographic hash) of each chunk.
- Lookup. Check the global index for that hash.
- Branch. If found, just record a pointer (the logical block maps to an existing physical chunk). If not, write the chunk to storage and add it to the index.
The logical-to-physical mapping replaces the data. A 100 GB VM image with 95% duplicated content stores 5 GB of unique chunks plus a tiny manifest of pointers.
file_A: [chunk1][chunk2][chunk3][chunk4][chunk5]
↓ ↓ ↓ ↓ ↓
hash hash hash hash hash
│ │ │ │ │
└─────┴──┬──┴─────┴─────┘
▼
┌───────────────────┐
│ Global hash → physical block index │
└───────────────────┘
After dedup:
file_A manifest: [ptr1, ptr2, ptr3, ptr4, ptr5]
file_B manifest: [ptr1, ptr7, ptr3, ptr8, ptr5] ← shares 3 chunks with A
physical store: {ptr1, ptr2, ptr3, ptr4, ptr5, ptr7, ptr8} ← 7 chunks, not 10
Fixed-block vs content-defined chunking
Fixed-block chunking slices at every 4 KB (or 64 KB, or whatever). Trivial, fast — but breaks on insertions. Insert one byte at the start of a 1 GB file, every subsequent chunk shifts by one byte, every hash changes, no chunk matches the prior version. Dedup ratio collapses for editable documents.
Content-defined chunking (CDC), pioneered by Rabin fingerprinting and modernized by FastCDC, picks chunk boundaries based on the content. A rolling hash slides over a 64-byte window; whenever the low N bits of the hash match a target pattern, a boundary is declared. The result: insertions ripple at most one or two chunks before the chunking pattern re-syncs. Same matches before and after.
Fixed-block, file edited at offset 100:
original: [chunk_0..4KB ][chunk_4..8KB ][chunk_8..12KB] …
edited: [byte at 100 inserted]
[chunk_0..4KB'] ← every chunk new
[chunk_4..8KB']
[chunk_8..12KB']
→ no shared chunks
Content-defined, same edit:
original: [chunk_0..3782 ][chunk_3782..7891][chunk_7891..12044] …
edited: [chunk_0..3783'] ← only first chunk changed (insertion absorbed)
[chunk_3783..7892] ← identical content, new offset, same hash
… matches resume immediately
→ all later chunks reused
What dedup actually delivers
- VM backup ratios of 10-50× are routine — most blocks repeat across guest OSes (Windows patches, common system files, blank-page padding). Production Veeam Backup & Replication and Dell EMC Avamar consistently report this range.
- File-server backup ratios of 5-20× over 30-day retention windows — small daily changes against an enormous static baseline.
- Index RAM cost ≈ 0.5% of stored size. 100 TB of dedup'd backup data needs ~500 GB of index RAM for hot lookup. ZFS dedup famously fails at scale on memory-starved boxes.
- Encrypted or compressed inputs dedup at 1×. Random-looking bytes have no repeat structure. Encrypt before dedup and you lose all benefit.
- Inline dedup adds 100-300 µs per write (hash computation + index probe). Post-process dedup runs during idle windows; tradeoff is full write bandwidth then storage spike before reclaim.
Dedup variants — block, file, source, target
| Block-level inline | Block-level post-process | File-level (whole-file) | Source-side | Target-side | CDC | |
|---|---|---|---|---|---|---|
| Granularity | 4 KB-1 MB chunks | 4 KB-1 MB chunks | Entire files | Whatever client chunks | Whatever server chunks | Variable (1-128 KB) |
| Dedup ratio | 10-50× | 10-50× | 1.2-2× | Same as block (10-50×) | Same as block (10-50×) | 15-60× |
| Write latency | +100-300 µs | 0 (deferred) | +ms (hash file) | Network savings | Server CPU | +200-500 µs |
| Bandwidth savings | None on wire | None on wire | Mod (whole-file skip) | Massive (chunks only) | None on wire | None on wire |
| Examples | ZFS dedup, Pure Storage | Avamar daily window | Git LFS, Dropbox Snapshot | Avamar agent | Data Domain | BorgBackup, Restic, ZPaq |
| Best for | VM, primary storage | Backup appliance | Document libraries | WAN backup | LAN-attached vault | Versioned editing |
The dimensions are independent: granularity (block vs file), timing (inline vs post-process), and locality (source vs target) combine. A typical enterprise backup is block-level, inline, target-side, CDC — Veeam, Rubrik, Cohesity, Data Domain.
Python — toy CDC deduplicator
import hashlib
from typing import Iterator
class CDCChunker:
"""Content-defined chunking via simple rolling polynomial hash."""
MIN_CHUNK = 4 * 1024
AVG_CHUNK = 16 * 1024
MAX_CHUNK = 64 * 1024
WINDOW = 48
# Mask of (log2(AVG_CHUNK)) bits; boundary when (hash & mask) == 0
MASK = AVG_CHUNK - 1
PRIME = 31
def chunk(self, data: bytes) -> Iterator[bytes]:
n = len(data); start = 0; i = self.MIN_CHUNK
while i < n:
if i - start >= self.MAX_CHUNK:
yield data[start:i]; start = i; i += self.MIN_CHUNK; continue
# rolling hash over WINDOW
h = 0
for b in data[max(0, i - self.WINDOW):i]:
h = (h * self.PRIME + b) & 0xFFFFFFFF
if (h & self.MASK) == 0:
yield data[start:i]; start = i; i += self.MIN_CHUNK
else:
i += 1
if start < n:
yield data[start:]
class DedupStore:
def __init__(self):
self.index: dict[bytes, int] = {} # sha256 → physical id
self.physical: list[bytes] = [] # chunks
self.files: dict[str, list[int]] = {} # filename → list of physical ids
def put(self, name: str, data: bytes, chunker = CDCChunker()):
manifest = []
for chunk in chunker.chunk(data):
h = hashlib.sha256(chunk).digest()
if h in self.index:
manifest.append(self.index[h]) # duplicate — pointer only
else:
pid = len(self.physical)
self.physical.append(chunk)
self.index[h] = pid
manifest.append(pid)
self.files[name] = manifest
def get(self, name: str) -> bytes:
return b''.join(self.physical[pid] for pid in self.files[name])
def stats(self) -> dict:
logical = sum(len(self.physical[p]) for ids in self.files.values() for p in ids)
physical = sum(len(c) for c in self.physical)
return {'logical': logical, 'physical': physical,
'ratio': logical / physical if physical else 0}
Production systems replace SHA-256 with BLAKE3 (faster, same security), keep the physical store on log-structured disk, and shard the index across SSDs. The shape of the algorithm is identical.
JavaScript dedup example
// Browser-side dedup using the Web Crypto API
async function dedupAddFile(store, name, fileBytes, chunkSize = 8192) {
const manifest = [];
for (let i = 0; i < fileBytes.length; i += chunkSize) {
const chunk = fileBytes.slice(i, i + chunkSize);
const hashBuf = await crypto.subtle.digest('SHA-256', chunk);
const hashHex = [...new Uint8Array(hashBuf)]
.map(b => b.toString(16).padStart(2, '0')).join('');
if (store.index.has(hashHex)) {
manifest.push(store.index.get(hashHex)); // duplicate
} else {
const pid = store.physical.length;
store.physical.push(chunk);
store.index.set(hashHex, pid);
manifest.push(pid);
}
}
store.files.set(name, manifest);
return manifest;
}
async function dedupGet(store, name) {
const manifest = store.files.get(name);
const parts = manifest.map(pid => store.physical[pid]);
return new Blob(parts);
}
const store = { index: new Map(), physical: [], files: new Map() };
await dedupAddFile(store, 'vm-disk-1.img', fileBytes1);
await dedupAddFile(store, 'vm-disk-2.img', fileBytes2);
console.log('dedup ratio:',
(store.files.get('vm-disk-1.img').length + store.files.get('vm-disk-2.img').length)
/ store.physical.length);
Production systems — Veeam, Avamar, Data Domain, ZFS
Veeam Backup & Replication. Uses target-side block-level dedup with optional source-side WAN acceleration. Default 1 MB blocks; "extra-large" jobs use 4 MB. Typical 10-25× ratios for VM backups.
Dell EMC Avamar. Source-side variable-length CDC. The agent on each backed-up host computes hashes locally and only sends not-yet-stored chunks over the wire — massive WAN bandwidth savings for remote office backup.
Dell EMC Data Domain (PowerProtect DD). Target-side inline dedup with stream-informed segment hashing. Specializes in handling lots of backup streams concurrently; 99th-percentile ratio claims of 30-50× on real datasets.
ZFS dedup. In-line, block-level, recordsize-aligned. Notorious for needing the entire dedup table in RAM — the rule of thumb is 5 GB RAM per TB of dedup'd data. Brilliant on small homelab setups, infamous in production at scale.
BorgBackup, Restic, Kopia. Open-source backup tools using CDC + content-addressable storage. Same techniques as enterprise products, repository-on-disk format.
Microsoft Storage Spaces Direct dedup. Post-process dedup running on Windows Server file shares. Default chunk 64 KB, scans every 24 hours.
When dedup pays
- Backup repositories. The single biggest win. 30-day retention × 10 hosts × shared OS files = enormous duplication.
- VDI / virtual desktops. Hundreds of identical user images. Without dedup, completely unaffordable; with dedup, routine.
- Email archives. Attachments forwarded among colleagues are stored once.
- Code or asset repositories. Git already does this (object store keyed by SHA-1). Most CI artifact stores dedup similarly.
- Document libraries with versioning. Small edits to large files — dedup recovers 80-95% of the storage.
Don't dedup encrypted blobs, unique research data, or media that's been compressed to near entropy — overhead with no return. The CPU cost is real; on hot OLTP volumes it's usually not worth the savings.
Common dedup pitfalls
- Encrypted then dedup'd. Encrypted ciphertext is statistically random; no chunks repeat. Always dedup before encryption (or use convergent encryption — same plaintext yields same ciphertext).
- Compressed then dedup'd. Same effect — pre-compression dedup, then optional post-dedup compression. Order matters.
- Out-of-memory index. ZFS dedup's classic failure mode — the dedup table grows to multi-GB and once it spills from RAM, write latency multiplies by 1000×.
- Fragmentation on read. Each logical file ends up scattered across the physical chunk store. Reading a single file becomes a many-small-IO workload, hurting sequential read throughput.
- Hash collisions (theoretical). SHA-256 makes this safer than disk failures, but paranoid archival systems pay the bytewise-verify cost on duplicate-hit.
- Repository corruption is catastrophic. Losing the chunk store means losing every file that references it. Dedup repositories need their own checksumming + redundancy layer (typically erasure coding or replication).
- Backup chain dependency. Many backup products dedupe against earlier backups, so deleting an old full backup can require rewriting the chain. Plan retention windows carefully.
Frequently asked questions
How does deduplication actually save space?
Each file is split into chunks (4 KB to 1 MB typical). Every chunk is hashed (SHA-256 or BLAKE3). The hash is looked up in an index — if it already exists, the file's logical block points to the existing physical chunk and no new bytes are stored. Identical chunks across backups, VMs, or users share a single physical copy.
Fixed-block vs content-defined chunking — which wins?
Fixed-block (e.g., 4 KB) is fast but vulnerable to the 'insert shift' problem — inserting one byte in the middle of a file changes every subsequent chunk boundary, so no later chunks match the old version. Content-defined chunking (Rabin fingerprint, FastCDC) picks boundaries where the data itself has a property, so insertions only invalidate one or two chunks. Better dedup ratio at higher CPU cost.
What dedup ratios are realistic?
VM backups: 10-50× (most blocks repeat across guest OSes). File-server backups: 5-20× (long retention windows produce many identical versions). Email archives: 3-10×. Database backups: 2-5×. General mixed file storage: 1.5-3×. Encrypted or compressed data: typically 1× — pre-compression dedup matters.
Inline vs post-process deduplication — which is better?
Inline (compute hash during write, skip duplicates immediately) saves disk write bandwidth but adds latency. Post-process (write all bytes, scan for duplicates later) is faster on the write path but needs the full data to land first. Inline is mandatory for backup appliances where ingest bandwidth is the bottleneck; post-process suits primary storage with idle cycles.
What about hash collisions?
SHA-256 collisions are cryptographically astronomical — under reasonable assumptions (2^128 chunks), one collision per 2^128 stored. Even paranoid deployments use double hashing (SHA-256 + comparison verify) for archival data. The chance of silent corruption from a hash collision is many orders of magnitude lower than from disk failure.
Why is the dedup index so large?
For every chunk stored, the index keeps (hash → physical location). At 32-byte hashes and 64 KB chunks, a 100 TB store needs ~1.6 billion entries → ~50 GB index just to map. ZFS dedup is notorious for this — the index must fit in RAM for tolerable lookup latency, gating deployments at small-to-medium scale.