Storage Systems

Data Deduplication

Hash each chunk, store identical chunks once — how backups shrink 10-50×

Data dedup splits storage into chunks, hashes each, and keeps just one physical copy of identical hashes. VM backups commonly compress 10-50×. Veeam, Avamar, ZFS dedup standard.

  • Chunk size4 KB to 1 MB typical
  • HashSHA-256 or BLAKE3 (32 bytes)
  • VM backup ratio10-50× reduction
  • File backup ratio5-20× reduction
  • ChunkingFixed or content-defined (Rabin)
  • Index cost~0.5% of stored size in RAM

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How deduplication works

Most data isn't original. Across a hundred VM backups, the Windows system files, the patched kernels, and the standard Office binaries are byte-identical. Across a corporate file server's nightly snapshots, 99.9% of files haven't changed since yesterday. Across a research lab's archive, the same reference datasets sit in dozens of subdirectories. Storage these redundantly is paying for the same bytes a hundred times.

Deduplication recognizes the duplicates and stores them once. The mechanism is mechanical:

  1. Chunk. Slice incoming data into pieces of 4 KB to 1 MB.
  2. Hash. Compute SHA-256 (or BLAKE3, or another cryptographic hash) of each chunk.
  3. Lookup. Check the global index for that hash.
  4. Branch. If found, just record a pointer (the logical block maps to an existing physical chunk). If not, write the chunk to storage and add it to the index.

The logical-to-physical mapping replaces the data. A 100 GB VM image with 95% duplicated content stores 5 GB of unique chunks plus a tiny manifest of pointers.

file_A:  [chunk1][chunk2][chunk3][chunk4][chunk5]
              ↓     ↓     ↓     ↓     ↓
            hash  hash  hash  hash  hash
              │     │     │     │     │
              └─────┴──┬──┴─────┴─────┘
                       ▼
              ┌───────────────────┐
              │  Global hash → physical block index │
              └───────────────────┘

After dedup:
  file_A manifest: [ptr1, ptr2, ptr3, ptr4, ptr5]
  file_B manifest: [ptr1, ptr7, ptr3, ptr8, ptr5]   ← shares 3 chunks with A
  physical store: {ptr1, ptr2, ptr3, ptr4, ptr5, ptr7, ptr8}  ← 7 chunks, not 10

Fixed-block vs content-defined chunking

Fixed-block chunking slices at every 4 KB (or 64 KB, or whatever). Trivial, fast — but breaks on insertions. Insert one byte at the start of a 1 GB file, every subsequent chunk shifts by one byte, every hash changes, no chunk matches the prior version. Dedup ratio collapses for editable documents.

Content-defined chunking (CDC), pioneered by Rabin fingerprinting and modernized by FastCDC, picks chunk boundaries based on the content. A rolling hash slides over a 64-byte window; whenever the low N bits of the hash match a target pattern, a boundary is declared. The result: insertions ripple at most one or two chunks before the chunking pattern re-syncs. Same matches before and after.

Fixed-block, file edited at offset 100:
  original:  [chunk_0..4KB ][chunk_4..8KB ][chunk_8..12KB] …
  edited:    [byte at 100 inserted]
             [chunk_0..4KB']                                     ← every chunk new
             [chunk_4..8KB']
             [chunk_8..12KB']
  → no shared chunks

Content-defined, same edit:
  original:  [chunk_0..3782 ][chunk_3782..7891][chunk_7891..12044] …
  edited:    [chunk_0..3783']  ← only first chunk changed (insertion absorbed)
             [chunk_3783..7892]  ← identical content, new offset, same hash
             …                    matches resume immediately
  → all later chunks reused

What dedup actually delivers

  • VM backup ratios of 10-50× are routine — most blocks repeat across guest OSes (Windows patches, common system files, blank-page padding). Production Veeam Backup & Replication and Dell EMC Avamar consistently report this range.
  • File-server backup ratios of 5-20× over 30-day retention windows — small daily changes against an enormous static baseline.
  • Index RAM cost ≈ 0.5% of stored size. 100 TB of dedup'd backup data needs ~500 GB of index RAM for hot lookup. ZFS dedup famously fails at scale on memory-starved boxes.
  • Encrypted or compressed inputs dedup at 1×. Random-looking bytes have no repeat structure. Encrypt before dedup and you lose all benefit.
  • Inline dedup adds 100-300 µs per write (hash computation + index probe). Post-process dedup runs during idle windows; tradeoff is full write bandwidth then storage spike before reclaim.

Dedup variants — block, file, source, target

Block-level inlineBlock-level post-processFile-level (whole-file)Source-sideTarget-sideCDC
Granularity4 KB-1 MB chunks4 KB-1 MB chunksEntire filesWhatever client chunksWhatever server chunksVariable (1-128 KB)
Dedup ratio10-50×10-50×1.2-2×Same as block (10-50×)Same as block (10-50×)15-60×
Write latency+100-300 µs0 (deferred)+ms (hash file)Network savingsServer CPU+200-500 µs
Bandwidth savingsNone on wireNone on wireMod (whole-file skip)Massive (chunks only)None on wireNone on wire
ExamplesZFS dedup, Pure StorageAvamar daily windowGit LFS, Dropbox SnapshotAvamar agentData DomainBorgBackup, Restic, ZPaq
Best forVM, primary storageBackup applianceDocument librariesWAN backupLAN-attached vaultVersioned editing

The dimensions are independent: granularity (block vs file), timing (inline vs post-process), and locality (source vs target) combine. A typical enterprise backup is block-level, inline, target-side, CDC — Veeam, Rubrik, Cohesity, Data Domain.

Python — toy CDC deduplicator

import hashlib
from typing import Iterator

class CDCChunker:
    """Content-defined chunking via simple rolling polynomial hash."""
    MIN_CHUNK = 4 * 1024
    AVG_CHUNK = 16 * 1024
    MAX_CHUNK = 64 * 1024
    WINDOW = 48
    # Mask of (log2(AVG_CHUNK)) bits; boundary when (hash & mask) == 0
    MASK = AVG_CHUNK - 1
    PRIME = 31

    def chunk(self, data: bytes) -> Iterator[bytes]:
        n = len(data); start = 0; i = self.MIN_CHUNK
        while i < n:
            if i - start >= self.MAX_CHUNK:
                yield data[start:i]; start = i; i += self.MIN_CHUNK; continue
            # rolling hash over WINDOW
            h = 0
            for b in data[max(0, i - self.WINDOW):i]:
                h = (h * self.PRIME + b) & 0xFFFFFFFF
            if (h & self.MASK) == 0:
                yield data[start:i]; start = i; i += self.MIN_CHUNK
            else:
                i += 1
        if start < n:
            yield data[start:]

class DedupStore:
    def __init__(self):
        self.index: dict[bytes, int] = {}     # sha256 → physical id
        self.physical: list[bytes] = []        # chunks
        self.files: dict[str, list[int]] = {}  # filename → list of physical ids

    def put(self, name: str, data: bytes, chunker = CDCChunker()):
        manifest = []
        for chunk in chunker.chunk(data):
            h = hashlib.sha256(chunk).digest()
            if h in self.index:
                manifest.append(self.index[h])    # duplicate — pointer only
            else:
                pid = len(self.physical)
                self.physical.append(chunk)
                self.index[h] = pid
                manifest.append(pid)
        self.files[name] = manifest

    def get(self, name: str) -> bytes:
        return b''.join(self.physical[pid] for pid in self.files[name])

    def stats(self) -> dict:
        logical = sum(len(self.physical[p]) for ids in self.files.values() for p in ids)
        physical = sum(len(c) for c in self.physical)
        return {'logical': logical, 'physical': physical,
                'ratio': logical / physical if physical else 0}

Production systems replace SHA-256 with BLAKE3 (faster, same security), keep the physical store on log-structured disk, and shard the index across SSDs. The shape of the algorithm is identical.

JavaScript dedup example

// Browser-side dedup using the Web Crypto API
async function dedupAddFile(store, name, fileBytes, chunkSize = 8192) {
  const manifest = [];
  for (let i = 0; i < fileBytes.length; i += chunkSize) {
    const chunk = fileBytes.slice(i, i + chunkSize);
    const hashBuf = await crypto.subtle.digest('SHA-256', chunk);
    const hashHex = [...new Uint8Array(hashBuf)]
      .map(b => b.toString(16).padStart(2, '0')).join('');
    if (store.index.has(hashHex)) {
      manifest.push(store.index.get(hashHex));      // duplicate
    } else {
      const pid = store.physical.length;
      store.physical.push(chunk);
      store.index.set(hashHex, pid);
      manifest.push(pid);
    }
  }
  store.files.set(name, manifest);
  return manifest;
}

async function dedupGet(store, name) {
  const manifest = store.files.get(name);
  const parts = manifest.map(pid => store.physical[pid]);
  return new Blob(parts);
}

const store = { index: new Map(), physical: [], files: new Map() };
await dedupAddFile(store, 'vm-disk-1.img', fileBytes1);
await dedupAddFile(store, 'vm-disk-2.img', fileBytes2);
console.log('dedup ratio:',
  (store.files.get('vm-disk-1.img').length + store.files.get('vm-disk-2.img').length)
  / store.physical.length);

Production systems — Veeam, Avamar, Data Domain, ZFS

Veeam Backup & Replication. Uses target-side block-level dedup with optional source-side WAN acceleration. Default 1 MB blocks; "extra-large" jobs use 4 MB. Typical 10-25× ratios for VM backups.

Dell EMC Avamar. Source-side variable-length CDC. The agent on each backed-up host computes hashes locally and only sends not-yet-stored chunks over the wire — massive WAN bandwidth savings for remote office backup.

Dell EMC Data Domain (PowerProtect DD). Target-side inline dedup with stream-informed segment hashing. Specializes in handling lots of backup streams concurrently; 99th-percentile ratio claims of 30-50× on real datasets.

ZFS dedup. In-line, block-level, recordsize-aligned. Notorious for needing the entire dedup table in RAM — the rule of thumb is 5 GB RAM per TB of dedup'd data. Brilliant on small homelab setups, infamous in production at scale.

BorgBackup, Restic, Kopia. Open-source backup tools using CDC + content-addressable storage. Same techniques as enterprise products, repository-on-disk format.

Microsoft Storage Spaces Direct dedup. Post-process dedup running on Windows Server file shares. Default chunk 64 KB, scans every 24 hours.

When dedup pays

  • Backup repositories. The single biggest win. 30-day retention × 10 hosts × shared OS files = enormous duplication.
  • VDI / virtual desktops. Hundreds of identical user images. Without dedup, completely unaffordable; with dedup, routine.
  • Email archives. Attachments forwarded among colleagues are stored once.
  • Code or asset repositories. Git already does this (object store keyed by SHA-1). Most CI artifact stores dedup similarly.
  • Document libraries with versioning. Small edits to large files — dedup recovers 80-95% of the storage.

Don't dedup encrypted blobs, unique research data, or media that's been compressed to near entropy — overhead with no return. The CPU cost is real; on hot OLTP volumes it's usually not worth the savings.

Common dedup pitfalls

  • Encrypted then dedup'd. Encrypted ciphertext is statistically random; no chunks repeat. Always dedup before encryption (or use convergent encryption — same plaintext yields same ciphertext).
  • Compressed then dedup'd. Same effect — pre-compression dedup, then optional post-dedup compression. Order matters.
  • Out-of-memory index. ZFS dedup's classic failure mode — the dedup table grows to multi-GB and once it spills from RAM, write latency multiplies by 1000×.
  • Fragmentation on read. Each logical file ends up scattered across the physical chunk store. Reading a single file becomes a many-small-IO workload, hurting sequential read throughput.
  • Hash collisions (theoretical). SHA-256 makes this safer than disk failures, but paranoid archival systems pay the bytewise-verify cost on duplicate-hit.
  • Repository corruption is catastrophic. Losing the chunk store means losing every file that references it. Dedup repositories need their own checksumming + redundancy layer (typically erasure coding or replication).
  • Backup chain dependency. Many backup products dedupe against earlier backups, so deleting an old full backup can require rewriting the chain. Plan retention windows carefully.

Frequently asked questions

How does deduplication actually save space?

Each file is split into chunks (4 KB to 1 MB typical). Every chunk is hashed (SHA-256 or BLAKE3). The hash is looked up in an index — if it already exists, the file's logical block points to the existing physical chunk and no new bytes are stored. Identical chunks across backups, VMs, or users share a single physical copy.

Fixed-block vs content-defined chunking — which wins?

Fixed-block (e.g., 4 KB) is fast but vulnerable to the 'insert shift' problem — inserting one byte in the middle of a file changes every subsequent chunk boundary, so no later chunks match the old version. Content-defined chunking (Rabin fingerprint, FastCDC) picks boundaries where the data itself has a property, so insertions only invalidate one or two chunks. Better dedup ratio at higher CPU cost.

What dedup ratios are realistic?

VM backups: 10-50× (most blocks repeat across guest OSes). File-server backups: 5-20× (long retention windows produce many identical versions). Email archives: 3-10×. Database backups: 2-5×. General mixed file storage: 1.5-3×. Encrypted or compressed data: typically 1× — pre-compression dedup matters.

Inline vs post-process deduplication — which is better?

Inline (compute hash during write, skip duplicates immediately) saves disk write bandwidth but adds latency. Post-process (write all bytes, scan for duplicates later) is faster on the write path but needs the full data to land first. Inline is mandatory for backup appliances where ingest bandwidth is the bottleneck; post-process suits primary storage with idle cycles.

What about hash collisions?

SHA-256 collisions are cryptographically astronomical — under reasonable assumptions (2^128 chunks), one collision per 2^128 stored. Even paranoid deployments use double hashing (SHA-256 + comparison verify) for archival data. The chance of silent corruption from a hash collision is many orders of magnitude lower than from disk failure.

Why is the dedup index so large?

For every chunk stored, the index keeps (hash → physical location). At 32-byte hashes and 64 KB chunks, a 100 TB store needs ~1.6 billion entries → ~50 GB index just to map. ZFS dedup is notorious for this — the index must fit in RAM for tolerable lookup latency, gating deployments at small-to-medium scale.