Data Structures

Persistent Data Structures

Every update produces a new version while sharing structure — used in Clojure, Git, undo systems

A persistent data structure preserves all previous versions when modified — every update returns a new version while sharing unchanged structure with the original. Partial persistence: only the latest version is mutable; older versions are read-only. Full persistence: any version can be modified. Confluent persistence: versions can be merged. Achieved via path copying (O(log n) extra nodes per update for trees) or fat-node techniques. Driscoll, Sarnak, Sleator, Tarjan formalized the technique in 1989. Foundational in Clojure, Scala (Vector/HashMap), Git's content-addressable model, and undo/redo systems.

  • Update costO(log n) extra nodes
  • Space per versionO(log n) shared
  • Path copyingO(log n)
  • Fat nodeO(1) per update
  • FormalizedDriscoll Sarnak Sleator Tarjan 1989
  • Used inClojure, Git, Scala, undo

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How persistence works

Picture a tree of a million elements. You want to change one leaf. The naive approach is to copy the entire tree — a million allocations, a million pointers rewritten. That's untenable. Persistent data structures take a different route: only the path from root to the changed leaf is rebuilt. Every other subtree is pointed at by both the old root and the new root. Two distinct versions, sharing 99.999% of their bytes.

The classical 1989 paper by Driscoll, Sarnak, Sleator and Tarjan established the formal vocabulary. They distinguished three flavors:

  • Partial persistence. Old versions are read-only. Only the latest can be updated. Versions form a line.
  • Full persistence. Any version can be updated, branching the version history into a tree.
  • Confluent persistence. An update can combine two versions, making the version history a DAG.

And two implementation techniques:

  • Path copying. On update, allocate new copies of every node on the path from root to modified leaf. O(log n) per update, O(log n) extra space per version. Simple, intuitive — the dominant approach in practice.
  • Fat nodes with version stamps. Each node stores a list of (version, value) pairs. On update, append a new entry. O(1) amortized per update — but lookups now need a binary search on version. Used in algorithm theory more than in production code.

Path copying — concretely

Suppose you have a balanced binary search tree with root R, and you want to update key K. Walk down to find K's leaf. On the way back up, allocate a new copy of each node, with one child pointer rewritten to the new path and the other still pointing at the original (shared) subtree. The new root R' points at the new path; R still points at the old. Both are valid trees.

function update(node, key, value):
  if node is null: return new Node(key, value)
  if key < node.key:
    return new Node(node.key, node.value,
                    update(node.left, key, value),
                    node.right)
  if key > node.key:
    return new Node(node.key, node.value,
                    node.left,
                    update(node.right, key, value))
  return new Node(key, value, node.left, node.right)

Each call allocates one node and returns it; the recursion depth is the tree height. For a balanced tree of a million elements, that's about 20 allocations per update. Compared to a mutable structure, that's a real cost — but versions are essentially free to keep around, since they share most of their contents.

Clojure's PersistentVector

Clojure pushes path copying to its limit with a 32-way trie. Each internal node has up to 32 children. To find element i, you treat i as a base-32 number — five hexits index five tree levels. Depth is ceil(log32 n). For a billion elements, depth is six.

On update, path copying allocates six new 32-pointer arrays — about 1.5 KB total. Reading also takes six indirections, but each step touches a 256-byte array that fits in two cache lines, so the constant factor is small. Benchmarks place Clojure's persistent vector at roughly 2-4× the cost of a mutable Java ArrayList for single appends. The gap shrinks for larger structures because both pay similar memory-system costs.

Scala's Vector follows the same blueprint. Haskell's Data.Sequence uses a 2-3 finger tree with O(1) amortized updates at either end and O(log n) random access. Each language picks a different point in the trade-off space.

Git as a persistent file system

Every Git commit is a snapshot of the entire repository, but storage doesn't blow up because Git is structurally a Merkle tree of immutable objects. Each blob (file content), tree (directory listing), and commit is content-addressed by its SHA-1 hash. A directory is a tree node listing entries, each pointing at a blob or sub-tree by hash.

When you change one file, Git allocates a new blob for the new content (new hash), a new tree for the directory containing that blob, then new trees for every ancestor directory all the way to the repository root, then a new commit object referencing that root. Sibling directories don't change — their hashes are referenced by both the old commit and the new commit. That's path copying, just with hashes as pointers.

This is why git switch between branches is instant: the operation is a pointer flip. It's why branch storage is essentially free: branches share their unchanged subtrees. Garbage collection (git gc) reclaims objects no commit references, exactly like a runtime GC reclaims unreachable persistent versions.

Undo, redo, and time-travel

Editor undo stacks are usually implemented as patch lists — record each edit, apply in reverse to undo. Persistent data structures offer an alternative: every edit produces a new version of the document, and the undo stack is a sequence of references to past versions. To undo, point at the previous version. To redo, point at the next.

The trade-off is space: patches are typically smaller than version snapshots, but persistent versions are O(log n) per edit anyway. For text editors editing million-character documents, persistent ropes (concatenable strings as balanced trees) are a real win — every keystroke is a new version, all sharing structure.

Time-travel debugging tools like Elm Reactor, Redux DevTools, and rr exploit the same idea. Hold the entire history of state in memory; jump to any past instant by reseating one pointer. Without structural sharing this would require gigabytes per minute; with it, megabytes.

Why persistence matters

  • Concurrency without locks. A reader sees a consistent snapshot — even if a writer is mutating in parallel, the reader holds a reference to a frozen past version. This is the basis of Clojure's STM and Scala's parallel collections.
  • Time-travel debugging. Hold every past state in memory cheaply.
  • Version control. Git, Mercurial, Pijul all rest on persistent trees of immutable content.
  • Functional purity. Pure-functional languages need persistent data structures by construction — there's no other way to support efficient update without mutation.
  • Speculative evaluation. Try a sequence of edits; commit only one branch; the others fall away as garbage. Used in computational geometry algorithms (Driscoll's persistent search trees).
  • Audit and compliance. Financial and legal systems often need every past state recoverable. Persistence makes that storage-efficient.

Common misconceptions

  • Memory bombs. Persistent doesn't mean every version is fully copied — structural sharing prevents the blowup. A million-version sequence of single-element updates uses O(n + v log n) space, not O(n × v).
  • Always slow. Only a log-factor slower than mutable for typical operations. The constant is small (4-6 indirections in Clojure/Scala) and lookups can be cache-friendly.
  • Same as immutable. Immutability means values can't change; persistence is the additional guarantee that old versions remain accessible. A frozen Java array is immutable but not persistent — there's no efficient update.
  • Requires garbage collection. Persistence works in any language, but reference counting and GC make freeing old versions automatic. Manual memory management requires careful version-lifetime tracking.
  • Append is always O(log n). Many persistent vectors achieve O(1) amortized append by tracking a tail buffer; only when the buffer fills is the trie grown.
  • Hash maps don't apply. Hash array-mapped tries (HAMTs) — the structure behind Clojure's PersistentHashMap — bring persistence to hash-based collections. Updates are O(log32 n) just like vectors.

Frequently asked questions

What is path copying?

Path copying is the simplest technique for making a tree persistent. When you update a node, you allocate a new copy of that node, then a new copy of its parent (pointing at the new child), then a new copy of its grandparent, all the way up to the root. Only the path from root to modified leaf is copied — every other subtree is shared with the original version. For a balanced tree of n elements this is O(log n) new nodes per update. The old root remains valid and points at the old structure; the new root points at the new path plus all the unchanged subtrees.

How is partial vs full persistence different?

Partial persistence allows reads on any past version but only the latest version can be updated — versions form a linear sequence. Full persistence allows updates on any version, so versions form a tree (every update branches off the version it modified). Confluent persistence is even stronger: it allows a single update to combine two existing versions, so versions form a DAG. Each level adds implementation complexity. Most real systems (Clojure collections, Git) only need full persistence; partial persistence is enough for many algorithms work like persistent search trees in computational geometry.

How does Clojure's PersistentVector achieve O(log₃₂ n)?

Clojure's PersistentVector is a 32-way trie — each internal node has up to 32 children. The element at index i is found by treating i as a base-32 number: the digits index successive levels. Depth is ceil(log32 n), so for a billion elements depth is just 6. On update, path copying allocates 6 new 32-element node arrays (around 1.5 KB) — compared to copying the whole vector. The branching factor of 32 is chosen to fit a CPU cache line and make the constant small enough that O(log32 n) feels like O(1) for any practical size.

Are persistent structures slower than mutable?

Yes, but the gap is smaller than people expect. A persistent vector update is O(log32 n) versus O(1) for a mutable array, but for typical n the constant is 4-6 indirections. Benchmarks show Clojure persistent vectors are 2-4x slower than mutable Java ArrayLists for single updates, but cost is amortized cleanly when many small versions are kept. For workloads that need versioning anyway (history, undo, time-travel debugging), mutable structures must explicitly snapshot, often O(n) per snapshot — and persistent structures win.

Why does Git use a Merkle tree of immutable objects?

Every Git object — blob, tree, commit — is content-addressed by SHA-1 (now SHA-256) and immutable. A commit is a tree of trees of blobs; each tree node lists the SHA-1s of its children. Modifying one file produces a new blob, then a new tree node listing that blob, then a new tree for each ancestor directory, then a new commit pointing at the new root. Untouched directories share their old SHA-1 — the new commit literally references the old tree objects unchanged. This is path copying, and it's why git switch between branches is fast and branch storage is cheap.

What is structural sharing?

Structural sharing is the property that two versions of a persistent data structure share the parts that didn't change. If you update one element of a million-element vector, the new version doesn't allocate a million new nodes — it allocates O(log n) new nodes on the path from root to the changed leaf, and all other subtrees are pointed to by both versions. The old version is unaware that a new version exists. Garbage collection reclaims an old version's nodes only when no remaining version references them. Structural sharing is what makes persistence affordable in space.