Distributed Systems
CRDTs (Conflict-free Replicated Data Types)
Replicas that always agree without ever asking each other
CRDTs are data structures whose replicas converge to the same state under any order of merges. They turn distributed conflict resolution from a runtime problem into a math problem — making collaborative editing, offline-first apps, and AP databases possible without consensus.
- ConvergenceStrong eventual — guaranteed
- Coordination costNone at write time
- Merge cost (typical)O(N) — O(log N) for sequence CRDTs
- Two familiesState-based (CvRDT), op-based (CmRDT)
- Used inYjs, Automerge, Riak, Redis, Figma, Linear
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How CRDTs work
Every distributed application that lets users edit shared data offline faces the same question: when two devices come back online with conflicting changes, how do you reconcile them? You can pick a winner (last-writer-wins, losing data), ask the user (annoying), or run consensus before every write (slow). CRDTs offer a fourth path: design the data structure so that any way of reconciling concurrent updates produces the same final state, on every replica, no coordination required.
The mathematical trick is a join-semilattice. Define a partial order on states and a merge operator that returns the least upper bound of two states. As long as merge is associative, commutative, and idempotent, replicas that have seen the same set of updates — in any order, with any duplication — converge to the same state. That's it. Pick any partial order with a join operation that satisfies those three properties and you have a state-based CRDT.
Two flavours dominate practice:
- State-based (CvRDT). A replica's full state is the unit of replication. Merges are pairwise joins. Send the whole state, accept the whole state, take the lub. Network resilient — duplicates and reorderings don't matter.
- Operation-based (CmRDT). Operations are the unit of replication; each op runs locally on each replica. Convergence requires that all concurrent ops commute and that the network delivers ops in causal order with exactly-once semantics. Smaller messages, stronger network requirements.
The simplest concrete CRDT is the G-Counter: an array of per-replica counters that only ever increases. Each replica increments its own slot; merge takes the elementwise max; the visible value is the sum. Once you can build a counter that two parties never disagree on, you can compose your way up to sets, maps, and sequences.
When to use CRDTs
- You need offline-first behaviour — the app must keep working without network and reconcile cleanly when it returns. (Linear, Notion, Apple Notes.)
- Multi-master replication where every replica accepts writes — collaborative editors, shopping carts, multiplayer state, IoT counters.
- You can express your data shape as a counter, set, register, map, or sequence; or you can build a custom CRDT with a join-semilattice.
- You can tolerate eventual convergence rather than linearizable reads. Money transfer is the wrong fit; collaborative cursor positions are the right one.
CRDTs are not a fit when ordering of conflicting operations actually matters (financial ledgers, inventory allocation, unique-username assignment) — those need consensus. They are also overkill if you only have a single writer per object; a plain LWW register or last-write-only field is cheaper.
CRDT zoo: G-Counter vs PN-Counter vs G-Set vs OR-Set vs LWW-Register vs Sequence
| G-Counter | PN-Counter | G-Set | OR-Set | LWW-Register | Sequence (RGA / Logoot) | |
|---|---|---|---|---|---|---|
| Operations | increment | increment, decrement | add | add, remove | assign(value, ts) | insert, delete at position |
| State per replica | vector of counters | 2 vectors (P and N) | set | set of (element, unique tag) | (value, timestamp, replica) | tree of position-tagged ops |
| Merge | elementwise max | elementwise max of P and N | set union | union of tags, minus removed | greatest timestamp wins | tree merge by position id |
| Merge cost | O(R) | O(R) | O(N) | O(N) typical | O(1) | O(log N) amortized |
| Tombstones? | No | No | No (no removes) | Yes — removed-tag set | No (overwrite) | Yes — deleted nodes |
| Lossy? | No | No | No | No | Yes (concurrent writes drop) | No |
| Typical use | view counts, likes | balances, score deltas | append-only feeds | shopping carts, contacts | presence, profile fields | collaborative text, lists |
Composition is the real power. A LWW-Map of OR-Sets gives you a mutable map whose values are mergeable sets. An RGA of OR-Sets becomes the data model behind Linear-style multiplayer kanban. Once you have the primitives, building richer types is gluing semilattices together.
What CRDTs actually cost
Per-operation overhead is small but non-zero. A G-Counter increment costs one integer write plus a vector entry per replica — typically 8–16 bytes. An OR-Set add costs the element plus a unique tag (UUID or replica-id-plus-counter, ~16 bytes). For a record with 100 fields, that adds up: a per-field LWW timestamp doubles record size; OR-Set tags can multiply it 4–8×. Yjs achieves better space behaviour for text by sharing position metadata across runs of inserts.
Merge cost is dominated by metadata, not data. Two G-Counters merge in O(R) where R is the number of replicas — fast, even for thousands. Two OR-Sets merge in O(N) where N is the number of unique adds (live + tombstoned). Two sequence CRDTs (RGA, Yjs) merge in O(log N) per operation thanks to tree representations. The painful case is a 24-hour offline laptop catching up on a million edits — whichever way you replicate, you pay roughly the size of the diff.
Tombstone growth is the dominant long-term cost. An OR-Set that never sees a coordinated GC accumulates one tombstone per remove forever; for a chat app where users delete messages routinely, that's gigabytes per year per channel. Production systems compact periodically by running a quick consensus round to agree "everyone has seen everything older than T, drop tombstones." Without that escape hatch, CRDTs leak space monotonically.
Network cost depends on flavour. State-based CRDTs send the full state on every gossip — fine for counters and small sets, prohibitive for documents. Delta-CRDTs (Almeida et al., 2018) send only the changed parts, getting state-based reliability with op-based bandwidth. Op-based CRDTs are bandwidth-minimal at the cost of needing causal-order delivery; if you can't guarantee that, you can't safely use a CmRDT.
JavaScript: G-Counter merge
// G-Counter: replicas only ever increment their own slot
class GCounter {
constructor(replicaId) {
this.id = replicaId;
this.counts = {}; // { replicaId: nonNegativeInt }
}
// Local increment — only on our own slot
increment(amount = 1) {
if (amount < 0) throw new Error("G-Counter is increment-only");
this.counts[this.id] = (this.counts[this.id] ?? 0) + amount;
}
// The visible value is the sum of all replicas' contributions
value() {
return Object.values(this.counts).reduce((a, b) => a + b, 0);
}
// Idempotent, commutative, associative merge: elementwise max
merge(other) {
for (const id of new Set([...Object.keys(this.counts), ...Object.keys(other.counts)])) {
this.counts[id] = Math.max(this.counts[id] ?? 0, other.counts[id] ?? 0);
}
}
}
// === Demo: replicas converge regardless of merge order ===
const a = new GCounter('A');
const b = new GCounter('B');
a.increment(3); // A says 3
b.increment(5); // B says 5
a.merge(b); // A: {A:3, B:5} → 8
b.merge(a); // B: {A:3, B:5} → 8
console.log(a.value(), b.value()); // 8 8 — converged
// Concurrent re-increment then merge from a third replica
a.increment(2); // A: {A:5, B:5}
b.increment(7); // B: {A:3, B:12}
const c = new GCounter('C');
c.merge(a); c.merge(b); // C: {A:5, B:12} → 17
b.merge(c); a.merge(b); // every replica converges to 17
Notice that we never had to coordinate. Every replica independently incremented its own slot; merges in any order produced 17 on every node. The key is that max(max(x, y), z) === max(x, max(y, z)) — associativity — and max(x, x) === x — idempotence — and max(x, y) === max(y, x) — commutativity. Those three properties are the entire mathematical foundation.
Python: PN-Counter and OR-Set
from collections import defaultdict
from uuid import uuid4
class PNCounter:
"""Two G-Counters: one for positive ops, one for negative. Value = P - N."""
def __init__(self, replica_id: str):
self.id = replica_id
self.P = defaultdict(int) # increments
self.N = defaultdict(int) # decrements
def inc(self, amount: int = 1):
if amount < 0: raise ValueError("inc requires non-negative amount")
self.P[self.id] += amount
def dec(self, amount: int = 1):
if amount < 0: raise ValueError("dec requires non-negative amount")
self.N[self.id] += amount
def value(self) -> int:
return sum(self.P.values()) - sum(self.N.values())
def merge(self, other: "PNCounter"):
for k in set(self.P) | set(other.P):
self.P[k] = max(self.P[k], other.P[k])
for k in set(self.N) | set(other.N):
self.N[k] = max(self.N[k], other.N[k])
class ORSet:
"""Observed-Remove Set: each add gets a unique tag; removes target tags."""
def __init__(self):
self.adds: set = set() # set of (element, tag)
self.tombstones: set = set() # set of (element, tag) that were removed
def add(self, element):
self.adds.add((element, uuid4().hex))
def remove(self, element):
# Tombstone every (element, tag) currently visible
live = {(e, t) for (e, t) in self.adds if e == element and (e, t) not in self.tombstones}
self.tombstones |= live
def contains(self, element) -> bool:
return any(e == element and (e, t) not in self.tombstones for (e, t) in self.adds)
def merge(self, other: "ORSet"):
self.adds |= other.adds
self.tombstones |= other.tombstones
# === Concurrent add/remove on different replicas ===
a, b = ORSet(), ORSet()
a.add("milk") # A adds "milk" with tag t1
a.merge(b); b.merge(a) # both see "milk"
a.remove("milk") # A tombstones t1
b.add("milk") # B concurrently adds "milk" with tag t2
a.merge(b); b.merge(a)
print(a.contains("milk"), b.contains("milk")) # True True — concurrent add wins
# An LWW set would have lost one of the two updates depending on timestamps.
# OR-Set preserves the intent: "the add I didn't see when I removed survives."
The OR-Set's behaviour during a concurrent add and remove is the test that separates a real CRDT from an ad-hoc "merge by union" hack. Because the remove only tombstones tags it has observed, an unrelated concurrent add survives. That matches user intent in almost every collaborative-editing scenario — if you didn't know about the new item when you pressed delete, your delete shouldn't kill it. Achieving that with a plain set requires consensus or LWW; the OR-Set does it with metadata only.
Variants and extensions
- Delta-CRDTs (Almeida et al., 2018). Send only the local "delta" from the last sync rather than full state — bandwidth like op-based, robustness like state-based. Used in Yjs and Automerge to support multi-megabyte documents efficiently.
- Sequence CRDTs. RGA, Logoot, Treedoc, and Yjs's underlying structure all solve the same problem: ordered sequences with concurrent inserts. Each insert assigns a unique fractional position id; merges interleave positions deterministically.
- Last-Writer-Wins variants. The simplest possible "register" CRDT — pick the value with the highest timestamp. Convergent but lossy. Surprisingly common in production for fields where conflict is rare and acceptable when it happens.
- Operational Transform (OT). The pre-CRDT approach to collaborative editing, used by Google Docs. Operations are transformed against concurrent ones to produce equivalent ops on each replica. Powerful but requires a central server to enforce ordering; CRDTs have largely supplanted OT for new systems because they don't.
- Causal-Counter CRDTs. A G-Counter combined with a vector clock so the value monotonically increases per causal lineage — useful when you want to view a counter "as of a point in causal time."
- Riak Datatypes. Production-hardened versions of Maps, Sets, Counters, Registers, and Flags. The map-of-maps pattern lets you build nested CRDT documents without designing your own join-semilattice.
- Pure-op CRDTs. An alternative formulation where operations carry just enough metadata to commute even under unreliable delivery — bridges the gap between strict op-based requirements and real-world networks.
Common bugs and edge cases
- Tombstones bloating the OR-Set. Every removed (element, tag) pair must persist forever, otherwise a delayed concurrent add with the same tag could resurrect it. Without periodic compaction, sets grow monotonically. Production systems run a stop-the-world quiescence-based GC every hour or every gigabyte.
- Forgetting that LWW silently drops data. A LWW-Register is convergent but not lossless — the loser of a timestamp tie evaporates. For shopping-cart contents, this is a customer-support nightmare; use OR-Sets where lossiness matters.
- Decrementing a G-Counter. By design, a G-Counter only goes up. Wrapping a "subtract" by reaching into another replica's slot breaks the merge invariant — you'll get spurious resets after the next merge from a peer that didn't see the subtraction.
- Op-based CRDT used over an unreliable transport. Op-based CRDTs assume causal-order, exactly-once delivery. UDP gossip won't do; you need a reliable causal broadcast layer (vector-clock-tagged messages with retransmits). Ignoring this gives intermittent divergence that's almost impossible to debug.
- Sequence CRDTs with insufficient position-id entropy. Logoot-style position ids that grow monotonically can become huge over time as concurrent edits force ever-finer fractional positions. Yjs's "items with origin pointers" formulation avoids this by using a tree, but RGA and Logoot literature is full of position-id-bloat war stories.
- Composing CRDTs incorrectly. A LWW-Map whose values are OR-Sets is a CRDT only if you use the OR-Set's merge inside the map's merge — naively replacing the value field with "whichever has the higher timestamp" loses every concurrent set update. Composition requires lifting the inner CRDT's merge into the outer one's join.
Frequently asked questions
Do CRDTs replace consensus algorithms like Raft and Paxos?
No. CRDTs and consensus solve different problems. Consensus gives you a single agreed-upon order of operations across nodes — you need it for transferring money, allocating unique sequence numbers, or anything where the order of conflicting operations matters. CRDTs give you eventual convergence under any order, but only for operations that are designed to commute. You can't represent "transfer $100 from A to B if A has the funds" as a CRDT, because the answer depends on the order of withdrawals.
What's the difference between state-based and operation-based CRDTs?
State-based CRDTs (CvRDTs) propagate the entire local state to other replicas; the merge function is a join over a join-semilattice — idempotent, associative, commutative. Operation-based CRDTs (CmRDTs) propagate individual operations and require reliable causal-order delivery. CmRDTs send less data per change but need a stronger network primitive (causal broadcast). Most production systems are state-based or "delta-CRDTs" that send only state diffs to get the best of both worlds.
Why do OR-Sets accumulate tombstones?
An OR-Set tags every add with a unique id; a remove records "remove the adds I currently know about". To stay convergent under reordering, you cannot delete the removed-tags eagerly — a delayed add for the same element with an older tag-id would resurrect the value otherwise. So removed tags persist as tombstones forever (or until a coordinated garbage-collection round). For long-running collaborative documents, tombstones can dwarf the live data; Yjs and Automerge use periodic compaction passes.
Are CRDTs slower than CRDTless replication?
They're usually larger but not slower. Merge cost is O(N) for most CRDTs, where N is the number of items. Sequence CRDTs (RGA, Logoot, Yjs's structure) achieve O(log N) merges by treating the document as a tree. Per-update overhead is typically a few bytes of metadata — a unique tag, a vector-clock entry, or a position id. The dominant cost in production is the unbounded growth of metadata over time, not per-operation latency.
Can CRDTs guarantee strong consistency?
They guarantee strong eventual consistency: replicas that have observed the same set of updates are mathematically guaranteed to be in the same state, regardless of update order. That is strictly stronger than plain eventual consistency (which gives no convergence guarantee in the face of reordering). It is strictly weaker than linearizability — a CRDT cannot prevent two clients from concurrently committing operations that, in a serializable system, would have ordered one after the other.
How do CRDTs handle text editing in real time?
Sequence CRDTs (RGA, Logoot, Yjs, Treedoc) treat the document as a tree of position-tagged characters. Each insert assigns a unique fractional position id between its neighbors; the position id encodes causal ordering and breaks ties by replica id. Concurrent inserts at the same logical position get distinct ids and merge deterministically. Modern editors like Linear, Notion, and Figma use sequence CRDTs to power offline-first collaborative editing without a central server during the edit.