Distributed Systems

Vector Clocks

An array of counters that knows what happened before what

Vector clocks are per-process counter arrays that capture happens-before relationships in a distributed system. They tell you which events causally preceded which, and which were truly concurrent — information a single Lamport timestamp cannot recover.

  • Per-event sizeO(N) ints — N writers
  • Comparison costO(N) per comparison
  • DetectsCausal precedence + concurrency
  • Authored byFidge / Mattern, 1988
  • Used inDynamo, Riak, Voldemort, version-vector replication

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How vector clocks work

In a distributed system, "what happened first?" has no global answer. Different machines see events at different physical times; clocks drift; networks delay messages. Lamport's 1978 paper gave the field a clean abstraction: the happens-before relation. Event A happens-before B if (a) they're on the same process and A came first, (b) A is the send of a message and B is the corresponding receive, or (c) transitively from a chain of those. Pairs of events that satisfy neither A→B nor B→A are concurrent.

A vector clock is a length-N array, one entry per process. Each process maintains its own array. The rules are simple:

  1. On a local event at process i: increment V[i].
  2. On send: attach a copy of V to the outgoing message.
  3. On receive of message with attached vector U: set V[k] = max(V[k], U[k]) for all k, then increment V[i].

To compare two vectors V and W:

  • V < W (V happened-before W) iff V[k] ≤ W[k] for all k AND there exists some k where V[k] < W[k].
  • V == W iff every entry equal — same event.
  • Otherwise V and W are concurrent (notation: V || W) — neither preceded the other.

The third case is where vector clocks earn their keep. A single integer can never tell you "these events have no causal relation" — it can only impose an arbitrary total order. Vector clocks preserve the partial-order truth.

When to use vector clocks

  • You're building an AP database where concurrent writes must be detected and surfaced (Dynamo, Riak, Voldemort).
  • You need to debug "what caused what" across services — distributed tracing, root-cause analysis, replay.
  • You're implementing optimistic concurrency control across replicas without locking.
  • You're studying happens-before formally — vector clocks are the canonical encoding.

If you only need a total order for ID generation or coarse "newer than" queries, a Lamport timestamp or hybrid logical clock is cheaper. If you need millisecond-accurate physical wall-clock time, vector clocks won't give it to you — they only encode logical causality.

Vector clocks vs Lamport vs version vectors vs HLC vs ITC vs dotted version vectors

LamportVector clockVersion vectorHybrid Logical Clock (HLC)Interval Tree ClockDotted Version Vector
Size per timestampO(1)O(N) — one per processO(R) — one per replicaO(1) — physical time + logical ctrAdaptive (grows with concurrency)O(R) + dot
Detects concurrency?NoYesYesNo (only ordering)YesYes
Total or partial orderTotal (arbitrary tie-break)PartialPartialTotalPartialPartial
Tracks process membership change?N/AManually (tombstones)ManuallyN/AYes (fork / join)Better (separate dot)
Approximates wall clock?NoNoNoYes (within clock skew)NoNo
Designed forTotal orderingGeneral causalityReplicated objectsCockroach-style snapshot readsOpen / forking systemsDynamo-style with sibling pruning
OriginatorLamport, 1978Fidge / Mattern, 1988Parker et al., 1983Kulkarni et al., 2014Almeida et al., 2008Preguiça et al., 2010

The right tool depends on what you actually need. Detect concurrent siblings? Vector clock or version vector. Order events for snapshot isolation across regions? HLC. Open system with nodes joining and leaving constantly? Interval tree clocks. None of these is strictly best — they trade size against expressiveness.

What vector clocks actually cost

Per-message overhead is the headline. A 4-writer system with 32-bit counters carries 16 bytes of clock per message — negligible. A 1,000-writer system carries 4 KB. A million-writer system carries 4 MB per message, which is unworkable for a database that handles millions of writes per second. This is the practical reason no AP database lets every client be a "writer" in vector-clock terms — Riak buckets writes through a small fixed set of vnodes, so the clock dimension is bounded by replication factor (typically 3), not by client count.

Comparison cost is O(N) per pair: every dimension must be checked. For 100-dim vectors that's 100 integer comparisons per check, fine; for 10,000-dim vectors it's 10K, painful when you do it on every read. The expensive case is "find all siblings of this object" — N comparisons against M existing siblings = O(N·M).

Storage cost adds up too. Cassandra-style systems store the clock alongside every column value, multiplying per-cell metadata. Riak uses Concise Version Vectors and dot pruning to shrink this. Dynamo's original 2007 paper limited the clock to the most recent 10 writers, accepting that older causal links would be lost.

Tombstone cost is the silent killer. When a writer leaves, you cannot drop its dimension immediately — partitioned replicas may still emit messages stamped with the dead writer's id. A tombstone records "this writer once existed" so its dimension stays in the comparison. Every retired writer is an entry in every clock forever, unless you garbage-collect after a quiescence window. In long-running systems with frequent membership churn, tombstones can dwarf the live-writer dimensions.

JavaScript: increment, merge, and happens-before

// A vector clock keyed by node id, defaulting absent entries to 0
class VectorClock {
  constructor(initial = {}) { this.v = { ...initial }; }

  // Local event on node `id`
  tick(id) { this.v[id] = (this.v[id] ?? 0) + 1; return this; }

  // On receive: take elementwise max with the incoming clock, then tick self
  merge(other, selfId) {
    const keys = new Set([...Object.keys(this.v), ...Object.keys(other.v)]);
    for (const k of keys) {
      this.v[k] = Math.max(this.v[k] ?? 0, other.v[k] ?? 0);
    }
    if (selfId) this.tick(selfId);
    return this;
  }

  // Returns 'before', 'after', 'equal', or 'concurrent'
  compare(other) {
    const keys = new Set([...Object.keys(this.v), ...Object.keys(other.v)]);
    let allLE = true, allGE = true, anyLT = false, anyGT = false;
    for (const k of keys) {
      const a = this.v[k] ?? 0, b = other.v[k] ?? 0;
      if (a < b) { allGE = false; anyLT = true; }
      if (a > b) { allLE = false; anyGT = true; }
    }
    if (allLE && !anyLT && allGE && !anyGT) return 'equal';
    if (allLE && anyLT) return 'before';
    if (allGE && anyGT) return 'after';
    return 'concurrent';
  }
}

// === Demo: detecting concurrent updates ===
const a = new VectorClock();
const b = new VectorClock();
a.tick('A');                                    // {A:1}
b.tick('B');                                    // {B:1} — concurrent with a
console.log(a.compare(b));                      // 'concurrent'

a.merge(b, 'A');                                // a learns about b → {A:2, B:1}
console.log(a.compare(b));                      // 'after' (a observed b)
console.log(b.compare(a));                      // 'before'

The default-to-zero rule for missing keys is what makes the implementation tolerate writers joining over time without rewriting every clock — a missing entry means "this writer hasn't done anything yet I've seen," which is observationally identical to a zero entry.

Python: shopping-cart sibling resolution

from typing import Dict, List, Tuple

class VectorClock(dict):
    def tick(self, node: str) -> "VectorClock":
        self[node] = self.get(node, 0) + 1
        return self

    def merge(self, other: "VectorClock") -> "VectorClock":
        for k in set(self) | set(other):
            self[k] = max(self.get(k, 0), other.get(k, 0))
        return self

    def le(self, other: "VectorClock") -> bool:
        return all(self.get(k, 0) <= other.get(k, 0) for k in set(self) | set(other))

    def concurrent_with(self, other: "VectorClock") -> bool:
        return not self.le(other) and not other.le(self)


def resolve_cart_siblings(siblings: List[Tuple[VectorClock, set]]) -> set:
    """
    Riak-style sibling resolution for a shopping cart represented as a set.
    Concurrent updates are merged by union (this is the cart's CRDT semantics —
    you'd lose deletions; an OR-Set fixes that, but for adds-only this works).
    """
    surviving: List[Tuple[VectorClock, set]] = []
    for clock, value in siblings:
        # Drop any existing sibling that this one strictly succeeds
        surviving = [(c, v) for c, v in surviving if not c.le(clock) or c == clock]
        # Add this one if no existing sibling already supersedes it
        if not any(clock.le(c) and clock != c for c, _ in surviving):
            surviving.append((clock, value))

    # Merge whatever survives via union
    merged_clock = VectorClock()
    merged_value: set = set()
    for c, v in surviving:
        merged_clock.merge(c)
        merged_value |= v
    return merged_value


# === Two clients edit the same cart on different replicas during a partition ===
client_a = (VectorClock({"R1": 1}), {"book"})           # added book on R1
client_b = (VectorClock({"R2": 1}), {"book", "lamp"})   # added lamp on R2
# Neither clock dominates the other → concurrent siblings
print(resolve_cart_siblings([client_a, client_b]))      # {'book', 'lamp'}

This is Dynamo's original use case, distilled: a partition lets the two replicas accept the customer's adds independently. Vector clocks identify the writes as concurrent (neither {R1:1} nor {R2:1} dominates the other), so the system surfaces both versions. The application layer merges by set union — a CRDT-flavoured choice that turns out to be correct for an "adds-only" cart but quietly drops deletions. A real production cart should use an OR-Set CRDT, which we cover in the CRDT article.

Variants and extensions

  • Lamport timestamps. The O(1) ancestor — a single integer that imposes a total order consistent with happens-before, but cannot detect concurrency. Useful for ID generation, snapshot isolation, and any case where you only need "newer than."
  • Version vectors. A specialization for replicated objects. Each replica increments its own entry per write; reads compare vectors to detect siblings. Identical math, narrower scope.
  • Dotted version vectors (Preguiça et al., 2010). Separate the "summary" dimensions (which replicas have observed which writes) from the "dot" identifying the specific event. Solves a subtle correctness bug in Dynamo's original sibling pruning where context handed back to the client could mis-causally-overwrite later updates.
  • Interval tree clocks (Almeida et al., 2008). Built for systems where processes fork and join. Each node owns a sub-interval of [0,1]; the clock is a tree representing event counts within each interval. Adapts gracefully to membership change without explicit tombstones.
  • Hybrid logical clocks (Kulkarni et al., 2014). Combine physical wall-clock time with a small logical counter. O(1) size, total order respecting happens-before, approximate wall-clock alignment. CockroachDB uses HLCs for snapshot isolation across regions. Trade-off: HLCs cannot detect concurrency, only order.
  • Bloom clocks. Approximate, space-bounded vector clocks via Bloom filters. Constant size at the cost of false-positive "happened before" claims; useful in opportunistic networks with millions of ephemeral peers.

Common bugs and edge cases

  • Forgetting to default missing entries to zero. If your comparison only iterates the keys present in one of the two vectors, you'll mis-classify pairs as equal when they differ on a key absent from the iterated vector. Always iterate the union of keys.
  • Tombstone explosion. Naively retiring a node by removing its entry corrupts comparisons against any in-flight message that still carries it. Use explicit tombstones with a quiescence window before garbage collection.
  • Identifying writers by mutable identity. If two replicas share an id (rare but happens after a botched bootstrap), their increments overwrite each other and concurrent updates are silently lost. Use UUIDs or epoch-suffixed ids.
  • Comparing clocks across schema versions. When you add a new replica, every old clock has zero in the new dimension. That's fine — but if the new replica increments before learning that its dimension is "new", it can write entries indistinguishable from a long-running peer's updates.
  • Confusing version vectors for ordering. A write with vector {A:5} and another with {A:3, B:1} are concurrent — neither dominates. Many engineers expect "later in time" to mean "larger vector" and are surprised when a read returns siblings.
  • Pruning siblings without dotted version vectors. The classic Dynamo bug: a read returns the merged sibling clock as "context"; the next write uses that context, overwriting an unrelated concurrent update on a third replica. Dotted version vectors fix this by carrying the actual event id, not just the merged summary.

Frequently asked questions

What's the difference between a Lamport timestamp and a vector clock?

A Lamport timestamp is a single integer per event. It guarantees that if event A causally precedes event B, then ts(A) < ts(B), but the converse is not true — ts(A) < ts(B) does NOT imply A precedes B. Vector clocks fix this by tracking a counter per process. With vector clocks you can definitively distinguish "A happened before B", "B happened before A", and "A and B were concurrent." That extra information costs O(N) bytes per event versus O(1) for Lamport.

Do vector clocks scale to millions of nodes?

Not naively. A vector clock has one entry per process that ever wrote, so a system with one million writers produces 100MB+ clocks per message at 100 bytes each. Production systems use one of three escape hatches: limit writers (Riak puts a vnode-scoped clock on each replica, not a per-client clock), prune entries for departed nodes (with caveats around tombstones), or switch to hybrid logical clocks (HLC) which combine physical time with a small bounded counter.

How do vector clocks detect concurrent updates?

Compare two clocks element-wise. V1 happened-before V2 iff every entry of V1 is ≤ the corresponding entry of V2 AND at least one entry is strictly less. If neither V1 ≤ V2 nor V2 ≤ V1 holds, the events are concurrent. In a shopping-cart context, that means two customers both modified the cart on different replicas during a partition; the application has to merge them rather than picking one.

Can vector clocks be compared in O(1)?

No — comparison is O(N) in the number of entries because you have to check every dimension. Hash-based "compressed clock" tricks can speed equality checks but not happens-before. For practical systems with bounded writer counts (10s to 100s), O(N) is fine. For unbounded writers, hybrid logical clocks or interval tree clocks compress the representation at the cost of less precise concurrency detection.

What is a tombstone in vector-clock systems?

When a writer leaves the system, you can't just delete its entry from every clock — a partitioned replica might still emit messages stamped with that writer's id. A tombstone is a permanent marker recording that the writer existed, so receiving its messages later doesn't accidentally introduce a "new" dimension. Tombstones grow forever; production systems garbage-collect them after a safe quiescence window or via dotted version vectors that move retirement metadata into separate state.

Are vector clocks the same as version vectors?

Closely related but not identical. Version vectors track the version of an object across replicas — incremented per write, propagated on replication. Vector clocks are the more general primitive, tracking causal relationships among arbitrary events including local computation steps and message sends. Most distributed-database literature uses "version vector" for the data-replication use case and "vector clock" for the general-purpose causality tracking, though the math is the same.