Distributed Systems

Spanner & TrueTime

The database that turned clock uncertainty into a number — and then waited it out

Spanner is Google's globally distributed database that uses TrueTime — a GPS-and-atomic-clock API exposing time as a bounded interval [earliest, latest] — to give externally consistent (linearizable) transactions across continents by waiting out the clock uncertainty before commit.

  • Consistency modelExternal consistency (linearizable)
  • TrueTime APIinterval [earliest, latest]
  • Uncertainty ε (2012)≈ 1–7 ms, avg ~4 ms
  • Commit-wait≈ 2·ε per read-write txn
  • Clock sourcesGPS + atomic (Armageddon)
  • Replication / commitPaxos groups + 2PC

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

The idea: stop pretending clocks are exact

Every distributed database has the same nightmare. Two transactions commit on machines a continent apart. Which one happened first? If you trust the local wall clock on each machine, you are doomed — ordinary servers running NTP drift by tens of milliseconds, and worse, they have no idea how far off they are. A timestamp of 14:00:00.000 might really be 14:00:00.030, or 13:59:59.970, and the clock will swear it is correct either way.

Google's Spanner, described by Corbett et al. at OSDI 2012, made a radical move: instead of pretending the clock is exact, expose the error as a first-class value. The TrueTime API does not return a single instant. It returns an interval:

TT.now() → { earliest, latest }   // guarantee: earliest ≤ t_abs ≤ latest

The contract is the part that matters. TrueTime guarantees that the true absolute time t_abs lies somewhere inside [earliest, latest]. The width of that interval is 2·ε — twice the instantaneous uncertainty. If ε is small, you know almost exactly when "now" is. If your hardware can't say, the interval widens honestly instead of lying. That single change — a clock that admits what it doesn't know — is the foundation everything else is built on.

External consistency, and why ordering is the whole game

Spanner's headline guarantee is external consistency, which is just a database-flavored name for linearizability over transactions. Stated plainly: if transaction T1 finishes (commits) before transaction T2 begins in real, wall-clock time — even if T1 ran in Belgium and T2 in Oregon — then Spanner guarantees commit_ts(T1) < commit_ts(T2). The whole system behaves as if every transaction took effect at one global instant, in exactly the order a person holding a perfect stopwatch would have seen.

This is strictly stronger than serializability. Serializability says transactions are equivalent to some serial order; it does not promise that order matches the real-world clock. Snapshot isolation is weaker still. External consistency is the property that lets you build correct cross-region systems where causality leaks through side channels — you commit a transaction, phone a colleague across the ocean, and they start a transaction that must see your write. Only an external-consistency guarantee makes that work without application-level coordination.

The mechanism that delivers it has two halves:

  1. Pick a timestamp that is "in the future." A read-write transaction chooses its commit timestamp s = TT.now().latest at commit time. Using latest guarantees s is at least as large as true time everywhere.
  2. Commit-wait. Before releasing locks or telling the client "done," the coordinator sleeps until TT.now().earliest > s — until TrueTime is certain that absolute time has moved past s everywhere in the world.

The commit-wait trick, step by step

Commit-wait is the cleverest line in the paper, and it is almost embarrassingly simple. Watch the two invariants line up:

commit a transaction:
  s ← TT.now().latest            # timestamp ≥ true time now
  ... do Paxos write at s ...
  while TT.now().earliest ≤ s:   # commit-wait
      sleep()
  release locks; ack client

Why does this force the right order? Suppose T1 commits, then T2 starts afterward in real time. T1 only acknowledged after TT.now().earliest > s1 — meaning absolute time was already past s1 when T1 finished. T2 begins after that, so when T2 later picks s2 = TT.now().latest, true time is already > s1, and latest ≥ t_abs > s1, forcing s2 > s1. The two events can never get tangled, because T1 literally refuses to finish until the clock is provably past its own timestamp. The waiting buys the ordering.

The cost of that wait is the cost of ε. Spanner overlaps commit-wait with the two-phase-commit and Paxos round trips that are happening anyway, so on a healthy network the wait often hides entirely behind work the transaction had to do regardless. On a fast local commit, you pay the full 2·ε — single-digit milliseconds.

When Spanner is the right tool — and when it isn't

  • You need strong consistency across regions. Financial ledgers, inventory, ad-spend counters — anywhere a stale or reordered read is a correctness bug, not a UX wrinkle. This is Spanner's home turf and the reason it powers Google AdWords and Cloud Spanner.
  • You want SQL and ACID at global scale. Spanner gives you a relational schema, secondary indexes, and externally consistent multi-row transactions across thousands of machines — the thing the original "NoSQL gives up consistency for scale" trade said you couldn't have.
  • Your write latency budget tolerates a few milliseconds of commit-wait plus a cross-region Paxos quorum (tens of milliseconds if your replicas span continents).

It is the wrong tool when you don't need cross-region linearizability. If a 50–100 ms cross-continent write latency or the operational cost of GPS antennas and atomic clocks in every datacenter is unjustified, a single-region Postgres, a Dynamo-style AP store, or an eventually-consistent design will be cheaper and faster. Spanner trades latency and hardware for a guarantee — only buy it if you need the guarantee.

Spanner vs other distributed databases

SpannerCockroachDBDynamoDBCassandraCalvin
ConsistencyExternal (linearizable)Serializable (no real-time guarantee by default)Eventual / strong per-keyTunable (eventual default)Strict serializable
Clock dependencyTrueTime (GPS + atomic)Hybrid logical clocks (NTP)none (no global order)last-write-wins by wall clocknone (deterministic order)
How order is fixedcommit-wait on εuncertainty restart windowper-item, no cross-key ordertimestamp on cellspre-agreed global log
Cross-region txn latencyquorum + ~2·εquorum + retries on overlapn/a (no cross-region txn)n/aquorum, no commit-wait
Read-only readslock-free MVCC snapshotMVCC snapshotper-keyper-keyfrom the ordered log
Special hardwareyes (time masters)nononono
Open / managedCloud Spanner (Google)open sourceAWS managedopen sourceresearch / FaunaDB-influenced

The defining axis is how each system decides who-came-first. Spanner spends a few milliseconds of real wall-clock waiting to make timestamps trustworthy. CockroachDB drops the special hardware and instead uses hybrid logical clocks plus an "uncertainty interval": if a read might have raced a write inside the clock-skew window, the read restarts at a higher timestamp. Calvin sidesteps clocks entirely by agreeing on a deterministic global transaction order first, then executing — no commit-wait, but every transaction's read/write set must be known up front.

What the numbers actually say

  • ε sawtooths between ~1 ms and ~7 ms. In the OSDI 2012 measurements, uncertainty rose linearly between syncs and snapped back on each poll, averaging roughly 4 ms. The slope is the assumed 200 µs/s oscillator drift, and the poll interval is 30 seconds — so worst-case added drift before a sync is about 200 µs/s × 30 s = 6 ms, on top of a ~1 ms base.
  • Commit-wait is about 2·ε. At ε ≈ 4 ms that is roughly 8 ms of latency — but much of it overlaps with Paxos and 2PC round trips, so the marginal cost is often a fraction of that.
  • The brittle dependency is ε, not bandwidth. If the time-master infrastructure misbehaves and ε balloons, Spanner deliberately slows down (longer commit-waits) rather than returning wrong answers — availability degrades, correctness never does.
  • Read-only transactions pay zero commit-wait. They are lock-free MVCC snapshot reads; the only "wait" is for a replica to be caught up to the read timestamp (the safe-time condition), which on a local replica is typically negligible.
  • Spanner runs in production at planetary scale. It backs Google's F1 / AdWords system and is offered as Cloud Spanner; the design target was millions of machines across hundreds of datacenters.

JavaScript: a TrueTime simulator with commit-wait

You can capture the entire idea in a few dozen lines. We model a clock with bounded error, expose now() as an interval, and implement commit() with commit-wait. The assertion at the end is the external-consistency invariant.

// A clock whose true time we know (for the sim), but which only
// exposes a bounded interval to callers — exactly TrueTime's contract.
class TrueTime {
  constructor(epsilonMs = 4) { this.epsilon = epsilonMs; }
  trueNow() { return Date.now(); }            // ground truth (sim only)
  now() {
    const t = this.trueNow();
    return { earliest: t - this.epsilon, latest: t + this.epsilon };
  }
}

const sleep = ms => new Promise(r => setTimeout(r, ms));

class Spanner {
  constructor(tt) { this.tt = tt; this.lastCommit = -Infinity; }

  // Read-write transaction: pick latest, do work, then commit-wait.
  async commit(apply) {
    const s = this.tt.now().latest;          // timestamp ≥ true time
    apply(s);                                 // Paxos write would happen here
    // commit-wait: block until TrueTime is sure abs time passed s
    while (this.tt.now().earliest <= s) {
      await sleep(1);
    }
    // external consistency invariant: timestamps are monotone in real time
    if (s <= this.lastCommit) throw new Error('ordering violated!');
    this.lastCommit = s;
    return s;                                 // commit timestamp
  }
}

// Demo: two sequential transactions never get reordered.
(async () => {
  const db = new Spanner(new TrueTime(4));
  const s1 = await db.commit(() => {});       // T1 fully commits...
  const s2 = await db.commit(() => {});       // ...then T2 starts
  console.log(s1 < s2);                        // always true
})();

The crucial line is the while loop. Drop it, set ε to a few milliseconds, and run two near-simultaneous commits: the timestamps can come out reversed and the invariant fires. Re-add commit-wait and the violation disappears. That tiny loop is the entire difference between "serializable" and "externally consistent."

Python: snapshot reads and the safe-time condition

Read-write transactions get the commit-wait. Read-only transactions get something more elegant: they never block writers and never take locks. They pick a read timestamp and wait only until each replica has applied every write at or below it — the safe time.

import time

class TrueTime:
    def __init__(self, epsilon_ms=4):
        self.eps = epsilon_ms / 1000.0
    def now(self):
        t = time.time()
        return (t - self.eps, t + self.eps)   # (earliest, latest)

class Replica:
    """MVCC store: each key holds (timestamp, value) versions."""
    def __init__(self):
        self.versions = {}          # key -> list of (ts, value), ts-ordered
        self.safe_time = 0.0        # all writes with ts <= safe_time applied

    def apply_write(self, key, ts, value):
        # apply a committed write; safe_time advances as the replica catches up
        self.versions.setdefault(key, []).append((ts, value))
        self.safe_time = max(self.safe_time, ts)

    def heartbeat(self, ts):
        # Paxos keeps replicas current even with no writes: a leader lease /
        # heartbeat at ts means "no write with ts' <= ts is still in flight."
        self.safe_time = max(self.safe_time, ts)

    def read_at(self, key, t_read):
        # safe-time condition: don't answer until the replica is caught up to t_read
        while self.safe_time < t_read:
            time.sleep(0.001)
        # MVCC: newest version with ts <= t_read
        best = None
        for ts, v in self.versions.get(key, []):
            if ts <= t_read and (best is None or ts > best[0]):
                best = (ts, v)
        return best[1] if best else None

tt = TrueTime(4)
r = Replica()
# A write commits at t_write; a read-only txn then reads a consistent snapshot.
t_write = tt.now()[1]              # commit at 'latest' edge (>= true time)
r.apply_write("x", t_write, "v1")
t_read = tt.now()[1]              # read timestamp >= t_write
r.heartbeat(t_read)               # replica reports it has caught up to t_read
print(r.read_at("x", t_read))     # 'v1' — no locks, no commit-wait

This is why Spanner reads scale to read-only replicas anywhere on Earth: a snapshot read is just "wait to be caught up, then look at the right version." No coordination with writers, no commit-wait, no two-phase locking.

Variants and the systems descended from it

Hybrid Logical Clocks (HLC). The most influential reaction to Spanner. CockroachDB and YugabyteDB wanted external-ish consistency without GPS/atomic hardware, so they combine a physical wall clock with a Lamport counter. Instead of commit-wait, they carry an uncertainty interval per read and restart a transaction at a higher timestamp if it might have raced a write inside the skew window. Cheaper hardware, but the guarantee is conditional on a configured max clock skew rather than a hardware-enforced bound.

Calvin. Drops clocks entirely. A sequencing layer agrees on a global transaction order up front (via a replicated log), then nodes execute deterministically. No commit-wait at all, but every transaction must declare its read/write set in advance, which complicates interactive SQL.

AWS commit-wait without TrueTime. Several systems (and even later Google work) use "wait out the uncertainty" with NTP-derived bounds; the wait simply grows when the bound is loose. The pattern outlives the specific hardware.

Spanner's own read paths. Beyond read-write transactions, Spanner exposes read-only transactions (lock-free, externally consistent snapshots) and snapshot reads (at a stale, client-chosen timestamp for cheap, slightly-old data). Picking the right read mode is most of the performance tuning in practice.

Common misconceptions and edge cases

  • "TrueTime makes the clocks accurate." No — it makes the error bound trustworthy. The clocks are still wrong; TrueTime just tells you by how much, and that honesty is what the algorithm exploits.
  • "Spanner waits 2·ε on every operation." Only read-write commits pay commit-wait, and much of it overlaps with Paxos/2PC work. Read-only transactions and snapshot reads pay none.
  • Confusing external consistency with serializability. Serializable systems can legally reorder transactions versus real time; externally consistent ones cannot. If you only need serializability, you don't need TrueTime.
  • Assuming small ε is free. ε is small only because of the GPS + atomic time-master fleet and a 30 s poll interval. Pull that infrastructure and ε grows, commit-waits lengthen, and write latency rises — the cost moves, it doesn't vanish.
  • Reading at earliest instead of latest for commit. Commit timestamps must use latest so they sit at-or-ahead of true time; using earliest would let a future transaction undercut the timestamp and break ordering.
  • Forgetting the safe-time wait on reads. A snapshot read that doesn't wait for the replica to catch up to its read timestamp can miss a committed write, silently returning stale data that violates the snapshot it promised.

Frequently asked questions

What problem does TrueTime actually solve?

It bounds clock error so Spanner can assign transaction commit timestamps that respect real-world order. Ordinary NTP clocks can be off by tens of milliseconds with no way to know how far — TrueTime instead returns an explicit interval [earliest, latest] guaranteed to contain true absolute time, so the database can reason about whether two events could possibly have overlapped.

What is external consistency in Spanner?

External consistency (equivalently, linearizability over transactions) means: if transaction T1 commits before T2 starts in real time — even on different machines on different continents — then T1's commit timestamp is less than T2's. The system behaves as if every transaction executed instantaneously at a single global instant, in the same order a human with a perfect clock would observe.

How does commit-wait work?

When a transaction picks commit timestamp s = TT.now().latest, Spanner does not release locks or acknowledge the client until TT.now().earliest > s — i.e. until TrueTime is certain absolute time has passed s. This wait, roughly 2·ε on average, guarantees no later transaction can be assigned a smaller timestamp, which is what makes the ordering externally consistent.

How big is the TrueTime uncertainty ε in practice?

In the 2012 OSDI paper, ε sawtooths between about 1 ms and 7 ms as the local clock drifts and is re-synced; the average was around 4 ms, dominated by the 200 µs/s assumed drift over the 30-second master-polling interval. Commit-wait is roughly 2·ε, so transactions pay single-digit milliseconds of added latency.

Why does Spanner need GPS and atomic clocks instead of just NTP?

The two clock sources fail in uncorrelated ways. GPS receivers can lose antenna signal or suffer leap-second and spoofing bugs; atomic clocks (Armageddon masters) drift slowly and predictably but can't be corrected externally. Cross-checking both lets a time master detect and evict a lying clock, keeping ε small and trustworthy — NTP alone gives you an estimate with no provable bound.

Can read-only transactions skip locks and commit-wait?

Yes. Spanner serves lock-free, non-blocking snapshot reads at a chosen timestamp using multi-version concurrency control. A read at timestamp t_read just waits until each replica has applied all writes with timestamp ≤ t_read (the safe-time condition), then reads that MVCC snapshot — no two-phase locking and no commit-wait, so reads scale to any replica including read-only ones.