Distributed Systems

Distributed Lock

Mutual exclusion across machines — and the TTL trap that breaks naive implementations

A distributed lock coordinates mutual exclusion across processes and machines. Redis, ZooKeeper, etcd, DynamoDB. The TTL versus pause hazard is the correctness landmine.

  • Acquire latency (Redis)~0.5–2 ms
  • Acquire latency (ZK/etcd)~5–20 ms (quorum)
  • Redlock clock skew tolerance~10 ms total drift
  • TTL hazard window= holder pause minus TTL
  • Safety primitiveFencing token (monotonic)
  • Used inJob runners, schedulers, leader election

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How a distributed lock works

A local mutex is easy — two threads in the same process share memory, and the hardware guarantees atomicity. A distributed mutex is hard because the participants don't share memory, the network can drop messages, the clock can lie, and the holder can pause at any moment for reasons it can't observe.

The basic primitive is the same as a single-machine lock: only one client can hold the lock at a time. The simplest implementation in Redis is one line:

SET lock:order:42 "client-A" NX PX 30000

NX means "only set if not exists" — the atomic compare-and-swap. PX 30000 sets a 30-second expiry. Whichever client wins the race becomes the holder. Others see the key already exists and back off.

The TTL is mandatory because if the holder crashes without releasing, the lock would persist forever. With a TTL, eventually the lock expires and another client can take it. That single mandatory TTL is also the source of the most dangerous bug in distributed-locking, and we'll come back to it.

The TTL hazard — Kleppmann's critique

Consider a single client holding the lock. It does its work, then renews the lock periodically. Now a long GC pause happens, or the process is paused by the OS, or it's hit with a network partition that delays renewals. Time passes. The lock's TTL expires on the server. The lock service grants the lock to a second client.

The first client eventually wakes up. It still believes it holds the lock. It commits its work to the protected resource. The second client also commits. Both clients held what they thought was an exclusive lock at the same time.

This is the failure mode Martin Kleppmann highlighted in his 2016 critique of Redlock. The TTL doesn't prevent it; it just hides it from the lock service. The lock service is internally consistent — it thinks one client at a time held the lock. The reality, as observed by the protected resource, is that two clients wrote.

The fix is a fencing token. Every lock acquisition gets a monotonically increasing token (a sequence number, an epoch, a ZooKeeper Zxid). Every write to the protected resource carries that token. The resource refuses any write whose token is less than the highest it has seen. So even if client A wakes up still holding "the lock," its write carries token 33 — and the resource has already accepted token 34 from client B. A's write is rejected.

acquire():
    token = lockService.acquire('lock:order:42')
    return token

write(resource, data, token):
    if token <= resource.lastSeenToken:
        raise StaleLockError
    resource.lastSeenToken = token
    resource.commit(data)

Fencing moves correctness from the lock service to the resource. The resource is the final arbiter, which is the only honest place for correctness to live.

When to use a distributed lock

  • Singleton workers. Only one instance of a cron job should run cluster-wide. Acquire a named lock at start; release on completion.
  • Leader election in stateless services. Whichever instance holds the lock is the leader. Heartbeat-renew while alive, drop on shutdown.
  • Serializing access to an external system without idempotency. Charging a card, sending an SMS — when you can't dedupe at the gateway, you serialize at the producer.
  • Coarse-grained coordination. "Only one migration runs at a time across the fleet." The lock is held for minutes; the network round-trip cost is irrelevant.

Avoid distributed locks for fine-grained coordination — they will dominate latency and become a single point of failure. Most "I need a distributed lock" requests are actually "I need an idempotent operation" in disguise.

Redis Redlock vs ZooKeeper vs etcd vs DB row vs DynamoDB

Redis RedlockSingle Redis SET NXZooKeeperetcdPostgres rowDynamoDB CAS
Latency (acquire)2–5 ms0.5–2 ms5–20 ms5–20 ms1–5 ms5–10 ms
Throughput~10–50k/s~50–100k/s~5–20k/s~5–20k/s~5–20k/s~5–10k/s
Built-in fencingNo (use Lua/version)NoYes (Zxid)Yes (revision)Yes (xmin)Yes (version attr)
Safe under GC pausesNo (Kleppmann)NoYes (w/ fencing)Yes (w/ fencing)Yes (w/ fencing)Yes (w/ fencing)
Clock-skew sensitiveYes (~10 ms)LessNoNoNoNo
Consensus requiredPer-node majorityNoYes (Zab)Yes (Raft)No (single DB)Implicit (managed)
Operational cost5 Redis nodes1 Redis node3–5 ZK ensemble3–5 etcd clusterExisting DBManaged AWS

The pragmatic decision tree: if you need correctness, use ZooKeeper, etcd, or a DB row with fencing tokens. If you need performance and can tolerate occasional double-acquire (with fencing as a backstop), Redis is fine. If you have neither requirement and just want exclusion that "usually works," a single Redis SET NX is enough — but document loudly that it's not safe under GC pauses.

What distributed locks actually cost

Latency: every protected operation pays at least one round-trip to the lock service on acquire and one on release. For a Redis lock on the same network, that's roughly 1–2 ms; for a ZooKeeper or etcd lock, 5–20 ms because the quorum write goes through the consensus log. Multiply by the number of operations per request and locks can easily become a third of total latency.

Contention amplifies cost. When two clients race for the same lock, one waits — and the wait queue plus the lock-service back-pressure can drive p99 latency up by 10–100×. If your locks see high contention, you're using them for fine-grained work; refactor to per-key locks or remove them entirely via idempotent operations.

Operational cost: a Redlock setup requires 5 Redis nodes; ZooKeeper or etcd requires a 3- or 5-node ensemble with persistent storage. None of these are free to operate. Many teams choose a DB row lock specifically to avoid running another stateful system.

Pseudo-code

// Single-Redis lock with fencing token.

acquire(key, ttl_ms):
    token = redis.incr('fence:' + key)            // monotonic, persistent
    ok = redis.set(key, token, NX=true, PX=ttl_ms)
    return token if ok else null

renew(key, token, ttl_ms):
    // Lua script for atomicity
    SCRIPT = """
      if redis.call('get', KEYS[1]) == ARGV[1] then
        return redis.call('pexpire', KEYS[1], ARGV[2])
      else return 0 end"""
    return redis.eval(SCRIPT, [key], [token, ttl_ms])

release(key, token):
    SCRIPT = """
      if redis.call('get', KEYS[1]) == ARGV[1] then
        return redis.call('del', KEYS[1])
      else return 0 end"""
    return redis.eval(SCRIPT, [key], [token])

// Resource side — refuse stale tokens.
write(resource, data, token):
    if token <= resource.fence_high_water:
        raise StaleLockError
    atomic:
        resource.fence_high_water = max(resource.fence_high_water, token)
        resource.apply(data)

Python: a complete fenced Redis lock

import redis, time

r = redis.Redis(host="localhost", port=6379)

RENEW_SCRIPT = r.register_script("""
  if redis.call('get', KEYS[1]) == ARGV[1] then
    return redis.call('pexpire', KEYS[1], ARGV[2])
  else return 0 end
""")

RELEASE_SCRIPT = r.register_script("""
  if redis.call('get', KEYS[1]) == ARGV[1] then
    return redis.call('del', KEYS[1])
  else return 0 end
""")

def acquire(key, ttl_ms=30000):
    """Returns a fencing token on success, None on failure."""
    token = r.incr(f"fence:{key}")
    if r.set(key, token, nx=True, px=ttl_ms):
        return token
    return None

def renew(key, token, ttl_ms=30000):
    return RENEW_SCRIPT(keys=[key], args=[token, ttl_ms])

def release(key, token):
    return RELEASE_SCRIPT(keys=[key], args=[token])

def with_lock(key, work, ttl_ms=30000):
    token = acquire(key, ttl_ms)
    if not token:
        raise Exception("lock not acquired")
    try:
        # Watchdog renewal every ttl/3
        import threading
        stop = threading.Event()
        def watchdog():
            while not stop.wait(ttl_ms / 3000):
                if not renew(key, token, ttl_ms):
                    return  # lock lost
        threading.Thread(target=watchdog, daemon=True).start()
        return work(token)        # pass fencing token to caller
    finally:
        stop.set()
        release(key, token)

The work(token) callback receives the fencing token so it can pass it to the protected resource. Without the token plumbing, the lock is structurally unsafe — Kleppmann's exact point.

Variants and extensions

  • Redlock. The original Redis-only protocol: acquire the same key on 3 of 5 independent Redis instances within a bounded time window. More resistant to single-node failure; still vulnerable to clock skew and GC pauses per Kleppmann.
  • ZooKeeper ephemeral nodes. Create an ephemeral sequential node under a parent; the holder is the one with the lowest sequence number. Disconnection automatically deletes the node, releasing the lock. The Zxid acts as a fencing token.
  • etcd lease + key. Bind a key to a lease with a TTL; if the lease isn't renewed, the key is deleted. Built-in revision numbers fence.
  • DynamoDB conditional write. PutItem ConditionExpression="attribute_not_exists(lockKey)" with a TTL attribute and version number gives a fenced lock with no extra infrastructure.
  • Reentrant lock. Track the holder's identity and a recursion count; same client can acquire repeatedly without blocking itself. Most distributed-lock libraries support this.
  • Read-write lock. Many concurrent readers, one exclusive writer. Built on top of a basic lock with a counter; Redis has SET-and-counter recipes, ZooKeeper has explicit RW lock primitive in Curator.
  • Hierarchical lock / tree lock. Lock a parent to lock all children; ZooKeeper's directory structure naturally supports this.

Common pitfalls and edge cases

  • Releasing someone else's lock. A delayed RELEASE from the previous holder can DEL the lock that a new holder just acquired. Always include the token in the release: if get(key) == my_token then del(key).
  • TTL too short. A pause longer than the TTL silently loses the lock. Set TTL to the longest plausible work duration plus headroom. Production heuristic: 30 seconds is a sane default for human-interactive flows; minutes for batch jobs.
  • TTL too long. A crashed holder pins the lock for the entire TTL — minutes of unavailability. Use a watchdog to shorten the effective hold, or pair TTL with a holder-heartbeat mechanism.
  • Skipping fencing tokens. Most "distributed lock" tutorials don't mention fencing. Their locks work until the first GC pause longer than the TTL. Fence at the resource — your locks become correctness-preserving even when the lock service is wrong.
  • Lock contention on hot keys. One popular key becomes a serialization bottleneck. Shard the key by hash of (key + bucket), or refactor to optimistic concurrency (compare-and-swap on the resource itself).
  • Mixing lock services across services. Service A uses Redlock, service B uses a Postgres row lock — both believe they have exclusive access. The lock service isn't the resource; protected access must be coordinated by the resource (fence token) or by the same lock service.

Frequently asked questions

Why can't I just use a database row as a distributed lock?

You can — and it's the simplest correct primitive available. SELECT ... FOR UPDATE on a single row gives you mutual exclusion within one transaction, and the database fences the lock automatically. The downsides are scale (every lock acquisition is a DB round-trip plus locks held during the work) and that database failover can release the lock to a new owner unexpectedly. For low contention, DB-row locks are still the right answer.

What is the TTL hazard in distributed locks?

Every distributed lock has a time-to-live. If the holder takes longer than the TTL — because of a GC pause, network blip, or just a slow query — the lock auto-expires while the holder is still running. The lock service grants the lock to a second client, and now two clients believe they hold it simultaneously. Both write. State corrupts. This is the failure mode Martin Kleppmann's famous critique of Redlock highlights.

How does a fencing token fix the TTL hazard?

When a client acquires the lock, the lock service hands it a monotonically increasing token. Every write to the protected resource carries that token, and the resource rejects writes with a stale token. So if client A's lock expired and client B acquired it with token 34, client A's write with token 33 is rejected — even if A still believes it holds the lock. The resource itself enforces correctness, not the lock service.

Redis Redlock vs ZooKeeper vs etcd — which to pick?

ZooKeeper and etcd offer linearizable consensus-backed locks with built-in fencing via Zxid or revision numbers. They're slower (~5–20ms per acquire) but provably safe. Redis Redlock is fast (~1ms) but Kleppmann showed it's not safe under clock skew or process pauses; Redis maintainers disagree, the debate continues. For correctness-critical operations, use ZK/etcd. For coarse-grained coordination where occasional double-acquire is acceptable, Redlock is fine.

How big is the clock-skew tolerance in Redlock?

Redlock assumes bounded clock drift across nodes — roughly 10 ms of total skew is the order-of-magnitude figure that breaks Redlock's safety argument. On a well-NTP-synced cluster you're typically below 1 ms. On a poorly-managed system, container restarts, VM suspension, or NTP adjustments can produce 100+ ms jumps that violate the assumption. This is the core of Kleppmann's critique.

What's a typical lock acquisition latency?

Redis SET NX PX: ~0.5–2 ms on the same network. Redlock with 5 nodes: ~2–5 ms (waits for majority). ZooKeeper or etcd: ~5–20 ms (quorum write to consensus log). A Postgres row lock: ~1–5 ms uncontended. DynamoDB conditional write: ~5–10 ms. Plan for the wide end of these numbers when locks are contended.

Should I implement lock renewal (lease extension)?

Yes if the protected operation might exceed the initial TTL. A watchdog thread renews the lock every TTL/3 with the same token; if it can't renew (because the lock expired), it aborts the local work immediately. This shrinks the TTL hazard window but doesn't eliminate it — a stalled process whose watchdog is also stalled is indistinguishable from a crashed one, and the lock will eventually expire.