Distributed Systems
Hybrid Logical Clock
Wall-clock milliseconds in the high bits, a Lamport counter in the low bits
A hybrid logical clock pairs wall-clock time with a Lamport-style logical counter into one 64-bit timestamp. It preserves causality like a Lamport clock and stays close to physical time like a wall clock — the trick that lets CockroachDB and YugabyteDB run distributed transactions without TrueTime hardware.
- Encoding48-bit ms + 16-bit counter
- Per-event overhead~50 bytes (timestamp + node ID)
- EstablishedKulkarni et al., 2014
- Causal propertyA→B ⇒ HLC(A) < HLC(B)
- Drift boundstays within NTP skew (~ms)
- Used inCockroachDB, YugabyteDB, MongoDB
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
The two-component tuple
A hybrid logical clock is a tuple (l, c). The first component l tracks the largest wall-clock millisecond the node has observed (locally or from any message). The second component c is a Lamport-style logical counter that increments when l didn't advance between events. Together, (l, c) is a strictly monotonic identifier compared lexicographically: (l1, c1) < (l2, c2) iff l1 < l2, or l1 = l2 and c1 < c2.
The whole structure fits in 64 bits. CockroachDB's layout puts 48 bits of physical milliseconds in the high portion (enough to address every millisecond until the year 10895) and 16 bits of logical counter in the low portion (65,536 events per millisecond per node before overflow). That packing matters: a 64-bit HLC slots into the same space databases already allocate for wall-clock timestamps, so storage and on-wire formats need no expansion.
The three update rules
Let pt be the local physical time (read from gettimeofday), and let the current HLC be (l, c). Three rules cover every event:
- Local event or send. Compute
l' = max(l, pt). Ifl' = l, the wall clock didn't advance — increment the logical counter:c' = c + 1. Otherwise reset:c' = 0. New HLC =(l', c'). - Receive of message with timestamp (l_msg, c_msg). Compute
l' = max(l, l_msg, pt). Ifl' = l = l_msg, takec' = max(c, c_msg) + 1. Ifl' = lonly, takec' = c + 1. Ifl' = l_msgonly, takec' = c_msg + 1. Otherwise reset:c' = 0. - Read. Just return the current
(l, c)— no mutation.
The receive rule looks intimidating but encodes one idea: the new l' dominates everything seen so far, and the logical counter only increments when ties leave no room for the wall clock to differentiate. If a fresher wall clock breaks the tie, c resets to 0 and we're back to a clean physical timestamp.
A worked example
Three nodes A, B, C. NTP-synchronized but slightly skewed.
- A's wall clock is 100 ms. A starts with HLC
(100, 0). - A does a local write at wall=101. Apply rule 1:
l' = max(100, 101) = 101, advanced, soc' = 0. HLC becomes(101, 0). - A immediately does a second local event at wall=101 (same ms).
l' = max(101, 101) = 101, didn't advance,c' = 1. HLC =(101, 1). - A sends a message with HLC
(101, 1)to B. B's wall is 95 (slightly behind A). - B receives.
l' = max(B.l=95, msg_l=101, B.pt=95) = 101. Sincel' = l_msgonly,c' = 1 + 1 = 2. B's HLC becomes(101, 2). - B does a local event one ms later (B.pt = 96, but B.l = 101 > 96).
l' = max(101, 96) = 101, didn't advance,c' = 3. HLC =(101, 3).
Notice B's clock effectively jumped forward into A's time. Once B's wall clock catches up to 101 ms, the logical counter resets to 0 and physical time takes over again. The logical part bridges skew; the physical part anchors the timeline.
Variants and refinements
- HLC with bounded uncertainty (B-HLC). Augments each timestamp with an uncertainty interval, similar to TrueTime, derived from NTP's reported error bound. Used in YugabyteDB's "wait-out" commit protocol.
- HLC with safe-read horizon. CockroachDB tracks a per-replica "closed timestamp" — the highest HLC at which the replica guarantees no future writes. Follower reads at or below this horizon are safe.
- Stripe-aware HLC (Cassandra). Within a Cassandra coordinator, HLC is generated per write-coordinator-stripe to avoid contention on a single atomic counter at very high throughput.
- Bounded HLC for clock-rollback safety. Imposes a hard cap: if a received
l_msgexceeds localptby more than the configured skew bound (default 500 ms in CockroachDB), refuse the message and crash. Prevents a misbehaving node from poisoning the cluster's clocks.
When HLC is the right choice
- You want distributed transactions across multiple regions but don't have TrueTime-grade hardware (GPS + atomic clocks in every datacenter).
- You need timestamps that operators can read at a glance — "this happened around 10:42:17" rather than "Lamport counter 4,712,338."
- You want a single 64-bit timestamp that fits everywhere a wall-clock timestamp already does, without expanding storage or wire formats.
- You need causal consistency across replicas without committing to vector-clock sized metadata (O(N) bytes per timestamp).
HLC is the wrong tool when you need to detect concurrent writes for sibling reconciliation (use vector clocks or dotted version vectors), or when you need provable real-time external consistency across regions without round-trips (use Spanner's TrueTime).
HLC vs Lamport vs vector clocks vs TrueTime
| Lamport | Vector clock | HLC | TrueTime | |
|---|---|---|---|---|
| Size per timestamp | 8 bytes | 8N bytes (N writers) | 8 bytes | 16 bytes (interval) |
| Tracks physical time? | No | No | Yes (within NTP skew) | Yes (with bounds) |
| Detects concurrency? | No | Yes | No | No |
| External consistency? | No | No | Best-effort | Yes (commit-wait) |
| Needs special hardware? | No | No | No | Yes (GPS, atomic) |
| Production users | Kafka offsets, etcd ZXID | Riak, Voldemort | CockroachDB, YugabyteDB, MongoDB | Spanner only |
The trade-off is clear: HLC takes the simplicity and tiny footprint of Lamport, layers on physical-time-readability, and gets you 90% of the way to TrueTime at none of the hardware cost. That's why every distributed database not run by Google has standardized on it.
Pseudo-code
// HLC state at each node
struct HLC {
l: u64 // wall-clock milliseconds (max observed)
c: u32 // logical counter
}
// Rule 1: local event or send
fn local_or_send(hlc, pt):
l_new = max(hlc.l, pt)
if l_new == hlc.l:
c_new = hlc.c + 1
else:
c_new = 0
hlc.l, hlc.c = l_new, c_new
return (l_new, c_new) // attach to outgoing message if send
// Rule 2: receive a message with (l_msg, c_msg)
fn receive(hlc, l_msg, c_msg, pt):
l_new = max(hlc.l, l_msg, pt)
if l_new == hlc.l and l_new == l_msg:
c_new = max(hlc.c, c_msg) + 1
else if l_new == hlc.l:
c_new = hlc.c + 1
else if l_new == l_msg:
c_new = c_msg + 1
else:
c_new = 0
hlc.l, hlc.c = l_new, c_new
// Sanity check
fn check_skew(l_msg, pt, max_skew_ms = 500):
if l_msg > pt + max_skew_ms:
panic("HLC drift exceeded — clock misconfigured")
Go implementation (CockroachDB-style)
package hlc
import (
"sync"
"time"
)
type Timestamp struct {
L int64 // milliseconds
C uint32 // logical counter
}
type Clock struct {
mu sync.Mutex
state Timestamp
maxSkewMs int64
nowMs func() int64 // injectable for tests
}
func New(maxSkewMs int64) *Clock {
return &Clock{
maxSkewMs: maxSkewMs,
nowMs: func() int64 { return time.Now().UnixMilli() },
}
}
// Now returns the next HLC timestamp for a local event.
func (c *Clock) Now() Timestamp {
c.mu.Lock()
defer c.mu.Unlock()
pt := c.nowMs()
if pt > c.state.L {
c.state.L = pt
c.state.C = 0
} else {
c.state.C++
}
return c.state
}
// Update merges in a remote timestamp from an incoming message.
func (c *Clock) Update(t Timestamp) (Timestamp, error) {
c.mu.Lock()
defer c.mu.Unlock()
pt := c.nowMs()
if t.L > pt+c.maxSkewMs {
return c.state, ErrClockDriftExceeded
}
lNew := max3(c.state.L, t.L, pt)
switch {
case lNew == c.state.L && lNew == t.L:
if t.C > c.state.C {
c.state.C = t.C + 1
} else {
c.state.C++
}
case lNew == c.state.L:
c.state.C++
case lNew == t.L:
c.state.C = t.C + 1
default:
c.state.C = 0
}
c.state.L = lNew
return c.state, nil
}
Common pitfalls
- Forgetting to bump c on local events when wall clock is stuck. If your monotonic clock returns the same millisecond twice in a row (very common at sub-microsecond event rates), two local events both produce
(l, 0)and your "monotonic" clock returns duplicates. The logical counter is what keeps the timestamps distinct. - Using a 16-bit counter without a hard cap. If a node truly bursts 65,536 events into a single ms, the counter wraps to 0 and time appears to go backward. Real implementations stall the producer until the next ms or widen the counter.
- Ignoring the max-skew bound. Without rejecting messages whose
l_msgexceeds localptby too much, a single misconfigured node (clock set five years in the future) poisons every other node it talks to. CockroachDB and YugabyteDB both default to a 500 ms tolerated skew and fatally crash on violation. - Treating HLC as wall-clock time for legal logging. HLC drifts above wall clock during skew; "the user clicked at HLC 10:42:17.213" is not exactly when the user clicked. For audit logs, record the original wall-clock separately.
- Mismatched encoding between nodes. If one node packs HLC as (48-bit ms, 16-bit counter) and another packs it as (51-bit ms, 13-bit counter), comparisons are nonsense. Lock the encoding in your protocol spec, not as a code convention.
Performance and cost characteristics
- ~50 bytes per event on the wire. 8-byte HLC timestamp + 4-byte node ID + 8-byte transaction ID + headers. Negligible at any scale.
- One atomic CAS or mutex per timestamp. ~20–50 ns per call on modern x86; rarely a bottleneck except at extreme single-coordinator throughput (millions of timestamps/sec from one core).
- Bounded logical-counter growth. In healthy operation, c rarely exceeds 10–20 because wall_clock advances every millisecond. During clock skew or burst contention, c can spike to thousands but is bounded by the 16-bit cap.
- NTP-grade clock skew is enough. CockroachDB recommends NTP accuracy within 250 ms; YugabyteDB defaults to 500 ms. Cloud provider clocks (AWS Time Sync, GCP NTP) routinely deliver sub-millisecond accuracy, far better than required.
- No coordination required. Unlike consensus-based clocks (Raft-term-and-index), HLC is generated locally with zero round-trips. That's the central performance win: timestamps cost zero network operations.
Frequently asked questions
What is a hybrid logical clock?
A hybrid logical clock (HLC) is a 64-bit timestamp tuple (l, c) where l tracks the maximum wall-clock millisecond observed and c is a Lamport-style logical counter. On every event, l is bumped to max(l, wall_clock, any_received_l) and c is incremented when l didn't advance — so HLC preserves Lamport's happens-before property while staying close to physical time. Introduced by Kulkarni, Demirbas, Madappa, Avva, and Leone in 2014.
Why is HLC better than a plain Lamport clock?
A Lamport clock is a single monotonic integer that can drift arbitrarily far from wall time. After a long-running distributed batch, the Lamport counter might read 47 million, which means nothing to a human operator. HLC keeps the high bits aligned with physical milliseconds, so timestamps remain human-readable (you can see at a glance that an event happened around 09:14:22 today) while still respecting causal order from message timestamps.
Why not just use vector clocks instead of HLC?
Vector clocks need O(N) bytes per timestamp where N is the number of writers. For a 1000-node cluster, that's 8 KB per timestamp attached to every message — prohibitive at high throughput. HLC fits in 64 bits regardless of cluster size. The trade-off: HLC gives you a total order (with tiebreaker), not concurrency detection. If you need to detect concurrent writes for siblings (Dynamo-style), use vector clocks; if you need a cheap, scalable, causally-consistent timestamp for snapshot reads, use HLC.
How does HLC compare to Google's TrueTime?
TrueTime returns an interval [earliest, latest] for the current wall-clock time, with bounded uncertainty (typically under 7 ms) backed by GPS receivers and atomic clocks in every data center. Spanner uses it to commit transactions after waiting out the uncertainty window. HLC achieves similar causality guarantees with commodity NTP-synchronized clocks — no specialized hardware. The cost is a slightly weaker external-consistency model: HLC orders events that exchanged messages but cannot reorder concurrent events into real-time order without help. For most workloads, HLC is good enough and orders of magnitude cheaper to deploy.
What happens when a node's clock jumps backward?
HLC handles backward jumps gracefully. Because the algorithm always takes max(l, wall_clock, received_l), a backward jump in wall_clock simply means the wall-clock branch loses to the local l — the logical counter c continues incrementing, and the timestamp keeps moving forward. The clock effectively becomes a Lamport clock during skew, then re-aligns to physical time once wall_clock catches back up. CockroachDB additionally caps the maximum tolerated skew (default 500 ms) and crashes nodes that drift beyond it, since clock skew that large indicates a configuration bug, not a transient hiccup.
How many bits does each HLC component need?
Common encoding: 48 high bits for physical milliseconds (good until year 10895), 16 low bits for the logical counter (65,536 events per millisecond per node before overflow). CockroachDB uses this layout. Some implementations use 51 bits / 13 bits for finer-grained physical timestamps. The logical counter resets to 0 whenever the physical bits advance, which is why it rarely approaches its cap — you need 65k events within a single millisecond from a single sender without any wall-clock progress, an unlikely combination on real hardware.
Which databases use HLC?
CockroachDB uses HLC for all transactional read and commit timestamps. YugabyteDB inherits the design via Raft. MongoDB introduced HLC-based ClusterTime in 3.6 for causal consistency sessions and change streams. FaunaDB, Aerospike, and TiKV (TiDB's storage) all implement HLC variants. Any database that wants distributed transactions without TrueTime hardware — which is essentially every database not named Spanner — has converged on HLC.