Databases

Database Sharding

One database, many machines, one shard key

Database sharding splits one logical database across many physical machines so that no single node has to hold all the data or absorb all the traffic. The hard part is not the splitting — it's choosing a shard key that scales linearly under your real query mix without creating hot shards or cross-shard joins.

  • Write throughput~linear in shards
  • Cross-shard txn cost2PC, ~5-10× slower
  • Hash-key rebalance1/N keys move
  • Common shard count32-256
  • Open-source proxyVitess, Citus

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How sharding works

A sharded database has three layers. The bottom is N independent database servers, each holding a slice of the rows. The middle is a router (sometimes called a vshard, mongos, or proxy) that knows which shard each key lives on. The top is your application, which sends queries to the router and pretends the database is one machine.

The router answers one question: given a row's shard key, which shard? The answer comes from a strategy:

  • Hash sharding. shard = hash(key) mod N. Even distribution by construction; loses ordered scans because adjacent keys land on different shards.
  • Range sharding. Each shard owns a contiguous key range. Range queries stay on one shard; sequential keys (timestamps, auto-increment IDs) become a hot shard.
  • Directory (lookup) sharding. A directory table maps key → shard. Maximum flexibility, plus a single point of metadata failure.
  • Consistent hashing. Hash both keys and shards onto a ring; each key goes to the next shard clockwise. Adding a shard moves only the keys between its position and its predecessor.

Real production systems are usually a hybrid: range partitions over a hashed prefix, lookup tables for "VIP" tenants, consistent hashing inside each tenant. Vitess, Citus, MongoDB, DynamoDB, and Cassandra each pick a different blend.

Sharding strategies side by side

HashRangeDirectoryConsistent hashingHybrid
Even distributionExcellentSkew-proneExcellentExcellentGood
Range scansBad (fan-out)ExcellentGoodBadMixed
Adding a shardRebalance ~all keysSplit a rangeUpdate directoryMove 1/N keysSplit + remap
Hot-key riskLowHigh (sequential)ManualLowManageable
Metadata SPOFNoneRange mapDirectory tableRing configMany
Used byCassandra, DynamoDBHBase, BigTable, CockroachDBVitess (vindexes), FoursquareRiak, Discord, MemcachedPostgres + Citus
Auto-splitManual (resharding)Yes (Spanner, Cockroach)ManualManualEngine-dependent

Consistent hashing shard router — JavaScript

// Consistent-hashing router with virtual nodes for smoother distribution.
import { createHash } from 'node:crypto';

class ShardRouter {
  constructor(shards, virtualNodes = 200) {
    // ring is sorted [{ pos, shard }, ...] keyed by 32-bit hash position.
    this.ring = [];
    for (const s of shards) {
      for (let v = 0; v < virtualNodes; v++) {
        this.ring.push({ pos: this.hash(`${s}#${v}`), shard: s });
      }
    }
    this.ring.sort((a, b) => a.pos - b.pos);
  }

  hash(s) {
    // 32-bit prefix of SHA-256 — fast enough, well-distributed.
    const h = createHash('sha256').update(s).digest();
    return h.readUInt32BE(0);
  }

  // Binary search for the first ring position >= keyHash.
  shardFor(key) {
    const k = this.hash(String(key));
    let lo = 0, hi = this.ring.length - 1;
    while (lo < hi) {
      const mid = (lo + hi) >>> 1;
      if (this.ring[mid].pos < k) lo = mid + 1; else hi = mid;
    }
    // Wrap around if k is past the last node.
    return this.ring[this.ring[lo].pos >= k ? lo : 0].shard;
  }

  addShard(shard, virtualNodes = 200) {
    for (let v = 0; v < virtualNodes; v++) {
      this.ring.push({ pos: this.hash(`${shard}#${v}`), shard });
    }
    this.ring.sort((a, b) => a.pos - b.pos);
  }
}

const router = new ShardRouter(['db-1', 'db-2', 'db-3', 'db-4']);
router.shardFor('user:42');         // -> 'db-3'
router.shardFor('user:1057');       // -> 'db-1'
router.addShard('db-5');            // ~1/5 of keys reroute

The 200 virtual nodes per shard matter. With one ring point per shard, shard sizes can vary by 2-3× simply because of where four random points landed. With 200 virtual points each, shard imbalance drops to single-digit percent — the same trick Memcached's ketama and Discord's user-routing both use.

Pick-key example — Python shard router

import hashlib
from bisect import bisect_left

class ShardRouter:
    def __init__(self, shards, virtual_nodes: int = 200):
        self.ring: list[tuple[int, str]] = []
        for s in shards:
            for v in range(virtual_nodes):
                self.ring.append((self._hash(f"{s}#{v}"), s))
        self.ring.sort()
        self.positions = [p for p, _ in self.ring]

    @staticmethod
    def _hash(s: str) -> int:
        return int.from_bytes(hashlib.sha256(s.encode()).digest()[:4], 'big')

    def shard_for(self, key) -> str:
        k = self._hash(str(key))
        i = bisect_left(self.positions, k) % len(self.ring)
        return self.ring[i][1]

# Picking a shard key for an "orders" table.
# Bad:  order_id (auto-increment)  -> sequential, all writes on the highest shard
# Bad:  region                     -> low cardinality (8 regions, lots of skew)
# Bad:  created_at                 -> time-skewed, hot recent shard
# Good: customer_id                -> high cardinality, query joins on it
# Best: (customer_id, order_id)    -> hash on customer_id keeps a customer's
#                                     orders co-located so per-customer JOINs
#                                     stay on one shard

router = ShardRouter(['db-1', 'db-2', 'db-3', 'db-4'])
def order_shard(order):
    return router.shard_for(order['customer_id'])

The comments are the lesson. A shard key isn't just an even hash — it's whatever lets your most-frequent multi-row query stay on a single shard. If "show this customer's orders" is your hottest query, customer_id is the right key even if order_id has higher cardinality.

Variants — production sharding systems

  • Vitess (YouTube / PlanetScale): directory sharding via vindexes (a vindex maps key → keyspace_id → shard range). Auto-splits ranges when a shard hits a size or QPS threshold.
  • Citus (Postgres extension): hash sharding into 32 default shards per distributed table. Co-located tables share the same shard key, so joins on that key run shard-local.
  • MongoDB: range or hash sharding via a configured shard key; balancer migrates "chunks" between shards as imbalance grows.
  • Cassandra / DynamoDB / Bigtable / HBase: hash- or range-partitioned, with auto-split (Bigtable, HBase) or manual repartition (Cassandra ring resize).
  • CockroachDB / Spanner / TiDB: range sharding with online auto-split when ranges exceed ~512MB. A range is the unit of replication, scheduling, and transaction commit.
  • Discord (custom): consistent-hashing router by user_id over hundreds of Cassandra/ScyllaDB shards; the canonical "we shard on user, period" pattern.
  • Figma's split-shards (2024): moved monolithic Postgres to per-table sharding using a Vitess-style proxy and table_shard pairs co-located by document_id.

What sharding buys and costs

The good news: write throughput scales close to linearly. A 64-shard cluster handles roughly 10× more sustained writes than a single primary at the same dollar — write fan-out doesn't hit network overhead until you cross-shard JOIN. Storage scales linearly forever. Discord's user-data cluster runs at hundreds of nodes; Twitter's old Manhattan ran at thousands.

The bad news comes in three flavors. (1) Cross-shard transactions need two-phase commit, which is ~5-10× slower than local commit and brittle under partial failures — most teams avoid them entirely. (2) Re-sharding once chosen badly is days-to-weeks of dual-writing and backfill. Foursquare lost 11 hours in 2010 to a single bad re-shard. (3) Schema changes, online migrations, and "global" queries (count rows, max ID) all turn O(N) in shard count.

The operational rule that keeps showing up across postmortems: pick a high-cardinality, low-skew shard key that matches your dominant query, over-provision shard count by 4-8× expected initial load (you can pack many shards on few hosts and re-split later), and forbid cross-shard transactions in application code.

Common bugs and edge cases

  • Hot-key skew. An "orders" table sharded by customer_id has one customer who is Walmart. That shard is on fire; the rest are idle. Solutions: composite key with a salt, separate shards for VIP tenants, application-level rate limits.
  • Cross-shard joins. JOIN orders ⋈ users when each is sharded differently fans out to every shard; aggregation latency = max latency × N. Either co-locate (same shard key) or denormalize.
  • Distributed transactions hung in 2PC. Coordinator dies between PREPARE and COMMIT; participants hold locks indefinitely. Recovery requires manual intervention or coordinator timeout protocols (e.g., Spanner's Paxos-based coordinator).
  • Time-based shard key. Sharding by created_at puts every new row on the latest shard. Looks fine in test, melts the cluster on day 30. Hash a composite key instead.
  • Auto-increment ID across shards. Two shards both hand out id=1. Use UUIDs, ULID, Snowflake-style 64-bit composite IDs, or a shared sequence service.
  • Re-shard with plain hash mod N. Going from 4 to 5 shards moves ~80% of the keys. Use consistent hashing or a hash-bucket-per-shard map so you only move 1/N.
  • Forgotten secondary index. Indexes on non-shard-key columns require querying every shard. Either accept fan-out or maintain a secondary index sharded on the indexed column.

When to shard (and when not to)

  • Single-primary writes are saturated (CPU, IOPS, fsync queue) and read replicas can't help.
  • Working set exceeds RAM on the largest single host you can run.
  • Per-tenant isolation requirements demand physical separation (multi-region compliance, noisy-neighbor SLOs).

Don't shard prematurely. A correctly-tuned single Postgres on modern NVMe handles 50-100k writes/s; you can grow to billions of rows before sharding pays for the operational complexity. Vertical scaling, read replicas, partitioning within one host, and aggressive caching all come first.

Frequently asked questions

What is a shard key?

The column (or columns) the database uses to decide which shard a row belongs on. Picking it well is the single most important decision in a sharded system — change it later and you re-shard the whole cluster. Pick something with high cardinality, low skew, and that appears in your most common query.

Why use consistent hashing instead of plain modulo?

Plain hash mod N rebalances every key when N changes — adding one node moves about (N-1)/N of the keys. Consistent hashing places shards on a ring; adding a node only moves 1/N of the keys, the ones now closer to the new shard's position on the ring.

Can I do JOINs across shards?

Yes, but expensively. A cross-shard JOIN requires fanning the query out to every shard, gathering partial results, and stitching them at the proxy. Most production systems either co-locate joined data on the same shard via a shared shard key, or denormalize so JOINs aren't needed.

What is a hot shard?

A shard receiving disproportionately more traffic than others — usually because the shard key is skewed (one tenant has 1000× the data of others, or the key is monotonically increasing time and all writes hit the latest range). The fix is either splitting that shard or hashing into the key to disperse load.

Sharding vs replication — what's the difference?

Replication makes copies of the same data; sharding splits different data across machines. Most production clusters do both — every shard is also replicated for availability. Sharding scales write throughput; replication scales read throughput and durability.

When should I shard?

When a single primary can no longer handle write throughput, when the data set exceeds reasonable single-host storage, or when a per-tenant isolation requirement demands it. Shard later than you think — schema changes and cross-shard transactions are dramatically harder once data is split.