Distributed Patterns
Leaky Bucket Rate Limit
FIFO queue draining at constant rate — smooth bursty input to a steady output
A bucket with a hole. Bursty input fills it; output leaks at fixed rate r. Overflow drops. Output is dead-smooth — exactly r req/sec. Used in network shaping and APIs.
- OriginatedTurner 1986 · ATM networks
- Output rateexactly r req/sec (smooth)
- Burst tolerancenone — queued or dropped
- Capacity formulab = r × max_latency
- vs. Token bucketsmooth output vs. burst-allowed
- Used inCisco, Linux tc, Stripe, Twilio
Interactive visualization
Bursty requests pour in. The bucket queues them. Output drips out at a perfectly constant rate. Overflow drops.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How the leaky bucket works
Picture a literal bucket with a hole in the bottom. Water (requests) pours in from the top — sometimes a trickle, sometimes a torrent. Through the hole, water leaks at a constant rate: exactly r liters per second, no matter how much is in the bucket. If the bucket has capacity b and pouring would overflow it, the excess water spills onto the floor — lost. If the input rate equals r, the level stays constant; if it's lower, the bucket drains; if it's higher, the bucket fills, then overflows.
That's the entire algorithm. Translate water → requests, hole → downstream processor, overflow → dropped requests, and you have the leaky bucket rate limiter. The output is a perfectly smooth stream at rate r, regardless of how chaotic the input.
Formally:
- r — leak rate. Requests per second the downstream is willing to accept. For an API: 100 req/sec. For an ISP link: 10 Mbps of committed bandwidth.
- b — bucket capacity. How many requests can be queued at once. Maximum queueing latency = b / r seconds.
- L — current level. Number of queued (or virtual-queued) requests right now. Starts at 0, leaks toward 0 at rate r.
Algorithm per request:
on_request(now):
# Leak first
elapsed = now - last_leak_time
L = max(0, L - elapsed * r)
last_leak_time = now
# Try to add
if L + 1 > b:
DROP # or return 429
else:
L += 1
ACCEPT # request will egress within (L / r) seconds
This is the counter-based variant — constant memory per client (just L and last_leak_time). The queue-based variant maintains a real FIFO; both produce identical egress behavior.
Worked example — smoothing a bursty client
An API serves r = 5 req/sec to one client, with bucket capacity b = 10. The client sends 20 requests over the first half-second, then idles.
Input arrivals (req# / time-ms / level after):
1 / 0 ms / L = 1 accepted
2 / 25 ms / L ≈ 2 accepted
3 / 50 ms / L ≈ 3 accepted
...
10 / 225 ms / L ≈ 10 accepted (bucket full)
11 / 250 ms / L would be 11 → DROP
12 / 275 ms / DROP
13 / 300 ms / DROP
...
20 / 475 ms / DROP
Egress (timed by leak rate r = 5/sec, so 200 ms apart):
req 1 exits at t ≈ 200 ms
req 2 exits at t ≈ 400 ms
req 3 exits at t ≈ 600 ms
...
req 10 exits at t ≈ 2000 ms
The client sent 20 requests in 500 ms.
The bucket dropped 10 of them.
The 10 accepted requests reached downstream at a perfectly even 5 req/sec
over 2 seconds. Downstream never sees more than 5 RPS.
Compare this to no rate limit (downstream sees a 40-RPS spike) or to a token bucket of capacity 10 + refill 5/sec (downstream sees an initial burst of 10 in < 100 ms then 5 RPS sustained). Only leaky bucket gives you the dead-smooth 5 RPS that some downstreams require.
Variants
- Counter-based (virtual queue). Track
Landlast_leak_time. O(1) memory per client. Used by Redis-Lua libraries, Stripe's rate limiter blog post variant, Cloudflare's edge rate limit. - Queue-based (real FIFO). Maintain an actual queue of pending request objects with a worker that pops one every 1/r seconds. Memory is O(b) per client. Used by hardware traffic shapers and any system that needs to reorder or modify queued requests.
- Hierarchical leaky bucket. Multiple buckets stacked: per-user, per-tenant, global. A request must pass all of them. Used in multi-tenant API gateways like Apigee and Kong.
- Weighted leaky bucket. Different requests cost different amounts of "water." A POST might cost 5 while a GET costs 1. Lets you rate-limit by computed cost rather than count. Stripe and SendGrid use weighted variants.
- Generic Cell Rate Algorithm (GCRA). A clever continuous-time reformulation that compares the next-arrival time to a "theoretical arrival time." Mathematically equivalent to leaky bucket; popular in ATM networks and modern Redis libraries like
redis-cell.
Leaky bucket vs token bucket
| Aspect | Leaky Bucket | Token Bucket |
|---|---|---|
| Output rate | Exactly r req/sec, smooth | Average r, allows burst up to b |
| Idle reward | None — bucket just drains | Tokens accumulate; later burst possible |
| Burst tolerance | None (or queued up to b) | Yes, up to b in one instant |
| State per client | L (level), last_leak_time | tokens, last_refill_time |
| Typical use | Network shaping, SMTP, payments | Public APIs, NIC, AWS DynamoDB |
| Implementation cost | O(1) counter or O(b) queue | O(1) counter |
| User experience | Steady delay or drop under burst | Brief acceleration, then drop |
The mathematical relationship: leaky bucket = token bucket with token capacity 1. Both algorithms enforce the same long-run rate; they differ only in their short-term burst behavior. Many production systems run both in series: token bucket at the edge (allow legitimate bursts), leaky bucket at the downstream (enforce smooth load).
Redis implementation
Distributed rate limiting needs a shared store with atomic operations. Redis is the canonical choice — sub-millisecond p99 latency, and Lua scripts run atomically as a single transaction.
-- KEYS[1] = "leaky:"
-- ARGV[1] = rate r (per second)
-- ARGV[2] = capacity b
-- ARGV[3] = now (milliseconds since epoch)
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local data = redis.call("HMGET", KEYS[1], "L", "ts")
local L = tonumber(data[1]) or 0
local ts = tonumber(data[2]) or now
-- Leak
local elapsed = (now - ts) / 1000
L = math.max(0, L - elapsed * rate)
if L + 1 > capacity then
redis.call("HMSET", KEYS[1], "L", L, "ts", now)
redis.call("PEXPIRE", KEYS[1], 60000)
return 0 -- DROP
end
L = L + 1
redis.call("HMSET", KEYS[1], "L", L, "ts", now)
redis.call("PEXPIRE", KEYS[1], 60000)
return 1 -- ACCEPT
This script returns 1 for accept, 0 for drop. Eviction TTL of 60 seconds means idle clients automatically clear from Redis. Throughput on a single Redis instance is around 100,000 calls per second; for higher throughput, shard by client ID.
Python implementation
import time
from collections import defaultdict
from threading import Lock
class LeakyBucket:
def __init__(self, rate, capacity):
self.rate = rate # leak rate (req/sec)
self.capacity = capacity # max queued
self.buckets = defaultdict(lambda: [0, time.time()]) # client -> [L, ts]
self.lock = Lock()
def allow(self, client_id):
with self.lock:
L, ts = self.buckets[client_id]
now = time.time()
L = max(0, L - (now - ts) * self.rate)
if L + 1 > self.capacity:
self.buckets[client_id] = [L, now]
return False
self.buckets[client_id] = [L + 1, now]
return True
# Usage
limiter = LeakyBucket(rate=5, capacity=10)
if limiter.allow("user_42"):
handle_request()
else:
return Response(status=429, headers={"Retry-After": "1"})
Common pitfalls
- Confusing leaky-bucket-as-shaper with leaky-bucket-as-limiter. A shaper queues bursts and releases slowly (higher latency under load). A limiter rejects immediately when the bucket is full. Most production code does both — queue up to b, reject the b+1th.
- Picking capacity too small. b = 1 effectively bans any burst — every micro-spike triggers drops. b should accommodate at least one second of expected burst traffic.
- Picking capacity too large. A bucket of 10,000 means a stuck client can fill it once and then delay legitimate traffic for minutes as it slowly drains. Keep b bounded.
- Sharing one bucket across multiple endpoints with different cost profiles. A POST that touches the DB and a GET that hits the cache cost very different downstream resources. Use weighted leaky bucket or separate buckets per endpoint class.
- No client-side feedback. If you just drop without saying anything, clients retry blindly. Return HTTP 429 with a Retry-After header (in seconds or HTTP-date) so clients can back off appropriately.
- Sharing one bucket across an entire client cluster. If your service is multi-instance and each instance has its own in-memory bucket, your effective rate is N × r — meaningless. Use Redis or a coordinator-style shared store.
- Forgetting clock drift. Distributed implementations must use a single authoritative clock (the Redis server's, typically). Trusting each client's wall clock leads to nondeterministic behavior under skew.
Where leaky bucket is actually used
- Network traffic shaping. Cisco's
traffic-shape, Juniper'straffic-policer, Linuxtcwithtbfandhtb— all leaky-bucket variants. ISPs use them to enforce committed information rate (CIR) on customer links. - Cellular networks. Base stations smooth per-subscriber uplink to protect backhaul. 3GPP standards specify GCRA (a leaky-bucket equivalent) for policing.
- Email and SMS APIs. SendGrid, Twilio, Mailgun smooth outbound message rates to respect downstream SMTP server capacity and SMS carrier per-second limits.
- Payment APIs. Stripe smooths outbound to card networks; Square does similar. The card networks (Visa, Mastercard) require strict per-second caps that cannot be burst-exceeded.
- API gateways. Kong, Apigee, Tyk all offer leaky-bucket rate limiters as a configurable policy. Often paired with a token-bucket for the edge limit.
Performance and impact
The counter-based leaky bucket adds essentially zero per-request overhead — a single arithmetic check against a stored level, an arithmetic update, a memory write. Within a single process this is ~50 nanoseconds. Through Redis, it's a single round-trip — typically 200-500 microseconds on a colocated Redis, dominated by network latency. At 100,000 RPS through a single Redis instance, the rate limiter itself is rarely the bottleneck.
The smoothing impact downstream is dramatic. Picture a downstream system that handles 100 RPS but breaks at 200 RPS. Without a leaky bucket, a client sending 1000 RPS for 1 second can crash the downstream. With a leaky bucket sized r=100, b=20, the downstream sees exactly 100 RPS — never more, never less — and the client gets 980 dropped requests over that second. The 980 drops are a fair price for a downstream that stays up. Stripe's published reliability post-mortems repeatedly cite leaky-bucket-style smoothing on outbound traffic as the reason they could handle Black Friday spikes without overwhelming partner card networks.
Frequently asked questions
What is the leaky bucket algorithm in one sentence?
A FIFO queue (the bucket) with capacity b that drains at constant rate r. Requests enter the bucket; if it overflows, they are dropped. The bucket emits exactly r requests per second regardless of input burstiness, so downstream sees a perfectly smooth load.
Leaky bucket vs token bucket — what's the difference?
Both limit the long-run average rate to r. The difference is short-term burstiness. Token bucket allows a burst up to bucket size b — a client that has been idle can spend accumulated tokens in one rapid burst. Leaky bucket has no such concession: output is exactly r requests/sec, smooth as silk. Use leaky bucket when the downstream system requires a smooth load (cellular base stations, network shapers); use token bucket when you want to reward idle clients with occasional bursts (most public APIs).
Why is leaky bucket called 'shaping' rather than 'limiting'?
Limiting (in the strict sense) means rejecting requests outright when a threshold is exceeded. Shaping means delaying them — the bucket queues bursty input and releases it at the regulated rate. Most leaky-bucket implementations are hybrids: they queue up to b requests (shape) and reject only the b+1th (limit). The network-shaping origin is why Cisco and Juniper documentation prefers the term 'shaping.'
What capacity b should I pick?
Capacity controls how much burstiness the bucket can absorb before dropping. Too small and legitimate brief bursts are dropped; too large and bad-actor bursts soak up so much queue space they delay good traffic. For an API at r=100 req/sec, a bucket of b=20 absorbs 200 ms of full-rate burst. A common formula: b = r × max_tolerable_latency. If you want max queueing delay 500 ms at r=100 RPS, set b = 50.
How does leaky bucket handle long-running outages?
If the downstream is briefly down and the bucket fills, all subsequent requests drop or queue until the bucket drains. After the downstream recovers, the bucket releases at rate r — the queue drains over b/r seconds. The bucket cannot 'lend' capacity from future seconds the way token bucket can borrow from saved tokens. This is sometimes a feature (the downstream sees no burst spike on recovery) and sometimes a bug (legitimate users see prolonged 429s as the queue drains).
How is leaky bucket implemented efficiently?
Two patterns. (1) Counter-based: store current_level and last_update_time. On each request, advance the level by (now - last_update) × r, clamp at zero, then check if level + 1 > b. If yes, reject; else level += 1. Constant memory per client. (2) Queue-based: a real FIFO queue with a timer that pops one element per 1/r seconds. Higher memory but supports exact reordering. The counter-based 'virtual queue' is what Redis-Lua leaky-bucket libraries use; the queue-based form is what hardware traffic shapers use.
Where is leaky bucket actually used?
Network traffic shaping (Cisco, Juniper, Linux tc) — enforcing committed information rate (CIR) on ISP links. Cellular base stations smoothing per-subscriber uplink. Some API gateways for endpoints that require strict smoothing — email sending APIs that respect downstream SMTP capacity, payment APIs that smooth load on the card network. Stripe, Twilio, and SendGrid all use leaky-bucket-style smoothing on outbound calls.