Networking
Rate Limiting (Token & Leaky Bucket)
Cap the request rate without punishing honest bursts — three algorithms, three trade-offs
Rate limiting caps how fast clients can hit a service. Token bucket allows bursts up to a capacity, leaky bucket smooths traffic to a constant rate, and sliding window counts requests over a rolling interval — each trading burst tolerance for fairness.
- Token / leaky bucket checkO(1) time & space
- Sliding-window logO(N) memory per key
- Sliding-window counterO(1), ≈0.003% error
- Burst toleranceToken bucket only
- Reject status429 + Retry-After
Interactive visualization
Press play, or step through manually. The visualization is yours to drive — try it before reading on.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
What rate limiting actually does
Every public endpoint has a finite budget — CPU, database connections, a paid downstream API with its own quota. Without a governor, a single misbehaving client (a runaway retry loop, a scraper, a credential-stuffing bot) can consume that whole budget and take the service down for everyone else. Rate limiting is the governor: it decides, for each incoming request, "is this client allowed to proceed right now, or do we reject it?"
The decision is keyed on something — an API key, a user ID, an IP address, or a tuple of those. For each key the limiter keeps a tiny piece of state and runs one of a handful of algorithms against the clock. The three that matter in practice are token bucket, leaky bucket, and the sliding window family. They differ in exactly one question: how do you treat a client that was quiet and then suddenly bursts?
A rejected request conventionally gets HTTP 429 Too Many Requests (RFC 6585, 2012) with a Retry-After header. Well-designed limiters also expose RateLimit-Remaining and RateLimit-Reset so clients can self-pace instead of retrying blindly.
Token bucket — pay-as-you-go with savings
Picture a bucket that holds at most b tokens and refills at r tokens per second. Every request must remove one token; if the bucket is empty, the request is rejected (or, in a queuing variant, made to wait). The bucket never overflows past b.
The genius is the saved-up credit. A client that has been idle for ten seconds with r = 5/s and b = 20 has a full bucket of 20 tokens, so it can fire 20 requests instantly — a legitimate burst — and then settles to the steady 5/s as the bucket drains and refills. The long-run average is capped at r, but short bursts up to b are forgiven.
You never run a background timer to add tokens. Instead you store the token count and a lastRefill timestamp, and compute the refill lazily on each request: tokens = min(b, tokens + (now − lastRefill) · r). That makes the whole check O(1) time and O(1) space — two numbers per key. This is why token bucket backs throttling in Stripe and AWS API Gateway. (NGINX's limit_req looks similar but is actually a leaky bucket — its burst parameter is the queue size, not saved-up token credit.)
Leaky bucket — a queue that drains at constant rate
The leaky bucket inverts the metaphor. Requests pour into a bucket (a FIFO queue of fixed capacity) and leak out the bottom at a fixed rate r. If the bucket is full when a request arrives, that request spills over and is dropped. The processor downstream sees a perfectly smooth, constant stream — no matter how spiky the input.
The crucial difference from token bucket: leaky bucket has no memory of idle time. Being quiet doesn't buy you a burst, because there's no credit to accumulate — the leak rate is constant. That makes it the right choice when the thing you're protecting truly cannot absorb bursts: a legacy mainframe, a payment processor with a hard per-second cap, an SMS gateway.
There are two flavors. The queuing leaky bucket actually buffers requests and delays them, adding latency but losing nothing until the queue overflows. The meter variant (GCRA — Generic Cell Rate Algorithm, from ATM networking) doesn't queue anything; it just tracks a virtual "theoretical arrival time" and rejects requests that arrive too early. GCRA is mathematically a leaky bucket but runs in O(1) with a single stored timestamp, which is why redis-cell uses it.
Fixed window, sliding log, sliding counter
The simplest limiter is the fixed-window counter: count requests in the current calendar minute, reset to zero at the boundary. It's one integer per key and trivially O(1) — but it has a notorious flaw. A client can send the full limit in the last second of 12:00 and the full limit in the first second of 12:01: double the limit in a two-second span straddling the boundary.
The sliding-window log fixes this exactly. Store the timestamp of every accepted request in a sorted set; on each new request, evict timestamps older than the window and count what's left. Precise to the millisecond — but memory is O(N) in the number of requests inside the window, so a hot key doing 10,000 req/s with a 60-second window stores up to 600,000 timestamps. That's the cost of exactness.
The sliding-window counter is the pragmatic middle. Keep two fixed-window counts — the current window and the previous one — and estimate the rolling rate by weighting the previous window by how much of it still overlaps the sliding window:
estimate = current_count
+ previous_count × (1 − elapsed_fraction_of_current_window)
This is O(1) space (two counters) and Cloudflare reported that, across 400 million requests, it differed from the exact sliding-window log by only about 0.003% of requests — just 3 sources were wrongly allowed, all of them under 15% over the true cap. For almost every API that error is invisible, which is why the sliding-window counter is the most widely deployed accurate limiter.
Algorithms compared
| Token bucket | Leaky bucket (queue) | Leaky bucket (GCRA) | Fixed window | Sliding log | Sliding counter | |
|---|---|---|---|---|---|---|
| Allows bursts | Yes, up to b | Buffered, delayed | No | Yes (at boundary) | No | No |
| Output shape | Bursty then steady | Perfectly smooth | Smooth | Spiky | Capped exact | Capped approx |
| Time per check | O(1) | O(1) | O(1) | O(1) | O(log N) | O(1) |
| Memory per key | 2 numbers | queue + count | 1 timestamp | 1 counter | O(N) timestamps | 2 counters |
| Boundary exploit | None | None | None | 2× limit | None | ≈0.003% error |
| Drops vs delays | Drops excess | Delays, then drops | Drops excess | Drops excess | Drops excess | Drops excess |
| Typical use | APIs, gateways | Traffic shaping | Redis (redis-cell) | Crude quotas | Exact billing limits | Production default |
The headline split is burst handling. Token bucket rewards a quiet client with a burst; leaky bucket refuses to, smoothing everything to a constant rate. Sliding window doesn't care about smoothness at all — it cares about a precise count over a rolling interval. Pick token bucket for friendliness to bursty-but-honest clients, leaky bucket when the downstream genuinely can't take a spike, and sliding-window counter when you need an accurate per-interval cap without the memory of a log.
What the numbers actually say
- Token bucket is two numbers per key. A 16-byte float pair plus an 8-byte key fits in < 32 bytes; ten million keys is under 320 MB — comfortable for one Redis node.
- Sliding-window log is O(N) memory. At 10k req/s over a 60 s window, one hot key holds 600,000 timestamps × 8 bytes ≈ 4.8 MB per key. A thousand such keys is 4.8 GB — the reason logs are reserved for low-volume exact limits.
- Sliding-window counter's error is ≈0.003%. Cloudflare's published figure over 400 M requests; the worst case is bounded — it can over-count by at most the previous window's count, never silently leak unbounded traffic.
- Fixed window's worst case is exactly 2×. Limit of 100/min permits 200 requests in a 2-second span straddling the minute boundary — provably the maximum.
- A Redis round trip is ~0.2–1 ms. A centralized atomic check-and-increment (INCR + EXPIRE, or a Lua script) adds sub-millisecond latency, which is why distributed limiters lean on Redis rather than per-node counters that drift.
JavaScript implementation
A lazy-refill token bucket — the algorithm you'll reach for 80% of the time. No timers; tokens are recomputed on each request from the elapsed time.
class TokenBucket {
// capacity b = max burst; refillRate r = tokens added per second
constructor(capacity, refillRate) {
this.capacity = capacity;
this.rate = refillRate;
this.tokens = capacity; // start full
this.last = Date.now() / 1000; // seconds
}
// returns true if the request is allowed (and consumes a token)
tryConsume(cost = 1) {
const now = Date.now() / 1000;
// lazily add tokens for the elapsed time, capped at capacity
this.tokens = Math.min(this.capacity, this.tokens + (now - this.last) * this.rate);
this.last = now;
if (this.tokens >= cost) {
this.tokens -= cost;
return { allowed: true, remaining: Math.floor(this.tokens) };
}
// seconds until enough tokens exist again — feed into Retry-After
const wait = (cost - this.tokens) / this.rate;
return { allowed: false, remaining: 0, retryAfter: Math.ceil(wait) };
}
}
// 5 req/s sustained, bursts up to 20
const limiter = new TokenBucket(20, 5);
const r = limiter.tryConsume();
if (!r.allowed) res.status(429).set('Retry-After', r.retryAfter).end();
The sliding-window counter — the accurate-enough O(1) limiter — leaning on two fixed-window counts:
function slidingWindowAllowed(store, key, limit, windowMs, now = Date.now()) {
const curWindow = Math.floor(now / windowMs);
const elapsed = (now % windowMs) / windowMs; // 0..1 into current
const cur = store.get(`${key}:${curWindow}`) || 0;
const prev = store.get(`${key}:${curWindow - 1}`) || 0;
// weight the previous window by the fraction still inside the sliding view
const estimate = cur + prev * (1 - elapsed);
if (estimate >= limit) return false;
store.incr(`${key}:${curWindow}`, windowMs * 2); // TTL spans 2 windows
return true;
}
Python implementation
The same token bucket, plus the GCRA (leaky-bucket meter) that powers single-timestamp distributed limiters.
import time, math
class TokenBucket:
def __init__(self, capacity, refill_rate):
self.capacity = capacity
self.rate = refill_rate
self.tokens = capacity
self.last = time.monotonic()
def try_consume(self, cost=1):
now = time.monotonic()
self.tokens = min(self.capacity, self.tokens + (now - self.last) * self.rate)
self.last = now
if self.tokens >= cost:
self.tokens -= cost
return True, math.floor(self.tokens), 0
wait = (cost - self.tokens) / self.rate
return False, 0, math.ceil(wait) # allowed, remaining, retry_after
class GCRA:
"""Leaky bucket as a meter — one stored timestamp, no queue.
period = seconds between requests at the steady rate;
burst = max requests allowed back-to-back (burst of 1 = pure rate)."""
def __init__(self, period, burst=1):
self.period = period
self.tolerance = period * (burst - 1) # how far ahead of TAT is OK
self.tat = time.monotonic() # theoretical arrival time
def allowed(self):
now = time.monotonic()
tat = max(self.tat, now)
if now < tat - self.tolerance: # arrived too early -> reject
return False
self.tat = tat + self.period # advance the virtual schedule
return True
# 5 req/s, tolerate bursts of 20
bucket = TokenBucket(capacity=20, refill_rate=5)
gcra = GCRA(period=1/5, burst=20)
Note the GCRA stores exactly one number — tat, the theoretical arrival time — yet reproduces leaky-bucket behavior precisely. That single-timestamp footprint is what lets a Redis Lua version scale to millions of keys in O(1) per check.
Variants worth knowing
Hierarchical / nested limits. Real APIs stack limits: 10 req/s per IP, 100 req/s per API key, 10,000 req/s per organization. Each request must pass every applicable bucket. The tightest binding limit wins, and you return the matching Retry-After.
Distributed token bucket. Per-node counters drift — three nodes each allowing 5/s yields 15/s globally. The fix is a centralized atomic counter in Redis (a Lua script makes check-and-decrement a single round trip), or a token-lease scheme where nodes pull blocks of tokens from a central authority and limit locally between refills, trading exactness for fewer round trips.
Concurrency limiting (in-flight cap). A cousin that caps simultaneous requests rather than rate — a semaphore of size k. Useful when the constraint is connection pool size, not requests per second.
Adaptive / congestion-based limiting. Instead of a fixed rate, adjust the limit based on observed latency or error rate, much like TCP congestion control — additive increase, multiplicative decrease. Netflix's concurrency-limits library does this with a Little's-law estimate of the safe in-flight count.
Common bugs and edge cases
- The fixed-window boundary burst. The single most common mistake: shipping a fixed-window counter and being surprised when clients sustain 2× the limit. Use sliding-window counter unless you truly don't care.
- Forgetting the clock can move backward. Use a monotonic clock for refill math (
time.monotonic(), not wall clock). An NTP step backward with wall-clock time can hand a client free tokens or lock it out. - Non-atomic check-then-increment in a distributed limiter. Reading the counter, deciding, then incrementing in separate calls races under load and overshoots. Do it in one atomic operation — a Redis Lua script or
INCRwith conditional logic. - Limiting by raw IP behind a proxy or NAT. Thousands of users behind one corporate NAT share an IP and trip each other's limit; meanwhile a botnet rotates IPs freely. Limit on the authenticated key when you have one.
- No
Retry-Afterheader. Returning 429 with no hint just makes clients retry immediately and harder, amplifying the storm. Always tell them when to come back. - Unbounded sliding-window log on a hot key. A celebrity or a scraper turns the O(N) log into a memory blowup. Cap the stored timestamps, or switch that key to a counter.
- Counting rejected requests against the limit. A rejected request shouldn't consume a token in a token bucket — otherwise a client that keeps getting 429s can never recover. Only decrement on accept.
Frequently asked questions
What is the difference between token bucket and leaky bucket?
Token bucket lets you spend saved-up credit in a burst — if the bucket holds b tokens, a client that has been idle can fire b requests instantly, then is throttled to the refill rate r. Leaky bucket has no such memory: requests drain from a queue at a fixed rate regardless of how long the client was quiet, so the output is perfectly smooth but bursts get queued or dropped.
Why does the fixed-window counter let through double the limit?
A fixed window resets its counter at a wall-clock boundary. A client can send the full limit in the last second of one window and the full limit in the first second of the next — twice the intended rate inside a two-second span straddling the boundary. The sliding-window algorithms exist precisely to close this edge.
How does the sliding window log work and why is it expensive?
It stores a timestamp for every accepted request, drops timestamps older than the window, and counts what remains. It is exact, but memory grows with the number of requests in the window — O(N) per key — so at high throughput a single hot key can store millions of timestamps. Sliding-window-counter approximates it in O(1) space using two adjacent fixed-window counts.
How do you rate limit across many servers?
Centralize the counter in a shared store such as Redis, updated atomically with INCR plus EXPIRE or a Lua script so the check-and-increment is a single round trip. The classic GCRA / token-bucket Lua script used by redis-cell runs in O(1) and stores just one timestamp per key, which is why it scales to millions of keys.
What HTTP status code should a rate limiter return?
429 Too Many Requests, defined in RFC 6585, ideally with a Retry-After header telling the client how many seconds to wait. Many APIs also send RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset headers (the IETF draft standardizes these) so well-behaved clients can self-throttle instead of hammering until they get blocked.
Is rate limiting the same as load balancing or backpressure?
No. Rate limiting caps the rate of accepted work and rejects the excess. Backpressure signals upstream to slow down rather than dropping, so no work is lost. Load balancing spreads accepted work across replicas. They compose: a gateway rate-limits per client, applies backpressure to its own queue, and a balancer fans surviving requests across servers.