Distributed Patterns

Token Bucket Rate Limit

Tokens refill at rate r — each request takes one — allow bursts up to capacity b

Tokens drip into a bucket at rate r, capped at b. Each request takes 1 token. Empty bucket → reject. Average = r req/sec; burst up to b. AWS, Stripe standard.

  • Average rater tokens/sec
  • Burst capacityb tokens (instantaneous)
  • State per clienttokens, last_refill_ms
  • Refilllazy — computed at check time
  • RejectionHTTP 429 + Retry-After
  • Used inAWS DynamoDB, Stripe, Cloudflare, NIC

Interactive visualization

Tokens accumulate during idle. A burst spends them all at once. Then the rate falls back to r as the bucket refills.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How the token bucket works

Imagine a parking meter that refills with quarters at a steady rate — one quarter every six seconds, ten quarters per minute. The meter holds at most ten quarters; any further drops are ignored. To park, you spend one quarter from the meter. If the meter is empty, you can't park. If you've stayed off the street for an hour, the meter is fully loaded and you can park ten times rapidly before needing to wait again.

That's the token bucket algorithm. Quarters are tokens. The meter is the bucket. Parking is a request. The refill rate (one quarter every six seconds) is r; the meter's capacity (ten quarters) is b. Long-run, you park at most r times per second. Short-run, you can burst up to b times when the bucket is full.

Formally:

  • r — refill rate. Tokens added per second. Equal to the long-run allowed request rate.
  • b — capacity. Maximum tokens the bucket can hold. Equal to the maximum allowed instantaneous burst.
  • tokens — current count. Refills toward b at rate r; decremented by each accepted request.

The algorithm per request:

on_request(now):
    # Lazy refill — compute tokens earned since last check
    elapsed = now - last_refill_time
    tokens = min(b, tokens + elapsed * r)
    last_refill_time = now

    if tokens >= 1:
        tokens -= 1
        ACCEPT
    else:
        REJECT with HTTP 429
        # Retry-After: ceil((1 - tokens) / r) seconds

The "lazy refill" is the critical optimization. You don't run a timer that drops a token every 1/r seconds — that's an expensive, lossy thing to schedule. Instead, you store last_refill_time and compute the elapsed-time × rate at check time. Constant memory per client, O(1) per request, no scheduler involved.

Worked example — bursting and recovering

A token bucket with r = 10 tokens/sec, b = 50. Client has been idle for a long time (bucket = 50 full). At t=0 it starts sending requests at 60 req/sec.

Time     | Requests sent | Tokens after | Result
---------|---------------|--------------|-------
0.000 s  |       1       |    49.0      | ✓
0.016 s  |       1       |    48.0      | ✓
0.033 s  |       1       |    47.0      | ✓
...
0.833 s  |       1       |     0.0      | ✓ (50th)  ← all initial tokens spent
0.850 s  |       1       |     0.0      | ✗ (51st)  → 429, Retry-After: 1
0.867 s  |       1       |    -0.0      | ✗
...     | (10 reqs/sec sustained, refilling at 10 tokens/sec, so 1 in 6 succeeds)

Stable state at 60 req/sec sustained input:
  10/60 ≈ 16.7% of requests succeed; 83% get 429.
  Effective accepted rate = 10/sec = r.

The first second sees 50 accepts (the burst spending the full bucket). After that, sustained input above r results in some rejections — exactly enough to hold the average at r. The visible UX: a snappy initial burst (UI loads fast, batch jobs surge through), then steady-state throttling once the burst is exhausted.

Idle behavior: if the client stops sending for 5 seconds, the bucket refills to min(50, 0 + 5 × 10) = 50. The next burst can spend all 50 again. This "reward for being polite" is what users perceive as fair — heavy users get throttled, light users get full speed.

Variants

  • Standard token bucket. One bucket per client. Refill rate r, capacity b. Most common.
  • Hierarchical / nested buckets. Multiple buckets stacked — per-user, per-tenant, global. A request must claim a token from all of them. Used in multi-tenant API gateways (Kong, Apigee).
  • Weighted token bucket. Each request costs a variable number of tokens. AWS DynamoDB charges 0.5/1/2 RCU for eventual/strong/transactional reads. Stripe charges higher token cost for POSTs than GETs.
  • Burst-credit DynamoDB style. AWS provisioned-capacity table accumulates up to 5 minutes' worth of unused capacity as a "burst credit" — effectively b = r × 300 seconds. Allows surprise spikes without spinning up auto-scaling.
  • Adaptive r. The refill rate is itself tuned based on downstream health. If downstream P99 latency rises, r drops automatically. Adaptive concurrency in Netflix's Resilience4j fork is one production example.
  • Multi-instance with shared store. Buckets live in Redis or a custom coordinator; every app instance checks against the shared bucket so total rate is enforced cluster-wide. Stripe's published architecture; Envoy's global rate limit service.

Token bucket vs leaky bucket

AspectToken BucketLeaky Bucket
Average accepted raterr
Burst behaviorAllows up to b in one instantNo burst — exactly r out
Idle rewardAccumulate up to b tokensNone — bucket just empties
State per clienttokens, last_refill_mslevel, last_leak_ms
Action on overflowReject (HTTP 429)Drop or queue
UX impressionSnappy for polite clientsAlways smooth, never instant
Typical usersPublic APIs, AWS, StripeCellular, payments, SMTP

You can think of leaky bucket as token bucket with b = 1 — no burst tolerance, instant smoothing. Real systems often run both: an outer token bucket at the edge for the user-facing burst-friendly limit, an inner leaky bucket between the service and a strict downstream that cannot tolerate bursts.

Redis implementation

For multi-instance services, a shared Redis with a Lua script is the standard:

-- KEYS[1] = "tb:"
-- ARGV[1] = rate r (tokens per second)
-- ARGV[2] = capacity b
-- ARGV[3] = now (milliseconds since epoch)
-- ARGV[4] = cost (tokens to consume, default 1)

local rate     = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now      = tonumber(ARGV[3])
local cost     = tonumber(ARGV[4] or 1)

local data = redis.call("HMGET", KEYS[1], "tokens", "ts")
local tokens = tonumber(data[1]) or capacity
local ts     = tonumber(data[2]) or now

-- Lazy refill
local elapsed = (now - ts) / 1000
tokens = math.min(capacity, tokens + elapsed * rate)
ts = now

if tokens >= cost then
    tokens = tokens - cost
    redis.call("HMSET", KEYS[1], "tokens", tokens, "ts", ts)
    redis.call("PEXPIRE", KEYS[1], 60000)
    return {1, tokens, 0}     -- accept, tokens_remaining, retry_after_ms
else
    local retry_ms = math.ceil((cost - tokens) / rate * 1000)
    redis.call("HMSET", KEYS[1], "tokens", tokens, "ts", ts)
    redis.call("PEXPIRE", KEYS[1], 60000)
    return {0, tokens, retry_ms}
end

The script returns {allow, tokens_remaining, retry_after_ms}. The caller uses these to set the response headers X-RateLimit-Remaining and Retry-After. Throughput on a single Redis: 50-100k limiter checks per second.

Python implementation

import time
from collections import defaultdict
from threading import Lock

class TokenBucket:
    def __init__(self, rate, capacity):
        self.rate = rate
        self.capacity = capacity
        self.state = defaultdict(lambda: [capacity, time.time()])
        self.lock = Lock()

    def allow(self, client_id, cost=1):
        with self.lock:
            tokens, ts = self.state[client_id]
            now = time.time()
            tokens = min(self.capacity, tokens + (now - ts) * self.rate)
            if tokens >= cost:
                self.state[client_id] = [tokens - cost, now]
                return True, int(tokens - cost), 0
            else:
                retry_after = (cost - tokens) / self.rate
                self.state[client_id] = [tokens, now]
                return False, int(tokens), retry_after

# Usage
bucket = TokenBucket(rate=100, capacity=200)

ok, remaining, retry_after = bucket.allow(request.client_id)
response.headers["X-RateLimit-Remaining"] = str(remaining)
if not ok:
    response.headers["Retry-After"] = str(int(retry_after + 1))
    return Response(status=429)

Case study — AWS DynamoDB

DynamoDB provisioned tables use a token bucket per partition. The user configures Read Capacity Units (RCU) and Write Capacity Units (WCU) — these are the rate r. The bucket capacity is sized so each partition can accumulate up to 5 minutes' worth of unused capacity. A query that needs more RCU than the bucket has available returns a ProvisionedThroughputExceededException — the application is expected to retry with exponential backoff.

The 5-minute burst credit is a deliberate design decision: it lets sudden surges (a hot Black-Friday product page) succeed against a table provisioned for steady-state traffic, without forcing the user to over-provision. The token bucket is what makes this work — without a burst-allowing algorithm, a steady-state-provisioned table would reject every Black-Friday surge.

DynamoDB on-demand (no explicit RCU/WCU) uses a multi-layer adaptive token bucket: per-partition limits raised by the auto-scaler as traffic increases, with a global account-level cap. Same algorithm, just the rate r is dynamic.

Common pitfalls

  • In-memory bucket on multi-instance service. Each instance has its own bucket; total cluster rate is N × r, not r. Use Redis or a coordinator for any production rate limiter on a horizontally-scaled service.
  • Returning HTTP 429 without Retry-After. Clients retry blindly, often immediately. Always include Retry-After (in seconds) computed as ceil((cost - tokens) / r).
  • Capacity b set equal to r. Effectively token bucket = leaky bucket — no burst allowed, no idle reward. Pick b based on what the downstream can absorb in one instant.
  • Capacity b set absurdly high. b = 10,000 with r = 10 means a malicious client idle for 1000 seconds can fire 10,000 requests instantly. Cap b at a value that the downstream can survive.
  • Ignoring clock skew across instances. If each instance trusts its own clock to drive the bucket's lazy refill, drift across instances causes inconsistent enforcement. Use a single authoritative clock — Redis's TIME or a NTP-disciplined coordinator.
  • Counting before checking. Decrementing tokens, then checking if >= 0 leaks slightly — under contention, two parallel calls each see tokens = 1, both decrement, finish at tokens = -1, and both succeed. Use atomic Lua scripts or compare-and-swap.
  • Same bucket for endpoints with different costs. A POST that processes a 10 MB payload and a GET that returns a cached value cost the downstream very differently. Use weighted token bucket or separate buckets per endpoint.

Performance and impact

The per-request cost of a token bucket is dominated by the state access. In-memory: 50-100 nanoseconds (a hashmap lookup, an arithmetic check, a write). Through Redis: 200-500 microseconds — a single round-trip dominated by network latency. At any rate the token bucket is rarely the bottleneck of a real API.

The downstream protection is the same as leaky bucket — long-run rate capped at r — but the user experience differs sharply. A token bucket lets a freshly-arrived user fire 50 requests instantly (b = 50, fully refilled bucket), which feels like a snappy app. A leaky bucket forces every user to a single regulated rate, which feels like uniform slowness. Most public APIs (Stripe, GitHub, Slack) pick token bucket because the burst-friendliness is part of the product feel.

Token bucket is what powers AWS DynamoDB's burst credits, Stripe's per-account rate limits, GitHub's API limits (with both per-user and per-IP), Cloudflare's bot management, and the network-level rate enforcement in every modern NIC firmware. It is, without exaggeration, the most-deployed rate-limiting algorithm in the industry.

Frequently asked questions

What is the token bucket algorithm in one sentence?

A bucket holds tokens, refilling at rate r tokens per second up to capacity b. Each request consumes one token; if the bucket is empty, the request is rejected. Average accepted rate over the long run is r, but a single burst of up to b requests can pass instantly if the bucket is full.

Token bucket vs leaky bucket — which to pick?

Both enforce the same long-run rate r. Token bucket allows a burst up to b instantly (rewards idle clients); leaky bucket forces every output to a strictly smooth rate. Choose token bucket for public APIs where occasional bursts are fine and a snappy UX matters (Stripe, AWS, GitHub all use token-bucket style). Choose leaky bucket when downstream cannot handle bursts at all (cellular base stations, network shapers, outbound SMTP).

How does the refill work continuously without a timer?

You don't run a timer that adds a token every 1/r seconds — that's expensive and lossy. Instead, store last_refill_time. On each request, compute new_tokens = min(b, tokens + (now - last_refill_time) × r), then check if new_tokens >= 1. If yes, tokens = new_tokens - 1, update last_refill_time = now, accept. If no, reject. This 'lazy refill' is O(1), no scheduler, no clock-driven side effects.

What does the burst capacity b control?

b is the maximum number of requests you'll accept in one instant after a long idle period. Average accepted rate is still r per second, but a client idle for ten seconds will have b tokens accumulated and can spend them in a single burst. AWS DynamoDB uses 5 minutes of accumulated capacity as its 'burst' — you can briefly exceed provisioned RCU/WCU. Stripe API allows 100-request bursts at sustained 100 req/sec. Pick b based on what the downstream can absorb in one instant.

How is token bucket implemented in Redis for a distributed system?

Use a Lua script for atomicity. Store tokens and last_refill_ms as fields of a Redis hash keyed by client_id. The script computes elapsed time, adds tokens (clamped to b), then decrements one if available. Returns 1 (allow) or 0 (reject). A typical single-instance Redis handles 50-100k limiter checks/sec. For higher throughput, shard by client_id across multiple Redis instances; envoy global rate limit uses this exact pattern.

What about cost-weighted requests?

Many real APIs charge different requests different token counts. AWS DynamoDB's 'eventually consistent read' costs 0.5 RCU; a 'strongly consistent read' costs 1 RCU; a 'transactional read' costs 2 RCU. To implement: change tokens >= 1 to tokens >= cost and tokens -= cost. Stripe charges higher token cost for POST than GET. This lets you rate-limit by downstream resource consumed rather than by request count.

What HTTP response should a token bucket return on rejection?

HTTP 429 Too Many Requests. Include Retry-After header — the number of seconds until the bucket has 1 token, computed as ceil((1 - tokens) / r). Also include the X-RateLimit-Limit (=b), X-RateLimit-Remaining (=floor(tokens)), X-RateLimit-Reset (=unix timestamp at which bucket is full again). RFC 9239 (2024) standardizes the un-prefixed RateLimit-* headers but X-RateLimit-* is still more widely understood.