Web

API Rate Limiting

Cap requests per client to protect backends — three classical algorithms, each with trade-offs

API rate limiting is the practice of capping the number of requests a client can make in a time window to protect backend resources, prevent abuse, and ensure fair multi-tenant usage. Common algorithms: Token bucket (refill rate r, capacity b, allow burst up to b) — used by Stripe, AWS API Gateway. Leaky bucket (constant outflow rate, drops excess) — smoother but no burst tolerance. Fixed window counter (count per minute, reset at boundary) — simple but allows 2× burst at window boundaries. Sliding window log / sliding window counter — fairest but more memory. Distributed rate limiting uses Redis with Lua scripts or specialized proxies (Envoy global rate limit). Standard headers: X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, Retry-After. HTTP 429 status code.

Status code429 Too Many Requests
AlgorithmsToken bucket, leaky bucket, sliding window
Standard headersX-RateLimit-*
AWS default10,000 RPS
StripeToken bucket
DistributedRedis + Lua

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why rate limiting matters

DDoS and abuse prevention. Without rate limits, a single misbehaving (or malicious) client can saturate backend capacity, taking down service for everyone. Rate limits are the first line of defense before WAF and DDoS scrubbing.
Fairness across tenants. Multi-tenant SaaS APIs must prevent one customer's traffic from starving another. Per-tenant limits guarantee a baseline of capacity.
Cost control. Every backend call costs money — database CPU, third-party API fees, GPU inference. Rate limits cap exposure to runaway costs from buggy clients or compromised credentials.
Tier enforcement. Free vs paid tiers differ primarily in rate limits. Stripe's standard plan: 100 read req/sec, 100 write req/sec; higher plans scale up.
Backpressure. Rate limiting communicates that the server is at capacity; clients should slow down rather than retry hot.
Login security. Authentication endpoints need especially aggressive rate limits to deter credential-stuffing — e.g., 5 failed attempts per 15 minutes per username.
Compliance. Many regulated APIs (banking, health) require documented rate limits as part of their security and resilience posture.

Token bucket in depth

State. Per client: tokens (float) and last_refill (timestamp).
Refill. On each request, compute elapsed = now - last_refill; tokens = min(b, tokens + elapsed * r).
Consume. If tokens >= 1: tokens -= 1, accept. Else reject with 429.
Burst. A client idle for b/r seconds accumulates b tokens and can spend them in a single burst. This is desirable for bursty workloads (Stripe webhook retries, AWS Lambda spikes).
Variable cost. Operations of different cost can consume more than 1 token (a complex GraphQL query: 5 tokens; a simple GET: 1).
Adopters. Stripe API, AWS API Gateway, GitHub REST and GraphQL, Cloudflare Workers, Slack Web API.

Leaky bucket

Two flavors. Leaky bucket as a meter (analogous to token bucket but inverted) and leaky bucket as a queue (incoming requests join a fixed-size queue, drained at constant rate; queue full => drop).
Smooth output. Output rate is exactly r; no bursts ever leak through. Good for protecting downstream systems that cannot tolerate spikes.
Latency cost. Queue-style leaky bucket adds wait time when the queue is non-empty. Token bucket either rejects immediately or accepts immediately.
Networking heritage. Originated in ATM traffic shaping (Turner 1986). Used by network QoS systems and CoAP.

Fixed window counter

State. Per client: count and window_start.
Algorithm. If now is in the current window: increment and check. If now is in a new window: reset to 1.
Memory. Trivial: O(1) per client.
Boundary problem. A client can fire 2× the limit across a window boundary — 100 requests in the last second of the minute, then 100 in the first second of the next minute. Tolerable for coarse safety limits, broken for fairness guarantees.
Adopters. Quick to implement; common in homegrown rate limiters; good as a first-pass defense before a more sophisticated algorithm.

Sliding window log and counter

Sliding window log. Store timestamps of recent requests in a sorted set. On each request: ZREMRANGEBYSCORE removes timestamps older than now-W; ZCARD checks count; if below limit, ZADD and accept. Memory: O(limit) per client.
Sliding window counter. Approximate the log with two fixed-window counts: previous_count and current_count, plus a fraction-of-window-elapsed factor. estimated = previous_count * (1 - elapsed_fraction) + current_count. Memory: O(1) per client; accuracy within a few percent of the log.
Trade-off. Log is exact but expensive. Counter is cheap but approximate. Real systems often use sliding window counter as the production choice.
Adopters. Cloudflare's per-zone rate limits use a sliding window counter; many internal Stripe/Twilio rate limiters do too.

Distributed rate limiting

The problem. Multiple backend instances need a shared view of per-client request counts. In-memory counters per instance are inconsistent.
Redis + Lua. Each instance forwards rate checks to Redis. A Lua script atomically reads/updates state; runs as a single isolated transaction. Sub-millisecond latency in the same datacenter.
Stripe-style sliding-window in Lua. Use a sorted set keyed by client; ZADD timestamp; ZREMRANGEBYSCORE old; ZCARD; expire the key after the window. All five ops in one EVAL.
Envoy global rate limit. Envoy proxies forward limits to a gRPC rate-limit service backed by Redis. Lyft open-sourced their reference implementation.
Edge limits. CDNs (Cloudflare, Fastly) enforce limits at the edge across hundreds of POPs — gossip-based count synchronization or shard-by-key strategies are used.
Eventual consistency. Some systems accept slight over-allowance for performance: each PoP keeps a local count and reconciles asynchronously, allowing brief over-limits in exchange for low latency.

HTTP headers and 429

HTTP 429 Too Many Requests. Defined in RFC 6585. Standard response when a client exceeds limits.
Retry-After. RFC 7231. Seconds ("Retry-After: 60") or HTTP date. Client should wait at least this long before retrying.
X-RateLimit-Limit. Total requests allowed in the window.
X-RateLimit-Remaining. Requests left.
X-RateLimit-Reset. Unix timestamp of next reset (or seconds remaining; vendors vary).
RFC 9239 (2024). Standardizes RateLimit-Limit, RateLimit-Remaining, RateLimit-Reset (no X- prefix). Adoption is partial; many APIs still use the older X- form.
Cache. 429 responses should not be cached by intermediaries unless the API explicitly opts in via Cache-Control.

Per-IP, per-user, per-API-key

Per-IP. Catches anonymous abuse but punishes corporate NATs and mobile carrier IPs. Cloudflare's Bot Fight Mode and similar tools layer behavior signals.
Per-user. Fair for authenticated traffic. Costs an auth round-trip before the limit check.
Per-API-key. Industry standard for B2B APIs; cleanly aligns with billing tiers.
Per-route. Different endpoints have different costs; login gets a tighter limit than a status check.
Per-organization. Multi-tenant SaaS often limits by org rather than user.
Layered. Production: per-IP at CDN, per-key at the gateway, per-user at the app, per-resource for expensive endpoints (e.g., pdf-generation).

Client-side patterns

Read X-RateLimit-Remaining. Smart clients slow down before hitting the limit, not after.
Honor Retry-After. Always wait the indicated time. Hot retries make the situation worse.
Exponential backoff with jitter. When Retry-After is missing, use exponential backoff with random jitter to avoid synchronized retry storms.
Bulk endpoints. Where APIs offer batch alternatives (e.g., Slack's conversations.history vs N message reads), bulk operations dramatically reduce limit pressure.
Concurrency caps. Limit in-flight requests client-side; many SDKs (Stripe, OpenAI) include built-in concurrency control.

Common misconceptions

"Rate limit means block." The standard response is 429 + Retry-After, telling the client when to retry. Outright blocking is a separate enforcement mode (firewall rule, IP ban) used for abusive traffic, not normal rate-limit overflow.
"Fixed window is fine." The boundary doubles allowed bursts, breaking fair-share guarantees. Use sliding window counter for any limit where 2× spike is unacceptable.
"Client-side throttling is enough." Client-side libraries are useful but cannot be trusted; malicious or buggy clients ignore them. Server-side enforcement is mandatory.
"One global rate limit covers everything." Different endpoints have different costs and different abuse vectors. Layered limits (global + per-route + per-resource) are standard.
"Token bucket is always best." Token bucket allows bursts, which is bad when downstream cannot tolerate them. Leaky bucket is preferable when smoothness matters more than responsiveness.
"Redis is the only option." Memcached with CAS, DynamoDB with conditional writes, and even Postgres with row-level locks all work for distributed rate limiting. Redis is most popular due to latency and Lua atomicity.

Frequently asked questions

Token bucket vs leaky bucket — what's the difference?

Token bucket models a bucket that refills at rate r tokens per second up to capacity b. Each request consumes one token; if the bucket is empty, reject. It allows bursts up to b: a client that has been idle accumulates tokens and can spend them in a single burst. Leaky bucket models a queue that drains at constant rate r — incoming requests join the queue if there is room, are dropped otherwise. Output is smooth at exactly r requests/sec; no burst tolerance. Token bucket is easier to implement statelessly per-request; leaky bucket is preferred when downstream systems need a strictly smooth load.

What is the boundary problem with fixed windows?

Fixed window counter: maintain count[client] reset every N seconds. With a 100-req/min limit, a client can send 100 requests in the last second of one window plus 100 in the first second of the next — 200 requests in 2 seconds, double the intended rate. Acceptable for coarse limits but breaks fair-share guarantees at the boundary. Sliding window counter mitigates by weighting the previous window's count by the fraction of overlap with the current sliding window.

How is sliding window log implemented?

For each client, store a list of timestamps of recent requests. On each new request: drop entries older than now-W; if list size is below limit, append now and accept; otherwise reject. Memory is O(limit) per client — for 10K-req/min limits across millions of users this can be expensive. Sliding window counter approximates this with O(1) memory by tracking the previous and current fixed-window counts and interpolating.

How does distributed rate limiting work in Redis?

Redis is the canonical store because it's fast (sub-ms p99) and atomic. Pattern: each instance forwards rate-limit checks to Redis via a Lua script that atomically reads the current count, checks the limit, and updates state — Lua scripts run as a single isolated transaction on Redis. Token bucket: Lua reads last_refill and tokens, computes new tokens given elapsed time, decrements if >= 1. Sliding window: ZADD timestamp + ZREMRANGEBYSCORE old + ZCARD. Envoy's global rate limit service uses gRPC + Redis; Stripe and Cloudflare run custom variants.

When is per-IP vs per-user vs per-API-key appropriate?

Per-IP catches anonymous abuse (login brute force, signup spam) but penalizes shared NATs and corporate networks where many users share one IP. Per-user is fairest for authenticated traffic but costs an auth round-trip. Per-API-key is the standard for paid APIs (Stripe, OpenAI, AWS) and aligns rate limits with billing tier. Production stacks layer all three: per-IP at the CDN edge, per-API-key at the gateway, per-user at the application for resource-specific actions.

What's the HTTP standard for rate limit headers (RFC 9239)?

The de-facto standard headers are X-RateLimit-Limit (cap per window), X-RateLimit-Remaining (calls left), X-RateLimit-Reset (unix timestamp of next reset). On 429 Too Many Requests, the Retry-After header (RFC 7231) tells the client how long to wait — either seconds or an HTTP date. RFC 9239 (2024) standardizes RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset (no X- prefix), but adoption is partial; many APIs still use the X- variants for backward compatibility.