Distributed Patterns
Circuit Breaker
Wrap an unreliable dependency in a state machine — fail fast, recover safely
A circuit breaker wraps unreliable remote calls. Trips OPEN after a failure threshold, returns immediate errors for ~30 s, then half-OPEN allows one probe. Hystrix pattern. Prevents cascades.
- StatesCLOSED · OPEN · HALF-OPEN
- Trip threshold (Hystrix)50% failure over 20 reqs
- Half-open timeout~30 s typical
- Probe count in HALF-OPEN1-5
- Overhead per call~100 ns counter increment
- Famous implementationsHystrix, Resilience4j, Polly, Envoy
Interactive visualization
Three states. Watch the breaker trip on failures, return errors immediately, then probe to recover.
Watch the 60-second explainer
A condensed visual walkthrough — narrated, captioned, under a minute.
How a circuit breaker works
An electrical circuit breaker trips when the current exceeds a safe limit, cutting power before the wire melts. A software circuit breaker does the same for remote calls — when failures exceed a safe rate, it cuts the connection before the failure cascade burns down upstream services. The pattern was popularized by Michael Nygard's book Release It! in 2007 and made famous by Netflix's Hystrix library in 2012.
The breaker sits between your code and the remote dependency. Every call goes through it. The breaker is a state machine with three states:
- CLOSED. The normal state. Requests flow through to the dependency. The breaker counts failures in a rolling window. When the failure rate (and total request count) cross a threshold, the breaker trips to OPEN.
- OPEN. Every call is rejected immediately — either by throwing an exception or returning a fallback. No actual network call is made. The breaker holds OPEN for a configured cooldown (commonly 30 seconds). At cooldown end, transition to HALF-OPEN.
- HALF-OPEN. One (or a small number of) probe requests are allowed through. If the probe succeeds, the breaker closes and normal traffic resumes. If it fails, back to OPEN for another cooldown.
The state machine is small but the failure-prevention payoff is enormous. A downstream service that takes 30 seconds to time out can lock up every thread in the caller in seconds — without a breaker, the caller becomes unresponsive too. With a breaker, after a handful of failures the calls return immediately, threads stay free, and the caller stays alive to handle traffic that doesn't depend on the broken downstream.
The state machine in detail
The transitions:
failure threshold exceeded
CLOSED ──────────────────────────────────▶ OPEN
▲ │
│ probe succeeded │ cooldown elapsed
│ ▼
HALF-OPEN ◀───────────── (probe attempt)
│
└─── probe failed ───▶ OPEN
Counting failures correctly is the only subtle part. A naive counter that just counts errors-ever is wrong — it never resets and stays tripped forever after one bad period. Use a rolling window: count the last N requests (typically 10-100), or count requests in the last T seconds. When the failure rate exceeds the threshold AND the total request count exceeds a minimum (so a single failed request doesn't flip the breaker), trip.
HALF-OPEN is the clever bit. When the cooldown expires, you can't just resume full traffic — if the downstream is still broken, you've just slammed it with another wave. Instead, allow exactly one probe through. If the probe succeeds, close the breaker; if not, open again with a fresh timer. This handles the partial-recovery case where the downstream returns to health and you smoothly re-engage.
Failure modes detected
What counts as a "failure"? Any of the following:
- Exceptions thrown by the call. Connection refused, IOException, HTTP 5xx.
- Timeouts. A call that exceeds its deadline is a failure regardless of what it eventually returns.
- Slow calls (advanced). Calls that succeed but take longer than a slow-call threshold (e.g. >P99 latency). Resilience4j 1.4+ supports this as a separate trigger.
- Specific error codes. HTTP 429 (rate limited) and 503 (service unavailable) often warrant tripping; 404 (not found) typically does not.
The decision of what counts as a failure is critical. Counting 404s as failures will trip the breaker every time a missing key is requested. Counting only network errors but ignoring timeouts means a hung downstream will still lock up your callers. Production breakers usually expose a predicate so the application can decide which exceptions/responses count.
When to use a circuit breaker
- Synchronous calls to external services. Any HTTP, gRPC, or RPC call that crosses a process or network boundary.
- Database calls if they can stall. A locked-up replica, a long-running query, a DNS resolution timeout — all can lock up callers if not protected.
- Cache lookups that fall back to origin. Wrap the origin call; if origin is failing, keep serving stale-cache hits without retrying origin every request.
- External API integrations. Third-party APIs go down — wrap them so your service doesn't go down too.
Avoid circuit breakers around fast in-process calls (the overhead exceeds the benefit), around calls that should always retry (use retry with backoff instead), and as the only resilience pattern — combine with timeouts, bulkheads, and rate limits for a real defence in depth.
Circuit breaker vs related patterns
| Pattern | What it does | When to use |
|---|---|---|
| Circuit Breaker | State machine that stops calls when failure rate is high | Cascading-failure protection for unreliable remote calls |
| Timeout | Bounds the wait time per call | Always — pair with everything else |
| Retry | Re-attempt failed calls with backoff | Transient errors; combine with breaker |
| Bulkhead | Bounded thread pool per dependency | Prevent one slow dependency exhausting all threads |
| Rate Limiting | Cap requests per second | Protect downstream proactively |
| Backpressure | Slow consumer signals upstream | Streaming; dynamic flow control |
| Fallback | Default value when primary fails | Degraded but still useful response |
Production systems typically stack four of these around every remote call: bulkhead (bounded threads) + circuit breaker (fail fast on systemic failure) + timeout (bound per-call wait) + retry (transient errors only, inside the breaker, with backoff and jitter). Hystrix bundles all of these; Resilience4j keeps them as composable modules.
Pseudo-code
class CircuitBreaker:
states = CLOSED, OPEN, HALF_OPEN
state = CLOSED
failures = 0
successes = 0
last_failure_time = 0
OPEN_TIMEOUT_MS = 30000
FAILURE_THRESHOLD = 5
SUCCESS_THRESHOLD_HALF_OPEN = 1
def call(self, func):
if self.state == OPEN:
if now() - self.last_failure_time > OPEN_TIMEOUT_MS:
self.state = HALF_OPEN
else:
raise CircuitBreakerOpenError()
try:
result = func()
self.on_success()
return result
except Exception as e:
self.on_failure()
raise
def on_success(self):
if self.state == HALF_OPEN:
self.successes += 1
if self.successes >= SUCCESS_THRESHOLD_HALF_OPEN:
self.state = CLOSED
self.failures = 0
self.successes = 0
else:
self.failures = 0
def on_failure(self):
self.failures += 1
self.last_failure_time = now()
if self.failures >= FAILURE_THRESHOLD:
self.state = OPEN
self.successes = 0
Java implementation (Resilience4j)
import io.github.resilience4j.circuitbreaker.*;
import java.time.Duration;
CircuitBreakerConfig config = CircuitBreakerConfig.custom()
.failureRateThreshold(50) // trip at 50% failures
.slowCallRateThreshold(80) // trip at 80% slow calls
.slowCallDurationThreshold(Duration.ofSeconds(2))
.waitDurationInOpenState(Duration.ofSeconds(30))
.slidingWindowType(SlidingWindowType.COUNT_BASED)
.slidingWindowSize(20) // last 20 calls
.minimumNumberOfCalls(10) // need 10+ before tripping
.permittedNumberOfCallsInHalfOpenState(3)
.build();
CircuitBreaker breaker = CircuitBreaker.of("paymentService", config);
// Decorate any call.
Supplier<Payment> decorated = CircuitBreaker
.decorateSupplier(breaker, () -> paymentApi.charge(invoice));
try {
Payment p = decorated.get();
} catch (CallNotPermittedException e) {
// Breaker is OPEN — return fallback or fail fast.
return Payment.failed("payment service unavailable");
}
// Observe state transitions.
breaker.getEventPublisher().onStateTransition(event ->
log.warn("CircuitBreaker {} {} → {}",
event.getCircuitBreakerName(),
event.getStateTransition().getFromState(),
event.getStateTransition().getToState()));
Python implementation
import time
from enum import Enum
class State(Enum):
CLOSED = 1
OPEN = 2
HALF_OPEN = 3
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30, expected_exception=Exception):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.expected_exception = expected_exception
self.failure_count = 0
self.last_failure_time = None
self.state = State.CLOSED
def call(self, func, *args, **kwargs):
if self.state == State.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = State.HALF_OPEN
else:
raise CircuitBreakerOpenError()
try:
result = func(*args, **kwargs)
except self.expected_exception:
self._on_failure()
raise
self._on_success()
return result
def _on_success(self):
self.failure_count = 0
if self.state == State.HALF_OPEN:
self.state = State.CLOSED
def _on_failure(self):
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = State.OPEN
class CircuitBreakerOpenError(Exception): pass
# Use as a decorator
breaker = CircuitBreaker(failure_threshold=5, recovery_timeout=30)
def call_payment_api(invoice):
return breaker.call(payment_api.charge, invoice)
Common pitfalls
- Single-counter breakers. Counting "errors ever" instead of "errors in a rolling window" makes the breaker trip once and never reset. Use a sliding count- or time-window.
- Tripping without a minimum-request floor. If the breaker can trip on a single failed call, transient blips will flap the state continuously. Require 10-20 requests in the window before tripping is allowed.
- Counting expected errors as failures. A 404 from a "user not found" API isn't a system failure. Filter exceptions/status codes before counting.
- No probe limit in HALF-OPEN. If you let unlimited traffic through when transitioning out of OPEN, the recovering service gets slammed. Allow 1-5 probes only.
- Forgetting metrics. A silent circuit breaker is a hidden problem. Emit metrics for state transitions, current state, and failure rate so operators see when a dependency degrades.
- One breaker for all downstreams. Different services have different failure modes. Use one breaker per dependency so an outage on service A doesn't trip the breaker for service B.
- Breaker without a timeout. If individual calls can hang for minutes, the breaker never sees enough requests to trip. Wrap calls in a timeout so a hung call counts as a failure quickly.
Performance and impact
The per-call overhead of a breaker is tiny — Resilience4j adds about 100 nanoseconds to a remote call that itself costs milliseconds. The benefit is dramatic. Consider an upstream service that does 1000 requests per second to a downstream that takes 30 seconds to time out under failure. Without a breaker, every failing request occupies a thread for 30 seconds — at 1000 RPS, the upstream needs 30,000 concurrent threads to stay responsive, which it does not have. With a breaker tripping after 5 failures, the upstream is back to fast-failing the next 29,995 requests in under a microsecond each. Threads stay free, the upstream stays alive.
The HALF-OPEN single-probe behavior is what makes recovery smooth. When the downstream recovers, the next single request through the probe succeeds, the breaker closes, and a normal traffic level resumes. Compare to "just try again after 30 s" — which would slam the recovering service with all queued traffic the moment its IP came back, often killing it again.
Typical production tunings: window of 20 requests, 50% failure threshold, 30-second cooldown, 3-5 half-open probes. These values come from Netflix's original Hystrix defaults and remain the reasonable starting point a decade later.
Frequently asked questions
What is a cascading failure and how does a circuit breaker prevent it?
A cascading failure happens when one service's slowdown propagates upstream. Service A calls service B; B is slow, so each A-request occupies a thread waiting for B for 30 seconds; A's threads exhaust; A itself becomes unresponsive; A's callers fail too. A circuit breaker around the A-to-B call short-circuits this — after a few B failures, it returns errors immediately for 30 seconds without consuming threads or hitting B. The fire is contained to one cell.
What are the three states and how do you transition between them?
CLOSED: requests pass through; the breaker counts failures in a rolling window. When failure rate exceeds threshold (50% over the last 20 calls is a common default), trip to OPEN. OPEN: every call short-circuits with an exception or fallback. After a cooldown timeout (30 s typical), transition to HALF-OPEN. HALF-OPEN: one probe call is allowed; if it succeeds, return to CLOSED; if it fails, go back to OPEN for another 30 s.
What thresholds should I use?
Hystrix's defaults are well-tested: window of 20 requests minimum, trip at 50% failure rate, 5-second sleep window. Resilience4j uses similar values. The right thresholds depend on the downstream's normal error rate — if the downstream normally returns 5% errors, set the threshold at 30-40% to avoid false trips. Always require a minimum request count (10-20) before the breaker can trip, or a single-failure system can flap the breaker.
What should the breaker return when OPEN?
Three options. Throw an exception ('CircuitBreakerOpenException') and let the caller decide — the simplest, the default in Hystrix. Return a fallback value — a default, a cached result, or a degraded response. Return a fail-fast signal to the user — 'service unavailable, try again in 30 seconds' is honest and lets the user move on. Pick fallback only when a stale or default response is actually useful; do not fabricate data.
How does HALF-OPEN avoid hammering a recovering service?
In HALF-OPEN, only one (or a configured small number — e.g. 5) requests are allowed through. The breaker holds a lock so other requests still see OPEN. If the probe succeeds, the breaker transitions to CLOSED and all traffic resumes. If the probe fails, the breaker returns to OPEN and starts a fresh timeout. This avoids the thundering herd problem where 10,000 simultaneous requests slam a service the moment it comes back online.
Should every remote call have a circuit breaker?
Yes for inter-service calls in a microservices architecture. The breaker pairs with a timeout and a bulkhead (bounded thread pool per dependency). Skip it for very-low-traffic calls where the breaker's minimum-request threshold can never be met. Skip it inside the same process — call-stack errors propagate fine without a breaker. Always pair with monitoring — silently failing breakers hide real problems.
What is the difference between a circuit breaker and a retry?
Retry attempts a failed call again, usually with exponential backoff — useful for transient errors. A circuit breaker stops calling entirely when a downstream is unhealthy — useful when retries would just amplify load on a broken service. They compose well: retry inside the breaker (re-attempt a flaky call), breaker outside the retry (give up entirely if even retries keep failing). Retry without a breaker is how you turn one failure into a thundering herd.