Event-Driven

Event-Driven Architecture

Q: What is eventual consistency and why does it matter?

In EDA, when service A publishes an event, services B, C, D process it asynchronously — they're behind A by some lag (typically 10-500 ms). During that lag the system is inconsistent: A says 'order placed,' B's database doesn't yet know, the UI may show stale data. Eventually they converge. This is eventual consistency. It is fine for analytics, search indexing, and audit logs, where the user doesn't see micro-inconsistency. It is dangerous for hot-path reads (showing the customer their order immediately after placing it). Solutions: read-your-writes by querying A directly (the producer), or wait for an acknowledgement event before showing success.

Q: How do you debug event-driven systems?

Hardest part of EDA. The global flow doesn't exist in any single service's code — it's distributed across producers, the bus, and consumers. Three production tools: (1) Distributed tracing — Jaeger / OpenTelemetry propagating trace IDs through event headers so you can follow a request across services. (2) Schema registry — Confluent, AWS Schema Registry — every event type is versioned and discoverable; you can audit who produces and who consumes. (3) Event replay — Kafka's offset model lets you replay a topic into a dev environment and re-run consumers against historical events. Without these, EDA debugging is 'grep across N services.' Invest in tracing from day one.

Q: How does Uber scale to thousands of services with EDA?

Uber moved from a single Python monolith to 4,000+ microservices between 2014-2020, all coordinated through Kafka. Their pattern: each domain (rides, payments, eats, freight, maps) owns its own services and Kafka topics. Inter-service communication is event-first; synchronous gRPC is only for read paths that need it. The 'Eats' Order Lifecycle alone publishes 30+ event types — order_created, restaurant_accepted, courier_assigned, delivered, etc. Each is a Kafka topic with multiple consumers. Total Kafka throughput at Uber is ~5 trillion messages/day across multiple clusters. The bus is the integration layer; without it, 4000 services would need 4000^2 / 2 = 8M point-to-point connections.

Services publish events to a bus; subscribers fan out — the integration model behind every modern microservices platform

Producers emit events; consumers subscribe. The bus decouples them in space and time. One PaymentCompleted event triggers email, analytics, and fraud checks in parallel — none of them knows the others exist.

CouplingLoose — bus decouples producers/consumers
ConsistencyEventual (10-500 ms typical lag)
Common busesKafka, RabbitMQ, EventBridge, NATS
Uber scale~5T messages/day, 4000+ services
ThroughputKafka: 1M msg/sec/broker
Used inUber, Netflix, LinkedIn, Airbnb

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

Why event-driven is the default at scale

Imagine a synchronous, point-to-point ordering pipeline. The order service POSTs to the email service. Then POSTs to analytics. Then POSTs to the fraud service. Then POSTs to the warehouse. Now product wants you to add a recommendation-engine update — sixth POST. And a tax-calculation update — seventh. Each addition is a code change to the order service, a deploy, and a new failure mode (what if the seventh POST times out?).

Inverting the relationship fixes this. The order service publishes one event — OrderPlaced — and goes back to handling the next request. Email, analytics, fraud, warehouse, recommendations, tax: each independently subscribes and reacts on its own timeline. New consumer? Subscribe to the existing event. Nothing changes on the producer.

This is the heart of event-driven architecture: producers don't know consumers. Consumers don't know producers. They're connected by a durable, replayable bus. The bus does the buffering, the fan-out, and the temporal decoupling — three things that would otherwise be the producer's problem.

Anatomy of an EDA system

Four ingredients are present in every event-driven system:

Events. Immutable past-tense facts: OrderPlaced, PaymentSucceeded, DriverArrived. Each has a stable schema (Avro, Protobuf, JSON Schema) versioned over time.
Producers. Services that emit events when business state changes. Producers commit the event to the bus inside (or alongside) their local transaction.
Bus. The middleman: Kafka, RabbitMQ, EventBridge, Kinesis, NATS, Pub/Sub. Persists events, partitions for ordering, broadcasts to subscribers.
Consumers. Services that subscribe to event types and process them. Each consumer tracks its own offset / acknowledgment.

A typical e-commerce stack runs 30-100 event types, each with 1-10 producers (yes, multiple services can produce the same event type) and 1-20 consumers. The bus carries everything.

A worked example: order placement

An e-commerce site receives a checkout request. The order service does its local work — validates the cart, creates an Order row, debits inventory tokens — and emits a single event:

topic: orders.OrderPlaced
{
  "order_id": "ord_8h2a",
  "user_id":  "usr_91xc",
  "total_usd": 89.50,
  "items": [...],
  "placed_at": "2026-05-26T14:23:01Z"
}

Six independent services subscribe to this topic:

Email service. Renders "Order confirmation" and sends via SES.
Analytics. Writes a row to the warehouse for revenue dashboards.
Fraud detection. Scores the order against rules + ML; flags if suspicious.
Warehouse / fulfillment. Picks the order for shipping.
Recommendation engine. Updates the user's "recently purchased" signal.
Tax service. Records taxable revenue for nexus calculations.

Each consumer runs at its own pace. The email service might take 200 ms; the warehouse picks at the next 5-minute batch; analytics writes a row immediately; recommendations rebuild a user feature hourly. The order service doesn't wait for any of them. If the email service is down for an hour, emails queue in Kafka and replay when it comes back — no orders lost.

EDA vs synchronous RPC

	Event-driven (bus)	Synchronous RPC (REST / gRPC)	Point-to-point queue	Shared database
Producer-consumer coupling	None (bus decouples)	Tight (addressed by name)	Tight (per queue)	Tight (schema)
Add a new consumer	Free (subscribe)	Producer code change	Producer code change	Schema change
Temporal coupling	None (async)	Tight (caller blocks)	None (async)	Tight (locks)
Replay history	Yes (Kafka offsets)	No	No (queues drain)	Snapshot only
Read-your-writes	No (eventual)	Yes	No	Yes (same DB)
Throughput ceiling	1M+ msg/sec/broker (Kafka)	~10k req/sec/service	~50k msg/sec/queue	~10k tx/sec/DB
Failure isolation	Good (bus absorbs)	Bad (caller fails)	Medium	Poor (shared deadlocks)
Debug complexity	High (distributed)	Low (call graph)	Medium	Medium

Picking the right bus

Kafka. Append-only log, partitioned by key, durable for days-to-forever. Multi-consumer with independent offsets. 1M+ msg/sec/broker. Best for: high-throughput streaming, event sourcing, multi-consumer fan-out. Operationally heavy (Zookeeper or KRaft, partition rebalancing).
RabbitMQ. AMQP queues with rich routing (direct, topic, headers, fanout exchanges). 30-50k msg/sec/node. Best for: traditional pub-sub, RPC over a bus, complex routing. Easier to operate than Kafka.
AWS EventBridge. Managed serverless, deep AWS integration (Lambda, Step Functions, SaaS-event ingestion). Default 10k events/sec/account. Best for: cross-account event routing, low ops overhead.
NATS. Lightweight, sub-10 ms p99, small footprint. JetStream adds Kafka-like durability. Best for: in-cluster pub-sub at single-digit ms latency.
Google Pub/Sub. Managed, autoscales, similar to Kafka semantics. Best when you're already on GCP.
Redis Streams. Tiny ops footprint, ~50k msg/sec/node, consumer groups. Best for: simple pub-sub at small scale.

The honest answer is "use Kafka if you're already running streaming infrastructure or expect 100k+ msg/sec; use a managed broker (EventBridge / Pub/Sub) otherwise." RabbitMQ is the right answer for organizations with deep AMQP investment.

Designing events well

Bad event taxonomy haunts you for years. Six rules:

Past tense. OrderPlaced, not PlaceOrder. Events are facts about the past. Commands (PlaceOrder) are different — they're requests for future action.
Business-relevant. CustomerSignedUp, not RowInsertedIntoUsersTable. Events should make sense to a product manager, not just a database admin.
Versioned schema. Use Avro / Protobuf with a schema registry. JSON-Schema with explicit version field is acceptable for simpler shops.
Self-contained. Include enough context for downstream consumers without a round-trip back to the producer. OrderPlaced carries items, total, user — not just an order_id.
Idempotency keys. Every event has a stable, unique event_id so consumers can dedupe. At-least-once delivery is the default; consumers must dedupe.
Aggregate-keyed partitioning. Partition by entity ID (order_id, user_id) so all events for one entity land on one Kafka partition and stay in order.

The eventual-consistency tax

When the order service publishes OrderPlaced, the email service might process it 200 ms later. During that 200 ms, the system is inconsistent: the order exists in the order DB, but no email has been sent and no analytics row has been written. The customer might refresh the UI and not see their order in "recent activity."

Three coping strategies:

Read-your-writes via the producer. The UI queries the order service (not the analytics warehouse) immediately after placing the order. Same service, same DB, strongly consistent.
Wait for an acknowledgement event. The producer waits for a downstream OrderConfirmed event before showing success. Slower but consistent UX.
Show the eventual state. "Order received — processing" UI explicitly tells the user there's lag. Acceptable for non-critical paths.

Real-world EDA deployments

Uber. ~4,000 microservices, ~5 trillion Kafka messages/day. Each domain (rides, eats, freight) owns topics; inter-domain integration is event-first.
LinkedIn. Invented Kafka. ~7 trillion messages/day across activity events, member-feed updates, ML feature streams.
Netflix. Kinesis + EventBridge + custom. Every view, play, pause, search is an event for personalization and operational telemetry.
Airbnb. Kafka-driven event sourcing for the booking lifecycle (search → request → accept → check-in → review). 50+ event types across the funnel.
Stripe. ~100M events/day. Webhooks are EDA exposed to customers — Stripe events fan out to your endpoint.
Twitter / X. The home timeline is a giant fan-out problem solved with event-driven indexing.

Common misconceptions and traps

"EDA replaces synchronous APIs." No. EDA is for asynchronous integration. Synchronous reads (fetch my order details) still need REST/gRPC. The two coexist.
"More events = more decoupling." Fine-grained events couple consumers to producer internals (every column change becomes a topic). Coarse-grained business events are usually better.
"Kafka guarantees ordering globally." Only within a partition. Cross-partition ordering is not preserved. Partition by aggregate ID.
"At-least-once is fine; consumers will sort it out." Only if consumers actually implement idempotency. Without it, retries double-execute.
"We'll add tracing later." Without distributed tracing, debugging an EDA system is forensic archaeology. Add it from day one.
"Eventual consistency is everyone's problem." It's the consumer's problem. Producers shouldn't compromise their write path to accommodate downstream lag.
"The bus is the source of truth." The producer's DB is. The bus is the integration channel. Pair with the outbox pattern to keep them in sync.
"EDA is a free architectural upgrade." It adds operational complexity (broker, schema registry, consumer groups, dead-letter queues) and debugging cost. For 3-service systems, REST is simpler.

Performance characteristics

End-to-end latency. Producer commit → consumer ack: 10-500 ms typical, p99 may stretch to 1-2 sec under load.
Producer throughput. Kafka: 100k-1M msg/sec/broker; RabbitMQ: 30-50k/node; NATS: 1M+/server (in-memory).
Storage. Kafka retention 7 days default, can be infinite. ~500 bytes/event compressed; ~250 GB / 1B events.
Fan-out scaling. N consumers see the same event with no extra producer cost — the killer feature of EDA.
Failure isolation. One slow consumer doesn't slow producers or other consumers — independent offsets.
Operational headcount. Running Kafka responsibly is typically a 1-3 person dedicated platform team at scale.

Frequently asked questions

What's the difference between event-driven and request-response?

Request-response (REST, gRPC) is synchronous and addressed — service A calls service B by name, blocks until B responds. A and B are tightly coupled at runtime: if B is down, A is down. Event-driven is asynchronous and broadcast — service A publishes 'PaymentCompleted' to a topic; whoever subscribed handles it. A doesn't know B exists; B doesn't know A exists. The bus absorbs the temporal coupling. Trade-off: lose synchronous read-your-writes (A can't immediately see B's response) and gain elasticity, resilience, and the ability to add new consumers without touching producers.

When does event-driven architecture make sense?

Five signals: (1) multiple services need to react to one upstream change — email, analytics, fraud detection all care about PaymentCompleted. EDA fans out for free. (2) Asymmetric scaling — producers do 10k/sec, one consumer does 50k/sec while another does 100/sec. The bus buffers. (3) Cross-team independence — Team A doesn't want to coordinate deploys with Team B every time a downstream behavior changes. (4) High write throughput with delayed processing acceptable — ingestion, telemetry, audit. (5) Long-running workflows — sagas, approvals, multi-step processes. Anti-signal: tightly-coupled synchronous read-modify-write across two services on a hot path — REST is simpler.

Which message bus should I choose?

Kafka: log-based, partitioned, multi-consumer with offset replay, 1M+ msg/sec/broker. Best for high-throughput streaming, event sourcing, multi-consumer fan-out. RabbitMQ: queue-based, lower throughput (~50k msg/sec/node) but rich routing (exchanges, headers, RPC patterns). Best for traditional pub-sub and command queueing. AWS EventBridge: managed, serverless, deep AWS integration, ~10k events/sec/account default. Best for cross-account and SaaS-event ingestion. NATS: lightweight (<10ms p99), small ops footprint, weaker durability than Kafka. Best for low-latency request-response over a bus. Google Pub/Sub: similar to Kafka but managed, autoscales. Pick the bus that matches your throughput / latency / durability profile, not the bus you've heard of.

What is eventual consistency and why does it matter?

In EDA, when service A publishes an event, services B, C, D process it asynchronously — they're behind A by some lag (typically 10-500 ms). During that lag the system is inconsistent: A says 'order placed,' B's database doesn't yet know, the UI may show stale data. Eventually they converge. This is eventual consistency. It is fine for analytics, search indexing, and audit logs, where the user doesn't see micro-inconsistency. It is dangerous for hot-path reads (showing the customer their order immediately after placing it). Solutions: read-your-writes by querying A directly (the producer), or wait for an acknowledgement event before showing success.

How do you debug event-driven systems?

Hardest part of EDA. The global flow doesn't exist in any single service's code — it's distributed across producers, the bus, and consumers. Three production tools: (1) Distributed tracing — Jaeger / OpenTelemetry propagating trace IDs through event headers so you can follow a request across services. (2) Schema registry — Confluent, AWS Schema Registry — every event type is versioned and discoverable; you can audit who produces and who consumes. (3) Event replay — Kafka's offset model lets you replay a topic into a dev environment and re-run consumers against historical events. Without these, EDA debugging is 'grep across N services.' Invest in tracing from day one.

How does Uber scale to thousands of services with EDA?

Uber moved from a single Python monolith to 4,000+ microservices between 2014-2020, all coordinated through Kafka. Their pattern: each domain (rides, payments, eats, freight, maps) owns its own services and Kafka topics. Inter-service communication is event-first; synchronous gRPC is only for read paths that need it. The 'Eats' Order Lifecycle alone publishes 30+ event types — order_created, restaurant_accepted, courier_assigned, delivered, etc. Each is a Kafka topic with multiple consumers. Total Kafka throughput at Uber is ~5 trillion messages/day across multiple clusters. The bus is the integration layer; without it, 4000 services would need 4000^2 / 2 = 8M point-to-point connections.