Networking

TCP Congestion Control

How TCP decides how fast to send

TCP congestion control decides how fast a sender pushes bytes onto a shared network. From Reno's loss-based AIMD to BBR's bandwidth-delay model, the algorithm is what keeps the internet from collapsing.

  • Initial window10 MSS (RFC 6928)
  • Slow start growth×2 per RTT
  • Congestion avoid.+1 MSS per RTT
  • Loss response (Reno)cwnd ÷ 2
  • Linux default (2026)Cubic

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How TCP congestion control works

The TCP sender doesn't know how much bandwidth is available, what the current queue depth is at the bottleneck, or how many other flows are competing. It only sees ACKs returning. Congestion control is the discipline of inferring "how fast can I send" from that single signal — and adjusting the congestion window cwnd in response.

The classical Reno algorithm has four phases:

  1. Slow start. Begin with cwnd = 10 MSS (modern default per RFC 6928, up from 3 in legacy stacks). For every ACK, add 1 MSS. cwnd doubles every RTT. Continue until a loss or until cwnd ≥ ssthresh.
  2. Congestion avoidance. Add 1 MSS per RTT (additive increase). Linear ramp.
  3. Fast retransmit. Three duplicate ACKs ⇒ infer a single packet was lost. Retransmit immediately without waiting for the RTO timer.
  4. Multiplicative decrease. On loss, set ssthresh = cwnd / 2 and cwnd = ssthresh, then resume congestion avoidance. (NewReno, the standard for two decades, refines fast retransmit to handle multiple losses per RTT.)

The shape this draws on a cwnd-vs-time graph is the famous "TCP sawtooth" — exponential climb, halve, linear climb, halve, repeat. AIMD (additive-increase, multiplicative-decrease) is provably fair: when many flows share a link, they converge on equal shares. It's also slow on modern networks. Cubic, BBR, and the rest are attempts to keep the fairness while ramping faster.

Reading ss -i output

Linux exposes the kernel's per-socket congestion state through ss -ti:

State Recv-Q Send-Q  Local Address:Port    Peer Address:Port
ESTAB 0      131072  10.0.0.5:55432       151.101.1.69:443
   cubic wscale:7,7 rto:208 rtt:7.234/2.1 ato:40 mss:1448 rcvmss:536
   pmtu:1500 cwnd:42 ssthresh:32 bytes_sent:8421376 bytes_acked:8290304
   segs_out:5824 segs_in:1853 data_segs_out:5821
   send 67.3Mbps lastsnd:8 lastrcv:12 pacing_rate:80.7Mbps
   delivery_rate:64.2Mbps app_limited busy:7984ms
   rcv_space:14600 rcv_ssthresh:64088 minrtt:6.812

Key fields: cubic is the active congestion-control algorithm; cwnd:42 means 42 segments in flight allowed; ssthresh:32 says the algorithm dropped from a higher cwnd after a loss; rtt:7.234/2.1 is smoothed RTT 7.2 ms with 2.1 ms variance; delivery_rate:64.2Mbps is the kernel's own bandwidth estimate. app_limited is critical — if the sender doesn't have data to push, the algorithm should not interpret slow ramps as a network signal.

Congestion-control algorithms compared

RenoCubicBBRVegasDCTCP
SignalLossLossBandwidth + minRTTRTT deltaECN marks
Growth in CALinear (+1/RTT)Cubic in timePacing at est. BWLinear, RTT-awareECN-proportional
Loss responsecwnd ÷ 2cwnd × 0.7None — uses BW estimatecwnd ÷ 2Proportional to ECN%
Long-fat-pipeSlow recoveryFast cubic rampExcellentOKN/A — datacenter
BufferbloatCauses itCauses itAvoids itAvoids itN/A
Fairness with peersReference fair~Reno-fairAggressive vs CubicLoses to loss-basedDCTCP-only
Where deployedLegacyLinux default since 2.6.19Google fleet, YouTubeResearch, FreeBSDDatacenters (Microsoft, Meta)

JavaScript implementation — simulating cwnd evolution

// Toy Reno simulator: send packets across a link with a 1500 MSS bottleneck
// queue, observing cwnd over time. Educational — not production.
function simulateReno({ rttMs = 50, durationMs = 5000, lossProb = 0.005 }) {
  const MSS = 1448;
  let cwnd = 10;          // initial window in segments
  let ssthresh = Infinity;
  let inSlowStart = true;
  let dupAcks = 0;
  const trace = [];

  for (let t = 0; t < durationMs; t += rttMs) {
    // One RTT worth of "ACKs" arrive.
    const sent = cwnd;
    const lost = Math.random() < lossProb * sent;
    if (lost) {
      ssthresh = Math.max(2, Math.floor(cwnd / 2));
      cwnd = ssthresh;        // fast recovery: skip slow start
      inSlowStart = false;
    } else if (inSlowStart) {
      cwnd = Math.min(cwnd * 2, ssthresh);
      if (cwnd >= ssthresh) inSlowStart = false;
    } else {
      cwnd += 1;              // congestion avoidance
    }
    trace.push({ t, cwnd, throughputMbps: (cwnd * MSS * 8) / (rttMs / 1000) / 1e6 });
  }
  return trace;
}

const trace = simulateReno({ rttMs: 50, durationMs: 5000, lossProb: 0.002 });
console.table(trace.filter((_, i) => i % 10 === 0));

Run this and you'll see the sawtooth: cwnd climbs to 80-100 segments, a packet drops, cwnd halves, and the cycle restarts. Switching lossProb to 0 makes cwnd grow forever — exactly the bufferbloat scenario where Reno fills router buffers indefinitely.

Python implementation — Cubic vs BBR throughput on a long-fat-pipe

"""
Compare time-to-fill the BDP for Cubic vs Reno-style AIMD on a
10 Gbps × 100 ms link (BDP ~ 125 MB, ~84,000 segments).
"""
import math

def reno_aimd(bdp_segs, rtt_ms, max_rtts=200_000):
    cwnd, ssthresh = 10, bdp_segs
    rtts = 0
    while cwnd < bdp_segs and rtts < max_rtts:
        cwnd = cwnd * 2 if cwnd < ssthresh else cwnd + 1
        rtts += 1
    return rtts, rtts * rtt_ms / 1000

def cubic(bdp_segs, rtt_ms, w_max, beta=0.7, c=0.4):
    # cwnd(t) = c*(t - K)^3 + w_max,  K = (w_max*(1-beta)/c)^(1/3)
    k = (w_max * (1 - beta) / c) ** (1/3)
    t = 0.0
    step_s = rtt_ms / 1000
    cwnd = w_max * beta
    while cwnd < bdp_segs and t < 1000:
        t += step_s
        cwnd = c * (t - k) ** 3 + w_max
    return t / step_s, t

bdp = 84_000
print("Reno:  ", reno_aimd(bdp, 100))
print("Cubic: ", cubic(bdp, 100, w_max=bdp))

Reno needs roughly 84,000 RTTs (~2.3 hours at 100 ms) to reach the BDP after a single loss. Cubic reaches the same window in seconds. On a 10 Gbps × 100 ms transcontinental link this is the difference between "TCP works" and "TCP works once you go to lunch."

Concrete costs and numbers

  • Slow start ramp: from cwnd 10 to cwnd 1024 takes 7 RTTs. On a 50 ms link, that's 350 ms of warm-up before TCP is at full throttle — frequently longer than the entire HTTP request.
  • Initial window history: RFC 2581 (1999): 2-4 segments. RFC 3390 (2002): up to 4. RFC 6928 (2013): 10 segments. Some CDNs experimentally raise to 30+; trade-off is bursty loss when the bottleneck is small.
  • Reno's recovery rate: +1 MSS / RTT. To climb from cwnd 500 back to 1000 after a loss takes 500 RTTs.
  • BBR throughput gain: Google reports 2-25% improvement on YouTube QoE metrics; on lossy paths (≥1% drop) gains are 2-25× because Cubic interprets every drop as congestion while BBR uses bandwidth probes.
  • Bufferbloat magnitude: a 64 MB router buffer on a 10 Mbps link adds 50+ seconds of queuing. CoDel was designed because TCP loss-based algorithms cannot distinguish bloat from competition.
  • RTO minimum: Linux clamps RTO at 200 ms. In datacenters with sub-millisecond RTTs, a single timeout is 200,000× the round-trip. Mitigation: tcp_min_rto_microseconds, or DCTCP/ECN to avoid timeouts entirely.

Variants and refinements

  • Vegas (1995) — RTT-based: slow down when RTT increases, even before loss. Theoretically best, in practice loses bandwidth to greedy loss-based competitors and was abandoned for the public internet.
  • NewReno (RFC 6582) — repairs Reno's failure to handle multiple losses per RTT by tracking partial ACKs.
  • Cubic (2008) — Linux default for 15+ years. The cubic curve is stable around the previous maximum and aggressive away from it.
  • BBR (2016, BBRv3 in 2024) — model-based: measures bottleneck bandwidth and propagation RTT, paces sends to fill the pipe without filling buffers. Used by Google, YouTube, and many CDNs.
  • DCTCP (2010) — datacenter-only. Uses ECN marks (not loss) to estimate congestion proportionally, allowing very small switch buffers and microsecond RTOs.
  • ECN (RFC 3168) — Explicit Congestion Notification. Routers mark packets when their queue is filling rather than dropping them; the sender slows down without paying for retransmission. Long history of middlebox interference; finally widely supported in 2026.
  • Pacing — instead of dumping cwnd worth of packets at line rate, space them across the RTT. Reduces microbursts, plays nicer with shallow buffers. Required for BBR; supported in Linux as tcp_pacing_ca_ratio.
  • ABC (Appropriate Byte Counting) — count bytes acknowledged, not number of ACKs, when growing cwnd. Defends against ACK-division attacks.

Common bugs and edge cases

  • Bufferbloat. A loss-based algorithm fills a 64 MB ISP buffer; pings inflate from 20 ms to 1 s. VoIP and gaming die. Diagnosis: ping while saturating upload. Fix at scale: deploy fq_codel/cake on the bottleneck; switch to BBR.
  • Incast collapse. 50 servers each send 100 KB to one client; the ToR switch's 1 MB buffer overflows; every flow hits a 200 ms RTO; aggregate throughput drops to a tenth of nominal. Fix: shrink RTO_min to microseconds and enable DCTCP with ECN.
  • BBR vs Cubic unfairness. BBRv1 starves Cubic when sharing a shallow buffer; BBRv2/v3 added fairness logic. Mixed-fleet datacenters had a real outage class from this.
  • Spurious retransmits on Wi-Fi. Wi-Fi link-layer retries can add 100s of ms of jitter. Loss-based algorithms misinterpret the resulting RTT spike as congestion and halve cwnd unnecessarily. F-RTO (RFC 5682) mitigates.
  • Initial-window pacing absent. Sending cwnd=10 segments back-to-back creates a 14 KB microburst. Shallow-buffer switches drop part of it; the connection enters loss recovery before exiting slow start. Pacing solves this; older stacks lacked it.
  • app_limited misclassified. Algorithms must distinguish "sender ran out of data" from "network slowed down." Misclassification leaves cwnd stuck low. Linux's tcp_app_limited flag fixed many subtle bugs here.

Frequently asked questions

What is the congestion window?

cwnd is the sender's estimate of how many unacknowledged bytes the network can hold without dropping. The actual in-flight cap is min(cwnd, rwnd) — the smaller of congestion window and the receiver's advertised window. cwnd is purely sender-side state, never sent on the wire.

How fast does slow start ramp up?

cwnd doubles every RTT. Starting from the modern initial window of 10 segments, you reach 1024 segments (~1.5 MB at 1500-byte MSS) after about 7 RTTs. From there Reno-style algorithms switch to additive increase — +1 MSS per RTT.

Why was Cubic invented?

Reno's linear AIMD recovers too slowly on long fat pipes. On a 10 Gbps link with 100ms RTT, Reno needs roughly 80,000 RTTs (~2.2 hours) to refill the pipe after a loss. Cubic uses a cubic curve in time-since-last-loss, so it climbs aggressively far from the previous max and gently near it, recovering in seconds.

Is BBR actually better than Cubic?

On bufferbloated paths and long-fat-pipes, BBR delivers 2-25× higher throughput at lower latency because it estimates bandwidth and minimum RTT instead of reacting to loss. On well-provisioned paths or when sharing with loss-based flows, BBR can be either fairer or more aggressive depending on version. Google reports BBRv1 reduced YouTube rebuffer rates by 10-20% globally.

What is bufferbloat?

When a router has multi-megabyte buffers, loss-based congestion control fills them before noticing. The result: the queue stays full, RTT inflates from 20ms to 500ms, and interactive traffic crawls. Loss-based algorithms cause it; AQM (CoDel, fq_codel) and bandwidth-based algorithms (BBR) avoid it.

What is incast collapse?

In a datacenter, when many servers reply to one client at once (e.g., a fanout query), all responses arrive in microseconds and overflow the top-of-rack switch's buffer. TCP enters retransmission timeout (RTO_min ≈ 200 ms) on every flow, killing throughput. Fixes: smaller RTO_min, DCTCP with ECN marking, or RDMA.