Networking

WebSockets

One handshake, then a two-way pipe that never hangs up

WebSockets upgrade a single HTTP connection into a persistent, full-duplex channel: after one Upgrade handshake, client and server exchange framed messages in both directions with 2–14 bytes of overhead and sub-millisecond push latency.

StandardRFC 6455 (2011)
HandshakeHTTP Upgrade → 101 Switching Protocols
DirectionFull-duplex — either side sends anytime
Frame overhead2–14 bytes per message
Schemes / portsws:// (80), wss:// (443)

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How WebSockets turn one HTTP request into a permanent channel

Plain HTTP is a vending machine: you put in a request, you get out a response, and the transaction is over. If the server later wants to tell you something — a new chat message arrived, a stock moved, another player fired — it can't. It has to wait for you to ask again. For decades the workaround was polling: ask "anything new?" every second, burn a full request/response cycle, and usually hear "no."

WebSockets, standardized as RFC 6455 in 2011, fix this by keeping the door open. The connection is born as HTTP — a normal GET with two magic headers — and then it stops being HTTP. The server answers 101 Switching Protocols, both sides keep the same TCP socket, and from that moment the bytes on the wire follow the WebSocket framing protocol, not HTTP. Either side can now send a message the instant it has one. That property is full-duplex: simultaneous, independent traffic in both directions, like a phone call rather than walkie-talkies.

Client                                 Server
  │  GET /ws HTTP/1.1                     │
  │  Upgrade: websocket                   │
  │  Connection: Upgrade                  │
  │  Sec-WebSocket-Key: dGhlIHNhbXBsZQ==  │
  │ ───────────────────────────────────► │
  │                                       │
  │  HTTP/1.1 101 Switching Protocols     │
  │  Upgrade: websocket                   │
  │  Sec-WebSocket-Accept: OEPSNnbv...    │
  │ ◄─────────────────────────────────── │
  │                                       │
  │ ═══════ full-duplex frames ═════════► │  client pushes
  │ ◄═══════ full-duplex frames ════════  │  server pushes
  │          (one TCP connection)         │

The Sec-WebSocket-Accept value is the server's proof it speaks WebSocket: it is base64(SHA-1(Sec-WebSocket-Key + "258EAFA5-E914-47DA-95CA-C5AB0DC85B11")). That GUID is a fixed magic constant baked into the spec. A dumb HTTP server that just echoed the key back would compute the wrong accept value, so the client refuses the upgrade — a cheap guard against accidental upgrades.

The frame format — where the 2-byte overhead comes from

Once upgraded, every message is wrapped in a frame. The header is tiny and bit-packed:

 0                   1                   2                   3
 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
+-+-+-+-+-------+-+-------------+-------------------------------+
|F|R|R|R| op    |M| Payload len |    Extended payload length    |
|I|S|S|S| code  |A|   (7 bits)  |        (16 or 64 bits)        |
|N|V|V|V| (4)   |S|             |                               |
| |1|2|3|       |K|             |                               |
+-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
|     Masking-key (4 bytes, client→server only)  |  Payload... |
+---------------------------------------------------------------+

FIN — 1 if this is the last frame of a message (messages can be split across frames).
Opcode — 0x1 text, 0x2 binary, 0x8 close, 0x9 ping, 0xA pong, 0x0 continuation.
MASK — must be 1 on every client→server frame, 0 on every server→client frame.
Payload length — 7 bits for ≤125 bytes, else 126 + a 16-bit length, or 127 + a 64-bit length.

So the absolute minimum frame header is 2 bytes (unmasked, tiny payload); a masked frame carrying a 64-bit length tops out at 14 bytes. Compare that to an HTTP/1.1 message, where headers like Host, User-Agent, Cookie, and Accept routinely cost 500–800 bytes per request. Sending a 20-byte "player moved" update 30 times a second over HTTP would spend 95%+ of bytes on headers; over WebSockets it's a couple of bytes of frame plus the payload.

Why the client must mask every frame

The masking rule looks like pointless busywork until you see the attack it blocks. Before WebSockets, a clever attacker could host a malicious page that opened a "WebSocket-looking" connection and sent bytes that happened to spell a valid HTTP request — like GET /admin HTTP/1.1. A naive intermediary proxy sitting between the browser and the origin, not understanding WebSockets, might interpret those bytes as a real request and cache the response, poisoning the cache for every other user.

RFC 6455 defends against this by forcing the browser to XOR every outbound payload byte with a fresh, random 4-byte key chosen per frame:

transformed[i] = original[i] XOR maskingKey[i mod 4]

Because the key is random and unpredictable from the page's JavaScript, the attacker can't steer the post-mask bytes into a crafted HTTP request. The server simply XORs again with the same key to recover the payload (XOR is its own inverse). The cost is one cheap XOR pass per byte; the benefit is that a hostile script cannot forge plaintext on the wire.

When to reach for WebSockets — and when not to

Bidirectional, low-latency, high-frequency. Chat, multiplayer games, collaborative editors (Figma, Google Docs), live trading dashboards. If both sides push and milliseconds matter, this is the tool.
Many small messages. The frame overhead is so small that thousands of tiny updates per second stay cheap.
Long-lived sessions. Pay the handshake cost once, then amortize it over minutes or hours of traffic.

Reach for something simpler when the pattern is one-directional or rare:

Server → client only? Use Server-Sent Events. They ride plain HTTP, auto-reconnect, and survive proxies trivially.
Occasional request/response? A REST call is stateless, cacheable, and far easier to load-balance. Don't hold a socket open to fetch a profile once.
Audio/video or peer-to-peer? WebRTC's data channels run over UDP/SCTP and tolerate packet loss, which TCP-based WebSockets can't — head-of-line blocking will stall a game on a single dropped packet.

WebSockets vs the real-time alternatives

	WebSockets	HTTP polling	HTTP long-polling	Server-Sent Events	WebRTC data channel
Direction	Full-duplex	Client → server	Client → server	Server → client	Full-duplex (peer↔peer)
Transport	TCP (one socket)	TCP (new conn/poll)	TCP (held open)	TCP (held open)	UDP / SCTP
Server→client latency	Sub-millisecond	½ × poll interval (≈500 ms)	Sub-second	Sub-millisecond	Sub-millisecond
Per-message overhead	2–14 bytes	Full HTTP headers (~0.5 KB)	Full HTTP headers	Field name + newline	~12+ bytes (DTLS/SCTP)
Auto-reconnect	No (DIY)	N/A (each poll fresh)	No (DIY)	Yes (Last-Event-ID)	No (re-negotiate)
Ordered + reliable	Yes (TCP)	Yes	Yes	Yes	Configurable (can drop)
Proxy / firewall friendliness	Good over wss:// (443)	Excellent	Excellent	Excellent	Needs STUN/TURN
Best for	Chat, games, collab	"Anything new?" checks	Notifications	One-way feeds	P2P media / games

The honest summary: WebSockets win when traffic is two-way and frequent, SSE wins for one-way feeds, polling wins for rare and simple checks, and WebRTC wins when you'd rather drop a stale packet than wait for a retransmit.

What the numbers actually say

Header savings. A typical HTTP/1.1 request/response pair carries 500–800 bytes of headers. A WebSocket frame carrying the same 20-byte JSON payload adds 6 bytes (masked, short). For a cursor that broadcasts 30 updates/second, that's roughly (800 − 6) × 30 ≈ 24 KB/s saved per client in pure overhead.
Latency. Polling delivers a server event after, on average, half the poll interval. At a 1-second poll that's 500 ms of dead time; WebSockets push it in the time it takes one TCP segment to traverse the network — typically sub-millisecond on the same data center, single-digit milliseconds across a region.
Connection density. The C10K problem (10,000 concurrent connections) was a hard 1999-era barrier; modern epoll/kqueue event loops blew past it, and the C1M demos hold 1,000,000+ idle connections on one tuned box at roughly 4–10 KB RAM each. The wall is memory and ephemeral ports, not CPU.
Keepalive cost. A ping/pong heartbeat every 30 s to detect dead peers costs ~4 bytes each way per client — about 0.13 bytes/second/client, negligible even at a million connections.

JavaScript: client and a from-scratch server handshake

The browser client is dead simple — the event-driven API hides all the framing:

const ws = new WebSocket('wss://example.com/ws');

ws.addEventListener('open', () => {
  ws.send(JSON.stringify({ type: 'join', room: 'general' }));
});

ws.addEventListener('message', (e) => {
  const msg = JSON.parse(e.data);   // e.data is a string (text frame)
  render(msg);
});

ws.addEventListener('close', (e) => {
  // 1000 = normal; 1006 = abnormal (no close frame, often a dropped conn)
  if (e.code !== 1000) scheduleReconnect();
});

// Resilient reconnect with exponential backoff + jitter
let backoff = 500;
function scheduleReconnect() {
  const wait = Math.min(backoff, 30_000) + Math.random() * 500;
  setTimeout(connect, wait);
  backoff *= 2;                     // 0.5s, 1s, 2s, 4s, ... capped at 30s
}

On the server, computing the handshake accept value by hand shows exactly what the magic GUID does:

import crypto from 'node:crypto';

const MAGIC = '258EAFA5-E914-47DA-95CA-C5AB0DC85B11';

function acceptValue(secWebSocketKey) {
  return crypto
    .createHash('sha1')
    .update(secWebSocketKey + MAGIC)
    .digest('base64');
}

// On a raw TCP upgrade event, reply with the 101:
function respondHandshake(socket, key) {
  socket.write(
    'HTTP/1.1 101 Switching Protocols\r\n' +
    'Upgrade: websocket\r\n' +
    'Connection: Upgrade\r\n' +
    'Sec-WebSocket-Accept: ' + acceptValue(key) + '\r\n\r\n'
  );
}

In production you'd use the ws library (Node) or your framework's built-in support rather than parsing frames by hand — but the accept computation is genuinely just those three lines.

Python: decoding a masked client frame

The trickiest part of a hand-rolled server is unmasking. Here's the minimal frame decoder, the piece every WebSocket library implements:

import struct

def read_client_frame(sock):
    b0, b1 = sock.recv(2)
    fin    = b0 >> 7
    opcode = b0 & 0x0F
    masked = b1 >> 7                 # MUST be 1 for client frames
    length = b1 & 0x7F

    if length == 126:               # 16-bit extended length
        length = struct.unpack('>H', sock.recv(2))[0]
    elif length == 127:             # 64-bit extended length
        length = struct.unpack('>Q', sock.recv(8))[0]

    if not masked:                  # spec violation — close the connection
        raise ValueError('client frame was not masked')

    mask    = sock.recv(4)
    payload = bytearray(sock.recv(length))
    for i in range(length):         # XOR is its own inverse
        payload[i] ^= mask[i % 4]

    return fin, opcode, bytes(payload)

Two details bite beginners. First, recv(n) can return fewer than n bytes on a real socket — production code loops until it has the full count. Second, a single application message may arrive as several frames (FIN=0 continuation frames), so you buffer payloads until you see FIN=1.

Variants and layers worth knowing

Socket.IO. A higher-level library that uses WebSockets when available and falls back to HTTP long-polling when a proxy blocks the upgrade. It adds rooms, automatic reconnection, acknowledgements, and multiplexed "namespaces" — conveniences raw WebSockets don't provide. The trade-off is a non-standard wire protocol: a Socket.IO client can't talk to a plain WebSocket server.

permessage-deflate. A negotiated extension (RFC 7692) that compresses each message with DEFLATE. Great for repetitive JSON, but it keeps a per-connection compression context that can cost tens of KB of memory each — a real concern at a million connections.

STOMP / MQTT over WebSockets. Messaging protocols tunneled inside WebSocket frames so browsers can speak to message brokers. STOMP is common with RabbitMQ; MQTT-over-WS lets web dashboards subscribe to IoT topics.

WebTransport. The newer HTTP/3-based successor built on QUIC. It offers both reliable streams and unreliable datagrams over a single multiplexed connection, dodging TCP head-of-line blocking. It targets the same low-latency niche but is still rolling out across browsers.

Common bugs and edge cases

No reconnection logic. The browser API does not auto-reconnect. A dropped Wi-Fi connection silently fires close with code 1006 and the channel is gone until you rebuild it — always implement backoff + jitter.
Silent dead connections. TCP can keep a socket "open" long after the peer vanished (no FIN ever arrived). Without app-level ping/pong heartbeats you'll keep pushing into a black hole. Send a ping every 20–30 s and drop the connection if no pong returns.
Idle-timeout proxies. Many load balancers and CDNs kill idle connections after 30–60 s. The heartbeat above doubles as keepalive traffic to prevent that.
Sticky-session requirement. A WebSocket is pinned to the backend that accepted it. Behind a load balancer you need sticky routing (or an external pub/sub layer like Redis) so a message published on server B reaches a client connected to server A.
Backpressure ignored. If you send() faster than the client can drain, bufferedAmount climbs and memory balloons. Check it and pause producing when it's high.
Forgetting to mask (or to reject unmasked). A client that sends unmasked frames, or a server that accepts them, violates RFC 6455. Conformant servers must close the connection on an unmasked client frame.
Assuming one frame = one message. Large payloads fragment across continuation frames; reassemble by FIN before parsing.

Frequently asked questions

What's the difference between WebSockets and HTTP?

HTTP is request/response: the client asks, the server answers, and the connection is logically done. WebSockets keep one TCP connection open and let either side send a message at any time — full-duplex — with only 2–14 bytes of framing overhead instead of HTTP's hundreds of bytes of headers per message.

How does the WebSocket handshake work?

The client sends an ordinary HTTP/1.1 GET with Upgrade: websocket, Connection: Upgrade, and a random Sec-WebSocket-Key. The server replies 101 Switching Protocols with Sec-WebSocket-Accept = base64(SHA-1(key + magic GUID)). After that single round trip the same TCP socket switches from HTTP framing to WebSocket framing.

Why must client-to-server WebSocket frames be masked?

RFC 6455 requires every browser-sent frame to XOR its payload with a random 4-byte masking key. This prevents cache-poisoning attacks where a malicious page tricks a browser into sending bytes that an intermediary proxy mistakes for a crafted HTTP request. Server-to-client frames are never masked.

When should I use Server-Sent Events instead of WebSockets?

Use SSE when data only flows server → client (news feeds, stock tickers, notifications). SSE runs over plain HTTP, auto-reconnects with Last-Event-ID, and is simpler to scale. Choose WebSockets when the client also needs to push frequently — chat, multiplayer games, collaborative editors.

How many WebSocket connections can one server hold?

The ceiling is memory and file descriptors, not CPU. The famous C10K and C1M demos showed a single tuned Linux box holding 1,000,000+ idle connections — roughly 4–10 KB of kernel and heap per idle socket. Throughput, not connection count, is usually the real limit.

Do WebSockets work through HTTP proxies and load balancers?

Yes, but only over wss:// (TLS) in practice — encryption hides the traffic from older proxies that would otherwise mangle the Upgrade. Load balancers must forward the Upgrade/Connection headers and pin a connection to one backend; layer-7 balancers like NGINX and HAProxy need explicit WebSocket configuration.