Question 1

How does UTF-8 encode a code point?

Accepted Answer

The bit pattern depends on the code point's value. U+0000 to U+007F (ASCII range): one byte, 0xxxxxxx — bit-identical to ASCII. U+0080 to U+07FF: two bytes, 110xxxxx 10xxxxxx — 11 bits of payload. U+0800 to U+FFFF (the rest of the BMP): three bytes, 1110xxxx 10xxxxxx 10xxxxxx — 16 bits of payload. U+10000 to U+10FFFF (supplementary planes): four bytes, 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx — 21 bits of payload. The leading byte's high bits tell you how many bytes follow. The continuation bytes always start with 10 — that's what makes UTF-8 self-synchronizing.

Question 2

Why was UTF-8 designed by Pike and Thompson in a diner?

Accepted Answer

In September 1992 Rob Pike and Ken Thompson were working on Plan 9 at Bell Labs and needed a way to handle Unicode without breaking existing ASCII tooling. They sketched UTF-8 on a placemat at Murray Hill's Roselle Cafe Murray Hill (now Roselle's Diner). The design constraints they imposed: ASCII bytes must encode the same in UTF-8 (preserving backward compatibility); no byte in a multi-byte sequence may have its high bit zero (so ASCII tools don't accidentally interpret middle bytes as separators); the encoding must be self-synchronizing (any byte tells you whether you're at a code-point boundary). They wrote a draft that night and had a Plan 9 implementation working within weeks. ISO eventually standardized it as ISO 10646; IETF adopted it as RFC 2279 and later RFC 3629.

Question 3

What does 'self-synchronizing' mean?

Accepted Answer

From any byte in a UTF-8 stream, you can tell whether it starts a new character or continues a previous one — without looking at preceding bytes. Specifically: if the high bit is 0, it's a single-byte ASCII character. If the byte starts with 11, it's the leading byte of a multi-byte sequence (and the number of leading 1s tells you the length). If it starts with 10, it's a continuation byte. This property makes UTF-8 robust to byte-stream corruption: drop a byte, and the decoder loses one character but resynchronizes at the next non-10 byte. Compare UTF-16, where losing one byte misaligns every subsequent character until the end of the stream.

Question 4

Why is UTF-8 ASCII-compatible?

Accepted Answer

All 128 ASCII characters (U+0000 to U+007F) encode in UTF-8 as a single byte with the high bit zero — exactly the same byte as in ASCII. So any ASCII text is already valid UTF-8; any pure-ASCII file (source code, configuration, CSV with no non-ASCII content) doesn't need conversion. C string functions that treat 0 as a terminator still work for ASCII-only data. The byte sequences for ASCII printable characters (e.g., 'a' is 0x61, '/' is 0x2F, newline is 0x0A) are unchanged. This compatibility was the key design goal that let UTF-8 displace older encodings — existing tools didn't have to break.

Question 5

How many bytes does 'café' take?

Accepted Answer

Five bytes. 'c', 'a', 'f' are ASCII (one byte each = 3 bytes). 'é' (U+00E9, e with acute) is in the U+0080 to U+07FF range, encoded as two bytes: 0xC3 0xA9 (binary 11000011 10101001 — decoded: 00011 101001 = 0xE9). Total: 5 bytes. Note that 'é' could alternatively be written as 'e' + combining acute (U+0065 + U+0301), which would be 1 + 2 = 3 bytes total. Both forms represent the visually identical character; Unicode normalization (NFC, NFD) governs which to use. Most modern text uses NFC (precomposed): single code point, two UTF-8 bytes for the é.

Question 6

Why is UTF-8 the dominant web encoding?

Accepted Answer

Three reasons compounded. First, ASCII compatibility: legacy English-text websites didn't need to migrate. Second, self-synchronization: web servers and proxies could chunk and slice UTF-8 byte streams without parsing them. Third, the IETF made it the default for HTML and HTTP in the 1990s, and the Unicode Consortium standardized it for XML in 1998. By 2008 it overtook ASCII as the most common web encoding; by 2024 it's used on 98.4 percent of all web pages tracked by W3Techs. UTF-16 and UTF-32 exist but are confined to internal application data (Java strings, Windows APIs); UTF-8 dominates the wire format and storage.

Code point range	Bytes	Binary layout	Bits of payload
U+0000 – U+007F (ASCII)	1	0xxxxxxx	7
U+0080 – U+07FF	2	110xxxxx 10xxxxxx	11
U+0800 – U+FFFF (rest of BMP)	3	1110xxxx 10xxxxxx 10xxxxxx	16
U+10000 – U+10FFFF (supplementary)	4	11110xxx 10xxxxxx 10xxxxxx 10xxxxxx	21

	UTF-8	UTF-16	UTF-32
Bytes per code point	1, 2, 3, or 4	2 or 4 (surrogate pairs)	Always 4
ASCII same as encoding	Yes	No	No
Self-synchronizing	Yes	Partially (within BMP)	Yes (trivially)
Endianness	None	BE or LE; BOM needed	BE or LE; BOM needed
Size on English text	1 byte/char	2 bytes/char (2× larger)	4 bytes/char (4× larger)
Size on CJK text	3 bytes/char	2 bytes/char (33% smaller)	4 bytes/char
Used by	Web, Linux, modern internet	Java strings, Windows APIs, JS	Internal lookup tables
Standardized	RFC 3629 (2003)	RFC 2781 (2000)	ISO/IEC 10646

UTF-8 Encoding

Interactive visualization

Watch the 60-second explainer

How UTF-8 stores a code point

Four worked examples

Why UTF-8 won — the design constraints

UTF-8 vs UTF-16 vs UTF-32

Parsing UTF-8 — code patterns

Common UTF-8 mistakes

The placemat at the diner

Frequently asked questions