Encoding Standards

UTF-8 Encoding

Variable-length Unicode in 1 to 4 bytes — self-synchronizing, ASCII-compatible, the web's default

UTF-8 encodes every Unicode code point in 1, 2, 3, or 4 bytes. ASCII characters take 1 byte and are bit-identical to their ASCII representation. Latin Extended, Greek, Cyrillic, Hebrew, and Arabic take 2 bytes. Most CJK ideographs and Basic Multilingual Plane characters take 3 bytes. Supplementary characters — emoji, ancient scripts, mathematical alphabets — take 4 bytes. The encoding is self-synchronizing (continuation bytes start with 10), backward-compatible with ASCII tooling, and runs the modern web — over 98 percent of pages use it. Designed by Ken Thompson and Rob Pike in a New Jersey diner in September 1992.

  • Variable length1 to 4 bytes per code point
  • ASCII subsetU+0000–U+007F: 1 byte
  • BMP (most scripts)2–3 bytes
  • Supplementary (emoji)4 bytes
  • Self-synchronizingYes
  • Designed byPike & Thompson, 1992

Interactive visualization

Press play, or step through manually. The visualization is yours to drive — try it before reading on.

Open visualization fullscreen ↗

Watch the 60-second explainer

A condensed visual walkthrough — narrated, captioned, under a minute.

How UTF-8 stores a code point

Unicode assigns each character a "code point" — a number from 0 to 0x10FFFF (1,114,111). UTF-8 maps that number to a sequence of 1 to 4 bytes, with the byte count determined by the code point's value:

Code point rangeBytesBinary layoutBits of payload
U+0000 – U+007F (ASCII)10xxxxxxx7
U+0080 – U+07FF2110xxxxx 10xxxxxx11
U+0800 – U+FFFF (rest of BMP)31110xxxx 10xxxxxx 10xxxxxx16
U+10000 – U+10FFFF (supplementary)411110xxx 10xxxxxx 10xxxxxx 10xxxxxx21

The leading byte tells you how many bytes follow:

  • High bit 0 (0xxxxxxx): single-byte ASCII.
  • 110xxxxx: two-byte sequence starts here, one continuation follows.
  • 1110xxxx: three-byte sequence, two continuations.
  • 11110xxx: four-byte sequence, three continuations.
  • 10xxxxxx: continuation byte — never appears as a leading byte.

This is what makes UTF-8 self-synchronizing. From any byte in the stream you can tell whether you're at a character boundary: if it starts with 0 or 11, you're at the start of a character; if it starts with 10, you're mid-character. Skip forward until you find a non-10 byte and you're resynced.

Four worked examples

Take four characters representative of the four UTF-8 lengths:

  • 'A' (U+0041) — 1 byte. Decimal 65, binary 01000001. ASCII range, so UTF-8 byte is 01000001 = 0x41. Same as ASCII.
  • 'é' (U+00E9) — 2 bytes. Decimal 233, binary 00011101001 (11 bits, padded). Two-byte form: 110xxxxx 10xxxxxx → 11000011 10101001 → 0xC3 0xA9. Verify: extract payload bits 00011 101001 = 233 = U+00E9. Check.
  • '中' (U+4E2D) — 3 bytes. Decimal 20013, binary 0100111000101101 (16 bits). Three-byte form: 1110xxxx 10xxxxxx 10xxxxxx → 11100100 10111000 10101101 → 0xE4 0xB8 0xAD. Payload bits 0100 111000 101101 = 0x4E2D. Check.
  • '𝄞' (U+1D11E, musical G clef) — 4 bytes. Decimal 119,070, binary 000011101000100011110 (21 bits). Four-byte form: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx → 11110000 10011101 10000100 10011110 → 0xF0 0x9D 0x84 0x9E.

The "café" example: 5 bytes total. c = 1 byte (0x63), a = 1 byte (0x61), f = 1 byte (0x66), é = 2 bytes (0xC3 0xA9). The visual length is 4 characters; the byte length is 5. This is why strlen on a UTF-8 string gives byte count, not character count — and why Python 3 distinguishes len(s) (code points) from len(s.encode('utf-8')) (bytes).

Why UTF-8 won — the design constraints

Pike and Thompson designed UTF-8 around a small set of properties no other encoding satisfied simultaneously:

  1. ASCII preserved bit-for-bit. Any ASCII file is already valid UTF-8 without conversion. Legacy English-text software keeps working.
  2. No ASCII byte in multi-byte sequences. Continuation bytes (10xxxxxx) and leading multi-byte bytes (1xxxxxxx) all have the high bit set. ASCII tools that look for specific bytes (newline = 0x0A, slash = 0x2F, null = 0x00) never get false matches inside multi-byte characters.
  3. Self-synchronizing. Drop a byte from the stream; the decoder loses one character but resynchronizes at the next non-10 byte. UTF-16 doesn't have this — drop a byte and every subsequent character misaligns.
  4. Unique encoding per code point. Each code point has exactly one canonical UTF-8 encoding. Multiple encodings of the same character (overlong encodings, like a 3-byte form for an ASCII character) are explicitly forbidden by the standard. This closes a class of security bugs where attackers smuggle dangerous characters through filters.
  5. Lexicographic byte order matches code-point order. Sorting UTF-8 strings byte-by-byte gives the same result as sorting by Unicode code point. UTF-16 lacks this for supplementary characters due to surrogate-pair quirks.
  6. No byte-order mark needed. UTF-8 has no endianness — it's a stream of bytes, not a stream of multi-byte units. UTF-16 needs a BOM (or out-of-band agreement) to specify big- vs little-endian.

UTF-8 vs UTF-16 vs UTF-32

UTF-8UTF-16UTF-32
Bytes per code point1, 2, 3, or 42 or 4 (surrogate pairs)Always 4
ASCII same as encodingYesNoNo
Self-synchronizingYesPartially (within BMP)Yes (trivially)
EndiannessNoneBE or LE; BOM neededBE or LE; BOM needed
Size on English text1 byte/char2 bytes/char (2× larger)4 bytes/char (4× larger)
Size on CJK text3 bytes/char2 bytes/char (33% smaller)4 bytes/char
Used byWeb, Linux, modern internetJava strings, Windows APIs, JSInternal lookup tables
StandardizedRFC 3629 (2003)RFC 2781 (2000)ISO/IEC 10646

UTF-16 has a size advantage for CJK-dominant content but loses on ASCII-heavy text. Windows and Java internalized UTF-16 in the 1990s before UTF-8 existed in popular form; modern systems would not make the same choice. JavaScript strings are nominally UTF-16 (every string index returns a 16-bit code unit), but JS source files and most JSON I/O is UTF-8. UTF-32 is fixed-width but pays 4 bytes per character even for ASCII — it's used internally for lookup tables and Unicode property queries but never as a wire format.

Parsing UTF-8 — code patterns

// JavaScript — string is UTF-16 in memory, UTF-8 on the wire
const s = "café 中 🎵";
console.log(s.length);                    // 7 (code units, not chars)
console.log([...s].length);               // 6 (actual code points)
console.log(new TextEncoder().encode(s).length); // 12 (UTF-8 bytes)

// Iterate code points correctly
for (const ch of s) console.log(ch);      // 'c', 'a', 'f', 'é', ' ', '中', ' ', '🎵'

// Encode/decode explicitly
const bytes = new TextEncoder().encode(s);          // Uint8Array of UTF-8 bytes
const back = new TextDecoder('utf-8').decode(bytes); // round-trip
# Python 3 — strings are sequences of code points (str type)
s = "café 中 🎵"
print(len(s))                # 8 (code points)
print(len(s.encode('utf-8')))  # 13 (UTF-8 bytes)

# Encode to bytes
bytes_data = s.encode('utf-8')   # b'caf\xc3\xa9 \xe4\xb8\xad \xf0\x9f\x8e\xb5'

# Decode back
recovered = bytes_data.decode('utf-8')  # 'café 中 🎵'
// Go — strings are UTF-8 bytes natively
s := "café 中 🎵"
fmt.Println(len(s))              // 12 (UTF-8 bytes)
fmt.Println(utf8.RuneCountInString(s))  // 6 (code points)

// Iterate runes (code points)
for _, r := range s {
    fmt.Printf("%c ", r)
}

Common UTF-8 mistakes

  • Confusing bytes and characters. strlen("café") in C returns 5, not 4 — it's counting bytes. Most string libraries today have explicit "byte length" and "code point length" APIs; use both correctly.
  • Truncating mid-character. A 100-byte buffer holding UTF-8 might end mid-multi-byte. Always truncate at a code-point boundary; many libraries provide a "scrub" function for this.
  • Assuming 1 character = 1 visual glyph. Combining characters (é = e + combining acute), emoji modifiers (skin tones), and zero-width joiners can make one visual glyph span multiple code points. Use a grapheme-cluster library for visual length.
  • BOM in UTF-8. The byte sequence 0xEF 0xBB 0xBF is the UTF-8 "BOM" — historically a marker but unnecessary since UTF-8 has no endianness. Some Windows tools add it; most Unix tools treat it as content. Avoid emitting it.
  • Mixing encodings. Reading a file as UTF-8 when it's actually Latin-1 (ISO 8859-1) produces "mojibake" — characters look wrong. Always specify encoding explicitly; never rely on a system default.
  • Surrogate code points in UTF-8. Unicode reserves U+D800–U+DFFF for UTF-16 surrogate pairs. UTF-8 explicitly forbids encoding these as standalone code points. Some lenient decoders accept "CESU-8" or "WTF-8" (a hack to round-trip invalid UTF-16); strict UTF-8 rejects them.
  • Indexing strings by byte offset. s[5] in a UTF-8 byte stream might be a continuation byte, not a character start. Use code-point or grapheme-cluster indexing for user-facing string operations.

The placemat at the diner

In September 1992 the IETF was deadlocked on a Unicode encoding. UTF-1 (an early proposal) had ASCII compatibility but wasn't self-synchronizing. A FSS-UTF design had self-sync but broke ASCII. Pike and Thompson, working on Plan 9 at Bell Labs, drove to dinner at the Roselle Diner in Murray Hill, New Jersey. Pike laid out the design constraints on a placemat. Within hours they had sketched what became UTF-8.

The next day Thompson wrote a Plan 9 implementation. Within weeks UTF-8 was running in production on Plan 9 systems. The IETF accepted it in early 1993 (RFC 2044, later RFC 2279, current RFC 3629). The Unicode Consortium standardized it as ISO 10646 Annex R. The W3C made it the default for XML in 1998 and for HTML in HTML5 (2008). By 2024, 98.4% of web pages use UTF-8 — making it the most successful character encoding in human history.

Frequently asked questions

How does UTF-8 encode a code point?

The bit pattern depends on the code point's value. U+0000 to U+007F (ASCII range): one byte, 0xxxxxxx — bit-identical to ASCII. U+0080 to U+07FF: two bytes, 110xxxxx 10xxxxxx — 11 bits of payload. U+0800 to U+FFFF: three bytes, 1110xxxx 10xxxxxx 10xxxxxx. U+10000 to U+10FFFF: four bytes, 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. The continuation bytes always start with 10 — that's what makes UTF-8 self-synchronizing.

Why was UTF-8 designed by Pike and Thompson in a diner?

In September 1992 Rob Pike and Ken Thompson were working on Plan 9 at Bell Labs and needed a way to handle Unicode without breaking existing ASCII tooling. They sketched UTF-8 on a placemat at Roselle's Diner in Murray Hill, New Jersey. The design constraints: ASCII bytes must encode the same in UTF-8 (preserving backward compatibility); no byte in a multi-byte sequence may have its high bit zero; the encoding must be self-synchronizing.

What does 'self-synchronizing' mean?

From any byte in a UTF-8 stream, you can tell whether it starts a new character or continues a previous one — without looking at preceding bytes. If the high bit is 0, it's a single-byte ASCII character. If the byte starts with 11, it's the leading byte of a multi-byte sequence. If it starts with 10, it's a continuation byte. This property makes UTF-8 robust to byte-stream corruption.

Why is UTF-8 ASCII-compatible?

All 128 ASCII characters (U+0000 to U+007F) encode in UTF-8 as a single byte with the high bit zero — exactly the same byte as in ASCII. So any ASCII text is already valid UTF-8; any pure-ASCII file doesn't need conversion. C string functions that treat 0 as a terminator still work for ASCII-only data. This compatibility was the key design goal that let UTF-8 displace older encodings.

How many bytes does 'café' take?

Five bytes. 'c', 'a', 'f' are ASCII (one byte each = 3 bytes). 'é' (U+00E9) is in the U+0080 to U+07FF range, encoded as two bytes: 0xC3 0xA9. Total: 5 bytes. Note that 'é' could alternatively be written as 'e' + combining acute (U+0065 + U+0301), which would be 1 + 2 = 3 bytes total. Both forms represent the visually identical character; Unicode normalization (NFC, NFD) governs which to use.

Why is UTF-8 the dominant web encoding?

Three reasons compounded. First, ASCII compatibility: legacy English-text websites didn't need to migrate. Second, self-synchronization: web servers and proxies could chunk and slice UTF-8 byte streams without parsing them. Third, the IETF made it the default for HTML and HTTP in the 1990s. By 2024 it's used on 98.4 percent of all web pages tracked by W3Techs.