published
26 December 2024
by
Ray Morgan
updated
3 January 2025

UTF-8

First things first: For web-facing projects, unless you have a specific, compelling reason to opt for UTF-16 or anything else, standardize your whole system on UTF-8.

UTF-8 (Unicode Transformation Format 8-bit) is the most commonly used encoding for Unicode. It is backward compatible with ASCII, meaning any valid ASCII text is also valid UTF-8 text. This makes it efficient and ensures compatibility with a wide range of systems and applications, including legacy systems that use ASCII.

Unicode and UTF-8 are often conflated because UTF-8 is the dominant encoding used to implement Unicode on the web, in files, and in software. When people say "a string is in Unicode," they usually mean "it's in a UTF-8 encoded representation of Unicode."

Overview

Web Standard: UTF-8 is widely adopted and has become the preferred encoding across the web. It is used for everything from HTML to databases to source code. A cross-platform application or API that exchanges data with other systems and services using JSON or XML will normally use UTF-8 encoding.

Efficiency: UTF-8 is efficient for texts that contain mainly ASCII characters (e.g., English letters, numbers, and basic punctuation), as these only require one byte each. This makes UTF-8 particularly suitable for text consisting primarily of English and other languages using the Latin alphabet. It is also efficient for HTML and most other source code.

A multilingual website that includes content in English, Spanish, and Chinese would benefit from UTF-8, as it efficiently handles the mixture of ASCII and non-ASCII characters.

Variable Encoding Length: ASCII characters are encoded in one byte, while other characters (e.g., accented letters, non-Latin scripts, symbols, and emojis) require between two and four bytes.

Complex parsing: Parsing and handling UTF-8 can be more complex due to its variable-length nature, especially when dealing with multi-byte characters.

Byte Order: In UTF-8, the bytes of multi-byte characters are always in the same order, regardless of the platform, making UTF-8 byte-order independent. No byte-order marker (BOM) is needed, and in most cases it is omitted. (In fact, the presence of a BOM in UTF-8 files can cause interpretation problems.)

How UTF-8 is Encoded

UTF-8 is a variable-length encoding that uses 1 to 4 bytes to represent Unicode code points. The structure of each byte in a sequence indicates whether it is a single-byte character or part of a multi-byte sequence; the leading bits in each byte signal the presence of multi-byte sequences and distinguish between 2-, 3-, and 4-byte characters:

  • 0xxxxxxx — single-byte characters (identical to ASCII)
  • 110xxxxx — the first byte in a two-byte sequence
  • 1110xxxx — the first byte in a three-byte sequence
  • 11110xxx — the first byte in a four-byte sequence
  • 10xxxxxx — continuation byte in a multi-byte sequence

Single-Byte Characters (1 Byte):

Single-byte characters represent ASCII characters directly.

Range: 0x00 to 0x7F (0 to 127 in decimal)

Binary bits: 0xxxxxxx (the leading bit is always 0)


2-Byte Characters:

In multi-byte characters, the initial byte is followed by continuation bytes, which have a distinct structure: 10xxxxxx. These leading bits indicate that the byte is not the start of a new character but a continuation of the current one.

Range: U+0080 to U+07FF

Binary bits: 110xxxxx 10xxxxxx

3-Byte Characters:

Range: U+0800 to U+FFFF (excluding surrogates from U+D800 to U+DFFF)

Binary bits: 1110xxxx 10xxxxxx 10xxxxxx

4-Byte Characters:

Range: U+10000 to U+10FFFF

Binary bits: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

UTF-8 Encoding Examples

1-Byte Characters (ASCII Range, U+0000 to U+007F)
Char Code Point UTF-8 (Hex) UTF-8 (Binary) UTF-8 (Decimal)
A U+0041 41 01000001 65
B U+0042 42 01000010 66
a U+0061 61 01100001 97
1 U+0031 31 00110001 49
! U+0021 21 00100001 33
2-Byte Characters (U+0080 to U+07FF)
Char Code Point UTF-8 (Hex) UTF-8 (Binary) UTF-8 (Decimal)
é U+00E9 C3 A9 11000011 10101001 195 169
ç U+00E7 C3 A7 11000011 10100111 195 167
ö U+00F6 C3 B6 11000011 10110110 195 182
ğ U+011F C4 9F 11000100 10011111 196 159
ə U+0259 C9 99 11001001 10011001 201 153
3-Byte Characters (U+0800 to U+FFFF)
Char Code Point UTF-8 (Hex) UTF-8 (Binary) UTF-8 (Decimal)
U+0905 E0 A4 85 11100000 10100100 10000101 224 164 133
U+2665 E2 99 A5 11100010 10011001 10100101 226 153 165
U+4E2D E4 B8 AD 11100100 10111000 10101101 228 184 173
4-Byte Characters (U+10000 to U+10FFFF)
Char Code Point UTF-8 (Hex) UTF-8 (Binary) UTF-8 (Decimal)
😀 U+1F600 F0 9F 98 80 11110000 10011111 10011000 10000000 240 159 152 128
𝄞 U+1D11E F0 9D 84 9E 11110000 10011101 10000100 10011110 240 157 132 158
🧡 U+1F9E1 F0 9F A7 A1 11110000 10011111 10100111 10100001 240 159 167 161
𤭢 U+24B62 F0 A4 AD A2 11110000 10100100 10101101 10100010 240 164 173 162


This design ensures that UTF-8 is self-synchronizing: if you start reading in the middle of a sequence, you can determine the boundaries of each character.

By following these patterns, UTF-8 efficiently encodes all Unicode characters while remaining backward-compatible with ASCII.