UTF-8
First things first: For web-facing projects, unless you have a specific, compelling reason to opt for UTF-16 or anything else, standardize your whole system on UTF-8.
UTF-8 (Unicode Transformation Format 8-bit) is the most commonly used encoding for Unicode. It is backward compatible with ASCII, meaning any valid ASCII text is also valid UTF-8 text. This makes it efficient and ensures compatibility with a wide range of systems and applications, including legacy systems that use ASCII.
Unicode and UTF-8 are often conflated because UTF-8 is the dominant encoding used to implement Unicode on the web, in files, and in software. When people say "a string is in Unicode," they usually mean "it's in a UTF-8 encoded representation of Unicode."
Overview
Web Standard: UTF-8 is widely adopted and has become the preferred encoding across the web. It is used for everything from HTML to databases to source code. A cross-platform application or API that exchanges data with other systems and services using JSON or XML will normally use UTF-8 encoding.
Efficiency: UTF-8 is efficient for texts that contain mainly ASCII characters (e.g., English letters, numbers, and basic punctuation), as these only require one byte each. This makes UTF-8 particularly suitable for text consisting primarily of English and other languages using the Latin alphabet. It is also efficient for HTML and most other source code.
A multilingual website that includes content in English, Spanish, and Chinese would benefit from UTF-8, as it efficiently handles the mixture of ASCII and non-ASCII characters.
Variable Encoding Length: ASCII characters are encoded in one byte, while other characters (e.g., accented letters, non-Latin scripts, symbols, and emojis) require between two and four bytes.
Complex parsing: Parsing and handling UTF-8 can be more complex due to its variable-length nature, especially when dealing with multi-byte characters.
Byte Order: In UTF-8, the bytes of multi-byte characters are always in the same order, regardless of the platform, making UTF-8 byte-order independent. No byte-order marker (BOM) is needed, and in most cases it is omitted. (In fact, the presence of a BOM in UTF-8 files can cause interpretation problems.)
How UTF-8 is Encoded
UTF-8 is a variable-length encoding that uses 1 to 4 bytes to represent Unicode code points. The structure of each byte in a sequence indicates whether it is a single-byte character or part of a multi-byte sequence; the leading bits in each byte signal the presence of multi-byte sequences and distinguish between 2-, 3-, and 4-byte characters:
0xxxxxxx
— single-byte characters (identical to ASCII)110xxxxx
— the first byte in a two-byte sequence1110xxxx
— the first byte in a three-byte sequence11110xxx
— the first byte in a four-byte sequence10xxxxxx
— continuation byte in a multi-byte sequence
Single-Byte Characters (1 Byte):
Single-byte characters represent ASCII characters directly.
Range: 0x00
to 0x7F
(0 to 127 in decimal)
Binary bits: 0xxxxxxx
(the leading bit is always 0)
2-Byte Characters:
In multi-byte characters, the initial byte is followed by continuation bytes, which have a distinct structure: 10xxxxxx. These leading bits indicate that the byte is not the start of a new character but a continuation of the current one.
Range: U+0080
to U+07FF
Binary bits: 110xxxxx 10xxxxxx
3-Byte Characters:
Range: U+0800
to U+FFFF
(excluding surrogates from U+D800
to U+DFFF
)
Binary bits: 1110xxxx 10xxxxxx 10xxxxxx
4-Byte Characters:
Range: U+10000
to U+10FFFF
Binary bits: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
UTF-8 Encoding Examples
1-Byte Characters (ASCII Range, U+0000 to U+007F) | ||||
---|---|---|---|---|
Char | Code Point | UTF-8 (Hex) | UTF-8 (Binary) | UTF-8 (Decimal) |
A | U+0041 |
41 |
01000001 |
65 |
B | U+0042 |
42 |
01000010 |
66 |
a | U+0061 |
61 |
01100001 |
97 |
1 | U+0031 |
31 |
00110001 |
49 |
! | U+0021 |
21 |
00100001 |
33 |
2-Byte Characters (U+0080 to U+07FF) | ||||
Char | Code Point | UTF-8 (Hex) | UTF-8 (Binary) | UTF-8 (Decimal) |
é | U+00E9 |
C3 A9 |
11000011 10101001 |
195 169 |
ç | U+00E7 |
C3 A7 |
11000011 10100111 |
195 167 |
ö | U+00F6 |
C3 B6 |
11000011 10110110 |
195 182 |
ğ | U+011F |
C4 9F |
11000100 10011111 |
196 159 |
ə | U+0259 |
C9 99 |
11001001 10011001 |
201 153 |
3-Byte Characters (U+0800 to U+FFFF) | ||||
Char | Code Point | UTF-8 (Hex) | UTF-8 (Binary) | UTF-8 (Decimal) |
अ | U+0905 |
E0 A4 85 |
11100000 10100100 10000101 |
224 164 133 |
♥ | U+2665 |
E2 99 A5 |
11100010 10011001 10100101 |
226 153 165 |
中 | U+4E2D |
E4 B8 AD |
11100100 10111000 10101101 |
228 184 173 |
4-Byte Characters (U+10000 to U+10FFFF) | ||||
Char | Code Point | UTF-8 (Hex) | UTF-8 (Binary) | UTF-8 (Decimal) |
😀 | U+1F600 |
F0 9F 98 80 |
11110000 10011111 10011000 10000000 |
240 159 152 128 |
𝄞 | U+1D11E |
F0 9D 84 9E |
11110000 10011101 10000100 10011110 |
240 157 132 158 |
🧡 | U+1F9E1 |
F0 9F A7 A1 |
11110000 10011111 10100111 10100001 |
240 159 167 161 |
𤭢 | U+24B62 |
F0 A4 AD A2 |
11110000 10100100 10101101 10100010 |
240 164 173 162 |
This design ensures that UTF-8 is self-synchronizing: if you start reading in the middle of a sequence, you can determine the boundaries of each character.
By following these patterns, UTF-8 efficiently encodes all Unicode characters while remaining backward-compatible with ASCII.