Converting Between UTF Encodings
UTF-8, UTF-16, and UTF-32 all are encodings of the same Unicode character set, but they use different sequences of bytes to represent code points.
Each four-byte (32-bit) value in UTF-32 maps directly to that code point’s numerical value, so that’s easy. But UTF-8 and UTF-16 use entirely different rules to encode data, from UTF-32 and from each other, so direct conversion without interpreting them as code points will almost certainly misinterpreted data, resulting in invalid characters or encoding errors.
Converting between UTF-8 and UTF-16 involves an intermediate step of first translating byte sequences into Unicode code points, and then re-encoding it in the target format. Since Unicode code points are the common representation that both UTF-8 and UTF-16 encodings map to and from, conversion requires something like:
source encoding → Unicode code point → target encoding
Why? Because UTF-8 represents code points as sequences of 1–4 bytes, whereas UTF-16 represents them as sequences of one or two 16-bit units.
So, when libraries or tools (e.g., JavaScript's TextEncoder
/TextDecoder
or Python's encode
/decode
methods) convert between encodings, the software typically:
- Decode the UTF-8 or UTF-16 byte sequence into Unicode code points.
- Re-encode the Unicode code points into the desired format (UTF-8 or UTF-16).
Example — UTF-8 to UTF-16:
- UTF-8 Input:
F0 9F 98 80
(bytes representing 😀,U+1F600
) - Decode to Unicode
U+1F600
(code point for 😀) - Re-encode as UTF-16:
D83D DE00
(surrogate pair for 😀 in UTF-16)
Example — UTF-16 to UTF-8:
- UTF-16 Input:
D83D DE00
(surrogate pair for 😀) - Decode to Unicode Code Point:
U+1F600
(code point for 😀) - Encode as UTF-8:
F0 9F 98 80
This process is abstracted away, so you don't have to handle code points manually unless needed.
Unicode Code Points vs. Extended ASCII
The original ASCII standard includes 128 characters (0–127). Unicode’s first 128 code points (U+0000 to U+007F) are identical to the original ASCII characters, ensuring backward compatibility.
Extended ASCII, an extension adopted by various systems, includes 256 characters (0–255). The characters in extended ASCII (128–255) are included in Unicode but are mapped differently depending on the specific extended ASCII variant (e.g., ISO 8859-1, Windows-1252). These characters appear in Unicode as part of the Latin-1 Supplement block (U+0080 to U+00FF). Redundancy can arise when transitioning or converting content between systems that use different encodings (e.g., translating ISO 8859-1 into Unicode).
Developers should consider encoding issues when converting text from legacy systems that use extended ASCII into Unicode to ensure no data is misrepresented or lost. If a system reads an ISO-8859-1 or Windows-1252 encoded text file while expecting UTF-8 encoding, characters in the range 0x80–0xFF
are likely to be misinterpreted because UTF-8 and these encodings handle bytes differently.
How ISO-8859-1 and Windows-1252 Work
Text files encoded in ISO-8859-1 or Windows-1252 use 0x80–0xFF as single-byte characters. Characters in the range 0x00–0x7F match ASCII and are directly compatible with UTF-8. Bytes in the range 0x00–0x7F match ASCII, so they're safe to interpret and import.
Characters in the range 0x80–0xFF represent additional symbols or letters in extended ASCII. In UTF-8, bytes in this range are treated as part of a multi-byte sequence, not standalone characters. If the system expects UTF-8, it interprets those bytes as part of a multi-byte sequence. This may result in invalid characters, where the system displays placeholders (e.g., �) for unrecognized sequences. Or, the system may produce unexpected results if the bytes accidentally form a valid UTF-8 sequence.
Misinterpretation Example
ISO-8859-1 or Windows-1252:
0xC3
→ À (Latin capital letter A with grave)0xA9
→ © (Copyright symbol)
In UTF-8, this is multi-byte sequence, in which 0xC3
starts a sequence and 0xA9
completes it. This is interpreted as 0xC3 0xA9
→ É (Latin capital letter E with acute). The result is that the standalone ©
symbol is misinterpreted as É
.
How to Prevent Misinterpretation
-
Specify Encoding:
- When opening or processing files, ensure the correct encoding is specified.
- For example, in Python:
with open('file.txt', encoding='iso-8859-1') as f: content = f.read()
-
Convert Legacy Files:
- If possible, convert files from ISO-8859-1 or Windows-1252 to UTF-8 explicitly:
iconv -f ISO-8859-1 -t UTF-8 input.txt -o output.txt
- If possible, convert files from ISO-8859-1 or Windows-1252 to UTF-8 explicitly:
-
Detect Encoding:
- Use tools or libraries to detect file encoding if it's not explicitly stated (e.g.,
chardet
orcharset-normalizer
in Python).
- Use tools or libraries to detect file encoding if it's not explicitly stated (e.g.,