Unicode Transformation Formats (UTFs)
A short introduction to UTF-8, UTF-16, and UTF-32 encodings.
Unicode itself does not define how code points are stored in memory or transmitted. While the entire range of Unicode code points can be (and in some cases is) represented directly using 4-byte (32-bit) values, such an approach is excessively bloated and inefficient for most practical applications.
That’s where encodings come in. To optimize storage and transmission, various encoding schemes — Unicode Transformation Formats (UTF) — have been developed to represent code points more compactly. UTFs define how to translate Unicode code points into sequences of bytes for storage or transmission.
The primary and most widely used formats* are UTF-8 and UTF-16, with UTF-32 being limited to special applications. Understanding the differences between these formats is crucial for developers working on internationalization, as the choice of encoding can impact performance, compatibility, and data size.
All UTF encoding can represent all Unicode characters. The choice of format depends on the specific requirements of your application, including the predominant languages you are dealing with, the need for compatibility with particular technologies, and the importance of storage efficiency versus processing simplicity. Understanding the trade-offs and advantages of each encoding format will help you make an informed decision that best suits your internationalization needs.
UTF-8 | UTF-16 | UTF-32 | |
---|---|---|---|
bytes per code point | 1-4 (variable) | 2 or 4 (variable) | 4 (fixed) |
byte-order marker | not recommended | recommended or required | recommended or required |
storage | Most efficient for ASCII, variable-length for others | More compact for some scripts (e.g., Chinese, Japanese, Korean, Thai), variable-length | Always 4 bytes per character, least efficient |
applications | Web standard; dominant encoding for files, web, and communication. | Common in programming environments like JavaScript, Java, and Windows APIs. | Rarely used outside specific contexts. Primarily used in internal processing and debugging. |
processing | Complex decoding | Requires handling surrogate pairs | Simplest: 1 code point = 1 unit |
* Other formats also exist, like CESU-8, SCSU, BOCU-1, GB 18030, and UTF-EBCDIC, but they serve niche purposes and have limited adoption due to their complexity or redundancy. They are not considered part of the UTF series and are rarely used.