UTF-16
UTF-16 might be preferable for applications that perform extensive text processing and manipulation, particularly with texts containing many non-Latin characters, like Chinese, Japanese, or Korean.
Variable-Length Encoding: UTF-16 uses either two or four bytes to represent each character. Most common characters (including those from the Basic Multilingual Plane, which covers most modern languages) are encoded in two bytes, while less common characters (from supplementary planes) require four bytes.
Fixed-Length for Most Characters: Many common characters are represented with two bytes, which can simplify processing for texts that contain a significant number of non-ASCII characters.
Efficiency for Certain Scripts: UTF-16 can be more efficient than UTF-8 for texts that contain many characters from non-Latin scripts, such as Chinese, Japanese, and Korean, which often require two bytes per character in UTF-16 but three bytes in UTF-8.
Simplified Processing: Easier to handle in memory for certain applications, such as those that frequently manipulate non-Latin text, due to its fixed-length encoding for many characters. For example, a desktop application that requires efficient manipulation of large volumes of multilingual, non-Latin text data might benefit from using UTF-16, as would one that heavily interacts with Windows system APIs.
Increased Size for ASCII Text: Less efficient for texts with a high proportion of ASCII characters, as each character uses at least two bytes.
Compatibility: Less commonly used on the web, which can lead to interoperability issues with web-based applications and services.
Byte Order: The order of bytes in a character can differ depending on the system's byte order (endianness). This is where the byte-order marker (BOM) becomes relevant.
The Basic Multilingual Plane and Surrogate Pairs
The Basic Multilingual Plane (BMP), the core part of Unicode, contains the majority of characters needed for everyday writing in most modern scripts. It includes characters with code points from U+0000 to U+FFFF, which can be represented directly by a single two-byte (16-bit) code unit in UTF-16.
Two blocks with this range are excluded:
U+E000
toU+F8FF
— a “private use area” reserved for custom, non-standard characters.U+D800
toU+DFFF
— reserved for use exclusively as part of surrogate pairs (see below).
The BMP was designed to cover the majority of characters used in modern scripts and symbols. It contains modern scripts (Latin, Greek, Cyrillic, Arabic, Hebrew, Devanagari), mathematical symbols, standard and specialized punctuation marks and symbols, and ASCII control characters like newline and tab. It also includes characters added for backward compatibility with older character sets like ASCII and ISO 8859.
Examples of BMP Characters:
- A (Latin uppercase letter A)
U+0041
- é (Latin lowercase letter e with acute)
U+00E9
- Ω (Greek uppercase letter omega)
U+03A9
- 中 (Chinese character for "middle")
U+4E2D
- अ (Devanagari letter A)
U+0905
- ♥ (Black heart suit)
U+2665
Characters outside the BMP (code points U+10000
and above) include less commonly used scripts, emojis, historical scripts, and specialized symbols. These require supplementary planes:
Supplementary Multilingual Plane (SMP): Historical scripts, emojis, etc.
Supplementary Ideographic Plane (SIP): Additional Chinese, Japanese, and Korean ideographs.
Other Planes: Specialized uses, like the Supplementary Special-purpose Plane (SSP).
In UTF-16, code points outside the BMP, in the range U+10000
to U+10FFFF
, cannot fit into a single 16-bit (two-byte) unit, so they are represented using four bytes — a pair of 16-bit code units called surrogate pairs*. (*Surrogate pairs are only used in UTF-16. UTF-8 encodes code points directly into 1–4 bytes, so it does not need surrogate pairs. UTF-32 uses a fixed-width 4-byte representation for all code points, avoiding surrogate pairs entirely.)
Each surrogate pair consists of:
- high surrogate — a value in the range
0xD800
to0xDBFF
. - low surrogate — a value in the range
0xDC00
to0xDFFF
.
JavaScript strings are encoded as UTF-16, so surrogate pairs are used for characters outside the BMP. This can lead to some counter-intuitive behaviors.
For example, String.prototype.charCodeAt()
retrieves individual 16-bit units:
const str = "😀"; console.log(str.charCodeAt(0).toString(16)); // High surrogate: d83d console.log(str.charCodeAt(1).toString(16)); // Low surrogate: de00
To get the full code point, use String.prototype.codePointAt()
:
console.log(str.codePointAt(0).toString(16)); // 1f600
Also note atob()/btoa() behavior