UTF-16 in JavaScript
JavaScript strings are represented internally as sequences of UTF-16 code units, not Unicode code points. Characters with code points in the Basic Multilingual Plane (BMP) (U+0000
to U+FFFF
) are represented as a single 16-bit code unit. Characters outside the BMP (U+10000
to U+10FFFF
) are represented as surrogate pairs, which require two 16-bit code units.
Examples:
- "A" (
U+0041
): Single UTF-16 code unit (0x0041
). - "𐍈" (
U+10348
): Surrogate pair (0xD800
and0xDF48
).
Converting between JavaScript strings and UTF-8 requires explicitly encoding or decoding the string, as these two encodings are not directly compatible. Here's how JavaScript handles these conversions:
To convert between UTF-8 byte sequences and JavaScript strings, you can use the TextEncoder
and TextDecoder
APIs, which are available in modern browsers and Node.js.
The TextEncoder
API converts a JavaScript string into a UTF-8 encoded Uint8Array
:
const str = "Hello, 𐍈!";
const encoder = new TextEncoder();
const utf8Array = encoder.encode(str);
console.log(utf8Array);
// Output: Uint8Array(13) [72, 101, 108, 108, 111, 44, 32, 240, 144, 141, 136, 33]
Each character in the string is encoded into its corresponding UTF-8 byte sequence.
The TextDecoder
API decodes a UTF-8 Uint8Array
back into a UTF-16 JavaScript string:
const utf8Array = new Uint8Array([72, 101, 108, 108, 111, 44, 32, 240, 144, 141, 136, 33]);
const decoder = new TextDecoder();
const str = decoder.decode(utf8Array);
console.log(str);
// Output: "Hello, 𐍈!"
String.prototype.codePointAt()
method retrieves the full Unicode code point for characters, including those requiring surrogate pairs:
const char = "𐍈";
console.log(char.codePointAt(0).toString(16)); // Output: "10348"
The String.fromCodePoint()
method converts Unicode code points to JavaScript strings:
const codePoint = 0x10348; console.log(String.fromCodePoint(codePoint)); // Output: "𐍈"
Characters outside the BMP (e.g., emojis or rare scripts) are represented as surrogate pairs in JavaScript. To handle these properly:
- Use
String.prototype.codePointAt()
andString.fromCodePoint()
for full code point operations. - Avoid naive string indexing (
str[i]
), as it treats surrogate pairs as two separate code units.