UTF-16 in JavaScript

JavaScript strings are represented internally as sequences of UTF-16 code units, not Unicode code points. Characters with code points in the Basic Multilingual Plane (BMP) (U+0000 to U+FFFF) are represented as a single 16-bit code unit. Characters outside the BMP (U+10000 to U+10FFFF) are represented as surrogate pairs, which require two 16-bit code units.

Examples:

"A" (U+0041): Single UTF-16 code unit (0x0041).
"𐍈" (U+10348): Surrogate pair (0xD800 and 0xDF48).

Converting between JavaScript strings and UTF-8 requires explicitly encoding or decoding the string, as these two encodings are not directly compatible. Here's how JavaScript handles these conversions:

To convert between UTF-8 byte sequences and JavaScript strings, you can use the TextEncoder and TextDecoder APIs, which are available in modern browsers and Node.js.

The TextEncoder API converts a JavaScript string into a UTF-8 encoded Uint8Array:

const str = "Hello, 𐍈!";
const encoder = new TextEncoder();
const utf8Array = encoder.encode(str);
console.log(utf8Array);
// Output: Uint8Array(13) [72, 101, 108, 108, 111, 44, 32, 240, 144, 141, 136, 33]

Each character in the string is encoded into its corresponding UTF-8 byte sequence.

The TextDecoder API decodes a UTF-8 Uint8Array back into a UTF-16 JavaScript string:

const utf8Array = new Uint8Array([72, 101, 108, 108, 111, 44, 32, 240, 144, 141, 136, 33]);
const decoder = new TextDecoder();
const str = decoder.decode(utf8Array);
console.log(str);
// Output: "Hello, 𐍈!"

JavaScript provides additional tools for working with Unicode code points and surrogate pairs when dealing with conversions or string manipulation. The String.prototype.codePointAt() method retrieves the full Unicode code point for characters, including those requiring surrogate pairs:

const char = "𐍈";
console.log(char.codePointAt(0).toString(16)); // Output: "10348"

The String.fromCodePoint() method converts Unicode code points to JavaScript strings:

const codePoint = 0x10348;
console.log(String.fromCodePoint(codePoint)); // Output: "𐍈"

Characters outside the BMP (e.g., emojis or rare scripts) are represented as surrogate pairs in JavaScript. To handle these properly:

Use String.prototype.codePointAt() and String.fromCodePoint() for full code point operations.
Avoid naive string indexing (str[i]), as it treats surrogate pairs as two separate code units.