Normalization
There are cases in which Unicode provides multiple valid ways to encode the same glyph (i.e., the visual representation of a character or combination of characters). Some examples:
- Diacritics and Combining Marks: Characters like accents, tildes, or dots can be combined with base characters (a, e) to form precomposed characters (e.g., á or ä), or these characters can be decomposed into their components.
- Ligatures: A ligature is a single glyph that combines or represents two or more characters, combined into a single visual unit for stylistic or functional purposes. For example, “æ” (
U+00E6
) is a ligature of “a” (U+0061
) and “e” (U+0065
), historically used in Latin and still common in some languages like Danish and Icelandic. Another is “œ” (U+0153
), a ligature of “o” (U+006F
) and “e” (U+0065
), used in French words like "œuvre." These ligatures combine two distinct characters into a single glyph while retaining their original meaning and pronunciation in context. - Compatibility Characters: Special forms such as superscripts (²), fractions (¼), or circled letters (ⓐ) may be transformed into their canonical equivalents (2, 1/4, or a).
- Special Composite Glyphs: Characters with multiple components, such as Hangul syllables in Korean, may be decomposed into their constituent Jamo elements or composed into a single glyph.
PRECOMPOSED | DECOMPOSED | ||
---|---|---|---|
character | code point | constituent characters | constituent code points |
é | U+00E9 |
e + ́ | U+0065 U+0301 |
ä | U+00E4 |
a + ̈ | U+0061 U+0308 |
æ (ligature) | U+00E6 |
a + e | U+0061 U+0065 |
œ (ligature) | U+0153 |
o + e | U+006F U+0065 |
fi (ligature) | U+FB01 |
f + i | U+0066 U+0069 |
² (superscript) | U+00B2 |
2 | U+0032 |
¼ (fraction) | U+00BC |
1 + / + 4 | U+0031 U+2044 U+0034 |
ⓐ (circled) | U+24D0 |
a | U+0061 |
한 (Hangul syllable) | U+D55C |
ᄒ + ᅡ + ᆫ | U+1112 U+1161 U+11AB |
Having multiple ways to represent the same character(s) can cause unexpected behaviors and cryptic bugs. For example, operations like string comparison or sorting may fail when visually identical texts differ the bytes that comprise their underlying encoding.
Normalization aims to standardize text representation for consistency in processing and display, and to ensure a consistent encoding for operations like comparison, searching, or storage.
Unicode Normalization Forms
Normalization forms refer to standardized ways of representing text to resolve differences in how characters are encoded.
The Unicode Standard defines four normalization forms:
composed | decomposed | |
---|---|---|
Canonical Equivalence: Focuses on precomposed and decomposed forms of characters that are considered equivalent in meaning and appearance. | NFC — Normalization Form C (Canonical Composition): Combines base characters with combining marks into precomposed characters where possible. Ideal for storage, display, and most practical applications, as it produces compact, composed forms that are visually consistent. | NFD — Normalization Form D:(Canonical Decomposition) Decomposes precomposed characters into base characters and combining marks. Useful for text analysis, searching, indexing, linguistic processing, or when granular representation of text is needed. |
Compatibility: Equivalence Expands compatibility characters into their simpler equivalents. | NFKC — Normalization Form KC:(Compatibility Composition) Similar to NFC but also replaces compatibility characters with their canonical equivalents. Ideal for cases where visual consistency and compatibility with older systems are necessary. | NFKD — Normalization Form KD: (Compatibility Decomposition) Similar to NFD but also replaces compatibility characters with their canonical equivalents. Useful for more rigorous text processing, such as preparing data for indexing or matching. |
Practical Examples
Input: éfi ( [é] + [fi] )
Representations:
- Precomposed:
U+00E9 U+FB01
- Decomposed:
U+0065 U+0301 U+0066 U+0069
Normalized Outputs:
- NFC: éfi —
U+00E9 U+FB01
- NFD: éfi —
U+0065 U+0301 U+0066 U+0069
- NFKC: efi —
U+0065 U+0301 U+0066 U+0069
- NFKD: efi —
U+0065 U+0301 U+0066 U+0069
Key Points
Normalization Is Essential for Text Comparison: Visually identical strings can have different binary representations, making normalization crucial for reliable comparisons.
Use Case-Specific: NFC and NFD focus on canonical equivalence (base + diacritics). NFKC and NFKD go further, addressing compatibility characters.
Not Always Reversible: NFKC/NFKD transformations may lose certain distinctions (e.g., ligatures).