Normalization

There are cases in which Unicode provides multiple valid ways to encode the same glyph (i.e., the visual representation of a character or combination of characters). Some examples:

Diacritics and Combining Marks: Characters like accents, tildes, or dots can be combined with base characters (a, e) to form precomposed characters (e.g., á or ä), or these characters can be decomposed into their components.
Ligatures: A ligature is a single glyph that combines or represents two or more characters, combined into a single visual unit for stylistic or functional purposes. For example, “æ” (U+00E6) is a ligature of “a” (U+0061) and “e” (U+0065), historically used in Latin and still common in some languages like Danish and Icelandic. Another is “œ” (U+0153), a ligature of “o” (U+006F) and “e” (U+0065), used in French words like "œuvre." These ligatures combine two distinct characters into a single glyph while retaining their original meaning and pronunciation in context.
Compatibility Characters: Special forms such as superscripts (²), fractions (¼), or circled letters (ⓐ) may be transformed into their canonical equivalents (2, 1/4, or a).
Special Composite Glyphs: Characters with multiple components, such as Hangul syllables in Korean, may be decomposed into their constituent Jamo elements or composed into a single glyph.

PRECOMPOSED DECOMPOSED

character code point constituent characters constituent code points

é U+00E9 e + ́ U+0065 U+0301

ä U+00E4 a + ̈ U+0061 U+0308

æ (ligature) U+00E6 a + e U+0061 U+0065

œ (ligature) U+0153 o + e U+006F U+0065

ﬁ (ligature) U+FB01 f + i U+0066 U+0069

² (superscript) U+00B2 2 U+0032

¼ (fraction) U+00BC 1 + / + 4 U+0031 U+2044 U+0034

ⓐ (circled) U+24D0 a U+0061

한 (Hangul syllable) U+D55C ᄒ + ᅡ + ᆫ U+1112 U+1161 U+11AB

PRECOMPOSED	DECOMPOSED
character	code point	constituent characters	constituent code points
é	`U+00E9`	e + ́	`U+0065 U+0301`
ä	`U+00E4`	a + ̈	`U+0061 U+0308`
æ (ligature)	`U+00E6`	a + e	`U+0061 U+0065`
œ (ligature)	`U+0153`	o + e	`U+006F U+0065`
ﬁ (ligature)	`U+FB01`	f + i	`U+0066 U+0069`
² (superscript)	`U+00B2`	2	`U+0032`
¼ (fraction)	`U+00BC`	1 + / + 4	`U+0031 U+2044 U+0034`
ⓐ (circled)	`U+24D0`	a	`U+0061`
한 (Hangul syllable)	`U+D55C`	ᄒ + ᅡ + ᆫ	`U+1112 U+1161 U+11AB`

Having multiple ways to represent the same character(s) can cause unexpected behaviors and cryptic bugs. For example, operations like string comparison or sorting may fail when visually identical texts differ the bytes that comprise their underlying encoding.

Normalization aims to standardize text representation for consistency in processing and display, and to ensure a consistent encoding for operations like comparison, searching, or storage.

Unicode Normalization Forms

Normalization forms refer to standardized ways of representing text to resolve differences in how characters are encoded.

The Unicode Standard defines four normalization forms:

	composed	decomposed
Canonical Equivalence: Focuses on precomposed and decomposed forms of characters that are considered equivalent in meaning and appearance.	NFC — Normalization Form C (Canonical Composition): Combines base characters with combining marks into precomposed characters where possible. Ideal for storage, display, and most practical applications, as it produces compact, composed forms that are visually consistent.	NFD — Normalization Form D:(Canonical Decomposition) Decomposes precomposed characters into base characters and combining marks. Useful for text analysis, searching, indexing, linguistic processing, or when granular representation of text is needed.
Compatibility: Equivalence Expands compatibility characters into their simpler equivalents.	NFKC — Normalization Form KC:(Compatibility Composition) Similar to NFC but also replaces compatibility characters with their canonical equivalents. Ideal for cases where visual consistency and compatibility with older systems are necessary.	NFKD — Normalization Form KD: (Compatibility Decomposition) Similar to NFD but also replaces compatibility characters with their canonical equivalents. Useful for more rigorous text processing, such as preparing data for indexing or matching.

composed

decomposed

Canonical Equivalence: Focuses on precomposed and decomposed forms of characters that are considered equivalent in meaning and appearance.

NFC — Normalization Form C (Canonical Composition): Combines base characters with combining marks into precomposed characters where possible. Ideal for storage, display, and most practical applications, as it produces compact, composed forms that are visually consistent.

NFD — Normalization Form D:(Canonical Decomposition) Decomposes precomposed characters into base characters and combining marks. Useful for text analysis, searching, indexing, linguistic processing, or when granular representation of text is needed.

Compatibility: Equivalence Expands compatibility characters into their simpler equivalents.

NFKC — Normalization Form KC:(Compatibility Composition) Similar to NFC but also replaces compatibility characters with their canonical equivalents. Ideal for cases where visual consistency and compatibility with older systems are necessary.

NFKD — Normalization Form KD: (Compatibility Decomposition) Similar to NFD but also replaces compatibility characters with their canonical equivalents. Useful for more rigorous text processing, such as preparing data for indexing or matching.

Practical Examples

Input: éﬁ ( [é] + [ﬁ] )

Representations:

Precomposed: U+00E9 U+FB01
Decomposed: U+0065 U+0301 U+0066 U+0069

Normalized Outputs:

NFC: éﬁ — U+00E9 U+FB01
NFD: éfi — U+0065 U+0301 U+0066 U+0069
NFKC: efi — U+0065 U+0301 U+0066 U+0069
NFKD: efi — U+0065 U+0301 U+0066 U+0069

Key Points

Normalization Is Essential for Text Comparison: Visually identical strings can have different binary representations, making normalization crucial for reliable comparisons.

Use Case-Specific: NFC and NFD focus on canonical equivalence (base + diacritics). NFKC and NFKD go further, addressing compatibility characters.

Not Always Reversible: NFKC/NFKD transformations may lose certain distinctions (e.g., ligatures).