Normalization in Python

Python provides functions for Unicode normalization through the unicodedata module. This module includes the normalize() method to normalize strings into different Unicode normalization forms.

Syntax:

unicodedata.normalize(form, input)

Parameters:

form: The desired normalization form. Possible values are:
- NFC (Canonical Composition)
- NFD (Canonical Decomposition)
- NFKC (Compatibility Composition)
- NFKD (Compatibility Decomposition)
input: The string to be normalized.

Return Value:

The normalized string.

Examples

Example 1: Canonical Composition

import unicodedata
str = "e\u0301"  # "e" + combining acute accent
normalized = unicodedata.normalize("NFC", str)
print(normalized)  # Outputs: "é"

Example 2: Canonical Decomposition

decomposed = unicodedata.normalize("NFD", normalized)
print(decomposed)  # Outputs: "é" (split into base and combining mark)

Use Case: Comparing Strings

Two strings that appear identical may have different internal Unicode representations. Normalization ensures consistency for accurate comparisons.

# Normalizing for comparison
str1 = "e\u0301"  # e + combining acute accent
str2 = "é"        # single precomposed character
print(str1 == str2)                   # False
print(unicodedata.normalize("NFC", str1) == str2)  # True

Use Case: Compatibility Decomposition

Normalization can transform compatibility characters into their simpler equivalents for easier processing.

# Compatibility normalization (NFKC)
str = "①"  # Circled number one
normalized = unicodedata.normalize("NFKC", str)
print(normalized)  # Outputs: "1"

Limitations and Dependencies

Performance — Normalization can be computationally intensive for large datasets.
The unicodedata module is part of Python's standard library, so no additional installation is needed.