Normalization in Python
Python provides functions for Unicode normalization through the unicodedata
module. This module includes the normalize()
method to normalize strings into different Unicode normalization forms.
Syntax:
unicodedata.normalize(form, input)
Parameters:
-
form
: The desired normalization form. Possible values are:NFC
(Canonical Composition)NFD
(Canonical Decomposition)NFKC
(Compatibility Composition)NFKD
(Compatibility Decomposition)
input
: The string to be normalized.
Return Value:
- The normalized string.
Examples
Example 1: Canonical Composition
import unicodedata str = "e\u0301" # "e" + combining acute accent normalized = unicodedata.normalize("NFC", str) print(normalized) # Outputs: "é"
Example 2: Canonical Decomposition
decomposed = unicodedata.normalize("NFD", normalized) print(decomposed) # Outputs: "é" (split into base and combining mark)
Use Case: Comparing Strings
Two strings that appear identical may have different internal Unicode representations. Normalization ensures consistency for accurate comparisons.
# Normalizing for comparison str1 = "e\u0301" # e + combining acute accent str2 = "é" # single precomposed character print(str1 == str2) # False print(unicodedata.normalize("NFC", str1) == str2) # True
Use Case: Compatibility Decomposition
Normalization can transform compatibility characters into their simpler equivalents for easier processing.
# Compatibility normalization (NFKC) str = "①" # Circled number one normalized = unicodedata.normalize("NFKC", str) print(normalized) # Outputs: "1"
Limitations and Dependencies
- Performance — Normalization can be computationally intensive for large datasets.
- The
unicodedata
module is part of Python's standard library, so no additional installation is needed.