published
26 December 2024
by
Ray Morgan
updated
3 January 2025

Normalization in Python

Python provides functions for Unicode normalization through the unicodedata module. This module includes the normalize() method to normalize strings into different Unicode normalization forms.

Syntax:

unicodedata.normalize(form, input)

Parameters:

  • form: The desired normalization form. Possible values are:
    • NFC (Canonical Composition)
    • NFD (Canonical Decomposition)
    • NFKC (Compatibility Composition)
    • NFKD (Compatibility Decomposition)
  • input: The string to be normalized.

Return Value:

  • The normalized string.

Examples

Example 1: Canonical Composition

import unicodedata
str = "e\u0301"  # "e" + combining acute accent
normalized = unicodedata.normalize("NFC", str)
print(normalized)  # Outputs: "é"

Example 2: Canonical Decomposition

decomposed = unicodedata.normalize("NFD", normalized)
print(decomposed)  # Outputs: "é" (split into base and combining mark)

Use Case: Comparing Strings

Two strings that appear identical may have different internal Unicode representations. Normalization ensures consistency for accurate comparisons.

# Normalizing for comparison
str1 = "e\u0301"  # e + combining acute accent
str2 = "é"        # single precomposed character
print(str1 == str2)                   # False
print(unicodedata.normalize("NFC", str1) == str2)  # True

Use Case: Compatibility Decomposition

Normalization can transform compatibility characters into their simpler equivalents for easier processing.

# Compatibility normalization (NFKC)
str = "①"  # Circled number one
normalized = unicodedata.normalize("NFKC", str)
print(normalized)  # Outputs: "1"

Limitations and Dependencies

  • Performance — Normalization can be computationally intensive for large datasets.
  • The unicodedata module is part of Python's standard library, so no additional installation is needed.