published
26 December 2024
by
Ray Morgan
updated
3 January 2025

Normalization in Java

Java provides Unicode normalization capabilities through the java.text.Normalizer class, which allows strings to be normalized into different Unicode normalization forms.

Syntax:

String Normalizer.normalize(CharSequence input, Normalizer.Form form)

Parameters:

  • input: The string to be normalized.
  • form: The desired normalization form, represented by the Normalizer.Form enum:
    • Normalizer.Form.NFC (Canonical Composition)
    • Normalizer.Form.NFD (Canonical Decomposition)
    • Normalizer.Form.NFKC (Compatibility Composition)
    • Normalizer.Form.NFKD (Compatibility Decomposition)

Return Value:

  • The normalized string.

Normalizer.isNormalized()

Checks if a string is already in the specified normalization form.

Syntax:

boolean Normalizer.isNormalized(CharSequence input, Normalizer.Form form)

Parameters:

  • input: The string to check.
  • form: The normalization form to check against.

Return Value:

  • true if the string is normalized, false otherwise.

Examples

Example 1: Canonical Composition

import java.text.Normalizer;

public class NormalizationExample {
    public static void main(String[] args) {
        String str = "e\u0301"; // "e" + combining acute accent
        String normalized = Normalizer.normalize(str, Normalizer.Form.NFC);
        System.out.println(normalized); // Outputs: "é"
    }
}

Example 2: Canonical Decomposition

import java.text.Normalizer;

public class NormalizationExample {
    public static void main(String[] args) {
        String str = "é"; // Precomposed character
        String decomposed = Normalizer.normalize(str, Normalizer.Form.NFD);
        System.out.println(decomposed); // Outputs: "é" (split into base and combining mark)
    }
}

Use Case: Comparing Strings

Strings may look identical but differ in their internal Unicode representations. Normalization ensures consistency for accurate comparisons.

import java.text.Normalizer;

public class NormalizationComparison {
    public static void main(String[] args) {
        String str1 = "e\u0301"; // "e" + combining acute accent
        String str2 = "é";      // Single precomposed character

        System.out.println(str1.equals(str2)); // False
        System.out.println(Normalizer.normalize(str1, Normalizer.Form.NFC).equals(str2)); // True
    }
}

Use Case: Compatibility Decomposition

Normalization can transform compatibility characters into simpler equivalents for easier processing.

import java.text.Normalizer;

public class CompatibilityNormalization {
    public static void main(String[] args) {
        String str = "①"; // Circled number one
        String normalized = Normalizer.normalize(str, Normalizer.Form.NFKC);
        System.out.println(normalized); // Outputs: "1"
    }
}

Limitations and Dependencies

  • Performance — Normalization can be computationally intensive for large datasets.
  • Complexity — Requires explicit calls to the Normalizer class, making normalization an additional step in text processing workflows.
  • The Normalizer class is part of the standard Java Development Kit (JDK), so no additional dependencies are needed.