Normalization in Java
Java provides Unicode normalization capabilities through the java.text.Normalizer
class, which allows strings to be normalized into different Unicode normalization forms.
Syntax:
String Normalizer.normalize(CharSequence input, Normalizer.Form form)
Parameters:
input
: The string to be normalized.-
form
: The desired normalization form, represented by theNormalizer.Form
enum:Normalizer.Form.NFC
(Canonical Composition)Normalizer.Form.NFD
(Canonical Decomposition)Normalizer.Form.NFKC
(Compatibility Composition)Normalizer.Form.NFKD
(Compatibility Decomposition)
Return Value:
- The normalized string.
Normalizer.isNormalized()
Checks if a string is already in the specified normalization form.
Syntax:
boolean Normalizer.isNormalized(CharSequence input, Normalizer.Form form)
Parameters:
input
: The string to check.form
: The normalization form to check against.
Return Value:
true
if the string is normalized,false
otherwise.
Examples
Example 1: Canonical Composition
import java.text.Normalizer; public class NormalizationExample { public static void main(String[] args) { String str = "e\u0301"; // "e" + combining acute accent String normalized = Normalizer.normalize(str, Normalizer.Form.NFC); System.out.println(normalized); // Outputs: "é" } }
Example 2: Canonical Decomposition
import java.text.Normalizer; public class NormalizationExample { public static void main(String[] args) { String str = "é"; // Precomposed character String decomposed = Normalizer.normalize(str, Normalizer.Form.NFD); System.out.println(decomposed); // Outputs: "é" (split into base and combining mark) } }
Use Case: Comparing Strings
Strings may look identical but differ in their internal Unicode representations. Normalization ensures consistency for accurate comparisons.
import java.text.Normalizer; public class NormalizationComparison { public static void main(String[] args) { String str1 = "e\u0301"; // "e" + combining acute accent String str2 = "é"; // Single precomposed character System.out.println(str1.equals(str2)); // False System.out.println(Normalizer.normalize(str1, Normalizer.Form.NFC).equals(str2)); // True } }
Use Case: Compatibility Decomposition
Normalization can transform compatibility characters into simpler equivalents for easier processing.
import java.text.Normalizer; public class CompatibilityNormalization { public static void main(String[] args) { String str = "①"; // Circled number one String normalized = Normalizer.normalize(str, Normalizer.Form.NFKC); System.out.println(normalized); // Outputs: "1" } }
Limitations and Dependencies
- Performance — Normalization can be computationally intensive for large datasets.
- Complexity — Requires explicit calls to the
Normalizer
class, making normalization an additional step in text processing workflows. - The
Normalizer
class is part of the standard Java Development Kit (JDK), so no additional dependencies are needed.