Normalization and Diacritic Folding
Character encoding and normalization have already been discussed, but we need to briefly examine how it relates to search and indexing. Encoding defines how text is represented in bytes, and standard encodings like UTF-8 ensure that characters from diverse languages can be stored and processed without conflicts. However, challenges arise when there are multiple ways of representing the same character. For example, the character “é” can be encoded as a single code point (precomposed: U+00E9
) or as a combination of two code points (decomposed: U+0065
for "e" + U+0301
for the accent). These differences can cause inconsistencies during indexing and retrieval.
Normalization resolves this issue by standardizing how characters are represented before indexing or processing. It ensures that equivalent forms of a character are treated identically, regardless of how they were originally encoded. For example, Unicode normalization can convert all characters to a consistent form, such as NFC (Normalization Form C, which uses precomposed characters) or NFD (Normalization Form D, which uses decomposed characters). This is crucial in multilingual systems, as it ensures that searches for text like “résumé” will match regardless of how the accented characters were encoded in the source data. Incorporating normalization into your indexing workflow is essential for consistent and accurate search functionality across languages and platforms.
Example: A document contains the word “résumé” in decomposed form (without accents separate from their base characters) as:
r + e + ́ + s + u + m + e + ́
A user searches for “résumé,” which is stored in precomposed form (with accented characters) as:
résumé
Without normalization, the search may fail because the system treats the two forms as different strings.
Solution: To address this, normalization ensures that all text is represented consistently before indexing. The most common approaches are:
- Normalization Form C (NFC): Converts characters to their precomposed forms.
- Normalization Form D (NFD): Converts characters to their decomposed forms.
Implementation Example:
- In Python:
import unicodedata text = "résumé" decomposed = unicodedata.normalize('NFD', text) # 'r + e + ́ + s + u + m + e + ́ ' precomposed = unicodedata.normalize('NFC', text) # 'résumé' print(decomposed, precomposed)
- In Elasticsearch: Use the
icu_normalizer
filter to standardize text:"settings": { "analysis": { "filter": { "nfc_normalizer": { "type": "icu_normalizer", "name": "nfc" } }, "analyzer": { "default": { "tokenizer": "standard", "filter": ["lowercase", "nfc_normalizer"] } } } }
By normalizing text during preprocessing and indexing, you ensure that searches for “résumé” will match regardless of how the accented characters were encoded in the source data. This approach eliminates inconsistencies and improves search accuracy across multilingual datasets.
Normalizing Search Queries
The search query and the indexed content must both be normalized in the same way for consistency and accurate search results. If the query and index use different normalization forms, even equivalent characters may not match, leading to failed searches or incorrect results.
As for whether NFD (Normalization Form D) is the preferred form, the choice between NFC (precomposed) and NFD (decomposed) depends on the specific use case, but NFC is typically preferred in most practical applications, including search systems.
Why NFC is Preferred
-
Compactness:
- NFC combines characters into their precomposed forms, which tend to be shorter. This can save storage space and make string comparisons more efficient.
- Example:
- NFC:
U+00E9
("é") - NFD:
U+0065
("e") +U+0301
(accent).
- NFC:
- NFC avoids the need to process additional code points, simplifying operations like sorting and matching.
-
Better Compatibility:
- Many systems, libraries, and fonts are optimized for NFC because it aligns more closely with how characters are traditionally displayed and stored.
- NFC is the default form used in most text inputs and outputs, including HTML, XML, and database systems.
-
Ease of Implementation:
- When using NFC, you don't need to handle combining characters explicitly, as they're already composed.
Why You Might Use NFD
-
Granular Processing:
- NFD provides finer control over individual components of characters, such as separating base letters and diacritics. This is useful in specialized tasks like phonetic matching or transliteration.
-
Interoperability with Systems Using Decomposed Forms:
- If you're working with tools or datasets that favor decomposed forms, NFD ensures consistency.
Recommendations
-
Normalize Both Index and Query:
- Ensure that both the indexed content and the search query are normalized to the same form, typically NFC.
-
Specify the Form Explicitly:
- When configuring systems like Elasticsearch or Solr, use normalization filters and make sure they are applied consistently across both indexing and querying.
Practical Example:
-
Python:
import unicodedata # Normalizing both query and content def normalize_text(text): return unicodedata.normalize('NFC', text) index = normalize_text("résumé") query = normalize_text("résumé") print(index == query) # True
-
Elasticsearch: Ensure that both the
index
andsearch_analyzer
are configured to use the same normalization filter (e.g., NFC):"settings": { "analysis": { "filter": { "nfc_normalizer": { "type": "icu_normalizer", "name": "nfc" } }, "analyzer": { "default": { "tokenizer": "standard", "filter": ["lowercase", "nfc_normalizer"] } } } }
Conclusion
NFC is generally preferred because of its compactness, compatibility, and ease of use. However, the key to successful search functionality is ensuring consistent normalization across both your indexed data and search queries. If you decide to use NFD for specific needs, make sure the entire system adheres to that standard.
Diacritic Folding
Normalization to NFC alone does not reconcile differences between characters with diacritics and those without. NFC ensures consistency in how characters with diacritics are represented (e.g., combining characters vs. precomposed characters), but it does not remove or ignore diacritics. Reconciling such differences requires an additional step: diacritic folding.
Diacritic folding (or accent folding) is the process of removing diacritics from characters to create a simplified form. For example:
- Original:
résumé
- Folded:
resume
This allows a search query like resume
to match indexed terms like résumé
without requiring users to include diacritics in their input.
How to Handle Characters Without Diacritics Matching Characters With Diacritics
-
Indexing Approach:
-
During indexing, apply diacritic folding to create a simplified version of the text while retaining the original form for display purposes. This allows searches to match terms regardless of whether the query includes diacritics.
-
Example:
- Index both
résumé
(original) andresume
(folded).
- Index both
-
Implementation in Elasticsearch: Use the
asciifolding
filter to remove diacritics:"settings": { "analysis": { "filter": { "asciifolding_filter": { "type": "asciifolding", "preserve_original": true } }, "analyzer": { "default": { "tokenizer": "standard", "filter": ["lowercase", "asciifolding_filter"] } } } }
In this configuration,
résumé
is indexed as bothrésumé
andresume
.
-
-
Query Handling:
- Apply the same diacritic folding to the search query to ensure consistency.
-
Example in Python:
import unicodedata def remove_diacritics(text): normalized = unicodedata.normalize('NFD', text) return ''.join(c for c in normalized if not unicodedata.combining(c)) index = "résumé" query = "resume" print(remove_diacritics(index) == query) # True
Combined Strategy
To reconcile these differences:
- Normalize all text to NFC to standardize how characters with diacritics are represented.
- Apply Diacritic Folding during indexing and querying to remove diacritics where necessary.
- Retain Original Form for display purposes to ensure results appear as users expect.
Practical Example
-
Indexed Content:
- Original:
résumé
- Normalized:
résumé
(NFC applied) - Folded:
resume
(diacritic removed)
- Original:
-
Search Query:
- Input:
resume
- Normalized:
resume
(NFC applied) - Folded:
resume
(diacritic removed)
- Input:
-
Matching Logic: Both folded versions (
resume
) are compared, ensuring a match.
Conclusion
Normalization to NFC ensures consistent representation of characters with diacritics but does not reconcile them with non-diacritic characters. To handle this, diacritic folding should be applied during both indexing and query processing. This ensures that searches for terms like resume
can match both resume
and résumé
while maintaining the fidelity of the original text for display.