published
9 January 2025
by
Ray Morgan

Normalization and Diacritic Folding

Character encoding and normalization have already been discussed, but we need to briefly examine how it relates to search and indexing. Encoding defines how text is represented in bytes, and standard encodings like UTF-8 ensure that characters from diverse languages can be stored and processed without conflicts. However, challenges arise when there are multiple ways of representing the same character. For example, the character “é” can be encoded as a single code point (precomposed: U+00E9) or as a combination of two code points (decomposed: U+0065 for "e" + U+0301 for the accent). These differences can cause inconsistencies during indexing and retrieval.

Normalization resolves this issue by standardizing how characters are represented before indexing or processing. It ensures that equivalent forms of a character are treated identically, regardless of how they were originally encoded. For example, Unicode normalization can convert all characters to a consistent form, such as NFC (Normalization Form C, which uses precomposed characters) or NFD (Normalization Form D, which uses decomposed characters). This is crucial in multilingual systems, as it ensures that searches for text like “résumé” will match regardless of how the accented characters were encoded in the source data. Incorporating normalization into your indexing workflow is essential for consistent and accurate search functionality across languages and platforms.

Example: A document contains the word “résumé” in decomposed form (without accents separate from their base characters) as:

r + e + ́ + s + u + m + e + ́ 

A user searches for “résumé,” which is stored in precomposed form (with accented characters) as:

résumé

Without normalization, the search may fail because the system treats the two forms as different strings.

Solution: To address this, normalization ensures that all text is represented consistently before indexing. The most common approaches are:

  • Normalization Form C (NFC): Converts characters to their precomposed forms.
  • Normalization Form D (NFD): Converts characters to their decomposed forms.

Implementation Example:

  • In Python:
    import unicodedata
    
    text = "résumé"
    decomposed = unicodedata.normalize('NFD', text)  # 'r + e + ́ + s + u + m + e + ́ '
    precomposed = unicodedata.normalize('NFC', text)  # 'résumé'
    print(decomposed, precomposed)
    
  • In Elasticsearch: Use the icu_normalizer filter to standardize text:
    "settings": {
      "analysis": {
        "filter": {
          "nfc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc"
          }
        },
        "analyzer": {
          "default": {
            "tokenizer": "standard",
            "filter": ["lowercase", "nfc_normalizer"]
          }
        }
      }
    }
    

By normalizing text during preprocessing and indexing, you ensure that searches for “résumé” will match regardless of how the accented characters were encoded in the source data. This approach eliminates inconsistencies and improves search accuracy across multilingual datasets.

Normalizing Search Queries

The search query and the indexed content must both be normalized in the same way for consistency and accurate search results. If the query and index use different normalization forms, even equivalent characters may not match, leading to failed searches or incorrect results.

As for whether NFD (Normalization Form D) is the preferred form, the choice between NFC (precomposed) and NFD (decomposed) depends on the specific use case, but NFC is typically preferred in most practical applications, including search systems.

Why NFC is Preferred

  1. Compactness:

    • NFC combines characters into their precomposed forms, which tend to be shorter. This can save storage space and make string comparisons more efficient.
    • Example:
      • NFC: U+00E9 ("é")
      • NFD: U+0065 ("e") + U+0301 (accent).
    • NFC avoids the need to process additional code points, simplifying operations like sorting and matching.
  2. Better Compatibility:

    • Many systems, libraries, and fonts are optimized for NFC because it aligns more closely with how characters are traditionally displayed and stored.
    • NFC is the default form used in most text inputs and outputs, including HTML, XML, and database systems.
  3. Ease of Implementation:

    • When using NFC, you don't need to handle combining characters explicitly, as they're already composed.

Why You Might Use NFD

  1. Granular Processing:

    • NFD provides finer control over individual components of characters, such as separating base letters and diacritics. This is useful in specialized tasks like phonetic matching or transliteration.
  2. Interoperability with Systems Using Decomposed Forms:

    • If you're working with tools or datasets that favor decomposed forms, NFD ensures consistency.

Recommendations

  1. Normalize Both Index and Query:
    • Ensure that both the indexed content and the search query are normalized to the same form, typically NFC.
  2. Specify the Form Explicitly:
    • When configuring systems like Elasticsearch or Solr, use normalization filters and make sure they are applied consistently across both indexing and querying.

Practical Example:

  • Python:
    import unicodedata
    
    # Normalizing both query and content
    def normalize_text(text):
        return unicodedata.normalize('NFC', text)
    
    index = normalize_text("résumé")
    query = normalize_text("résumé")
    
    print(index == query)  # True
    
  • Elasticsearch: Ensure that both the index and search_analyzer are configured to use the same normalization filter (e.g., NFC):
    "settings": {
      "analysis": {
        "filter": {
          "nfc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc"
          }
        },
        "analyzer": {
          "default": {
            "tokenizer": "standard",
            "filter": ["lowercase", "nfc_normalizer"]
          }
        }
      }
    }
    

Conclusion

NFC is generally preferred because of its compactness, compatibility, and ease of use. However, the key to successful search functionality is ensuring consistent normalization across both your indexed data and search queries. If you decide to use NFD for specific needs, make sure the entire system adheres to that standard.

Diacritic Folding

Normalization to NFC alone does not reconcile differences between characters with diacritics and those without. NFC ensures consistency in how characters with diacritics are represented (e.g., combining characters vs. precomposed characters), but it does not remove or ignore diacritics. Reconciling such differences requires an additional step: diacritic folding.

Diacritic folding (or accent folding) is the process of removing diacritics from characters to create a simplified form. For example:

  • Original: résumé
  • Folded: resume

This allows a search query like resume to match indexed terms like résumé without requiring users to include diacritics in their input.

How to Handle Characters Without Diacritics Matching Characters With Diacritics

  1. Indexing Approach:

    • During indexing, apply diacritic folding to create a simplified version of the text while retaining the original form for display purposes. This allows searches to match terms regardless of whether the query includes diacritics.

    • Example:

      • Index both résumé (original) and resume (folded).
    • Implementation in Elasticsearch: Use the asciifolding filter to remove diacritics:

      "settings": {
        "analysis": {
          "filter": {
            "asciifolding_filter": {
              "type": "asciifolding",
              "preserve_original": true
            }
          },
          "analyzer": {
            "default": {
              "tokenizer": "standard",
              "filter": ["lowercase", "asciifolding_filter"]
            }
          }
        }
      }
      

      In this configuration, résumé is indexed as both résumé and resume.

  2. Query Handling:

    • Apply the same diacritic folding to the search query to ensure consistency.
    • Example in Python:
      import unicodedata
      
      def remove_diacritics(text):
          normalized = unicodedata.normalize('NFD', text)
          return ''.join(c for c in normalized if not unicodedata.combining(c))
      
      index = "résumé"
      query = "resume"
      
      print(remove_diacritics(index) == query)  # True
      

Combined Strategy

To reconcile these differences:

  1. Normalize all text to NFC to standardize how characters with diacritics are represented.
  2. Apply Diacritic Folding during indexing and querying to remove diacritics where necessary.
  3. Retain Original Form for display purposes to ensure results appear as users expect.

Practical Example

  • Indexed Content:

    • Original: résumé
    • Normalized: résumé (NFC applied)
    • Folded: resume (diacritic removed)
  • Search Query:

    • Input: resume
    • Normalized: resume (NFC applied)
    • Folded: resume (diacritic removed)
  • Matching Logic: Both folded versions (resume) are compared, ensuring a match.

Conclusion

Normalization to NFC ensures consistent representation of characters with diacritics but does not reconcile them with non-diacritic characters. To handle this, diacritic folding should be applied during both indexing and query processing. This ensures that searches for terms like resume can match both resume and résumé while maintaining the fidelity of the original text for display.