published
9 January 2025
by
Ray Morgan

Practical Implementation

8. Practical Implementation


8.1 Setting Up Language-Specific Indices

  • Challenges:
    • Managing separate indices for each language while ensuring consistency in search results.
  • Examples:
    • A news website with English, Spanish, and French articles requires language-specific tokenization and stemming.
  • Implementation:
    • Creating Language-Specific Indices in Elasticsearch:
      PUT /news_en
      {
        "settings": {
          "analysis": {
            "analyzer": {
              "english_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["lowercase", "english_stemmer"]
              }
            }
          }
        }
      }
      
      Repeat the process for news_es (Spanish) and news_fr (French) with corresponding analyzers.
    • Querying Across Indices: Use multi-index search to query all languages:
      GET /news_*/_search
      {
        "query": { "match": { "content": "football" } }
      }
      

8.2 Configuring Stop Words and Stemming

  • Challenges:
    • Each language requires a tailored approach to stop words and stemming for effective full-text search.
  • Examples:
    • Searching for “running” in English should match “run,” while in German, “laufend” should stem to “lauf.”
  • Implementation:
    • Custom Stop Words: Add stop word filters for each language in Solr or Elasticsearch:
      "filter": {
        "english_stop": {
          "type": "stop",
          "stopwords": "_english_"
        }
      }
      
    • Stemming: Configure Snowball stemmers for supported languages:
      "filter": {
        "english_stemmer": {
          "type": "stemmer",
          "language": "english"
        }
      }
      

8.3 Building Synonym Support

  • Challenges:
    • Ensuring synonym mappings are relevant and contextual.
  • Examples:
    • A search for “car” in English retrieves results for “automobile,” while “film” matches “movie.”
  • Implementation:
    • Defining Synonyms in Elasticsearch:
      "filter": {
        "synonym_filter": {
          "type": "synonym",
          "synonyms": [
            "car, automobile, vehicle",
            "film, movie"
          ]
        }
      }
      
    • Testing Synonym Behavior: Query to validate synonyms:
      GET /products/_search
      {
        "query": {
          "match": {
            "description": "car"
          }
        }
      }
      

8.4 Implementing Fuzzy Search

  • Challenges:
    • Correcting typos and phonetic variations in queries while maintaining relevance.
  • Examples:
    • Searching for “restuarant” should match “restaurant.”
  • Implementation:
    • Fuzzy Queries in Elasticsearch:
      {
        "query": {
          "fuzzy": {
            "name": {
              "value": "restuarant",
              "fuzziness": "AUTO"
            }
          }
        }
      }
      

8.5 Autocomplete Suggestions

  • Challenges:
    • Providing real-time, localized autocomplete suggestions for multilingual users.
  • Examples:
    • Typing “rece” in English suggests “recipes,” while in French suggests “recettes.”
  • Implementation:
    • Building Autocomplete in Elasticsearch:
      PUT /suggest_index
      {
        "mappings": {
          "properties": {
            "suggest": {
              "type": "completion"
            }
          }
        }
      }
      
      Add suggestion data:
      PUT /suggest_index/_doc/1
      {
        "suggest": { "input": ["recipes", "recettes"] }
      }
      
    • Query for suggestions:
      GET /suggest_index/_search
      {
        "suggest": {
          "recipe-suggest": {
            "prefix": "rece",
            "completion": {
              "field": "suggest"
            }
          }
        }
      }
      

8.6 Supporting Multilingual Thesauri

  • Challenges:
    • Handling synonyms, translations, and alternate spellings across multiple languages.
  • Examples:
    • A search for “doctor” retrieves results for “medico” (Spanish) and “arzt” (German).
  • Implementation:
    • Adding Multilingual Synonyms:
      • Define language-specific synonym files and load them dynamically:
        "filter": {
          "multilingual_synonyms": {
            "type": "synonym",
            "synonyms_path": "analysis/synonyms.txt"
          }
        }
        
      Example synonyms.txt:
      doctor, medico, arzt
      

8.7 Handling Mixed-Language Queries

  • Challenges:
    • Processing queries that include multiple languages or scripts.
  • Examples:
    • A query like “Tokyo 東京 hotels” mixes English and Japanese.
  • Implementation:
    • Tokenization for Mixed Scripts:
      • Use a multi-language tokenizer like ICU in Solr or Elasticsearch:
        "analyzer": {
          "mixed_script_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": ["lowercase"]
          }
        }
        
    • Fallback Mechanism:
      • If a query doesn’t match in one language index, attempt searches in others.

8.8 Building a Unified Index for Multilingual Search

  • Challenges:
    • Combining multilingual content into a single index while preserving relevance and performance.
  • Examples:
    • An article in English and its French translation are tied together in search results.
  • Implementation:
    • Unified Indexing with Metadata:
      • Store translations as separate documents with a shared content_id:
        PUT /content/_doc/1
        {
          "content_id": "123",
          "locale": "en",
          "title": "Hello World"
        }
        PUT /content/_doc/2
        {
          "content_id": "123",
          "locale": "fr",
          "title": "Bonjour le Monde"
        }
        
      Query for a specific language:
      GET /content/_search
      {
        "query": {
          "bool": {
            "must": [
              { "match": { "content_id": "123" } },
              { "term": { "locale": "fr" } }
            ]
          }
        }
      }
      

8.9 Real-Time Index Updates

  • Challenges:
    • Keeping the search index synchronized with new or updated content without impacting performance.
  • Examples:
    • Adding a new blog post should instantly reflect in search results.
  • Implementation:
    • Using Bulk Indexing for Real-Time Updates:
      POST /_bulk
      { "index": { "_index": "content", "_id": "1" } }
      { "title": "New Post", "locale": "en" }