published
9 January 2025
by
Ray Morgan

Full-Text Search

4. Full-Text Search


4.1 Word Boundaries

  • Challenges:
    • Recognizing word boundaries in languages without spaces (e.g., Chinese, Japanese) or with complex word compounding (e.g., German) is essential for accurate search results.
  • Examples:
    • Chinese text: “我喜欢学习” should be tokenized into “我” (I), “喜欢” (like), and “学习” (study).
    • German compound words like “Arbeitszeitgesetz” (Working Hours Act) need segmentation.
  • Solutions:
    • Language-Specific Tokenizers:
      • Use tools like Jieba for Chinese:
        import jieba
        tokens = jieba.cut("我喜欢学习")
        print(list(tokens))  # Output: ['我', '喜欢', '学习']
        
      • For German, configure Elasticsearch with a compound word filter:
        "filter": {
          "german_compound": {
            "type": "word_delimiter",
            "split_on_case_change": false
          }
        }
        
    • Hybrid Tokenization: Combine rule-based and statistical models to improve accuracy for complex cases.

4.2 Stop Word Handling

  • Challenges:
    • Stop words differ by language and affect query performance and relevance.
    • Removing critical stop words can change query meaning.
  • Examples:
    • “The Matrix” in English may lose relevance if “the” is removed.
    • French stop words like “le” and “la” could impact phrase searches.
  • Solutions:
    • Custom Stop Word Lists:
      • Define stop words per language in your search engine. For example, Elasticsearch:
        "filter": {
          "french_stop": {
            "type": "stop",
            "stopwords": "_french_"
          }
        }
        
    • Dynamic Stop Word Override:
      • Allow users to bypass stop words for exact queries, such as by enclosing terms in quotes: "The Matrix".

4.3 Handling Synonyms

  • Challenges:
    • Synonyms enhance search relevance but must be tailored to language and context.
  • Examples:
    • Searching for “movie” should also retrieve results for “film” in English.
    • For technical terms, “AI” and “artificial intelligence” need to be equivalent.
  • Solutions:
    • Synonym Mapping:
      • Define synonyms in your search engine:
        "filter": {
          "synonym_filter": {
            "type": "synonym",
            "synonyms": [
              "movie, film",
              "AI, artificial intelligence"
            ]
          }
        }
        
    • Context-Sensitive Synonyms:
      • Use machine learning models to expand queries dynamically based on context.

4.4 Language Detection for Query Routing

  • Challenges:
    • Multilingual content requires routing queries to the appropriate language index.
    • Mixed-language queries need special handling.
  • Examples:
    • A French user searching for “restaurants à Paris” should route to the French index.
    • Mixed queries like “pizza em Lisboa” (Portuguese and English) require splitting.
  • Solutions:
    • Detect and Route:
      • Use tools like langdetect to determine query language and prioritize indices:
        from langdetect import detect
        language = detect("restaurants à Paris")
        print(language)  # Output: 'fr'
        
    • Mixed-Language Handling:
      • Split queries and match each part against its respective language index.

4.5 Indexing Phrases

  • Challenges:
    • Phrases often need to be treated as single units for search accuracy.
    • Exact phrase matching is critical for some queries.
  • Examples:
    • “United Nations” should not match results for “United” or “Nations” alone.
    • “Artificial intelligence” as a phrase is distinct from individual words.
  • Solutions:
    • Phrase Indexing:
      • Use Elasticsearch’s match_phrase query to prioritize phrase matches:
        {
          "query": {
            "match_phrase": {
              "content": "United Nations"
            }
          }
        }
        
    • Custom N-grams:
      • Index bigrams and trigrams to enhance phrase search capabilities.

4.6 Relevance Scoring

  • Challenges:
    • Relevance scores must reflect user expectations for different languages and scripts.
    • Cultural and linguistic nuances impact scoring.
  • Examples:
    • In English, “cheap flights” should prioritize price-related results.
    • In Japanese, kanji-based terms may carry different weights than katakana words.
  • Solutions:
    • Boost Important Fields:
      • Adjust field weights dynamically. For example:
        {
          "query": {
            "multi_match": {
              "query": "cheap flights",
              "fields": ["title^3", "description^1"]
            }
          }
        }
        
    • Cultural Customization:
      • Implement language-specific relevance rules using user behavior data.

4.7 Handling Mixed-Script Queries

  • Challenges:
    • Mixed-script queries are common in multilingual settings, especially with Latin characters combined with local scripts.
  • Examples:
    • “Tokyo 2025” (English + Japanese) or “Español recipes” (Spanish + English).
  • Solutions:
    • Script-Aware Tokenizers:
      • Use multi-script tokenizers to segment queries appropriately.
    • Unified Indexing:
      • Index content in all relevant scripts, using transliteration where necessary.

4.8 Full-Text Search Tuning

  • Challenges:
    • Different languages require distinct settings for tokenization, stemming, and stop word removal.
  • Examples:
    • Arabic text requires proper normalization and stemming to handle roots.
    • German compound words need special filters to break them down meaningfully.
  • Solutions:
    • Language-Specific Analyzers:
      • Configure separate analyzers for each language in your search engine.
        "settings": {
          "analysis": {
            "analyzer": {
              "arabic_analyzer": {
                "type": "custom",
                "tokenizer": "standard",
                "filter": ["lowercase", "arabic_stem"]
              }
            }
          }
        }
        
    • User Testing:
      • Conduct usability testing to fine-tune search performance for multilingual users.