Full-Text Search

4. Full-Text Search

4.1 Word Boundaries

Challenges:
- Recognizing word boundaries in languages without spaces (e.g., Chinese, Japanese) or with complex word compounding (e.g., German) is essential for accurate search results.
Examples:
- Chinese text: “我喜欢学习” should be tokenized into “我” (I), “喜欢” (like), and “学习” (study).
- German compound words like “Arbeitszeitgesetz” (Working Hours Act) need segmentation.
Solutions:
- Language-Specific Tokenizers:
  - Use tools like Jieba for Chinese:
```
import jieba
tokens = jieba.cut("我喜欢学习")
print(list(tokens))  # Output: ['我', '喜欢', '学习']
```
  - For German, configure Elasticsearch with a compound word filter:
```
"filter": {
  "german_compound": {
    "type": "word_delimiter",
    "split_on_case_change": false
  }
}
```
- Hybrid Tokenization: Combine rule-based and statistical models to improve accuracy for complex cases.

4.2 Stop Word Handling

Challenges:
- Stop words differ by language and affect query performance and relevance.
- Removing critical stop words can change query meaning.
Examples:
- “The Matrix” in English may lose relevance if “the” is removed.
- French stop words like “le” and “la” could impact phrase searches.
Solutions:
- Custom Stop Word Lists:
  - Define stop words per language in your search engine. For example, Elasticsearch:
```
"filter": {
  "french_stop": {
    "type": "stop",
    "stopwords": "_french_"
  }
}
```
- Dynamic Stop Word Override:
  - Allow users to bypass stop words for exact queries, such as by enclosing terms in quotes: "The Matrix".

4.3 Handling Synonyms

Challenges:
- Synonyms enhance search relevance but must be tailored to language and context.
Examples:
- Searching for “movie” should also retrieve results for “film” in English.
- For technical terms, “AI” and “artificial intelligence” need to be equivalent.
Solutions:
- Synonym Mapping:
  - Define synonyms in your search engine:
```
"filter": {
  "synonym_filter": {
    "type": "synonym",
    "synonyms": [
      "movie, film",
      "AI, artificial intelligence"
    ]
  }
}
```
- Context-Sensitive Synonyms:
  - Use machine learning models to expand queries dynamically based on context.

4.4 Language Detection for Query Routing

Challenges:
- Multilingual content requires routing queries to the appropriate language index.
- Mixed-language queries need special handling.
Examples:
- A French user searching for “restaurants à Paris” should route to the French index.
- Mixed queries like “pizza em Lisboa” (Portuguese and English) require splitting.
Solutions:
- Detect and Route:
  - Use tools like langdetect to determine query language and prioritize indices:
```
from langdetect import detect
language = detect("restaurants à Paris")
print(language)  # Output: 'fr'
```
- Mixed-Language Handling:
  - Split queries and match each part against its respective language index.

4.5 Indexing Phrases

Challenges:
- Phrases often need to be treated as single units for search accuracy.
- Exact phrase matching is critical for some queries.
Examples:
- “United Nations” should not match results for “United” or “Nations” alone.
- “Artificial intelligence” as a phrase is distinct from individual words.
Solutions:
- Phrase Indexing:
  - Use Elasticsearch’s match_phrase query to prioritize phrase matches:
```
{
  "query": {
    "match_phrase": {
      "content": "United Nations"
    }
  }
}
```
- Custom N-grams:
  - Index bigrams and trigrams to enhance phrase search capabilities.

4.6 Relevance Scoring

Challenges:
- Relevance scores must reflect user expectations for different languages and scripts.
- Cultural and linguistic nuances impact scoring.
Examples:
- In English, “cheap flights” should prioritize price-related results.
- In Japanese, kanji-based terms may carry different weights than katakana words.
Solutions:
- Boost Important Fields:
  - Adjust field weights dynamically. For example:
```
{
  "query": {
    "multi_match": {
      "query": "cheap flights",
      "fields": ["title^3", "description^1"]
    }
  }
}
```
- Cultural Customization:
  - Implement language-specific relevance rules using user behavior data.

4.7 Handling Mixed-Script Queries

Challenges:
- Mixed-script queries are common in multilingual settings, especially with Latin characters combined with local scripts.
Examples:
- “Tokyo 2025” (English + Japanese) or “Español recipes” (Spanish + English).
Solutions:
- Script-Aware Tokenizers:
  - Use multi-script tokenizers to segment queries appropriately.
- Unified Indexing:
  - Index content in all relevant scripts, using transliteration where necessary.

4.8 Full-Text Search Tuning

Challenges:
- Different languages require distinct settings for tokenization, stemming, and stop word removal.
Examples:
- Arabic text requires proper normalization and stemming to handle roots.
- German compound words need special filters to break them down meaningfully.

Solutions:

Language-Specific Analyzers:

Configure separate analyzers for each language in your search engine.

"settings": {
  "analysis": {
    "analyzer": {
      "arabic_analyzer": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": ["lowercase", "arabic_stem"]
      }
    }
  }
}

User Testing:
- Conduct usability testing to fine-tune search performance for multilingual users.