published
10 January 2025
by
Ray Morgan

Term Frequency-Inverse Document Frequency (TF-IDF)

Term Frequency-Inverse Document Frequency is a statistical measure used to evaluate the importance of a word in a document relative to a collection (or corpus) of documents. It addresses the idea that rarer words are often more meaningful for identifying the relevance of a document in response to a query.

TF-IDF is a foundational concept in text analysis and search relevance ranking, complementing field weighting. By emphasizing rare and specific terms, TF-IDF helps focus search results on the documents most likely to match the user's intent. It’s a powerful method that underpins many modern search engines and information retrieval systems.

Term Frequency (TF) measures how often a word appears in a document. Words that occur frequently in a document (e.g., "apple" in a document about apples) are given higher importance.

Inverse Document Frequency (IDF) reduces the weight of common words that appear in many documents (e.g., "the," "is") and increases the weight of rare, specific words. Rare words across the corpus get higher scores, as they are more likely to carry specific meaning.

The TF-IDF Score combines the two measures to calculate a word’s importance in a specific document relative to the entire corpus.

How TF-IDF Relates to Search and Field Weighting

Prioritizing Rare Words: Words like "quantum" or "photosynthesis" in a scientific document carry more weight because they are rare across the corpus, while common words like "the" or "and" are deprioritized.

Improving Search Relevance: TF-IDF is used in many search engines to rank documents based on the importance of query terms. For example:

  • A query for "quantum computing" would give higher scores to documents with these rarer terms compared to documents with more generic language.

Custom Field Weighting: TF-IDF scores can be combined with field weights to give certain fields (e.g., titles) even more influence in search ranking.

Practical Example

  • Document 1: "Quantum computing is a fascinating field of study."
  • Document 2: "The computing field is broad and includes many topics."

Calculating TF-IDF for "quantum":

  • TF (Document 1): Appears once in 7 words → ( TF = 1/7 ).
  • TF (Document 2): Does not appear → ( TF = 0 ).
  • IDF: Assuming a corpus of 10 documents, with "quantum" appearing in 1 document →
    [ IDF = \log \frac{10}{1} = 1.0 ]

Final Scores:

  • Document 1: ( TF\text{-}IDF = (1/7) \times 1.0 = 0.143 ).
  • Document 2: ( TF\text{-}IDF = 0 ).