Recognizing Keywords from Documents


Problem Statement

  • In several applications, it is necessary to quickly determine the keywords of a document,  such as
  1. Contextual Advertising
  2. News Query Extaction
  3. Email Query Extraction
  • One technique to solve this problem is a four step process
  1. Pre-processing
  • The document is pre-processed, so that the text gets prominence and markup tags are removed
  1. Candidate Selection
  •  The candidate phrases are extracted from the text. Typically these are Nouns and Noun Phrases.
  1. Scoring
  • The features for the candidates is computed.
  • Then all the candidates are scored
  • The scoring here uses a supervised learning model, with annotators pre-annotating test pages.
  1. Post-processing
  • At the end of the scoring , all candidate solutions are listed in decreasing score.
  • Other rule-based constraints may be applied to score the items
  • The paper describes such a system used in “Contextual Advertising”.
  • The details are easy to read, and the system is simple and practical

Latent Semantic Analysis

Latent Semantic Analysis is a widely adopted technique to associate Documents with Terms

Concept Space

  • Documents and Terms are indirectly associated with each other through “Concepts”.
  • The number of Concepts is far less than the number of Documents and Terms.
  • Concepts are only abstract. They may have no concrete meaning or relevance.

Latent Semantic Analysis Process

  1. Estimate the importance of a Term within the document. A typical metric is tf-idf
  2. Send the Document-Term matrix to the SVD algorithm, and pick the top K eigen values
  3. Prune the vectors for the Document and Term matrices to contain only the K factors

Strength of the Association

Document-Term The association between the Document and a Term is given by the dot-product between the Document Row and Term Column
Term-Term The association between a Term and another is given by the dot-product between the First term and the Second term.
Document-Document The association between a Document and another is given by the dot-product between the First document and the Second document.

Lucene Conferences

Introduction to Search Technology

The Once and Future History of Enterprise Search and Open Source
In this article, Mark Krellenstein from Lucid Imagination covers a lot of background on the history of Search, and the importance of the Search problem in Enterprises today.
Rating Thought-Provoking

The Guardian Platform

