- In several applications, it is necessary to quickly determine the keywords of a document, such as
- Contextual Advertising
- News Query Extaction
- Email Query Extraction
- One technique to solve this problem is a four step process
- The document is pre-processed, so that the text gets prominence and markup tags are removed
- Candidate Selection
- The candidate phrases are extracted from the text. Typically these are Nouns and Noun Phrases.
- The features for the candidates is computed.
- Then all the candidates are scored
- The scoring here uses a supervised learning model, with annotators pre-annotating test pages.
- At the end of the scoring , all candidate solutions are listed in decreasing score.
- Other rule-based constraints may be applied to score the items
- The paper describes such a system used in “Contextual Advertising”.
- The details are easy to read, and the system is simple and practical
Latent Semantic Analysis is a widely adopted technique to associate Documents with Terms
- Documents and Terms are indirectly associated with each other through “Concepts”.
- The number of Concepts is far less than the number of Documents and Terms.
- Concepts are only abstract. They may have no concrete meaning or relevance.
Latent Semantic Analysis Process
- Estimate the importance of a Term within the document. A typical metric is tf-idf
- Send the Document-Term matrix to the SVD algorithm, and pick the top K eigen values
- Prune the vectors for the Document and Term matrices to contain only the K factors
Strength of the Association
||The association between the Document and a Term is given by the dot-product between the Document Row and Term Column
||The association between a Term and another is given by the dot-product between the First term and the Second term.
||The association between a Document and another is given by the dot-product between the First document and the Second document.
Introduction to Search Technology
The Once and Future History of Enterprise Search and Open Source
In this article, Mark Krellenstein from Lucid Imagination covers a lot of background on the history of Search, and the importance of the Search problem in Enterprises today.
The Guardian Platform