Crisp Reading Notes on Latest Technology Trends and Basics

Archive for the ‘Search’ Category

Recognizing Keywords from Documents

keywords_on_a_page

http://www.cs.cmu.edu/~vitor/papers/www06.pdf

Problem Statement

  • In several applications, it is necessary to quickly determine the keywords of a document,  such as
  1. Contextual Advertising
  2. News Query Extaction
  3. Email Query Extraction
  • One technique to solve this problem is a four step process
  1. Pre-processing
  • The document is pre-processed, so that the text gets prominence and markup tags are removed
  1. Candidate Selection
  •  The candidate phrases are extracted from the text. Typically these are Nouns and Noun Phrases.
  1. Scoring
  • The features for the candidates is computed.
  • Then all the candidates are scored
  • The scoring here uses a supervised learning model, with annotators pre-annotating test pages.
  1. Post-processing
  • At the end of the scoring , all candidate solutions are listed in decreasing score.
  • Other rule-based constraints may be applied to score the items
  • The paper describes such a system used in “Contextual Advertising”.
  • The details are easy to read, and the system is simple and practical

Latent Semantic Analysis

Latent Semantic Analysis is a widely adopted technique to associate Documents with Terms

Concept Space

  • Documents and Terms are indirectly associated with each other through “Concepts”.
  • The number of Concepts is far less than the number of Documents and Terms.
  • Concepts are only abstract. They may have no concrete meaning or relevance.

Latent Semantic Analysis Process

  1. Estimate the importance of a Term within the document. A typical metric is tf-idf
  2. Send the Document-Term matrix to the SVD algorithm, and pick the top K eigen values
  3. Prune the vectors for the Document and Term matrices to contain only the K factors

Strength of the Association

Document-Term The association between the Document and a Term is given by the dot-product between the Document Row and Term Column
Term-Term The association between a Term and another is given by the dot-product between the First term and the Second term.
Document-Document The association between a Document and another is given by the dot-product between the First document and the Second document.

Lucene Conferences

http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

Introduction to Search Technology

The Once and Future History of Enterprise Search and Open Source
In this article, Mark Krellenstein from Lucid Imagination covers a lot of background on the history of Search, and the importance of the Search problem in Enterprises today.
Rating Thought-Provoking

The Guardian Platform

Tag Cloud