Crisp Reading Notes on Latest Technology Trends and Basics

Archive for the ‘Statistics’ Category

Optimal Stopping Time

The notes here optimal_secretaries summarize two problems

  1. How to optimize one’s chance of getting the best out of a finite pool when one has to make an instant decision after every observation, and one does not have an idea of what is coming from the pool.
  2. How to maximize the chance of cashing in on the last occurrence of an  event, and get a favorable exit from a given situation



Chi-Square Distribution and Tests


  • Chi-square tests are a measure of how far a probability distribution is from the expected distribution
  • This has two common use cases
    • To estimate how good is a fit of the population whose sample is extracted to a theoretical distribution
    • How independent are two  variables by observing the deviation of the actual samples from the expected distributions
  • Beyond the theory, this is a very simple and straight-forward formula, that can be used effectively

Generalized Linear Regression

Generalized Linear Regression

Document in MS-Word format glm_regression

This is a long post on an Intermediate level topic of interest to people working in Big-Data.

These are my notes, as I struggled to understand the topic from the available references. Unfortunately, the references, all contained the same wordings and the same areas of focus. They were deficient in some crucial areas – missing links

  1. To explain the Link function as the “Maximum Likelihood function” of the Original distribution
  2. To explain how Maximum Likelihood applied to regression – That the objective was to fit a probability distribution function, that maximized the probability that the observed independent variables would give as output the observed dependant variables
  1. That only a single pdf was being fit, even though the observations were in N-dimensions


Latent Semantic Analysis

Latent Semantic Analysis is a widely adopted technique to associate Documents with Terms

Concept Space

  • Documents and Terms are indirectly associated with each other through “Concepts”.
  • The number of Concepts is far less than the number of Documents and Terms.
  • Concepts are only abstract. They may have no concrete meaning or relevance.

Latent Semantic Analysis Process

  1. Estimate the importance of a Term within the document. A typical metric is tf-idf
  2. Send the Document-Term matrix to the SVD algorithm, and pick the top K eigen values
  3. Prune the vectors for the Document and Term matrices to contain only the K factors

Strength of the Association

Document-Term The association between the Document and a Term is given by the dot-product between the Document Row and Term Column
Term-Term The association between a Term and another is given by the dot-product between the First term and the Second term.
Document-Document The association between a Document and another is given by the dot-product between the First document and the Second document.

Tag Cloud