Thursday, July 17, 2014

Topic Modeling - How Google Extracts "Topics" and Establishes Relationship Between Documents

Topic Modeling is a highly technical topic and is associated with the on page optimization part of the search engine optimization processTopic model is a kind of statistical model that helps to extract "Topics" from web pages in order to determine the relevancy of that page with respect to the search queries. It helps to discover hidden topic based patterns and assigns scores based on the relevancy of documents. For documents related to the keyword SHIP, we can expect the use of some common words like "sea", "water', "fleet", "river", "mast", "vessel", "cargo", "fishes" etc but for some documents related to SPACESHIP, words used would be "earth","planets", "aircraft", "space mission", "satellite", "lunar", "orbit" etc. However, in both these documents, there might be the usage of some common words like "the", "has", "is", "for" etc. A study of the words associated around a particular topic helps to unveil the main or important topic of that document. 

Similarly, topic modeling will also help to find relevant webpages that relate more to the main topic present on the search query. For example, the query "Charles Dickens" will return webpages that have a high score based on the high topic associated word percentage of the main query as opposed to web pages that have a low score. A document using 70% words related to the main topic will rank higher than the document using 40% related words. (Please note that actual Google rankings is based on more than 200 factors and topic modeling is just one of those signals). 

The Types of Topic Models

LSA/LSI - Latent Semantic Analysis or Latent Semantic Indexing (For determining relationships)

The technique of LSA was suggested by Scott Deerwester, Susan Dumais, George Furnas, Richard Harshman, Thomas Landauer, Karen Lochbaum and Lynn Streeter in 1998. Commonly used for natural language processing, LSA identifies concepts present on two or more similar documents and establishes a relationship between them. It assumes that words closest in meaning are present on a similar set of documents. For example, a common word like "pet" can be present on documents that discusses about "dogs" and "cats" while words like "litter" or "bark" can be present on documents related to cats and dogs respectively. It uses a technique called "Singular Value Decomposition" to identify patterns and then establishes a relationship between the terms and concepts. LSI is helpful in overcoming the difficulties associated with the processing of similar meanings and multiple meanings. 

LDA - Latent Dirichlet Allocation (For identifying topic probability)

This model is helpful in determining a topic for a document. Let us take an example:

Consider these sentences:

  • Online shopping requires the use of credit card.
  • Major online shopping sites accepts payments through credit cards.
  • Chocolate cake is one of my favorite recipe.
  • You need eggs and cocoa powder to prepare chocolate cakes.

Through LDA we can determine, that 1st and 2nd sentence is related to the topic "online shopping" and the sentence 3rd and 4th is related to the topic "chocolate cakes". LDA is pretty accurate in identifying mixture of topics and then sorting them according to a fixed percentage. 

For example, in the 1st sentence, we can say the topics are divided as per the below percentage:

8= total words
2/8= online shopping

Similarly, for all the 4 sentences mentioned above, we will get an LDA ratio of 1/2 online shopping and 1/2 chocolate cake.
ESA - Explicit Semantic Analysis

ESA was designed by Evgeniy Gabrilovich and Shaul Markovitch in order to properly define and assign labels to the different concepts. It is similar to LSI but differs in the sense that it uses a knowledge base like Wikipedia, ODP (Open Directory project) or the Reuters Corpus to label the concepts or entities. The Knowledge Graph used by Google is the perfect example of ESA in action.

Topic Models and Search Engine Rankings

Topic Modeling has been used extensively by the search engines like Google and Bing. Vector analysis used in the algorithms helps the search engines to store or retrieve data based on topics and their relationships. Google has been successful in implementation of a scalable search engine technology based on topic modeling. An internet marketer with a sound knowledge of techniques like LSI, PLSI, ESA, LDA etc can draft a document in a manner that can influence the algorithmic values already used by them. By creating documents focused around topics and using similar words in them can make the document stand highly relevant in the eyes of search engines. The use of LSITF-IDF ScoreCo-Occurences and Co-CitationsKnowledge Graph etc are nothing but the use of topic modeling. However, as said earlier, Google uses more than 200 signals for determining ranking and topic modeling is just one signal. 

Also See: