Topic Modelling

0

Heard about Topic Modelling?

“Be watchful with words you use in your talk, you cannot have the same key words for every speech”. This is a famous saying for speakers.

Talk or speech should be clear cut, precise and nailed to the point. To make the talk more interesting, speakers often tweak their words based on the set of audience. Faulty usage of words, incorrect grammar or deviating from the context often ends up the speaker in awkward situations.

Modifying the speech to suit a particular situation is one of the most difficult tasks. This is more of a data driven process as it requires the right set of words to be used. Hence, proof reading the script is an important process. Self-reviews and review from peers help arrive at a better end product. Also, the ideas or the topics to be conveyed through the talk is an important aspect as it should convey the right message in the right tone to the right audience.

NLP techniques like topic modelling can be of help to many speakers as it reviews the transcripts comprehensively. Topic modelling is a technique which extracts hidden topics from a group of documents. For each extracted topic, topic modelling provides distribution of words and for each document it provides the distribution of topics making it easier for the speaker to review his script. This is one such example of usage of data science and NLP in our day to day needs.

Statistical techniques like Latent Dirichlet allocation or LDA are used to identify the hidden topics from a large corpus of documents.

In Natural language processing, Latent Dirichlet Allocation is a generative statistical model that allows sets of observations (words or terms) to be explained by unobserved groups (topics) that explain why some parts of the data are similar. LDA is an unsupervised learning technique which is based on Bayesian statistics. LDA assumes documents as a collection of multiple topics and each topic is assumed to be a collection of words. LDA identifies hidden topics using a probabilistic approach.

LDA assumes that the probability of a word appearing in a document is the product of two probability terms – The probability of a word appearing in the topics multiplied with the probability of a topic appearing in a document.

LDA decomposes the document term matrix or DTM into two matrices: Document to Topic matrix and Topic to terms matrix.

In the first iteration, the LDA algorithm starts with a random assignment of words to topics in each document, prepares the document to topic matrix and the topic to terms matrix.

In the next iteration, for each term in the document the probability of assigning a new topic is computed based on the below probabilities. These probabilities helps to decide whether a term should belong to topic1 or topic2.

Based on new topic assignments, document to topic and topic to term/word matrix will be populated again. This process is repeated multiple times until convergence.

This technique of Topic modelling can be used for review of speeches, automatically tag or annotate files in our system, tag files in the email inbox. Since the words here are grouped together as a topic, topic modelling can also be used as a dimensionality reduction technique.