Building a Recommendation Engine using Machine Learning
We all love ted talks, don’t we? Recently I came across a dataset on Kaggle (https://www.kaggle.com/rounakbanik/ted-talks). This dataset gave me an opportunity to put some NLP and text mining techniques to test. If you look at the data, you will realise that there are two files, one talks about the metadata of different Ted Talks such as Speaker Name, Title, Length of the talk, List of similar talks etc. There is another file which has the complete transcripts of the talks. This got me thinking:
I have the data on all the transcripts across many Ted talks, can I try to come up with a way to recommend ted talks based on their similarity, just as is done by official Ted page?
The data came in as a tabular flat file, the transcript for each talk was stored in a row across a column named transcript. Here is how the file looked like
import pandas as pd transcripts=pd.read_csv("E:\\Kaggle\\ted-data\\transcripts.csv") transcripts.head()
|0||Good morning. How are you?(Laughter)It’s been …||https://www.ted.com/talks/ken_robinson_says_sc…|
|1||Thank you so much, Chris. And it’s truly a gre…||https://www.ted.com/talks/al_gore_on_averting_…|
|2||(Music: “The Sound of Silence,” Simon & Garfun…||https://www.ted.com/talks/david_pogue_says_sim…|
|3||If you’re here today — and I’m very happy that…||https://www.ted.com/talks/majora_carter_s_tale…|
|4||About 10 years ago, I took on the task to teac…||https://www.ted.com/talks/hans_rosling_shows_t…|
After examining how the data looked like, I figured out that I could easily extract the title of the talk from the url. My eventual goal was to use the text in the transcript column to create a measure of similarity. And then recommend the 4 most similar titles to a given talk. It was quite straightforward to separate the title from url using a simple string split operation as shown below
transcripts['title']=transcripts['url'].map(lambda x:x.split("/")[-1]) transcripts.head()
|0||Good morning. How are you?(Laughter)It’s been …||https://www.ted.com/talks/ken_robinson_says_sc…||ken_robinson_says_schools_kill_creativity\n|
|1||Thank you so much, Chris. And it’s truly a gre…||https://www.ted.com/talks/al_gore_on_averting_…||al_gore_on_averting_climate_crisis\n|
|2||(Music: “The Sound of Silence,” Simon & Garfun…||https://www.ted.com/talks/david_pogue_says_sim…||david_pogue_says_simplicity_sells\n|
|3||If you’re here today — and I’m very happy that…||https://www.ted.com/talks/majora_carter_s_tale…||majora_carter_s_tale_of_urban_renewal\n|
|4||About 10 years ago, I took on the task to teac…||https://www.ted.com/talks/hans_rosling_shows_t…||hans_rosling_shows_the_best_stats_you_ve_ever_…|
At this point, I was ready to begin piecing together the components that will help me build a talk recommender. In order to achieve this I had to:
- Create a vector representation of each transcript
- Create a similarity matrix for the vector representation created above
- For each talk, based on some similarity metric, select 4 most similar talks
Since our final goal is to recommend talks based on the similarity of their content, the first thing we will have to do is to, create a representation of the transcripts that are amenable to comparison. One way of doing this is to create a tfidf vector for each transcript. But what is this tfidf business anyway? Let’s discuss that first.
Corpus, Document and Count Matrix
To represent text, we will think of each transcript as one “Document” and the set of all documents as a “Corpus”. Then we will create a vector representing the count of words that occur in each document, something like this:
As you can see for each document, we have created a vector for count of times each word occurs. So the vector , represents the count of words “This”,”is”,”sentence”,”one”,”two”,”three” in document 1.This is known as a count matrix. There is one issue with such a representation of text though, it doesn’t take into account the importance of words in a document. For example, the word “one” occurs only once in document 1 but is missing in the other documents, so from the point of view of its importance, “one” is an important word for document 1 as it characterises it, but if we look at the count vector for document 1, we can see that “one” gets of weight of 1 so do words like “This”, “is” etc. Issues, regarding the importance of words in a document, can be handled using what is known as Tf-Idf.
Term Frequency-Inverse Document Frequency (Tf-Idf)
In order to understand how Tf-Idf helps in identifying the importance of the words, let’s do a thought experiment and ask ourselves a couple of questions, what determines if a word is important?
- If the word occurs a lot in the document?
- If the word occurs rarely in the corpus?
- Both 1 and 2?
A word is important in a document if, it occurs a lot in the document, but rarely in other documents in the corpus. Term Frequencymeasures how often the word appears in a given document, while Inverse term frquency measures how rare the word is in a corpus. The product of these two quantities measures the importance of the word and is known as Tf-Idf. Creating a tf-idf representation is fairly straightforward, if you are working with a machine learning framework, such as scikit-learn, it’s fairly straightforward to create a matrix representation of text data
from sklearn.feature_extraction import text Text=transcripts['transcript'].tolist() tfidf=text.TfidfVectorizer(input=Text,stop_words="english") matrix=tfidf.fit_transform(Text) #print(matrix.shape)
So once we sort the issue of representing word vectors by taking into account the importance of the words, we are all set to tackle the next issue, how to find out which documents (in our case Ted talk transcripts) are similar to a given document?
To find out similar documents among different documents, we will need to compute a measure of similarity. Usually, when dealing with Tf-Idf vectors, we use similarity. Think of similarity as measuring how close one TF-Idf vector is from the other. Now if you remember from the previous discussion, we were able to represent each transcript as a vector, so the similarity will become a means for us to find out how similar the transcript of one Ted Talk is to the other.So essentially, I created a cosine matrix from Tf-Idf vectors to represent how similar each document was to the other, schematically, something like:
Again, using sklearn, doing this was very straighforward
### Get Similarity Scores using cosine similarity from sklearn.metrics.pairwise import cosine_similarity sim_unigram=cosine_similarity(matrix)
All I had to do now was for, each Transcript, find out the 4 most similar ones, based on cosine similarity. Algorithmically, this would amount to finding out, for each row in the cosine matrix constructed above, the index of five columns, that are most similar to the document (transcript in our case) corresponding to the respective row number. This was accomplished using a few lines of code
def get_similar_articles(x): return ",".join(transcripts['title'].loc[x.argsort()[-5:-1]]) transcripts['similar_articles_unigram']=[get_similar_articles(x) for x in sim_unigram]
Let’s check how we faired, by examining the recommendations. Let’s pickup, any Ted Talk Title from, the list, let’s say we pick up:
'AL GORE ON AVERTING CLIMATE CRISIS'
Then, based on our analysis, the four most similar titles are
['RORY BREMNER S ONE MAN WORLD SUMMIT', ',ALICE BOWS LARKIN WE RE TOO LATE TO PREVENT CLIMATE CHANGE HERE S HOW WE ADAPT', ',TED HALSTEAD A CLIMATE SOLUTION WHERE ALL SIDES CAN WIN', ',AL GORE S NEW THINKING ON THE CLIMATE CRISIS']