Building a Recommendation Engine using Machine Learning

0
In [1]:
from IPython.display import HTML
HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
Out[1]:

We all love ted talks, don’t we? Recently I came across a dataset on Kaggle (https://www.kaggle.com/rounakbanik/ted-talks). This dataset gave me an opportunity to put some NLP and text mining techniques to test. If you look at the data, you will realise that there are two files, one talks about the metadata of different Ted Talks such as Speaker Name, Title, Length of the talk, List of similar talks etc. There is another file which has the complete transcripts of the talks. This got me thinking:

I have the data on all the transcripts across many Ted talks, can I try to come up with a way to recommend ted talks based on their similarity, just as is done by official Ted page?

The data came in as a tabular flat file, the transcript for each talk was stored in a row across a column named transcript. Here is how the file looked like

In [2]:
import pandas as pd
transcripts=pd.read_csv("E:\\Kaggle\\ted-data\\transcripts.csv")
transcripts.head()
Out[2]:
transcripturl
0Good morning. How are you?(Laughter)It’s been …https://www.ted.com/talks/ken_robinson_says_sc…
1Thank you so much, Chris. And it’s truly a gre…https://www.ted.com/talks/al_gore_on_averting_…
2(Music: “The Sound of Silence,” Simon & Garfun…https://www.ted.com/talks/david_pogue_says_sim…
3If you’re here today — and I’m very happy that…https://www.ted.com/talks/majora_carter_s_tale…
4About 10 years ago, I took on the task to teac…https://www.ted.com/talks/hans_rosling_shows_t…

After examining how the data looked like, I figured out that I could easily extract the title of the talk from the url. My eventual goal was to use the text in the transcript column to create a measure of similarity. And then recommend the 4 most similar titles to a given talk. It was quite straightforward to separate the title from url using a simple string split operation as shown below

In [3]:
transcripts['title']=transcripts['url'].map(lambda x:x.split("/")[-1])
transcripts.head()
Out[3]:
transcripturltitle
0Good morning. How are you?(Laughter)It’s been …https://www.ted.com/talks/ken_robinson_says_sc…ken_robinson_says_schools_kill_creativity\n
1Thank you so much, Chris. And it’s truly a gre…https://www.ted.com/talks/al_gore_on_averting_…al_gore_on_averting_climate_crisis\n
2(Music: “The Sound of Silence,” Simon & Garfun…https://www.ted.com/talks/david_pogue_says_sim…david_pogue_says_simplicity_sells\n
3If you’re here today — and I’m very happy that…https://www.ted.com/talks/majora_carter_s_tale…majora_carter_s_tale_of_urban_renewal\n
4About 10 years ago, I took on the task to teac…https://www.ted.com/talks/hans_rosling_shows_t…hans_rosling_shows_the_best_stats_you_ve_ever_…

At this point, I was ready to begin piecing together the components that will help me build a talk recommender. In order to achieve this I had to:

  1. Create a vector representation of each transcript
  2. Create a similarity matrix for the vector representation created above
  3. For each talk, based on some similarity metric, select 4 most similar talks

Since our final goal is to recommend talks based on the similarity of their content, the first thing we will have to do is to, create a representation of the transcripts that are amenable to comparison. One way of doing this is to create a tfidf vector for each transcript. But what is this tfidf business anyway? Let’s discuss that first.

Corpus, Document and Count Matrix

To represent text, we will think of each transcript as one “Document” and the set of all documents as a “Corpus”. Then we will create a vector representing the count of words that occur in each document, something like this:

As you can see for each document, we have created a vector for count of times each word occurs. So the vector $(1,1,1,1,0,0)$, represents the count of words “This”,”is”,”sentence”,”one”,”two”,”three” in document 1.This is known as a count matrix. There is one issue with such a representation of text though, it doesn’t take into account the importance of words in a document. For example, the word “one” occurs only once in document 1 but is missing in the other documents, so from the point of view of its importance, “one” is an important word for document 1 as it characterises it, but if we look at the count vector for document 1, we can see that “one” gets of weight of 1 so do words like “This”, “is” etc. Issues, regarding the importance of words in a document, can be handled using what is known as Tf-Idf.

Term Frequency-Inverse Document Frequency (Tf-Idf)

In order to understand how Tf-Idf helps in identifying the importance of the words, let’s do a thought experiment and ask ourselves a couple of questions, what determines if a word is important?

  1. If the word occurs a lot in the document?
  2. If the word occurs rarely in the corpus?
  3. Both 1 and 2?

A word is important in a document if, it occurs a lot in the document, but rarely in other documents in the corpus. Term Frequencymeasures how often the word appears in a given document, while Inverse term frquency measures how rare the word is in a corpus. The product of these two quantities measures the importance of the word and is known as Tf-Idf. Creating a tf-idf representation is fairly straightforward, if you are working with a machine learning framework, such as scikit-learn, it’s fairly straightforward to create a matrix representation of text data

In [4]:
from sklearn.feature_extraction import text
Text=transcripts['transcript'].tolist()
tfidf=text.TfidfVectorizer(input=Text,stop_words="english")
matrix=tfidf.fit_transform(Text)
#print(matrix.shape)

So once we sort the issue of representing word vectors by taking into account the importance of the words, we are all set to tackle the next issue, how to find out which documents (in our case Ted talk transcripts) are similar to a given document?

To find out similar documents among different documents, we will need to compute a measure of similarity. Usually, when dealing with Tf-Idf vectors, we use similarity$cosine$. Think of similarity$cosine$ as measuring how close one TF-Idf vector is from the other. Now if you remember from the previous discussion, we were able to represent each transcript as a vector, so the similarity$cosine$ will become a means for us to find out how similar the transcript of one Ted Talk is to the other.So essentially, I created a cosine matrix from Tf-Idf vectors to represent how similar each document was to the other, schematically, something like:

Again, using sklearn, doing this was very straighforward

In [5]:
### Get Similarity Scores using cosine similarity
from sklearn.metrics.pairwise import cosine_similarity
sim_unigram=cosine_similarity(matrix)

All I had to do now was for, each Transcript, find out the 4 most similar ones, based on cosine similarity. Algorithmically, this would amount to finding out, for each row in the cosine matrix constructed above, the index of five columns, that are most similar to the document (transcript in our case) corresponding to the respective row number. This was accomplished using a few lines of code

In [6]:
def get_similar_articles(x):
    return ",".join(transcripts['title'].loc[x.argsort()[-5:-1]])
transcripts['similar_articles_unigram']=[get_similar_articles(x) for x in sim_unigram]

Let’s check how we faired, by examining the recommendations. Let’s pickup, any Ted Talk Title from, the list, let’s say we pick up:

In [7]:
transcripts['title'].str.replace("_"," ").str.upper().str.strip()[1]
Out[7]:
'AL GORE ON AVERTING CLIMATE CRISIS'

Then, based on our analysis, the four most similar titles are

In [8]:
transcripts['similar_articles_unigram'].str.replace("_"," ").str.upper().str.strip().str.split("\n")[1]
Out[8]:
['RORY BREMNER S ONE MAN WORLD SUMMIT',
 ',ALICE BOWS LARKIN WE RE TOO LATE TO PREVENT CLIMATE CHANGE HERE S HOW WE ADAPT',
 ',TED HALSTEAD A CLIMATE SOLUTION WHERE ALL SIDES CAN WIN',
 ',AL GORE S NEW THINKING ON THE CLIMATE CRISIS']

You can clearly, see that by using Tf-Idf vectors to compare transcripts of the talks, we were able to pick up, talks that were on similar themes. You can find the complete code HERE.