How to Create a Word Cloud in R
In my post on Word Clouds a few weeks back, I explained the concept and gave you some insights as to where they can be used as well as a list of Free Word Cloud Generators. In this post I am going to go a little deeper and explain how to actually create a Word Cloud in R.
1. Choose the Text File
Choose the text file for which you need to create a word cloud. For instance I am going to create a word cloud of PM Modi’s Speech at the United Nations. Now why will I want to do this? Well I am just curious to know what topics made the speech go viral!
Step 1 is to create a text form of the speech. So copy and paste the speech (which you will find in a pdf format online) into a plain text file (e.g. Speech.txt)
2. Installing Packages :
Open RStudio.You will need to install the packages “tm” and “wordcloud”.Next you need to load the packages in R
Run the following commands in RStudio.
3. Reading the File
Following is the command to read a text file in R:
speech = “D:\\Desktop stuff\\Text mining Project\\bjp\\project\\BJP2014.txt”
modi_txt = readLines(speech)
4. Converting the text file into a Corpus
Now in order to process or clean the text using tm package, you need to first convert this plain text data into a format called corpus which can then be processed by the tm package. A corpus is a collection of documents (although in our case we only have one).Following is the command to convert .txt file into a corpus.
To see the first few documents in the text file, type the R command: inspect(modi)[1:10]
5. Data Cleaning
Execute the following commands in RStudio:
As you can see the commands above, use tm_map() from the tm package for processing your text. As the commands are quite obvious, they do the following: strip unnecessary white space, convert everything to lower case (since tm package is case sensitive)remove English common words like ‘the’ (so-called ‘stopwords’).You can also explicitly remove numbers and punctuation with the removeNumbers and removePunctuation arguments.
After looking at the text document ,I also noticed the following words stop words which I wanted to remove:
6. Create a Term Document Matrix
It is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to words in the collection and columns correspond to documents.
Now we can create a word cloud even without a TDM. But the advantage of using this here is to take a look at the frequency of words.
tdm_modi<-TermDocumentMatrix (modi_data) #Creates a TDM
TDM1<-as.matrix(tdm_modi) #Convert this into a matrix format
v = sort(rowSums(TDM1), decreasing = TRUE) #Gives you the frequencies for every word
summary(v) will give us the distribution of the frequency of words. So we can take a look at the least and max number of times a word has occurred. This helps us set the “max.words” parameter in the next step.
7. Create your first word cloud!
wordcloud (modi_data, scale=c(5,0.5), max.words=1, random.order=FALSE, rot.per=0.35, use.r.layout=FALSE, colors=brewer.pal(8, “Dark2″))
Scale controls the difference between the largest and smallest font, max.words is required to limit the number of words in the cloud (if you omit this R will try to squeeze every unique word into the diagram), rot.per is the percentage of vertical text, and colors provides a wide choice of symbolizing your data.
If you notice here, since the speech we used was short, there were not many words which were repeated. So the whole purpose of this particular exercise was to just look at the topics that were brushed upon in the speech. If you use bigger text documents (I would suggest you try with some short stories, your favorite plays or any tweets data trends on the latest topic) then you can set your max.words as 50 or 100 depending on which range you want your word cloud to be.