Some Fun Analysis on R Packages
Today’s post is by the team at Vozag, who use data & analytics to better understand the world around us. Their data driven approach often sheds new light on topics we cover. In this article they show us how they analyzed Stack Overflow questions to see which R packages had the highest amount of doubts or questions asked about it. This will help us understand which R topics analysts and data scientists are having the most trouble with, & explore ideas to get better answers faster.
Here is our list of the top 8 R Packages about which the most questions are asked based on Stack Overflow data:
The below chart illustrates the number of questions answered & the unanswered percentage.
The R package that ranked first in terms of most questions asked about it is GGPlot with ~7200 questions followed by Data table (2135), Plyr (1213) & Knitr (1136). The rest of the packages have less than 1000 total questions asked. However, when we look at the percentage of unanswered questions to asked questions, the packages with the most unanswered queries are Knitr, Lattice and iGraph with 24.2%, 23.5% and 19.8% respectively.
When we transpose this data about questions asked, vs data about the top downloaded R packages, we find very little co-relation between them. For example, while GGplot has the most questions asked about it and is the second highest downloaded package, the “Data.Table” package (the number two ranked R package with the most questions) is not even in the top 100 packages downloaded. Another example is Knitr which is among the top 5 questions asked, but is only ranked at number 27 on the most downloaded packages.
What are some of the implications of this data analysis and what can we learn to drive better understanding of R in the community?
The first lesson seems to be that an FAQ and focus on key concepts of R packages is out there on the internet and in the minds of R practitioners. We need to do a better job of understanding where these questions are being asked and what those questions are.
Second, we need to develop a better answering framework for some of these frequently asked questions and concepts. For e.g. a lot of the questions for GGplot are around simple topics like graph alignment, axis alignments & UI issues. These topics are not complicated to learn or explain but just need a simple & quick explanation for these common & niggling problems.
Third, we need to drive these explanations into the package documentation & features itself. For example, if we have a question around showing graphs or ordering frames to matrices, we need to deliver the explanations within the UI itself in the form of tool tips or simpler search functions of existing documentation.
These are some ideas on driving better adoption of R and its various packages. What are your thoughts?