Five Lessons in Data Preparation

Part II of the Diary of a Freshly Minted Data Scientist Series


The last two weeks have been an exercise in understanding the challenges a data scientist faces in the process of data preparation.

In this time, my attempts to analyse Shakespeare and Indian football along with creating some animated graphics have resulted in little interesting output but massive beginner wisdom.

I am new to the field of data science and this is the second in a series of posts describing my efforts to keep my skills updated and keep improving as a professional in the field.

It is my intention to share some of the lessons I have learnt with you, hoping that everyone who is in or has been through this phase in their careers can relate with my story to similar experiences you may be having or have had.

Shakespeare or soccer, analysis needs data preparation

A meeting with an adman-turned-text-analytics expert set me off on a quest to represent Shakespeare’s Julius Caesar using word clouds as part of an internship I sort of thrust upon myself.

Suresh Manian had a successful career in advertising before he decided to venture into areas he knew nothing about. Over the last several years, he has created Metaphic – a text analytics platform capable of generating magical insights from data that ranges from tweets to emails to customer support to comments by users on forums.

As part of a discussion on collaborating to develop end-user applications using Metaphic’s super powers, I got access to many of the platform’s capability and gave myself the mandate of developing content to help showcase what Metaphic is capable of achieving.

After much deliberation and wasting time on poor choices of topics, I decided to wet my feet in the sea of text analytics using a play that I studied as a textbook in school – William Shakespeare’s Julius Caesar.

Another ongoing project involves analysing data on Indian football. Add to that a learning exercise in the basics of animated graphics and you can understand that I have spent a lot of time thinking about machine learning and AI.

My actions, however, have been more about the rather less glamorous aspect of a data scientist’s job – data preparation and cleaning.

During the Post Graduate Program in Machine Learning and Data Science course that I completed, it was a lesson well taught and learnt that data preparation consumes most of the time in any project.

Yet, in these early days of working with data, the inevitability of data preparation and its criticality every single time you want to achieve something meaningful, has been a real revelation.

Data preparation lessons learnt the hard way

There is a lot of great material and insights available online about the importance of data preparation, the challenges faced as well as best practices and tools available for it. Most of these have been created by veterans from the industry or academicians with great knowledge.

I don’t have their expertise or experience with data projects but I can claim to be an expert at making mistakes. So, let me stick to my strengths and share with you the lessons I learnt while making mistakes over the last couple of weeks.

Lesson One: You can get nothing done without preparation

When I first set out to make a word cloud, I had seen a do-it-in-two-minutes tutorial. I had the text I needed and a clear picture of what I wanted my cloud to communicate. I thought I would be sharing my findings on social media in half an hour.

Half a week later, I was still struggling to have my text prepared so that any algorithm I used could create a word cloud that showed useful information.

And this is a story that keeps repeating. Any project – even a little beyond the ordinary – ends up taking far more time than initially anticipated because as you get deeper into it, you not only encounter more problems that you need to fix, but also figure out additional changes to data that may lead to better results!

Lesson Two: There is a lot of data but not much kept ready for your special needs

It took me many hours of work to finally extract useful data about the Indian football league from online sources.

Five seconds of euphoria gave way to a long groan of despair when I started inspecting the tables to find that there was a lot more work to be done to get analysis ready.

Once again, I had fallen in the trap of calling success too early and found that my project plan would need to build in a lot more time for getting the data ready than I had ever anticipated. After all, there are variables with names repeated or variables that make no sense, data for the same event can be found in different tables and is not always consistent, missing values, and even tables within tables! I mean, who does that?

But of course, when the mind is calmer, I realize that the good people who made the website had other priorities to address and rather than complaining about how much work I need to put into the data to get it into shape, I better be thanking them for at least providing rich data on a subject where not much data-driven exploration has happened yet.

Lesson Three: Preparing data is like being on an emotional rollercoaster of possibilities

I may be being a touch dramatic here (spent time dabbling with Shakespeare after all) but every time I look back at a session where I have been trying to setup the perfect dataset, I feel like I have been on some sort of adventure.

Every look at the dataset makes the impossible seem possible. The mind races to solutions that you have already created based on the amazing potential that you have been able to detect in the data. And just when you feel that greatness has been achieved, the bubble bursts.

By the time you have an experimental output to see if you are on the right path, you realize that there was lesser insight and power in the data than you had earlier anticipated.

Don’t get me wrong!

Data science, machine learning and AI are building amazing solutions and so can you. But if every time you felt you had a winner turned out to be true, you would end up changing the world every single day!

Lesson Four: It’s about finding the balance between being smart and choosing the easy options

While analyzing the text of the play Julius Caesar, I figured I wanted to separate the characters from the dialogues and put them into different columns of a table.

I had a text document with the entire play with me and I decided to see if the table was going to be helpful by manually copying and pasting the first scene from the play into an excel sheet in the required format.

It so happened that soon I was doing the same with the second scene and then the entire first Act of the play and then even more. Before I knew it, I was sucked into a copy paste world where I felt compelled to just get the data ready for the next scene from the play and test a new hypothesis.

Needless to say that I was being excessively inefficient and before long had spent too much time copying and pasting data than any reasonable person would have.

So, I decided to automate, which would have been a smart thing to do as soon I figured out that I did see use in having the whole play in a neat excel sheet. Of course, I now took too much time to write a tool that would do the remaining job for me. I might have finished what was left faster if I had stuck to my inefficient ways.

There are other stories and other projects where I have spent too much time making a tool when five minutes using MS Excel would have done the job!

While I have no clarity yet on how I will approach this question the next time I encounter it but at least I feel now that I will be smarter about making my choice than I have been so far.

Lesson Five: The job is never really over

Let’s say you do finally reach a stage where you are able to run that final piece of code for the model or create that final visualization that you set out to do. I have found that just as night follows day, there is almost always an immediate realization of small changes that can be made to improve the solution. And very often, those changes need to be made right at the data preparation stage.

I guess that this is the inevitable process of feedback and correction associated with creating any solution but it does lead to the question about when you feel that what you have is good enough to show to others or put into production.

Moving on

The biggest conclusion I have come to over these last two weeks is that the process of checking if the data is valid, complete, consistent, uniform and accurate must be enjoyed rather than looked on as a chore.

And it can be! There are amazing packages to discover that solve very specific problems you may be facing, there are new methods to learn all the time and the process of gaining insights from data can potentially begin here.

Then there is the reward. With all the ifs and buts and failures along the way, every iteration of improving the data leads to improved results. And finally, you have something to show for your efforts that seem to have come a long way from the first messy shot you took at your project.

Here is another reward for my efforts – one of my first attempts at creating an animated graphic using the excellent gganimate package in R.