As we move further into the information age, we are finding ourselves in an increasingly data-rich and data-oriented world. Almost everything that a company does today is dictated by data and analytics in one way or another. Organizations have begun using sophisticated analytics techniques to identify new growth areas, streamline costs, increase operating margins, make better human resource decisions and devise more effective budgets. The influence of data also spills over into our personal lives, health care, the environment and much more.
Experts have predicted that this influence will only continue to grow in the next few decades. There is expected to be a 4300% percent increase in annual data generated, by the year 2020. By 2015, there were 4.4 million jobs created globally to support Big Data and in the next four years, around 6 million jobs will be generated by the information economy in North America alone.
At Jigsaw Academy, we are strongly of the belief that every facet of our lives will soon be dictated by data and the success of any business will depend on the efficiency of its data initiatives. Also, despite its ubiquity and ever-growing influence, analytics is still a new concept to a lot of people, particularly when it comes to buzzwords and key terms. And at the same time, there is no shortage of young professionals looking to find a way to break into the analytics space. With this article, we hope to inspire them to begin educating themselves in data and using this to build towards a career in what is already one of the hottest industry of the 21st century.
What is Analytics?
In his popular book ‘Competing on Analytics’, Thomas Davenport defines analytics as “The extensive use of data, statistical and quantitative analysis, exploratory and predictive models and fact-based management to drive decisions and actions.” To put it in layman’s terms, it refers to “the analysis of data to draw hidden insights, to aid in decision-making”.
In a business context, the importance of analytics cannot be overstated. It is used to examine and understand historical patterns and to predict and improve future business practices. As such, is integral to the success of any organization, hence the widespread demand for analytics experts. For instance, a call centre manager analyses his team’s performance data for resource optimization. HR managers use employee data to predict attrition. Marketers use sales and marketing ROI data to decide the optimal allocation of their marketing budget.
Broadly, analytics is classified into two types – descriptive and predictive. Descriptive analytics is used to analyse and derive insights from past data whereas predictive analytics is used to study trends and predict what will happen in the future. Later in the post, we will be taking a more detailed look at both, with examples of how they are used.
Working with Data
In the first section of this article, we’ll be taking a broad look at working with data, and important factors related to that – first and foremost, defining what data is, storage terms, types of data, as well as how one would go about analyzing either structured or unstructured data.
What is Data?
The first step towards using data as an effective aid to decision making is to fully understand what data is, how it is generated, stored, different types of data and their characteristics.
When information is in its “raw” and “unorganized” form, it is called data. This data is then analyzed and valuable information is extracted from it. When we hear the term “data”, a lot of us immediately imagine spreadsheets and tables but data is by no means restricted to just numbers. In fact, data that is collected today is as varied as Facebook likes, heart rates on a device like a Fitbit and even your location detected by your smartphone. Data can be numeric, text-based, audio, or video and the amount of data that humans capture now is a staggering increase from even a few years ago. Indeed, there’s a popular quote attributed to IBM from 2013, which states that almost 90% of the world’s data had been captured in just the previous two years!
So where is all this data coming from? A majority of it is from relatively traditional sources, especially in businesses. These include transactional systems, billing systems, CRM (customer relationship management) systems, ERP (Enterprise Resource Planning) systems and others like them. They capture information such as what has been purchased from the inventory, prices paid by customers, location and time of the purchase etc. Together, this is where almost all of a business’ data resides and these systems form a crucial part of any traditional organization’s setup.
Data Storage and Capture
In recent times, we have begun to observe more unconventional sources of data. Over and above what is captured by CRM and ERP systems, businesses are now seeing a massive influx of data from the web, especially with the increasing influence of smartphones and other machines in our daily lives. As these devices typically generate a lot more text, image and video data, this presents its own challenges in terms of storage and processing.
The first step is enabling the proper collection of this data for which technology is constantly evolving and improving. It normally now happens through digital data capture systems such as barcode scanners and voice or video recording systems though there are still instances of manual intervention (for example, in a population census). Once the data has been captured, it is put into storage systems.
Most businesses today store data in databases, which form a part of data warehousing systems. However, these are not simply repositories to dump raw data. The relationships different data elements have to be correctly captured for easier retrieval of data for analysis later on. Over time, technology has been developed to ensure that the storage of raw data is happening properly but a data scientist still needs to be aware of every step in the process.
Data Storage Terms
Let us know take a look at popular terms that relate to the data storage process. These are in very common usage in analytics and you will no doubt come across them at one point or another.
RDMBS (Relational Database Management Systems) – These are essentially how most business data has been stored. Information is represented via tables and a “primary key” helps connect and understand the relationship between different tables. Popular examples include Oracle, Microsoft SQL, MySQL and others.
SQL (Structured Query Language) – This is a very popular language used to query databases. A company would use this to identify what its top products are in terms of unit sales in a particular location.
Data Warehouses and Marts – They are both examples of databases used and created for the ability to generate reports. Data Marts are generally specific to a team whereas Data Warehouses would hold data relevant to the whole enterprise.
ETL (Extract, Transform, Load) – It is an important stage of data processing. It refers to the collection of data from operating systems, cleaning it to suit the procedural requirements and then loading the clean data into the appropriate warehousing system.
Data lake – A data lake is very similar to a data warehouse with the difference being that data in a lake would be in a much more “raw” form compared to what is stored in a warehouse. Moreover, a lake could contain all kinds of data, both structured as well as unstructured (which we will discuss later on in the book).
Data Types and Attributes
Now that we have seen how data is captured and stored, let us take a look at what it contains – specifically, the types of data we deal with and some common terminology associated with it.
As we saw earlier, even as recently as a decade ago, data was almost exclusively structured in rows and columns and generally in a numeric form. Now we have to deal with a wider variety of data, such as text (Amazon reviews), images (photos added to social media sites) audio (call recordings) and so on. While the possibilities with these are very exciting, these advancements present their own challenges to the data analysis process.
First and foremost, this type of data is considered to be semi-structured or unstructured. It is not possible to automatically store this data in rows and columns and it would need a fair amount of processing to provide it with some sort of structure. This processing is reasonably complex and dynamic in itself. For instance, reviews on a site would each be in a unique style and with different tones and styles, something that an algorithm would struggle with.
Data requirements for unstructured data are naturally far higher than for structured data. For example, to even categorise photos of men and women, you would need several thousand photos for the algorithm to reach a decent level of accuracy. As a result of this recent boom of unstructured data, it is important to know how to deal with it as a part of any data analysis process.
Structured Data Analysis
With any sort of data, the ultimate aim is to get it into a reasonably structured format. As a result, when data is discussed, we generally assume it to be in the form of a table, with rows and columns. In a standard data table, a row is called an ‘observation’ (referring to a specific, recorded data point), whereas a column is a ‘variable’.
There are different data types within Structured Data. Broadly speaking, most programming languages classify data in three categories – numeric, character based and date or time based.
Numeric data is what most of us visualize when we think of data. It can be broadly classified into Discrete and Continuous. Discrete data is in whole numbers and cannot be further broken down. For example, the number of customers who visit a store on a particular day.
Continuous data, on the other hand, contains both whole numbers as well as fractions. This could include time, revenue, price etc.
There are other forms of numeric data based on measurement scales which each have unique characteristics –
Nominal data: This refers to numbers where arithmetic processes wouldn’t apply. For example, pin codes or employee IDs.
Ordinal data: A rating scale from 1 to 5 is an example of this, where (for instance) 1 stands for “did not meet expectations” and 5 indicates “exceeds expectations”
Interval data: In this, there is order and distance. An example is a temperature scale – we can definitively state that 5 degrees are 5 times greater than 1 degree. However, there can’t be a true zero (as in, no absence of temperature)
Ratio data: It is exactly the same as interval data, only with the presence a true zero value (for example, a weight scale).
Structured Data Analysis – Non-numeric data
After numeric data, the next broad category is character data. This is data made of text. Examples of this include brand names, or classification of gender as male and female etc. In general, the idea is to apply a mathematical operation of some form in order to convert the data from qualitative to numeric.
This is not always possible however, as not every character variable can be converted that way. For instance, while to classify gender, you can assign values of 0 and 1 to male and female respectively. However, some character variables are purely descriptive, such as a product name or a customer name.
This is the third type of structured data. Date and time data are treated as a separate class of data in most statistical softwares, as they may contain character type as well as numeric data. It is expected that a mathematical operation may be performed on a given date (for example, the number of days since the last purchase).
The main challenge with this type of data is for the tools to recognize them as dates and times, since they may be represented in different formats (with slashes, colons, words etc).
There are also instances where data is in a mixed form, containing both characters as well as numericals. Let’s take a customer ID as an example, which is BG7634-21. As this contains both text as well as numbers, it will usually be stored as character data. In this case, depending on the specifications of the tool being used, the person performing the analysis may choose to separate the numeric or the character part.
Semi Structured and Unstructured Data Analysis
As we saw earlier, we are increasingly dealing with information that cannot be used in the context of traditional structured data. Emails, photos, tweets – how do we deal with all this?
Semi-structured data may have some structure but it is often mixed with data that has no structure at all. Examples of semi-structured data are seen in what we call XML files. These are similar to HTML files. However, they contain information not just about how to display images or text on the web but also additional information about the type of object that is defined earlier. For instance, an address box will contain an address, a phonebox will contain a phone number and so on.
As the name suggests, this type of data is completely lacking in any type of structure. For example, let’s consider a review of a movie. The aim is to extract some information this review and it is certainly not the kind of data that can be plotted in columns and rows. And so, data of this sort needs two things – a system that can manage this kind of data and an analysis system that can extract meaningful information from the data.
Analyzing unstructured data of this sort can be done in various ways such as by looking at the metadata and searching for specific things, such as a text string, or a particular image. Once these have been identified, dictionaries or corpuses can be created and applied to the data to extract meaning from it.
At the end of the day, all data, whether structured or unstructured can and will be analyzed. The only challenge with the latter is that the storage and analysis methods are significantly different from the traditional techniques used for structured data.
In this next section, we will take a look at popular terms in Analytics, and what they mean. Some of them, you may already be seen before, but each one serves a distinctive purpose and function in the data analysis process.
Popular terms in Analytics – BI, Data Science
Two fairly common terms that you are likely to come across in the analytics space are Business Intelligence (BI) and Data Science.
Business Intelligence refers to the extraction of meaningful information from business data without the use of statistical models or Machine Learning techniques. It is typically used to retroactively examine data and draw inferences from past events. While it is considered by many to be a lesser variant of analytics, a more accurate description would be that it is a component of the overarching field of analytics. The term ‘Business Intelligence’ is older and was in vogue long before ‘analytics’.
On the other hand, ‘Data Science’ has entered popular usage more recently. In fact, the term only overtook ‘Business Intelligence’ in popularity some time around late 2015. It refers to the science that incorporates both statistics as well as Machine Learning in its data analysis techniques. A Data Scientist works with structured as well as unstructured data and has a good command over programming as well as mathematics. This article helps provide more context on everything that Data Science encompasses.
Popular terms in Analytics – Big Data
The term ‘Big Data’ tends to be the source of confusion for a lot of people, largely because of how it is used interchangeably with ‘analytics’ in the media. In fact, Big Data is more a subset of analytics as a whole, which can be characterized by the ‘3 Vs’ –
- and Velocity.
For a very long time, companies would only work with very structured data – what they generated internally (like customer or transaction data) or the information they purchased from data aggregator businesses. More recently, companies have had to start contending with vast amounts of unstructured data such as audio, video and social media data, which would not be in a standard tabular format. Traditional analytics techniques are not designed to contend with data of this nature.
Soon enough, businesses realized that there was a need for alternative tools, databases and platforms in order to effectively analyze the vast amounts of unstructured data that was being generated. This led to the classification of Big Data as an independent concept while still being a part of analytics as a whole. To summarize, Big Data refers to subset of analytics that deals with –
- Unstructured data
- Massive data sets
- Data being generated at very high velocity
Next up, we’ll take a look at the different types of analytics, beginning with Descriptive Analytics. While the dynamic pricing of airlines is a very good example, let’s use an instance from the retail world to illustrate this better.
When you visit your local supermarket, there are certain items you buy every single time – staples like milk, bread, fruit etc. Each time you buy an item, say a litre of milk that information is recorded in the store’s inventory system. Over a period of a few months, the store can analyze this data to get an idea of how much milk has been sold, which in turn gives the retailer very interesting insight into the buying patterns of customers and supplies.
This is, quite simply, descriptive analytics. All the information is coming from the customer’s shopping basket and that data is just being sliced and diced and looked at from different angles in order to draw relevant conclusions. Descriptive analytics is the simplest form of data analysis, as it can only be used to analyze data from the past. Due to this limitation and the ease with which it can be learnt, it is often considered unglamorous. However, it remains a very powerful tool for retail, sales, marketing and much more.
Let’s use another illustrative example from the local supermarket. Have you also ever wondered why milk is always at the far end of a store but magazines and chewing gums are at the checkout counter? Through extensive analysis, retailers realized that while walking all the way to pick up your essentials, you might be tempted to buy something else as well. Plus, gum and magazines are cheap impulses buys so you’re likely to not think twice to put them in your basket, just before you ring up all your purchases.
This, in a nutshell, is how predictive analytics works – by identifying patterns in historical data and then using statistics to make inferences about the future. We essentially attempt to fit the data into a certain pattern and if the data is following that pattern, we can predict what will happen in the future with some certainty.
If a person buys product A, retailers want to know if he or she is also likely to buy product B or C. Understanding this relationship between products is called product affinity analysis or association analysis. Predictive analytics is widely used across multiple industries such as retail, telecom, pharmaceuticals and more, as a way for companies to optimize their business practices.
Prescriptive Analytics is something that came into existence only about a decade ago and has since proven to be a very popular and powerful tool with businesses around the world. It can be used to analyze data in the present as well as predict what might happen going forward. Most significantly, it provides insights into what steps should be taken based on the available data and what the impact of these decisions would be. This tremendous versatility places Prescriptive Analytics at the cutting edge of analytics technology.
An excellent example of an industry where this technique is extensively used is aviation. Airlines companies are always looking for ways to optimize their routes for maximum efficiency, which can help them save billions of dollars. The sheer numbers in the industry are staggering – globally, there are over 50 million commercial flights every year; that is effectively a flight every second; even a simple route like San Francisco to Boston has 2000+ possible route options.
And so, the industry is constantly using Prescriptive Analytics to identify more efficient ways to operate, which can keep airline costs down and profits up.
Let us now take a look at some of the topics that have gained a lot of prominence in recent times. We will begin with a term that you have almost certainly heard before – Machine Learning.
Simply put, Machine Learning refers to the ability of computers to learn without explicit programming or human interference. A machine runs statistical analysis on vast amounts of data, observes underlying patterns and teaches itself to perform tasks that are otherwise very hard to program. The machine gets smarter with each step and over time the rules it builds for itself become increasingly complex.
Even though the concept of Machine Learning is as old as the computer itself, it has had a recent resurgence in popularity for two main reasons. Firstly, computing power is far more sophisticated and can support ML technology more ably. And the massive amounts of data available to us today makes it the ideal environment for Machine Learning to thrive. An algorithm learns through data and so the more data that is available, the better it can finetune its internal logic and produce superior results. As ML technology continues to advance, it can only be beneficial for humans, particularly in the business space.
For a long time, statistical algorithms have dominated the world of Business Analytics. So much so that about a decade ago, almost 90% of all analysis was done using statistical algorithms. Regression techniques like logistic and generalized linear models (or GLM) were the most popular algorithms for data scientists.
Back then, if there was a rare occasion where neural networks or random forests were used for a project, it would generate a fair amount of buzz in an entire analytics team. Since then, Machine Learning technology has blossomed and its algorithms are becoming more and more ubiquitous. Of the most popular algorithms in 2017, more than half of them are used primarily in Machine Learning –
- Logistic regression • OLS Regression
- Decision trees • Time Series
- Random Forest • Neural networks
- Support vector • Principal component analysis
- K-means • Monte Carlo simulation
- Boosting and bagging
All this is to say that if you are serious about pursuing Data Science today, focusing on mastering Machine Learning techniques will serve you well as the technology continues to evolve and become even more prominent in the industry.
Deep Learning and Artificial Intelligence
While on the topic of the future of analytics, it’s only natural that we examine two more terms that are bound to increase in popularity as we move forward – Artificial Intelligence and Deep Learning.
Artificial Intelligence is the overarching concept of developing machines to perform tasks that would normally require human intervention. While it has existed as a concept since the 50s and 60s, it has only really taken off since 2015. This is largely due to the wide availability of CPUs that allow for much faster, cheaper and more powerful parallel processing. The simultaneous sharp increase in available data and practically infinite storage has also significantly contributed to this.
Deep Learning, on the other hand, is a subset of Machine Learning that focuses on artificial neural networks. Deep Learning algorithms are generally far more accurate than other ML algorithms, especially with functions like speech recognition (used in services like Siri and Alexa) and image and video recognition (implemented in self-driven cars). Given their scope for advancement, Deep Learning and AI are considered to be the future of analytics and so it is important for us to understand these concepts and not get left behind in fast-changing world.
Analytics Use Cases
Given how widely data is used in the world today, there are naturally going to be more than a few very interesting use cases highlighting how analytics has been used effectively by businesses. A great example of this is the immensely popular show House of Cards and how Netflix used data to create the show.
They started by collecting massive amounts of data from their users, such as the kinds of shows and movies they watched and what actors and directors were the most popular. After crunching the numbers, they were able to identify that director David Fincher was among the most popular, as was actor Kevin Spacey. Interestingly, they also found that the original British version of House of Cards was very well received by these same users. The numbers weighed very heavily in favour of combining these three elements (the show, Spacey and Fincher), which prompted them to buy the rights for the show and create an American version, which released in early 2013.
Did all this data work? The answer to that is a resounding yes! The show proved to be a massive hit and Netflix gained around 2 million new subscribers in the first quarter of 2013 in the US alone. They also got a further 1 million new users from other markets around the world and this influx of new subscribers helped reimburse the production costs of the show. This just goes to show the power of data and how it can help companies take transformational business decisions. It’s little wonder that analytics is becoming integral to commercial decisions taken by companies around the world.
Now that we’ve had a look at the very basics of analytics – the popular terms, different techniques and use cases – let’s dive a bit deeper into the technical side of things. Specifically, we’ll be examining the technology landscape in analytics, where we will take a look at the popular tools and technologies being used in the space today, along with the strengths and limitations of each.
Analytics is a field that exists at the intersection of business and technology. More often than not, analytics is performed in the context of a business problem or opportunity. Which is why business understanding is crucial for data scientists to be the most effective. Simultaneously, having the right tools and platforms are equally important prerequisites.
What is interesting in all this is the extremely dynamic nature of the technology landscape in analytics. There are scores of new tools that have been developed in the last 10 years or so as existing tools have continued to advance and evolve. Most popular platforms such as data warehousing, especially for unstructured data, have been developed fairly recently. And then we have to consider the integration of various tools with different platforms. All in all, there is a high level of dynamism in a very fluid landscape – once you complete this module, you will be well placed to identify the analytics tools that match your organizational needs.
Data Retrieval Tools
Data retrieval refers to the process of obtaining or extracting data from a data warehouse. Typically, an organization’s database will contain a vast amount of data, but only a small subset will be required for analysis. For an analytics project, the first step would be to retrieve the relevant data from the data management system.
SQL (Structured Query Language) is one of the most popular tools for data retrieval. By entering the conditions for the required data, analysts can easily extract just a small, relevant subset of the larger database. However, this only works for structured data, whereas unstructured data exists in what is called NoSQL databases. Extracting data from NoSQL requires a slightly different approach and data scientists often rely on a mechanism called MapReduce.
Another popular extraction technique is data scraping (also called data harvesting). This refers to a wholesale collection of information from web pages. It is essentially a form of copying, in which specific data is gathered and copied from the web into a central database for later analysis. Popular tools for data scraping include Import.io as well as Scrapy, an open source tool written in Python, which is particularly useful for programmers.
In the next few sections of this post, we will take a closer look at the tools for statistical and exploratory data analysis. As a data scientist, these will be your “go-to” tools and they will be the ones you use for the majority of your analytical work. And so, it’s vital to be aware of the various options available to you.
Tools for Analysis: SAS
For a very long time, SAS was the dominant analytical tool used by data scientists everywhere. SAS stands for ‘Statistical Analysis Software’ and has been developed by the SAS institute, a privately held company. It’s a highly versatile tool, which can do a lot with data, including mining, manipulation, management, as well as sophisticated statistical analysis.
Given its widespread use and popularity, there are naturally plenty of advantages to the tool, its versatility being the first and foremost. Moreover, it is very robust and can handle even very large volumes of data without any problems. It is also fairly easy to learn, indeed many people find it to be an easier language to grasp than R or Python.
The biggest drawback with SAS is the price. It is a paid tool and while that does mean better support, it can be prohibitively expensive for individuals. You may also have to pay to use the tool for additional tasks, over and above the regular modules. Due to the cost factor, SAS is primarily used by larger organizations such as banks and healthcare corporations. While many companies are beginning to move towards the use of R, especially with the growing importance of Machine Learning, SAS remains a popular and highly effective tool in the industry at large.
Tools for Analysis: R
Over the last few years, R has become the most widely used tool in analytics. It became more robust and versatile over time and the fact that it is open source has only served to boost its popularity.
R and all its modules and packages – from visualization and time series to Machine Learning – are completely free. It integrates well with open source platforms for Big Data and can therefore be used to perform analysis on very large datasets. The programming is done in the R language, which is fairly compact and easier to learn than other computing languages.
Since the tool is open source, there is a whole community of developers worldwide, constantly working to evolve and improve it. If any of the packages falls slightly below the required standard, it will quickly evolve into a more robust product through a collaborative effort. Due to all this and the fact that it is free of cost, there are very few areas (if any) in which R falls short. The only potential disadvantage could be that the language is slightly more difficult to learn than SAS. Nevertheless, the benefits far outweigh the drawbacks.
Tools for Analysis: Python
Python is a high level programming language that was designed using simple syntax and very few lines of programming. Since 2003, it has consistently been in the top 10 programming languages in the world and with libraries like NumPy and SciPy, it has become very popular with data scientists as well.
When stacked up against R, Python has many of its own benefits. For one, it is a better choice than R when it comes to data manipulation whereas the latter is better for ad hoc analysis. And while R is more popular among analysts and data scientists, Python is preferred in an engineering environment.
On the whole, the benefits of Python are many. It is easy to learn as the codes are easy to write, read and debug. It works well for any kind of web integration and the speed of the tool lets it perform routine data mining tasks very quickly. All this being said, it still has minor drawbacks, such as being less versatile than R, especially when it comes to statistical analysis. While it doesn’t enjoy the same kind of technical support online, it has become widely popular in its own right.
Tools for Analysis: SQL
As we saw earlier, SQL refers to ‘Structured Query Language’ – it is a domain specific language, designed for dealing with RDMBS, or relational databases. It is a great language to mine and manipulate data and to perform aggregations, pivots etc. While it has its limitations, particularly to perform sophisticated statistical analysis, it is still widely popular among data scientists due to the speed at which it can handle data.
That being said, there are a couple of reasons why most people use other analytics tools, either with or instead of SQL. Not all data can be fit neatly into rows and columns, whereas SQL automatically assumes structure. To get around this, a data scientist may use different methods to unstructured data in order to give it more order and then use SQL to perform the analysis. SQL also has a few other limitations, particularly when it comes to Machine Learning and predictive analytics.
There are still many advantages to using SQL, such as the easy availability of talent. It also has an ‘English-like’ language, making it easy to learn. However, due to the limitations with unstructured data (something that is becoming increasingly important), it’s likely that SQL will continue to be used in conjunction with other tools for some time.
Let us know take a closer look at the specialized tools that are ideal for Machine Learning
Machine Learning tools – Python
As we saw earlier, Python is a widely used and popular programming language and this is especially true for professionals who work on Machine Learning systems. One of the most commonly cited reasons for this is the syntax of Python, which has been variously described as “elegant” as well as “maths-like”. This makes it a lot more accessible and coupled with the general ease of learning the language, adds to its wide appeal.
Other users have pointed out that particular tools make Python an easy language to use. Some cite an array of frameworks and libraries, along with extensions like NumPy make the language a lot easier to implement. When compared to other languages such as Java, Ruby on Rails, C and Perl, Python has been described as a “toy-language” due its ease of application, even for basic users.
As a general-purpose language, Python can do a lot of things, which helps with complex Machine Learning tasks. Due to its broad appeal, there is also a large support community for the language online, which also helps broaden its appeal further.
Machine Learning tools – TensorFlow, Keras, Spark
TensorFlow (TF) is an ecosystem for developing deep learning models. It is a Python library developed by Google for Complex Number Computation in order to build complex Machine Learning models from scratch. Since it is open source and supported by Google, it is one of the popular deep learning libraries and also a very vibrant online community. However, TF is not the easiest technology to use.
Keras, on the other hand, is an API built on TF, which is comparatively easy and user-friendly. It is used to build simple or complex neural networks using minimal coding. Keras can also be tweaked a lot more and is far easier to debug than TensorFlow.
Apache Spark is known as a fast and easy-to-use engine for big data processing that has built-in modules for streaming, SQL, Machine Learning and graph processing. Spark’s general ML library is designed for simplicity, scalability and easy integration with other tools to help with faster iteration of data problems. To address the limitations of traditional tools and techniques (such as processing of data on a single machine), Spark provides a powerful, unified engine that is faster and easier to use.
With the numerous language choices it provides such as Scala, Java, Python and R, it’s unsurprising that Spark is a very popular choice among Machine Learning experts, due to its versatility and efficiency.
Other Machine Learning tools
Apart from the popular aforementioned tools, there are many other tools and libraries available for Machine Learning and Deep Learning. Among these, Matlab has been around for a long time and has been extensively used by researchers as well as academicians. Despite its popularity, Matlab’s potential applications in Machine Learning aren’t very widely known. In recent times, it has developed capabilities in Machine Learning, Deep Learning, image recognition and sensor data processing and it is well equipped to handle new-age data.
PyTorch is another popular tool and is very similar to TensorFlow. They are both Python-based and used extensively in the professional space, most notably by massive tech companies like Google (which prefers TensorFlow) and Facebook (which favours Torch). On the other hand, a good example of a non-Python tool is Caffe, which is Matlab-based. A lighter, more computationally efficient version of this, Caffe 2, is also available.
To sum up, if you are looking for a versatile and flexible solution for your Machine Learning and deep learning needs, Tensorflow is the ideal choice. If you deal with big data, Apache Spark’s Mlib library in combination with R or Python could work for you. If you are a Matlab user, then Caffe or Caffe2 frameworks is the best option.
BI and Visualization Tools
Because of the way the human brain processes information, using charts or graphs to visualize large amounts of complex data is far more impactful than poring over spreadsheets or reports – it’s a great way to convey concepts in a universal manner. In this section, we will take a look at the popular tools that are used to create data visualizations.
Tableau is an interactive tool that enables you to create visualizations in the form of dashboards or worksheets, which can help you draw unique business insights for the betterment of your company. Even people without very high technical ability can use it to create meaningful dashboards. The most eye-opening aspect of using Tableau is how it unearths a lot of insights that wouldn’t have been apparent otherwise.
Qlikview is yet another visualization tool, which is in a direct competition with Tableau. However, the latter is more feature-rich and is preferred by businesses whereas Qlikview is a more popular choice in the IT space. Microsoft’s Power BI is another popular tool. It is similar to Excel, much cheaper than Tableau and hence a popular choice among many organizations. Other tools, such as Alteryx and Spotfire, while not without their limitations, are also excellent alternatives to the more popular options available.
At the end of the day, it comes down to which team needs to use the tool. Tableau is best suited to business needs, whereas Qlikview would work better for an IT team. Power BI is similar to them both, but is mainly just a cheaper alternative.
Next, we move on to the topic of Excel, which is by far the most popular analytics tool in the world today. There are over a billion Microsoft Office users worldwide and it’s safe to assume that they all use Excel. It is extremely easy to use, has plenty of features, great visualizations and it is GUI (Graphical User Interface) based, which all contribute to its widespread popularity.
Despite all these advantages, Excel does have certain drawbacks. For instance, it can’t handle very large datasets or perform sophisticated statistical analysis and Machine Learning processes. More recently however, recent add-ins have been developed to help Excel overcome these limitations. For example, Excel Solver is especially useful for optimization problems and “what-if” problems, such as analyzing marketing budgets under different scenarios. Typical problems solved using Solver include linear programming, quadratic programming and risk analysis.
RExcel is an Excel add-in that allows access to the R statistics package within Excel. This is loaded with a number of excellent features that expand Excel’s capabilities. With this, you can:
- Transfer data between R and Excel in both directions
- Run R code directly from Excel ranges
- Write macros calling R to perform calculations without exposing the user to R
Another very user-friendly, comprehensive and affordable statistical software is XLstat. Its features include data visualization, statistical modeling and data mining. The highlight however, is its Machine Learning capabilities, which is something that enhances Excel to a completely new level.
While Excel is quite versatile itself, it may require certain add-ins to help it perform beyond its capability. And depending on what your analytics needs are, you can identify and choose the right one to use.
GUI-based analytics tools
A common problem a lot of people face when making the move to analytics is that they are not familiar or entirely comfortable with programming languages. There is no real requirement for expertise in this when working on the business side of things. At the same time, a lot of the most powerful tools (such as R, Python and SAS) do require programming. However, there are alternatives such as Graphical User Interface tools, available to anyone who wants to do sophisticated analyses without having to learn a new language.
Let us begin with one of the most widely used analytics tools, R. While the in-built console in the tool is the most popular, the graphical interfaces such as RStudio, RCommander and Rattle are also used quite extensively, with RStudio being the most popular of the lot. While they do have certain limitations, these interfaces are all excellent ways for non-programmers to use R’s immense statistical capabilities.
SAS also has a GUI-based alternative called the SAS Enterprise miner or E-miner, which is an extremely powerful tool with a highly intuitive GUI. However, its presence is limited by the fact that it is a very expensive tool. Other options for non-coders include popular tools we have already discussed earlier in this book – visualization softwares like Tableau, as well as Excel, with its numerous versatile add-ins. At the end of the day, while programming knowledge would be beneficial in analytics, it is by no means essential.
Big Data Introduction
Before we get into the types and uses of Big Data tools, let us quickly revisit the concept in a little more detail. There is no complete consensus in the world of analytics about the exact definition of the term so we will stick to one that is easy to understand.
As we saw towards the beginning of this post, Big Data refers to the technologies that relate to massive volumes of data, which businesses are trying to fully connect and understand. For a simple example, consider Airtel, a telecom mammoth. Over 2 billion calls are made by Airtel users daily and the sheer scale of this data is immense. Traditional analytical tools can’t contend with this volume of information and so Airtel would rely on Big Data tools and solutions to get the most out of their data.
Among these Big Data tools, you would need a cloud-based platform to store all the data. Next comes a system where you can store the data in a distributed environment and finally, the tools you would use for the analysis. All of these together form the Big Data ecosystem of an organization. Tech giants such as Amazon and Google provide cloud services (Amazon Web Services and Google Cloud) which have similar features and prices and these are typically the most common choices for companies.
Big Data ETL and Storage
When it comes to the ETL (Extract, Transform and Load) and storage options for Big Data, there are numerous options available that data scientists can choose from.
Hadoop is a name you’re likely to have heard before. It is a very popular open source platform that supports distributed computing. There are several Hadoop-based options for ETL and storage such as Azure and Amazon each of which has its own data lakes.
When it comes to NoSQL databases, the most common examples are MongoDB, Cassandra and HBase and each one has its own unique functions. HBase is useful for handling sparse data sets like identifying 100 items out of a set of a billion. Cassandra offers SQL-like querying via its own language and this makes it easier for developers used to working with traditional databases. It is best used when your data does not fit the server and you need to use a NoSQL database but you want a more familiar database when using it.
If you are analyzing data that does not change too much and calls for pre-defined queries, CouchDB is the best option. On the flipside, if the data is more dynamic, MongoDB is the ideal option. Ultimately, there are numerous options available, each with its own unique capabilities. Choosing the right database should be done after you have correctly identified the business needs of your organization.
Big Data Tools
And now we move on to the final piece of the Big Data puzzle – the tools and algorithms. Similar to the databases, there are numerous options available for the data analysis process:
- Mahout: It is a collection of algorithms that enable Machine Learning functions on Hadoop. Being very useful for clustering, classification and collaborative filtering, makes it a popular choice in E-commerce and retail
- Impala: It’s a query engine that allows Big Data specialists to perform analytics on data stored on Hadoop via SQL or other Business Intelligence tools.
- H2O: Written in Java, R and Python, it can perform various algorithms like regression, clustering and neural networks on Big Data.
- SSpark: Another popular option for large scale batch processing, stream processing and Machine Learning.
In addition to these, Apache Flink and Google BigQuery are other also popular options for Big Data analysis.
With this, we wrap up our comprehensive look at the technological landscape in the analytics space. This section of the book ought to have given you a deeper sense of the immense range of tools and softwares available for data analytics, as well as their capabilities.
Over the course of this book, we have covered a very broad range of topics. We have looked at what data is, its different forms, how each one is dealt with, as the different tools used in data science and each of their capabilities.
We are sure the book has given you substantial insights into what constitutes the world of data analytics and what you can expect to face as you dive deeper into the subject matter.