What Makes Data Big?
What Makes Data Big?
Last month we shared an excerpt of the first chapter of an eBook published by Jigsaw Academy’s Big Data Team, titled ‘Beginner’s Guide to Big Data Analytics’. This book aims to give one an understanding of Big Data and what makes data big, while also elaborating in simple language the challenges of Big Data, the emerging technologies and the Big Data landscape. The book also gives valuable insights into careers in Big Data and a glimpse into what the future could hold in store for the industry, and is a useful companion to those of you enrolled in Jigsaw’s Big Data course.
To download the full eBook for free, please visit our website www.jigsawacademy.com and visit the free trial section. You can not only download this ebook, but you will also get access to our other free resources, eBooks and reports. You can also get a first hand feel of our courses, as we have loads of free course videos and class recordings.
The first chapter ‘What is Big data’ helped us understand the concept of Big Data in todays business scenario.Today we would like to share the next chapter ‘What Makes Data Big?’.
We live in the era of big data and it is not leaving any industry untouched be it financial services, consumer goods, e-commerce, transportation, manufacturing or social media. Across all industries, enterprises now have access to an increasing number of both internal and external big data sources. Internal sources typically track information around demographics, online or offline transactions, sensors, server logs, website traffic and emails. This list is not exhaustive and varies from industry to industry. External sources on the other hand are mostly related to social media information from online discussions, comments and feedback shared about a particular product or service. Another major source of big data is machine data which consists of real-time information from sensors and web logs that monitor customer’s online activity. In the coming years, as we continue to develop new ways of data generation either online or offline by leveraging technological advancements, the one correct prediction we can make is this; the growth of big data is not going to stop.
Although big data is more about data being captured from multiple sources and size at a higher level, there are many technical definitions which provide more clarity. Orielly Strata group states that “Big data is data that exceeds the processing capacity of conventional database systems. The data is too big, moves too fast, or doesn’t fit the structures of your database architectures”. In simple terms, big data needs multiple systems to efficiently handle and process data rather than a single system. Say an online e-commerce enterprise in a single region is generating about 1000 gigabytes of data on a daily basis which can be handled and processed using traditional database systems. On expanding operations to a global level, their daily data generation has increased 10000 times and is currently at 10 petabytes (1 petabyte = 1000000 gigabytes). To handle this kind of data, traditional database systems do not have required capabilities and enterprises need to depend on big data technologies such as Hadoop which uses a distributed computing framework. We will learn more about these technical topics in subsequent chapters.
To further simplify our big data understanding, we can rely on three major characteristics of big data i.e. volume, variety and velocity which are more commonly referred as 3 V’s of big data. Occasionally, some resources do talk about a not so common characteristic of big data i.e. Veracity which is referred as the 4th V of big data. All these 4 characteristics provide more details around the nature, scope and structure of the big data.
Volume deals with the size aspect of big data. With technical advancements in global connectivity and social media, the physical extent of data generated on a daily basis is growing exponentially. Every day, about 30 billion pieces of information is shared globally. An IDC Digital Universe study, estimated that global data volume was about 1.8 zetabyte as of 2011 and will grow about 5 times by 2015. A zetabyte is a quantity of information or information storage capacity equal to one billion terabytes which is 1 followed by 21 zeroes of bytes. Across many enterprises, these increasing volumes pose an immediate challenge to traditional database systems with regards to the storing and processing of data.
Big data comes from sources such as conversations on social media, media files shared on social networks, online transactions, smart phone usage, climate sensor data, financial market data and many more. The underlying formats of data coming out of these sources would vary in terms of excel sheets, text documents, audio files and server log files which can broadly classified under either structured or unstructured data types. Structured data formats typically refer to a defined way of storing data i.e. clearly marked out rows and columns whereas unstructured data formats do not have any order and mostly refer to text, audio and video data. Unstructured formats of data are more a recent phenomenon and traditional database systems do not possess required capabilities to process this kind of information.
Increased volumes of data have put a lot of stress on the processing abilities of traditional database systems. Enterprises should be able to quickly process incoming data from various sources and then share it with the business to ensure the smooth functioning of day-to-day operations. This quick flow of data within an enterprise refers to the velocity characteristic of big data. Another important aspect is also about the ability to provide relevant services to the end user on a real time basis. For example, Amazon provides instant recommendation services depending on the users search and location. Based on the entered keyword, these services need to search through their entire historical transactions and share relevant results which hopefully would convert into a potential purchase. The effects of velocity are very similar to volume, and enterprises need to rely on advanced processing technologies to efficiently handle big data.
Though enterprises have access to lot of big data, some aspects of it would be missing. Over the years, we are aware that data quality issues usually happen due to human entry error or due to some individuals withholding information. In the big data era where most of the data capturing processes are automated, the same issues can occur due to technical lapses which arise due to system or sensor malfunction. Whatever may be the reasons, one should be careful to deal with inconsistency in big data before using it for any kind of analysis.
Click here to read the chapter What is Big Data.
Interested in a career in Big Data? Check out Jigsaw Academy’s Big Data courses and see how you can get trained to become a Big Data specialist.