Advantages of Apache Spark ™
Yes, Hadoop is your ultimate store of all your semi structured data within HDFS and yes, you can query all your data using Map Reduce. It is indeed the backbone of Big Data analytics, and today many companies are using it to do wonders with their Data.
However, Apache Spark, an open-source data analytics, cluster computing framework, is pretty awesome as well. It was developed in the AMPLab at UC Berkeley in 2009 and open sourced in 2010. Since then it has become one of the largest open source communities in Big Data, with over 200 contributors from 50+ organizations. In 2013 Cloudera began offering commercial support for Apache Spark.
The numerous advantages of Apache Spark makes it a very attractive big data framework. In this blog we will talk through the many advantages of using Spark. The video above by the creator of Spark, Matei Zaharia, will also give you a lot more information about “How Companies are Using Spark, and Where the Edge in Big Data Will Be”.
Integration with Hadoop
Spark’s framework is built on top of the Hadoop Distributed File System (HDFS).So it’s advantageous for those who are familiar with Hadoop.
Spark also starts with the same concept of being able to run MapReduce jobs except that it first places the data into RDDs (Resilient Distributed Datasets) so that this data is now stored in memory so it’s more quickly accessible i.e. the same MapReduce jobs can run much faster because the data is accessed in memory.
Real time stream processing
Every year the real time data being collected from various sources keeps shooting up exponentially. This is where processing and manipulating real time data can help us. Spark helps us to analyse real time data as and when it is collected.
Applications are fraud detection, electronic trading data, log processing in live streams (website logs), etc.
Apart from Steam Processing , Spark can also be used for graph processing .From advertising to social data analysis, graph processing capture relationships in data between entities, say people and objects which are then are mapped out.This has led to recent advances in machine learning and data mining.
Today companies manage two different systems to handle their data and hence end up building separate applications for it. One for streaming & storing real time data. The other to manipulate and analyse this data. This means a lot of space and computational time. Spark gives us the flexibility to implement both batch and stream processing of data simultaneously, which allows organisations to simplify deployment, maintenance and application development.
Hence, you can write sophisticated parallel applications quickly in Java, Scala, or Python without having to think in terms of only “map” and “reduce” operators. This makes it well suited to machine learning algorithms.
Spark is known as the Swiss army knife of Big Data Analytics. It is very popular for its speed, iterative computing and most importantly caching intermediate data in memory for better access.