Who wins the battle between Hadoop and Spark?


Distributed computing or parallel processing has been around since the 1920’s Programming languages forked out into two major categories namely, functional style and imperative (procedural style). However, for quite a long time they were restricted for usage in defense research labs or university research labs, and hence were only active underground. Then the BIG DATA monster began yelling out loud enough for everyone to hear, begging that these underground technologies associated with distributed computing be opened to the public. That’s when HADOOP and a bag of associated tools & technologies came to our rescue publicly and got popular. This was around 2010 and it was excitedly embraced by several companies eager to enter the Big Data Fray  (although companies like Google and Yahoo were already using it internally).

The word HADOOP which is a BUZZ-WORD right now also echoes the word MAP-REDUCE in our minds. That being said, HADOOP is a platform and a bunch of technology and tools built to help us tackle problems in the BIG DATA arena.  MAP-REDUCE was the fundamental framework around which many of HADOOP’s tools were built (PIG, SQOOP, HIVE etc). Hence we can say that MAP-REDUCE is a framework which natively supports the JAVA programming language.

HADOOP and JAVA MAP-REDUCE helps to solve a lot of problems in BIG-DATA processing like data analysis, data transformation etc. but all of it is applicable only to BATCH PROCESSING of jobs (non interactive and not real time analysis). It did wonders when no other better technology existed. However in technology, one guy being the ruler of the world for a long time seldom happens. The same is the case with the MAP-REDUCE  framework of HADOOP.

The Big Data problems fast evolved from a batch processing report generation approach to REAL-TIME interactive analytics on extremely large datasets. Suddenly MAP-REDUCE seemed quite slow, and that’s when a new technology called “APACHE SPARK” evolved.

SPARK is another technology and a set  of tools on the HADOOP platform. It is not a replacement to HADOOP, but to a large extent, it did replace the MAP-REDUCE approach. In fact perhaps 70% of the map reduce use cases were replaced by APACHE SPARK and SCALA, because the later was better and faster. In fact there are quite a bit of research that shows that moving to SPARK improved speed 100x in some cases, 10x in most cases and at least a 3x  in all cases when compared to MAP-REDUCE.

SPARK is the ideal framework for interactive jobs which is the core of machine learning algorithms, which MAP-REDUCE cannot handle. SPARK can handle batch processing, interactive and iterative jobs, whereas MAP-REDUCE can handle only BATCH jobs, that too slower when compared to MAP-REDUCE.

Having adulated SPARK over MAP-REDUCE, the common misconception is that SPARK demands hardware intensive resources (high on CPU and memory) as SPARK is all about IN-MEMORY computations. IN-MEMORY computation is one of the features of SPARK which can be selectively turned on. Insights from the industry suggests that SPARK gives a minimum of 3x to 10x performance improvements over MAP-REDUCE on the existing hardware for the same set of use cases.

The final verdict would be that using SPARK over MAP-REDUCE is a clear WIN-WIN situation in any given scenario. APACHE SPARK is the new kid on the block, and a very promising kid, who would eventually grow up to be the NINJA of Big Data, at least until it confronts another SAMURAI !

Suggested Read:

Can Big Data Solutions using Hadoop Save you Big Bucks?

Why Data Scientists Need a Combination of SAS, R and Hadoop Skills