On Hadoop and Its Place in Big Data Technologies Today
Working in new and emerging technologies, I am aware that a solution put together today can quickly get out-of-date as newer & better technologies & techniques keep coming up quite fast. I kind of felt the full import of it when I was thinking through the details of how to implement in Spark a solution that we had done mid last year (2017) on Hadoop platform for a large private bank.
One of the largest private banks in India is a customer of my client. The bank wanted a system which figures out the propensity to default the payment of the monthly loan installment (EMI) by its borrowers specifically those who have defaulted in the last 2 months and last 3 months. We provided a Hadoop based solution and I was responsible for the data engineering part which included ETL, pre-processing and post-processing of data. And also for the end-to-end development & deployment of the solution including setting up an internal Hadoop cluster with a team of 1 Hadoop admin and 1 Hive developer. Two data scientists/analysts with the assistance of an SQL developer worked on developing the models which are used for scoring/prediction. We were given access to 4 tables in an instance of the customer’s RDBMS which contained the loan details, demographic data of the borrowers, payment history and elaborate tracking of follow up actions which you typically see in financial institutions. Based on these the models were arrived at after considerable exploration, analysis, testing and iterations.
As for the technology platform, though the solution design was in place more or less from the outset, the timeline got extended due to a couple of change requests and requests for POCs, for example,
- The customer asked for including borrower’s qualification as one of the variables. The data entered in this field was in free form text and so it required a lot of cleansing to make it uniform and get into a factor type of data with manageable number of levels.
- The data scientists figured out that adding some of the published indexes like weightage of employment domains of borrowers such as IT/ITES, Auto-manufacturing, Hospitality industry would enrich the model.
- The customer requested for a web-based interface for the final output records which we did a POC providing web UI to Hive tables through Hiveserver2 though it was later dropped.
The picture below gives an overview of the design and flow of the application that was deployed and working at the site.
- The input data was made available in 4 flat tables of their RDBMS.
- Data is imported using Sqoop into HDFS in tab separated files.
- All the required transformations are done using Hive by creating external tables out of the above files.
- Once the data is ready for scoring a Pig script is run which applies the model developed by the analysts and scores all the records.
- The final output which is a subset of the total input records is provided as a CSV file and also exported to the RDBMS for the customer to take required actions based on the scoring and scored categories.
All the above steps are put into a Linux Shell script which is scheduled using cron on the Hadoop cluster’s namenode to run on monthly basis which I think can be termed as classical Hadoop use-case.
Migrating the application to Spark as we know, will make it faster (lightning fast as the Spark web site mentions) and all the other good stuff that Spark offers most importantly a uniform platform. The customer naturally is interested in using the system deployed than migrating to newer ways of doing the same thing.
However, thinking through the details of implementing this application in Spark we see that:
- The data extraction can be done by connecting to the RDBMS using SparkSQL
- All the transformations/pre-processing can be done likewise in SparkSQL which will be done on data frames (which are distributed on the Spark cluster)
- Scoring engine can be done in a Scala routine which will run on an RDD in a distributed manner
- The final output can be written into a table in the RDBMS by SparkSQL
So we wouldn’t even need Hadoop and HDFS as a matter of fact! All we need is a cluster of commodity servers with say 32 or more GB of RAM and a TB or two of hard disk each. I generally brush aside articles with titles like ‘Hadoop is out!’ or worse still ‘Is Hadoop dead?’ etc., considering them as alarmist or attempts to get attention (or to use a trendy phrase grab eyeballs), but they are not totally off the mark after all.
However if we look at it at the enterprise level a data extraction exercise like the one in this case is most likely going to be used by multiple applications and not by just a single application.
This is where Hadoop can serve as a veritable data lake – collecting and storing the data from all sources and channels in whatever form it is given. Each application like the one above can dip into this data lake, take the data in its available form, cleanse it and bottle it as per the processing, reporting and analytics requirements.
The more the data, the better & more accurate the analytic models are. And all analytics require, if not demand, a robust pipeline for data pre-processing steps from cleansing to transformations to data reduction and so on.
So, Hadoop for sure has its prime place in Big Data technologies though it may not be synonymous with Big Data as it used to be just a few years ago.
This article has been originally published on my LinkedIn Pulse. You can read the original publication here.