Oozie in Action: Workflow Scheduler for Hadoop
Oozie – workflow scheduler for Hadoop – perhaps is the only major component in the Hadoop ecosystem that does not work on or handle data directly by way of data ingestion or data processing. Let us look at what its function is and where & how it is used through a production scenario case study.
Consider the production scenario of a typical Hadoop project. Data ingestion i.e. uploading data in to HDFS forms almost 50% of any project in terms of time and effort due to various reasons. In any corporate, data usually comes from multiple sources into the Big Data systems. For instance in Insurance domain, policy data can come from different sources like the normal sales channels or from online transactions via the web site. Similarly other details like payments or claim requests and so on can come from different sources too.
Since data comes from different channels there would be discrepancies and inconsistencies in the data fields and content. Data from some sources may send additional fields which need to be filtered out. Fields in the data from some channels may contain junk characters or control characters which need to be removed. Another type of inconsistency one usually sees is – one data source may indicate a field say Gender with an “F” or an “M” while another source may fill in for same field with full forms like “Female” and “Male”. All these need to be cleaned and inconsistencies to be ironed out before the data is ready to be loaded for downstream analytics applications.
An efficient and robust data pipeline that takes the incoming data and then cleans extracts and transforms the data into a form that can be loaded into Hadoop for downstream operations is very essential for smooth functioning of the entire project similar to a Data Warehouse application.
Oozie allows us to build a workflow for these operations as Directed Acyclic Graph (DAG) and its Coordinator functionality allows us to schedule the workflow which can be triggered at a specific time or based on a specific event such as availability of data files.
We at Jigsaw Academy have taken a typical production scenario from Auto Insurance domain with a dataset of 9000+ records for this case study. The dataset contains two types of data:
- Policy data – which contains data that is static during the life of a policy such as holder’s name, vehicle details and so on. This data file is generated daily at a specified hour covering all the policies generated/issued in the day.
- Monthly data – which contains data that change every month such as months-since-inception, months-since-last claim, open claims, claim amount, open complaints and so on. This data file is given at a monthly frequency on a specified day and time.
After looking at the sample records of both the data files we decided to:
- Have a simple Shell script to cleanse the data by removing the control characters that come up in some records in both the data files
- Secondly write a brief Pig script for each data file to extract the required data fields and transform the data to make the Gender field consistent for example in Policy data file and to correct the Claim Amount field in the Monthly data file.
- And lastly write Hive queries for loading the data into the respective tables
So our workflow basically consists of three actions for these cleaning and ETL tasks as shown below.
Shell Script → Pig Script → Hive Queries
This sequence of actions needs to be executed as a workflow for each of the two files as and when they are made available. Once a month it is possible that both the files are made available at the same time. To handle this scenario instead of scheduling one more workflow with same actions for two files, we have used Oozie Workflow’s Decision Control Node and Fork & Join Control Node features.
Become a Big Data Analyst with our custom designed Big Data Specialization that trains you on Hadoop, Pig, Hive, Spark, Storm, MongoDB, and Cassandra.
A question may come up as to why we cannot use cron, the scheduler that comes with Linux system or any other scheduling tool like Autosys which will be available in most of the mid to large data flow environments.
Where Oozie Gets an Upper-hand
- Oozie is designed to take advantage of the distributed environment of the Hadoop cluster. It launches each job on a different node thus distributing the load. This also increases the capacity to launch workflows as the cluster size increases.
- Oozie has built-in Hadoop actions, so not only building and scheduling actions as workflows but maintenance and troubleshooting becomes easier too.
- Oozie’s Web UI allows us to see the logs and drill down to specific errors on the data nodes. With other systems this becomes time-consuming.
- Oozie Coordinator has a facility to trigger actions when data files arrive or when a directory is ready in HDFS. This will be difficult to implement with other tools.
To implement a Oozie Workflow as well as a Coordinator we just need to define two files for each:
- An XML file listing the sequence of actions and
- A Properties file for Oozie to get the required configuration details of the jobs
The Workflow is built and the Coordinator is defined to trigger the workflow on the event of the availability of the data files in a specified directory. The workflow is as shown in the chart below. The workflow gets depicted on Oozie’s Web UI graphically in a similar fashion, highlighting the path that is being executed.
Oozie is tailor-made and eminently suitable for scheduling workflows on Hadoop not only for data ingestion, but it can also be used for downstream descriptive analytics operations like generating standard reports for example.
Figuring out and implementing this case study gives you sufficient understanding and familiarity with Oozie and its features and makes you production-ready.