HBase – A Versatile Data Store

0

HBase is a Hadoop ecosystem component that I would describe as an all-rounder in the big data analytics playground. Let me explain with an example. Take a typical Hadoop-based big data analytics project – be it building a decision tree/random forest model or a recommender system or a regression model. The data pipeline in the project can be depicted as below.

 

  • Raw data from the operational data store or other existing data sources in the organization is ingested into hadoop – by spooling the data from different tables into different files or extracting the data with a tool like Sqoop as text files or OCR/Parquet files.
  • Pre-processing of the data is taken up using tools like Pig or Hive which involves tasks such as data cleansing, binning, and data reduction. Data is made consistent or standardized as listed in the examples below.
    • Gender field having F or Female and M or Male is populated with 0 or 1
    • Fields like say Education are made into factor type of variables by defining levels like Metric, Pre-university, Engineering Graduate, Non-Engineering Graduate, Post-graduate, PhD and denoting each one with a number from 0 to 5.
    • Fields such as Age or Income if given as a number will be converted into fields like age-group or income-bracket with each category denoted by a number ranging from 0 up to 4 or 5 similar to the above.
  • Exploratory analysis of the data is done next using Pig/Hive and any graphics/visualization tools where several views of the data are generated by filtering and grouping with different criteria to get better insights into the data and the patterns thereof.
  • The parameters required for the machine learning algorithms/models are thus arrived at through these exercises and finally the data frames for the models.

This phase of the project can easily take a few months or a few quarters and some of these exercises are iterative as well. During this period usually, additional data will arrive with new records and updates to existing records. New ones can be appended to the existing data on Hadoop; however, updating existing records can be cumbersome. This is where HBase can be very valuable as a data store because of its ability to provide random real-time read/write access to your big data as mentioned on the official website of HBase.

HBase Data Model:

HBase is an open-source, distributed, versioned, non-relational database modelled after Google’s Bigtable and runs on top of hadoop. It is built to handle tables of billions of rows and millions of columns. The following illustrates HBase’s data model without getting into too many details.

HBase’s data model is quite different from that of an RDBMS. An HBase table consists of rows and column families. Each column family can contain any number of columns. Rows are sorted by row keys. There are no data types in HBase; data is stored as byte arrays in the cells of HBase table. The content or the value in cell is versioned by the timestamp when the value is stored in the cell. So each cell of an HBase table may contain multiple versions of data. As any row or cell of any row can be directly accessed using the row key, HBase affords us random read/write access to any required row among the billions of rows of big data.

The logical representation of the rows in an HBase table can be shown as below:

RowKey Column family: productsColumn family: customers
rowkeyproduct_namepricequntyimagecustomer_namestatezip code
1Diamondback Women’s Serene Classic Comfort299.981Mary MaloneNC28601
2Pelican Sunstream 100 Kayak199.991David RodriguezIL
3Nike Men’s Dri-FIT Victory Golf Polo505David RodriguezIL
4Nike Men’s CJ Elite 2 TD Football Cleat129.991David RodriguezIL
5Team Golf New England Patriots Putter Grip24.992Brian Wilson
6Perfect Fitness Perfect Rip Deck59.995Brian Wilson

 

Using HBase:

HBase comes with a command line interface – HBase shell. Using HBase shell you can – create an HBase table, add rows into it, get rows based on the row key and run commands like scan which displays all the rows,  filter which filters rows based on specified criteria and a few other ones.

We can work with HBase using Pig which is very useful to upload data in bulk into HBase tables. For example you can load the following records into an HBase table using the Pig script given below.

1,Richard Hernandez,6303 Heather Plaza,Brownsville,TX,78521

2,Mary Barrett,9526 Noble Embers Ridge,Littleton,CO,80126

3,Ann Smith,3422 Blue Pioneer Bend,Caguas,PR,00725

4,Mary Jones,8324 Little Common,San Marcos,CA,92069

5,Robert Hudson,10 Crystal River Mall ,Caguas,PR,00725

6,Mary Smith,3151 Sleepy Quail Promenade,Passaic,NJ,07055

7,Melissa Wilcox,9453 High Concession,Caguas,PR,00725

8,Megan Smith,3047 Foggy Forest Plaza,Lawrence,MA,01841

9,Mary Perez,3616 Quaking Street,Caguas,PR,00725

10,Melissa Smith,8598 Harvest Beacon Plaza,Stafford,VA,22554

11,Mary Huffman,3169 Stony Woods,Caguas,PR,00725

12,Christopher Smith,5594 Jagged Embers By-pass,San Antonio,TX,78227

To try it out follow these steps:

  • Copy the above records into a text file named say csv.
  • Run hbase shell command and create the HBase table named ‘customer’ with column family ‘details’ by giving the following command at hbase shell prompt:

create ‘customer’, ‘details’

  • Now run Pig in local mode if you are using a virtual machine else you can run pig in mapreduce mode

pig –x local

  • And run the following Pig commands one by one.

input_data = LOAD ‘customerdet.csv’ USING PigStorage(‘,’);

dump input_data;

STORE input_data INTO ‘hbase://customer’ USING org.apache.pig.backend.hadoop.hbase.HBaseStorage(‘details:name details:address details:city details:state details:zipcode’);

  • Once the data is successfully loaded you can quit Pig (grunt) shell; run hbase shell and give the command below to see the records loaded. You can obviously use all the HBase shell commands.

scan ‘customer’

We can use the same library pig.backend.hadoop.hbase.HBaseStorage to read (load) the data from Hbase into Pig scripts for any data manipulation or transformations.

Hive is integrated with HBase very closely as well. We can use HBase through Hive and run Hive queries on HBase tables. For example we can create an external table in Hive and point it to the HBase table created above and query the HBase data through Hive queries.

You can try it out with the following steps.

  • Run Hive shell and create a database and an external table in it using these commands.

CREATE DATABASE retail1;

USE retail1;

SET hive.cli.print.current.db=true;

CREATE EXTERNAL TABLE customer1(customerid int, name string, address string, city string, state string, zipcode string) STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler’ WITH SERDEPROPERTIES (“hbase.columns.mapping” = “:key, details:name, details:address, details:city, details:state, details:zipcode”) TBLPROPERTIES (“hbase.table.name” = “customer”);

  • Now run the commands below and check the table descriptions and the data (records).

DESCRIBE FORMATTED customer1;

SELECT * FROM customer1 where name like “Mary%”;

HBase also has a Thrift Server and REST Server which enable remote clients such as web applications to access HBase data. Moreover we can use Sqoop to ingest data into an HBase table from any RDBMS.

Designing an HBase Table:

Let us now take an RDBMS and look at how to design the HBase table to ingest the data. Following is the schema of the database from retail industry domain.

We can describe the relationships of the database tables as: Customers place an order which has several order items. Each order item is for a product and each product belongs to a particular category. For analysing customer purchase behaviour and/or for building a recommender system the data from this RDBMS needs to be ingested into an HBase table.

In a relational database design, the starting point would the entities and their relationships with one another. Factors like how the data is accessed and the indexes to be built will come in later. In HBase the major factors to consider for the table design are:

  1. Row Key
  2. Column families
  3. Read and Write patterns and
  4. Versions of cell content

One other aspect is normalization. While designing relational databases, schema is normalized to avoid duplication of data storage and repeating data is put in a separate table. This however results in joins of tables while retrieving the data and when data from multiple tables is being retrieved through joins it may be slow. In HBase table design you actually de-normalize data, though it causes redundancy and it can be looked at as substitution for joins.

In case the data is being accessed mainly through products then let us see if we can consider product id as the row key. We can then group the product details in a column family called products and the customer details can be in grouped into a customer family with multiple columns. In this case the rows in HBase table could look as shown below:

Row KeyColumn family: productsColumn family: customer
product_idproduct_namepricequantitycustomer_id1customer_name1customer_id2customer_name2
957Diamondback Women’s Serene Classic Comfort299.98111599Mary Malone11318Mary Henry
1073Pelican Sunstream 100 Kayak199.991256David Rodriguez
502Nike Men’s Dri-FIT Victory Golf Polo505256David Rodriguez8827Brian Wilson
403Nike Men’s CJ Elite 2 TD Football Cleat129.991256David Rodriguez
897Team Golf New England Patriots Putter Grip24.9928827Brian Wilson
365Perfect Fitness Perfect Rip Deck59.9958827Brian Wilson11318Mary Henry
1014O’Brien Men’s Neoprene Life Vest49.9848827Brian Wilson

This approach clearly is not feasible because the number of customers who purchase a product can be large. Even when you have a use case and dataset with large but finite number of columns, filtering for column values on HBase tables can be slow. Hence long narrow tables are advisable over broad flat tables while designing row keys for HBase tables.

So a more appropriate way is to store details of one customer and the details of one product in each row. Since the row key has to be unique we need to use a combination (concatenation) of product id and customer id as the row key. This will make the row narrow with one column family for product data and one column family for customer data. And the number of rows in the table will be large since there will be as many rows as the number of customers who bought the product; and the product details will be repeating in each of these rows as shown below.

RowKey Column family: productsColumn family: customers
hbaserowkeyproduct_idproduct_namepricequantitycustomer_idcustomer_name
0957-11599957Diamondback Women’s Serene Classic Comfort299.98111599Mary Malone
1073-002561073Pelican Sunstream 100 Kayak199.991256David Rodriguez
0502-00256502Nike Men’s Dri-FIT Victory Golf Polo505256David Rodriguez
0403-00256403Nike Men’s CJ Elite 2 TD Football Cleat129.991256David Rodriguez
0897-08827897Team Golf New England Patriots Putter Grip24.9928827Brian Wilson
0365-08827365Perfect Fitness Perfect Rip Deck59.9958827Brian Wilson
0502-08827502Nike Men’s Dri-FIT Victory Golf Polo5038827Brian Wilson
1014-088271014O’Brien Men’s Neoprene Life Vest49.9848827Brian Wilson
0957-11318957Diamondback Women’s Serene Classic Comfort299.98111318Mary Henry
0365-11318365Perfect Fitness Perfect Rip Deck59.99511318Mary Henry

 

The advantage of storing the data in HBase is that if any field changes for instance if a product is returned or if quantity is decreased or increased then only that cell can be updated. You can do it with a load function in Pig or Hive and put command in HBase shell or its equivalent API if a client/web-based application is built. Based on the row key HBase will update the cell by adding a new version of the value with the current timestamp and not insert another row for this. This kind of updates are not done in an easy or straightforward manner if you are processing the data as HDFS files with Pig scripts or with Hive tables defined on top of this data.

You may see one problem with the above design of the HBase table. If a product is ordered one more time after a month or so then the row key for the new order will be the same old order and hence instead of inserting a new row HBase will update the existing row of that row key. We can remedy this by adding order date or order id to the row key.

All the steps of the above process can be performed by loading the data set into a database of any RDBMS and using Sqoop to load the data into HBase.

A MySQL database was created with the abovementioned schema and data was loaded into all the tables using mysqlimport tool. To keep it simple an SQL join statement was used to join the records from the required tables and load them into a new table retail2.

An HBase table is created with the following HBase shell command and then Sqoop import commands below are run to load the data from the RDBMS table into the HBase table.

hbase(main) > create ‘retail2′,’products’,’customers’

$ sqoop import \

–connect jdbc:mysql://<hostname.domainname>/retaildb \

–username <username> \

–P \

–query ‘select hbaserowkey, product_id, product_category, product_name, product_description, product_price, product_image, product_quantity from retail2 WHERE $CONDITIONS’ \

–hbase-table retail2 \

–column-family products \

–hbase-row-key hbaserowkey -m 1

$ sqoop import \

–connect jdbc:mysql://<hostname.domainname/retaildb \

–username <username> \

–P \

–query ‘select hbaserowkey, customer_id, customer_name, customer_email, customer_street, customer_city, customer_state, customer_zipcode from retail2 WHERE $CONDITIONS’ \

–hbase-table retail2 \

–column-family customers \

–hbase-row-key hbaserowkey -m 1

Now with the data available in HBase we can access and manage it as required for any processing or analysis using any tools as afforded by HBase.

You can use HBase not only with the tools like Pig, Hive and Sqoop as shown above but also with Spark; you can also write client or web-based applications using its Thrift or REST services. As we saw HBase requires only the table name and the column family names at the time of table creation. There are no data types in HBase as all data is treated as byte arrays.

We have a very flexible data store in HBase, we just need to design the tables carefully, bearing in mind the factors such as data access patterns, column families and mainly row key design.

PS: You can download the sample datasets used and mentioned in this blog here <LINK> by signing in with a valid e-mail address.

Photo by rawpixel on Unsplash

 

Want to know more?

Field will not be visible to web visitor