This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.
“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”
This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase 0.94.6.1.
Grab some data and register it in Hive
We’ll need some data to work with. For this purpose, grab some traffic stats from wikipedia. Once we have some data, copy it up to HDFS.
1 2 3 |
|
For reference, this is what the data looks like.
1 2 3 4 5 6 |
|
As I understand it, each record is a count of page views of a specific page on Wikipedia. The first column is the language code, second is the page name, third is the number of page views, and fourth is the size of the page in bytes. Each file contains an hour’s worth of aggregated data. None of the above pages were particularly popular that hour.
Now that we have data and understand its raw schema, create a Hive table over it. To do that, we’ll use a DDL script that looks like this.
1 2 3 4 5 6 7 8 |
|
Run the script to register our dataset with Hive.
1 2 3 |
|
Verify that the schema mapping works by calculating a simple statistic over the dataset.
1 2 3 4 5 6 7 |
|
Hive says the 10 files we downloaded contain just over 36.5mm records. Let’s just confirm things are working as expected by getting a second opinion. This isn’t that much data, so confirm on the command line.
1 2 |
|
The record counts match up – excellent.
Transform the schema for HBase
The next step is to transform the raw data into a schema that makes sense for HBase. In our case, we’ll create a schema that allows us to calculate aggregate summaries of pages according to their titles. To do this, we want all the data for a single page grouped together. We’ll manage that by creating a Hive view that represents our target HBase schema. Here’s the DDL.
1 2 3 4 5 6 7 8 9 10 |
|
The SELECT
statement uses hive to build a compound rowkey for HBase. It
concatenates the project code, page name, and date, joined by the '/'
character. A handy trick: it uses a simple regex to extract the date from the
source file names. Run it now.
1 2 3 |
|
This is just a view, so the SELECT
statement won’t be evaluated until we
query it for data. Registering it with hive doesn’t actually process any data.
Again, make sure it works by querying Hive for a subset of the data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
|
Register the HBase table
Now that we have a dataset in Hive, it’s time to introduce HBase. The first step is to register our HBase table in Hive so that we can interact with it using Hive queries. That means another DDL statement. Here’s what it looks like.
1 2 3 4 5 6 |
|
This statement will tell Hive to go create an HBase table named pagecounts
with the single column family f
. It registers that HBase table in the Hive
metastore by the name pagecounts_hbase
with 3 columns: rowkey
, pageviews
,
and bytes
. The SerDe property hbase.columns.mapping
makes the association
from Hive column to HBase column. It says the Hive column rowkey
is mapped to
the HBase table’s rowkey, the Hive column pageviews
to the HBase column
f:c1
, and bytes
to the HBase column f:c2
. To keep the example simple, we
have Hive treat all these columns as the STRING
type.
In order to use the HBase library, we need to make the HBase jars and
configuration available to the local Hive process (at least until
HIVE-5518 is resolved). Do that by specifying a value for the
HADOOP_CLASSPATH
environment variable before executing the statement.
1 2 3 4 |
|
Populate the HBase table
Now it’s time to write data to HBase. This is done using a regular Hive
INSERT
statement, sourcing data from the view with SELECT
. There’s one more
bit of administration we need to take care of though. This INSERT
statement
will run a mapreduce job that writes data to HBase. That means we need to tell
Hive to ship the HBase jars and dependencies with the job.
Note that this is a separate step from the classpath modification we did
previously. Normally you can do this with an export
statement from the shell,
the same way we specified the HADOOP_CLASSPATH
. However there’s a bug in
HDP-1.3 that requires me to use Hive’s SET
statement in the script instead.
1 2 3 4 5 6 7 |
|
Note there’s a big ugly bug in Hive 0.12.0 which means this doesn’t work with that version. Never fear though, we have a patch in progress. Follow along at HIVE-5515.
If you choose to use a different method for setting Hive’s auxpath, be advised
that it’s a tricky process – depending on how you specify it
(HIVE_AUX_JARS_PATH
, --auxpath
), Hive will interpret the argument
differently. HIVE-2349 seeks to remedy this unfortunate state of
affairs.
1 2 3 4 5 6 7 |
|
Be advised also that this step is currently broken on secured HBase deployments. Follow along HIVE-5523 if that’s of interest to you.
Query data from HBase-land
40 seconds later, you now have data in HBase. Let’s have a look using the HBase shell.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 |
|
Here we have 10 rows with two columns each containing the data loaded using
Hive. It’s now accessible in your online world using HBase. For example,
perhaps you receive an updated data file and have a corrected value for one of
the stats. You can update the record in HBase with a regular PUT
command.
Verify data from from Hive
The HBase table remains available to you Hive world; Hive’s
HBaseStorageHandler
works both ways, after all.
Note that this command expects that the HADOOP_CLASSPATH
is still set and
HIVE_AUX_JARS_PATH
as well if your query is complex.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Continue using Hive for analysis
Since the HBase table is accessible from Hive, you can continue to use Hive for your ETL processing with mapreduce. Keep in mind that the auxpath considerations apply here too, so I’ve scripted out the query instead of just running it directly at the command line.
1 2 3 4 5 6 7 |
|
Run it the same way we did the others.
1 2 3 4 5 6 7 8 |
|
There you have it: a hands-on, end to end demonstration of interacting with HBase from Hive. You can learn more about the nitty-gritty details in Enis’s deck on the topic, or see the presentation he and Ashutosh gave at HBaseCon. If you’re inclined to make the intersection of these technologies work better (faster, stronger), I encourage you to pick up any of the JIRA issues mentioned in this post or the previous.
Happy hacking!