This is the first of two posts examining the use of Hive for interaction with HBase tables. The second post is now available.
One of the things I’m frequently asked about is how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it. I’ve done a bit of research in this area, so hopefully this will be useful to someone besides myself. This is a topic that we did not get to cover in HBase in Action, perhaps these notes will become the basis for the 2nd edition ;) These notes are applicable to Hive 0.11.x used in conjunction with HBase 0.94.x. They should be largely applicable to 0.12.x + 0.96.x, though I haven’t tested everything yet.
The hive project includes an optional library for
interacting with HBase. This is where the bridge layer between the two systems
is implemented. The primary interface you use when accessing HBase from Hive
queries is called the HBaseStorageHandler
. You can
also interact with HBase tables directly via Input and Output formats, but
the handler is simpler and works for most uses.
HBase tables from Hive
Use the HBaseStorageHandler
to register HBase tables
with the Hive metastore. You can optionally specify the HBase table as
EXTERNAL
, in which case Hive will not create to drop that
table directly – you’ll have to use the HBase shell to do so.
1 2 3 |
|
The above statement registers the HBase table named bar
in the Hive
metastore, accessible from Hive by the name foo
.
Under the hood, HBaseStorageHandler
is delegating
interaction with the HBase table to
HiveHBaseTableInputFormat
and
HiveHBaseTableOutputFormat
. You can register
your HBase table in Hive using those classes directly if you desire. The above
statement is roughly equivalent to:
1 2 3 4 5 |
|
Also provided is the HiveHFileOutputFormat
which
means it should be possible to generate HFiles for bulkloading from Hive as
well. In practice, I haven’t gotten this to work end-to-end (see
HIVE-4627).
Schema mapping
Registering the table is only the first step. As part of that registration, you
also need to specify a column mapping. This is how you link Hive column names
to the HBase table’s rowkey and columns. Do so using the
hbase.columns.mapping
SerDe property.
1 2 3 4 5 |
|
The values provided in the mapping property correspond one-for-one with column
names of the hive table. HBase column names are fully qualified by column
family, and you use the special token :key
to represent the rowkey. The above
example makes rows from the HBase table bar
available via the Hive table
foo
. The foo
column rowkey
maps to the HBase’s table’s rowkey, a
to
c1
in the f
column family, and b
to c2
, also in the f
family.
You can also associate Hive’s MAP
data structures to HBase column families.
In this case, only the STRING
Hive type is used. The other Hive type
currently supported is BINARY
. See the wiki page for more examples.
Interacting with data
With the column mappings defined, you can now access HBase data just like you would any other Hive data. Only simple query predicates are currently supported.
1
|
|
You can also populate and HBase table using Hive. This works with both
INTO
and OVERWRITE
clauses.
1 2 |
|
Be advised that there is a regression in Hive 0.12.0 which breaks this feature, see HIVE-5515.
In practice
There’s still a little finesse required to get everything wired up properly at runtime. The HBase interaction module is completely optional, so you have to make sure it and it’s HBase dependencies are available on Hive’s classpath.
1 2 |
|
The installation environment could do a better job of handling this for users,
but for the time being you must manage it yourself. Ideally the hive
bin
script can detect the presence of HBase and automatically make the necessary
CLASSPATH
adjustments. This enhancement appears to be tracked in
HIVE-2055. The last mile is provided by the distribution itself,
ensuring the environment variables are set for hive
. This functionality is
provided by BIGTOP-955.
You also need to make sure the necessary jars are shipped out to the MapReduce jobs when you execute your Hive statements. Hive provides a mechanism for shipping additional job dependencies via the auxjars feature.
1 2 |
|
I did discover a small bug in HDP-1.3 builds which masks user-specified values
of HIVE_AUX_JARS_PATH
. With administrative rights, this is easily fixed by
correcting the line in hive-env.sh
to respect an existing value. The
work-around in user scripts is to use the SET
statement to
provide a value once you’ve launched the Hive CLI.
1
|
|
Hive should be able to detect which jars are necessary and add them itself.
HBase provides the TableMapReduceUtils#addDependencyJars
methods for this purpose. It appears that this is done in hive-0.12.0, at least
according to HIVE-2379.
Future work
Much has been said about proper support for predicate pushdown (HIVE-1643, HIVE-2854, HIVE-3617, HIVE-3684) and data type awareness (HIVE-1245, HIVE-2599). These go hand-in-hand as predicate semantics are defined in terms of the types upon which they operate. More could be done to map Hive’s complex data types like Maps and Structs onto HBase column families as well (HIVE-3211). Support for HBase timestamps is a bit of a mess; they’re not made available to Hive applications with any level of granularity (HIVE-2828, HIVE-2306). The only interaction a user has is via storage handler setting for writing a custom timestamp with all operations.
From a performance perspective, there are things Hive can do today (ie, not dependent on data types) to take advantage of HBase. There’s also the possibility of an HBase-aware Hive to make use of HBase tables as intermediate storage location (HIVE-3565), facilitating map-side joins against dimension tables loaded into HBase. Hive could make use of HBase’s natural indexed structure (HIVE-3634, HIVE-3727), potentially saving huge scans. Currently, the user doesn’t have (any?) control over the scans which are executed. Configuration on a per-job, or at least per-table basis should be enabled (HIVE-1233). That would enable an HBase-savy user to provide Hive with hints regarding how it should interact with HBase. Support for simple split sampling of HBase tables (HIVE-3399) could also be easily done because HBase manages table partitions already.
Other access channels
Everything discussed thus far has required Hive to interact with online HBase RegionServers. Applications may stand to gain significant throughput and enjoy greater flexibility by interacting directly with HBase data persisted to HDFS. This also has the benefit of preventing Hive workloads from interfering with online SLA-bound HBase applications (at least, until we see HBase improvements in QOS isolation between tasks, HBASE-4441).
As mentioned earlier, there is the
HiveHFileOutputFormat
. Resolving
HIVE-4627 should make Hive a straight-forward way to generate
HFiles for bulk loading. Once you’ve created the HFiles using Hive, there’s
still the last step of running the
LoadIncrementalHFiles
utility to copy and register
them in the regions. For this, the HiveStorageHandler
interface will need some kind of hook to influence the query plan as it’s
created, allowing it to append steps. Once in place, it should be possible to
SET
a runtime flag, switching an INSERT
operation to use bulkload.
HBase recently introduced the table snapshot feature. This allows a user to create a persisted point-in-time view of a table, persisted to HDFS. HBase is able to restore a table from a snapshot to a previous state, and to create an entirely new table from an existing snapshot. Hive does not currently support reading from an HBase snapshot. For that matter, HBase doesn’t yet support MapReduce jobs over snapshots, though the feature is a work in progress (HBASE-8369).
Conclusions
The interface between HBase and Hive is young, but has nice potential. There’s a lot of low-hanging fruit that can be picked up to make things easier and faster. The most glaring issue barring real application development is the impedance mismatch between Hive’s typed, dense schema and HBase’s untyped, sparse schema. This is as much a cognitive problem as technical issue. Solutions here would allow a number of improvements to fall out, including much in the way of performance improvements. I’m hopeful that continuing work to add data types to HBase (HBASE-8089) can help bridge this gap.
Basic operations mostly work, at least in a rudimentary way. You can read data out of and write data back into HBase using Hive. Configuring the environment is an opaque and manual process, one which likely stymies novices from adopting the tools. There’s also the question of bulk operations – support for writing HFiles and reading HBase snapshots using Hive is entirely lacking at this point. And of course, there are bugs sprinkled throughout. The biggest recent improvement is the deprecation of HCatalog’s interface, removing the necessary upfront decision regarding which interface to use.
Hive provides a very usable SQL interface on top of HBase, one which integrates easily into many existing ETL workflows. That interface requires simplifying some of the BigTable semantics HBase provides, but the result will be to open up HBase to a much broader audience of users. The Hive interop compliments extremely well the experience provided by Phoenix. Hive has the benefit of not requiring the deployment complexities currently required by that system. Hopefully the common definition of types will allow a complimentary future.