Several months ago I took up a monster of a project I had no clear idea how to tackle. What was abundantly clear was that I would need to gain a broad-reaching understanding of company policies, internal systems, and open source technologies to succeed. I had one quarter to deliver meaningfully. In an act of desperation, I reached for my coding agent as a research assistant.
I have previously released artifacts for Apache HBase using MacOS and Windows 11 + WSL2. Now I am running a native Linux installation, and so I again have some minor details to work through. This install is built on systemd, which is of minor concern. More interestingly, I decided to drop Docker and instead use Podman and crun as my interface over Linux containers.
While working on HBase bug fixes and feature development, it’s often quite convenient to test changes on a local-mode HBase. This is done by running HBase right out of your developer sandbox. Though a lot of HBase development happens on Macs these days, it’s a system designed first to run on Linux. That means there are a couple minor annoyances for non-Linux users. Let me show you how I work around one of them.

Between HBaseCon and Hadoop Summit I took a short trip to Europe. I got to spend some more time working along side Nicolas and meet some of the Scaled Risk crew. I also took a small holiday through the hillside in Romania! Along the way, I was invited to present for both the Paris HPC Meetup and the London HBase Meetup.
Every year at Hadoop Summit there’s a little un-conference, call the Birds of a Feather Sessions, or BoF for short. These are topical meetups that take place after the conference proceedings and are open to non-attendees. This year I helped organize the HBase BoF, along with Subash D’Souza.

The Latency Talk Nicolas and I gave at HBaseCon has been accepted for Hadoop Summit San Jose. If you missed us at HBaseCon, you get one more opportunity! We’re speaking on June 4th at 3:25p.
See you in June!
Edit: Unfortunately, Nicolas was unable to make it so I presented solo. I hope I did his section justice.

HBaseCon was another fantastic conference this year! It’s a great resource for information about and around HBase, no matter where you are along your path. This year I presented a talk along with a colleague of mine, Nicolas Liochon of Scaled Risk fame. Our topic: HBase as an online, low-latency system.
The HBase BlockCache is an important structure for enabling low
latency reads. As of HBase 0.96.0, there are no less than three different
BlockCache implementations to choose from. But how to know when to use one
over the other? There’s a little bit of guidance floating around out there, but
nothing concrete. It’s high time the HBase community changed that! I did some
benchmarking of these implementations, and these results I’d like to share with
you here.
Note that this is my second post on the BlockCache. In my
previous post, I provide an overview of the BlockCache in general as
well as brief details about each of the implementations. I’ll assume you’ve
read that one already.
Edit: The sequel post, BlockCache Showdown is now available!
HBase is a distributed database built around the core concepts of an ordered
write log and a log-structured merge tree. As with any database, optimized I/O
is a critical concern to HBase. When possible, the priority is to not perform
any I/O at all. This means that memory utilization and caching structures are
of utmost importance. To this end, HBase maintains two cache structures: the
“memory store” and the “block cache”. Memory store, implemented as the
MemStore, accumulates data edits as they’re received, buffering
them in memory 1. The block cache, an implementation of the
BlockCache interface, keeps data blocks resident in memory
after they’re read.
This is the second of two posts examining the use of Hive for interaction with HBase tables. This is a hands-on exploration so the first post isn’t required reading for consuming this one. Still, it might be good context.
“Nick!” you exclaim, “that first post had too many words and I don’t care about JIRA tickets. Show me how I use this thing!”
This is post is exactly that: a concrete, end-to-end example of consuming HBase over Hive. The whole mess was tested to work on a tiny little 5-node cluster running HDP-1.3.2, which means Hive 0.11.0 and HBase 0.94.6.1.
This is the first of two posts examining the use of Hive for interaction with HBase tables. The second post is now available.
One of the things I’m frequently asked about is how to use HBase from Apache Hive. Not just how to do it, but what works, how well it works, and how to make good use of it. I’ve done a bit of research in this area, so hopefully this will be useful to someone besides myself. This is a topic that we did not get to cover in HBase in Action, perhaps these notes will become the basis for the 2nd edition ;) These notes are applicable to Hive 0.11.x used in conjunction with HBase 0.94.x. They should be largely applicable to 0.12.x + 0.96.x, though I haven’t tested everything yet.
I spent last week in NYC at this year’s Strata+Hadoop World, where I was invited to speak. The title of this talk is the same as the talk I gave at the Big Data Deep Dive in May, but the content received a thorough overhaul. Thanks to all the attendees and friends who give me great advice on this first go-around. Hopefully the improvements were helpful.

Wow, what a busy summer. In addition to Hadoop Summit, HBaseCon, and a little holiday, I managed to squeeze the foundation patches for a client-managed data type API into HBase 0.95.2. I also received word that my proposal to speak at Strata/NYC was accepted!
My work on adding data types to HBase has come along far enough that ambiguities in the conversation are finally starting to shake out. These were issues I’d hoped to address through initial design documentation and a draft specification. Unfortunately, it’s not until there’s real code implemented that the finer points are addressed in concrete. I’d like to take a step back from the code for a moment to initiate the conversation again and hopefully clarify some points about how I’ve approached this new feature.
Edit: this entry has been cross-posted onto the Apache HBase blog. You might find more comments and discussion over there.
I find Cascalog’s choice of name for the lazy-generator to be a
bit of a misnomer. That is, it’s not actually lazy! The
lazy-generator consumes entirely your lazy-seq into a temporary tap.
This necessary inconvenience results in a convenient side-effect,
however.
I had the honor of presenting to a full house at FOSS4G-NA 2013 this May. This is a rough transcript of that presentation. Just like my talk at the Big Data Deep Dive, no recording was made, as far as I’m aware. So just like that transcript, this is a recitation from memory.
The deck is available on slideshare, and embedded at the bottom of the post.
In case you haven’t heard, Hadoop2 is on the way! There are loads more new features than I can begin to enumerate, including lots of interesting enhancements to HDFS for online applications like HBase. One of the most anticipated new features is YARN, an entirely new way to think about deploying applications across your Hadoop cluster. It’s easy to think of YARN as the infrastructure necessary to turn Hadoop into a cloud-like runtime for deploying and scaling data-centric applications. Early examples of such applications are rare, but two noteworthy examples are Knitting Boar and Storm on YARN. Hadoop2 will also ship a MapReduce implementation built on top of YARN that is binary compatible with applications written for MapReduce on Hadoop-1.x.
The HBase project is rearing to get onto this new platform as well. Hadoop2 will be a fully supported deployment environment for HBase 0.96 release. There are still lots of bugs to squish and the build lights aren’t green yet. That’s where you come in!
I was invited to speak at the Seattle Technical Forum’s first “Big Data Deep Dive”. The event was very well organized and all three presentations dove-tailed into each other quite well. No recording was made of the event, so this is a transcription of my talk based on notes and memory.
The deck is available on slideshare, and embedded at the bottom of the post.
Aside from re-skinning the place, I’ve been pretty quite here lately. I’m busy working on my type system experiment (HBASE-8089) and simplifying interoperability between HBase and Pig (PIG-2786, PIG-3285), Hive (HIVE-2055, HIVE-2379), and HCatalog (HCAT-621). I’m also preparing for some talks for later next month. The first one will be here in Seattle (Bellevue, really) and the second in Minneapolis. If you’re able to make either one, do step up and introduce yourself.

You use git and have a Dropbox account, right? Here’s a little trick I use from time to time for archiving Git repositories. Create a bare repository in your Dropbox account and push a mirror. Now you can delete your local sandbox, but you’ll still have the full history available if you need it later. Sure, you could set up private repos on Github, but that’ll become expensive fast, while Dropbox is free, at least from the beginning.
Yesterday I spoke at this month’s Seattle Scalability Meetup. My topic didn’t deviate too far from what was originally posted. Here are the slides. If you were able to join us yesterday, please take a moment to leave some feedback.
With Posterous shutting their doors, I’m finally motivated to reexamine the web space I don’t really maintain. The whole point of choosing posterous was to have a minimal barrier to posting. To that extent, the string of short-text-plus-images posts proves the format effective. In search of a replacement, I’m not excited about anything I’ve found. However, since finishing the book, I have a number of ideas and half-writings to share. So, it’s time to make something work.
HBasecon2012, the first of it’s kind, happened on Wednesday. I had the honor of presenting a lightning talk at the end of one of the Applications tracks. I shared a little of what I’ve learned over the last couple months in the new-to-me domain of GIS. I think the talk went well, despite my nerves, because I had many good questions from the audience. I look forward to continuing the work and providing more details the next time around.
Seattle Log The 18th day of January on this year of our Lord, two-thousand and twelve.
It just happens I took a photo of yesterday’s blue sky. Today is quite the contrast.
If a window is smashed in the night and no one is awake to hear it, does it make a sound?
I just returned from a couple weeks of work in San Francisco. On the way down, I snapped some shots as we were departing SEA. Through the magic of white-balance correction I’ve managed to pull out a few of the nicer ones; I’m pleased with the results. Enjoy.
Salmon and asparagus beside clams and leeks drowned in Chardonnay. The asparagus ended up a little over-cooked but the seafood was perfect. I now need to make good on 1.5c of delicious clam juice. Yum!
Yesterday I was gifted these lovely flowers, fresh from Pike’s Market. Lacking any kind of vase, I found a use of the gurgling pot.
This project has been a long time coming. Last week I purchased a 2009 Yamaha FZ6R from a friend of a friend. While technically not my first bike, it’s the first one I’ve ridden (as opposed to worked on). I went for my first ride on Saturday and stopped by a friend’s house along the way. He’s also an amateur photographer and kindly snapped a few shots for me.
A recent post in Read Write Web calls out a short and sweet quote which hits close to home:
For when I can’t find this later:
I had the good fortune of being invited out for Drum and Bass night on Saturday – who knew Temple Billiards has a basement AND a weekly show with a thriving community? There was a lot more nuance to the beats than I expected though it could have been louder. The people were really friendly and very into it. No one seemed to mind me using my camera and I enjoyed playing in the low-light.
Great live show. Ravenous energy, consumed the croud. Authentic sound.
I start a new consultant position on Wednesday. This employer requires I submit to a background check, including I provide my last 7 years of residences. I had most of the information compiled from a previous bureaucratic encounter, so I just had to add a couple more addresses.
Here’s a couple charts from my new toy. I really like having access to the statistics regarding my runs. With the ability to evaluate past performance and make calculated changes to improve future performance, I might consider calling this “training”.
Only cook the pasta to 2/3 done. Pull and place in a bowl of room-temp water. By the time you’ve cooked all your pasta, they’ll have absorbed enough additional liquid to be “cooked”. Plus, they’re far less likely to rip while handling. Plus, they won’t stick together between the pot and assembly.
Three hours until my YC interview and all I can think to do is write a blog post. As you well know, I don’t really blog. This morning, however, I’m compelled to put thoughts to ether. Thoughts surrounding the sequence of decisions which have led me to this spot: sitting in a hotel room in the Silicon Valley, wearing a waffled hotel robe, drinking pretty decent hotel coffee, preparing to go in front of people upon whom I’ve had an internet crush for the last 6 years and compel them to give me - over all the other people they will interview today - a boost in starting my company.
Turns out the Fit Sport comes stock with some nice rims which all the kids are boosting these days. Guy at the dealership says the only reason they had the wheels in stock is because they see 4-5 of these a month.
VMWare buys Spring? I’m quite sure the sky is falling. Who’d have thought Spring would be worth $400+ mil?
Word is, Posterous is the shite. Here’s to learning. Also, the option of expressing myself to the tubes in more than 140 characters is new and exciting.
As if I needed to rub it in, here’s a shot from my back porch. And my family wonders why I won’t move back to the Midwest, HA!


