I find Cascalog’s choice of name for the lazy-generator
to be a
bit of a misnomer. That is, it’s not actually lazy! The
lazy-generator
consumes entirely your lazy-seq into a temporary tap.
This necessary inconvenience results in a convenient side-effect,
however.
Clojure’s lazy-seq
, as a producer of values, cannot be
serialized. When you use the lazy-seq
as a cascalog generator,
cascalog side-steps this problem by realizing the entire seq into
memory. That value is then serialized like any other literal value.
This is done where the lazy-seq-backed generator is defined, in the
process executing your application’s -main
. In MapReduce terms, this
is the JVM containing the JobClient
instance. As you might expect,
when your realized seq is large, this leads to runtime problems. For
one, you can blow the process’s heap. Even if you have enough RAM, a
more subtle issue can manifest as jobconf size exceptions.
It’s also worth noting that all this serialization business is happening outside of the MapReduce processing pipeline. That is, there’s no Mapper or Reducer executing while this goes on. This can be confusing as it looks like your job isn’t doing anything while this work is performed.
Introduce the lazy-generator
. This handy bit of code transforms a
lazy-seq
into an hfs-backed tap. Just like consuming the lazy-seq
as a cascalog generator directly, this consumes the entire seq!
Instead of realizing the whole seq into memory, however, the seq is
serialized into an anonymous tap. This process is generally slower
than consuming the lazy-seq
directly because of the additional IO
involved. Again, this code is run in the JVM executing your
application’s -main
, outside of any Map or Reduce step. Et voilà, no
more memory problems or mysterious jobconf size exceptions.
Wrapping your lazy-seq
in a lazy-generator
is a little
inconvenient, but it has another benefit, especially when your seq is
large. Because the data is being read from the HDFS, Hadoop’s
InputSplit logic is applied. No longer is your lazy-seq-backed
step limited to a single task process! MapReduce will split the data
in the usual way and spread the work of processing around the cluster.