Pushmi-pullyu

For my rngzip XML compression tool, written in Java, I wanted to incorporate LZMA, a fast compressor based on Lempel-Ziv (like gzip), but with better compression ratios and the same or better speed. Apparently 7-Zip is somewhat popular in the Microsoft world, but has not made much impact on Unix yet, perhaps because the original implementation was extremely dependent on Microsoft APIs. That seems to be changing: ‘lzma’ and ‘p7zip-full’ packages are available in the new Debian etch release.

Anyway, there is a Java implementation of LZMA, but it does not use the standard FilterOutputStream approach of the java.util.zip package – closely followed for implementations of BZip and PPM-based arithmetic coders. And it's not nearly as simple as changing around some method names.

No, the LZMA encoder wants to pull bytes from an InputStream and push the compressed representation to an OutputStream. In contrast, the FilterOutputStream approach is a ‘push’ model – the caller remains in control, pushing bytes to the encoder as it pleases. I had seen this contrast before, for example in XML parsing itself. The Simple API for XML (SAX) is a ‘push’ technique, which is why writing SAX event handlers is such a pain. I can hardly fault the author of the LZMA encoder for wanting to structure the code this way. It makes it much simpler, and if all you're doing is compressing existing files, you don't need a stream processor.

Of course, the programming languages geek in me recognizes this dilemma as a need for coroutines! The push/pull controversy is really about which module wants to be in control. With a single thread of control, all but one of the cooperating modules must be turned inside-out, written in a passive structure. Sometimes this isn't too hard, and sometimes it's monstrous.

With coroutines (or continuations, which can implement them), all of the modules can be written as though they maintain control, simply ‘yielding’ results to each other across the divide. Java doesn't have coroutines or continuations, but it does have multi-threading. So, once I recognized what was needed, in short order I whipped up an LzmaOutputStream (and LzmaInputStream) that spawns the Encoder (Decoder) in a separate thread and passes bytes to it using a blocking queue. Classic producer/consumer concurrency, which should delight my students in Operating Systems.

I'll make the code available here at some point, but if you're looking and can't find it yet, let me know! Efficiency is horrible at the moment, because I'm currently using ArrayBlockingQueue<Integer> from the Java library. This means, I presume, that each byte dropped into the queue is allocated as a new object! Might have to implement my own blocking queue with semaphores and a lighter-weight int[] representation.

Update (18 April 2007): code is now available here, and I ditched the generic ArrayBlockingQueue.

©20022015 Christopher League