Friday 21 December 2007 @10:00
Thanks to changedetection.com, I noticed this morning that our RNGzip paper went up on the site of Academy Publisher — JCP volume 2, number 10. JCP is open-access, so the full text can be downloaded by anyone.
I also decided to post my own full-text version of the graph-theory paper to appear in Congressus Numerantium. I believe that’s acceptable according to the copyright terms, but I suppose I ought to read the fine print. See my publications page.
One other paper (on CS education) is under review right now, and there’s a draft (on types in compilation) that’s nearly ready for submission. I hope to hear about the former and work on the latter over winter break.
Monday 9 April 2007 @22:44
For my rngzip XML compression tool, written in Java, I wanted to incorporate LZMA, a fast compressor based on Lempel-Ziv (like gzip), but with better compression ratios and the same or better speed. Apparently 7-Zip is somewhat popular in the Microsoft world, but has not made much impact on Unix yet, perhaps because the original implementation was extremely dependent on Microsoft APIs. That seems to be changing: ‘lzma’ and ‘p7zip-full’ packages are available in the new Debian etch release.
Anyway, there is a Java implementation of LZMA, but it does not use the standard FilterOutputStream approach of the java.util.zip package — closely followed for implementations of BZip and PPM-based arithmetic coders. And it’s not nearly as simple as changing around some method names.
No, the LZMA encoder wants to pull bytes from an InputStream and push the compressed representation to an OutputStream. In contrast, the FilterOutputStream approach is a ‘push’ model — the caller remains in control, pushing bytes to the encoder as it pleases. I had seen this contrast before, for example in XML parsing itself. The Simple API for XML (SAX) is a ‘push’ technique, which is why writing SAX event handlers is such a pain. I can hardly fault the author of the LZMA encoder for wanting to structure the code this way. It makes it much simpler, and if all you’re doing is compressing existing files, you don’t need a stream processor.
Of course, the programming languages geek in me recognizes this dilemma as a need for coroutines! The push/pull controversy is really about which module wants to be in control. With a single thread of control, all but one of the cooperating modules must be turned inside-out, written in a passive structure. Sometimes this isn’t too hard, and sometimes it’s monstrous.
With coroutines (or continuations, which can implement them), all of the modules can be written as though they maintain control, simply ‘yielding’ results to each other across the divide. Java doesn’t have coroutines or continuations, but it does have multi-threading. So, once I recognized what was needed, in short order I whipped up an LzmaOutputStream (and LzmaInputStream) that spawns the Encoder (Decoder) in a separate thread and passes bytes to it using a blocking queue. Classic producer/consumer concurrency, which should delight my students in Operating Systems.
I’ll make the code available here at some point, but if you’re looking and can’t find it yet, let me know! Efficiency is horrible at the moment, because I’m currently using ArrayBlockingQueue<Integer> from the Java library. This means, I presume, that each byte dropped into the queue is allocated as a new object! Might have to implement my own blocking queue with semaphores and a lighter-weight int[] representation.
Update (18 April 2007): code is now available here, and I ditched the generic ArrayBlockingQueue.
Monday 24 April 2006 @16:20
Each spring, our campus has an event called “discovery day,” where faculty and students get together in one space to exhibit their latest research and scholarship, using posters and demonstrations. In some ways, a presentation at an event like this is necessarily shallow; it just isn’t possible to appreciate one another’s contributions without a lot of background in the field. Sometimes this is even hard with different sub-disciplines within a computer science department, to say nothing of exhibiting to your colleagues in chemistry, media arts, or political science. Nevertheless, I think it’s a great exercise. I’m reminded of Feynman’s sentiment (paraphrased): “If I can’t teach a freshman lecture on it, it means I don’t really understand it.”
Last year, I presented my work on MetaOCaml Server Pages, which at that point was under consideration for a journal (now accepted). This year, I wanted to exhibit some work that is more preliminary. In fact, this is the first public mention of the result. My poster is called Type-based compression of XML streams (1.4M, PDF). [If you choose to print it, be careful to specify shrink-to-fit-page, because the PDF is formatted for the full size 40×30ʺ poster. It should still be readable shrunk to a letter-size page.]
The idea is basically to use information from the XML document type to encode the tree structure extremely compactly. Depending on the nature of the document type, many tags will take no space at all in the compressed form, and others are represented in just a few bits. This is essentially similar to how Amme, et al. encode their SafeTSA intermediate language.
In fact, my goal for this work is to be able to use XML libraries and tools for compiler intermediate representations. Other than that application area, I was never especially interested in XML. To me, it seemed that S-expressions were an improvement on XML, and they were around in the 1950s. (Lately I’m finding many advances in computing seem to be summed up by the phrase “same sh*t, different buzzword.”)
Our results are still very preliminary, and I’m probably not ready to release the code just yet. But as it is implemented now, our technique does extremely well on very ‘taggy’ XML documents, and is still somewhat reasonable (though not always the winner) on more ‘texty’ XML documents. Our best-case examples are XML results from the National Center for Biotechnology Information at NIH. I expect compiler intermediate representations to be extremely taggy too, but more on that as things develop.
One important advantage of our technique (over just gzipped XML, for example) is that I can generate SAX events directly from the compressed tree representation. This should prove to be much faster than uncompressing to text, and then re-parsing the XML.