Discovery day poster: XML compression

Each spring, our campus has an event called “discovery day,” where faculty and students get together in one space to exhibit their latest research and scholarship, using posters and demonstrations. In some ways, a presentation at an event like this is necessarily shallow; it just isn’t possible to appreciate one another’s contributions without a lot of background in the field. Sometimes this is even hard with different sub-disciplines within a computer science department, to say nothing of exhibiting to your colleagues in chemistry, media arts, or political science. Nevertheless, I think it's a great exercise. I’m reminded of Feynman’s sentiment (paraphrased): “If I can’t teach a freshman lecture on it, it means I don’t really understand it.”

Last year, I presented my work on MetaOCaml Server Pages, which at that point was under consideration for a journal (now accepted). This year, I wanted to exhibit some work that is more preliminary. In fact, this is the first public mention of the result. My poster is called Type-based compression of XML streams (1.4M, PDF). [If you choose to print it, be careful to specify shrink-to-fit-page, because the PDF is formatted for the full size 40×30ʺ poster. It should still be readable shrunk to a letter-size page.]

The idea is basically to use information from the XML document type to encode the tree structure extremely compactly. Depending on the nature of the document type, many tags will take no space at all in the compressed form, and others are represented in just a few bits. This is essentially similar to how Amme, et al. encode their SafeTSA intermediate language.

In fact, my goal for this work is to be able to use XML libraries and tools for compiler intermediate representations. Other than that application area, I was never especially interested in XML. To me, it seemed that S-expressions were an improvement on XML, and they were around in the 1950s. (Lately I’m finding many advances in computing seem to be summed up by the phrase “same sh*t, different buzzword.”)

Our results are still very preliminary, and I’m probably not ready to release the code just yet. But as it is implemented now, our technique does extremely well on very ‘taggy’ XML documents, and is still somewhat reasonable (though not always the winner) on more ‘texty’ XML documents. Our best-case examples are XML results from the National Center for Biotechnology Information at NIH. I expect compiler intermediate representations to be extremely taggy too, but more on that as things develop.

One important advantage of our technique (over just gzipped XML, for example) is that I can generate SAX events directly from the compressed tree representation. This should prove to be much faster than uncompressing to text, and then re-parsing the XML.

©20022015 Christopher League