Monday 9 April 2007 @22:44
For my rngzip XML compression tool, written in Java, I wanted to incorporate LZMA, a fast compressor based on Lempel-Ziv (like gzip), but with better compression ratios and the same or better speed. Apparently 7-Zip is somewhat popular in the Microsoft world, but has not made much impact on Unix yet, perhaps because the original implementation was extremely dependent on Microsoft APIs. That seems to be changing: ‘lzma’ and ‘p7zip-full’ packages are available in the new Debian etch release.
Anyway, there is a Java implementation of LZMA, but it does not use the standard FilterOutputStream approach of the java.util.zip package — closely followed for implementations of BZip and PPM-based arithmetic coders. And it’s not nearly as simple as changing around some method names.
No, the LZMA encoder wants to pull bytes from an InputStream and push the compressed representation to an OutputStream. In contrast, the FilterOutputStream approach is a ‘push’ model — the caller remains in control, pushing bytes to the encoder as it pleases. I had seen this contrast before, for example in XML parsing itself. The Simple API for XML (SAX) is a ‘push’ technique, which is why writing SAX event handlers is such a pain. I can hardly fault the author of the LZMA encoder for wanting to structure the code this way. It makes it much simpler, and if all you’re doing is compressing existing files, you don’t need a stream processor.
Of course, the programming languages geek in me recognizes this dilemma as a need for coroutines! The push/pull controversy is really about which module wants to be in control. With a single thread of control, all but one of the cooperating modules must be turned inside-out, written in a passive structure. Sometimes this isn’t too hard, and sometimes it’s monstrous.
With coroutines (or continuations, which can implement them), all of the modules can be written as though they maintain control, simply ‘yielding’ results to each other across the divide. Java doesn’t have coroutines or continuations, but it does have multi-threading. So, once I recognized what was needed, in short order I whipped up an LzmaOutputStream (and LzmaInputStream) that spawns the Encoder (Decoder) in a separate thread and passes bytes to it using a blocking queue. Classic producer/consumer concurrency, which should delight my students in Operating Systems.
I’ll make the code available here at some point, but if you’re looking and can’t find it yet, let me know! Efficiency is horrible at the moment, because I’m currently using ArrayBlockingQueue<Integer> from the Java library. This means, I presume, that each byte dropped into the queue is allocated as a new object! Might have to implement my own blocking queue with semaphores and a lighter-weight int[] representation.
Update (18 April 2007): code is now available here, and I ditched the generic ArrayBlockingQueue.
Saturday 7 April 2007 @17:27
Ugh, again I fight this battle. While hacking on my XML compression tool ‘rngzip’, I’m subclassing Java’s input/output stream hierarchy. The read and write methods use int instead of byte so that we may use -1 to represent end-of-file.
But surprisingly, Java bytes are signed. There is no way to specify unsigned numbers, and the primitive casts automatically do sign extension So when you cast a byte containing 0xFF into an integer you get 0xFFFFFFFF, which is -1. This can cause a great many bugs, some of them not apparent until you’re processing a binary stream with bytes equal to 0xFF.
The decision to use int here is questionable too, but Java has no lightweight way to specify ‘byte option’.
Sunday 19 November 2006 @8:36
The byte type in Java is, strangely, signed. So bytes range from –127 to +128. So you cannot write byte b = 0xCA directly (because 202 > 128) but the (byte) cast does sign extension.
The problem I’m fighting with now is with input/output streams. The OutputStream class has two methods that take byte arrays, but the abstract method write that takes an int. With the InputStream, the abstract read returns an int mainly so that it can return –1 for end-of-file, so I guess that write uses an int for consistency.
But here’s where it gets really weird. Normally converting between byte and int does sign extension, but apparently that’s not what you want here… or is it? The specification of void OutputStream.write(int):
Writes the specified byte to this output stream. The general contract for write is that one byte is written to the output stream. The byte to be written is the eight low-order bits of the argument b. The 24 high-order bits of b are ignored.
and of int InputStream.read():
Reads the next byte of data from the input stream. The value byte is returned as an int in the range 0 to 255. If no byte is available because the end of the stream has been reached, the value -1 is returned. This method blocks until input data is available, the end of the stream is detected, or an exception is thrown.
Got it? In both methods, just the lower 8 bits are used, except in the case where read() returns –1 for EOF. But here’s the thing—OutputStream violates its own contract:
import java.io.*;
class StreamTest extends OutputStream {
public void write(int b) throws IOException {
System.out.println("Writing int " + b);
}
public static void main(String[] args) throws IOException {
byte[] bs = { (byte)0xCA, (byte)0xFE,
(byte)07F, (byte)032 };
new StreamTest().write(bs);
}
}
The above code sets up a byte array, and passes it to OutputStream.write(byte[]), which in turn iterates and calls my overridden write(int). But in converting byte to int, it does the normal sign-extension widening, so that it’s not just the lower 8 bits I need to pay attention to in write, it’s the lower 7 bits and the sign. (Update: oops, I overlooked that the representations are the same thanks to two’s complement!)
Arguably, write doesn’t need to be as picky about the sign as read, since read uses –1 to indicate EOF. Indeed, if I try:
import java.io.*;
class ReadTest extends InputStream {
private byte[] bs = { (byte)0xCA, (byte)040,
(byte)0xFF, (byte)032 };
private int i = 0;
public int read() throws IOException {
return bs[i++];
}
public static void main(String[] args) throws IOException {
byte[] buf = new byte[4];
new ReadTest().read(buf);
for( int i=0; i<4; i++ ) System.out.println(buf[i]);
}
}
then it outputs –54 64 0 0, because the 0xFF = –1 is interpreted as EOF. The read method needs some sign-inverting logic like: bs[i] > 0? bs[i] : bs[i]+256
This all started with some 3rd-party library code I’m trying to use, for arithmetic encoding (compression) using prediction by partial match. They defined filtering input and output streams that seem to work okay in tests, but fail miserably as soon as I wrap them in a DataOutputStream. It turned out that they were doing sign-inversions — not in the way I showed above — but in a weird place. As long as you write bytes as an array and then read them as an array, it happened to work. But if you write an array and then read one at a time, the conversions don’t work out. They also seemed to take the specification of write(int) at face value, just assuming that the parameter would be non-negative.
What a mess!
Tuesday 23 May 2006 @23:57
I’m toying with the newish JUnit version 4 for setting up test cases for my current Java project. I had tried the 3.x series, and didn’t find the experience quite as satisfying. Setting up parameterized suites with the TestSuite class, and having to rely on method naming conventions seemed… well… kinda ugly.
JUnit 4 is a radically different API, and it depends on lots of newish features in the Java language, including annotations on classes and methods. I really knew nothing about this feature until I saw it used in JUnit, and I still know very little. I’m a language nerd, but not a Java nerd, I guess. But it may be worth exploring further.
Anyway, I need to be able to pull test cases from files within a directory, so the Parameterized ‘runner’ is just the thing. You designate one method to produce all the test cases as a Collection, and then JUnit instantiates your class with arguments taken from that collection.
Unfortunately, when a test fails, the output leaves a lot to be desired. It prints the name of the class, the name of the method, but as far as which test case failed… they’re just numbered serially, like this:
There were 4 failures:
1) tryFib[1](FibTest)
java.lang.AssertionError: expected:<1> but was:<0>
2) tryFib[2](FibTest)
java.lang.AssertionError: expected:<1> but was:<0>
3) tryFib[3](FibTest)
java.lang.AssertionError: expected:<2> but was:<0>
4) tryFib[4](FibTest)
java.lang.AssertionError: expected:<3> but was:<0>
How are you supposed to know which test actually failed? As someone else pointed out, there really needs to be a way to annotate a name or description for each case.
Until then, here’s my silly work-around: just wrap your whole test method in a try/catch block, and rethrow the exception with a new message containing a description of the test case. I use the toString() method to produce that description. Complete example follows.
Here is the failure output now:
There were 4 failures:
1) tryFib[1](FibTest)
java.lang.Error: fib(1): expected:<1> but was:<0>
2) tryFib[2](FibTest)
java.lang.Error: fib(2): expected:<1> but was:<0>
3) tryFib[3](FibTest)
java.lang.Error: fib(3): expected:<2> but was:<0>
4) tryFib[4](FibTest)
java.lang.Error: fib(4): expected:<3> but was:<0>
The differences are highlighted. I guess maybe this isn’t such a convincing example, since the test description—fib(2)—is not sufficiently different from the serial number (2). In my application, test cases will be described by the input file name.
Sunday 21 May 2006 @19:00
I’m working on a rather sophisticated program in Java, where I ended up building on lots of libraries and other programs. Java is never my first language choice, but with the new features in 5.0 (or 1.5 or Java 2 or whatever the hell they’re calling it)—generics, for-each loop, auto-boxing, enums, assertions, variadic methods—I find it almost usable. And it is fairly easy to bring in external libraries: download the jar file, set your class path, browse the javadoc site, and you’re on your way.
The system is not ready for release yet, but I did start contemplating the licensing terms today. Because of all the libraries I’m depending on, it’s kind of complicated. I’m a big advocate of free software. For end users, the license doesn’t matter very much; any of the ‘open source’ ones will do. For lone developers, you just choose one that matches your goals. For me, that’s GNU GPL or LGPL, depending on the situation.
Unfortunately, if you’re building a system that incorporates the work of many other people, it gets complicated. Here are the libraries and programs I’m embedding… so far:
- org.kohsuke.bali — BSD-new
- com.colloquial.arithcode — BSD-new
- org.apache.tools.bzip2 — Apache 1.1
- xerces — Apache 1.1
- junit — Common Public License 1.0
- gnu.bytecode — GPL 2
- gnu.getopt — LGPL 2
So, under what license can I distribute my own code, which is combined with all of these systems? The main complication seems to be that the Apache and Common licenses are not strictly compatible with the GPL, even though they are meant to be, “in spirit.”
Some people would blame this on the GPL. It does seem to be the odd man out; if gnu.bytecode were released under Apache/BSD or even LGPL, then I could distribute the whole thing under Apache/BSD and be done with it. But I’m a big supporter of the GPL, and I can’t blame Per Bothner for selecting it for the bytecode library (part of Kawa, a Scheme-to-Java system, by the way).
I see a few options here:
1. Remove dependencies on the Apache/Common components, making them optional at build time, then release under GPL. End users can still grab and link those libraries, but I don’t have to redistribute them for a basic configuration.
2. Remove dependence on gnu.bytecode, perhaps using Byte-Code Engineering Library instead (it has an Apache license). Then release everything under Apache or BSD. Regardless of the relative merits of the two libraries, I’d hate to reject one just because it’s GPL.
3. Perhaps it’s possible to redistribute Apache/BSD-licensed software under the GPL, even though I’m not the copyright holder? That seems to be the opinion of Roy T. Fielding, of the Apache Software Foundation:
Whether or not [Apache and GPL] are considered compatible by the FSF is an opinion only they can make, but given that a derivative work consisting of both Apache Licensed code and GPL code can be distributed under the GPL (according to our opinion), there really isn’t anything to be discussed. — 24 Jan 2004
Option 3 would be ideal, if it turns out to be legal. I usually assume that I’m bound to redistribute under the same license the author used, but really it’s just the GPL requires that… ?