10 July 2007

compression = artificial intelligence?: the Hutter Prize

From Slashdot: Alexander Ratushnyak has won the second Hutter prize. This is a prize for compressing the first 100 megabytes of Wikipedia to a small size. The size that he achieved is 16,481,655 bytes; in March of 2006, before this prize was announced, the best possible was 18,324,887 bytes.

Hutter claims that "being able to compress well is closely related to acting intelligently" -- this is based on the thesis that the way to compress data well is to have a good idea of what's coming next, and the way to do that is to understand the structure of the data. So far, so good. But does it follow that understanding the structure of the data requires intelligence on the part of the compression algorithm? I don't think so. The MP3 algorithm, for example, compresses music for the most part by relying on psychological information about how humans perceive sound, and it required intelligence on the part of the inventor -- but I wouldn't ascribe intelligence to the algorithm.

Supposedly there is a proof that "ideal lossless compression" implies passing the Turing Test: the e-mail there by Matt Mahoney states
The second problem remains: ideal lossy compression does not imply passing the Turing test. For lossless compression, it can be proven that it does. Let p(s) be the (unknown) probability that s will be the prefix of a text dialog. Then a machine that can compute p(s) exactly is able to generate response A to question Q with the distribution p(QA)/p(Q) which is indistinguishable from human. The same model minimizes the compressed size, E[log 1/p(s)].

This may be true -- I'm not competent to evaluate the proof. But it does not follow that a compression method which is 1% short of ideal lossless compression is 1% away from passing the Turing test. And we don't know what the ideal even is! It is possible to estimate the entropy of natural language, but the estimates seem to be good to, say, one decimal place.

Dan has pointed out at my post secret messages in human DNA? that general-purpose compressors don't work in all data. If you gzip the human genome, it gets bigger. In fact, it's not possible to write a program that will losslessly compress all input strings, for the following reason: say we can write a program which losslessly compresses all strings of length up to N bits. There are 2N+1-1 such strings, but only 2N-1 strings of length N-1 bits or less; therefore there will be collisions among the strings of length N-1 or less, and we won't be able to properly uncompress the data.

1 comment:

dan said...

The conference I am currently attending included several papers yesterday about the provable limits of common compression algorithms, such as the bzip algorithm.

You might be interested in learning some Kolmogorov complexity. The book by Li and Vitanyi is supposed to be very readable. (Some day, I'll have time...)