God Plays Dice: compression

10 September 2007

How many well-formed formulas are there?

The following is exercise 1.1.2 of "A Mathematical Introduction to Logic", by Herbert Enderton:

Show that there are no [well-formed formulas] of length 2, 3, or 6, but that any other positive length is possible.

Enderton defines a wff as follows: every sentence symbol A_n is a wff. If α and β are wffs, then so are (¬ α), (α ∧ β), (α ∨ β), (α → β), and (α ↔ &beta); furthermore, that's all of the wffs. (The parentheses count as symbols. The sentence symbol A_n counts as a single symbol.)

The problem is pretty straightforward. A, (A ∧ B), and (A ∧ (B ∧ C)) are wffs of length 1, 5, and 9 respectively. If α is a wff of length n, then (¬ α) is a wff of length n+3; thus we can generate wffs with lengths in the following sequences: {1, 4, 7, ...}, {5, 8, 11, ...}, {9, 12, 15, ...}. This is all the integers except 2, 3, and 6.

Now, each of the ways of forming new wffs from old takes one or two wffs, adds them together, and adds three new symbols. So we can't get a wff of length 2 or 3 because it would have to come from a wff of length -1 or 0. Furthermore, there are no wffs of length 6, because they'd either have to be of the form (¬ α) where α has length 3, or (α * β) where * is ∧, ∨, →, or ↔ and α and β have total length 3, which would mean that one of α and β has length 2.

The natural follow-up question, to me at least, is "how many wffs are there"? This is silly to ask because if we have infinitely many sentence symbols, there are of course infinitely many wffs of any length for which there is at least one wff. So let's restrict ourselves to one sentence symbol, A. Let f_n be the number of wffs of length n. Let F(z) = Σ_n f_n zⁿ be the generating function of this sequence. Then we have

f(z) = z + z³f(z) + 4z³f(z)²

where the first term on the right-hand side corresponds to the formula which consists of just the sentence symbol, the second term to all those formulas of type (¬ α), and the third term to all those forms of type (α * β). (The coefficient 4 comes up because there are four things that "*" could be.)

Solving this for f, we get

$f(z) = {1-z^3 - \sqrt{1-2z^3+z^6-16z^4} \over 8z^3}$

and so we get

f(z) = z + z⁴ + 4z⁵ + z⁷ + 12z⁸ + 32z⁹ + z¹⁰ + 24z¹¹ + 160z¹² + 321z¹³ + 40z¹⁴ + 480z¹⁵ + ...

which is a bit surprising; the coefficients don't grow "smoothly". This is because, for example, wffs of length 9 are of the form (A * (A * A)) or ((A * A) * A) where * is as above (and yes, these two forms are different), so there are 32 of these; the only wff of length 10 is (¬ (¬ (¬ A))), since any other such formula would come from two formulas with total length 7, and one of those would have to be of length 2, 3, or 6 which is impossible. But I'm more interested in the asymptotics.

It's a general principle that the coefficients of a generating function grow roughly like f_n ~ (1/ρ)ⁿ, where ρ is the absolute value of the singularity of f(z) = Sigma;_n f_n zⁿ; this is true up to a "subexponential factor" which has to do with the nature of that singularity (polar, algebraic, etc.) The singularities of f(z) in this case arise from taking the square root of numbers near 0; thus they are located at the roots of the polynomial 1 - 2z³> +z⁶ -16z⁴. The smallest root is about ρ = 0.4728339090, so the number of wffs of length n grows like (1/ρ)ⁿ, or about 2.11ⁿ, times some subexponential factor. If you take larger coefficients they do have a ratio of about 2.11.

The subexponential factor is probably something like n^-3/2 -- that is, the number of wffs of length n is probably about 2.11ⁿ n^-3/2 -- because that particular subexponential factor often arises when one analyzes the asymptotics of tree-like structures. Don't ask me to explain that.

The interpretation this brings to mind is that if you write a wff by just writing symbols at random, at any given step there are in some sense "on average" 2.11 symbols that you could write down which would keep the wff grammatical. There are eight possible symbols, so this suggests that the best special-purpose compression algorithm for this formal language would shorten formulas to log(2.11)/log(8), or about 36%, of their original length.

10 July 2007

compression = artificial intelligence?: the Hutter Prize

From Slashdot: Alexander Ratushnyak has won the second Hutter prize. This is a prize for compressing the first 100 megabytes of Wikipedia to a small size. The size that he achieved is 16,481,655 bytes; in March of 2006, before this prize was announced, the best possible was 18,324,887 bytes.

Hutter claims that "being able to compress well is closely related to acting intelligently" -- this is based on the thesis that the way to compress data well is to have a good idea of what's coming next, and the way to do that is to understand the structure of the data. So far, so good. But does it follow that understanding the structure of the data requires intelligence on the part of the compression algorithm? I don't think so. The MP3 algorithm, for example, compresses music for the most part by relying on psychological information about how humans perceive sound, and it required intelligence on the part of the inventor -- but I wouldn't ascribe intelligence to the algorithm.

Supposedly there is a proof that "ideal lossless compression" implies passing the Turing Test: the e-mail there by Matt Mahoney states

The second problem remains: ideal lossy compression does not imply passing the Turing test. For lossless compression, it can be proven that it does. Let p(s) be the (unknown) probability that s will be the prefix of a text dialog. Then a machine that can compute p(s) exactly is able to generate response A to question Q with the distribution p(QA)/p(Q) which is indistinguishable from human. The same model minimizes the compressed size, E[log 1/p(s)].

This may be true -- I'm not competent to evaluate the proof. But it does not follow that a compression method which is 1% short of ideal lossless compression is 1% away from passing the Turing test. And we don't know what the ideal even is! It is possible to estimate the entropy of natural language, but the estimates seem to be good to, say, one decimal place.

Dan has pointed out at my post secret messages in human DNA? that general-purpose compressors don't work in all data. If you gzip the human genome, it gets bigger. In fact, it's not possible to write a program that will losslessly compress all input strings, for the following reason: say we can write a program which losslessly compresses all strings of length up to N bits. There are 2^N+1-1 such strings, but only 2^N-1 strings of length N-1 bits or less; therefore there will be collisions among the strings of length N-1 or less, and we won't be able to properly uncompress the data.

God Plays Dice

10 September 2007

How many well-formed formulas are there?

10 July 2007

compression = artificial intelligence?: the Hutter Prize

Blog Archive

Contributors

Other blogs