God Plays Dice: information theory

Showing posts with label information theory. Show all posts

17 October 2007

Extraordinary claims require extraordinary evidence

Here's a book that's available online that I learned about recently: Information Theory, Inference, and Learning Algorithms by David MacKay. I learned about it from Steve at Information Processing; he learned about it from Nerd Wisdom.

I rather like the trend of books being available online. For one thing, it means that I do not have to carry books back and forth between my office and my apartment. (This is unfortunately not true for the calculus book I'm teaching from, which is about 1200 pages; fortunately my officemate has a copy of the book he doesn't use, so I use his copy when I'm on campus and my copy when I'm at home.)

This book seems to be a good introduction to information theory, machine learning, Bayesian inference, etc.; I have not had the chance to read any part of it thoroughly but I have randomly sampled from it and it seems quite interesting.

A few days ago I wrote about the question of finding the next element in the sequence "1, 2, 5, ..." and some of my readers complained that this problem is not well-posed. MacKay, in his chapter on Occam's razor, gives a similar example: what's the next number in the sequence "-1, 3, 7, 11"? You probably say 15. Consider two possible theories -- the sequence is an arithmetic sequence, or it is given by a cubic function of the form cx³ + dx² + e, where c, d, and e are rational numbers. (The omission of the linear term is deliberate; otherwise the first theory would be a case of the second one.) The first one turns out to be much more likely, for the simple reason that there are less parameters to be tweaked! Let's say it's equally likely that the underlying function is linear or cubic; there are just a lot more possible cubic functions, so each particular cubic function is less likely. (For the details, see p. 345 of MacKay.)

By logic such as this, the most likely next element in that sequence is difficult to say, though... should we prefer 10, because then the nth term is given by the simple explicit formula (n-1)²+1? Or should we prefer 12 or 13, which are both given by simple linear recurrences? My instinct is that it depends on where the problem comes from, since the various possible next terms arise from different sorts of combinatorial structures, and in this sense the problem was ill-posed. In reality we wouldn't start out by assuming that all possible theories have equal probability; for one thing, there are infinitely many of them! The "simple" theories (the sequence has an explicit formula which is a polynomial, or a linear recursion with constant coefficients, or something like that) have higher prior probabilities... but given enough evidence, any complicated theory that starts out with nonzero probability of being true could turn out to be true. Extraordinary claims, though -- those with low prior probability -- require extraordinary evidence. There's a nice toy example a little further on in MacKay (p. 351) showing that if you see a piece of a box sticking out to the left of a tree, and a piece of a box of the same height and color sticking out to the right, it is nearly certain that those are actually pieces of the same box, and not two different boxes.

(What I've sketched in the previous paragraph is a bit of a lie, though; mathematical reasoning is rarely anywhere near this fuzzy. One could almost argue that it never is, because if we are so unsure about whether a result is true or not then we just don't call it mathematics.)

PS: I realized that I had a bunch of things I wanted to write about that were kind of piling up, so I might be digging into stuff that made its way through the blogosphere a month or so ago. I would feel no need to apologize for this except that blogs move so quickly.

PPS: The title of this post is due to Carl Sagan. I did not realize that when I titled the post, though, but I knew I'd heard the phrase somewhere before and Google tells me it's his.

20 September 2007

Baseball entropy

I was thinking of putting together a prediction of the Phillies' odds of making the postseason -- it's late enough in the season that I can do the calculations exactly, if I'm willing to ignore the teams that plainly have no chance -- but the results would be depressing. (Although I did do a prediction of the Phillies' ten-thousandth loss, which came a couple days earlier than I predicted. That was depressing, too.) The good folks at Baseball Prospectus do a simulation where they run the rest of the season a million times; at this time of the season, with ten games left, the odds fluctuate wildly with each game.

I've also wondered if it would be possible to determine some sort of information entropy (the link goes to a nice intuitive explanation of what that means) from these postseason odds, and use that as a single quantity to determine how "close" the playoff races are at a given moment. For example, at basically any moment this season, the National League has been "closer" than the American League. Okay, by "wondered" I mean I thought "I don't really want to do the computations, because I'm lazy". The information entropy explains "how surprising" a random variable is. The entropy of a random variable which takes each of n values 1, ..., n with probabilities p₁, ..., p_n is

-(p₁ log p₁ + ... + p_n log p_n)

where we adopt the convention that 0 log 0 = 0 (or, equivalently, we say that all the probabilites involved are nonzero.) For example, consider the winner of this year's National League pennant. If there are, say, three known playoff teams and a fourth team which is equally likely to be one of two teams, then the probability that the pennant winner is any of the first three teams is 1/4, and any of the other two teams 1/2; then the entropy of that random variable is

3 (1/4 log 4) + 2 (1/8 log 8) = 9/4

(logs are base 2 here; this means that entropy is measured in bits). The entropy of a random variable which is equally likely to take any of n values is log n bits; thus there are in some sense 2^9/4 = 4.756... contenders, in that if we could have a random variable which took 2^9/4 values with equal probability, it would have the same entropy. This interpolates between four and five; there are five contenders but three of them are clearly stronger than the other two. As of right now the probabilities of each National League team making the playoffs, according to Baseball Prospectus, are

.9449017 (Mets), .8990023 (Diamondbacks), .8109878 (Padres), .7179337 (Cubs), .2938922 (Phillies), .2825803 (Brewers), .0310565 (Rockies), .0106489 (Dodgers), .0089891 (Braves), .0000075 (Cardinals), all other teams zero

I'll assume that each of these teams, if they should make the playoffs, have a one-in-four chance of winning the pennant; thus the entropy of the pennant winner is given by summing a bunch of terms which look like

-.9449017/4 log (.9449017/4)

and in the end we get 2.5312 bits, corresponding to 2^2.5312 = 5.780 contenders. This seems reasonable; there are basically six contending teams at this point.

The American League has four teams above 99 percent right now (Red Sox, Yankees, Indians, Angels), and the entropy of their pennant winner is 2.019 bits or 4.054 "effective" contenders.

And this post was originally supposed to be about math humor in a baseball radio broadcast: Ryan Howard, with the bases empty, has fourteen home runs and fourteen RBIs so far this season.

I was just (well, a couple innings ago now) informed of this by the Phillies radio announcers; it appeared on the monitor that tells them the statistics.

They chuckled at this. I assume they were chuckling because it's trivial, as one of them said "well, how many RBIs was he supposed to have?" The only way one can bat in a run if the bases are empty is to hit a home run; furthermore that will bat in exactly one run.

(I assume that the monitor in question breaks down a player's statistics by the eight possible situations for who is on the bases; the other lines probably seem less silly. They won't show this same sort of thing, because when runners are on base it's possible to get them in without scoring a home run.)

21 August 2007

information-theoretic entropy in the weather

Right now, in Philadelphia, it's sixty-one degrees, with a light rain.

I have long maintained that this is "generic Philadelphia weather". By this I do not mean that it's always like this here. What I mean is that if I for some reason do not know what season it is, and I head outside and find it is sixty-one degrees with a light rain, this gives me very little information, because this sort of weather can happen any time of year. Another characteristic of such a day is that the high and low temperatures are very close together, say within ten degrees of each other; it's been between 60 and 64 since midnight and probably won't get out of the sixties (in either direction) all day.

Looking at the data from the last year, June 14 was pretty close to this, although it was just overcast, not rainy; January 6 might look that way on paper, except I remember it clearly and it was actually a freakishly warm and sunny day. I wore shorts. It was January. We had the New Year's Day parade that day. November 8, 13, and 14 fit as well; also October 11 and a few other October days; September 1. (I remember the weather on September 1 quite well, because I moved that day. The rain was light for most of the day and got heavier about an hour after my movers were done getting everything in.) I'm deliberately being vague about what constitute a day like this.

Not surprisingly, this sort of weather is most common in the spring and fall (mostly because I care about temperature) but it is possible in the winter or summer as well. And this gets me wondering -- in general, what is the information content of the weather? If it's 100 degrees and sunny, there might be a 2% chance that it's July 24; a 0.5% chance it's September 1; an 0.01% chance that it's November 1; and a one-in-a-million chance it's January 15. This sort of weather is very localized towards a certain time of year. One could imagine calculating the Shannon entropy corresponding to this distribution; it would be a lot smaller than the entropy you'd get from a similar distribution if you conditioned on sixty degrees and light rain.

Of course, in this formulation, the whole idea is kind of silly -- when am I not going to know what the date is? But looking at the information-theoretic entropy of weather seems like a potentially useful way to quantify how extreme the seasons in some place might be; it's possible but not normal to get the same weather in winter and summer in Philadelphia, say; routine in San Francisco; unheard of in Minneapolis. (I am picking these places without looking at actual statistics, so I might be wrong.) Why one would want to quantify that, though, I'm not sure.

03 August 2007

information theory for architects

Paul Bourke asks a question about the intersection of three cylinders (via Anarchaia):

Question. Is a 3D object uniquely described by a plan, front and side view?
Answer: No!

In particular, consider a sphere at the center of a Cartesian coordinate system. It can be inscribed in a cylinder with central axis the x-axis, or the y-axis, or the z-axis. The intersection of these three cylinders has the same plan, front, and side view (i. e. projections onto the three coordinate planes) as the sphere, but it's not a sphere. If you want to see what it looks like, there's a picture at the post, of one made from wood. Bourke writes:

I originally designed this object because Architecture students would ask me why the computer couldn't scan their plans and elevations and automatically build a 3D model. This was the simplest demonstration I could come up that there isn't enough information about a 3D object in its plane projections.

Thinking about how much "information" is contained in something is in general a useful idea. My first instinct is that these three projections, being plane curves, contain together about as much information as, say, three cross-sections of the object; you would never expect to be able to recreate an object from three of its cross-sections. I'm not sure if I would have been able to come up with this particular counterexample, but the previous argument would have reassured me that one probably exists.

But this sort of intuition can be misleading, especially when one is doing continuous (as opposed to discrete) mathematics. This line of reasoning would also lead you to think that Fourier series don't exist, because how can we encode uncountably many real numbers (the values of a function at every point in some real number) by only countably many real numbers (its Fourier coefficients)? The correct way to interpret this might be that functions which have Fourier series are somehow extremely rare among all functions, which is true because they've got to be piecewise smooth and continuous. I once asked a foolish question along these lines at a colloquium on the famous problem "Can you hear the shape of a drum?" The problem is basically what it sounds like. There's a membrane stretched in a certain shape, and when you hit it its motion is governed by a certain partial differential equation. The eigenvalues of the PDE, with boundary conditions governed by the shape of the drum are given to you; these correspond to the various frequencies one would hear in the sound if the drum were hit, hence the question. Can you (armed with an oscilloscope and a big fancy computer) tell what the drum looks like, if you don't get to see it? (The answer is "sort of".) It seemed to me for a moment that there's no way you could recover the whole shape from a countable sequence of numbers... but then I was reminded that Fourier series exist.

I prefer to only use this sort of intuition in discrete mathematics, because then usually one can count the things in question (at least in principle) since there are only finitely many of them. For example: it's easy to prove that any algorithm for sorting n items, where we're only allowed to compare them pairwise, must have runtime at least O(n log n). Basically, we want to determine which permutation of S_n is associated with the original list (in layman's terms, what order they started out in). From an information-theoretic point of view, each comparison gives us one bit of information; the total amount of information in the original permutation is log(n!), which is O(n log n) by Stirling's approximation.

God Plays Dice