05 November 2007

An implication of the St. Petersburg paradox

Many of you are probably familiar with the St. Petersburg paradox. This is the following paradox. I offer to play the following game with you: I will flip a fair coin repeatedly until it comes up heads. If the first time it comes up heads is on the nth toss, I will pay you 2n dollars. How much are you willing to play this game?

Well, the probability that the coin comes up heads on the first toss is 1/2; in this case you get 2 dollars. The probability that it comes up heads for the first time on the second toss is 1/4; in this case you get 4 dollars. In general, the probability that the coin comes up heads for the first time on the nth toss is 1/2n, and in this case you get 2n dollars. So your expected winnings are

2(1/2) + 4(1/4) + 8(1/8) + 16(1/16) + ...

and each term here is 1; the series diverges. So you should be willing to pay an infinite amount of money to play this game. Yet you're not. (If you are, let me know. I'd like to have an infinite amount of money. Notice that you will only win a finite amount of money playing this game, so even after I pay you I will still have an infinite amount of money.)

Yet you're not. You may suspect that this is because you know the person you're betting against doesn't have an infinite amount of money, so your expected winnings don't come from actually summing the whole infinite series. For example, let's say Bill Gates is willing to play this game with you; let's say his net worth is 236 dollars. Then your expected winnings are

2(1/2) + 4(1/4) + 8(1/8) + ... + 235(1/235) + 236(1/236) + 236(1/237) + ...

where the first thirty-six terms are all 1; then what follows is 1/2, 1/4, 1/8, and so on. So you should be willing to spend $37 to play this game if the Gates fortune is backing it.

Of course, you might argue -- and a lot of economists have -- that $2n is not worth twice as much to you as $n. The usual assumption here is that the utility of $n is something like log2(n) "utils" (I'm not sure how they handle the problem of the units here), and that people play to maximize their expected number of utils, not their expected number of dollars. Then the "expected value" of the previous game, in utils, is
1(1/2) + 2(1/4) + 3(1/8) + ... = 2
and you should be willing to pay that amount of money which is worth 2 utils to you, namely $4.
But I can construct a similar paradox. If the coin comes up heads on the first toss, I pay you $2. If it comes up heads on the second toss, I pay you $4. If it comes up heads on the third toss, I pay you $16. If it comes up heads on the fourth toss, I pay you $256, and so on... then you receive 1 util if the coin comes up heads on the first toss, 2 on the second toss, 4 on the third toss, 8 on the fourth toss, and so on. So you should still be willing to pay infinitely much to pay this game. And in general one can construct such a payoff sequence for any unbounded utility function.

The somewhat counterintuitive resolution that I heard for this recently is that utility functions must be bounded. So say $1 gives me a certain amount of utility. Then in order to make it impossible to construct a St.-Petersburg-style wager, for which I would be willing to pay an infinite amount of money, there must be some K such that any amount of money gives me at most K times as much utility as $1. I'm not sure I believe this, either... it just goes to show that sometimes expected value might not be the way to go.

04 November 2007

more on Dijkstra

A few days ago I posted a link to Edsger Dijkstra's "On the cruelty of really teaching computer science"; an anonymous commentator has pointed to the published version, which includes a series of rebuttals to Dijkstra's claim that computer science was under-mathematized. Probably the most important point made is that although it may be theoretically possible to formally prove that one's programs work:
1. mistakes are possible in proofs, just as they are in programming, and
2. engineering has historically used both the formal methods of mathematics and more pragmatic methods.

As for my claim that anthropomorphization of mathematical objects is bad, I stand by that, but that's really more a linguistic pet peeve than anything else, and I may just be saying that because I dont like the people I associate with the use of the word "guy" for mathematical objects for other reasons. That being said, evolutionarily we are used to reasoning about people, and we should take advantage of that in problem solving.

Some mathematical videos

Some videos:

Meep's Math Matters is a new series of videos that look at some simple mathematics. (You don't see Meep; you hear her voice and see her writing on an imaginary "board".) The first two installments are available on YouTube, and embedded below. The first one looks at the standard trick for summing an arithmetic series, and includes one of about a zillion retellings of the Gauss 1 + 2 + ... + 100 = 5050 anecdote; if you want to read more such retelings go here.)



The second installment shows how to sum consecutive even and odd numbers:



From Science after Sunclipse: all of first-year physics (a ridiculously fast-pased explanation of derivatives and integrals, the basic conservation laws, and so on) in nine minutes! This is from Caltech's The Mechanical Universe, a series of computer animations by Jim Blinn.



and The story of π, from Caltech's Project Mathematics around the same time; the animation is supposedly due to Blinn, and the video is credited to Tom Apostol (the number theorist). From this, I learned that "one of the most practical uses of supercomputers is calculating a billion digits of π." Some other videos I found at Google Video include The Law of Large Numbers (which includes a wonderful animation illustrating the law of large numbers in terms of random walks, set to some sort of futuristic music...) and The Optiverse, by John M. Sullivan, George Francis and Stuart Levy, showing an optimal (in some energetic sense) way to turn a sphere inside out! More information is available here. Mostly it is good for the pretty pictures.

03 November 2007

Fun with gamma and zeta

I'd recently dealt with the following problem in the class I'm teaching: use double integrals to approximate the sum
\sum_{i=1}^n \sum_{j=1}^n {1 \over i+j}.

The best estimate that one can get without Euler-Maclaurin summation or other such trickery is
\int_{1/2}^{n+1/2} \int_{1/2}^{n+1/2} {1 \over x+y} \, dx \, dy

which in the case we cared about (n = 100) was about 133.71. (The limits come from what I've heard called the ``continuity correction'' -- basically, when approximating a Riemann sum by an integral, the points at which one needs to evaluate the integrand to get that particular sum should be in the middle of the intervals into which one partitions the domain of integration.) This integral is (2n+1) log (2n+1) - (2n+2) log (n+1); it's not too hard to write an asymptotic series for that,
(2 log 2)n - log n - 1 + log 2 - (3/4n) - (7/24n2) + O(n-3})
if you just remember that log (1 + x-1) has a Taylor series in 1/x: x-1 - x-2/2 + x-3/3 + O(x-4)
There's no estimate of the error here, which is kind of frustrating; however, note that we can rewrite the original sum as
{1 \over 2} + {2 \over 3} + \cdots + {n-1 \over n} + {n \over n+1} + {n-1 \over n+2} + \cdots + {1 \over 2n}

since it contains one term 1/2, two terms 1/3, and so on. This is easily rewritten as
n - \left( {1 \over 2} + {1 \over 3} + \cdots + {1 \over n+1} \right) + \sum_{k=2}^n {n-k+1 \over n+k}

and now we've reduced the problem to doing two single sums; one is well-known and Euler-Maclaurin summation easily applies to the other one. I won't go through the details, but one turns out to get an asymptotic series for this sum
(2 log 2)n - log n - 1/2 - γ + log 2 - (5/8n) + (7/48n2) + O(n-3)

which differs from the crude estimate by (γ - 1/2) + O(n-1). The Euler-Mascheroni constant γ is about 0.577, so the crude estimate is quite good, and basically shows that I wasted my time trying to come up with a better one.

But then I got to thinking -- how do we know that γ = 0.5772156649...? For those of you who don't know, this constant is usually defined as
\gamma = \lim_{n \to \infty} \left( \sum_{k=1}^n {1 \over k} \right) - \log n.

So I suppose you could just take n really large, and evaluated this; it turns out that the error in this estimate is about 1/(2n), so we wouldn't want to use this for computation. If we exponentiate both sies of the definition of γ, then we get
\exp(\gamma) = \lim_{n \to \infty} \left( {\prod_{k=1}^n e^{1/k} \over n} \right)

and rewriting the denominator as a telescoping product, this is
\exp(\gamma) = \lim_{n \to \infty} \left( {\prod_{k=1}^n e^{1/k} \over \prod_{k=1}^{n-1} {k+1 \over k}} \right).

Combine the two products into one; the numerator has one extra term, so we pull it out. This gives
\exp(\gamma) = \lim_{n \to \infty} \left( e^{1/n} \prod_{k=1}^{n-1} {e^{1/k} \over 1 + {1 \over k}} \right).

The exp(1/n) term goes away as n goes to infinity; we can replace n-1 by n to get
\exp(\gamma) = \lim_{n \to \infty} \left( \prod_{k=1}^n {e^{1/k} \over 1 + {1 \over k}} \right).

Taking logs gives
\gamma = \lim_{n \to \infty} \sum_{k=1}^n \left( {1 \over k} - \log \left( 1 + {1 \over k} \right) \right)

and writing the logarithm as a Taylor series in 1/k gives
\gamma = \lim_{n \to \infty} \sum_{k=1}^n {1 \over 2k^2} - {1 \over 3k^3} + {1 \over 4k^4} + \cdots.

Then we distribute the k-sum over the Taylor series (this isn't quite kosher since there are positive and negative terms, but it works, and it can probably be justified by taking the terms of the Taylor series in pairs) to get
\gamma = \lim_{n \to \infty} \left( \sum_{k=1}^n {1 \over 2k^2} - \sum_{k=1}^n {1 \over 3k^3} + \sum_{k=1}^n {1 \over 4k^4} + \cdots \right)

which is just
\gamma = {1 \over 2} \zeta(2) - {1 \over 3} \zeta(3) + {1 \over 4} \zeta(4) - {1 \over 5} \zeta(5) + \cdots.

This sum doesn't converge all that quickly; ζ(n) is very nearly 1 for large n. But I thought it was kind of cute nonetheless. (It's mentioned in the Wikipedia article, along with a bunch of other, similar sums, some of which are a lot more useful for actual computation.

02 November 2007

How long are extra-inning games?

From the Baseball Reference Stat of the Day: the number of Major League Baseball games which went exactly n innings this year are as follows:



n1011121314151617
number of n-inning games975834169312

(The table in the original post actually gives twice the number of games which have gone at least n innings; I've done the obvious manipulations to get this table.)
If you look at this sequence, one thing that sticks out is that it's roughly geometric. Each number is just about half the number before. What this indicates is that a game has roughly the same probability of ending in any extra inning, namely about one-half. The geometric distribution is "memoryless"; this indicates that at the beginning of the nth inning, for any n ≥ 10, the probability of the game going one more inning is 1/2, two more innings 1/4, three more innings 1/8, and so on. The expected number of additional innings to be played is always the same -- around two.

This isn't all that surprising -- every extra inning "looks the same" when it starts. It wouldn't surprise me to see that things start to change, say, around the fifteenth or sixteenth inning when teams just run out of pitchers -- it wouldn't surprise me to learn that scoring rates go up in, say, the seventeenth inning because whoever is on the mound either is the guy you never let play or the guy who usually pitches an inning or two that's in his fifth inning of work -- but there's not enough data from just a single season to say anything for sure.

But if we look a bit closer at the data, it does look like something like that happens. To be somewhat more precise, there were 220 extra-inning games played this year, and 465 extra innings; thus at the beginning of the tenth, one should expect a game to go 2.11 more innings. There were 123 11-inning-or-more games played this year, and 245 innings numbered 11 or higher; thus at the beginning of the eleventh a game should go 1.99 more innings. Similar numbers for the beginning of the twelfth, ..., seventeenth are 1.88, 1.84, 1.73, 1.83, 1.67, 1.00. The distribution isn't quite geometric, and the expected number of innings remaining in an extra-inning game actually does decrease as time goes on.

(Also, in general the fifth and sixth innings are surprisingly high-scoring. I blame the fact that starters seem to fall apart a lot in the fifth or sixth these days. I'm not saying I could do better, but these guys make a hundred times what I do, so they should be held to higher standards.)

I actually saw this in action on July 25th, when the Phillies played the Nationals and won in fourteen innings, of which I only saw the first nine. My father and I went to the game, and I believe my father had never left a baseball game early in his life. But we left after the ninth, because we had to pick my mother up at the train station. She said that she wouldn't have minded waiting, say, an inning or so. But you never know how long the game is going to last, which is why we left... and if we had stayed around for the rest of the game she would have been waiting at the train station for two hours. You always think the game will end soon. And it usually will. But sometimes it doesn't.

01 November 2007

Expecting the unexpected in lotteries

Pattern analysis of MegaMillions Lottery Numbers, from omninerd.com, via slashdot. The author suggests that one should play the numbers which have come up often in the lottery, because they're more likely to come up often again, and backs it up with a pile of meaningless charts.

But of course some numbers are going to come up more often than others. It would be kind of creepy if all the numbers had come up equally often! For example:
Players have had the option of drawing numbers between 1 and 56 since June 22, 2005. Every number has been drawn at least ten times while the most frequently drawn number has appeared thirty times. Overall, each number has appeared an average of twenty times. Despite being drawn the most, both 7 and 53 are only 2.17 deviations from the mean. Even the least drawn number, 47, is -2.17 deviations from the mean.

Let's assume that the number of times each of the balls has appeared is independent (this is roughly true because there are so many balls). These should be independent binomial distributions with mean np and variance np(1-p), where n is the number of drawings in question and p is the probability that a ball comes up in a given drawing; these can be approximated by normal random variables. The maximum of k numbers chosen uniformly at random from [0,1] has expected value k/(k+1) (this isn't obvious, but it's a fairly standard fact); the maximum of the 56 z-scores (number of standard deviations from the mean) for the balls should be at the 100(56/57) = 98.3 percentile of the normal distribution.
And that's about 2.11 standard deviations from the mean.

Not to mention that expressing things in terms of standard deviations really throws away a huge amount of the data...

I'm not going to dissect this much further. But by the previous analysis, consider a 56-ball lottery where 5 balls are picked each day. After n days, the number of times that ball 1 (or any other ball fixed in advance) has been picked is approximately normally distributed with mean 5n/56 and variance n(5/56)(51/56); the standard deviation is thus the square root of this, about .285n1/2. The maximum of the 56 random variables defined like this is about 2.11 standard deviations above the mean, so about 5n/56 + .602n1/2.

Say n is 1000; then the maximum of the number of times ball k has been picked, over all k from 1 to 56, can be estimated by plugging in n = 1000 here; you get 108. But the average ball gets chosen 5000/56 times, or about 89 times. Each individual ball will be chosen near n/56 times; but there's bound to be some outlier. More generally, something unexpected happens in just about any random process, but you can't predict what unexpected thing will happen. That's why it's unexpected!

(Incidentally, if I were going to pick numbers to bet on based on probability, I'd pick the ones that come up about the expected number of times. Some people will bet on numbers that have come up a lot recently because they think they're "hot"; some people will bet on numbers that have come up rarely recently because they think they're "due". Since I don't believe numbers are "hot" or "due", my key to success in playing the lottery would be to play numbers that other people are less likely to play. But an even more successful strategy is not playing at all.)

Where are the zeros of zeta of s?

Tom Apostol's number theory song, which exists in various copies all over the web. It begins:
Where Are The Zeros of zeta of s?
G.F.B. Riemann has made a good guess:
"They're all on the critical line", stated he,
"And their density is one over two pi log T."

This is, of course, the Riemann hypothesis; the rest of the song goes on to give a lot of the history of this problem.

I was reminded of this because I checked out Tom Apostol's book Introduction to Analytic Number Theory; I keep saying that I should learn some analytic number theory and this was recommended to me as a good place to start. About a month ago I wrote about probabilities of relative primality, which was inspired by some things I thought about while reading Mark Dominus' post on a similar topic; Mark gave a picture of a quilt which tabulates the GCD for integers. Apostol's book has a similar picture on the cover.