01 November 2007

Expecting the unexpected in lotteries

Pattern analysis of MegaMillions Lottery Numbers, from omninerd.com, via slashdot. The author suggests that one should play the numbers which have come up often in the lottery, because they're more likely to come up often again, and backs it up with a pile of meaningless charts.

But of course some numbers are going to come up more often than others. It would be kind of creepy if all the numbers had come up equally often! For example:
Players have had the option of drawing numbers between 1 and 56 since June 22, 2005. Every number has been drawn at least ten times while the most frequently drawn number has appeared thirty times. Overall, each number has appeared an average of twenty times. Despite being drawn the most, both 7 and 53 are only 2.17 deviations from the mean. Even the least drawn number, 47, is -2.17 deviations from the mean.

Let's assume that the number of times each of the balls has appeared is independent (this is roughly true because there are so many balls). These should be independent binomial distributions with mean np and variance np(1-p), where n is the number of drawings in question and p is the probability that a ball comes up in a given drawing; these can be approximated by normal random variables. The maximum of k numbers chosen uniformly at random from [0,1] has expected value k/(k+1) (this isn't obvious, but it's a fairly standard fact); the maximum of the 56 z-scores (number of standard deviations from the mean) for the balls should be at the 100(56/57) = 98.3 percentile of the normal distribution.
And that's about 2.11 standard deviations from the mean.

Not to mention that expressing things in terms of standard deviations really throws away a huge amount of the data...

I'm not going to dissect this much further. But by the previous analysis, consider a 56-ball lottery where 5 balls are picked each day. After n days, the number of times that ball 1 (or any other ball fixed in advance) has been picked is approximately normally distributed with mean 5n/56 and variance n(5/56)(51/56); the standard deviation is thus the square root of this, about .285n1/2. The maximum of the 56 random variables defined like this is about 2.11 standard deviations above the mean, so about 5n/56 + .602n1/2.

Say n is 1000; then the maximum of the number of times ball k has been picked, over all k from 1 to 56, can be estimated by plugging in n = 1000 here; you get 108. But the average ball gets chosen 5000/56 times, or about 89 times. Each individual ball will be chosen near n/56 times; but there's bound to be some outlier. More generally, something unexpected happens in just about any random process, but you can't predict what unexpected thing will happen. That's why it's unexpected!

(Incidentally, if I were going to pick numbers to bet on based on probability, I'd pick the ones that come up about the expected number of times. Some people will bet on numbers that have come up a lot recently because they think they're "hot"; some people will bet on numbers that have come up rarely recently because they think they're "due". Since I don't believe numbers are "hot" or "due", my key to success in playing the lottery would be to play numbers that other people are less likely to play. But an even more successful strategy is not playing at all.)

No comments: