A Journey To Baseball's Alternate Universe, in today's New York Times, by Samuel Arbesman and Steven Strogatz.
Arbesman and Strogatz ask the question: how likely is it that some major league baseball player, at some point, would have had a 56-game hitting streak, as Joe DiMaggio did in 1941?
I've seen attempts to determine this before, but they're usually handwaving things that start out by saying "assume everybody bats .266 and gets 3.83 at-bats a game" (actual averages for the 2007 National League), and then let's compute the probability that such a player has a 56-game hitting streak in any given sequence of 56 games. In this case that's easy; the average player gets has a probability (1-.266)3.83 = 0.306 of not getting a hit in any given game, thus a probability 0.694 of getting a hit in any given game; raising this to the 56th power tells you that the average player has a probability of 1.31 in a billion of getting hits in, say, the 56 games starting tomorrow and ending sometime in early June.
So what's the expected number of 56-game hitting streaks this season, according to this model? There are 107 ways any given player could get a 56-game streak -- starting in game 1, 2, ..., 107. So the expected number of 56-game streaks for Joe Qankee (yes, I'm reviving the Qankees) is this probability times 107, or 1.41 × 10-7. Now, assume there are eight Qankees that play every day. (The Qankees are an extraordinarily healthy National League team. The fact that their name rhymes with that of the American League team that DiMaggio actually had his streak with is purely coincidence.) The expected number of 56-game hitting streaks by Qankees this season is thus 1.13 × 10-6. (Note that this is not the probability that one of them has such a streak. A 57-game streak would get counted twice here, a 58-game streak three times, and so on. However, it is an upper bound for the probability of a Qankee having a streak of at least 56 games.)
Now, there have been something less than three thousand "team seasons" in Major League Baseball (one team playing for one season). So the expected number of 56-games streaks is bounded above by (1.13 × 10-6) × 3000 = 0.0338, or about one in 300.
But we've had one. That seems like a lot.
What's the problem here? Well, the average player isn't the one that's going to have that streak. A .280 hitter will put together a streak in 7.40 56-game frames out of every billion. A .300 hitter, in 69 out of every billion. A .320 hitter, in 498 out of every billion. (And I'm still assuming such a player only gets 3.83 at-bats a game; that's probably not true, because the player who hits well will lead his team as a whole to have more at bats.) But an equally bad hitter doesn't drag down the expectation nearly as much. I've ignored batting order (which Arbesman and Strogatz did take into account, implicitly; their inputs for each player are the total number of hits, number of games played, and number of plate appearances, and number of plate appearances varies with position in the batting order).
Rather than making some assumptions on how batting averages are distributed (which would probably be wrong, and even if they were right in the peak of the distribution would still be wrong because what really matters is the tails), I'll defer to Arbesman and Strogatz. Their method is to simulate the entire history of baseball 10,000 times, which is enough to get a nice basically-smooth curve for the distribution of the length of the longest streak. The median length of the longest streak, in their simulations, is 53 games.
Simulation might not be entirely necessary, though. It's routine to calculate the distrbution of the length of the longest streaks in sequences of biased coin flips; aggregating that information together is a little harder. But I don't care enough to do it, so I'll stop here.