30 March 2008

It's almost baseball season again...

A Journey To Baseball's Alternate Universe, in today's New York Times, by Samuel Arbesman and Steven Strogatz.

Arbesman and Strogatz ask the question: how likely is it that some major league baseball player, at some point, would have had a 56-game hitting streak, as Joe DiMaggio did in 1941?

I've seen attempts to determine this before, but they're usually handwaving things that start out by saying "assume everybody bats .266 and gets 3.83 at-bats a game" (actual averages for the 2007 National League), and then let's compute the probability that such a player has a 56-game hitting streak in any given sequence of 56 games. In this case that's easy; the average player gets has a probability (1-.266)3.83 = 0.306 of not getting a hit in any given game, thus a probability 0.694 of getting a hit in any given game; raising this to the 56th power tells you that the average player has a probability of 1.31 in a billion of getting hits in, say, the 56 games starting tomorrow and ending sometime in early June.

So what's the expected number of 56-game hitting streaks this season, according to this model? There are 107 ways any given player could get a 56-game streak -- starting in game 1, 2, ..., 107. So the expected number of 56-game streaks for Joe Qankee (yes, I'm reviving the Qankees) is this probability times 107, or 1.41 × 10-7. Now, assume there are eight Qankees that play every day. (The Qankees are an extraordinarily healthy National League team. The fact that their name rhymes with that of the American League team that DiMaggio actually had his streak with is purely coincidence.) The expected number of 56-game hitting streaks by Qankees this season is thus 1.13 × 10-6. (Note that this is not the probability that one of them has such a streak. A 57-game streak would get counted twice here, a 58-game streak three times, and so on. However, it is an upper bound for the probability of a Qankee having a streak of at least 56 games.)

Now, there have been something less than three thousand "team seasons" in Major League Baseball (one team playing for one season). So the expected number of 56-games streaks is bounded above by (1.13 × 10-6) × 3000 = 0.0338, or about one in 300.

But we've had one. That seems like a lot.

What's the problem here? Well, the average player isn't the one that's going to have that streak. A .280 hitter will put together a streak in 7.40 56-game frames out of every billion. A .300 hitter, in 69 out of every billion. A .320 hitter, in 498 out of every billion. (And I'm still assuming such a player only gets 3.83 at-bats a game; that's probably not true, because the player who hits well will lead his team as a whole to have more at bats.) But an equally bad hitter doesn't drag down the expectation nearly as much. I've ignored batting order (which Arbesman and Strogatz did take into account, implicitly; their inputs for each player are the total number of hits, number of games played, and number of plate appearances, and number of plate appearances varies with position in the batting order).

Rather than making some assumptions on how batting averages are distributed (which would probably be wrong, and even if they were right in the peak of the distribution would still be wrong because what really matters is the tails), I'll defer to Arbesman and Strogatz. Their method is to simulate the entire history of baseball 10,000 times, which is enough to get a nice basically-smooth curve for the distribution of the length of the longest streak. The median length of the longest streak, in their simulations, is 53 games.

Simulation might not be entirely necessary, though. It's routine to calculate the distrbution of the length of the longest streaks in sequences of biased coin flips; aggregating that information together is a little harder. But I don't care enough to do it, so I'll stop here.

3 comments:

michaeldcassidy said...

If I remember correctly Stephen Jay Gould wrote that he didn't believe we would see another 400 hitter because of changes in defense; I wonder if that also plays into never seeing Joe's 56 game hitting streak broken.

I also wonder how many of DiMaggio's hits came in the late innings with tired pitchers? Now a manager automatically puts in a reliever whether he believes a pitcher is tired or not.

Isabel Lugo said...

Michael,

that's probably true. An achievement like DiMaggio's streak requires there to be some weak pitching to beat up on.

If I remember correctly, not only are there less .400 hitters than there used to be, but there are less, say, .150 hitters than there used to be; the mean batting average has stayed roughly constant over time but the distribution has gotten a lot tighter. Since the long streaks come almost exclusively from the hitters with high batting averages, the fact that the tail of the distribution is disappearing makes streaks less likely.

Anonymous said...

I always figured the same thing, that DiMaggio greatly benefitted from tired pitchers. He probably did to some extent, but maybe not as much as we thought.

A while back, I compiled info on which inning DiMaggio got his first hit of each of the 56 games. I also did it for a few other long streaks for which I could get data.

Here it is:
AVERAGE INNING OF FIRST HIT:
DiMaggio (56g, 1941): 3.93
Rose (44g, 1978): 3.84
Dahlen (42g, 1894): 3.43
Molitor (39g, 1987): 3.87
Rollins (38g, 2005-06): 4.13
Utley (35g, 2006): 3.97

PERCENTAGE OF GAMES WITH FIRST HIT IN 7TH INN OR LATER:
DiMaggio: 21.4%
Rose: 18.2%
Dahlen: 12.5%
Molitor: 17.9%
Rollins: 34.2%
Utley: 25.7%

So DiMaggio's streak isn't really exceptional in that regard. Rollins's streak was amazing, probably nothing else like it has ever happened. He got his first hit in the 8th or 9th inning much, much more often than would be expected.

--Trent McCotter, SABR. (I didn't feel like registering after I'd already typed this, so I'm doing this "anonymously")