## 29 April 2010

### A simple model for baseball

From the April Notices of the AMS, John D'Angelo writes Baseball and Markov Chains: Power Hitting and Power Series. Consider the following simple model of baseball. Players only hit singles; three singles score a run. That is, the third and every following player to get a hit in a given inning score a run. This can either be interpreted as that, say, all runners score from second on a single or all runners go from first to third on a single -- but not both! -- or that every third hit is actually a double. (And I do mean exactly every third hit, not some random one-third of hits, so this is a bit unnatural.) Then the expected number of runs per half inning is p3(3p2-10p+10)/(1-p). For real baseball the average number of runs per half-inning is around one half, which corresponds to p = 0.361.

D'Angelo gives this as an exercise, but I independently came up with this model a while ago and can't resist sharing the solution. Let q = 1-p. The probability of getting k hits in an inning is pk q3 -- that's the probability of getting those hits in a certain order -- times the number of ways in which k hits and 3 outs can be arranged. Since the last batter of an inning must get out, the number of possible arrangements is the number of ways to pick 2 batters out of the first k+2 to get out, which is (k+2)(k+1)/2.

The probability of getting k runs, if k is at least 1, is just the probability of getting k+2 hits, which is pk+2q3(k+4)(k+3)/2. Call this f(k); then

f(1) + 2f(2) + 3f(3) + ... = p3(3p2-10p+10)/(1-p)

by some annoying algebra. I'm pretty sure I came up with this exact model while procrastinating from some real work a couple years ago; it's probably been independently reinvented many times.

With p = 0.361, the probabilities of scoring 0, 1, 2, 3, 4, 5 runs in an inning are .748, .123, .066, .034, .016, .008 (rounded to three decimal places). (Probabilities of larger numbers of runs can also be calculated; together they have probability around .006.)

Assuming that each half-inning is independent, the probability G(k) of a team scoring k runs in a game is, for each k,

 k 0 1 2 3 4 5 G(k) 0.073 0.108 0.129 0.133 0.124 0.108 k 6 7 8 9 10 11 G(k) 0.088 0.069 0.052 0.038 0.026 0.018 k 12 13 14 15 16 17 G(k) 0.012 0.008 0.005 0.003 0.002 0.001

with probability about 0.0006 of scoring 18 runs or more. (This seems a bit low to me -- three times a season in the major leagues -- but after all this is a very crude model!) But one interesting thing here is that the distribution of the number of runs per game, which is a sum of nine skewed distributions, is still skewed; the mode is 3, and the median 4. Recall that I chose p so that the mean would be 4.5. And the actual distribution is similarly skewed.

Of course a more sophisticated model of baseball is as a Markov chain. There are twenty-five states in this chain -- zero, one or two outs combined with eight possible ways to have runners on base, and three outs. We assume that each hitter hits randomly according to his actual statistics, and the runners move in the "appropriate" way. Of course determining what's appropriate here would be a bit tricky. How do runners move? A runner is probably more likely to take an extra base when a power hitter is hitting, but the sample size for any individual is fairly small. But one could probably predict from some measure of the hitter's power (say, the number of doubles and home runs, combined appropriately) the chances of a runner taking an extra base on a single. Something similar is necessary for sacrifice flies (which have to be deep enough to score the runner), grounding into double plays, etc. I'm not sure if the Markov models that are out there, such as that by Sagarin, do this. Sagarin computes the (offensive) value of a player by determining how many runs per game a team composed of only that player would score.