17 July 2007

how often is a team at .500?

The Phillies' current record is 46 wins and 46 losses.

When I heard this, I thought "hmm, the Phillies have been at .500 quite often this season". Baseball-reference.com tells us that they have been 0-0 (yes, that counts!) 20-20, 21-21, 22-22, 23-23, 24-24, 26-26, 28-28, 29-29, 44-44, and 46-46 this season; that's eleven times. Is that a lot? (I remember first noticing that between the 40th and 48th games of this season; after they were 20-20 they lost, won, lost, won, lost, won, lost, and won, in that order.)

Given that the team is 46-46, how many times should we expect them to have had the same number of wins and losses? It's a lot easier to work this out, of course, if we replace "46" with some smaller number.

For example, say the team had won two games and lost two games. Then there are C(4,2) = 6 ways we can arrange their two wins and two losses: WWLL, WLWL, WLLW, LWWL, LWLW, LLWW. In the first and last of these, the team was 0-0 and 2-2 at various times; in all the others they were also 1-1 after two games. This seems kind of obtuse, but let's flip things around. In six of these possibilities (which are all equally likely, because we've assumed the team wins exactly half its games), they're 0-0 after 0 games. In four of them, they're 1-1 after 2 games. In six of them, they're 2-2 after 4 games. The expected number of times that the team is at .500? It's (6+4+6)/6, or 16/6.

Sixteen is a power of two.

If we try this again for a 3-3 team, there are C(6,3) = 20 ways we can arrange three wins and three losses; there are 20, 12, 12, and 20 ways to arrange them so that the team is at some point 0-0, 1-1, 2-2, and 3-3 repspectively. So the total number of times we expect them to be at .500? It's (20+12+12+20)/20, or 64/20.

Sixty-four is again a power of two. Hmm, this can't be a coincidence.

Let's try to find that sum in the numerator in general. If the team has n wins and n losses (so eventually I'll set n=46 to solve the original problem), then how many ways are there to arrange the wins and losses so that the team wins m of the first 2m games? Clearly this is C(2m,m) C(2(n-m), n-m); we first have to pick which of the first 2m games are the first m wins, then which of the remaining 2(n-m) wins are the n-m remaining wins. So what we want to find is the sum

C(0,0) C(2n,n) + C(2,1) C(2n-2, n-1) + ... + C(2n-2, n-1) C(2,1) + C(2n, n) C(0,0)

and I don't see how to do this directly. However, consider the (infinite) power series

1 + 2z + 6z2 + 20z3 + 70z4 + ...

where the coefficients are C(0,0) = 1, C(2,1) = 2, C(4,2) = 6, C(6,3) = 20, C(8,4) = 70, and so on. (This is called the generating function of this series; generating functions are a ridiculously powerful tool which I will only scratch the surface of here.) This turns out to be the Taylor series of the function (1-4z)-1/2 at z=0. Now, consider what happens if we multiply this power series by itself, so we have

(1 + 2z + 6z2 + 20z3 + 70z4 + ...)(
1 + 2z + 6z2 + 20z3 + 70z4 + ...)
= (1)(1) + [(2)(1) + (1)(2)]z + [(6)(1)+(2)(2)+(1)(6)] z2 + [(20)(1)+(6)(2)+(2)(6)+(1)(20)] z3 + ...

and the coefficient of zn is exactly the sum we want to find! But the power series multiplied by itself is just (1-4z)-1, so the coefficient of zn is 4n.

Finally, we conclude that if we work out the expected number of times at .500 for a team with n wins and n losses, it's 4n/C(2n,n). But it's well-known that C(2n,n) is approximately 4n/(πn)1/2. So a team with n wins and n losses is expected to have been at .500 very nearly (πn)1/2 times.

When n=46, this approximation gives 12.021. (The exact number 446/C(92,46) is, to three decimal places, 12.054.) The Phillies have been at .500 eleven times so far; this is actually less than the expectation, which surprised me. A team which is .500 at the end of the season is expected to have been at .500 sixteen times during the season. For the Phillies, though, since they're already at 46-46, that adjusts the estimate upward, to around twenty-two.

In general, though, one might not want to use the expectation of a random variable like this. It's possible that most teams which are .500 at the end of the season really hit that mark in the middle of the season, but a very few teams are .500 some ridiculously large number of times. However, the most times a team can be at .500 over a 162-game regular season is 82 (0-0, 1-1, 2-2, ..., 81-81), so the expectation probably is a decent guess. Also, the expectation is often a lot more accessible than more detailed information. It is in this case, because I haven't figured out how to get the whole distribution yet, so I don't know the probability, say, that a 46-46 team has been at .500 eleven or more times. That seems harder to figure out, and the best way to find that number would probably be via a simulation; getting an exact, analytic answer doesn't seem easy.

3 comments:

John Armstrong said...

Not unexpectedly, the probability works out to 4^n/(n+1)C_n, where C_n is the Catalan number.

The UM

frank said...

This is really easy to do by simulation. Given that a team is 46-46 (and all premutations of wins and loses are equally likely), the probability of # of ties is:
0 or 1 - 0 (obviously)
2 - 0.0109
3 - 0.0219
4 - 0.0325
5 - 0.0425
6 - 0.0514
7 - 0.0587
8 - 0.0643
9 - 0.0683
10 - 0.0706
11 - 0.0706
12 - 0.0691
13 - 0.0658
14 - 0.0612
15 - 0.0560
16 - 0.0498
17 - 0.0436
18 - 0.0370
19 - 0.0305
20 - 0.0249
21 - 0.0194
22 - 0.0150
23 - 0.0111
> 23 - 0.0248 (37 was the highest that I found out of 1,000,000 simulations)

This leads to a mean of 12.055 (as you pointed out)

(Sorry for the poor formatting)

If you look at a 0.500 team for a full season and don't condition on the record at any point during the season, the expected number of times at 0.500 is actually only 10.20.

Now in reality, if a team is truly expected to win 1/2 its games, there will be some games (e.g. when Hamels starts) where they expect to win more than half the time and vice versa. So I bet we'd expect to get more than 10.2 in reality with a .500 team.

Michael Lugo said...

Frank,

thanks for doing the simulation. Now I don't have to. It wouldn't have been hard, but I didn't feel like programming it.

and if I start taking into account the pitching rotation, then this becomes more of a baseball blog than I intended it to be.