God Plays Dice: journalism

18 October 2007

will the best team win?

Will the Best Team Win? Maybe -- by Alan Schwarz, in the October 14 New York Times.

Basically, there's a very good chance that the best team does not win in the Major League Baseball playoffs.

On a related note, often before a playoff series sportswriters will make predictions of the form "[team] in [number of games]". In a best-of-(2n-1) series, if we know the probability p that a team (let's call them the Phillies) wins a game against some other team (as I've done before, let's call them the Qankees), then we can compute the probability that they'll win the series. Let q = 1 - p be the probability that the Qankees win a single game. Then the probability that the Phillies win in k games, where k is between n and 2n-1, can be obtained as follows. There are ${k-1 \choose n-1}$ arrangements of wins and losses in a k-game series that allow the Phillies to win in k games -- they must win n-1 our of the first k-1 games. Each of these occurs with probability pⁿq^k-n. So the probability that the Phillies win in k games is

$P_{n,k}(p) = {k-1 \choose n-1} p^n (1-p)^{k-n}$

and likewise the probability that the Qankees win in k games is
Q_{n,k}(p) = {k-1 \choose n-1} (1-p)^n p^{k-n}

$Q_{n,k}(p) = {k-1 \choose n-1} (1-p)^n p^{k-n}$

(The number n is the number of games needed to win the series; in a best-of-seven series, n = 4.) For example, P_4,6(.6) is the probability that the Phillies win a best-of-seven series in six games, given that they have a .6 probabilty of winning each games; it's ${5 \choose 2} (.6)^3 (.4)^2 = 0.3456$ .

A prediction of the winner of a series and the number of games they win in amounts to a prediction of p. If we assume that the predictor simply predicts the most likely outcome of the series given what they believe p to be, then we want to find the largest of

$P_{n,n}(p), \ldots, P_{n,2n-1}(p), Q_{n,n}(p), \ldots, Q_{n,2n-1}(p)$

.
To do this, we start by finding the ratio $P_{n,k+1}(p)/P_{n,k}(p)$ ; this is k(1-p)/(k-n). If this is greater than 1, it means a win in k+1 games is more likely than a win in n games; it becomes greater than 1 at p=(n-1)/k. So, for example, as we decrease p, a five-game win in a best-of-seven series becomes more likely than a sweep when p = 3/4 (we have n=3 and k=4); a six-game win in a best-of-seven series becomes more likely than a five-game win when p = 3/5; a seven-game win becomes more likely than a six-game win when p = 3/6 = 1/2.

And in general, as we decrease p, a win in (2n-1) games becomes more likely than a win in (2n-2) games when p = (n-1)/(2n-2) = 1/2.

But when p dips below 1/2, that's also when losses should become more likely than wins!

In particular, the ratio between the Phillies' probability of winning in 2n-1 games and that of losing in 2n-1 games is P_n,2n-1(p)/Q_n,2n-1(p) = p/(1-p); if p < 1-p then winning in 2n-1 games can't be the most common outcome. At best, winning in 2n-1 games is as probable as winning in 2n-2 games... when p = 1/2, and at that moment losing in 2n-1 and losing in 2n-2 have the same probability.

Concretely, in a best-of-seven series you should predict that the Phillies:

win in four, if p > 3/4;

win in five, if 3/5 < p < 3/4;

win in six, if 1/2 < p < 3/5;

lose in six, five, or four in cases symmetric to the three above.

If p = 1/2, then the probability of a win in six, win in seven, loss in six, or loss in seven are all the same, 5/32 each.

The point here is that either type of seven-game series is never the sole most likely outcome in this model (although it may be in reality, because games aren't independent -- home-field advantage, who's starting that day, and so on enter into the picture), and that it almost never makes sense to predict a sweep (playoff teams will be evenly enough matched that the worse one should be able to beat the better one more than one-quarter of the time).

Yet four- and seven-game series happen. I'm not saying that these are ridiculously rare events, just that it doesn't make sense to predict them a priori. It's a bit surprising, though -- if you actually played all seven games, 4-3 would be the most common outcome for series than are nearly evenly matched -- but enough of those come from the team already down 4-2 winning the last game that you don't see that in the best of seven format.

Realistically, though, a prediction of "[team] in 7" is just a sportswriter's way of signaling "I think this team is slightly better than its opponent", which is all it should be taken as.

30 July 2007

Language Log dissects science journalism

From Language Log: Two simple numbers and Thou shalt not report odds ratios by Mark Liberman.

The first of these, from a week ago, suggests the following rule:

Today's prescription is a trivial rule of scientific rhetoric. When there's a claim that some genomic variant is associated with some phenotypic trait -- whether it's breast cancer or homosexuality or conservatism or stuttering -- we need to know four simple numbers. Specifically: (A) the number of "case subjects" in the study (people with the trait in question); (B) the number of "control subjects" in the study; (C) the proportion of the case subjects with the genomic variant in question; and (D) the proportion of the controls with the genomic variant in question.

If four numbers are too many, leave out (A) and (B), as long as they're not really small. But stick with (C) and (D) -- they're the medicine that really does the work here.

This is something that I've often worried about; in one of the examples that Liberman cites, (C) and (D) are 77% and 66%.

Also, there's a link to a New York Times article (July 19) with the headline Scientists Find Genetic Link for a Disorder (Next, Respect?). Does a disease need a genetic basis in order for people to take diagnoses of it seriously? All of someone's genes are determined before they're born; this seems to imply that things which happen during a person's life which affect their health don't matter. (Please don't get me started on people who think that homosexuality is okay if, and only if, it's genetic. And even if there is a "gay gene", it's not like everyone who has it is gay and everyone who doesn't have it isn't. If the inheritance patterns for homosexuality were that simple we'd have figured it out already.

But, you know, numbers scare people. If you put numbers in a newspaper article they'll throw up their hands and turn on some reality television.

At least in the first case I had realized that there was missing information. The second of these seems more insidious to me, because I'd never thought about it before, and I'm smarter than most people about these things. (You probably are, too, if you're reading this. If you don't believe me, get out of the house some time.) A recent study was reported in the popular press with phrases such as this (from the New York Times):

Doctors are only 60% as likely to order cardiac catheterization for women and blacks as for men and whites.

As it turns out, the referral rate for white men was 90.4%, and for women and blacks 84.7%. (While I'm on the subject: conflating "women" and "blacks" like this seems kind of silly. And by "men and whites" they apparently actually meant "white men".) The study reports an "odds ratio"; the odds of a white man being referred are 9.6 to 1, and the odds of a black person or woman being referred are 15.5 to 1. The ratio of these numbers is where the 60% comes from.

The following sentence would actually be pretty close to true:

Doctors are only 60% as likely to not order cardiac catheterization for white men as for women and blacks.

The relevant percentages are 9.6% and 15.3%, which are close enough to zero that the results don't get distorted too badly by all this manipulation. When it's put that way, it's hard to understand, but if we take not ordering catheterization as some sort of negligence you can see how it would come about. Still, it's the sort of sentence with lots of quantifiers that only a mathematician could love.

It seems that odds ratios are often given in the medical literature due to the fact that they arise more naturally from certain statistical tests. But the media has a responsibility to translate the facts into language that the hypothetical "educated layperson" can understand. And the schools have a responsibility to create "educated laypeople" who can then read such an article and understand it, but this is not a post about education.

God Plays Dice

18 October 2007

will the best team win?

30 July 2007

Language Log dissects science journalism

Blog Archive

Contributors

Other blogs