10 July 2008

Why medians are dangerous

Greg Mankiw provides a graph of the salaries of newly minted lawyers, originally from Empirical Legal Studies.

There are two peaks, one centered at about $45,000 and one centered at about $145,000. The higher peak corresponds to people working for Big Law Firms; the lower to people working for nonprofits, the government, etc.

The median is reported at $62,000, just to the right of the first peak, since the first peak contains slightly more people. But one gets the impression that if a few more people were to shift from the left peak to the right peak, the median would jump drastically upwards. We usually hear that it's better to look at the median than the mean when looking at distributions of incomes, house prices, etc. because these distributions are heavily skewed towards the right. But even that starts to break down when the distribution is bimodal.


Efrique said...

That's not an argument specifically against medians - it's an argument against a single location measure for a bimodal distribution. (Mean, mode, and well, any single location statistic really, are all not telling you what you need to know here.)

For a distribution with two substantial peaks, essentially any single-number-location-descriptive is going to be dangerous, since it must either describe one of the peaks (ignoring the other), or something in between the peaks (or, I guess, under some circumstances, something outside the peaks).

With a strongly bimodal distribution, the first important thing to say is "bimodal", and then identify them (in pretty much the way you did in your text). That says a lot.

Location measures can make some kind of sense for a unimodal distribution (and possibly also in the case where there's a sufficiently large number of modes, as long as they tend to "die away" in the tails in such a way that there's some kind of "center". We shouldn't expect them to necessarily make sense the rest of the time.

Michael Lugo said...

Right, but a lot of the people with a little knowledge (as opposed to no knowledge) know that they shouldn't use the mean or the mode, so they use the median.