08 May 2008

Sampling is tricky

From Statistical Modeling, etc.: The candy weighing demonstration, or, the unwisdom of crowds.

Basically, you have a bunch of candies in a bag, of various sizes. You want to know how much the bag weighs, and for some reason you know the number of candies. To do this, you pick five candies at random and divide by five to get the average weight of a candy. Then you multiply by the number of candies.

Your estimate will almost surely be too high, because you are more likely to pick the large candies than the small ones.

This reminds me of a classic statistical paradox: academic institutions like to measure the "average number of students per course", which they do by adding up the number of students in each course and dividing by the number of courses. But what really matters from the students' point of view is the average size of the courses they're in. Big classes have more students in them (that's what makes them big!), so this latter average will be larger. Assume, for example, we have a university with two courses; one has 30 students and one has 60. Every student takes one course. The average number of students per course is (30+60)/2 = 45. But if you ask the students "how many people are in your course?", one-third will say 30 and two-thirds will say 60, so from the students' point of view the average is 50. One could conclude that universities would use their teaching resources more efficiently if they didn't have such a wide range of course sizes, but this wouldn't be practical.

1 comment:

Mark Dominus said...

This reminds me of the observation that you're most likely to find yourself in the slow, crowded lane of the highway precisely because it is the crowded lane.