02 April 2011

A street-fighting approach to the variance of a hypergeometric random variable

So you all1 know that if I have a biased coin with probability p of coming up heads, and I flip it n times, then the expected number of heads is np and the variance is npq. That's the binomial distribution. Alternatively, if I have an urn containing pN white balls and qN black balls, with p + q = 1, and I draw n balls with replacement then the distribution of the number of white balls has that mean and variance.

Some of you know that if I sample without replacement from that same urn -- that is, if I take balls out and don't put them back -- then the expected number of white balls is np and the variance is npq(N-n)/(N-1). The distribution of the number of white balls is the hypergeometric distribution.

So it makes sense, I think, to think of (N-n)/(N-1) as a "correction factor" for going from sampling with replacement to sampling without replacement. This is the approach taken in Freedman, Pisani, and Purves, for example, which is the book I'm teaching intro stats from this semester.

How do you prove this? On this, FPP are silent. The proof I know -- see, for example, Pitman -- is as follows. Write the number of white balls, when sampling without replacement, as

Sn = I1 + ... + In

where ISk is 1 if the kth draw gives a white ball and 0 otherwise. Then E(Ik) is just the probability of getting a white ball on the kth draw, and so it's equal to p by symmetry. By linearity of expectation E(Sn) = np. To get the variance, it's enough to get E(Sn2). And by expanding out that sum of indicators there, you get

Sn2 = (I12 + ... + In2) + (I1 I2 + I1 I3 + ... + In-1 In).

There are n terms inside the first set of parentheses, and n(n-1) inside the second set, which includes every pair Ij Ik where j and k aren't equal. By linearity of expectation and symmetry,

E(Sn2) = nE(I1) + n(n-1)E(I1 I2).

The first term, we already know, is np. The second term is n(n-1) times the probability that both the first and second draws yield white balls. The first draw yields a white ball with probability p. For the second draw there are N-1 balls left, of which pN-1 are white, so that draw yields a white ball with probability (pN-1)/(N-1). The probability is the product of these. Do the algebra, let the dust settle, and you get the formula I claimed.

But this doesn't explain things in terms of the correction factor. It doesn't refer back to the binomial distribution at all! But in the limit where your sample is small compared to your population, sampling without replacement and smapling with replacement are the same! So can we use this somehow? Let's try to guess the correction factor without writing down any random variables. We'll write

Variance without replacement = f(N,n) npq

where n is the sample size and N is the population size, and think about what we know about f(N,n)

First, f(N,1) = 1. If you have a sample of size 1, sampling with and without replacement are actually the same thing.

Second, f(N,N) = 0. If your sample is the entire population, you always get the same result.

But most important is that if we sample without replacement, and take samples of size n or of size N-n, we should get the same variance! Taking a sample of size N-n is the same as taking a sample of size n and deciding to take all the other balls instead. So for each sample of size n with w white balls, there's a corresponding sample of size N-n with pN-w white balls. The distributions of numbers of white balls are mirror images of each other, so they have the same variance. So you get

nf(N,n)pq = (N-n)f(N, N-n)pq.

Of course the pq factors cancel. For ease of notation, let g(x) = f(N,x). Then we need to find some function g such that g(1) = 1, g(N)=0, and ng(n) = (N-n)g(N-n). Letting n = 1 you get g(1) = (N-1)g(N-1), so g(N-1) = 1/(N-1). The three values of g that we have so far are consistent with the guess that g is linear. So let's assume it is -- why should it be anything more complicated? And that gives you the formula. This strikes me as the Street-Fighting Mathematics approach to this problem.

Question: Is there a way to rigorize this "guess" -- some functional equation I'm not seeing, for example?

1. I use "all" in the mathematician's sense. This means I wish you knew this, or I think you should know it. Some of you probably don't. That's okay.


Anonymous said...

I believe there a typo in your second paragraph. I think you meant to say "without replacement," instead of "with replacement.

Unknown said...
This comment has been removed by the author.
Michael Lugo said...

You're right. Thanks for pointing out the mistake.

CarlBrannen said...

It's been 25 years since I took this sort of class. My instinct would be to begin with the binomial theorem where you take N samples, and write it as the sum of two random variables where you take n and N-n samples. The averages add up, the variances are related. But to get the hypergeometric, you have to sum out over the specific cases for the results of the N samples. This is natural when you consider the generating functions. Like I say, it's been a long, long, time.

Michael B. Moore said...

For some visualizations of symmetry in hypergeometric distributions, take a look at deckulator which also links to an online calculator for a multivariate hypergeometric distribution.