18 December 2009

Uniquely identifying people by birth date, gender, and zip code

Netflix Spilled Your Brokeback Mountain Secret, Lawsuit Claims. A woman is suing Netflix because she was in the closet, and her movie-rental data was part of the Netflix prize dataset. She claims this means that people could figure out her secret.

Now Netflix is starting a second contest, and rumor has it that the data will include the zip code, birthdate, and gender of each individual. According to this paper (abstract only, unfortunately, so I can't comment on methods) by Latanya Sweeney, this is enough to uniquely identify 87% of the US population. A paper by Phillippe Golle gives the figure as 63%, based on actual Census Bureau data. (The Census gives the number of people with each birth year and gender in each zip code.)

Is it surprising that people can be identified this easily?

From the Golle paper, there are 33,233 "Zip Code Tabulation Areas" in the United States.

US life expectancy is 77.7 years. Since this is a back-of-the-envelope calculation, let's assume that everybody drops dead after 77.7 years (28,379 days), and therefore that the age of a random individual is uniformly distributed over the last 28,000 days. (It pains me to say this, because my grandmother is 85 and still living.)

There are, to a first approximation, two genders.

Therefore there are 28,379 * 33,233 * 2, or about 1.9 billion, possible combinations of birthdate, zip code, and gender. There are about 300 million Americans. If we assume all of these are equally likely (which they're not; some ages are more likely than others, and some zip codes have more people than others), and that they're independent (which they're not, as anybody who's lived in a college town can tell you; Golle notes the college-town effect, and also a military-base effect), then on average the number of people having a given (birthdate, zip code, gender) triplet is about 0.16.

So we'll model the population of the US as 1.9 billion Poisson random variables, each of mean 0.16, and each corresponding to a birthdate-zip code-gender triplet. How many of these do we expect to have value 1 (meaning that that triplet picks out exactly one person)? The probability that a Poisson(0.16) random variable takes the value 1 is exp(-0.16)*(0.16). Thus we find that there are (1.9 billion)*(0.16)*exp(-0.16) people uniquely identified by this triplet, out of (2.5 billion)*(0.16) people.

According to this crude model, the probability that a random individual is uniquely identified by these three pieces of information, then, is exp(-0.16), or about 85%. Why is everybody so surprised?


Anonymous said...

There are, to a first approximation, two genders.

Given some people I know, I found this hilarious. Good to see you posting again. Tough semester?

Michael Lugo said...

Actually, this semester hasn't been too bad; it's just that most of my internet-math energies have been at mathoverflow.

Tom LaGatta said...

Michael, what's the college-town effect?

UserGoogol said...

Tom: In towns where much of the population are college students, you have ZIP codes where the date of birth tends to be disproportionately likely to be around "about twenty years ago," making date of birth less informative.

Veky said...

I don't live in USA and I don't know much about zip codes, but what's wrong with the following argument?

Just go to any hospital any given day in any zipcode area with reasonable population, and see how many women give birth to girls and boys. I suppose it will be a lot more than 2.

Are zipcode equivalence classes in USA really that small, or is there something I'm missing?

Michael Lugo said...

Veky: there are about eight thousand hospitals in the US, and as I said in the original post about thirty thousand zip codes. In rural areas, for example, a zip code often corresponds to a single small town -- usually there is one for each post office -- and such towns may not have their own hospitals.

Anonymous said...

Hmm, how does this match with the Birthday Paradox? Oh, ok that does not say anything about the birth year, which may be different even for people with same birthday. Does it take 78 times 22 people to have someone in the set with your exact birth DATE? then, twice as much to have that someone have your same gender? Gosh what a mental mess you put me in...

Walt said...

Not only are there many zip codes without a hospital, but there are many hospitals with much less than one birth per day.

At our local hospital, the first baby of the year is often not until the 2nd, 3rd, or 4th of January.

Anonymous said...
This comment has been removed by a blog administrator.
Michael Lugo said...

Walt, I assume that the zip code of birth is the zip code in which the parents reside, not the zip code of the hospital you were born in.

Also, depending on the source there seem to be between five thousand and eight thousand hospitals in the US, and about four million annual births. This means that an average hospital has somewhere between 500 and 800 births a year, so it's not surprising that in smaller-than-average hospitals days go by without births. (Although not every hospital has a maternity ward, so that also alters the calculation...)