Showing posts with label Netflix. Show all posts
Showing posts with label Netflix. Show all posts

18 December 2009

Uniquely identifying people by birth date, gender, and zip code

Netflix Spilled Your Brokeback Mountain Secret, Lawsuit Claims. A woman is suing Netflix because she was in the closet, and her movie-rental data was part of the Netflix prize dataset. She claims this means that people could figure out her secret.

Now Netflix is starting a second contest, and rumor has it that the data will include the zip code, birthdate, and gender of each individual. According to this paper (abstract only, unfortunately, so I can't comment on methods) by Latanya Sweeney, this is enough to uniquely identify 87% of the US population. A paper by Phillippe Golle gives the figure as 63%, based on actual Census Bureau data. (The Census gives the number of people with each birth year and gender in each zip code.)

Is it surprising that people can be identified this easily?

From the Golle paper, there are 33,233 "Zip Code Tabulation Areas" in the United States.

US life expectancy is 77.7 years. Since this is a back-of-the-envelope calculation, let's assume that everybody drops dead after 77.7 years (28,379 days), and therefore that the age of a random individual is uniformly distributed over the last 28,000 days. (It pains me to say this, because my grandmother is 85 and still living.)

There are, to a first approximation, two genders.

Therefore there are 28,379 * 33,233 * 2, or about 1.9 billion, possible combinations of birthdate, zip code, and gender. There are about 300 million Americans. If we assume all of these are equally likely (which they're not; some ages are more likely than others, and some zip codes have more people than others), and that they're independent (which they're not, as anybody who's lived in a college town can tell you; Golle notes the college-town effect, and also a military-base effect), then on average the number of people having a given (birthdate, zip code, gender) triplet is about 0.16.

So we'll model the population of the US as 1.9 billion Poisson random variables, each of mean 0.16, and each corresponding to a birthdate-zip code-gender triplet. How many of these do we expect to have value 1 (meaning that that triplet picks out exactly one person)? The probability that a Poisson(0.16) random variable takes the value 1 is exp(-0.16)*(0.16). Thus we find that there are (1.9 billion)*(0.16)*exp(-0.16) people uniquely identified by this triplet, out of (2.5 billion)*(0.16) people.

According to this crude model, the probability that a random individual is uniquely identified by these three pieces of information, then, is exp(-0.16), or about 85%. Why is everybody so surprised?