I'm applying for jobs. Many jobs, because that's how academic job searches work these days. So I have a spreadsheet (in OpenOffice) to keep track of them.
Among the things that I track for each job, there is a column with 0, 1, or 2 in it. 0 means that I haven't submitted anything; 1 means I've submitted something, but not everything that was asked for; 2 means the application is complete. Averaging these numbers and dividing by 2 tells me what proportion of the search is complete.
But I also wanted to know how many 0s, 1s, and 2s there were. And as far as I know the built-in functions in OpenOffice won't do that.
What they will do, however, is this. I have a column consisting of 0s, 1s, 2s, and empty cells. By doing
COUNT(J8:J1000)
SUM(J8:J1000)
SUMPRODUCT(J8:J1000;J8:J1000)
I get the number of cells in that column which are nonempty; their sum; and the sum of their squares. (The SUMPRODUCT function takes 2 arrays of the same shape and returns the sum of the products of corresponding cells.) "8" is the row that contains the first job on the list, and "1000" is just a number that is comfortably more than the number of jobs I am applying for. Call these a, b, and c respectively. Let n0, n1, and n2 be the number of entries which are 0, 1, and 2 respectively. Then I have
a = n0 + n1 + n2
b = n1 + 2n2
c = n1 + 4n2
which is a three-by-three linear system, and can be solved for n0, n1, n2, giving
n0 = a - 3b/2 + c/2, n1 = 2b-c, n2 = (c-b)/2
and so I can recover the number of applications with status 0, 1, or 2 from this. From the sums of the 0th, 1st, and 2nd powers I can recover the distribution of the values themselves. (The actual code is slightly different, but of course equivalent, because I solved the system "by inspection" and never actually explicitly wrote it out until just now.)
Believe it or not, I actually use this trick in a preprint, "The number of cycles of specified normalized length in permutations", to do some actual mathematics! There I find the expectation of X0, X1, X2, ..., Xk where X is a certain random variable known to take on the values 0, 1, ..., k, namely the number of cycles of length in the interval [γ n, δ n] in a permutation of [n] chosen uniformly at random where γ and δ are constants. k is the greatest integer less than or equal to 1/γ ; for example, if we're looking at cycles of length at least 0.15n in permutations of n, there can't be more than six of them. This gives a linear system like the one above which gives the probability that X takes on each value 0, 1, ..., k.
31 December 2009
18 December 2009
Uniquely identifying people by birth date, gender, and zip code
Netflix Spilled Your Brokeback Mountain Secret, Lawsuit Claims. A woman is suing Netflix because she was in the closet, and her movie-rental data was part of the Netflix prize dataset. She claims this means that people could figure out her secret.
Now Netflix is starting a second contest, and rumor has it that the data will include the zip code, birthdate, and gender of each individual. According to this paper (abstract only, unfortunately, so I can't comment on methods) by Latanya Sweeney, this is enough to uniquely identify 87% of the US population. A paper by Phillippe Golle gives the figure as 63%, based on actual Census Bureau data. (The Census gives the number of people with each birth year and gender in each zip code.)
Is it surprising that people can be identified this easily?
From the Golle paper, there are 33,233 "Zip Code Tabulation Areas" in the United States.
US life expectancy is 77.7 years. Since this is a back-of-the-envelope calculation, let's assume that everybody drops dead after 77.7 years (28,379 days), and therefore that the age of a random individual is uniformly distributed over the last 28,000 days. (It pains me to say this, because my grandmother is 85 and still living.)
There are, to a first approximation, two genders.
Therefore there are 28,379 * 33,233 * 2, or about 1.9 billion, possible combinations of birthdate, zip code, and gender. There are about 300 million Americans. If we assume all of these are equally likely (which they're not; some ages are more likely than others, and some zip codes have more people than others), and that they're independent (which they're not, as anybody who's lived in a college town can tell you; Golle notes the college-town effect, and also a military-base effect), then on average the number of people having a given (birthdate, zip code, gender) triplet is about 0.16.
So we'll model the population of the US as 1.9 billion Poisson random variables, each of mean 0.16, and each corresponding to a birthdate-zip code-gender triplet. How many of these do we expect to have value 1 (meaning that that triplet picks out exactly one person)? The probability that a Poisson(0.16) random variable takes the value 1 is exp(-0.16)*(0.16). Thus we find that there are (1.9 billion)*(0.16)*exp(-0.16) people uniquely identified by this triplet, out of (2.5 billion)*(0.16) people.
According to this crude model, the probability that a random individual is uniquely identified by these three pieces of information, then, is exp(-0.16), or about 85%. Why is everybody so surprised?
Now Netflix is starting a second contest, and rumor has it that the data will include the zip code, birthdate, and gender of each individual. According to this paper (abstract only, unfortunately, so I can't comment on methods) by Latanya Sweeney, this is enough to uniquely identify 87% of the US population. A paper by Phillippe Golle gives the figure as 63%, based on actual Census Bureau data. (The Census gives the number of people with each birth year and gender in each zip code.)
Is it surprising that people can be identified this easily?
From the Golle paper, there are 33,233 "Zip Code Tabulation Areas" in the United States.
US life expectancy is 77.7 years. Since this is a back-of-the-envelope calculation, let's assume that everybody drops dead after 77.7 years (28,379 days), and therefore that the age of a random individual is uniformly distributed over the last 28,000 days. (It pains me to say this, because my grandmother is 85 and still living.)
There are, to a first approximation, two genders.
Therefore there are 28,379 * 33,233 * 2, or about 1.9 billion, possible combinations of birthdate, zip code, and gender. There are about 300 million Americans. If we assume all of these are equally likely (which they're not; some ages are more likely than others, and some zip codes have more people than others), and that they're independent (which they're not, as anybody who's lived in a college town can tell you; Golle notes the college-town effect, and also a military-base effect), then on average the number of people having a given (birthdate, zip code, gender) triplet is about 0.16.
So we'll model the population of the US as 1.9 billion Poisson random variables, each of mean 0.16, and each corresponding to a birthdate-zip code-gender triplet. How many of these do we expect to have value 1 (meaning that that triplet picks out exactly one person)? The probability that a Poisson(0.16) random variable takes the value 1 is exp(-0.16)*(0.16). Thus we find that there are (1.9 billion)*(0.16)*exp(-0.16) people uniquely identified by this triplet, out of (2.5 billion)*(0.16) people.
According to this crude model, the probability that a random individual is uniquely identified by these three pieces of information, then, is exp(-0.16), or about 85%. Why is everybody so surprised?
08 December 2009
Distribution of Putnam scores
The distributions of Putnam exam scores are interesting. See, for example, the 2001 distribution. It takes a bit of number-crunching to get an actual distribution of scores from the data; they report the "rank" of the people getting each score. The rank corresponding to a given score is, I assume, A+(B+1)/2 where A is the number of people scoring higher than that score and B is the number of people scoring that particular score. For example, in 2001 -- which happens to be one of the years in which I took the Putnam -- the table begins
where the first two rows are provided by the organizers, and the third row can be worked out by working left to right. For example, once we know 17 people got 70 or better, the fact that the score 69 corresponds to rank 19 means that the people scoring 69 must have been the 18th, 19th, and 20th-best; so there were three of them. (Incidentally, most increasing sequences of half-integers, when interpreted as sequences of ranks, don't appear to correspond to legitimate score distributions; the number of people getting certain scores ends up negative if you're note careful.)
Anyway, if you crunch the numbers on a typical Putnam score distribution you observe two things:
- the scores follow, roughly, a power law; the number of people scoring 10n decays like some power of n, for integer n.
- once you remove this decay (which I haven't actually done; I've just eyeballed it), there are "spikes" at multiples of 10. For example, the number of people scoring 18, 19, 20, 21, 22, 23 in 2001 were 8, 23, 99, 60, 39, 11. Twenty-four people scored 50; seven scored each of 49 and 51.
I can't explain the first one (and it may just be an artifact of the way I'm doing the plotting; lots of things look close to linear when plotted on a logarithmic scale). But the second one is actually easy to explain; Putnam problems are worth ten points each, and most scores are 0 or 10 with a smattering of 1, 2, 8, or 9. Scores between 3 and 7 on a problem are exceedingly rare. So to get a score of, say, 55, one has to get five problems right and have made a bit of progress on three to five more, which is less likely than straight-out solving five or six problems (for 50 or 60, respectively).
Incidentally, I haven't looked at the problems from the 2009 Putnam, because I have work to do.
Score | 101 | 100 | 86 | 80 | 79 | 77 | 73 | 72 | 71 | 70 | 69 | 68 |
Rank | 1 | 2 | 3 | 4.5 | 6 | 7.5 | 9 | 11 | 14 | 16.5 | 19 | 23.5 |
Number | 1 | 1 | 1 | 2 | 1 | 2 | 1 | 3 | 3 | 2 | 3 | 6 |
where the first two rows are provided by the organizers, and the third row can be worked out by working left to right. For example, once we know 17 people got 70 or better, the fact that the score 69 corresponds to rank 19 means that the people scoring 69 must have been the 18th, 19th, and 20th-best; so there were three of them. (Incidentally, most increasing sequences of half-integers, when interpreted as sequences of ranks, don't appear to correspond to legitimate score distributions; the number of people getting certain scores ends up negative if you're note careful.)
Anyway, if you crunch the numbers on a typical Putnam score distribution you observe two things:
- the scores follow, roughly, a power law; the number of people scoring 10n decays like some power of n, for integer n.
- once you remove this decay (which I haven't actually done; I've just eyeballed it), there are "spikes" at multiples of 10. For example, the number of people scoring 18, 19, 20, 21, 22, 23 in 2001 were 8, 23, 99, 60, 39, 11. Twenty-four people scored 50; seven scored each of 49 and 51.
I can't explain the first one (and it may just be an artifact of the way I'm doing the plotting; lots of things look close to linear when plotted on a logarithmic scale). But the second one is actually easy to explain; Putnam problems are worth ten points each, and most scores are 0 or 10 with a smattering of 1, 2, 8, or 9. Scores between 3 and 7 on a problem are exceedingly rare. So to get a score of, say, 55, one has to get five problems right and have made a bit of progress on three to five more, which is less likely than straight-out solving five or six problems (for 50 or 60, respectively).
Incidentally, I haven't looked at the problems from the 2009 Putnam, because I have work to do.
22 October 2009
Math Overflow
Because I'd have to forfeit my math blogger card otherwise: you should know about Math Overflow This is a site where people can ask mathematics questions -- the level is basically that of questions which would be of interest to professional mathematicians. I'm rather enjoying it; it's procrastination and education at the same time!
It is also useful for asking questions and being told that I already answered them on this blog.
It is also useful for asking questions and being told that I already answered them on this blog.
08 October 2009
Evidence that mathematicians have a big Internet presence
If you google genealogy, the first hit is The mathematics genealogy project. In some sense, according to "the Internet", mathematical genealogy is more important than real genealogy! (I have a feeling that the biological parents of mathematicians would be offended by this, so I won't tell my parents.)
If you google AMS, the first hit is the American Mathematical Society. (Societies of meteorologists, musicologists, Montessori schools, etc. show up further down the list.) I have a musicologist friend that I joke with this about, claiming that the mathematical society is the real AMS.
I suspect this is because mathematicians found the Internet early, and seem to be more likely to have personal web pages than even most other academics.
Edit, 4:18 pm: In a comment by Boris, I'm reminded that Google personalizes search results, so what I've said is not true.
If you google AMS, the first hit is the American Mathematical Society. (Societies of meteorologists, musicologists, Montessori schools, etc. show up further down the list.) I have a musicologist friend that I joke with this about, claiming that the mathematical society is the real AMS.
I suspect this is because mathematicians found the Internet early, and seem to be more likely to have personal web pages than even most other academics.
Edit, 4:18 pm: In a comment by Boris, I'm reminded that Google personalizes search results, so what I've said is not true.
30 September 2009
Is zero even?
Did you know that there are actually things to say about whether zero is even or odd (from Wikipedia)? Obviously it is, but the math-ed folks have seriously looked at this.
I found this via a comment by John Thacker at The Volokh Conspiracy. There's a poll there; right now 2% of people have said 0 is odd, 51% even, 43% both, 4% neither. I can kind of understand what's going on with people saying "neither" (perhaps they're getting this from some elementary-school notions), but how is 0 odd?
My answer: yes, zero is even, because it's twice an integer.
(Or because the identity permutation on n letters is an element of the alternating group An -- I've been thinking about permutations a lot lately. But if you understand that, you probably are like me, think zero is even, and didn't even think there was anything to discuss.)
Incidentally, sometime recently -- I forget the context -- I saw something that referred to the Gaussian integer a+bi as "uneven" if and only if a and b had different parity.
I found this via a comment by John Thacker at The Volokh Conspiracy. There's a poll there; right now 2% of people have said 0 is odd, 51% even, 43% both, 4% neither. I can kind of understand what's going on with people saying "neither" (perhaps they're getting this from some elementary-school notions), but how is 0 odd?
My answer: yes, zero is even, because it's twice an integer.
(Or because the identity permutation on n letters is an element of the alternating group An -- I've been thinking about permutations a lot lately. But if you understand that, you probably are like me, think zero is even, and didn't even think there was anything to discuss.)
Incidentally, sometime recently -- I forget the context -- I saw something that referred to the Gaussian integer a+bi as "uneven" if and only if a and b had different parity.
27 September 2009
Economic impact of mathematics?
Tim Gowers wrote in The Importance of Mathematics:
If you were to work out what mathematical research has cost the world in the last 100 years, and then work out what the world has gained, in crude economic terms, then you would discover that the world has received an extraordinary return on a very small investment.I don't doubt this. But has anyone actually tried to do this? (And would the numbers even be meaningful?)
Subscribe to:
Posts (Atom)