31 December 2009

A hack I'm disturbingly proud of, and its connection to some real math

I'm applying for jobs. Many jobs, because that's how academic job searches work these days. So I have a spreadsheet (in OpenOffice) to keep track of them.

Among the things that I track for each job, there is a column with 0, 1, or 2 in it. 0 means that I haven't submitted anything; 1 means I've submitted something, but not everything that was asked for; 2 means the application is complete. Averaging these numbers and dividing by 2 tells me what proportion of the search is complete.

But I also wanted to know how many 0s, 1s, and 2s there were. And as far as I know the built-in functions in OpenOffice won't do that.

What they will do, however, is this. I have a column consisting of 0s, 1s, 2s, and empty cells. By doing


I get the number of cells in that column which are nonempty; their sum; and the sum of their squares. (The SUMPRODUCT function takes 2 arrays of the same shape and returns the sum of the products of corresponding cells.) "8" is the row that contains the first job on the list, and "1000" is just a number that is comfortably more than the number of jobs I am applying for. Call these a, b, and c respectively. Let n0, n1, and n2 be the number of entries which are 0, 1, and 2 respectively. Then I have

a = n0 + n1 + n2
b = n1 + 2n2
c = n1 + 4n2

which is a three-by-three linear system, and can be solved for n0, n1, n2, giving

n0 = a - 3b/2 + c/2, n1 = 2b-c, n2 = (c-b)/2

and so I can recover the number of applications with status 0, 1, or 2 from this. From the sums of the 0th, 1st, and 2nd powers I can recover the distribution of the values themselves. (The actual code is slightly different, but of course equivalent, because I solved the system "by inspection" and never actually explicitly wrote it out until just now.)

Believe it or not, I actually use this trick in a preprint, "The number of cycles of specified normalized length in permutations", to do some actual mathematics! There I find the expectation of X0, X1, X2, ..., Xk where X is a certain random variable known to take on the values 0, 1, ..., k, namely the number of cycles of length in the interval [γ n, δ n] in a permutation of [n] chosen uniformly at random where γ and δ are constants. k is the greatest integer less than or equal to 1/γ ; for example, if we're looking at cycles of length at least 0.15n in permutations of n, there can't be more than six of them. This gives a linear system like the one above which gives the probability that X takes on each value 0, 1, ..., k.


christopherdrup said...

Have you tried the Frequency command in OpenOffice? I think it will perform the counting operation you are looking for.

Unknown said...

I believe that the COUNTIF function is implemented which should let you count how many items have 0, > 0, 1, > 1, or 2.

Mike said...

Good luck on your job search, btw. The process does seem to be good at producing unlimited amounts of anxiety.

Anonymous said...

In EXCEL, @COUNTIF(range,criteria)

Counta for non-empty.

Or, if you have a column to waste, set up a column that is 100^n (where n is the corresponding 0, 1, or 2. The sum of the new column will be a nifty 6 digit number, encoding all the info you want.

Omar said...

Nice hack, of course, but it sounded silly to me that OpenOffice didn't have a function to count values satisfying some condition (like being equal to some specific number, for example).

So, as I would have done, had I needed that function, I used a popular search engine and found the countif function your other commenters mentioned.

I'm curious: did you do it your way instead of googling the name of the appropriate function because it was more fun? I often do things like that...

Unknown said...

I usually use the data pilot for this kind of analysis. By default it presents sums, but it's easy to change to counts instead.