18 August 2007

more fun with sexual means

I, and about a zillion other people, responded to Gina Kolata's article in the NYT last week, in which medians and means were confused, in an article claiming that men had more sexual partners on average than women.

There is a follow-up article by Kolata in today's Times (Sunday).

There was blogospheric clamor for the full distribution of the number of sexual partners of men and women; the original report from the CDC, it turns out, doesn't have that, but groups people into four groups -- those who have had 0 or 1 sexual partners, 2 to 6, 7 to 14, and 15 or more. This strikes me as insufficient resolution. In particular, zero is much different than one, as any virgin could tell you. Two seems a lot different than six, as well; two sexual partners could be someone who had sex with their spouse and one other person, whereas six sexual partners in a lifetime, although not a lot, can't have such a simple story behind it.

In any case, I'll reproduce the table. (The numbers are percentages.)
Partners0-12-67-1415+
Men16.633.820.728.9
Women25.044.321.39.4

The claimed medians for the number of sexual partners for men and women are seven and four, respectively. But 50.4% of men have had six or less sexual partners, according to this data. My earlier claim that the data might support a two-peaked distribution for women seems unlikely, but can't be ruled out at this resolution. (But I've seen enough other distributions that would explain the difference in medians that I don't really believe my own theory any more.) You can't extract means from this data -- and in fact at this resolution, it's theoretically possible that all the men could have 0, 2, 7, or 15 sex partners, for a mean of 6.46, and all the women 1, 6, 14, or [large number] sex partners, for a mean of at least 7.3 (if [large number] is in fact 15), making the female mean actually higher than the male mean. It would be simple (though I won't do it) to tweak the numbers so that the two means came out exactly equal.

Kolata (who has a master's in math, according to Wikipedia), however, claims that the data is inconsistent, in that there's no way to make the means equal: "I got between 40 percent and 75 percent more male than female partners depending on how you guess the average on each interval." I wonder what she tried. Sure, I'm just showing that it's possible the means are equal, not that it's likely. But someone with mathematical training should know better.

Ten thousand

The ten thousandth page load on this blog came today, at 5:31:47 AM (US Eastern Daylight Time, UTC-4), from someone located in or near Ithaca, New York, using Time Warner Cable's RoadRunner internet service; they viewed my post on the high school prom theorem (the article from the New York Times that inspired this, in which medians and means were confused, has been mentioned in quite a few blogs).

They came from this post at The Volokh Conspiracy; I'm not mentioned in the post, but John Armstrong provides a link to me in the comments.

They were using Firefox 2.0.0, wich is the browser used by slightly more than half of my readers. (Generally the statistics hover around 60% Firefox, 20% IE, 10% Safari, and 10% random other browsers, which is much different from the Internet as a whole.)

Ten thousand isn't a huge number, but it's enough to make me realize that people actually are willing to hear what I want to say, and I appreciate that. Thanks for reading.

Social interactions aren't random

When I graduated from high school six years ago, I received a prestigious national award. The people who received it were selected based on some combination of standardized test scores, academic ability, a bunch of essays we wrote, and some sort of correction for geography I never quite understood. About one hundred and forty people -- all high school seniors -- receive this award each year.

In the two years since I've graduated from college, I have met two other people who have received this award, and who I did not meet through the schools that I attended. (It doesn't seem quite fair to include people who went to the same schools as I did, or indeed who I met through any academic channels, because the award was supposedly selecting for academic ability.) One is an ex-girlfriend; the other is someone that I met the most recent time that I was looking for a place to live. (I live alone; after a string of potential roommates had "friends" materialize that were suddenly interested in their empty rooms, I realized that people were trying to tell me something.) Neither of them volunteered this particular piece of information (to be honest, it's basically irrelevant); in both cases I found it out by Googling them.

And actually, upon scanning the names of people who've gotten this award, I found a third one I know, the semi-invisible housemate of some friends of mine.

So, how likely is it that I would have met two (or three) such people in the last two years, out of all the people I've met in that time? The word "met" is a bit vague here, but since in both cases I found out this piece of information by Googling, in order to identify this fact I would have to have enough information to find them on Google. In particular, I'd have to know their last name.

How many people have I met over the last two years? I don't know an exact answer, but it seems like one person a day would be fairly accurate, if perhaps a bit high -- for a total of seven hundred and thirty people. This is a tricky number to estimate because on most days I meet no new people, but on some days I meet lots of new people. I suspect I'm not unique here.

Furthermore, I'm going to make the very crude assumption that everyone I meet is between one year younger than me and three years older than me -- and actually, as you'll see, irrelevant. This is surprisingly close to being the truth. This means that there are five years worth of potential award recipients for me to meet, or 700 people. (As I said previously, there are 140 recipients each year.) The total number of 17-year-olds in the country in 2000 is about four million. (There were about twenty million people between the ages of 15 and 19; I'm assuming one-fifth of them were seventeen.) So the fraction of people who were recipients of this award is 700 (the number of recipients in five years) in twenty million (the number of people that were the right age in those five years), or about one in thirty thousand. You'd expect me to have met about one-fortieth of one of these people. I've met three. Just how unlikely is this?

We can compute this using the binomial distribution. I'll let C(n,k) denote the binomial coefficient "n choose k", which is n!/(k! (n-k)!). (Incidentally, these are some of my favorite numbers. Yes, I have favorite numbers.) We can look at the probability that each person I've met is an award recipient, which is about one in thirty thousand; let p = 1/30000. Let q = 1-p be the probability that a given person is not an award recipient. (Note that in probability, q is almost always 1-p. This differs from the convention in number theory, where p is a prime, and q is a prime that isn't p.)

The assumption of independence is a bit sketchy, but each of the three people in question I've met through different people and in different ways, so it seems reasonable. This sort of thing isn't always reasonable; for example, one feature of my current life is that I know a lot of people who went to Moravian College, which seems noteworthy. But I met one of them and then she introduced me to the others, so it's not all that weird.

Anyway, the probability that I've met exactly k award recipients is

P(k) = C(730,k) pk q730-k

and we compute P(0) = 0.9760, P(1) = 0.0238, P(2) = 2.89 10-4, P(3) = 2.33 10-6.

The chances that I've met at least two award recipients are P(2) + P(3) + ... P(730); since all the other terms are ridiculously small in terms of P(2), we'll call it just P(2), and we see that the chances of meeting two award recipients -- assuming that I meet people at random -- is one in 3400. The chances that I meet at least three is very nearly P(3), or one in 420,000 or so.

What do I conclude from this? That I don't meet people randomly. Neither do you. We generally tend to meet people who are like us in terms of socioeconomic status, level of education, age, political leanings, and so on -- all things that are probably correlated with this award. (Yes, I said "political leanings" are correlated with an academic award. I invite you to contradict me.) This is also a problem that occurs in the small world phenomenon (sometimes more popularly known as "six degrees of separation") -- we generally know people that are like us, but somehow we're linked by short chains to people who have absolutely nothing in common with us. I expect I'll write about this in the future.

17 August 2007

some thoughts on numberpedia

Numberpedia: Store, share and search the world's statistics. From The Numbers Guy.

The goal of this site is nicely put at their overview:

Up to 25% of all news articles written in any given day are based around some kind of statistics. Many times when we read one of these articles we want to remember a particular statistic without bookmarking the entire page. Search results still require you click a link and read through paragraphs before we find the relevant number. In some instances, statistics are described in different way but are logically comparable. Search engines have a hard time returning all of the relevant statistics and projections because of the nuances of language and thought around forecasts and projections.

For the most part the numbers that are there are taken directly from news articles; what I'd like to see would be something that explains where a given statistic is really coming from, a sort of provenance for numbers. (Of course, one can presumably do this by tracing back to the original source.) This would be nice because then one could know when two statistics are comparable; I want to know if I'm comparing apples and apples, apples and oranges, apples and hamburgers, or apples and televisions. (Televisions are obviously very different from apples, because you can't eat them. I was going to say "apples and computers" but then saw the potential for misinterpretation.) Of course, this would also require a lot more work, and a lot of the time you just can't tell where people got their numbers from. It appears that the software includes the feature of "discussing" a given number, though, so this might naturally evolve if the site takes off as people begin to see numbers juxtaposed and wonder what to make of it.

The nuances of language as they apply to various questions of forecasting and projecting might be lost on people, too. It's interesting how logically equivalent questions can give much different results in political polls, for example.

more on communication and education

Vlorbik on Math Ed brings us the pencil rant, in which you expect him to say -- but he actually doesn't -- that math should be done in pencil, at least by students. Personally, I don't care what my students use, so long as their work is readable and not full of cross-outs. (It is surprisingly difficult to get them to do this; they seem to believe that there is no value in making one's work be presentable.) I use pen, myself, mostly because pens are lower-maintenance than pencils -- they don't need to be sharpened. Also, things I wrote in pen are still legible months or years later, while pencils smudge. Most importantly, if I erased everything I did that I thought was wrong but later turned out to be right, I'd never get anything done because I would constantly be retracing my steps! A teacher of mine in middle school once took off points because I took a quiz in pen. (If I remember correctly -- and I might not -- he had never explicitly said to use pencil.) My father was outraged when he learned about this; he said that the teacher ought to have given me extra points for being confident enough to write in something non-erasable.

Calculating the Word Spurt from MathTrek. Certain words are easier for children to learn than others (what makes a word easy to learn isn't entirely clear, but it seems to depend on a large number of factors, so the "difficulty" of learning a word is normally distributed). A child needs to hear an "easy" word less times than they need to hear a "hard" word in order to learn it. Thus a chhild will start out by learning a trickle of words, but when they get to the point when they've heard the medium-difficulty words enough times to learn them, that's when the flood comes. From what I can gather from the coverage I've read of this (I can't see the actual article), this particular theory applies only to little kids, and considers words independently of each other. But I would imagine that if you know certain words, it's easier to learn others. The obvious example is a lot of technical terminology which has explicit definitions which use other technical terminology, but non-techhnical natural language could be the same way. People talk about figuring out vocabulary by context. If there's one word in a sentence you don't know, you might be able to figure out what it means; if there are two words you don't know, probably not. (However, if you hear those two words in two sentences you might be able to figure it out; I'm seeing flickers of an analogy which identifies sentences with equations and unknown words with variables.) This last analogy has some interesting (and probably nearly trivial) ramifications for mathematics education, indeed education of all kinds -- try not to introduce too many new concepts at once. I have had professors who might have, say, ten things they want to say in a given lecture, and they cram them all into the first ten minutes, or the last ten minutes. By simply reordering what they say they could probably do a better job of facilitating the learning process. Similarly, giving a definition of the form "An X that has properties P1, P2, ... P10 is a Y", although logically sound, isn't cognitively sound -- it's better to break the definition up into chunks. "An X which has properties P1, P2, and P3 is said to be R. An X which has properties P4, P5, and P6 is said to be S. An X which has properties P7 through P10 is said to be T. Something which is R, S, and T is said to be Y." Those of us learning are not computers; we are humans, with human brains.

This sort of "chunking" probably comes about naturally if one does not talk from meticulously prepared notes, which is Vlorbik's suggestion at Jazz Math Ed. The human brain won't remember that list of ten things, but it will remember the chunked version, so the chunked version is what will come out. I believe that one of the worst sins of some mathematicians is to write everything as if it is reference material for those who already know it, therefore making it incredibly difficult to digest. (I almost called this blog "fuck Bourbaki", in fact, since that's a hallmark of the Bourbaki style, but having an obscenity in the title seemed unwise.)

16 August 2007

Fibonacci win points

The weekend after next, I'm going to the National Baseball Hall of Fame. While poking around on the internet, I found some references to Whatever Happened to the Hall of Fame by Bill James. James is well-known as one of the first people to apply statistical methods to baseball (though I must confess I've never read any of his work; I might get this book. If the Hall of Fame is at all competent, they'll have it at the gift shop.) To be frank, from what I've heard a lot of his work doesn't really have a sound mathematical basis, but it's inspired people to look at the numbers in order to judge players' performance, and a lot of people (like, say, the folks at Baseball Prospectus) have taken their inspiration from him.

Anyway, the Wikipedia article mentions various methods that James came up with to judge a player over the course of a career; this includes the intriguingly named "Fibonacci win score", but doesn't explain how this is calculated. Naturally, I was curious. Google turned up this thread at baseball fever, which says that the Fibonacci win score for a pitcher is the number of career wins, times the winning percentage, plus the number of "marginal wins" (i. e. wins minus losses). This is typical of James in that it doesn't make sense at first -- why would you multiply wins by winning percentage?

The reason it's called "Fibonacci" is because of the answer to the natural question -- how does the Fibonacci win score for a player compare to their actual number of wins? Say a pitcher's winning percentage is k, and he won W games in his career. Then he loses [(1-k)/k]W games, and his number of win points is kW + W - [(1-k)/k]W. For this to be equal to W, we have

k + 1 - (1-k)/k = 1

and this has one root with k between 0 and 1, namely k = (√5 - 1)/2;, or about .618; this is the limit of the ratio between consecutive Fibonacci numbers, hence the name. A pitcher with a better winning percentage than this will have a higher win score than his actual number of wins; a pitcher with a worse record than this will have a lower win score than his actual number of wins. (The highest win score in history is 511, by Cy Young, who won 511 games and lost 316 in his career; indeed, Young's win percentage was .618.) The purpose of this statistic is to reward pitchers that pitched well and penalize pitchers who were just mediocre over very long careers.

The other question that comes to mind is -- if a pitcher wins a game, or loses a game, what does this do to his number of win points? Let f(W,L) denote the win score of a pitcher with W wins and L losses; then we have

f(W,L) = W (W/(W+L)) + W - L = (2W2 - L2)/(W+L)

Incidentally, in this formula the numerator is negative for a pitcher whose winning percentage is less than 1/(1+√2), or .414. If we differentiate this with respect to W and simplify, we see that

fW(W,L) = 2 - [L/(W+L)]2

and thus an additional win gets a pitcher two win points, minus the square of his losing percentage. Similarly, we have

fL(W,L) = -1 - [W/(W+L)]2

and so a loss costs a pitcher one win point, plus the square of his winning percentage. It almost seems meaningless to say that, because there's no a priori reason why this particular arrangement of variables should mean anything -- though at least it's dimensionally consistent, and has units of wins; there are a lot of random-looking combinations of statistics that don't even do that!

IN UR QUANTUM BOX... MAYBE

IN UR QUANTUM BOX... MAYBE, perhaps the best lolcat ever. (But then again, I'm biased due to the affection I have for the whole Schrodinger's Cat problem, because I like quantum mechanics and I like cats.)

The comments include the following gems:
"Of corse, kitteh 2 big 2 b unserten in qwantum way witout bein reely reely reely cold. Like reely neer absuloot 0."

"Oh hai! I brought u a wayv funkshun. But i collapsded it."

"I wuz playin wit string theoriez, but now Iz restin."

"Kitteh partikalz givin off much cyootness energi. Round O’cheezburgerz fer evybodi!"

"umm, wen box closed and sealed from environment, Shrodingur’s kitteh in ’sooper-posishon’ of beeing bof alive an ded. Wen u open box, you interfear wiv da kitteh, da wayv funkshun kolaps and u see kitteh is eyver alive or ded."

and about four-fifths of the way down, a copy of this list of feline laws of physics.

(from I CAN HAZ CHEEZBURGER?)