29 October 2007

Bayesian gender spam

A Bayesian explanation of how to determine the gender of a person on the street (from observable cues), by Meep.

It's rather similar to Bayesian spam filtering (Paul Graham, see also here). The major difference is that one can generally assume that most e-mail is spam, whereas one cannot assume that most people are of one or the other of the two canonical genders.

In the spam filtering case, though, it doesn't seem that the prior probability that a message is spam matters; Graham claims that most e-mails are either very likely or very unlikely to be spam. But there are probably more words in an e-mail than there are easily observable cues to a person's gender; it seems much more likely to get, say, that a person has a 60% probability of being male than that an e-mail has a 60% probability of being spam.

Also, it's a lot easier to collect the necessary for spam filtering than for gender determination.


Anonymous said...

Women tend to use many more hand gestures while talking. What I noticed lately is that everyone is on a cell phone and women are moving their hands more than men.

Men drink scotch or beer women drink white wine.

Anonymous said...

I found it interesting that you posted this on the heals of your previous thoughts on merging lists. When I read that original post my first thought was actually to consider Bayesian statistics.

The reason I thought that was, consider two lists like, for example, the list of males on the street and the list of females on the street. We can further imagine that the list is an ordered list such that easily identified men appear higher up on the list while more androgynous men would appear much lower. Likewise for the list of females.

Now, at the risk of offending someone somewhere I can suggest that perhaps, instead of two disjointed lists we in fact have one single list that describes a gender spectrum. Merging the two lists then becomes trivial.

This is of course identical to Bayesian spam filters. The system typically separates mail into two folders upon receipt, In Box and Junk. Sometimes spam makes it into the In Box because it just barely exhibits statistical characteristics of a valid message while sometimes valid messages just barely exhibit spam like qualities and get sent to the Junk folder.

Instead of viewing one's mail as two distinct lists it can also be viewed as a spectrum of spam-ness where a request from the thesis adviser is 100% valid email while a Nigerian business deal is 100% not valid. Again the two lists merge easily into a single list.

Perhaps I'm missing the overall point of the exercise.