03 September 2007

data mining, or, Shakespeare was a combinatorialist

A rant on the virtues of data mining (from Statistical Modeling, Causal Inference, and Social Science) and "Data Mining" = voodoo science (from The Geomblog).

Aleks at Statistical Modeling, etc. says that he does data mining, and that social scientists are not happy to see this and say things like "that's not science". (Incidentally, I tend to be skeptical of any discipline that has "science" in it's name. If you have to tell everyone you're a science, perhaps you're trying to hide something. It's like restaurants that serve "authentic Chinese cuisine".)

Now, I don't know a huge amount about data mining techniques. And it seems to me that it would be irresponsible to do the following:

  • Take a data set from which one could generate a large number different hypotheses.

  • Start picking hypotheses at random.

  • When you find a hypothesis that is true at the 95% confidence level, say "Eureka!" and publish.


This is silly because at the 95% confidence level, one expects that you'd get a hit one time out of twenty purely by chance. I sincerely hope that if the people doing this stuff do things analogous to what I just said, they use some much higher confidence level.

Data miners also seem to draw a distinction between exploratory and confirmatory data mining. They might use a technique like the one I just said (although more sophisticated) to find hypotheses that are worth looking at and studying in more detail. This, of course, is something we all do every day -- we look around and when things are interesting we look at them more closely. We cannot look at everything very closely because then our brains would explode.

The trick here, I suppose, is knowing how to distinguish signal from noise. My favorite example of the moment is Example I.11 of Flajolet and Sedgewick's Analytic Combinatorics (link goes to 12 MB PDF of the book). They tell us that the pattern "combinatorics" is hidden in the text of Hamlet, which begins as follows:

Who's there?

Nay, answer me: stand, and unfold yourself.

Long live the king!

Bernardo?

He.

You come most carefully upon your hour.

'Tis now struck twelve; get thee to bed, Francisco.

For this relief much thanks: 'tis bitter cold,
And I am sick at heart.

Have you had quiet guard?

Not a mouse stirring.

Well, good night.
If you do meet Horatio and Marcellus,

At this point we look at the bolded letters, jump up and down and say that Shakespeare was trying to tell us something, despite the fact that Shakespeare had never heard the word "combinatorics". But Shakespeare is also a Yankees fan (see the italicized letters) and went to Harvard (see the underlined letters), so we ought to be suspicious. It turns out that Hamlet contains 1.63 x 1039 instances of the word "combinatorics", whereas a random sequence of letters chosen uniformly at random from the English alphabet and of the same length as Hamlet contains on average 6.96 x 1037 such sequences, and a random sequence of letters chosen at random with the same distribution as normal English text contains on average 1.71 x 1039 such sequences. So all we can conclude, it seems, is that maybe Shakespeare was writing in the same language that the word "combinatorics" is taken from. (One could try to compute the standard deviation associated with that 1.71 x 1039 figure, but it's silly because English text is not created by reaching into a Scrabble bag over and over again.) Of course there are less frivolous examples -- one of them is looking at whether certain patterns that appear in the human genome occur more or less often than you'd expect by chance.

No comments: