## 03 September 2007

### data mining, or, Shakespeare was a combinatorialist

A rant on the virtues of data mining (from Statistical Modeling, Causal Inference, and Social Science) and "Data Mining" = voodoo science (from The Geomblog).

Aleks at Statistical Modeling, etc. says that he does data mining, and that social scientists are not happy to see this and say things like "that's not science". (Incidentally, I tend to be skeptical of any discipline that has "science" in it's name. If you have to tell everyone you're a science, perhaps you're trying to hide something. It's like restaurants that serve "authentic Chinese cuisine".)

Now, I don't know a huge amount about data mining techniques. And it seems to me that it would be irresponsible to do the following:

• Take a data set from which one could generate a large number different hypotheses.

• Start picking hypotheses at random.

• When you find a hypothesis that is true at the 95% confidence level, say "Eureka!" and publish.

This is silly because at the 95% confidence level, one expects that you'd get a hit one time out of twenty purely by chance. I sincerely hope that if the people doing this stuff do things analogous to what I just said, they use some much higher confidence level.

Data miners also seem to draw a distinction between exploratory and confirmatory data mining. They might use a technique like the one I just said (although more sophisticated) to find hypotheses that are worth looking at and studying in more detail. This, of course, is something we all do every day -- we look around and when things are interesting we look at them more closely. We cannot look at everything very closely because then our brains would explode.

The trick here, I suppose, is knowing how to distinguish signal from noise. My favorite example of the moment is Example I.11 of Flajolet and Sedgewick's Analytic Combinatorics (link goes to 12 MB PDF of the book). They tell us that the pattern "combinatorics" is hidden in the text of Hamlet, which begins as follows:

Who's there?

Nay, answer me: stand, and unfold yourself.

Long live the king!

Bernardo?

He.

You come most carefully upon your hour.

'Tis now struck twelve; get thee to bed, Francisco.

For this relief much thanks: 'tis bitter cold,
And I am sick at heart.