Showing posts with label data mining. Show all posts
Showing posts with label data mining. Show all posts

03 September 2007

data mining, or, Shakespeare was a combinatorialist

A rant on the virtues of data mining (from Statistical Modeling, Causal Inference, and Social Science) and "Data Mining" = voodoo science (from The Geomblog).

Aleks at Statistical Modeling, etc. says that he does data mining, and that social scientists are not happy to see this and say things like "that's not science". (Incidentally, I tend to be skeptical of any discipline that has "science" in it's name. If you have to tell everyone you're a science, perhaps you're trying to hide something. It's like restaurants that serve "authentic Chinese cuisine".)

Now, I don't know a huge amount about data mining techniques. And it seems to me that it would be irresponsible to do the following:

  • Take a data set from which one could generate a large number different hypotheses.

  • Start picking hypotheses at random.

  • When you find a hypothesis that is true at the 95% confidence level, say "Eureka!" and publish.


This is silly because at the 95% confidence level, one expects that you'd get a hit one time out of twenty purely by chance. I sincerely hope that if the people doing this stuff do things analogous to what I just said, they use some much higher confidence level.

Data miners also seem to draw a distinction between exploratory and confirmatory data mining. They might use a technique like the one I just said (although more sophisticated) to find hypotheses that are worth looking at and studying in more detail. This, of course, is something we all do every day -- we look around and when things are interesting we look at them more closely. We cannot look at everything very closely because then our brains would explode.

The trick here, I suppose, is knowing how to distinguish signal from noise. My favorite example of the moment is Example I.11 of Flajolet and Sedgewick's Analytic Combinatorics (link goes to 12 MB PDF of the book). They tell us that the pattern "combinatorics" is hidden in the text of Hamlet, which begins as follows:

Who's there?

Nay, answer me: stand, and unfold yourself.

Long live the king!

Bernardo?

He.

You come most carefully upon your hour.

'Tis now struck twelve; get thee to bed, Francisco.

For this relief much thanks: 'tis bitter cold,
And I am sick at heart.

Have you had quiet guard?

Not a mouse stirring.

Well, good night.
If you do meet Horatio and Marcellus,

At this point we look at the bolded letters, jump up and down and say that Shakespeare was trying to tell us something, despite the fact that Shakespeare had never heard the word "combinatorics". But Shakespeare is also a Yankees fan (see the italicized letters) and went to Harvard (see the underlined letters), so we ought to be suspicious. It turns out that Hamlet contains 1.63 x 1039 instances of the word "combinatorics", whereas a random sequence of letters chosen uniformly at random from the English alphabet and of the same length as Hamlet contains on average 6.96 x 1037 such sequences, and a random sequence of letters chosen at random with the same distribution as normal English text contains on average 1.71 x 1039 such sequences. So all we can conclude, it seems, is that maybe Shakespeare was writing in the same language that the word "combinatorics" is taken from. (One could try to compute the standard deviation associated with that 1.71 x 1039 figure, but it's silly because English text is not created by reaching into a Scrabble bag over and over again.) Of course there are less frivolous examples -- one of them is looking at whether certain patterns that appear in the human genome occur more or less often than you'd expect by chance.

01 September 2007

Beauty is in the eye of the beholder

Ben Goldacre, writing in Bad Science, tells us the story of how he was contacted by a PR firm which wanted him to concoct some equations that would "prove" that certain celebrities had sexier walks than certain other celebrities, as part of a promotion for Veet hair removal cream. I am not making this up. It's an extended version of his column in the Guardian.

The press release seems to have beeen reproduced by about a zillion British newspapers, usually in a shortened version; the longest version I could find is here, and I'll be quoting from it.
JESSICA Alba, the film actress, has the ultimate sexy strut, according to a team of UK mathematicians. Beauty is no longer in the eye of the beholder - it can now be worked out using a simple mathematical formula.

Bullshit! And this isn't just me being bitter that I might not fit some social standard of beauty. It's me being bitter that mathematics is being "used" in this way. In Goldacre's column, we learn that there was no team. Richard Weber at Cambridge is the mathematician who's mentioned there, but only after Goldacre was contacted -- and Goldacre as far as I can tell is a medical doctor. (By "there" I mean Goldacre's blog post; I can't find a version of the press release that mentions Weber.)
The academics found that it is the ratio between hips and waist that puts the sway into a woman's walk - and the nearer that ratio is to 0.7, the better.

No they didn't! This seems like the stuff you see out there every so often about the Golden Ratio making things more beautiful -- and in fact that ratio's been used to sell pants! -- where there's some "magic number". There's at least some justification for the 0.7 number, though, in that Real Scientists have done studies, although the preferred ratio varies by culture. And it never seems to be given to more than one decimal place, which suggests that a few inches don't matter. Apparently Weber made a more nuanced comment that got cut down to this.
This ratio provides the body with the right torso strength to produce a more angular swing and bounce to the hips during the walking motion.

Furthermore, the waist-to-hip ratio might actually be important for physical attractiveness, but nobody said that had anything to do with the walk. I don't know much about biomechanics, though. But it looks like the causality just isn't there.

Oh, and they screwed the survey up so badly that it doesn't even mention anything that hair removal cream could actually do. You'd expect a study by a hair removal cream company to say that having smooth, shiny, hairless [insert body part here] was an important part of beauty.

Fortunately, they're just using this to sell hair removal cream. Recent studies have also shown that studies which are funded by pharmaceutical companies are far more likely to say that drugs do something goodthan studies which are not funded by pharmaceutical companies; that sort of hijacking of the scientific apparatus is a lot more insidious, as people could actually die.

I'm not saying that beauty can't be encapsulated in some sort of formula. But it'll be a lot more complicated than this one. (In fact, it might not be a "formula" at all, in the sense that you put in numbers which describe the person and get out a numerical rating of their beauty; much more feasible would be a recommendation system like that on Netflix or Amazon. As I understand it, those systems work by recommending books or movies to you that people who have bought the same books or rented the same movies as you also liked.)

And I'm skeptical of any formula that says that the same people will seem beautiful to everyone, because that's simply not true. I think someone could come up for a formula that will tell them who I am likely to find beautiful -- there are definitely patterns. But my tastes are not the same as yours. And don't try to tell me they should be.

P. S. I moved into my apartment a year ago today. As I sit here, I look out my kitchen window and see a truck from the movers I used. This is probably not a coincidence, though, as September 1 is kind of a big moving day in my neighborhood.

27 June 2007

secret messages in human DNA?

In yesterday's New York Times, Dennis Overbye writes about the possibility of hiding secret messages in human DNA.

This seems vaguely plausible. Each strand of DNA is composed of a sequence of the four bases adenine, cytosine, guanine, and thymine. One could use these like the digits 0, 1, 2, and 3 in a base-4 number system; equivalently, they could be used as 00, 01, 10, 11 in a binary number system, so each base represents two bits.

Humans have done things like this. Freshly allocated memory in certain computing environments is filled with the repeated string (in hexadecimal notation) DEADBEEF; also ABADBABE, BAADF00D, CAFEBABE have been used. (CAFEBABE is apparently used in Java-related contexts; see this archive of a thread "why CAFEBABE?" on comp.lang.java.) It is of course quite unlikely that any of these strings would be found repeatedly in a computer's memory, if the memory is filled at random; the chance of getting, say, ten DEADBEEFs in a row (assuming there's not a process that's just copying some string over and over again) is one in 2320, which is more than the number of subatomic particles in the universe. As you may know, the Central Dogma of molecular biology says that DNA is transcribed into RNA, which is translated into proteins; each triplet of DNA bases maps to a single amino acid, of which there are twenty. There's a code that assigns a letter to each amino acid; the letters B, X, and Z are "special" letters; U, O, and J aren't used. It's possible to spell things with the remaining twenty letters, though, and I've heard that some genetically engineered food includes the name of the company doing the engineering in the junk DNA.

So what if someone designed us? Maybe they'd hide a message in the DNA? (For the record, I don't believe in intelligent design; however, if we were intelligently designed, that leads inexorably to the question of "who designed the designer"?) But how would they hide that message? They don't know what language we speak, and they certainly don't know that we'll invent this twenty-letter way of describing protein sequences concisely. And unlike in the DEADBEEF example, there appear to be reasons why you'd want stretches of DNA to be the same thing over and over again; these occur in the so-called junk DNA. Like many mathematicians, I'm inclined to believe that they'd hide the prime numbers. The idea behind this is that the primes should never occur due to a natural process, but any culture which is the least bit mathematically sophisticated should have them. (The idea comes from people who are searching for extraterrestrial intelligence; they assume that both us and the other species involved have radio astronomy, and inventing radios without mathematics is Hard.) The sequence

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, ...

which in base-4 is

2, 3, 11, 13, 23, 31, 101, 103, 113, 131, 133, 211, 221, 223, 233, ...

(Note that a very large number of these base-4 numbers, when read in base 10, are also prime! This is just a coincidence, although the fact that 4 and 10 are both even -- and therefore numbers which are odd remain odd under this transformation -- helps.) Replacing 0, 1, 2, 3 with A, C, G, T, we get

GTCCCGGTTACACCATCCTCTCCTTGCCGGCGGTGTT

and so if we see this string in DNA, perhaps we should be suspicious? Well, it's 37 base-pairs long; thus we expect it to occur once in every 437 base pairs. The human genome is about 3,000,000,000 base-pairs long, so if the genome were random, the probability of this string occuring is 3,000,000,000/437 = 1.6 × 10-13.

So if we find it? Then yes, there's probably a Designer. But this doesn't mean that creationists should go fishing for hidden patterns in the genome. First, my choice of how to encode the primes was entirely random. We could reorder A, C, G, and T. We could have encoded the primes in base 3, using the fourth base to separate them. We could have encoded the primes as

CAACAAACAAAAACAAAAAAAC...

where the number of A's between each pair of C's is prime. And so on. Creationists looking in DNA would, I suspect, take a Bible code-like approach to the search. And if there were slight errors? They'd blame it on mutations, which are inevitable (the Times article points out that there are certain "ultraconserved" segments of the genome -- but those sections also appear to be functional, so it would be harder to hide a message in them -- but then if these hypothetical designers are so smart, maybe they can make those sections be functional and hide messages...)

Sequencing the human genome is good for lots of reasons. But the search for messages from the past probably isn't one of them. They might be there, but we'd be searching for a needle in a haystack. And there would be lots of shiny things that aren't needles there, too.