30 June 2007

drug + drug = better drug

Old Drugs In, New Ones Out -- from today's New York Times.

A field known as combinatorial chemistry has recently emerged. Many molecules have similar "backbones" to each other and only differ in, say, a few groups of atoms hanging off of the end; the canonical example are proteins, which are built up from just twenty different amino acids. The amino acids all look like the image at the left, differing only in the group called "R". The actual protein is made up by sticking these molecules together via peptide bond formation, which eliminates the -OH group at the right end and one of the hydrogen atoms at the left end, bonding the carbon and nitrogen in adjacent amino acids together directly.

In drug design, it seems that what's often considered is the pharmacophore -- basically, the "business end" of a molecule. If you synthesize a bunch of molecules that are the same at one end but different at the other end, well, that means that the "business end" won't be exactly the same in each instance, and some might be better than others.

But what they're doing now takes this to a new level. Drugs that have already been created are now being combined with other drugs -- not chemically, just being put in the same pill. (Although sometimes more subtly than just throwing them both in, which means that you can't just take the two pills separately.) And of course, there are a lot of combinations you can get this way. What's more, the combinations aren't what a mathematician would call "linear" -- if you take a drug that does A, and a drug that does B, and stick them together, you don't always get a drug that does A-and-B. For example, one drug mentioned by the article -- Avanir's Zenvia -- takes a cough suppressant and a drug used to treat heart rhythm disturbances, and gets out a drug to stop uncontrolled laughing and crying. Predicting which combinations of drugs will have effects like this is tricky, and a lot of the work is in screening the combinations. But synthesizing all those combinations is also hard. Here's a patent for robotic synthesis.

One company, CombinatoRx, got my attention because their name is pronounced like the word "combinatorics". The article states that their current research program is to take two thousand generic drugs, make all possible pairs, and screen them to see if they do anything interesting; then develop the interesting drugs. There are two million possible pairs of drugs. They test "several thousand pairs of medicines a day". How long can this last? Well, if you assume "several thousand" means "two thousand", then it can last a thousand days. (Presumably they could expand their library of generic drugs, though.)

The next step would then be to try three-part drugs -- with the same library, you'd have about 1.3 billion of them. At 2000 combinations tested a day, that would take about two thousand years to test.

For a triple combination, the F.D.A. might want evidence that the trio is better than not only the individual parts but also better than any of the possible pairs. Showing that would require huge and costly clinical trials.

One wonders if it would be as huge and costly as implied here. My instinct is that combinations of three, four, or more drugs would come from adding a single drug to an already existing combination -- or, in the case of four-part drugs, taking two two-part drugs and putting them together. So some of the testing would already be done. From what I've heard about the FDA, though, they're likely not to care.

29 June 2007

ordering the Supreme Court justices

Today I put nine people in order -- and it wasn't a baseball team.
I came across the following table, which gives the percentage of the time that each pair of justices of the U.S. Supreme Court agreed with each other in non-unanimous decisions:

I came across the data in the Philadelphia Inquirer (June 29, 2007, page A14); the table lists as its source the Supreme Court Institute at the Georgetown University Law Center. It's a version of the table on p. 18 of that institute's October Term 2006 overview (although, rather inexplicably, some of the numbers are different between the two tables!)
The table above, though, is sorted in a conservative-to-liberal order derived from the alignment data. As I originally saw the table, the justices were in the order Roberts, Stevens, Scalia, Kennedy, Souter, Thomas, Ginsburg, Breyer, Alito. This is the Chief Justice followed by the eight Associate Justices in the order in which they were appointed.
These nine people are basically just names to me -- and in fact, some of them aren't even that. I hadn't heard of some of these people until today. Anyway, here's how I did the sorting: Thomas and Stevens agree with each other least often, so one assumes they have the largest ideological difference; put them first and ninth in the order. Scalia is the justice who's most likely to agree with Thomas, so put him second; similarly Ginsburg is mostly likely to agree with Steens, so put her 8th. Roberts is most likely to agree with Scalia of the five justices who have yet to be picked, so put him third. Alito and Roberts are the two justices who are most likely to agree with each other, so clearly they should be next to each other in this ordering; put Alito fourth. Thus, we so far have
Thomas, Scalia, Roberts, Alito, ?, ?, ?, Ginsburg, Stevens.
(Note that the people on the left here are more conservative than the ones on the right, the reverse of what you'd expect from the usual use of "left" and "right" as "liberal" and "conservative".) We still have to position Kennedy, Souter, and Breyer. At this point I created the following table:


It's a tough call, but I figured that since Souter is the most likely to agree with Ginsburg, they should be next to each other; thus Souter gets the seventh slot. So now we have to put Kennedy and Breyer in order. Thomas, Scalia, Roberts, and Alito were all more likely to agree with Kennedy than with Breyer; Souter, Ginsburg, and Stevens were all more likely to agree with Breyer than with Kennedy. On this basis, Kennedy gets the fifth slot, Souter the 6th, which gives the order you can read off the first row of the big table which opened this post.
What surprised me is how coherent the assignment ended up being. If you look at the table above, for the most part each row or column increases smoothly to 100 and then decreases smoothly. (The coloring marks the exceptions; each pair of adjacent entries that's in the same color is a pair in the "wrong" order.) Furthermore, if you switch the order of any two adjacent justices you create more exceptions; this is clearly at least a local minimum. What this basically says is that the current court is one-dimensional; on most cases you can probably draw a line between two people in the ordering I gave and all the "yes" votes will be on one side, all the "no" votes on the other. (I may check this at some future date.)
I'm kind of curious if this could be done for, say, the Senate, although since the Senate has a hundred members and I'd have to compile the data myself I'm not going to do it. This took maybe ten or twenty minutes once I had the table, and was an amusing break from what I was working on this morning; doing it for the Senate would be actual work.
What's more, this ordering is in order with what the Supreme Court Institute has to say in the narrative parts of their report. Kennedy was in the majority in all 22 5-4 decisions the court made this term. They call Scalia a "conservative pole", though, which doesn't line up with my data.
Also, it would probably be possible to incorporate past justices into this ordering, if they'd overlapped with enough of the current justices that where they fell in the ordering was obvious. But I doubt that it's possible to rank all 110 justices this way; for one thing, our left-right distinction is different from the one that's existed at various points in history. And even if "left" and "right" had carried the same meaning over the last two hundred years, we wouldn't be able to compare every pair of justices -- just those that had overlapped -- and extending that sort of partial order to a total order is hard.

when's the Fourth of July weekend this year?

Phillyist asks:

We've always wondered what the protocol is for celebrating a holiday weekend if the actual holiday falls squarely in the middle of the week. Should we be celebrating Independence Day this weekend? Or next weekend? Or should we just celebrate both weekends and spend two weekends in a row gorging ourselves on various grilled meats and icy-cold Coronas and margaritas? (This Phillyist votes the latter.)

and Jacqueline Urgo of the Philadelphia Inquirer
asks the same question:

Surely there'll be a Fourth of July weekend at the Jersey Shore. But when?
Because the Fourth falls on Wednesday this year, schedule shilly-shallying has driven the Shore into a near panic.
Will bars and restaurants need those extra ice cubes this week or next week? What about more linens for the tables? More food for the hungry?

In general, I imagine people are taking off more time for the Fourth on average this year. My guess is that the most common behavior among people taking vacations is as follows, depending on the day of the week on which the 4th falls:

  • Thursday: people take off Thursday the 4th through Sunday the 7th.

  • Friday: people take off Friday the 4th through Sunday the 6th

  • Saturday: people take off Friday the 3rd through Sunday the 5th

  • Sunday: people take off Saturday the 3rd through Monday the 5th

  • Monday: people take off Saturday the 2nd through Monday the 4th

  • Tuesday: people take off Saturay the 1st through Tuesday the 4th

But this year, do you take off from Saturday, June 30 through the 4th? Or from the 4th through Sunday, July 8? Or just throw up your hands and take the whole week? (One person is quoted in the Inquirer article as saying that people will take the weekend after, not the weekend before, because the weekend before falls partially in June. He might be on to something, although I'm not totally sure how much month boundaries affect people.)
But the Inquirer article makes it sound like this never happens. In fact, it happens one year in seven. You'd think that the people who have been in business for a while could go back and see what happened in 2001. Or 1996. Or 1990. Or 1984. Or... you get the idea. In fact, it happens in 58 years out of every 400, very nearly one in seven. (The link is to the Wikipedia article on "Dominical letter", which is the Catholic Church's system for encoding how the days of the week fall in a given year with a single letter; in every year with dominical letter G or AG, the Fourth of July falls on a Wednesday. Looking at the table there makes it easy to count.
A few random facts about the Gregorian calendar:

I'll leave that as an exercise for the reader.

28 June 2007

help! the Earth is sinking!

Earth's inner heat keeps cities afloat. The rocks that the Earth is made of expand when it's warmer, like most materials; thus if the inside of the Earth were not as heat the Earth would be smaller.

Derrick Hasterok and David Chapman, of the University of Utah, say that the significance of this heating has been overlooked. In particular, it's stronger in some areas than in others -- the rock under the western U. S. is hotter than that under the eastern U. S., so the general fact that the West tends to be higher than the East is in part due to this phenomenon.

However, they claim that "New York would drop to 1,427 feet below the Atlantic ocean, Boston and Miami even deeper. Los Angeles would rest 3,756 feet below the surface of the Pacific ocean." This just doesn't feel right. Perhaps those places would fall to those heights below the current sea level -- I take this to mean they'd be slightly closer to the center of the Earth. But sea level would be redefined to be the new average height of the sea. The only way all these places could suddenly be under sea level is if there were more water.

In any case, it doesn't matter, because the heat is coming from radioactive decay of some very long-lived isotopes. Worry about global warming.

(Those of you who thought this blog was supposed to be about probability -- as the title might lead you to believe -- may be wondering why I'm making this post. But this blog is also about silly uses of mathematics in the media.)

The round house

Updating a House of Tomorrow, by Eve M. Kahn in today's New York Times.
Theodore and Susan Pound recently bought a house in the Buckhead section of Atlanta designed by the architect Cecil Alexander. Most of the people who saw the house when it was for sale didn't much like it, because the rooms were oddly shaped; the house has a circular plan, which you can see a picture of in the article.
Alexander, when asked why the house was round, said:

My first plans were L’s or squares or rectangles [....] But then I realized those shapes waste so much space — a circle is compact, it gives you the maximum interior room for the minimum amount of exposed wall.

This is true; it's the well-known isoperimetric inequality. It's related to a lot of other geometric inequalities.

But I'm not sure that minimizing wall space is necessarily the way to go. My bedroom is round -- the corner I live on is an acute angle, and so whoever designed the building stuck on a round turret so the building didn't stick out into the intersection and stab people. Also, it's difficult to work with curved walls when you don't have curved furniture.

Finally, in a very dense neighborhood circular houses would waste land; there arinevitably holes between the houses, as in the picture below (taken from Wikipedia)
which take up about ten percent of the space. Atlanta's sprawling enough already; they don't need more wasted land, so they probably shouldn't start building neighborhoods of circular houses close together. However, the house in question is 5,500 square feet on four acres of land, so that's not a problem here.

27 June 2007

the 10,000th Phillies loss will come on the West Coast

Walking around this morning, I saw the Philadelphia Weekly's cover story: Losing proposition. This is an article about how the Phillies are very close to having ten thousand losses. The New York Times made fun of us a couple weeks ago (but the Times mocks anything involving Philadelphia). There are sites like Countdown to 10000 and Celebrate 10000 in honor of it. They sell T-shirts. Some people claim the 10,000th loss was in June of 2005, against the Red Sox -- but this is only true if you count the Worcester Worcesters of 1880-1882 as being the Phillies. They're not.
(Yes, the Worcester Worcesters. Some sources call them the Brown Stockings, but I like calling them the Worcesters because it shows even less ingenuity in naming than the name "Phillies" does.)
There are three facebook groups. (I wonder if there's a myspace group; the link goes to a paper that's been circulating about the class differences between Facebook and Myspace.)
Then I remembered that I have Phillies tickets for their game against the Cardinals on July 13th, the first game after the All-Star break.
I got to thinking -- what are the chances that I'd see the Phillies' ten thousandth loss? They've lost 9,991 games so far; they've got nine more to go.
Surely the 10,000th loss is a historic moment in all of professional sports. No team has lost this many games. (The San Francisco (formerly New York) Giants have won 10,000.)
It's not so hard to compute this. What I needed to know was the probability that the Phillies lose each particular game. This can be found via a method which for some cryptic reason is called the "log5 method", which I learned about from this article from Diamond Mind which computed the probabilities that each of the 2002 playoff teams would win the World Series. The method is as follows: if team A wins pA of its games, and team B wins pB of its games, then the probability that team A wins in any given game against team B is
pA(1-pB) / (pA(1-pB) + pB(1-pA).
The best justification for this formula is that it works when you test it on actual data. (Actual baseball data, that is; I'm not sure if it's good for other sports.) But an intuitive justification for it is as follows: you have two coins, coin A and coin B. Each coin has "win" on one side and "loss" on the other. Coin A comes up "win" with probability pA, and coin B comes up "win" with probability pB. To simulate a game, flip the two coins. If one comes up "win" and one comes up "loss", that gives you the outcome of the game; if they both come up the same, flip again. Notice that the formula passes a couple sanity checks. If pA = 0, then it always gives 0 -- that is, if a team never wins, then its probability of winning against any opponent is zero. If pB = 1/2, then it just gives pA -- so a team which is playing aginst average teams performs how it usually performs.
To adjust for home field advantage, I added 0.02 to the home team's winning percentage and subtracted 0.02 from the visiting team's winning percentage; this is the method used at Baseball Prospectus' postseason odds simulation, which I'll have more to say about later.
So, for example, the Phillies play the Reds tonight, in Philadelphia. The Reds have won 29 games and lost 48, so their winning percentage is .377; we replace this with .357 since the Reds will be playing on the road. The Phillies have won 40 and lost 36, so their winning percentage is .526; we replace this with .546 since they're playing at home. The formula tells us that the Reds' chance of winning tonight is
(.357)(1-.546) / ((.357)(1-.546) + (.546)(1-.357))
which is 0.315. This is the Phillies' chance of losing, which is what I'm interested in.
So after tonight, the Phillies will have eight losses to go with probability 0.315; they'll have nine losses to go with probability 1-0.315, or 0.685.
They'll play the Reds again tomorrow night. After that game, they have seven losses to go with probability (0.315)2 = 0.099; they have eight losses to go with probability (.315)(.685)+(.685)(.315) = .432; they have nine losses to go with probability (0.685)2 = 0.469.
Thus, I set up a spreadsheet which calculates the probability that after each game, they have 9, 8, 7, ..., 1 losses to go. The probability of the Phillies getting their ten-thousandth loss on a certain day is the probability that they have 9,999 losses before that day ("1 loss to go"), times the probability of losing that day.
The results are as follows. The rows in red are home games, following the same color scheme as the sorted schedule. The winning percentages are from mlb.com standings as of June 27.

DateOpponentChance of 10,000th loss
Jun 27 v. Reds0.000000
Jun 28 v. Reds0.000000
Jun 29 v. Mets0.000000
Jun 29 v. Mets0.000000
Jun 30 v. Mets0.000000
Jul 01 v. Mets0.000000
Jul 02 @ Astros0.000000
Jul 03 @ Astros0.000000
Jul 04 @ Astros0.000467
Jul 06 @ Rockies0.002946
Jul 07 @ Rockies0.009603
Jul 08 @ Rockies0.021746
Jul 13 v. Cardinals0.030071
Jul 14 v. Cardinals0.041621
Jul 15 v. Cardinals0.052571
Jul 16 @ Dodgers0.091757
Jul 17 @ Dodgers0.106722
Jul 18 @ Dodgers0.112506
Jul 19 @ Padres0.108077
Jul 20 @ Padres0.097745
Jul 21 @ Padres0.083264
Jul 22 @ Padres0.067340
Jul 24 v. Nationals0.031618
Jul 25 v. Nationals0.026675
Jul 26 v. Nationals0.022282
Jul 27 v. Pirates0.018717
Jul 28 v. Pirates0.015312
Jul 29 v. Pirates0.012425
Jul 30 @ Cubs0.014016
Jul 31 @ Cubs0.010085
Aug 01 @ Cubs0.007142
Aug 02 @ Cubs0.004984
Aug 03 @ Brewers0.004103
Aug 04 @ Brewers0.002533
Aug 05 @ Brewers0.001533
Aug 07 v. Marlins0.000613
Aug 08 v. Marlins0.000441
Aug 09 v. Marlins0.000317
Aug 10 v. Braves0.000251
Aug 11 v. Braves0.000171
Aug 12 v. Braves0.000115
Aug 14 @ Nationals0.000074
Aug 15 @ Nationals0.000051
Aug 16 @ Nationals0.000034
Aug 17 @ Pirates0.000024
Aug 18 @ Pirates0.000016
Aug 19 @ Pirates0.000011
Aug 21 v. Padres0.000008
Aug 22 v. Padres0.000005
Aug 23 v. Padres0.000003
Aug 24 v. Dodgers0.000002
Aug 25 v. Dodgers0.000001
Aug 26 v. Dodgers0.000001
Aug 27 v. Mets0.000000
So it appears most likely that the Phillies will have their ten-thousandth loss on the West Coast, between July 16 and July 22; there's a 66.8% chance of it happening in those seven games. This is where you'd "naturally" expect things to peak anyway -- since the team loses about half the time, you'd expect it to take them 18 games in order to lose 9. That road trip is the 16th through 22nd games if we start counting from today. Plus, they'll be on the road, and the Dodgers and Padres are both good teams. It actually surprised me to see that the 10,000th loss is nearly twice as likely in the first game of that road trip (July 16 @ Dodgers) than in the last game of the preceding homestand (July 15 v. Cardinals), and similarly for the last game of the road trip (July 22) and the first game back (July 24). The soonest it can happen, as of this writing, is July 4, if they lose the next nine -- that would seem somehow appropriate, given what happened in Philadelphia on a long-ago July 4. The tail of the distribution is long -- there's always that slim chance that the Phillies could get ridiculously hot and stretch this out for thirty games or more. I wouldn't bet on it, though.

And I've only got a three percent chance of seeing this historic moment on the 13th of July. I hope I don't see it, because that would mean the Phillies would only win four out of their next thirteen.

edit (Friday, 2:39 pm): Frank athot dogs and beer features a similar analysis.

iPhone bets part 2, and aggregating customer service information

Yesterday I posted betting on the iPhone, about an online bookmaker that was taking odds on various circumstances surrounding the release of the iPhone.
Wired has "Ready for an IPhone? Tips to End Your Existing Cell Contract", which is what it sounds like. Their tips include "odds of success", such as:

Pawn it off
Don’t want your contract anymore? Find someone who does. Websites like Celltradeusa.com specialize in connecting thousands of people together for the express purpose of transferring the financial responsibilities of cell contracts from one person to another. As long as the recipient meets the minimum qualifications (credit check, etc.) you can transfer the plan over without getting hit with the early termination fee.
Odds of success: 2-to-1

but I think they just made up these odds. It would actually be interesting to know these sorts of things about various companies' customer service. Some of the tips they offer are more of a gamble -- for example, abuse the system by using up lots of minutes while you're roaming, if roaming is free for you. They offer 20-to-1 odds on that. Before I tried that, I might like to know "1000 people say they've tried this; it worked for 50 of them". It would be interesting to see a web site that collected such information -- there are many problems that a lot of people have and having data on how they solved them could be useful.

However, there's always the issue that I'll call the "asymmetry" of word of mouth. If you expect a plan like that not to work, you'll tell lots of people if it does work and not that many people if it doesn't work. It's kind of like how you hear about the actors that made it (because they're on your TV all the time) but you don't hear about the ones that didn't make it (although you do see them if you go to a restaurant, because they're waiting tables to make ends meet). So any information on such a web site would have to be taken with a shaker of salt.

secret messages in human DNA?

In yesterday's New York Times, Dennis Overbye writes about the possibility of hiding secret messages in human DNA.

This seems vaguely plausible. Each strand of DNA is composed of a sequence of the four bases adenine, cytosine, guanine, and thymine. One could use these like the digits 0, 1, 2, and 3 in a base-4 number system; equivalently, they could be used as 00, 01, 10, 11 in a binary number system, so each base represents two bits.

Humans have done things like this. Freshly allocated memory in certain computing environments is filled with the repeated string (in hexadecimal notation) DEADBEEF; also ABADBABE, BAADF00D, CAFEBABE have been used. (CAFEBABE is apparently used in Java-related contexts; see this archive of a thread "why CAFEBABE?" on comp.lang.java.) It is of course quite unlikely that any of these strings would be found repeatedly in a computer's memory, if the memory is filled at random; the chance of getting, say, ten DEADBEEFs in a row (assuming there's not a process that's just copying some string over and over again) is one in 2320, which is more than the number of subatomic particles in the universe. As you may know, the Central Dogma of molecular biology says that DNA is transcribed into RNA, which is translated into proteins; each triplet of DNA bases maps to a single amino acid, of which there are twenty. There's a code that assigns a letter to each amino acid; the letters B, X, and Z are "special" letters; U, O, and J aren't used. It's possible to spell things with the remaining twenty letters, though, and I've heard that some genetically engineered food includes the name of the company doing the engineering in the junk DNA.

So what if someone designed us? Maybe they'd hide a message in the DNA? (For the record, I don't believe in intelligent design; however, if we were intelligently designed, that leads inexorably to the question of "who designed the designer"?) But how would they hide that message? They don't know what language we speak, and they certainly don't know that we'll invent this twenty-letter way of describing protein sequences concisely. And unlike in the DEADBEEF example, there appear to be reasons why you'd want stretches of DNA to be the same thing over and over again; these occur in the so-called junk DNA. Like many mathematicians, I'm inclined to believe that they'd hide the prime numbers. The idea behind this is that the primes should never occur due to a natural process, but any culture which is the least bit mathematically sophisticated should have them. (The idea comes from people who are searching for extraterrestrial intelligence; they assume that both us and the other species involved have radio astronomy, and inventing radios without mathematics is Hard.) The sequence

2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, ...

which in base-4 is

2, 3, 11, 13, 23, 31, 101, 103, 113, 131, 133, 211, 221, 223, 233, ...

(Note that a very large number of these base-4 numbers, when read in base 10, are also prime! This is just a coincidence, although the fact that 4 and 10 are both even -- and therefore numbers which are odd remain odd under this transformation -- helps.) Replacing 0, 1, 2, 3 with A, C, G, T, we get


and so if we see this string in DNA, perhaps we should be suspicious? Well, it's 37 base-pairs long; thus we expect it to occur once in every 437 base pairs. The human genome is about 3,000,000,000 base-pairs long, so if the genome were random, the probability of this string occuring is 3,000,000,000/437 = 1.6 × 10-13.

So if we find it? Then yes, there's probably a Designer. But this doesn't mean that creationists should go fishing for hidden patterns in the genome. First, my choice of how to encode the primes was entirely random. We could reorder A, C, G, and T. We could have encoded the primes in base 3, using the fourth base to separate them. We could have encoded the primes as


where the number of A's between each pair of C's is prime. And so on. Creationists looking in DNA would, I suspect, take a Bible code-like approach to the search. And if there were slight errors? They'd blame it on mutations, which are inevitable (the Times article points out that there are certain "ultraconserved" segments of the genome -- but those sections also appear to be functional, so it would be harder to hide a message in them -- but then if these hypothetical designers are so smart, maybe they can make those sections be functional and hide messages...)

Sequencing the human genome is good for lots of reasons. But the search for messages from the past probably isn't one of them. They might be there, but we'd be searching for a needle in a haystack. And there would be lots of shiny things that aren't needles there, too.

26 June 2007

betting on the iPhone

You can bet on everything these days!

BetUS.com -- which appears to be mostly a sports betting site -- is giving odds on various iPhone-related events. (I came to this via Marginal Revolution.)

I can't get inside, but livescience.com (the first link above) claims that BetUS.com is offering the following odds:

Consumers are reported camping out waiting for an iPhone—3/1

At first glance, I'd take this bet. People camp out now for product launches, it's What They Do in this consumer culture. And the sort of people who do that are, to some extent, Apple's target market. However, the iPhone is being released at 6pm local time on Friday. And the iPhone is expensive -- $500 just for the physical device, and then depending on who you believe somewhere around $80 for the service -- so you've got to think that maybe the people buying them will have jobs. (I'm sure there's a Steve Jobs joke in here somewhere, but I can't find it.)

Apple’s stock jumps at least 10% in value in regards to the price on 6/30/07—1/2

Technically, this can't happen. Why? Because June 30 is a Saturday. Stocks don't trade on Saturdays. But assuming they mean the next trading day after the release -- that is, Monday, July 2nd -- this would be an interesting disproof of the efficient market hypothesis. This hypothesis claims that the price of a traded asset -- such as Apple stock -- reflects all the knowledge that's available about the company.

On the other hand, Apple has been trading around 125 lately; it was at 90 as recently as mid-April. Either a lot of information about Apple has suddenly come out, or investors are just crazy. Or both.

Consumers pay at least three times the original price ($1,500) on ebay - 2/1

Hard to call. Did consumers learn from when people tried to flip PS3s and Xboxes last winter? Sure, some people pulled it off, but a lot got stuck with them.

iPhone spontaneously combusts—150/1

I hope this is a joke.

Judging from the little information I have, though, and the fact that the odds are simple integer ratios, I'm guessing that these odds don't move, but are set by BetUS.com. I was expecting something like tradesports.com or intrade.com, in which people can buy and sell "contracts" on various events -- these are rapidly emerging as an interesting means of predicting the probability of various "complicated" events, where one can't come up with a simple model to make a decent guess at the probability of an event. We expect that, if people are willing to pay $25 for a "contract" that pays out $100 if people are reported camping out waiting for an iPhone, then if we could repeat this experiment over and over again, one time out of four there would be people camping out. (The question of what this even means is kind of tricky, though, because there aren't going to be three more iPhones. Tonight I prefer the interpretation of complicated probabilities like these in terms of wagers, but that could always change.)

edit, 5:08 pm: People are already camping out. Reuters reports that as of this morning, there were four people in line outside the Apple Store on 5th Avenue in Manhattan.

25 June 2007

The Simpsons use decimal numbers

In The Simpsons, people have four fingers on each hand. Eight fingers in total. Therefore, shouldn't they use numbers in base 8?

(The reason that they have four fingers is the same reason that most animated characters have four fingers -- it's easier to draw. In at least one episode, God appears; God has five fingers.)

This occurred to me while watching the episode The Canine Mutiny, in which "After using his credit card to buy another dog, Bart must choose between his new wonder-pooch and the bumbling but loyal Santa's Little Helper." Bart gets the credit card in the name of his old dog, Santa's Little Helper; to order the new dog from a catalog, he has to dial an 800 number. He says "I don't think our phone goes up to 800", which got me thinking about what kind of numbers they use in Simpsons-world.

simpsonsmath.com, by Sarah Greenwald and Andrew Nestler, has a list of mathematical references on the Simpsons. This is not one of them.

It's actually possible to prove that the Simpsons universe has numbers in base 10. The baseball attendance figures in Marge and Homer turn a Couple Play are 8191, 8128, and 8208. The use of 9 indicates that we're in base at least 10. If we assume this is supposed to be a mathematical joke, 8191 and 8128 are immediately recognizable as 213-1 (Mersenne prime) and (27-1)26 (a perfect number.) 8208 is also 213 + 24, but more importantly it's the sum of the fourth powers of its digits. This would only be true in base 10. (Incidentally, most mathematicians regard properties of numbers that are based on their digits as not worthy of investigation, because they are basically accidents of the fact that we have ten fingers.)

24 June 2007

Bloomberg as a kingmaker?

President? Or Kingmaker? by Patrick Healy, in today's New York Times.

Michael Bloomberg, mayor of New York City, recently officially changed from a member of the Republican party to an independent. This has been interpreted as a harbinger of a presidential run as an independent. However, that's a long shot, even though there's speculation that Bloomberg would be willing to spend a billion dollars of his own money on his campaign.

Healy suggests that Bloomberg ought to run as a "kingmaker". He should attempt to win one or two large states (New York is the obvious choice, since he's mayor of the city that makes up nearly half that state's population) and basically forget about the others. After that, he would need to hope that neither the Republican nor the Democratic candidate has 270 electoral votes. The election then by default goes to the House of Representatives. However, electoral votes aren't cast until December 15, six weeks after the general election. So Bloomberg could make deals with one of the two major-party candidates.

This has been tried before; George Wallace attempted it in 1968, Strom Thurmond in 1948. But neither of those elections was close enough for the strategy to work.

This raises a question, though. Let's say Bloomberg can win New York (31 electoral votes). What are the chances that the other states are evenly split enough?

Let's assume that each state's winner is decided by flipping a coin. (This, of course, does not reflect the reality of American politics -- some states are much more likely to break one way or the other -- but bear with me.) Then each candidate expects to win half of the remaining electoral votes -- that's 253.5. The variance of the number of electoral votes won by, say, the Democratic candidate is the sum of the variances of the number of electoral votes won in each state. In a state with n electoral votes, that's n2/4. Adding the results up for each state, we see that the variance of the number of electoral votes won by the Democrat is 2326.25; the standard deviation is the square root of this, 48.23. I'll assume that the distribution is normal -- if all the states were the same size, this would be the Central Limit Theorem, and hopefully the fact that the states aren't all the same size doesn't kill us. So the probability that the Democrat gets 270 electoral votes in this scheme is the probability that a normally distributed random variable with mean 253.5 and standard deviation 48.23 is at least 269.5; that's 37%. Similarly for the Republican. That leaves Bloomberg a 26% chance -- barely one in four -- that this scheme would work. He might be willing to take those odds.

But, of course, there are some states that are sure to go one way or the other. Say only one-third of states (representing one-third of electoral votes) are sure to go to the Democrats, one-third to the Republicans, and one-third in play. Then the variance gets divided by 3; the standard deviation is now 27.85; Bloomberg's chances of the election being close enough for this strategy to come into play are 44%.

And this whole analysis neglects the finer points of electoral college strategy. States aren't independent of each other -- we wouldn't see an election in which Utah went Democratic while Massachusetts went Republican, or even one where Virginia went Democratic but New Jersey went Republican, to be a little more reasonable. (Both of those states are probably in play, but New Jersey is far enough left of Virginia that they shouldn't break that way.) And in the end it could come down to just a few states -- the 2004 election basically came down to Florida, Ohio, and Pennsylvania -- in which case this whole normal approximation breaks down. But we won't know whcih states those are for a long time yet.

edit, 12:09pm: Can Bloomberg Win? suggests the reverse of Healy's plan -- Bloomberg wins a few states, the Democrat and Republican split the rest of the states, and cuts a deal with electors of the party that gets less electoral votes that makes him President. Rasmussen Reports talks about possible "electoral chaos" which could fundamentally change the way we elect our Presidents.

edit, 7:47pm: As reader Elizabeth has pointed out in a comment, New York is reliably Democratic; this changes things a bit, so the chances that Bloomberg plays the spoiler by allowing neither other candidate to get 270 electoral votes (under the second set of assumptions) are more like 38%.

23 June 2007

You can't win if you don't play

A Dutch woman has sued the Nationale Postcode Loterij ("National Postcode Lottery"); from what I can gather, some of the winnings in this lottery are shared among all the people who bought a ticket and who share the same postal code as the jackpot winner.

This woman didn't buy a ticket, and sued for emotional distress.

A lot of US states (Pennsylvania, my home state, included) have two fairly low-stakes lottery games, one of which consists of picking three numbers from 0 to 9, and one of which consists of picking four. One could concatenate those numbers to get something of the form 867-5309, which is coincidentally the form of US phone numbers without the area code. I've always thought it would be kind of amusing if the people who had that phone number got some money. Not a large amount, because they didn't buy a ticket -- maybe a few thousand dollars. That would cause a lot less neighborly stress; Pennsylvania has ten area codes and the people who won would be in ten different geographical places, and almost certainly don't know each other.

Checkout lines and genderfree bathrooms

Yes, the two things mentioned in the title have something in common.

A Long Line for a Shorter Wait -- June 23 New York Times.

Whole Foods stores in New York City have moved from having a line for each checkout register (which is for the most part standard in American food stores) to a single line for the whole store. This means that the line looks longer but customers get through it faster. A few of the commenters at the NYT article have pointed out that you don't actually get through the line faster with this system; however, the probability of waiting a very long time is reduced. And that's really what the store wants to minimize. When you go grocery shopping, you understand that you're going to have to wait in line.

In general, if you wait in a line and there's a line on either side of you, the chances are one in three that your line will be faster than both of the lines adjacent to you. So there's a two in three chance that you'll regret your choice and think "damn, I should have gotten in that one!" -- and that's if you can't see any lines other than those two. I suspect that what a grocery store actually wants to minimize is a combination of average waiting time, some sort of "maximum" waiting time (maybe the 95th percentile?), and the number of people who feel like they got screwed over.

The article claims that the waits are much longer in NYC than elsewhere, though. If this is true, why? My guess is the following. Let's say your store's checkout people can serve 5 people per minute. Then in any given minute, the line only gets longer if more than 5 people come in. If on average four people come to the checkout per minute, then the line will only get longer in 22% of minutes (the minutes when six or more people get in line); it'll stay the same length in 16% of minutes (those when five people get in line); it'll shrink in 62% of minutes. So the line doesn't have much of a chance to get long. If on average 4.8 people come to the checkout per minute (96% of your store's capacity), these probabilities are 35%, 17%, 48% respectively; suddenly it's easier for the line to get long.

If on average 5 people come per minute, the line is equally likely to grow or shrink in any given minute. And if the store's understaffed (someone's sick, maybe?) then forget it -- the line is more likely to grow than to shrink and will probably get out of control. Perhaps NYC grocery stores are slightly understaffed relative to grocery stores elsewhere but this translates into big differences for line length.

Incidentally, a related problem comes to mind. There are a large number of small business establishments (restaurants, coffee shops, etc.) which have two bathrooms, each of which has a single toilet in it. In many cases these two bathrooms are marked "men" and "women". This means one has to wait longer than if there were two bathrooms, both of which were marked "bathroom". One might argue, though, that since men on average take less time in the bathroom than women, a system such as this would actually slow things down for men.

I've seen more and more people taking this into their own hands by using whatever bathroom is free in such establishments. The usual protocol seems to be to try the bathroom marked with one's gender first, and then try the other one if the "right" one is occupied.

And let's not forget that some people face a real quandary trying to decide which bathroom to use! safe2pee.org -- bathrooms for everyone has listings and maps of bathrooms which are safe for such people, either because they are single-occupancy or because they are explicitly genderfree.

22 June 2007

Six murders in one day in Philadelphia.

There were six homicides in Philadelphia yesterday. The headline in the Philadelphia Inquirer is "Summer's beginning: Six dead in one day". The events happened as follows:

  • a triple homicide in North Philadelphia;

  • a triple shooting in Kensington -- two died, one was critically wounded;

  • one man shot to death in Kingsessing.

I saw the headline while walking past a newspaper box well before I read the article. I thought "hmm, six murders in one day, is that a lot?" Last year Philadelphia had 406 murders; this year there have been 195 so far, as compared to 177 up until this time last year. The number I carry around in my head is that Philadelphia has one murder a day, although the actual 2006 figure was about 1.11 murders per day.

Since I didn't know that there had only been three incidents, I assumed that the six murders had all been separate. Furthermore, I assumed that murders are committed independently, since the murderers aren't aware of each other's actions. This second assumption seems believable to me. I've heard that, say, school shootings inspire copycats, mostly because they create a media circus around them -- at the time of the Virginia Tech massacres I remember people saying that the media shouldn't cover the shootings so much because they might "give people ideas", and I vaguely recall similar sentiments around the time of Columbine. But a single murder, in a city where the average day sees one murder, doesn't draw much attention.

If the murders are independent, then I figure I can model the random variable "number of murders per day" with a Poisson distribution. The rate of the distribution would be the average number of murders per day, which is 1.11; thus the probability of having n murders in a day should be e-1.11 (1.11)n/n!. This leads to the numbers:

Prob. of n murders in one day0.32960.36580.20300.07510.02090.00460.000860.00016

So six or more murders should happen in a day about one day in a thousand, or once in almost three years. That seems like an argument for newsworthiness. But on the other hand, let's say there's some lesser crime -- crime X -- that is committed in Philadelphia with such frequency that crime X does not occur on only one day in a thousand. (Such a crime would be something that happens 2516 times per year, or 6.9 times a day.) I don't see that being front-page news. Lots of one-in-a-thousand things happen every day.

Of course, what actually occurred yesterday was not six independent murders. It sounds like there were only three murderers. So it's time for new assumptions. Let's now assume that all murderers act independently, but that two in five of them kills one person; two in five kill two people; one in five kill three people. This means the average murderer kills 1.8 people. Further, let's say that murderers go out and kill people as a Poisson process with rate 0.62 -- that's the old rate divided by 1.8, so there are still the same number of murders.

(The assumptions of how many people a murderer murders are made up, I admit, but the only list of murders I can find are the Inquirer's interactive maps, and it doesn't seem worth the time to harvest the data I'd need from them.)

Now, for example, the probability that three people are murdered on any given day is the sum of the probability that there's one triple homicide, one double and one single, or three single. Running through the computation, I get:

Prob. of n murders in one day0.53790.13340.14990.10120.03720.02300.01030.0071

The probability of one or two murders in a day goes down; the probability of zero, or of three or more, goes up. Suddenly yesterday isn't nearly as rare. Days with six or more murders are, under these assumptions, 1.74% of all days -- just over six per year.

The calculation I'm afraid to do -- if I even could do it -- is "how likely am I to get murdered each time I go outside?" Fortunately I live in a decent neighborhood; but some neighborhoods not that far away from me have had some of the worst violence. But it occurred to me that at 400 murders a year, if you live in Philadelphia for 75 years there will be thirty thousand murders in that time span. Philly has about 1.5 million people. So if things stay like they are, the average Philadelphian has a one in fifty chance of dying by murder. In comparison, the nationwide murder rate in 2005 was 5.6 per 100,000; multiplying by an average lifespan of 75 years we get 420 murders per 100,000 people. So one in every two hundred and forty Americans will die of murder, if things stay like they are.

21 June 2007

Dartboards are about luck. Even in baseball.

David Gassko of The Hardball Times has a feature he calls the THT dartboard. They rank all 30 Major League Baseball teams by something they call the "dartboard factor", which is the number of wins a team would be expected to get -- over the 162-game season -- if they played an average schedule and hadn't gotten particularly unlucky or lucky in winning close games.

This seems like the exact opposite of what I would call a "dartboard factor". I would think the "dartboard factor" would be how much of a team's record is due to luck, or to playing an easy or hard schedule. Incidentally, I'd say that strength of schedule is a luck issue, because the schedule makers aren't supposed to take into account which teams are easy and which teams are hard, so it's not nearly important in baseball as it is in a sport like (American) football.

For example, they say the Phillies will go 78-84 this season, and they call 78 the "dartboard factor". The Phillies are currently 37-35; over 162 games that projects to 83-79. (There ought to be some regression to the mean taken into account here, but the mean is a .500 team anyway, so I'll ignore it.) So if I were going to call anything the "dartboard factor", it would be the five "extra" games that the Phillies will win (assuming they keep going at this pace) over the 78 they "should" win. Dartboards are about luck, not skill. I'm reminded of the semi-mythical people who pick stocks by throwing darts, or their modern-day equivalent, the million-dollar waitress.

I won't comment on where their predictions are coming from, because the links in the original post don't actually make it clear.

Also, my guess is that luck doesn't play a big a role in baseball's regular season as it does in, say, football, because there's more time for everything to even out. (Also, football has more season-ending injuries than baseball -- that sort of thing can really screw a team over -- although baseball certainly isn't immune.) Teams get lucky in the playoffs -- anything can happen in a short series.

Quantum probability, and one and a half dead cat jokes

There exists such a thing as "quantum probability". Basically, it's like ordinary probability, except that instead of having probability density functions, you have wavefunctions. (Sometimes you have "discrete wavefunctions", which are wavefunctions that are concentrated on, say, the integers; I don't know if this is the right technical terminology. This doesn't seem to occur in actual quantum systems -- what with the world being continuous and all -- but that doesn't stop a mathematician!) These are annoying because wavefunctions are complex-valued and I can't keep nearly as many complex numbers straight in my head as real numbers. I'll probably have more to say about quantum probability in the future.

Anyway, I was walking down the street earlier today, lamenting this fact (after trying to do such computations in my head and failing) and I saw somebody wearing a T-shirt which said Schrodinger's cat is dead. I thought this was sad! Then after she walked past I saw that on the back of her shirt it said "Schrodinger's cat is not dead".

This joke's been done before, though. Griffiths' text on quantum mechanics has a picture of a live cat on the front and a dead cat on the back.

An ex of mine said that the only legitimate use of HTML's <blink> tag (which fortunately is used a lot less often than it was in, say, the late nineties) was the following:
Schrodinger's cat is NOT dead.

The summer solstice and the longest days

Quick, what's the longest day of the year? (In the northern hemisphere.)

If you answered June 21 (today!), you're probably right. That's the date of the summer solstice, at least in most years and in most time zones. (You may have thought that the solstice was an entire day, but in fact it's just a moment in time, the moment when the sun is furthest north. The sun doesn't move, of course, at least not in the usual treatment of astronomy -- but the Earth moves around the sun, so sometimes its northern part is pointed more towards the sun and sometimes its southern part is.

But on what day does the sun rise the earliest? Or set the latest?

This is a trickier question. "Trickier", here, means "I don't remember the answer". But the U. S. Naval Observatory makes available a sun or moon rise/set table for one year. You can enter your location and it'll tell you when the sun rises and sets on each day in, say, 2007. The patterns don't change from year to year, because the Gregorian calendar is what we call a "solar calendar" and is pretty well correlated with the seasons. Its predecessor, the Julian calendar, didn't have this property -- it slipped relative to the seasons by a bit under a day per century. For more than you ever wanted to know about calendars, see Claus Tondering's calendar FAQ.

If I enter my location -- Philadelphia -- into the table, it tells me that the day the sun rises the earliest is any day between June 10 and June 18, when it rises at 5:31 am. Let's say that the actual earliest sunrise is in the middle of this period, June 14. Similarly, the latest sunset is on any day between June 26 and June 29; let's call it June 27. These are a week earlier and later than the solstice. On the winter side of things, the shortest day is December 20 (only nine hours and nineteen minutes - sunrise is at 7:19, sunset at 4:38), but the earliest sunset is around December 7 (4:35) and the latest sunrise is around January 5, 2008 (7:23).

What's the cause of this? It's a little something known as the equation of time, which basically says that the earth runs "fast" in some seasons and "slow" in others. In December it's running faster than in January, and in early June it's running faster than in late June.

I first really became aware of this phenomenon when I lived in Boston. In Boston winters, night comes very early -- it's not uncommon to see the pink and purple shades of sunset at around 3:30 on a December afternoon. The actual earliest sunset comes at 4:12 on the 8th of December; as you head further north the earliest sunset and latest sunrise both move towards December 21, because the "equation of time" becomes less significant with respect to the variation in day length. (In Miami, for example, they're November 29 and January 14; in Anchorage, they're December 15 and December 26.) You end up getting some strange asymmetries. You'd think that on dates equidistant from the winter solstice, you'd have the same time of sunset.

But you don't. In mid-November and late January, the sun sets in alignment with MIT's Infinite Corridor, which is a very long hallway running through the center of campus. In November it happens around November 12 (thirty-nine days before the winter solstice), at 4:20 pm; in January, it happens around January 29 (thirty-nine days after), at 4:50 pm. That actually helps in the Boston winters, believe it or not -- by the time it's getting really cold at least it feels like the sunlight is starting to come back. At least if you were someone like me who was never awake for sunrise.

Finally, in the summer the sun lines up with Manhattan streets at sunset, on May 28 and July 12. This is called "Manhattanhenge", and some people claim that the alignments are cosmic signs of Memorial Day and baseball's All-Star break. Of course, they're not; Manhattan just isn't aligned with the "north" that we usually call by that name. Most places with a regular grid of streets will have a day like this, although I haven't seen references to it happening in places other than Manhattan.

20 June 2007

Good health is worth $631,000 a year?

Nattavudh Powdthavee claims that improving your health from "very poor" to "excellent" makes you as much happier as $631,000 extra per year would.

Does this seem reasonable to you? It doesn't to me. From what I understand, studies such as this are done by asking people to assume they are in excellent health, and then asking how much money they would accept to have (say) a 1% chance of being in very poor health. On average they say $6,310 a year. (In the case of health, it might work by looking at how much people are willing to pay for health insurance; this seems like the sort of thing the folks at Freakonomics might do, although I don't know if they've done it.) Divide by .01 and you get this $631,000 figure.

It gets even weirder when you look at it in reverse. The article claims that "Widowhood packs a psychic punch of $421,000 a year in losses". Taken literally, this means that the average widow would pay $421,000 to have her husband back for a year -- despite the fact that this is almost certainly a large multiple of her entire annual income. Maybe more than her house costs. She'd become homeless in order to have her husband back? For a year?

Or: Increasing face time with friends and relatives from "once or twice a month" to "on most days" feels like getting a $179,000 raise. That's about $500 a day. Are you saying you'd go out today to see your friends if I offered you five hundred bucks to stay in?

I suspect the problem here is that it's hard to express happiness in terms of money. Dollars are not the natural unit of happiness. If you gave me an extra $20,000 a year, I'd be happy -- that would be about a doubling of my income. If you gave the punks who hang out on the street in my neighborhood that money who have no income, they'd be even happier -- now they'd have a roof over their head! But if you gave Bill Gates that money, he wouldn't care at all. For him it's a rounding error. Some people have claimed that, say, getting a 10% raise feels the same to everybody, which seems a lot more reasonable. So what people are trying to maximize is actually the logarithm of the amount of money they have.

Note that if you choose to redistribute all the money in the world so that the sum of the logarithms of everybody's net worth (or income) is maximized, then everybody should have the same amount of money. Intuitively, if I have $30,000 and you have $10,000, then the total amount of happiness is less than if we both have $20,000; if you apply this to all pairs of people then you get this flat distribution. I leave critiquing this argument as an exercise for the reader.

The Phillies defeat the Qankees in nine games?

Scott Boras is pushing for a nine-game World Series, with the first two games to be played at a neutral site. This gives me a reason to post the following, which I wrote a couple weeks ago, about how good the World Series is at identifying the "best" team. A best-of-nine World Series wouldn't be much of an improvement over a best-of-seven World Series, in this regard. Let's say one of the two teams in the World Series has a 55% chance of winning over the other in any given game; they'd have about a 61% chance of winning a seven-game series, and a 62% chance of winning nine. It would bring more money. It would also lengthen the season by, say, another three days (two playing days and a travel day), and as many people pointed out in April the season is already too long for an outdoor sport. Game 7 of this year's World Series, if it happens, will happen on Thursday, November 1. Do we really want November baseball to be a routine event?

(Also, Boras likens these first two, neutral-site games to the Super Bowl. Which makes me wonder -- has the Super Bowl ever been played at a site that wasn't actually neutral, because it just happened that the host team did well that year?)

Anyway, on to the math.

In the World Series of baseball, two teams P and Q play against each other until one has won four games; this team is declared the champion. If team P has probability p of winning each individual game, and the games are independent, what is the probability that team P wins the series?

(From here on out, I will call the two teams the Phillies and the Qankees. "Phillies" because, well, I actually want them to win; Qankees because Qankees is fun to say. It's pronounced quan-keys, like how a little kid would say "cranky".)

This problem is often posed in introductory probability texts (in fact, a couple days ago I showed a student I'm tutoring how to do it), and the solution those texts have in mind runs something like this. If the Phillies win, they do it in four, five, six, or seven games. The probability of them winning in four games is clearly p4. The probability of them winning in five games is 4p4q, where q = 1-p. This is because there are four ways for the Phillies to win in five games -- we just pick which of the first four games they lose. Similarly, the probability of the Phillies winning in six is 10p4q2 -- the number "10" is the number of ways we can pick two games out of the first five for them to lose. And they win in seven with probability 20p4q3.

Thus, the total probability of the Phillies winning is p4(1 + 4q + 10q2 + 20q3.) If we remember that q = 1-p, we get that this is p4(35 - 84p + 70p2 - 20p3). Plugging in numbers, we see, for example, that if the Phillies have a 55% chance of winning each individual game, they have about a 60.8% chance of winning the entire series, and that the Qankees would then be throwing temper tantrums because they didn't get their twenty-seventh championship. (How you would know they have a 55% chance of winning each game is a different story.)

But if you're designing a system of playoffs, this polynomial isn't that interesting. It seems to me that you can basically sum up the whole problem in a single number. You can view the playoffs as a probabilistic algorithm for determining which team is better. (They're a pretty weak probabilistic algorithm, at least when the teams are fairly evenly matched.) The natural question to ask seems to be: if the Phillies have a probability 1/2 + ε of winning a single game, what's the probability of them winning the series? (This turns out to be a sneaky way to compute the derivative of the above polynomial at p = 1/2.) So we let p = 1/2 + ε in the above computation, and -- here's the trick -- we act as if ε2 = 0. Then the Phillies' probability of winning is

(1/2 + ε4) (1 + 4(1/2 - ε) + 10(1/2 - ε)2 + 20(1/2 - ε)3)

which simplifies to 1/2 + (35/16)ε. I'll call this coefficient 35/16 the amplification of this playoff system.

Doing the same computation for different series lengths gives:

Number of wins needed2345681216

(as decimal)1.501.882.192.462.713.143.874.48

The amplification is very nearly the square root of the number of wins needed. I'd bet it's 2/√π times the number of wins -- the constant seems to work out, and π seems to occur a lot in these types of problems, because in the end we have to compute factorials and Stirling's approximation is a very good approximation to factorials that includes π. (A confession: the numerical work actually comes from computing the probabilities in the first way given above, because I have a fast computer at my disposal and didn't want to figure out how to program the second solution.)

What does this tell us, then? It tells us that to get the amplification to be twice as good, we have to play four times as many games! This is something that occurs pretty often -- political opinion polling, for example, follows the same principle. To get the "margin of error" down from the standard 3% to 1.5% requires polling four thousand people instead of one thousand.

A principle that occurs fairly often in randomized algorithm design is that if you have an algorithm that gives you a correct answer with probability greater than 1/2, then you can run the algorithm repeatedly and be more confident in its results. This is actually the principle behind playing a series of games instead of a single game, but what if we played a series of series? Or a tennis match, with its multi-tiered structure (point, game, set, match)? I'll look at this in a future post.

The Million-Dollar Waitress and stock-picking scams

The Million-Dollar Waitress, from Business Week.

A waitress in Ohio, Mary Sue Williams, may win the million-dollar grand prize in a CNBC stock-picking contest. She was in sixth place when the contest ended on May 25. But a flaw in the way the contest was set up meant that, basically, the five people ahead of her "cheated". They found a way to select stocks to buy at, say, $20 -- and then wait until they went up to $25 before pressing the "buy" button. Because of the way the contest had been programmed, they only had to pay $20 in fake money to buy the stock. (I'm putting "cheated" in quotes because although this is clearly against the spirit of the contest -- you couldn't do this in the real market -- perhaps one could argue that the contest is defined by whatever the computer lets people do.)

Furthermore, it's not really clear how meaningful a contest like this is. From what I can tell, the contest lasted for ten weeks; the winner each week won $10,000 and the grand-prize winner won $1,000,000. Although I can't find information about the prize structure, my guess would be that any other prizes were much smaller. So this encourages risk-taking that nobody would take with actual money. Let's say there's a second-place prize, and it's $100,000. In "real life", most people would be happy with that money. But if you bet -- and I'm using "bet" here because this really does feel like gambling -- all your money on some very risky stock, then you have a shot at multiplying your money by ten. But it's probably a lot more likely that you'll lose it all.

Jim Kraber, on the other hand, played legitimately but at one point had 1600 portfolios. This enabled him to take the sort of risks that someone with a single portfolio -- or even a number of portfolios that could reasonably be played with with hard currency -- would never take, because he could afford to just throw out a portfolio that wasn't succeeding.

Williams admits that "Part of this was luck... a lot of it was a gut feeling, some eenie-meenie-minie-moe, and common sense." Now, I'm not saying that she can't pick stocks -- who knows, she might be quite good at it. But this story reminds me of the following scam. I get a mailing list with 64,000 people on it. (I'm not sure whether this should be postal mail or e-mail; the story, which is not mine and which I learned from a book of John Allen Paulos, predates e-mail.) I tell 32,000 of them they should, say, bet on football team A to win this week, and 32,000 that they should bet on the same team to lose. The team either wins or loses. Next week, I take the 32,000 that got the right answer this week, and split them in half. To 16,000 I say that team B will win this week, and to 16,000 I say that team B will lose. I repeat this six times, until 1,000 people have received six correct predictions. Note that when I begin this scheme I don't know which 1,000 people this will be, but I know they'll exist. Then I try to sell these thousand people my "system". Maybe they'd buy it!

19 June 2007

Names for large numbers

Language Log on number delimitation. In the United States, we put commas between every three digits: 123,456,789. In Europe they use periods instead of commas, but in the same places.

In China, they group into sets of four digits instead of three: 1,2345,6789. (I've seen this a few times written with the commas, in English-language texts; it's very disorienting.) The ancient Greeks also did something like this: they referred to "myriad", "myriad myriad", and so on, where "myriad" is 104. ("Myrio-" and "myria-" are also obsolete metric prefixes for 104 and 10-4 respectively.)

In India, they break into a low-order group of 3 and then groups of 2: 12,34,56,789. This reflects the structure of the language -- certain odd powers of 10 have names, with 105 being "lakh" and 107 being "crore".

In a way, though, we do the same thing, and least if you consider the etymology of the number names thousand, million, billion, trillion, and so on. (By "billion" I mean an American billion, 109.) "Thousand" is somehow special; we're treating the first power of 103 differently than all the others. The British system, where successive powers of 1000 are named thousand, million, milliard, billion, billiard, trillion, ... is more "logical". But shouldn't "thousand" be "thousard" or something like that? And "billiard" is also a name for that game where you hit the balls with the sticks.

There's also the Knuth -yllion notation, which answers a question that Poser asked: are there systems with groups that double in size? The answer is yes, although I don't think anyone seriously uses this system.

Forbes on roulette (and a few other casino games)

The Best Bets at the Casino, from Forbes.com. (I found this one through the bar at the top of the screen in gmail that delivers me all sorts of random news.)

As I've said before, probability theory was invented to solve problems in games of chance. Supposedly the first nontrivial probability problem was the problem of points. Two people are flipping a coin, one bets on heads and the other bets on tails. They agree flip until either heads has occurred ten times or tails has occurred ten times. But they have to quit when the score is seven to five; how should they split up the money?

Anyway, the Forbes article claims: "If your goal is to nab the best risk-adjusted return (as opposed to playing for hours on end), place fewer, smarter and larger bets." I'm not entirely sure what this means, because "risk-adjusted return" is a vague phrase. If you bet, say, $10 each on 100 spins of the roulette wheel, or $100 each on 10 spins of the roulette wheel, either way you expect to lose about fifty bucks -- $52.63 to be exact. This is what we mean when we say that the "house edge" in roulette is 5.26% -- it means that the house will, on average, take 5.26% of your bet. (If all your bets are on "red" or "black", then roulette is basically like flipping a coin -- a very complicated coin -- except that there are certain special outcomes where both "red" and "black" lose.)

But the variance of your winnings on a single spin at $10 is 100 "square dollars" or so, so the variance of your winnings on 100 spins is 10,000 "square dollars". The standard deviation is about $100. For a single spin at $100, the variance is 10,000 "square dollars", and so the variance on 10 spins at $100 each is 100,000 "square dollars"; the standard deviation is about $316. What does this mean? It means that if you make a few big bets then the random fluctuations won't cancel out. So what you do have is a greater probability of coming out ahead. Maybe this is what they mean by "risk-adjusted return".

To make the statement about "random fluctuations" more precise, I could use the central limit theorem. The central limit theorem says, essentially, that when you add up a bunch of random things you get something that's approximately normally distributed. But it's not really fair to do that here. Why? Because the normal distribution is a continuous distribution -- that is, it can take whatever value we like. The distribution of the amount of money you have after ten spins is quite "lumpy" -- it can be, say, $0 (if we win on five spins and lose on five spins), or $200 (if we win on six and lose on four), but not something in between. I'm afraid of losing something in the holes between $0 and $200, as it were.

(By the way, there are roulette bets that let you bet that any of 1, 2, 3, 4, 5, 6, 12, or 18 numbers come up. These all have the same house edge -- except for the 5. Don't take that one.)

Also, rumor has it that there are devices that you can sneak into a casino which will watch the roulette wheel and tell you where it's more likely to come up. I'm not sure if they're small and unobtrusive enough -- or sufficiently good at prediction -- to be good for anything.

They also seem to claim that it's possible to win at blackjack -- in expectation without counting cards; that's not so! The house edge in blackjack under most rules is about 0.5%. But even if you could win in expectation, I'm not sure if it's worth the trouble. If I'm going to have to pay attention to what I'm doing, I don't want to have a high chance of losing money.

18 June 2007

first post

You've stumbled across "God plays dice".

What is this? It's a blog. I plan to use it to talk about probability. Probability's everywhere around us, and in fact the world is basically governed by it. As you probably know, the ultimate laws of physics don't reflect what you see around you every day. The state-of-the-art theories in physics involve quantum mechanics. In our world, you can't walk through walls. But if you were a subatomic particle, maybe you could -- it's called quantum tunnelling. The probability of it happening would be very low -- but it could happen. And in fact, there's a chance that you, too, could end up on the other side of the wall next to you, or that your chair could suddenly disintegrate, or something like that. It's just that that chance is so small that it would take many times the lifetime of the universe for that to happen. Nevertheless, I had a friend in college who claimed to be afraid of quantum tunnelling. He was very smart and knew more physics than I do, so I hope he was joking.

Some people -- people who know a little bit about physics -- think, "oh, quantum mechanics says that things are random" and assume that's it. But that's not true. There's something called a wavefunction which tells us just how likely it is for a particle to be at a certain point in space. And physicists can compute these likelihoods very precisely. (Sometimes the computations get hard, but in principle there's always an answer even if they can't write down the number.) You might not like this. Einstein didn't like this, and Einstein was a very smart man. Einstein once wrote in a letter: "I, at any rate, am convinced that He does not throw dice." Niels Bohr is said to have said "Albert, stop telling God what to do."

(By the way, Einstein also didn't wear socks, or so they say. I tried this for a little while once. It's not such a good idea -- your shoes end up smelling horrible if you don't wear socks. So he didn't know everything.)

That's where the name of this blog comes from. But this isn't a blog about quantum mechanics. It's a blog about probability in all its guises. For example, do you worry that all the air in your room is going to go over to where you aren't, and suffocate you? Probably not. If you have money in the stock market, you might worry that the stock market goes up and down -- but you keep your money there because the "experts" tell you that the market won't collapse. The experts might not know what they're talking about, though -- from what I've heard a lot of them use models that assume the way prices vary follow something called the normal distribution. This isn't because they actually do -- it's just that that makes the math easier.

And that's why, at least in this blog, I will strive to be a "probabilist" and not a mathematician who does probability. What's the difference? Well, I'm a graduate student in mathematics. In mathematics there's a very strict way of doing things. You start out with certain axioms, which are things that you just assume are true, and then using the laws of logic you can show lots and lots of things are true. This goes back as far as Euclid's Elements, over two thousand years ago. Euclid started from a few pages of axioms about points, lines, and planes, and from that it's possible to prove a wide variety of statements about things you can draw in the plane. But here's something important -- he never defines what a point, a line, or a plane is. It's expected that you just know these things from being a human being. And it turns out that Euclid's geometry is not the geometry that applies, say, on the surface of the earth. On the plane, if you stand at the intersection of two lines, they never meet. But on the surface of the earth, if you are standing at the intersection of two "lines", then those lines also intersect at a point that is literally half a world away. Probability has a similar batch of axioms -- the probability axioms. And you can reason about these all you want and come up with nice results -- say, the central limit theorems, or the Lovasz local lemma. But when you apply these things to the "real world", you get things like the Doomsday argument. This says, among other things, that we don't know how long the human species is likely to survive, so we ought to act as if the human species will continue to exist for exactly as long as it already has existed. Or does it say that the number of people who are yet to be born is the same as the number who have already been born? Since population has grown a lot lately, these aren't the same things. We get ourselves into trouble when we apply probability to the real world -- but that's what I hope to do here.

I hope to convince you, my readers, that randomness is everywhere. (I admit I have not done that in this post.) And I aim to use this blog to teach you about probability, and to try to learn more of it for myself. As I said, I'm a graduate student in math at a Big Fancy University. But we don't talk about things like this. We just assume that Someone Else has figured out how to translate what we're doing into mathematical symbols, and we take it from there. We lose track of "reality", whatever that is. But probability theory was first developed not because a bunch of mathematicians were sitting around wondering how to torture their students -- no, it came into being because people wanted to analyze games of chance. And the world is a game of chance.