God Plays Dice: Poisson

Showing posts with label Poisson. Show all posts

07 September 2008

First number not in Google?

Walking Randomly asks: what's the smallest positive integer that gives no Google hits?

I don't know exactly, but it's in the high eight digits. Why, you ask?

Well, I searched a series of numbers, starting with 1 and roughly doubling. This gave the following data:

Search string	Number of hits
13033319	158
26066638	37
52133277	17
104266555	12
208533110	4

(Each number in the left column is either twice the previous one, or twice the previous one plus one.)

So here's what I'm thinking; numbers around 26 million, have, on average, 37 google hits. Since containing a given number is a Rare Event, I'm guessing that we can treat the number of hits of numbers around that magnitude as Poisson with parameter 37. The probability that a Poisson(37) random variable is equal to zero is exp(-37), or about one in 10¹⁶. So the first integer with no Google hits is probably larger than 26 million.

But by the same argument, numbers around 52 million have probability exp(-17), or about one in 25 million, of having no Google hits. So we expect one between, say, 52 million and 77 million.

And by the time we get up near 100 million, we should be seeing these numbers at a frequency of one in exp(12), or six in a million; they should become commonplace.

To get a better model, I'd take more samples. The number of Google hits for a number seems to follow a power law; the number of hits for n is a constant times n^-α for some exponent α, somewhere between 1 and 1.5. (There are issues around saying that things follow power laws, though; it's easy to see them even when they're not there.) And there are various complications -- for example, powers of ten and powers of two are more common than the numbers around them. And how do we know that the occurrence of a number on various web pages is actually independent? To be honest, we don't; if a number exists on a web page, it's there for a reason, and if one person has something to say about it, why not someone else?

11 August 2008

Lucky babies redux

Two babies born at 8:08 am on 8/8/08, weighing eight pounds, eight ounces, both in the United States.

How many would you expect?

The 2007 crude birth rate for the US is 14.2 per 1000, per year; the estimated US population is 304,843,316. The product of these is 4,328,775 births per year, or 8.25 births per minute.

From here I can find the distribution of birth weights (in Norway, 1992-1998 -- better figures would be appreciated). About six percent of babies weigh between 3850 and 3950 grams, which is a 3.5-ounce-wide interval; thus about 6%/3.5 = 1.7% of babies weigh 8 pounds, 8 ounces (to the nearest ounce) at birth.

So the expected number of babies born in the US at that particular minute, at that weight, is about 1.7% of 8.25, or 0.14.

There were two. The probability of this happening, assuming births are a Poisson process, is about one in 112. I wouldn't trust this number too much, because birth weights are supposedly growing with time and the Norwegian distribution is probably different from the US distribution.

So if I had to guess, people at the hospitals are fudging the numbers; if we were being totally honest, those babies might turn out to have been born at 8:09 am and weighed eight pounds, seven ounces, or something like that. Not that there's anything wrong with that.

(This post borrows a lot from a post I just remembered I made, lucky babies, about babies born on July 7, 2007 at 7:07 and weighing seven pounds, seven ounces -- but those were fictional babies.)

08 September 2007

mapping functions and genes "crossing over"

In genetics they have a unit called the centimorgan. This unit is a unit of what is called recombinant frequency, and it doesn't seem to be well-defined. For those of you who don't remember your biology (and I'll admit I'm one of them), recall that almost all cells contain chromosomes in pairs (23 pairs in humans). in the process of meiosis, cells are produced which contain a copy of one member of each pair. When fertilization occurs, these come together to form a new pair of chromosomes. However, this new pair mixes up or "recombines" parts of the old pair, as can be seen in the image.

The result is that two genes which are physically close together on the same chromosome will be inherited together, but two genes which are physically far apart might not be inherited together. When one learns about this in an introductory biology class, I think that the fact that two cross-overs is, in a sense, the same as no cross-over at all is ignored. That is, if the chromosomes cross over twice, or four times, or six times, or any even number of times between two genes, then those genes will end up on the same copy of the chromosome even after crossing over. (A more quotidian analogy: you walk down a street, arbitrarily crossing it "when the mood strikes"; the probability that at some given moment in the future you are on the opposite side from where you started is not the same as the probability that you have ever crossed the street, because you might have crossed back.)

Certainly, I don't remember hearing it in high school biology, and it's not mentioned in Time, Love, Memory: A Great Biologist and His Quest for the Origins of Behavior, which is the book I'm reading right now. It's nominally a biography of Seymour Benzer (who is still an active researcher) but is also something of a history of molecular biology.

Anyway, two genes are said to be one centimorgan apart if the probability of a crossover occurring between them is 0.01 -- or if the probability of an odd number of crossovers occurring between them is 0.01 -- or if the average number of crossovers between them is 0.01 -- I can't determine which. From what I can gather, molecular biologists seem to think of centimorgans as additive, which seems to require the third definition. (It looks like sometimes they use the other definitions and use something called a mapping function to correct for this, but I'm not entirely sure I'm reading this correctly.)

Now, a first guess would be that crossovers occur basically at random over the entire chromosome, and are a Poisson process. For the sake of simplicity assume that crossovers form a Poisson process with rate 1 -- that is, in a piece of the chromosome of length λ, the number of crossovers is a Poisson distribution with mean λ, and non-overlapping pieces have independent numbers of crossovers. What is the probability of an odd number of crossovers occuring in a segment of length λ? Let X be a Poisson(λ) random variable; then it's f(λ) P(X = 1) + P(X = 3) + P(X = 5) + ... The logical question to ask is: is this an increasing function of λ? That is, as we consider points further and further apart on the chromosome, does the linkage between them actually become less strong? You could imagine that the function might not be increasing. For example, say that after one crossover, the next crossover always occurred between 9 and 11 space-units down the line. Then two genes between 11 and 18 units apart would always end up on opposite chromosomes, and two genes between 22 and 27 units apart would always end up on the same chromosome, and in general you'd have some sort of oscillatory behavior.

Under the Poisson assumption, though, the answer is yes. In fact, we have
P(X = 1) + P(X = 3) + P(X = 5) + ...
= λ e^-λ + (λ³ e^-λ)/3! + (λ⁵ e^-λ)/5! + ...
= e^-λ (λ + λ³/3! + λ⁵/5! + ...)
= e^-λ sinh λ
= (1 - e^-2λ)/2
which is known as Haldane's mapping function. It's hard to find a clear derivation of this online, because most of what's available online is course notes that are intended for people who will be using this in their work and don't particularly need to know the derivation.

What this tells us is that two genes which are separated by λ "units of space" will recombine with frequency (1 - e^-2λ)/2. Note that if λ is small, this is only very slightly smaller than λ, since cases when there is more than one crossover in the space between the genes are vanishingly rare. But it also tells us that if two genes A and B recombine with frequency p, then they are not p of these "natural units" apart, but rather they are a distance λ apart with (1-e^-2λ)/2 = p, so λ = -log(1-2p)/2. So, for example, if two genes A and B recombine with frequency .20, the average number of recombinations between them is not .20, but -log(.6)/2 = .255. And if another two genes B and C recombine with that frequency, and they are arranged on the chromosome in the order A, B, C, then the distance between them is -log(.6), and the recombination frequency is (1-e^{log .6})/2 = .32, not .40. In general, if A and B recombine with frequency p, and B and C recombine with frequency q, then A and C recombine with frequency p+q-2pq. This can be derived from the Haldane mapping function, but the following argument is nicer. In order for A and C to recombine, exactly one of the pairs (A, B) and (B, C) must recombine. With probability p(1-q), A and B recombine while B and C don't; with probability q(1-p) the reverse happens. Again, this formula seems to recur without justification in notes that I can find online.

If you know anything about special relativity, this sort of reminds me of how rapidities add (while velocities don't). The rapidity of a particle with velocity v is is tanh^-1 v/c, which is approximately v/c (or v, if you like natural units); relative rapidities are additive, while relative velocities are only approximately additive, and then only for small velocities. Something similar is going on in the genetic situation, where the usual measure of "distance" is only additive for things that are close together and a correction has to be used when they get far apart.

And how would this be different if chromosomes came in, say, triplets instead of pairs? Maybe I should be a mad scientist in my next life. Then I could find out. (Or I could just do the calculation now, but I've got better things to do.)

(I suspect there are places here where I'm not using correct biological terminology; here I follow in the footsteps of Feynman, who when he learned about zoology once went to a library and asked them for a "map of the cat".)

09 August 2007

80-degree lows in Philadelphia

I find myself interested in the weather, both because my main means of transportation is walking, and because it's a complex system that enough other people are interested in that they try to predict it and which is a source of large data sets. (Surprisingly, I have no weather-forecasting ability. I figure I could learn how if I wanted to, but I'm not that curious, because I'd just be discouraged when I realized the professionals are better than me at it.) It hit 97 here in Philadelphia yesterday; more crazy-sounding is that Wednesday morning's low was 80 degrees. This has only happened 39 40 times in recorded Philadelphia weather history, i. e. since 1876.

I find myself wondering how the number of such days is distributed, but it's hard to come to any sort of conclusion. My instinct, though, is something like this: the number of clusters of 80-degree lows (that is, consecutive days with them) in a given summer is probably Poisson-distributed.

Of course, counting these "clusters" is a silly thing to do, because it seems to imply that, say, the lows on August 16, 2002 and August 18, 2002 (both 80 degrees) are independent events, just because it didn't get up to eighty on August 17. (The low that day was 77.) If you look at the data, there have been 109 summers with no 80-degree lows, 20 summers with one cluster of them, one summer with two clusters, and two summers with three clusters; that last one makes me suspicious, but one of those is 2002 (July 30, August 16, August 18) and one is 1995 (July 15, 26, 29).

Second, how long is each individual cluster? There are 19 clusters of length 1, 7 of length 2, and 2 of length 3; I'm inclined to suggest a geometric distribution. Once there's an 80-degree low, the probability of having an 80-degree low on the next day is a constant, approximately 1/4. It seems reasonable that this probability would be less than 1/2 but not too much less; an 80-degree low in Philadelphia is very unusual, so the next day is expected to be a bit cooler. I've heard that the simplest weather-forecasting rule is that "tomorrow will be like today". I suspect that a slightly better rule is "tomorrow will be like today, but a little less so" -- that is, it will regress to the mean a bit. But to apply this rule one has to have some idea what the mean is. On a day with a high of 70 in Philadelphia in July I'd want this rule to predict the next day would be warmer; on the same day in November I'd want it to predict the next day would be colder.

So the number of 80-degree days in a given summer, I expect, is the sum of some number of geometrically distributed variables with p = 3/4, the number of such variables being given by a Poisson distribution with mean about .21 (there were 28 observed clusters in 132 years). Determining what this says in terms of actual probabilities is left as an exercise for the reader, mostly because I don't trust these numbers enough. It seems like a reasonable first stab at a model, though. It only has any chance of working for extreme temperatures, though; if I replaced 80 with 70 (the average low in Philly this time of year) then the model wouldn't work so well, because it depends on this clustering phenomenon. (The highest-recorded low temperature ever in Philadelphia is 82, so you can see that 80 is extreme.)

While I'm on the subject of weather, check out weatherbonk.com, which displays the weather, live, on a map, showing the temperature at various volunteer-run weather stations. I'm never sure how much to trust any of these stations, though, because you hear that sometimes they're in an asphalt-paved parking lot next to the place where the heat comes out for a central AC unit. I suspect it would be more interesting in a place like San Francisco where the microclimates are pronounced enough that you can see them even over that noise.

09 July 2007

lucky babies

Stephen at Freakonomics asks:

What I want to hear about is the 7 lbs.-7 oz. kid who was born at 7:07 a.m. on 7/7/07. Any leads? She will probably grow up to be a poker champ.

I wonder if there are any. You'd expect 11,000/1440 = 7.5 babies to be born in that minute, and 7.5 in the corresponding p.m. minute, for a total of fifteen. About two percent of babies weigh between 7 lbs, 6.5 oz and 7 lbs, 7.5 oz at birth; I'm getting this from this analysis of Norwegian birth weights which is all I could find quickly. (Seven percent of babies weigh between 3350 grams and 3450 grams at birth; one ounce is very nearly two-sevenths of one hundred grams.) So the expected number of 7 lb, 7 oz. babies born at 7:07 (a.m. or p.m., local time) on 7/7/07 is about two percent of fifteen, or 0.3. I'd guess that the number of babies born during any minute is Poisson-distributed; the probability there are no such babies is thus e^-0.3, or about 74%.

Then again, as has been pointed out, if you had a seven-pound, six-ounce baby born at 7:08, maybe you'd make everything be 7s just for the hell of it. Hospital staff aren't above this sort of thing. I was born prematurely and weighed five pounds, eight ounces at birth, but lost weight once I was born; the hospital had a rule that babies under five pounds couldn't leave. On Christmas Eve I weighed a bit under five pounds; apparently the records show five pounds exactly, because the nurse felt that I should be home for Christmas. (Given who gets stuck with working at a hospital on Christmas Day, this might have even made sense from a medical point of view.)

Of course, the U.S. isn't the entire world. We're about one-twentieth of the world's population; very crudely just multiplying that 0.3 by twenty gives six. (This of course doesn't take into account the different birth rates or distributions of birth weights in different countries.) Then the probability that there are no such babies is about e^-6, or one in 400 (by the way, knowing e³ is very nearly 20 is useful), and even without rounding there probably do exist such supremely "lucky" babies.

22 June 2007

Six murders in one day in Philadelphia.

There were six homicides in Philadelphia yesterday. The headline in the Philadelphia Inquirer is "Summer's beginning: Six dead in one day". The events happened as follows:

a triple homicide in North Philadelphia;

a triple shooting in Kensington -- two died, one was critically wounded;

one man shot to death in Kingsessing.

I saw the headline while walking past a newspaper box well before I read the article. I thought "hmm, six murders in one day, is that a lot?" Last year Philadelphia had 406 murders; this year there have been 195 so far, as compared to 177 up until this time last year. The number I carry around in my head is that Philadelphia has one murder a day, although the actual 2006 figure was about 1.11 murders per day.

Since I didn't know that there had only been three incidents, I assumed that the six murders had all been separate. Furthermore, I assumed that murders are committed independently, since the murderers aren't aware of each other's actions. This second assumption seems believable to me. I've heard that, say, school shootings inspire copycats, mostly because they create a media circus around them -- at the time of the Virginia Tech massacres I remember people saying that the media shouldn't cover the shootings so much because they might "give people ideas", and I vaguely recall similar sentiments around the time of Columbine. But a single murder, in a city where the average day sees one murder, doesn't draw much attention.

If the murders are independent, then I figure I can model the random variable "number of murders per day" with a Poisson distribution. The rate of the distribution would be the average number of murders per day, which is 1.11; thus the probability of having n murders in a day should be e^-1.11 (1.11)ⁿ/n!. This leads to the numbers:

n:	0	1	2	3	4	5	6	7+
Prob. of n murders in one day	0.3296	0.3658	0.2030	0.0751	0.0209	0.0046	0.00086	0.00016

So six or more murders should happen in a day about one day in a thousand, or once in almost three years. That seems like an argument for newsworthiness. But on the other hand, let's say there's some lesser crime -- crime X -- that is committed in Philadelphia with such frequency that crime X does not occur on only one day in a thousand. (Such a crime would be something that happens 2516 times per year, or 6.9 times a day.) I don't see that being front-page news. Lots of one-in-a-thousand things happen every day.

Of course, what actually occurred yesterday was not six independent murders. It sounds like there were only three murderers. So it's time for new assumptions. Let's now assume that all murderers act independently, but that two in five of them kills one person; two in five kill two people; one in five kill three people. This means the average murderer kills 1.8 people. Further, let's say that murderers go out and kill people as a Poisson process with rate 0.62 -- that's the old rate divided by 1.8, so there are still the same number of murders.

(The assumptions of how many people a murderer murders are made up, I admit, but the only list of murders I can find are the Inquirer's interactive maps, and it doesn't seem worth the time to harvest the data I'd need from them.)

Now, for example, the probability that three people are murdered on any given day is the sum of the probability that there's one triple homicide, one double and one single, or three single. Running through the computation, I get:

n:	0	1	2	3	4	5	6	7+
Prob. of n murders in one day	0.5379	0.1334	0.1499	0.1012	0.0372	0.0230	0.0103	0.0071

The probability of one or two murders in a day goes down; the probability of zero, or of three or more, goes up. Suddenly yesterday isn't nearly as rare. Days with six or more murders are, under these assumptions, 1.74% of all days -- just over six per year.

The calculation I'm afraid to do -- if I even could do it -- is "how likely am I to get murdered each time I go outside?" Fortunately I live in a decent neighborhood; but some neighborhoods not that far away from me have had some of the worst violence. But it occurred to me that at 400 murders a year, if you live in Philadelphia for 75 years there will be thirty thousand murders in that time span. Philly has about 1.5 million people. So if things stay like they are, the average Philadelphian has a one in fifty chance of dying by murder. In comparison, the nationwide murder rate in 2005 was 5.6 per 100,000; multiplying by an average lifespan of 75 years we get 420 murders per 100,000 people. So one in every two hundred and forty Americans will die of murder, if things stay like they are.

God Plays Dice