Showing posts with label probability. Show all posts
Showing posts with label probability. Show all posts

15 December 2011

Solution to distance between random points from a sphere

So I asked on Sunday the following question: pick two points on a unit sphere uniformly at random. What is the expected distance between them?

Without loss of generality we can fix one of the points to be (1, 0, 0). The other will be chosen uniformly at random and will be (X, Y, Z). The distance between the two points is therefore

√((1-X)2 + Y2 + Z2)

which does not look all that pleasant. But the point is on the sphere! So X2 + Y2 + Z2 = 1, and this can be rewritten as

√((1-X)2 + 1 - X2)

or after some simplification

√(2-2X).

But by a theorem of Archimedes (Wolfram Alpha calls it Archimedes' Hat-Box Theorem but I don't know if this name is standard), X is uniformly distributed on (-1, 1). Let U = 2-2X; U is uniformly distributed on (0, 4). The expectation of √(U) is therefore

04 (1/4) u1/2 du

and integrating gives 43/2/6 = 8/6 = 4/3.

(The commenter "inverno" got this.)

Of course it's not hard to simulate this in, say, R, if you know that the distribution of three independent standard normals is spherically symmetric, and so one way to simulate a random point on a sphere is to take a vector of three standard normals and normalize it to have unit length. This code does that:

xx1=rnorm(10^6,0,1); yy1=rnorm(10^6,0,1); zz1=rnorm(10^6,0,1)
d1=radic(xx1^2+yy1^2+zz1^2)
x1=xx1/d1;y1=yy1/d1;z1=zz1/d1;
xx2=rnorm(10^6,0,1); yy2=rnorm(10^6,0,1); zz2=rnorm(10^6,0,1)
d2=radic(xx2^2+yy2^2+zz2^2)
x2=xx2/d2;y2=yy2/d2;z2=zz2/d2;
d=radic((x1-x2)^2+(y1-y2)^2+(z1-z2)^2);

and then the output of mean(d), which contains the distances, is 1.333659; the histogram of the distances d is a right triangle. (The code doesn't make the assumption that one point is (1, 0, 0); that's a massive simplification if you want to do the problem analytically, but not nearly as important in simulation.)

11 December 2011

A geometric probability problem

Here's a cute problem (from Robert M. Young, Excursions in Calculus, p. 244): "What is the average straight line distance between two points on a sphere of radius 1?"

(Answer to follow.)

If any of my students are reading this: no, this should not be interpreted as a hint to what will be on the final exam.

16 November 2011

In which I declare four things which my probability class is not about

In class today, I said approximately this:

So people decide whether to have children by flipping a coin, and if it comes up tails they have a kid, and if it comes up heads they don't. They repeat this until it comes up heads. This is probably not a good model of how people decide whether or not to have children, but maybe it's good in the aggregate. And anyway this isn't a class about how people decide whether to have kids.

Then there are two kinds of children, girls and boys -- well, not always, but this isn't a class about that -- and each child is equally likely to be a boy or a girl -- well, wait, that's not exactly true, but it's not a horrible assumption about how reproduction works on a cellular level, but this isn't a class about that either.

And people's decisions to stop having kids is independent of the sex of the children they've had -- which says this isn't China, because people do interesting things under the one-child policy -- but this isn't a class about that.

(Then I actually did some math -- namely, assume that the number of children a random family has is geometrically distributed with some parameter p, and assume that all children are equally likely to be male or female and that their genders are independent of the gender of any other children or the number of children in the family. Pick a random family with no boys. What is the distribution of the number of children they have?)

07 May 2011

Two no-hitters four days apart is not that rare

Justin Verlander just threw a no-hitter for the Detroit Tigers. On May 3rd, Francisco Liriano threw one for the Minnesota Twins.

There have only been 271 no-hitters in one hundred years of Major League Baseball, so two separated by four days seems unusual.

But two no-hitters within four days of each other has happened several times before. From Wikipedia, there have been two no-hitters within four days of each other on the following dates:

August 19 and 20, 1880
September 19 and 20, 1882
two on April 22, 1898
September 18 and 20, 1908
August 26 and 30, 1916
May 2, 5, and 6, 1917
September 4 and 7, 1923
June 11 and 15, 1938
June 26 and 30, 1962
September 17 and 18, 1968
September 26 and 29, 1983
June 1 and 2, 1990
two on June 29, 1990
September 4 and 8, 1993
May 11 and 14, 1996

Is this list surprisingly long? If you assume that baseball has been played 180 days a year for 130 years, then that's 23,400 days on which baseball has been played. There have been 271 no-hitters, so on an average baseball-playing day there are 0.01158 no-hitters. After any given no-hitter there's a four-day window in which the list I gave above could be added to. So you'd expect (271)(0.01158)(4) = 12.5 pairs in that list. There are 17 pairs on the list. (I'm counting the 1917 triplet as three pairs. I'm not counting today's no-hitter.) So there doesn't seem to be particularly strong evidence for no-hitters somehow causing more no-hitters in their wake. (Although my model of the baseball schedule is, I admit, ridiculously crude. In particular I have ignored the fact that the number of teams isn't constant.)

02 April 2011

A street-fighting approach to the variance of a hypergeometric random variable

So you all1 know that if I have a biased coin with probability p of coming up heads, and I flip it n times, then the expected number of heads is np and the variance is npq. That's the binomial distribution. Alternatively, if I have an urn containing pN white balls and qN black balls, with p + q = 1, and I draw n balls with replacement then the distribution of the number of white balls has that mean and variance.

Some of you know that if I sample without replacement from that same urn -- that is, if I take balls out and don't put them back -- then the expected number of white balls is np and the variance is npq(N-n)/(N-1). The distribution of the number of white balls is the hypergeometric distribution.

So it makes sense, I think, to think of (N-n)/(N-1) as a "correction factor" for going from sampling with replacement to sampling without replacement. This is the approach taken in Freedman, Pisani, and Purves, for example, which is the book I'm teaching intro stats from this semester.

How do you prove this? On this, FPP are silent. The proof I know -- see, for example, Pitman -- is as follows. Write the number of white balls, when sampling without replacement, as

Sn = I1 + ... + In

where ISk is 1 if the kth draw gives a white ball and 0 otherwise. Then E(Ik) is just the probability of getting a white ball on the kth draw, and so it's equal to p by symmetry. By linearity of expectation E(Sn) = np. To get the variance, it's enough to get E(Sn2). And by expanding out that sum of indicators there, you get

Sn2 = (I12 + ... + In2) + (I1 I2 + I1 I3 + ... + In-1 In).

There are n terms inside the first set of parentheses, and n(n-1) inside the second set, which includes every pair Ij Ik where j and k aren't equal. By linearity of expectation and symmetry,

E(Sn2) = nE(I1) + n(n-1)E(I1 I2).

The first term, we already know, is np. The second term is n(n-1) times the probability that both the first and second draws yield white balls. The first draw yields a white ball with probability p. For the second draw there are N-1 balls left, of which pN-1 are white, so that draw yields a white ball with probability (pN-1)/(N-1). The probability is the product of these. Do the algebra, let the dust settle, and you get the formula I claimed.

But this doesn't explain things in terms of the correction factor. It doesn't refer back to the binomial distribution at all! But in the limit where your sample is small compared to your population, sampling without replacement and smapling with replacement are the same! So can we use this somehow? Let's try to guess the correction factor without writing down any random variables. We'll write

Variance without replacement = f(N,n) npq

where n is the sample size and N is the population size, and think about what we know about f(N,n)

First, f(N,1) = 1. If you have a sample of size 1, sampling with and without replacement are actually the same thing.

Second, f(N,N) = 0. If your sample is the entire population, you always get the same result.

But most important is that if we sample without replacement, and take samples of size n or of size N-n, we should get the same variance! Taking a sample of size N-n is the same as taking a sample of size n and deciding to take all the other balls instead. So for each sample of size n with w white balls, there's a corresponding sample of size N-n with pN-w white balls. The distributions of numbers of white balls are mirror images of each other, so they have the same variance. So you get

nf(N,n)pq = (N-n)f(N, N-n)pq.

Of course the pq factors cancel. For ease of notation, let g(x) = f(N,x). Then we need to find some function g such that g(1) = 1, g(N)=0, and ng(n) = (N-n)g(N-n). Letting n = 1 you get g(1) = (N-1)g(N-1), so g(N-1) = 1/(N-1). The three values of g that we have so far are consistent with the guess that g is linear. So let's assume it is -- why should it be anything more complicated? And that gives you the formula. This strikes me as the Street-Fighting Mathematics approach to this problem.

Question: Is there a way to rigorize this "guess" -- some functional equation I'm not seeing, for example?

1. I use "all" in the mathematician's sense. This means I wish you knew this, or I think you should know it. Some of you probably don't. That's okay.

25 March 2011

Vallentin's probability cheat sheet

John Allen Paulos pointed me to Matthias Vallentin's probability and statistics cheat sheet. It's a big "sheet" -- twenty-seven pages -- but maybe you have a big blank wall to put it on.

(To my students, if you read this: remember that you only get one page of notes on the midterm, and you have to write it yourself.)

22 March 2011

Are food-borne pathogen survival times really exponentially distributed?

From Scientific American, an excerpt from Modernist Cuisine: The Art and Science of Cooking on the complex origins of food safety rules.

This is the six-volume, six-hundred-dollar magnum opus of Nathan Myhrvold (former chief technology officer at Microsoft, and chefs Chris Young and Maxime Bilet; you can read more about it at the Wall Street Journal.

In particular I noticed the following:
If a 1D reduction requires 18 minutes at 54.4 degrees C / 130 degrees F , then a 5D reduction would take five times as long, or 90 minutes, and a 6.5D reduction would take 6.5 times as long, or 117 minutes.

A "nD" reduction is one that kills all but 10-n of the foodborne pathogens.

What struck me here is that the distribution of the pathogen lifetimes, assuming these numbers are actually correct, is exponential. And, therefore, memoryless -- if you're a bacterium under these conditions, your chances of dying in the first eighteen minutes are ninety percent, and if you're still alive at ninety minutes, your chances of dying in the next eighteen minutes are still ninety percent. This surprised me. The decay of radioactive atoms can be described in this way -- but are bacteria really so simple?

The excerpt as a whole is quite interesting -- apparently a lot more than just science is going into recommendations of how long food should be cooked.

(Myhrvold has a bachelor's degree in math and a PhD in mathematical economics, among other degrees; Young has a bachelor's degree in math and was working on a doctoral degree before he left for the culinary world. So perhaps it is fair of me to think that they would get this right.)

23 June 2010

The ridiculously long match at Wimbledon

As you may have heard, there's a match at Wimbledon , between John Isner and Nicolas Mahut, in which the last set is tied at 59 games. (The previous longest set at Wimbledon was 24-22.)

A set goes until one player has won six games, and has also won two more than the opponent. This means that back since the set was tied at 5 games, games 11 and 12 were split by the two players; so were 13 and 14; and so on up to 117 and 118.

Terence Tao points out that this is very unlikely using a reasonable naive model of tennis, which assumes that the player serving has a fixed probability of winning the game. (Service alternates between games.) His guess is that some other factor is at play; for example, "both players may perform markedly better when they are behind".

This seems statistically checkable, at least if records of that sort of thing are kept. I'm not sure if they are; it seems like tennis scores are often reported as just the number of games won by each player in each set, not their order. Another hypothesis, of course, is that the match has taken on a life of its own and, subconsciously, the players are playing to keep the pattern going.

Edit (Thurs. 7:49 am): More on Isner-Mahut: Tim Gowers' comments, and some odds being offered by William Hill, the betting shop.

17 May 2010

Innumeracy and the NBA draft lottery

I don't really know much about basketball. But this New York Times article suggests that the first pick in the NBA lottery might not be worth much this year, and then goes on to say:

But history suggests that he [Rod Thorn, president of the New Jersey Nets] will not have that decision to make. Since 1994, the team with the worst record has won the lottery only once — Orlando in 2004.


Here's how the NBA draft lottery works. In short: there are thirty teams in the NBA. Sixteen makes the playoff. The other fourteen are entered in the draft lottery. Fourteen ping-pong balls (it's a coincidence that the numbers are the same) are placed in a tumbler. There are 1001 ways to pick four balls from fourteen. Of these, 1000 are assigned to the various teams; the worse teams are assigned more combinations. 250 are assigned to the worst team, 199 to the second-worst team, "and so on". (It's not clear to me where the numbers come from.)

Then four balls are picked. The team that this set corresponds to gets the first pick in the draft. Those balls are replaced; another set is picked, and this team (assuming it's not the team already picked) gets the second pick. This process is repeated to determine the team with the third pick. At this point there's an arbitrary cutoff; the 4th through 14th picks are assigned to the eleven unassigned teams, from worst to best. The reason for this method seems to be so that all the lottery teams have some chance of getting one of the first three picks, but no team does much worse than would be expected from its record; if the worst team got the 14th pick they wouldn't be happy.

So the probability that the team with the worst record wins the lottery is one in four, by construction; this "history suggests" is meaningless. (And the article even mentions the 25 percent probability!) This isn't like situations within the game itself where the probabilities can't be derived from first principles and have to be worked out from observation.

Also, let's say we continued iterating this process to pick the order of all the lottery teams. How would one expect the order of draft picks to compare to the order of finish in the league? I don't know off the top of my head.

01 May 2010

The probability that 901 coins have total value $100

Here's a cute little problem from Reddit: Tough question for you guys. Let's say you have 901 coins that come out to exactly $100. What are the odds? (Also here.)

Everyone there who gets a solution is assuming that all the possible coins are equally likely, which isn't a reasonable assumption. Years ago I looked at the density of money, where I used a model in which I get back from each transaction n cents with probability 0.01, for n = 0, 1, ... 99; furthermore I always get back the smallest possible number of coins. The only coins allowed are pennies, nickels, dimes, and quarters (worth 1, 5, 10, and 25 cents respectively).

As I calculated before, if I make 100 transactions, and I get each number of cents back exactly once, I'll get 200 pennies, 40 nickels, 80 dimes, and 150 quarters. This is a total of 470 coins, and worth $49.50. Thus the "average coin" is worth 495/47 = 10.53 cents; 901 coins are "on average" worth $94.89. The value $100 isn't that unreasonable.

So consider a jar with 901 coins, which are independent; they each have probability 20/47 of being a penny, 4/47 of being a nickel, 8/47 of being a dime, and 15/47 of being a quarter. The mean value of a coin is 495/47 = 10.53 cents; the variance is 238840/2309 = 108.12 "square cents".

The mean value of 901 coins, then, is 9489 cents; the variance is 93198 "square cents", so the standard deviation is 305 cents. (Everything here is rounded to the nearest integer.)

Invoking the central limit theorem, then, we say that the value of 901 randomly chosen coins is normally distributed with this mean and standard deviation. The probability of having value exactly 10,000 cents is approximated by the probability density function of this variable at 10,000; that's 0.000322, or 1 in 3101.

An exact answer is feasible -- but not worth computing, I'd say, because the error in the central limit theorem is surely much smaller than the error from the fact that this isn't a realistic model of what actually ends up in your change jar.

29 April 2010

A simple model for baseball

From the April Notices of the AMS, John D'Angelo writes Baseball and Markov Chains: Power Hitting and Power Series. Consider the following simple model of baseball. Players only hit singles; three singles score a run. That is, the third and every following player to get a hit in a given inning score a run. This can either be interpreted as that, say, all runners score from second on a single or all runners go from first to third on a single -- but not both! -- or that every third hit is actually a double. (And I do mean exactly every third hit, not some random one-third of hits, so this is a bit unnatural.) Then the expected number of runs per half inning is p3(3p2-10p+10)/(1-p). For real baseball the average number of runs per half-inning is around one half, which corresponds to p = 0.361.

D'Angelo gives this as an exercise, but I independently came up with this model a while ago and can't resist sharing the solution. Let q = 1-p. The probability of getting k hits in an inning is pk q3 -- that's the probability of getting those hits in a certain order -- times the number of ways in which k hits and 3 outs can be arranged. Since the last batter of an inning must get out, the number of possible arrangements is the number of ways to pick 2 batters out of the first k+2 to get out, which is (k+2)(k+1)/2.

The probability of getting k runs, if k is at least 1, is just the probability of getting k+2 hits, which is pk+2q3(k+4)(k+3)/2. Call this f(k); then

f(1) + 2f(2) + 3f(3) + ... = p3(3p2-10p+10)/(1-p)

by some annoying algebra. I'm pretty sure I came up with this exact model while procrastinating from some real work a couple years ago; it's probably been independently reinvented many times.

With p = 0.361, the probabilities of scoring 0, 1, 2, 3, 4, 5 runs in an inning are .748, .123, .066, .034, .016, .008 (rounded to three decimal places). (Probabilities of larger numbers of runs can also be calculated; together they have probability around .006.)

Assuming that each half-inning is independent, the probability G(k) of a team scoring k runs in a game is, for each k,







k012345
G(k).073.108.129.133.124.108
k67891011
G(k).088.069.052.038.026.018
k121314151617
G(k).012.008.005.003.002.001

with probability about 0.0006 of scoring 18 runs or more. (This seems a bit low to me -- three times a season in the major leagues -- but after all this is a very crude model!) But one interesting thing here is that the distribution of the number of runs per game, which is a sum of nine skewed distributions, is still skewed; the mode is 3, and the median 4. Recall that I chose p so that the mean would be 4.5. And the actual distribution is similarly skewed.

Of course a more sophisticated model of baseball is as a Markov chain. There are twenty-five states in this chain -- zero, one or two outs combined with eight possible ways to have runners on base, and three outs. We assume that each hitter hits randomly according to his actual statistics, and the runners move in the "appropriate" way. Of course determining what's appropriate here would be a bit tricky. How do runners move? A runner is probably more likely to take an extra base when a power hitter is hitting, but the sample size for any individual is fairly small. But one could probably predict from some measure of the hitter's power (say, the number of doubles and home runs, combined appropriately) the chances of a runner taking an extra base on a single. Something similar is necessary for sacrifice flies (which have to be deep enough to score the runner), grounding into double plays, etc. I'm not sure if the Markov models that are out there, such as that by Sagarin, do this. Sagarin computes the (offensive) value of a player by determining how many runs per game a team composed of only that player would score.

31 December 2009

A hack I'm disturbingly proud of, and its connection to some real math

I'm applying for jobs. Many jobs, because that's how academic job searches work these days. So I have a spreadsheet (in OpenOffice) to keep track of them.

Among the things that I track for each job, there is a column with 0, 1, or 2 in it. 0 means that I haven't submitted anything; 1 means I've submitted something, but not everything that was asked for; 2 means the application is complete. Averaging these numbers and dividing by 2 tells me what proportion of the search is complete.

But I also wanted to know how many 0s, 1s, and 2s there were. And as far as I know the built-in functions in OpenOffice won't do that.

What they will do, however, is this. I have a column consisting of 0s, 1s, 2s, and empty cells. By doing

COUNT(J8:J1000)
SUM(J8:J1000)
SUMPRODUCT(J8:J1000;J8:J1000)

I get the number of cells in that column which are nonempty; their sum; and the sum of their squares. (The SUMPRODUCT function takes 2 arrays of the same shape and returns the sum of the products of corresponding cells.) "8" is the row that contains the first job on the list, and "1000" is just a number that is comfortably more than the number of jobs I am applying for. Call these a, b, and c respectively. Let n0, n1, and n2 be the number of entries which are 0, 1, and 2 respectively. Then I have

a = n0 + n1 + n2
b = n1 + 2n2
c = n1 + 4n2

which is a three-by-three linear system, and can be solved for n0, n1, n2, giving

n0 = a - 3b/2 + c/2, n1 = 2b-c, n2 = (c-b)/2

and so I can recover the number of applications with status 0, 1, or 2 from this. From the sums of the 0th, 1st, and 2nd powers I can recover the distribution of the values themselves. (The actual code is slightly different, but of course equivalent, because I solved the system "by inspection" and never actually explicitly wrote it out until just now.)

Believe it or not, I actually use this trick in a preprint, "The number of cycles of specified normalized length in permutations", to do some actual mathematics! There I find the expectation of X0, X1, X2, ..., Xk where X is a certain random variable known to take on the values 0, 1, ..., k, namely the number of cycles of length in the interval [γ n, δ n] in a permutation of [n] chosen uniformly at random where γ and δ are constants. k is the greatest integer less than or equal to 1/γ ; for example, if we're looking at cycles of length at least 0.15n in permutations of n, there can't be more than six of them. This gives a linear system like the one above which gives the probability that X takes on each value 0, 1, ..., k.

18 December 2009

Uniquely identifying people by birth date, gender, and zip code

Netflix Spilled Your Brokeback Mountain Secret, Lawsuit Claims. A woman is suing Netflix because she was in the closet, and her movie-rental data was part of the Netflix prize dataset. She claims this means that people could figure out her secret.

Now Netflix is starting a second contest, and rumor has it that the data will include the zip code, birthdate, and gender of each individual. According to this paper (abstract only, unfortunately, so I can't comment on methods) by Latanya Sweeney, this is enough to uniquely identify 87% of the US population. A paper by Phillippe Golle gives the figure as 63%, based on actual Census Bureau data. (The Census gives the number of people with each birth year and gender in each zip code.)

Is it surprising that people can be identified this easily?

From the Golle paper, there are 33,233 "Zip Code Tabulation Areas" in the United States.

US life expectancy is 77.7 years. Since this is a back-of-the-envelope calculation, let's assume that everybody drops dead after 77.7 years (28,379 days), and therefore that the age of a random individual is uniformly distributed over the last 28,000 days. (It pains me to say this, because my grandmother is 85 and still living.)

There are, to a first approximation, two genders.

Therefore there are 28,379 * 33,233 * 2, or about 1.9 billion, possible combinations of birthdate, zip code, and gender. There are about 300 million Americans. If we assume all of these are equally likely (which they're not; some ages are more likely than others, and some zip codes have more people than others), and that they're independent (which they're not, as anybody who's lived in a college town can tell you; Golle notes the college-town effect, and also a military-base effect), then on average the number of people having a given (birthdate, zip code, gender) triplet is about 0.16.

So we'll model the population of the US as 1.9 billion Poisson random variables, each of mean 0.16, and each corresponding to a birthdate-zip code-gender triplet. How many of these do we expect to have value 1 (meaning that that triplet picks out exactly one person)? The probability that a Poisson(0.16) random variable takes the value 1 is exp(-0.16)*(0.16). Thus we find that there are (1.9 billion)*(0.16)*exp(-0.16) people uniquely identified by this triplet, out of (2.5 billion)*(0.16) people.

According to this crude model, the probability that a random individual is uniquely identified by these three pieces of information, then, is exp(-0.16), or about 85%. Why is everybody so surprised?

21 September 2009

Perfection "squared" on standardized tests

I came across an article about a student who got a perfect score on both the ACT and the SAT. (These are the two standardized tests used for university admissions in the US; generally schools on the coasts use the SAT and schools in the interior of the country use the ACT, although this is a vast generalization. The geographical separation seems to be a function of where the tests originated, in Iowa and New Jersey respectively.

This article (which I'm not linking to because I found it by googling a student, and the student is probably already not happy that this is all over the Internet) points out that less than 1 percent of students get a perfect score on each of these tests. (As you'll see below, this is quite an understatement.) I think we're supposed to come to the conclusion that less than 1 in 10000 students would get a perfect score on both.

But of course scores on these tests are positively correlated! So the probability of getting a perfect score on both tests is much higher than the product of the probability of getting a perfect score on each. (I don't think knowing that would help you on the SAT. But it's been a while. In my day they were out of 1600.)

This article indicates that 294 of the high school seniors graduating in 2008 got a perfect score on the SAT, and 514 out of 1.4 million got a perfect score on the ACT. Wikipedia puts the number of SAT takers at 1.5 million per year; let's knock this down to 1 million since some people take the test more than once and we're talking about the total number of students. So the probability that a random student who takes both tests gets a perfect score on both is something like (294/1000000) (514/1400000), which is about one in 1.3 million. The number of students taking both tests is less than this (many people only take one of the two), so assuming independence there should be less than one student per year who gets a perfect score on both tests.

But a quick glance at the Google results will convince you that there are a few students per year who pull this off.

03 June 2009

Random Walk: The visualization of randomness

Random Walk: The visualization of randomness, Daniel Becker's diploma thesis, shows fascinating pictures that illustrate various stochastic phenomena: dart-throwing and the Poisson distribution, Benford's law, Monte Carlo methods, some hidden high-order correlations in pseudo-random number generators, and so on.

17 April 2009

The Art of the Probable: Literature and Probability

From MIT's Open Course Ware: The Art of the Probable: Literature and Probability. The course readings include both some of the classical mathematical writings about probability (Pascal, Fermat, Leibnitz, Bernoulli, Bayes, Quetelet, etc.) as well as various more "literary" pieces.

Only at MIT...

(Seriously, though, I would have liked to take this class. And one of the readings from the last week is "the Bohr-Einstein dialogue", which you may know refers to whether God does or does not play dice.)

04 March 2009

Why isn't it expnormal?

We say that a random variable X has a lognormal distribution if its logarithm, Y = log X, is normally distributed. The normal distribution often occurs when a random variable comes about by combining a bunch of small independent contributions, but those contributions combine additively; when the combination is multiplicative instead, lognormals occur. For example, lognormal distributions often occur in models of financial markets.

But of course X = exp Y, so the variable we care about is the exponential of a normal. Why isn't it called expnormal?

26 January 2009

Probabilistic fun with the n-sphere

On reddit: a link to the old chestnut that high-dimensional spheres are weird. If you consider the volume of the unit ball in Rn as a function of n, it increases up to n=5 and then decreases. The volume is πn/2/((n/2)!). (By the way, I generally use the factorial notation, not the Γ notation, even when the argument isn't an integer.)

But I hear you complaining that it doesn't make sense to compare volumes in different dimensions! Fair enough. Compare the volume of the unit ball in Rn to the cube circumscribing it, which has volume 2n. Then the portion of the cube which is inside the ball is f(n) = πn/2/((n/2)! 2n). This is rapidly decreasing with n. For n = 2, it's π/4 -- the volume of the unit disc is π, and it can be inscribed in a square of area 4. In n = 3, the unit ball has volume 4π/3 and it's inscribed in a cube of size 8, so we get f(3) = π/6. But f(n) decreases superexponentially. f(10) is about 0.0025, f(20) is about 25 in a billion.

I was surprised that it decayed that quickly -- I'd never bothered to work it out. But if you think about it probabilistically, it kind of makes sense. Namely, a random point in the unit n-cube can be identified with its coordinates (x1, x2, ..., xn). It's in the unit n-sphere if and only if the sum of the squares of those coordinates is less than 1. Let yi = xi2 -- then yi has mean 1/3 and variance 4/45. (That's calculus.) So the sum of the squares of the coordinates is a sum of n such independent random variables, and is thus itself a random variable with mean n/3 and variance 4n/45 -- it's no surprise most of its mass is at n > 1. One could probably use large deviation inequalities to quantify this, but come on, it's 11 at night and I have real work to do.

30 November 2008

I'm thankful for the Borel-Cantelli lemma

Cosmic Variance is thankful for the spin-statistics theorem, because it enables the division between matter and force, which is kind of important.

In this spirit, I am thankful for the second Borel-Cantelli lemma, which states that if countably many events E1, E2, E3, ... are independent and the sum of the probabilities of the En diverges to infinity, then the probability that infinitely many of them occur is 1. Let these events be "something interesting happens at time n", for a suitable quantization of time; then given infinite time, infinitely many interesting things will happen. (Of course I'm making an independence assumption here.) I like interesting things.

I am also thankful for whoever it was that took a picture of a monkey at a typewriter.