08 September 2007

mapping functions and genes "crossing over"

In genetics they have a unit called the centimorgan. This unit is a unit of what is called recombinant frequency, and it doesn't seem to be well-defined. For those of you who don't remember your biology (and I'll admit I'm one of them), recall that almost all cells contain chromosomes in pairs (23 pairs in humans). in the process of meiosis, cells are produced which contain a copy of one member of each pair. When fertilization occurs, these come together to form a new pair of chromosomes. However, this new pair mixes up or "recombines" parts of the old pair, as can be seen in the image.

The result is that two genes which are physically close together on the same chromosome will be inherited together, but two genes which are physically far apart might not be inherited together. When one learns about this in an introductory biology class, I think that the fact that two cross-overs is, in a sense, the same as no cross-over at all is ignored. That is, if the chromosomes cross over twice, or four times, or six times, or any even number of times between two genes, then those genes will end up on the same copy of the chromosome even after crossing over. (A more quotidian analogy: you walk down a street, arbitrarily crossing it "when the mood strikes"; the probability that at some given moment in the future you are on the opposite side from where you started is not the same as the probability that you have ever crossed the street, because you might have crossed back.)

Certainly, I don't remember hearing it in high school biology, and it's not mentioned in Time, Love, Memory: A Great Biologist and His Quest for the Origins of Behavior, which is the book I'm reading right now. It's nominally a biography of Seymour Benzer (who is still an active researcher) but is also something of a history of molecular biology.

Anyway, two genes are said to be one centimorgan apart if the probability of a crossover occurring between them is 0.01 -- or if the probability of an odd number of crossovers occurring between them is 0.01 -- or if the average number of crossovers between them is 0.01 -- I can't determine which. From what I can gather, molecular biologists seem to think of centimorgans as additive, which seems to require the third definition. (It looks like sometimes they use the other definitions and use something called a mapping function to correct for this, but I'm not entirely sure I'm reading this correctly.)

Now, a first guess would be that crossovers occur basically at random over the entire chromosome, and are a Poisson process. For the sake of simplicity assume that crossovers form a Poisson process with rate 1 -- that is, in a piece of the chromosome of length λ, the number of crossovers is a Poisson distribution with mean λ, and non-overlapping pieces have independent numbers of crossovers. What is the probability of an odd number of crossovers occuring in a segment of length λ? Let X be a Poisson(λ) random variable; then it's f(λ) P(X = 1) + P(X = 3) + P(X = 5) + ... The logical question to ask is: is this an increasing function of λ? That is, as we consider points further and further apart on the chromosome, does the linkage between them actually become less strong? You could imagine that the function might not be increasing. For example, say that after one crossover, the next crossover always occurred between 9 and 11 space-units down the line. Then two genes between 11 and 18 units apart would always end up on opposite chromosomes, and two genes between 22 and 27 units apart would always end up on the same chromosome, and in general you'd have some sort of oscillatory behavior.

Under the Poisson assumption, though, the answer is yes. In fact, we have
P(X = 1) + P(X = 3) + P(X = 5) + ...
= λ e + (λ3 e)/3! + (λ5 e)/5! + ...
= e (λ + λ3/3! + λ5/5! + ...)
= e sinh λ
= (1 - e-2λ)/2
which is known as Haldane's mapping function. It's hard to find a clear derivation of this online, because most of what's available online is course notes that are intended for people who will be using this in their work and don't particularly need to know the derivation.

What this tells us is that two genes which are separated by λ "units of space" will recombine with frequency (1 - e-2λ)/2. Note that if λ is small, this is only very slightly smaller than λ, since cases when there is more than one crossover in the space between the genes are vanishingly rare. But it also tells us that if two genes A and B recombine with frequency p, then they are not p of these "natural units" apart, but rather they are a distance λ apart with (1-e-2λ)/2 = p, so λ = -log(1-2p)/2. So, for example, if two genes A and B recombine with frequency .20, the average number of recombinations between them is not .20, but -log(.6)/2 = .255. And if another two genes B and C recombine with that frequency, and they are arranged on the chromosome in the order A, B, C, then the distance between them is -log(.6), and the recombination frequency is (1-elog .6)/2 = .32, not .40. In general, if A and B recombine with frequency p, and B and C recombine with frequency q, then A and C recombine with frequency p+q-2pq. This can be derived from the Haldane mapping function, but the following argument is nicer. In order for A and C to recombine, exactly one of the pairs (A, B) and (B, C) must recombine. With probability p(1-q), A and B recombine while B and C don't; with probability q(1-p) the reverse happens. Again, this formula seems to recur without justification in notes that I can find online.

If you know anything about special relativity, this sort of reminds me of how rapidities add (while velocities don't). The rapidity of a particle with velocity v is is tanh-1 v/c, which is approximately v/c (or v, if you like natural units); relative rapidities are additive, while relative velocities are only approximately additive, and then only for small velocities. Something similar is going on in the genetic situation, where the usual measure of "distance" is only additive for things that are close together and a correction has to be used when they get far apart.

And how would this be different if chromosomes came in, say, triplets instead of pairs? Maybe I should be a mad scientist in my next life. Then I could find out. (Or I could just do the calculation now, but I've got better things to do.)

(I suspect there are places here where I'm not using correct biological terminology; here I follow in the footsteps of Feynman, who when he learned about zoology once went to a library and asked them for a "map of the cat".)


Rohan said...

I don't see how you go from
e^-λ sinh λ to (1 - e^-2λ)/2...

Rohan said...

Oh never mind. Wikipedia shows the light.