29 December 2008

A combinatorial problem from Crick

I recently read What Mad Pursuit: A Personal View of Scientific Discovery
, which is Francis Crick's account of the "classical period" of molecular biology, from the discovery of the double helix structure of DNA to the eventual figuring out of the genetic code. It differs from the more well-known book by James Watson, The Double Helix: A Personal Account of the Discovery of the Structure of DNA, which focuses more on the characters involved and less on the science.

Crick was trained as a physicist, and learned some mathematics as well, and every so often this pokes through. For example, back when the nature of the genetic code wasn't known, combinatorial problems arose to prove that a genetic code of a certain type was or was not possible. One idea, due to Gamow and Ycas was that since there are twenty combinations of four bases taken three at a time where order doesn't matter, perhaps each one of those corresponded to a different amino acid. This turned out to be false. Another, more interesting problem comes from asking how the cell knows where to begin reading the code. What is the largest size of a collection of triplets of four bases such that if UVW and XYZ are both in the collection, then neither VWX nor WXY is? The reason for this constraint is so that the "phase" of a genetic sequence is unambiguous; if we see the sequence UVWXYZ, we know to start reading at the U, not the V or the W. Thus the collection can't contain any triplet in which all three elements are the same, and it can contain at most one of {XYZ, YZX, ZXY} for any bases X, Y, Z, not necessarily distinct. There are sixty triplets where not all three elements are the same, thus at most twenty amino acids can be encoded in such a code. There are solutions that acheive twenty; see the paper of Crick, Griffith, and Orgel.

Note that the "20" in the two types of code here arises in different ways. If we assume a triplet code with n bases, then the first type of code can encode as many as n(n+1)(n+2)/6 amino acids, the second (n3-n)/3.

Crick says that the more general problem of enumerating the number of codes which imply their own "reading frame" was considered by Golomb and Welch, and separately Freudenthal. Based on the title and the date, I think the first of these is the paper I point to below -- but our library doesn't have that journal in electronic form, and the physical library is closed this week!

References
F. H. C. Crick, J. S. Griffith, L. E. Orgel. Codes Without Commas. Proceedings of the National Academy of Sciences of the United States of America, Vol. 43, No. 5 (May 15, 1957), pp. 416-421.

George Gamow, Martynas Ycas. Statistical Correlation of Protein and Ribonucleic Acid Composition Statistical Correlation of Protein and Ribonucleic Acid Composition. Vol. 41, No. 12 (Dec. 15, 1955), pp. 1011-1019.

Golomb, S.W., Gordon, B., and Welch, L.R., "Comma-Free Codes", The Canadian Journal of Mathematics, Vol. 10, 1958. (Citation from this list of Golomb's publications; I haven't read it.)

1 comment:

CarlBrannen said...

There's another question about DNA, but at a lower level that I blogged about some time ago. Why does DNA only use 4 Nucleotides?.

Two nucleotides would be binary, but by using 4, DNA is more efficient in terms of carrying capacity (bits per molecule). And there are a lot more nucleotides available, something like a few dozen. The question was the subject of a Nature article titled "Why are There Four Letters in the Genetic Alphabet?". My answer is that 4 happens to be the optimum for speed of transcription due to chemistry.