|
|
|
|
Published online before print
February 9, 2007, 10.1101/gr.5987307 Genome Res. 17:405-412, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00 OPEN ACCESS ARTICLE
Letter The genetic code is nearly optimal for allowing additional information within protein-coding sequences1 Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot 76100, Israel; 2 Department of Physics of Complex Systems, Weizmann Institute of Science, Rehovot 76100, Israel
DNA sequences that code for proteins need to convey, in addition to the protein-coding information, several different signals at the same time. These "parallel codes" include binding sequences for regulatory and structural proteins, signals for splicing, and RNA secondary structure. Here, we show that the universal genetic code can efficiently carry arbitrary parallel codes much better than the vast majority of other possible genetic codes. This property is related to the identity of the stop codons. We find that the ability to support parallel codes is strongly tied to another useful property of the genetic codeminimization of the effects of frame-shift translation errors. Whereas many of the known regulatory codes reside in nontranslated regions of the genome, the present findings suggest that protein-coding regions can readily carry abundant additional information.
The genetic code is the mapping of 64 three-letter codons to 20 amino-acids and a stop signal (Woese 1965 There exist a large number of alternative genetic codes that are equivalent to the real code in these two prominent features (Fig. 1). Here we ask whether the real code stands out among these alternative codes as being optimal for other properties.
We consider the ability of the genetic code to support, in addition to the protein-coding sequence, additional information that can carry biologically meaningful signals. These signals can include binding sequences of regulatory proteins that bind within coding regions (Robison et al. 1998 We find that the universal genetic code can allow arbitrary sequences of nucleotides within coding regions much better than the vast majority of other possible genetic codes. We further find that the ability to support parallel codes is strongly correlated with an additional propertyminimization of the effects of frame-shift translation errors. Selection for either or both of these traits may have helped to shape the universal genetic code.
Ability to include additional sequences We first considered the ability of the genetic code to support, in addition to the protein-coding sequence, additional sequences that can carry biological signals. For this purpose, we studied the properties of all alternative genetic codes that share the known optimality features of the real code (Fig. 1). Each alternative code has the same number of codons per each amino acid and the same impact of misread errors as in the real code. We tested the ability of the genetic codes to include arbitrary sequences, denoted n-mers, within protein-coding regions. As an example, consider the 5-mer "UGACA." This sequence may be a protein-binding site, which should appear within a protein-coding region. This 5-mer sequence can appear within a coding sequence in one of the three reading frames: UGA|CAN, NNU|GAC|ANN, or NUG|ACA, where N denotes any nucleotide and the vertical lines separate consecutive codons. To assess the probability that this 5-mer appears in a coding region, one needs to sum over the three possible reading frames (Fig. 2A). In one of the frames, this sequence generates a stop codon, UGA. The 5-mer cannot appear in a coding region in this frame, because coding regions have no in-frame stop codons. The sequence can, however, appear in one of the two other frames. Overall, the probability that this 5-mer appears in coding regions will tend to be lower than that of 5-mers that do not include stop codons.
Each genetic code has n-mer sequences, such as the above-mentioned sequence UGACA in the real genetic code, which are difficult to include in coding regions: these "difficult" sequences contain stop codons, and thus cannot appear in at least one of the three frames, since protein-coding regions do not contain stop codons. We find that the real genetic code is able to include even the most difficult n-mers because it has a special property: its stop codons, when frame shifted, tend to form abundant codons. Hence, n-mers that cannot be included in one frame-shift can be included with high probability in other frame shifts. To understand the relation between the stop codons and the ability of the genetic code to include arbitrary n-mers, consider the 5-mer S = AAAAA (Fig. 2C). This 5-mer can appear within a coding sequence in one of the three reading frames: AAA|AAN, NNA|AAA|ANN, or NAA|AAA. Alternative genetic codes that assign one of their stop codons as AAA (Fig. 3D), can never include S in a protein-coding sequence. The problem is that the stop codon AAA overlaps with itself when frame shifted; hence, strings such as S include a stop codon in each of the three frames, precluding their presence in a coding region.
Another example is the 5-mer S = CCGGU. In an alternative code with stop codons CCA, CCG, and CGG, this n-mer can only appear in one of the three reading frames (Fig. 2D). This is because two of the stop codons, CCG and CGG, overlap each other. In contrast, the real genetic code has the stop codons UAA, UAG, and UGA that do not overlap with themselves or with each other, no matter how they are frame shifted. Furthermore, frame-shifted versions of the real stop codons overlap with the codons of the most abundant amino acids. For example, the UGA stop codon in a 1 frame-shift message results in the di-codon NNU|GAN, where N is any nucleotide (Fig. 2B). The GAN codons encode Asp and Glu, which are among the three amino acids with the most abundant codons (Table 1). Therefore, n-mers with the letters UGA can be included with high probability in protein sequences without generating an in-frame stop. The same idea applies to the other two stop codons in the real code; this property occurs in only very few of the alternative genetic codes. In short, optimality for including arbitrary n-mer sequences within coding regions is due to stop codons that do not overlap each other, but which do overlap codons for abundant amino acids.
We calculated the probability of including all n-mer sequences for each alternative genetic code by summing up, for every n-mer sequence, the probabilities of all codon combinations that contain it (Fig. 2A; for details see Methods). The codon probabilities were determined according to the known amino acid frequencies in proteins (Table 1). The results presented in the main text are for uniform codon usage, but they apply to a wide range of different codon usages (Supplemental material). We find that the real code shows significantly higher probabilities to include arbitrary sequences. The average of the logarithm of all n-mer probabilities is significantly higher in the real code than in the vast majority of alternative codes (Table 2), with a P-value < 0.05 for n-mer sequences with n greater than seven. In addition, the real code shows significantly higher probabilities to include the most difficult sequences (n-mers with the lowest probability of appearing in a coding region) than the vast majority of alternative codes (Fig. 2E; Table 2; Supplemental Fig. 4). For example, the average probability of including the 20% most difficult sequences is exceeded by only 3% of the alternative codes for 8-mers and 1% of the alternative codes for 9-mers. This property can be seen when examining the distribution of the n-mer probabilities of appearing within protein-coding sequences. In the real code there are significantly fewer n-mers with low probabilities (Fig. 2E).
The optimality of the real genetic code relative to alternative codes seems to increase with the length of the n-mers (Fig. 2F). This is because as the length of the n-mers increases, the fraction of n-mers that include stop codons increases dramatically. Above n = 16, more than half of all n-mers include at least one stop codon. The real genetic code is able to include all n-mers with n < 11 in at least one, and often many combinations of amino acid codons. For n-mers of any length, the real code appears to exceed almost all of the alternative codes in its ability to include a large fraction of possible n-mers within coding regions (Fig. 2F; Table 2).
Robustness to translational frame-shift errors
To abort translation after a frame shift, the ribosome must encounter a stop codon in the shifted frame. It has been suggested that codon usage in some organisms may be biased toward codons that can form stop codons upon translational frame shift (Seligmann and Pollock 2004 Interestingly, the ability to abort translation after frame shift is closely related to the ability to include arbitrary parallel codes (Fig. 4). Robustness to frame-shift errors occurs because the frame-shifted codons for abundant amino acids overlap with the stop codons, hence increasing the probability that stop is encountered upon frame shift. As mentioned above, it is precisely this property that allows the real genetic code to include arbitrary sequences within protein-coding regions, including those with stop sequences, with a significantly higher probability than alternative codes.
The present optimality features are shared also by almost all of the nonuniversal codes such as those found in mitochondria (Osawa et al. 1992
In summary, we found that the genetic code is nearly optimal for encoding additional information in parallel to its main function of encoding for the amino acid sequence of proteins. This optimality is related to the identity of the stop codons in the universal code: when frame shifted, the stop codons overlap with codons of abundant amino acids. We showed that this optimality is strongly tied to a second useful propertyminimization of the effect of translational frame-shift errors.
Robustness to frame-shift errors may be a reasonable inherent constraint on the early genetic code. One may therefore propose that the ability to carry parallel codes may have emerged as a side effect that was later exploited to allow genes and mRNA molecules to support a wide range of signals to regulate and modify biological processes in cells (Kirschner et al. 2005
Whereas many of the currently known regulatory codes reside in nontranslated regions of the genome (Robison et al. 1998
Alternative genetic codes The alternative genetic codes were obtained by independently permuting the nucleotides in the three codon positions while preserving the amino acid assignment (Fig. 1). These permutations preserve both the number of codons per amino acid and the effect of misread errors on the translated protein, as defined in Freeland and Hurst (1998) G permutation. The ensemble of alternative codes therefore contains 24 x 24 x 2 = 1152 codes. In the Supplemental material, we show that relaxing the wobble constraint does not change any of the present conclusions (Supplemental Fig. 1).
Inclusion of arbitrary sequences within protein-coding sequences
For each code we calculated the average logarithm of the probabilities of all n-mer sequences. To avoid singularities, a small number In addition to an average of the logarithm of all n-mer probabilities, for each alternative genetic code we calculated the arithmetic average probability of obtaining the fraction x of n-mers, sorted from the most difficult to the easiest (lowest to highest probability). For every x, we assigned a P-value to the real code, which is the fraction of alternative codes for which the average probability of the x most difficult n-mers is equal or higher than in the real code (Supplemental Fig. 4). Table 2 shows the P-value for the average probability of obtaining the x most difficult n-mers for different n-mer sizes, with x = 20%. The values of x for which small P-values are found increases with the size of the n-mers under consideration (see below). The FDR method was used to determine the range of difficult n-mers for which the average probability in the real code is significantly higher than in alternative codes, with a threshold that corresponds to a false discovery rate of 15% (Supplemental Fig. 4). For n > 8, the calculations were based on 105 randomly sampled n-mers.
We find that in the real code, all sequences with n
The probability of encountering a frame-shifted stop
Selection pressure of frame shift errors
We thank James Shapiro for suggesting this problem, Orna Man for the amino acid probabilities data, and Tsvi Tlusti, Eran Segal, Liran Shlush, and all members of our lab for useful comments. We thank Minerva, HFSP and the Kahn family Foundation for support. S.I. acknowledges support from the Horowitz Complexity Science Foundation.
3 Corresponding author.
E-mail uri.alon{at}weizmann.ac.il; fax 972-8-934125. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5987307
Alon, U. 2006. An introduction to systems biology. CRC Press, London, UK. Archetti, M. 2004. Codon usage bias and mutation constraints reduce the level of error minimization of the genetic code. J. Mol. Evol. 59: 258266.[CrossRef][Medline] Brooks, D.J., Fresco, J.R., and Singh, M. 2004. A novel method for estimating ancestral amino acid composition and its application to proteins of the Last Universal Ancestor. Bioinformatics 20: 22512257. Cartegni, L., Chew, S.L., and Krainer, A.R. 2002. Listening to silence and understanding nonsense: Exonic mutations that affect splicing. Nat. Rev. Genet. 3: 285298.[CrossRef][Medline] Crick, F.H. 1968. The origin of the genetic code. J. Mol. Biol. 38: 367379.[CrossRef][Medline] Dekel, E. and Alon, U. 2005. Optimality and evolutionary tuning of the expression level of a protein. Nature 436: 588592.[CrossRef][Medline] Di Giulio, M. 2005. The origin of the genetic code: Theories and their relationships, a review. Biosystems 80: 175184.[CrossRef][Medline] Draper, D.E. 1999. Themes in RNA-protein recognition. J. Mol. Biol. 293: 255270.[CrossRef][Medline] Dufton, M.J. 1997. Genetic code synonym quotas and amino acid complexity: Cutting the cost of proteins? J. Theor. Biol. 187: 165173.[CrossRef][Medline] Farabaugh, P.J. and Bjork, G.R. 1999. How translational accuracy influences reading frame maintenance. EMBO J. 18: 14271434.[CrossRef][Medline] Freeland, S.J. and Hurst, L.D. 1998. The genetic code is one in a million. J. Mol. Evol. 47: 238248.[CrossRef][Medline] Freeland, S.J., Knight, R.D., Landweber, L.F., and Hurst, L.D. 2000. Early fixation of an optimal genetic code. Mol. Biol. Evol. 17: 511518. Gilis, D., Massar, S., Cerf, N.J., and Rooman, M. 2001. Optimality of the genetic code with respect to protein stability and amino-acid frequencies. Genome Biol. 2: research0049.[Medline] Gusev, V.D., Nemytikova, L.A., and Chuzhanova, N.A. 1999. On the complexity measures of genetic sequences. Bioinformatics 15: 994999. Hasegawa, M. and Miyata, T. 1980. On the antisymmetry of the amino acid code table. Orig. Life 10: 265270.[CrossRef][Medline] Katz, L. and Burge, C.B. 2003. Widespread selection for local RNA secondary structure in coding regions of bacterial genes. Genome Res. 13: 20422051. Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E.S. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: 241254.[CrossRef][Medline] Kirschner, M., Gerhart, J.C., and Norton, J. 2005. The plausibility of life: Resolving Darwins dilemma. Yale University Press, New Haven, CT. Knight, R.D., Freeland, S.J., and Landweber, L.F. 2001. Rewiring the keyboard: Evolvability of the genetic code. Nat. Rev. Genet. 2: 4958.[CrossRef][Medline] Konecny, J., Schoniger, M., Hofacker, I., Weitze, M.D., and Hofacker, G.L. 2000. Concurrent neutral evolution of mRNA secondary structures and encoded proteins. J. Mol. Evol. 50: 238242.[Medline] Lieb, J.D., Liu, X., Botstein, D., and Brown, P.O. 2001. Promoter-specific binding of Rap1 revealed by genome-wide maps of protein-DNA association. Nat. Genet. 28: 327334.[CrossRef][Medline] Muto, A. and Osawa, S. 1987. The guanine and cytosine content of genomic DNA and bacterial evolution. Proc. Natl. Acad. Sci. 84: 166169. Osawa, S., Jukes, T.H., Watanabe, K., and Muto, A. 1992. Recent evidence for evolution of the genetic code. Microbiol. Rev. 56: 229264. Parker, J. 1989. Errors and alternatives in reading the universal genetic code. Microbiol. Rev. 53: 273298. Peer, I., Felder, C.E., Man, O., Silman, I., Sussman, J.L., and Beckmann, J.S. 2004. Proteomic signatures: Amino acid and oligopeptide compositions differentiate among phyla. Proteins 54: 2040.[CrossRef][Medline] Robison, K., McGuire, A.M., and Church, G.M. 1998. A comprehensive library of DNA-binding site matrices for 55 proteins applied to the complete Escherichia coli K-12 genome. J. Mol. Biol. 284: 241254.[CrossRef][Medline] Satchwell, S.C., Drew, H.R., and Travers, A.A. 1986. Sequence periodicities in chicken nucleosome core DNA. J. Mol. Biol. 191: 659675.[CrossRef][Medline] Segal, E., Fondufe-Mittendorf, Y., Chen, L., Thastrom, A., Field, Y., Moore, I.K., Wang, J.P., and Widom, J. 2006. A genomic code for nucleosome positioning. Nature 442: 772778.[CrossRef][Medline] Seligmann, H. and Pollock, D.D. 2004. The ambush hypothesis: Hidden stop codons prevent off-frame gene reading. DNA Cell Biol. 23: 701705.[CrossRef][Medline] Shpaer, E.G. 1985. The secondary structure of mRNAs from Escherichia coli: Its possible role in increasing the accuracy of translation. Nucleic Acids Res. 13: 275288. Stormo, G.D. 2000. DNA binding sites: Representation and discovery. Bioinformatics 16: 1623. Trifonov, E.N. 1989. The multiple codes of nucleotide sequences. Bull. Math. Biol. 51: 417432.[CrossRef][Medline] Troyanskaya, O.G., Arbell, O., Koren, Y., Landau, G.M., and Bolshoy, A. 2002. Sequence complexity profiles of prokaryotic genomic sequences: A fast algorithm for calculating linguistic complexity. Bioinformatics 18: 679688. Wagner, A. 2005a. Energy constraints on the evolution of gene expression. Mol. Biol. Evol. 22: 13651374. Wagner, A. 2005b. Robustness and evolvability in living systems. Princeton University Press, Princeton, N.J. Wan, H. and Wootton, J.C. 2000. A global compositional complexity measure for biological sequences: AT-rich and GC-rich genomes encode less complex proteins. Comput. Chem. 24: 7194.[CrossRef][Medline] Woese, C. 1998. The universal ancestor. Proc. Natl. Acad. Sci. 95: 68546859. Woese, C.R. 1965. Order in the genetic code. Proc. Natl. Acad. Sci. 54: 7175. Zuker, M. and Stiegler, P. 1981. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Res. 9: 133148.
Received September 22, 2006; accepted in revised format November 29, 2006. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||