|
|
|
|
Vol. 10, Issue 2, 228-236, February 2000
LETTER
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
When transcription is to the right of the promoter, the "top," mRNA-synonymous strand of DNA tends to be purine-rich. When transcription is to the left of the promoter, the top, mRNA-template strand tends to be pyrimidine-rich. This transcription-direction rule suggests that there has been an evolutionary selection pressure for the purine-loading of RNAs. The politeness hypothesis states that purine-loading prevents distracting RNA-RNA interactions and excessive formation of double-stranded RNA, which might trigger various intracellular alarms. Because RNA-RNA interactions have a distinct entropy-driven component, the pressure for the evolution of purine-loading might be greater in organisms living at high temperatures. In support of this, we find that Chargaff differences (a measure of purine-loading) are greater in thermophiles than in nonthermophiles and extend to both purine bases. In thermophiles the pressure to purine-load affects codon choice, indicating that some features of their amino acid composition (e.g., high levels of glutamic acid) might reflect purine-loading pressure (i.e., constraints on mRNA) rather than direct constraints on protein structure and function.
| |
INTRODUCTION |
|---|
|
|
|---|
Duplex DNA can be represented as two horizontal
lines, representing "top" (5'
3')
and "bottom" (3'
5') strands. When transcription is
to the right of the promoter, the top, mRNA-synonymous strand tends to
be purine-rich. When transcription is to the left of the promoter, the
top, mRNA-template strand tends to be pyrimidine-rich (Szybalski et al.
1966
; Smithies et al. 1981
). It follows that mRNAs, whatever the
direction of their transcription, tend to be purine-rich (Bell et al.
1998
; Dang et al. 1998
). Usually one of the two purine bases is most
heavily involved in purine-loading, and the other may appear
indifferent. For organisms of low genomic (C+G)% the purine is usually
A. For organisms of high (C+G)% the purine is usually G. The extra
purines are located in the loop regions of computer-folded mRNA
structures (Bell and Forsdyke 1999a
,b
).
To explain the phenomenon of purine-loading, it was pointed out that
the physical and chemical state of the "crowded" cytosol (Fulton
1982
; Forsdyke 1995
) is probably adapted to facilitate a reaction of
fundamental importance
tRNA-mRNA interaction. This connection between
genotype and phenotype (mRNA translation) must occur rapidly and with
high specificity. If cytosolic conditions were such as to optimize this
process, then there would be an increased probability not only of
efficient tRNA-mRNA interactions, but also of efficient mRNA-mRNA
interactions. These would initiate by way of "kissing" between the
loops of folded RNAs (Eguchi et al. 1991
). Such interactions might (1)
directly impede protein synthesis, and (2) generate double-stranded RNA
segments of lengths sufficient to trigger various intracellular alarms
(Cristillo 1998
; Fire 1999
; Forsdyke 1999a
; A.D. Cristillo, T.P.
Lillicrap, and D.R. Forsdyke, unpubl.). Thus, there would have been a
selection pressure for mRNAs to be "polite" (Zuckerkandl 1986
) and
avoid unnecessary interactions. This would have been achieved, wherever compatible with other mRNA functions, by loading loops of all mRNAs
with non-Watson-Crick pairing bases (e.g., all purines or all pyrimidines).
Exploratory kissing interactions between hybridizing nucleic acids
involve transient stacking interactions (Eguchi et al. 1991
), with the
exclusion of structured water. Such reactions have a strong
entropy-driven component (Cantor and Schimmel 1980
) and so might
increase as temperature increases (Lauffer 1975
). Hence, perhaps
counterintuitively, nucleic acids should be more "sticky" at high
temperatures and the selection pressure to avoid formation of
double-stranded RNA should be greater. To examine this, we compare the
magnitude (evaluated as Chargaff differences) and range (one or both
purines) of purine-loading in the genomes of thermophilic bacteria with
those of the genomes of mesophilic bacteria, which normally exist at
37°C or lower temperatures. We also examine the extent to which the
pressure to purine-load has affected codon choice, and hence,
potentially, protein composition and function.
| |
RESULTS |
|---|
|
|
|---|
Large Chargaff Differences in Methanococcus jannaschii
Figure 1 shows a plot of Chargaff differences (%) along a segment
of the genome of the thermophilic bacterium Methanococcus jannaschii. In previous studies of
nonthermophilic organisms, such plots had to be carefully examined to
see whether the S or the W bases were the best predictors of
transcription direction in accordance with Szybalski's transcription
direction rule. Chargaff differences were seldom >20% (especially
Chargaff differences for the S bases), and simply plotting the ratio of
purines to pyrimidines (R/Y) was not particularly informative (Bell et
al. 1998
; Dang et al. 1998
; Bell and Forsdyke 1999a
,b
). In contrast, for M. jannaschii it is observed that (1) both purines follow Szybalski's transcription direction rule, (2) the magnitude of the
Chargaff differences is often >20% (especially Chargaff differences for the S bases), and (3) the R/Y ratio strongly correlates with transcription direction.
|
Even a small rightward-transcribed ORF (in the 15.5- to 15.8-kb region), which is surrounded by leftward-transcribed ORFs, is detectable as a small dip in the S-base plot (G>C), although the W-base plot is not affected. No ORF has been reported at the beginning of the sequence (1- to 2-kb region), but the curve pattern (A>T; G>C) predicts that any ORFs found here should be transcribed to the right of the promoter.
Quadrant Analysis of M. jannaschii
To show that these features of a 20-kb segment (Fig. 1) are typical
of the whole genome, the two Chargaff differences (for the W and S
bases) were plotted against each other to generate quadrant plots (Bell
et al. 1998
) for leftward-transcribed ORFs (Fig. 2a)
and for rightward-transcribed ORFs (Fig. 2b). Each point in such plots
represents a 1-kb window. Because M. jannaschii is 1.66 Mb and
windows are taken at 0.1-kb intervals, there are several thousand
points in each plot. For windows whose centers overlap
leftward-transcribed ORFs, most points indicate enrichment both in T
and C, implying that the corresponding mRNA synonymous strands would be
enriched both in A and G. For windows whose centers overlap
rightward-transcribed ORFs, most points indicate enrichment both in A
and G, again implying that the mRNA synonymous strands are enriched
both in A and G.
|
As in Figure 1, Chargaff differences for the S bases are generally greater than those for the W bases (e.g., in Fig. 2b points are generally farther to the left of the ordinate, indicating G-richness, than they are above the abscissa, indicating A-richness). Although widely scattered, the points fit linear regressions sloping downward, so that, as suggested by Figure 1, it is likely that windows enriched in one pyrimidine are also enriched in the other (Fig. 2a) and that windows enriched in one purine are also enriched in the other (Fig. 2b).
Large Chargaff Differences in Other Thermophiles
The extreme purine-loading found for the entire genome of M. jannaschii was also found for large segments of the genomes of two
other thermophilic bacteria (Table 1). The
thermophilic bacteria strictly comply with Szybalski's transcription
direction rule for both Watson-Crick base pairs, and the magnitude of
the differences is generally greater than in the case of the
nonthermophilic bacteria. The latter sometimes do not comply with
respect to one Watson-Crick base pair (for the Escherichia
coli segment, rightward-transcribed ORFs are slightly T-rich; for
the Haemophilus influenzae segment, leftward transcribed ORFs
are slightly A-rich).
|
The difference between thermophiles and nonthermophiles was more readily appreciated when the four Chargaff differences values (base excesses) for each organism were combined to provide an index of the purine-loading of the corresponding RNAs. For the thermophiles the index is simply the sum of the four absolute Chargaff differences, as the transcription rule is followed in each of the four cases. For the nonthermophiles, in cases where the transcription rule was not followed, the corresponding values were subtracted from the overall sum. On average, the three thermophiles show more purine-loading than the three nonthermophiles (Table 1). Purine-loading in thermophiles [(153 + 208 + 288)/(3)] exceeds that of nonthermophiles [(42 + 64 + 64)/(3)] by 160 bases/kb window (P = 0.04; paired t-test with 2 df).
More Purine-Loading at Low (C+G)
For thermophiles the purine-loading index shows an inverse linear relationship to (C+G)%, with a slope that is significantly different from zero (r2 = 0.994; P = 0.05). The relationship reflects purine-loading with G residues more than with A residues [i.e., with decreasing (C+G)% there is an increasing tendency for C-excess in leftward-transcribed ORFs and for G-excess in rightward-transcribed ORFs]. Thus, although there are fewer S bases available to support these excesses when the (C+G)% is low, those present are more readily utilized for the purine-loading function (i.e., they are likely to be locally unpaired in loops, rather than in stems). Whereas in their absolute numbers, the S bases and the W bases contribute about equally to purine-loading in thermophilic bacteria, usually one class dominates in the nonthermophilic bacteria.
Genome-Wide Distribution of Purine-Loading
That large segments of genomes are likely to be representative of
the entire genomes with respect to purine-loading was shown for M. jannaschii. The genome was divided into six segments of approx.
276.5 kb each. At this level of resolution, there are only minor
fluctuations between segments in (C+G)%, Chargaff differences, and
purine-loading indices (Table 2). For example, for
rightward-transcribed ORFs, the genomic average Chargaff difference
(%) for the W bases is 9.92 ± 0.09 (i.e., A>T; Table 1), and
corresponding values for the six segments range from 8.46 to 10.57.
|
The genome of M. jannaschii shows a remarkable symmetry between leftward and rightward ORFs in the actual numbers of S and W bases contributing to purine-loading (Table 1). Thus, the average 1-kb window has pyrimidine excesses of 68 (T) and 76 (C) for leftward-transcribed ORFs, and purine excesses of 68 (A) and 76 (G) for rightward-transcribed ORFs. This tendency toward symmetry also occurs in the six segments (Table 2) and is precise in segment 2.
Influence of the Origin of Replication?
Although leftward and rightward ORFs are covered by approximately
equal numbers of overlapping 1-kb windows (7494 and 8202, respectively), the distribution varies (Table 2). The number of windows
corresponding to leftward ORFs tends to decrease with segment number,
whereas the number of windows corresponding to rightward ORFs tends to
increase with segment number. Because the M. jannaschii
chromosome is circular, a sharp switch is indicated between segments 6 and 1 from a predominantly purine-rich top strand (reflecting an excess
of windows covering rightward ORFs in segment 6) to a predominantly
pyrimidine-rich top strand (reflecting an excess of windows covering
leftward ORFs in segment 1). This probably relates to the origin of
replication (site not currently known), as when windows much greater
than 1 kb are used for Chargaff difference analysis (thus tending to
obscure the local effects of individual ORFs), the results can provide
a guide to the position of the origin of replication (Smithies et al.
1981
; Lobry 1996
).
Same Optimum Window for Thermophiles and Nonthermophiles
In previous studies the optimum window (1 kb) for examining Chargaff
differences was determined in a range of organisms by comparing natural
with the corresponding shuffled sequences (Bell et al. 1998
; Dang et
al. 1998
; Bell and Forsdyke 1999a
). A similar study of the three
thermophile genomes considered above indicated that a 1-kb window would
also be optimum in these organisms. For example, for Figure 3 absolute
Chargaff difference values were plotted against window size in a
276.5-kb segment of M. jannaschii. The
difference between values of points on the curves for the natural and
shuffled sequences reaches an optimum at a window size of ~1 kb. As
with the nonthermophiles, the ratio of the values on these curves
reaches an optimum at higher window sizes. It was concluded that it was
appropriate to use 1-kb windows when comparing the Chargaff differences
of thermophile and nonthermophile genomes.
|
Purine-Loading Affects Codon Choice
The strong pressure on thermophiles to purine-load their mRNAs, as
revealed by Chargaff difference analysis, suggested that choice of
synonymous codons might provide an independent measure of this
evolutionary force. Furthermore, thermophiles show some regularities in
amino acid compositions (Kagawa et al. 1984
; Deckert et al. 1998
;
Jaenicke and Bohm 1998
), raising the possibility that the choice of
nonsynonymous codons might also be affected.
Table 3 shows data for some amino acids corresponding to purine-rich
codons. In the case of glycine (4 codons) and
arginine (6 codons) the opportunity is provided for choice of
synonymous codon, and in both cases purine-rich codons are preferred by
thermophiles. Thus, thermophiles prefer GGR over GGY, and
nonthermophiles prefer GGY over GGR. This suggests a selection pressure
acting at the nucleic acid level, rather than at the protein level.
|
There is a significant increase in codons for glutamic acid in thermophiles, consistent with a selection pressure on nonsynonymous codons. However, although there is a suggestion of a similar pressure for arginine, it is not found for aspartic acid or lysine. The thermophile Archaeoglobus fulgidus is very similar to the other thermophiles (Table 3), indicating that in future work, purine-loading might also be demonstrable by the Chargaff difference method in this organism.
| |
DISCUSSION |
|---|
|
|
|---|
An Adaptation for Survival at High Temperatures
It is likely that Szybalski's transcription direction rule is a
consequence of purine-loading (Bell and Forsdyke 1999b
). Under this
assumption, the difference between thermophilic and nonthermophilic bacteria in (1) the extent of their compliance with Szybalski's transcription direction rule (Table 1), and (2) their choice of
purine-rich codons (Table 3) suggests that adaptation for survival at
high temperatures requires that mRNAs be more heavily purine-loaded.
Consistent with this, in a survey of 12 chloroplast genomes (R.J.
Rasile and D.R. Forsdyke, unpubl.) we find that those of the
thermophile Cyanidium caldarium have a greater purine-loading index (108 bases/kb) than those of 11 nonthermophilic organisms (average 41.1 ± 6.2 bases/kb). An extreme example of the latter is
the genome of chloroplasts of the parasitic plant Epiphagus virginiana, in which the chloroplasts are degenerate so that
pressure on mRNAs to be polite is likely to be decreased
(purine-loading index only 22 bases/kb).
The Politeness Hypothesis
The Gibb's free energy equation
(
G =
H
T
S)
implies that with increasing temperature, reactions with a significant
entropy-driven component can occur more readily (Lauffer 1975
). Because
the base-pairing involved in RNA-RNA interactions has a considerable
entropic component (Cantor and Schimmel 1980
), such interactions,
whether desirable or undesirable, should be favored at high
temperatures. Thus, if chemically and biologically feasible, it is
possible that RNA sequences would have adapted to avoid undesirable
interactions while not impairing desirable ones. Purine-loading would
seem to achieve this. In general, mRNAs "drive" politely on the
purine "side of the road," and thermophile mRNAs appear excessively
polite. The politeness is not trivial (contrast the "polite DNA" of
Zuckerkandl 1986
). Driving on the correct side of the road is conducive
to efficient highway operation. Failure to do so might be lethal if it
led to the formation of dsRNA and the false triggering of intracellular alarms.
Exceptions to Szybalski's Rule
There are exceptions to Szybalski's rule (Cristillo 1998
; Bell and
Forsdyke 1999b
; A.D. Cristillo, T.P. Lillicrap, and D.R. Forsdyke,
unpubl.). Certain viruses with a prolonged period of clinical latency
load their mRNAs with pyrimidines. Thus, whereas human immunodeficiency
virus 1 (moderately committed to latency) is polite (mRNAs heavily
purine-loaded), human T-cell leukemia virus 1 (profoundly committed to
latency) is extremely impolite (mRNAs heavily pyrimidine-loaded).
Epstein-Barr virus (like many other herpes viruses) is also profoundly
committed to latency and pyrimidine-loads most of its RNAs. However,
unlike human T cell leukemia virus 1, Epstein-Barr virus has an
important transcript expressed in all forms of latency, the transcript
encoding the EBNA-1 antigen. Remarkably, this transcript is polite, its
purine-loading being amplified by inclusion of a simple-sequence
element encoding, by preferential employment of purine-rich codons, a
glycine-alanine repeat that can be removed without greatly affecting
EBNA-1 function. Despite this, much current work is based on the
premise that in the compact virus genome, the simple-sequence element
persists because of a function at the protein level rather than at the nucleic acid level (for references, see Lee et al. 1999
).
Purine-Loading Might Affect the Composition of Proteins
Conventional natural selection provides an extrinsic pressure on the
phenotype and so determines which genotypes will survive. However, it
has long been recognized that genomes are also molded by intrinsic
forces (Romanes 1886
; Forsdyke 1999b
,c
). One such force is a component
of the base composition of DNA
(C+G)%. This can affect protein
composition and, hence, possibly the phenotype (Sueoka 1961
; Ball 1973
;
Grantham 1980
; Bronson and Anderson 1994
; Forsdyke 1996
, 1998
). In
light of the present work it appears that another component of base
composition, R/Y, may also affect protein composition and, hence,
possibly the phenotype.
We have identified purine-loading as an evolutionary force and have
suggested an adaptive basis related to the intrinsic workings of the
cell (Bell et al. 1998
; Bell and Forsdyke 1999b
; Fire 1999
; Forsdyke
1999a
). Thermophilic bacteria appear particularly susceptible to this
force, so that the pressure to purine-load RNA in these organisms might
have been powerful enough to affect choice both of synonymous and
nonsynonymous codons. In thermophiles, an increase in the proportion of
glutamic acid (encoded by codons containing only purines) has been
noted (Deckert et al. 1998
), and this is confirmed in Table 3. Although
it is tempting to believe that all such changes in the amino acid
composition of proteins are related to the need to maintain protein
stability and function at high temperatures (Kagawa et al. 1984
;
Jaenicke and Bohm 1998
), our results raise the possibility that the
needs of efficient mRNA function at high temperatures might also have
affected the composition of proteins. However, Jaenicke and Bohm (1998)
find it difficult to define what they call "traffic rules" of
thermophilic adaptation in terms of significant differences in amino
acid composition.
Which Purine to Use?
From our initial studies of the purine-loading phenomenon, the
generalization emerged that organisms with high (C+G)% genomes would
preferentially load with G residues, and organisms with low (C+G)%
genomes would preferentially load with A residues (Bell and Forsdyke
1999b
). This seemed logical, as organisms with a low (C+G)% would
appear to have less flexibility for loading codons with scarce G
residues (e.g., the Gs would be required for critically placed codons,
which might not match RNA loop regions). To our surprise, the present
work (Table 1) indicates either that this generalization is not valid
or that thermophiles are a special case. Perhaps because the strength
of bonding between the S bases is greater than that between the W
bases, it may be the avoidance of C residues as much as the inclusion
of G residues that generates such large Chargaff differences with
respect to the S bases in thermophiles. The unpaired Gs would locate to
the loop regions of stem-loop secondary structures. In considering
this matter, one should also take into account the fact that nearest
neighbors are of considerable importance in base-pairing interactions
(Turner 1996
), and in low (C+G)% DNA an S base is more likely to have a W base nearest neighbor.
| |
METHODS |
|---|
|
|
|---|
Chargaff Difference Analysis
Chargaff's first parity rule for duplex DNA (%A = %T;
%C = %G) applies, to a close approximation, to ssDNA (Chargaff's
second parity rule; Chargaff 1979
). Deviations from parity are referred to as Chargaff differences. The base composition of successive 1-kb
windows, moved in steps of 0.1 kb, was assessed as described by Dang et
al. (1998)
. Chargaff differences were either calculated as
(A
T)/W and (C
G)/S and expressed as percentages or were expressed directly as positive or negative base excesses (A
T,
or C
G). Here, A, T, C, and G refer to the number of the corresponding base in a window. The direction of subtraction
(A
T or T
A) is determined alphabetically. W is the sum
of the W bases (A + T) and S is the sum of the S bases (C + G).
Purine-Loading
Purine-loading of mRNAs is indicated when the regions of top DNA
strands corresponding to leftward-transcribed ORFs are enriched in
pyrimidines and when the regions of top DNA strands corresponding to
rightward-transcribed ORFs are enriched in purines. These enrichments may be assessed at the DNA level as Chargaff differences. Because the
direction of subtraction is determined alphabetically, purine-loading has been supported when Chargaff differences for the W bases
(A
T) are positive and/or Chargaff differences for the S bases (C
G) are negative.
An overall index of the purine-loading of RNA was obtained by summing absolute Chargaff difference values (positive or negative base excesses per 1-kb window) for the pyrimidine excess in leftward-transcribed ORFs and for the purine excess in rightward-transcribed ORFs. In circumstances where Szybalski's transcription direction rule was not followed, the corresponding absolute values were subtracted from the overall total.
When both purines contributed to purine-loading, the latter was assessed as the purine/pyrimidine ratio (Y/R; usually expressed as a percentage). In some circumstances codon choice provided a measure.
Sequences
Sequence information refers to the top strand as designated in the
GenBank record. Genomic sequences examined were from Aquifex aeolicus (Deckert et al. 1998
), A. fulgidus (Klenk et al.
1997
), E. coli (Blattner et al. 1997
), H. influenzae
(Fleischmann et al. 1995
), Methanobacterium
thermoautotrophicum (Smith et al. 1997
), M. jannaschii
(Bult et al. 1996
), Mycoplasma genitalium (Fraser et al.
1995
), and Mycoplasma pneumoniae (Himmelreich et al. 1996
).
These sequences were analyzed with programs of the Genetics Computer
Group (GCG, Madison, WI) and our own programs written as Unix scripts
or in C++. Codon usage tables for complete genomes, calculated using
the GCG program CodonFrequency, were obtained from C. Brown (Department
of Biochemistry, University of Otago, Dunedin, New Zealand).
| |
ACKNOWLEDGMENTS |
|---|
We thank Chris Brown for codon usage tables, Jim Gerlach for assistance with computer configuration, Gregory Hill for a program for determining optimum window sizes, and Robert Rasile for data on chloroplast genomes. The National Research Council of Canada, Academic Press, Cold Spring Harbor Laboratory Press, and Elsevier Publishing Corporation gave permission for the inclusion of full-text versions of the relevant papers cited herein at our internet site (http://post.queensu.ca/~forsdyke/homepage.htm).
| |
FOOTNOTES |
|---|
1 Corresponding author.
E-MAIL forsdyke{at}post.queensu.ca; FAX (613) 533-2497.
| |
REFERENCES |
|---|
|
|
|---|
H: Functional analysis and comparative genomics.
J. Bacteriol.
179:
7135-7155
and A
fetal globin gene region.
Cell
26:
345-353[CrossRef][Medline].Received August 23, 1999; accepted in revised form December 16, 1999.
This article has been cited by other articles:
![]() |
S. SMIT, M. YARUS, and R. KNIGHT Natural selection is not required to explain universal compositional patterns in rRNA secondary structure categories RNA, January 1, 2006; 12(1): 1 - 14. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Nikolaou and Y. Almirantis A study on the correlation of nucleotide skews and the positioning of the origin of replication: different modes of replication in bacterial species Nucleic Acids Res., November 30, 2005; 33(21): 6816 - 6822. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Friedman, J. W. Drake, and A. L. Hughes Genome-Wide Patterns of Nucleotide Substitution Reveal Stringent Functional Constraints on the Protein Sequences of Thermophiles Genetics, July 1, 2004; 167(3): 1507 - 1512. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. L. Chen, W. Lee, A. K. Hottes, L. Shapiro, and H. H. McAdams Codon usage between genomes is constrained by genome-wide mutational processes PNAS, March 9, 2004; 101(10): 3480 - 3485. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Paz, D. Mester, I. Baca, E. Nevo, and A. Korol Adaptive role of increased frequency of polypurine tracts in mRNA sequences of thermophilic prokaryotes PNAS, March 2, 2004; 101(9): 2951 - 2956. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. J. Lynn, G. A. C. Singer, and D. A. Hickey Synonymous codon usage is subject to selection in thermophilic bacteria Nucleic Acids Res., October 1, 2002; 30(19): 4272 - 4277. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Schattner Searching for RNA genes using base-composition statistics Nucleic Acids Res., May 1, 2002; 30(9): 2076 - 2082. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||