|
|
|
|
Vol. 11, Issue 4, 540-546, April 2001
LETTER
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
We examined dinucleotide relative abundances and their biases in recent sequences of eukaryotic genomes and chromosomes, including human chromosomes 21 and 22, Saccharomyces cerevisiae, Arabidopsis thaliana, and Drosophila melanogaster. We found that dinucleotide relative abundances are remarkably constant across human chromosomes and within the DNA of a particular species. The dinucleotide biases differ between species, providing a genome signature that is characteristic of the bulk properties of an organism's DNA. We detail the relations between species genome signatures and suggest possible mechanisms for their origin and maintenance.
| |
INTRODUCTION |
|---|
|
|
|---|
The recent sequencing of the complete genomes of
Saccharomyces cerevisiae, Caenorhabditis elegans and
Drosophila melanogaster, along with human chromosomes 21 and
22 and chromosomes 2 and 4 of Arabidopsis thaliana, provides
new opportunities for studying higher eukaryote genome organization
(C. elegans Sequencing Consortium 1998
; Dunham et al. 1999
;
Lin et al. 1999
; Mayer et al. 1999
; Adams et al. 2000
; Hattori et al.
2000
). Every genome has a unique signature based on dinucleotide
relative abundances (Karlin and Ladunga 1994
). This genome signature is
a characteristic of the genome as a whole and does not depend on
knowledge of individual genes or alignment of homologous sequences.
Instead, it reflects the response of the whole genome to overall
selective pressures, operating through limits on compositional and/or
structural variations in DNA. It is essentially constant in both coding
and noncoding sequences and is independent of renaturation fraction
(G + C isochores) and of base compositional fractions (Russell et al.
1976
; Russell and Subak-Sharpe 1977
). The mechanisms that determine and
maintain the signature are not understood, but they could involve DNA
replication and repair mechanisms and biases in DNA modification
processes. They can operate on the whole genome through DNA structure
(e.g., base-step stacking energies and DNA conformational tendencies), context dependent mutation, and DNA methylation patterns (for review,
see Karlin 1998
).
Dinucleotide Relative Abundances
The dinucleotide relative abundance is defined as
|
*
measures the abundance of dinucleotides relative to what would be
expected from the component base frequencies. Hence,
* (actually
*
1) can also be referred to as the dinucleotide bias.
The vector of
* values constitutes the genome signature. In
practice, a given sequence is split into equal (typically 50-kb) segments and the signature is calculated for each. Distributions of
* values for the 50-kb segments can be compared with each other
within a species or between different species. Thus, it can be judged
which dinucleotide pairs are relatively over- or underrepresented in
the genome. Theoretical and empirical studies indicate that if the
dinucleotide XY has a mean
*XY
0.78, then XY is
significantly underrepresented (suppressed), whereas
*XY
1.23 indicates
over-representation. Corresponding expressions can be constructed for
tri- and tetranucleotide relative abundances but add little additional
information, suggesting that DNA conformational stacking arrangements
are determined mainly through the dinucleotide base-step configurations.
The genome signature is highly invariant across the DNA of an organism
and is similar for closely related species. Strong support for the
invariance of the signature within species comes from both sequence
analysis and experimental studies of nearest-neighbor frequencies,
which have shown that the set of dinucleotide relative abundance values
for 50-kb DNA contigs is a characteristic of an organism's DNA and
distinguishes it from other species (Russell et al. 1976
; Russell and
Subak-Sharpe 1977
; Karlin and Burge 1995
; Karlin 1998
).
*XY Distributions Across Species
*XY values
determined for each sample. For every organism, one obtains a list of
* values for each 50-kb sample for all dinucleotides XY.
These are plotted as histograms of
* values for each dinucleotide
in Figure 1, which compares the
distributions for human, S. cerevisiae, D. melanogaster, C. elegans, and A. thaliana. The distributions are all
homogeneous within species and distinctly different between species.
Histograms are superior to simple variance statistics. Individual p*
values do not discriminate between, for example, yeast and
Arabidopsis, between mouse and human, betwee4n the protists Plasmodium falciparum and Trypanosoma brucei, or
among most prokaryotes. The whole genome signature vector (10 components) does discriminate these cases.
|
*GC. Human DNA has higher relative abundances
of CC/GG, AG/CT, and CA/TG dinucleotides than the other species, but
neither dinucleotide pair is significantly biased.
*CA/TG is slightly high in human but normal in
Drosophila, yeast, Arabidopsis, and C. elegans. TA is modestly suppressed in all organisms, with human and
C. elegans showing the lowest
*TA.
Yeast and Arabidopsis have very similar
* values for all
dinucleotides, with generally sharply peaked distributions and low
variance, the exception being CG in Arabidopsis. In contrast
human, C. elegans, and, to a lesser extent,
Drosophila all exhibit a moderate spread in
* values.
AC/GT, AA/TT, and AT relative abundances do not differ much between
species and are all in the normal (unbiased) range of
* values.
Human Chromosomes 21 and 22
The recent completion of human chromosomes 21 and 22 makes them particularly interesting sequences to study. Both were partitioned into contiguous 50-kb windows and the
*XY values for each window are plotted across the chromosomes in Figure
2. One can see immediately that, with minor
exceptions, all dinucleotide biases are clearly invariant both across
and between chromosomes. This is conspicuous in the
* values for
CG, GC, GA/TC, AC/GT, and AT. From around position 10 Mb to 25 Mb on
chromosome 21, the AG/CT dinucleotide bias is slightly reduced compared
to the rest of the chromosome and to chromosome 22. In addition, the chromosome-21 TA bias is slightly elevated over this region. The only
other notable variation is around position 13.4 Mb of chromosome 22 in
the 406-kb long contig NT002447. Closer inspection reveals that a large
portion of this contig (GenBank accession no. AP000536) is dominated by
a 47-kb tandem repeat of an ~ 50-bp subunit.
|
{
*XY} Comparisons
*XY values of nonoverlapping 50-kb samples for
each dinucleotide pair in several eukaryotes and for each human
chromosome. Mean
*XY values are strongly conserved across all human chromosomes. The ranges are in CG, 0.18 to
0.31; GC, 0.96 to 1.02; TA, 0.66 to 0.75; AT, 0.84 to 0.89; CC/GG, 1.22 to 1.24; TT/AA, 1.11 to 1.13; TG/CA, 1.20 to 1.24; AG/CT, 1.15 to 1.24;
AC/GT, 0.82 to 0.86; and GA/TC 0.98 to 1.00. The largest variation is
in
*CG, where the highest value is 0.31 for
chromosome 19, followed by chromosomes 16 and 22 at 0.28. The lowest
values occur for chromosomes Y (0.18), X (0.20), and 18 (0.21). There
is a positive correlation between
*XY values
and the CG-island densities implied by in situ fluorescence
hybridization of human chromosomes during metaphase (Cross and Bird
1995
*CG values noted above.
|
*CG = 0.21 in mouse and 0.18-0.31 for human
chromosomes. CG suppression is usually explained through the
methylation-deamination-mutation hypothesis, whereby methylation of CG
to 5-methylcytosine and subsequent deamination to thymine results, if
unrepaired, in conversion of CG to TG/CA. The methylation hypothesis is
supported by the fact that invertebrates that do not possess a
methylase, such as Drosophila and C. elegans, do not
exhibit significant CG dinucleotide bias
(
*CG = 0.92 for Drosophila and 0.96 for C.elegans).
*CG is significantly
low in A. thaliana (0.72) but not in yeast (0.80), concurring
with the occurence of methylation in dicots such as
Arabidopsis but with its absence from monocots. However, in
human and mouse, TG/CA is only marginally overrepresented (
*TG/CA = 1.20-1.24 and 1.20-1.23,
respectively), in marked contrast to the extreme underrepresentation of
CG. Moreover, CG is underrepresented in all animal mitochondria despite
the lack of methylase activity in mitochondria. There is also no
significant bias in TG/CA in animal mitochondria. This indicates that
although methlyation may contribute to vertebrate CG suppression, it
does not fully account for it.
All of the eukaryotes except Plasmodium falciparum (0.99) show
low
*TA, ranging from 0.56 in Leishmania
major and 0.62 in C. elegans to 0.75 in
Arabidopsis and Drosophila. TA is the least stable
dinucleotide stacking pair and is prominent in some regulatory signals,
such as the TATA box and 3' polyadenylation signal. Avoidance of
spurious signal sequences and considerations of DNA stability could
both act to suppress overall levels of TA. In coding regions, TA may be
low because UA is disfavored in mRNAs, where it is relatively
susceptible to cleavage by ribonucleases (Beutler et al. 1989
*GC is high in Drosophila (1.27), whereas C.elegans has high
*TT/AA = 1.28. Mouse shows high
*AG/CT (1.25), with human (range 1.15-1.22)
hardly biased. The other most biased dinucleotide abundances are in
Leishmania (
*TG/CA = 1.25) and
Plasmodium (
*CC/GG = 1.51). Yeast is
unusual among these eukaryotes in having no significantly biased
dinucleotide relative abundances. All yeast
*s are in the range
0.8-1.13 except for
*TA, which qualifies as marginally underrepresented at 0.77.
Table 2 shows the unsymmetrized
(single-strand)
values for CG and TA at different codon
positions, introns, and intergenic regions in human DNA. Both CG and TA
are suppressed in coding and noncoding regions, with TA being less
biased in all cases. Introns and intergenic DNA exhibit stronger CG
suppression than coding sequences but are less biased in TA. This is
consonant with higher substitution rates in noncoding regions, which do not have the constraints on amino acid and codon usage, which affect
coding sequences. The higher CG usage at codon positions 1,2
compared
to 2,3 or 3,1
probably reflects the fact that in human proteins,
arginine is more frequently coded for by CGN (3.2% of the time) than
by an AGR codon (2.2%). Paradoxically, G is highest at codon position
1 (32%) and C is highest at position 3 (29%), yet CG is highly
suppressed at positions 3,1.
|
* Comparisons
It is useful to have a measure of the difference between the
signatures of DNA sequences. For this purpose, we use the dinucleotide relative abundance distance, which for sequences p and
q is defined as
|
* is quoted after multiplying by 1000. The average
distance
* between random sequences of length 50 kb is then
~ 10-20. In comparing DNA sequences, the mean
* value is
found for all pairwise comparisons of 50-kb contigs. This can be done
within a species and between different species. Thus, a matrix of
distances is built up, which is the mean
* distance between 50-kb
segments from each species or sequence. Extensive testing has shown
that the
* distance is not distorted by extreme biases in a single
dinucleotide (Karlin and Ladunga 1994
* Comparisons within Species
* scores range from 30 in chromosome 7 to 48 in chromosome 11. The range of
* between chromosomes is from
30 (chromosome 18 vs. 13) to 54 (19 vs. Y), with 35-45 being typical.
The
* distance between chromosomes, therefore, is approximately
the same as within chromosomes, despite the differences in base
composition, gene density, and repeat frequencies between them. The
*distances between and within Drosophila chromosomes range from 42 to 68. As shown in Figure 3,
the left (L) and right (R) arms of chromosomes 2 and 3 are a close
group, with
* between 42 and 57. The Drosophila X
chromosome is slightly more variable both within itself (
* = 68)
and in comparison to the other chromosomes, with a distance of 57 from
2L, 3L, and 3R and 65 from 2R. With the exception of the X chromosome,
these values are similar to human within- and between-chromosome
*
values. Finally, in C. elegans the six chromosomes exhibit a
range of
* within themselves from 49 (chromosome 4) to 70 (chromosome 2). Between-chromosome distances are from 51 (chromosome X
vs. 4 and 5) to 70 (3 vs. 2 and X). The
* values thus exhibit the
same invariance within a species as the dinucleotide
* biases.
|
* Comparisons between Species
* distances
between the eukaryotes discussed above. P. falciparum
chromosomes 2 and 3, L. major chromosome 1, and the complete
E. coli genome are included for comparison.
|
* = 58), as one would
expect. Arabidopsis and yeast are close (
* = 45);
surprisingly, their
* distance from each other is nearly as low as
their mean within-species distances. E. coli is very distant
from human (210), mouse (241), and both protoctists (196, 174). It is
also distant from C. elegans (128), Arabidopsis
(148), and yeast (122). Mysteriously, however, there is moderate
similarity between the signatures of E. coli and D. melanogaster (
* =74).
| |
DISCUSSION |
|---|
|
|
|---|
We have confirmed, through our analysis of the current complete
eukaryotic genomes and chromosomes 21 and 22 of human, the constancy
and validity of the genome signature for each species. Signature
comparisons have revealed a number of intriguing relations between
organisms. For example, bacterial phage genome signatures are strongly
correlated with the nature of the host and the extent to which the
phage uses the host-cell machinery (Blaisdell et al. 1996
). Both
broad-range and specialized plasmids in prokaryotes share moderate to
close genome signature with their host (Campbell et al. 1999
). Although
mammalian mitochondria are close to each other in signature and reflect
relationships parallel to those derived from nuclear DNA, they are not
close to their host nuclear DNA, with typical
* differences
between 140 and 200 (Karlin and Mrázek 1997
).
Among bacteria, there are signature similarities between closely
related species (such as E. coli vs. Salmonella
typhimirium and Streptococcus pyogenes vs.
Lactococcus lactis) but no groupings that can be attributed to
obvious causes such as the environment in which the bacteria live.
Likewise, archaea do not form a coherent clade in terms of their
signature; for example, halobacteria sp. and methanogens have extremely
different genome signatures. Anomalies in the signature have been used
to detect bacterial pathogenicity islands and laterally transferred
operons in Helicobacter pylori and Mycobacterium
tuberculosis (Karlin 1998
) and in Neisseria meningitidis,
Vibrio cholerae, Campylobacter jejuni, and E. coli (data
not shown). Unmethylated CG shows normal dinucleotide bias in most
proteobacteria and can provoke an immune response in mammals (Krieg et
al. 1998
). CG is also suppressed in most small (< 30 kb length)
vertebrate viral genomes, except for a few togaviruses (Karlin et al.
1994
). Another intriguing result is that the signature of mammalian
retroviruses shows moderate similarity to the nuclear DNA into which
they integrate with a range
* = 70-90 (data not shown). This
might have resulted from the processing of the viral genetic program by
the host-cell machinery or a selective shift in the viral genome toward
a genome signature that is more compatible with the host.
There are a number of unanswered questions concerning the nature of the
genome signature. The homogeneity of the signature is clearly
maintained by processes that operate at the scale of the whole genome.
However, it is not known if the signature corresponds to a frozen event
or if it is a dynamical feature of a genome that changes over time,
albeit slowly. How did the signature arise for a given genome and how
fast can it change? Many DNA repair enzymes recognize the shape of the
DNA molecule rather than specific sequences (Echols and Goodman 1991
;
Kunkel 1992
). Stacking energies, charge interactions, and
conformational tendencies all bear on local DNA structure and thus
influence the intrinsic curvature of DNA (Bolshoy 1995
). In addition,
the efficiency of DNA repair is affected by neighboring-base context.
| |
METHODS |
|---|
|
|
|---|
Data
The human, mouse, A. thaliana, P. falciparum, L. major, C. elegans, S. cerevisiae, and E. coli sequences were
acquired from GenBank. Except for chromosomes 21 and 22, sequence sets
for human chromosomes were produced using the lists of contigs
maintained by the Computational Biosciences Section at Oak Ridge
National Laboratory. Only contigs
50 kb in length were used. The
complete D. melanogaster genome was obtained from the Gadfly
database maintained by the Berkeley Drosophila Genome Project.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
NOTE ADDED IN PROOF |
|---|
The p* values for human chromosomes are essentially unchanged when
calculated across the recently released draft sequence of the complete
human genome (International Human Genome Sequencing Consortium 2001
).
| |
FOOTNOTES |
|---|
1 E-MAIL karlin{at}math.stanford.edu; FAX (650) 725-2040.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.163101.
| |
REFERENCES |
|---|
|
|
|---|
Received August 30, 2000; accepted in revised form February 5, 2001.
This article has been cited by other articles:
![]() |
S. Gunewardena and Z. Zhang A hybrid model for robust detection of transcription factor binding sites Bioinformatics, February 15, 2008; 24(4): 484 - 491. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Van der Auwera, J. Baute, M. Bauwens, I. Peck, D. Piette, M. Pycke, P. Asselman, and A. Depicker Development and Application of Novel Constructs to Score C:G-to-T:A Transitions and Homologous Recombination in Arabidopsis Plant Physiology, January 1, 2008; 146(1): 22 - 31. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Paz, V. Kirzhner, E. Nevo, and A. Korol Coevolution of DNA-Interacting Proteins and Genome "Dialect" Mol. Biol. Evol., January 1, 2006; 23(1): 56 - 64. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. C. Leman, Y. Chen, J. E. Stajich, M. A. F. Noor, and M. K. Uyenoyama Likelihoods From Summary Statistics: Recent Divergence Between Species Genetics, November 1, 2005; 171(3): 1419 - 1436. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Karlin Colloquium Perspective: Statistical signals in bioinformatics PNAS, September 20, 2005; 102(38): 13355 - 13362. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Abe, H. Sugawara, M. Kinouchi, S. Kanaya, and T. Ikemura Novel Phylogenetic Studies of Genomic Sequence Fragments Derived from Uncultured Microbe Mixtures in Environmental and Clinical Samples DNA Res, January 1, 2005; 12(5): 281 - 290. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Zhang, P. M. Harrison, Y. Liu, and M. Gerstein Millions of Years of Evolution Preserved: A Comprehensive Catalog of the Processed Pseudogenes in the Human Genome Genome Res., December 1, 2003; 13(12): 2541 - 2558. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Dieringer and C. Schlotterer Two Distinct Modes of Microsatellite Mutation Processes: Evidence From the Complete Genomic Sequences of Nine Species Genome Res., October 1, 2003; 13(10): 2242 - 2251. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Zhang and M. Gerstein Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes Nucleic Acids Res., September 15, 2003; 31(18): 5338 - 5348. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Abe, S. Kanaya, M. Kinouchi, Y. Ichiba, T. Kozuki, and T. Ikemura Informatics for Unveiling Hidden Genome Signatures Genome Res., April 1, 2003; 13(4): 693 - 702. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Echols, P. Harrison, S. Balasubramanian, N. M. Luscombe, P. Bertone, Z. Zhang, and M. Gerstein Comprehensive analysis of amino acid and nucleotide composition in eukaryotic genomes, comparing genes and pseudogenes Nucleic Acids Res., June 1, 2002; 30(11): 2515 - 2523. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Lerat, P. Capy, and C. Biemont The Relative Abundance of Dinucleotides in Transposable Elements in Five Species Mol. Biol. Evol., June 1, 2002; 19(6): 964 - 967. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Chen, A. J. Gentles, J. Jurka, and S. Karlin Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22 PNAS, February 20, 2002; (2002) 52692099. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Chen, A. J. Gentles, J. Jurka, and S. Karlin Genes, pseudogenes, and Alu sequence organization across human chromosomes 21 and 22 PNAS, March 5, 2002; 99(5): 2930 - 2935. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||