|
|
|
|
Vol. 10, Issue 7, 967-981, July 2000
LETTER
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
We examined the abundance of microsatellites with repeated unit
lengths of 1-6 base pairs in several eukaryotic taxonomic groups:
primates, rodents, other mammals, nonmammalian vertebrates, arthropods,
Caenorhabditis elegans, plants, yeast, and other fungi. Distribution of simple sequence repeats was compared between exons, introns, and intergenic regions. Tri- and hexanucleotide repeats prevail in protein-coding exons of all taxa, whereas the dependence of
repeat abundance on the length of the repeated unit shows a very
different pattern as well as taxon-specific variation in intergenic
regions and introns. Although it is known that coding and noncoding
regions differ significantly in their microsatellite distribution, in
addition we could demonstrate characteristic differences between
intergenic regions and introns. We observed striking relative abundance
of (CCG)n
(CGG)n trinucleotide repeats in
intergenic regions of all vertebrates, in contrast to the almost complete lack of this motif from introns. Taxon-specific variation could also be detected in the frequency distributions of simple sequence motifs. Our results suggest that strand-slippage theories alone are insufficient to explain microsatellite distribution in the
genome as a whole. Other possible factors contributing to the observed
divergence are discussed.
| |
INTRODUCTION |
|---|
|
|
|---|
Microsatellites or simple sequence repeats (SSRs)
are tandemly repeated tracts of DNA composed of 1-6 base pair (bp)
long units. They are ubiquitous in prokaryotes and eukaryotes, present even in the smallest bacterial genomes (Field and Wills 1996
; Hancock
1996a
). A subset of SSRs, namely trinucleotide repeats, are of great
interest because of the role they play in many human neurodegenerative
disorders (fragile X syndrome, Huntington's disease, myotonic
dystrophy, spinal-bulbar muscular atrophy, spinocerebellar ataxia,
etc.; for reviews, see Warren and Nelson 1993
; Bates and Lehrach 1994
;
Reddy and Housman 1997
) and in some human cancers, e.g. hereditary
nonpolyposis colorectal carcinoma (Wooster et al. 1994
; Arzimanoglou et
al. 1998
). The alteration responsible for these genetic diseases is the
expansion of triplet repeats, where the rate of mutation depends on the
number of tandem units within the repeat. Hence the term 'dynamic
mutation' was coined by Richards and Sutherland (1992)
.
Microsatellites can be found anywhere in the genome, both in
protein-coding and noncoding regions. Because of their high mutability, microsatellites are thought to play a significant role in genome evolution by creating and maintaining quantitative genetic variation (Tautz et al. 1986
; Kashi et al. 1997
). In promoter regions, the length
of SSRs may influence transcriptional activity (Kashi et al. 1997
).
Length of polyglutamine or polyproline tracts encoded by SSRs may
affect protein-protein interactions involving transcription factors
(Gerber et al. 1994
; Perutz et al. 1994
).
It has been shown that SSRs in exons are less abundant than in
noncoding regions (Hancock 1995
), and that different taxa exhibit different preferences for SSR types (Beckmann and Weber 1992
; Lagercrantz et al. 1993
; Tautz and Schlötterer 1994
). Moreover, the overall microsatellite content in the genome correlates with the
genome size of the organisms (Hancock 1996b
).
SSRs are inherently unstable. Two models have been proposed to explain
microsatellite generation and instability: DNA polymerase slippage and
unequal recombination. The first model involves transient dissociation
of the replicating DNA strands, followed by misaligned reassociation
(Richards and Sutherland 1994
). The slipped structure may be stabilized
by hairpin, triplex, or quadruplex arrangement of DNA strands (for
review, see Pearson and Sinden 1998
; Sinden 1999
). Thus, it is expected
that those repeats that are able to form such alternative DNA
conformations would be generated more frequently than others. The
possible structures of triplet repeats involved in human diseases have
been studied extensively. The repeats that show a considerable potential to
form alternative structures include (CTG)n
(CAG)n,
(CCG)n
(CGG)n, (GAA)n
(TTC)n, (AGG)n
(CCT)n, and
(TGG)n
(CCA)n (Gacy et al. 1995
; Bidichandani et
al. 1998
; Usdin 1998
). However, some sequences with theoretically high
hairpin-forming potential [e.g. (CCG)n] show the slowest in
vitro slippage rate (Schlötterer and Tautz 1992
). Moreover, the
rate of alterations is likely to be controlled at multiple steps in
vivo. An active role of the DNA mismatch repair system to stabilize
simple sequence repeats has been revealed in Escherichia coli,
yeast, and humans (for review, see Sia et al. 1997
). Although a number
of experimental results argue in favor of the above model, homologous
recombination may also result in genetic instability of certain SSRs
(Jakupciak and Wells 1999
).
We can expect that the fixation of de novo-generated SSRs is determined
by the interplay of several factors, of which the repeat type, the
genomic position of the SSR, and the genetic-biochemical background of
the cell are the most important. In our study we addressed the
questions of whether the abundance of various microsatellite types is
similar or not in different taxonomic groups and how SSR frequencies
differ in exons, introns, and intergenic regions. We intended to give a
detailed picture analyzing all possible (501) SSR motifs to complement
the results of a previous study on primate DNA sequence data (Jurka and
Pethiyagoda 1995
), and place them into comparative evolutionary perspective.
| |
RESULTS |
|---|
|
|
|---|
We examined the distribution of perfect SSRs over 12-bp long, so if
not explicitly stated otherwise, our results described here apply to
microsatellites meeting this criterion. To assess expandability of the
repeats, we also analyzed perfect repeats longer than 24 bp (see
Methods) and compared the results to those obtained using the shorter
cutoff length. Data presented below always refer to duplex DNA, even if
we show only the sequence of the repeated motif on one strand for
simplicity, i.e. notations like AC and
(AC)n
(GT)n are equivalent.
The nonoverlapping groups of DNA sequences used in this study will be referred to as taxonomic groups or taxa. These groups represent either individual species (Caenorhabditis elegans and Saccharomyces cerevisiae), or groups of related species such as Primates, Rodentia, and Mammalia. Thus our taxa are defined rather arbitrarily based primarily on sequence availability (see Methods). We carried out the analyses on sequences classified into three genomic regions (intergenic regions, introns, and exons), and on a superset referred to as all sequences. The latter contained all sequence entries that passed the filtering criteria described in Methods, even if they could not be assigned to genomic regions.
To estimate database bias caused by the use of GenBank, we also
included the full sequence of the human chromosome 22 in our study. The
results obtained for chromosome 22 are in good agreement with those for
all primate sequences, confirming the validity of our approach. The
30% increase in total microsatellite content in the full chromosomal
sequence (see the last column of Table 1) is mostly
due to greater abundance of (A + T)-rich repeats, especially
poly(A/T) tracts (Tables 2 and 6).
|
|
To assess the contribution of repeated unit length to microsatellite abundance, we calculated the total lengths of all mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats per megabase pair (Mbp) of DNA sequence (Table 1). In exons, trinucleotide repeats are invariably the most abundant in all taxa, with hexanucleotide repeats being the second most common. Intergenic regions and introns, however, contain more hexanucleotide repeats than exons do, Embryophyta and S. cerevisiae introns being the only exceptions to this rule.
In primates, mononucleotide repeats are the most copious. In introns and intergenic regions they are more than twice as frequent as di- and tetranucleotide repeats. The latter are of similar abundance, and interestingly, much more frequent than trinucleotide repeats. In rodents, repeats with dinucleotide units are about three times more frequent than those with mononucleotides. Dinucleotide repeats are dominant in introns and intergenic regions of many other taxa, except for Primates, Embryophyta, S. cerevisiae, and Fungi. In rodent introns and intergenic regions, the rarity of triplet repeats is also quite pronounced in comparison to di- and tetranucleotide repeats.
The relative abundance of tetranucleotide over trinucleotide repeats in introns and intergenic regions is characteristic of all vertebrate taxa but not of any other taxonomic group studied. In all mammalian taxa, even pentanucleotide repeats are more frequent in introns and intergenic regions than triplet repeats. In invertebrates and fungi, tetranucleotide repeats constitute the less frequent class of microsatellite in introns and intergenic regions, whereas in vascular plants they are comparably rare as hexanucleotide repeats.
When comparing various taxonomic groups, it is evident that rodents adopt much more microsatellites than any other group we examined. C. elegans, however, contains the least SSRs per one Mbp of DNA, less than S. cerevisiae and other fungi.
A more detailed picture could be drawn when we analyzed the
distribution of SSRs by the sequence of the repeated motif. Results obtained for mono-, di-, and trinucleotide repeats are shown in Tables
2-5. The most
frequent tetra-, penta-, and hexanucleotide repeats are listed in Tables
6-9.
More data are available online in our SSRDB database at
http://genetics.elte.hu/ssr.
|
|
|
|
|
|
|
Mononucleotide Repeats
In general, poly(A/T) tracts are more abundant in each taxon than poly(C/G) sequences (Tables 2-5). This difference is the least characteristic in C. elegans and most pronounced in primates. The total length of mononucleotide repeats, taking together both patterns, is also greatest in primates (Table 1). Nonmammalian vertebrates show the second highest ratio of poly(C/G) to poly(A/T). Besides C. elegans, they constitute the only group where poly(C/G) repeats appear in exons in a proportion comparable to poly(A/T) (Table 5). Intergenic regions show an interesting preference for poly(C/G) over poly(A/T) in C. elegans (Table 3). Introns contain more poly(A/T) than poly(C/G) repeats in each taxon (Table 4).
Dinucleotide Repeats
Dinucleotide repeats are most abundant in rodents and the least frequent in fungi (Table 1). Characteristic differences between taxa can only be observed for intergenic regions and introns (Tables 3 and 4) because of the rarity of dinucleotide repeats in exons (Table 5). Curiously, we have found one 16-bp long CG repeat in the protein-coding region of beta one adrenergic receptor gene from Canis familiaris. Otherwise, CG repeats are very rare.
In all vertebrates and arthropods, AC is the most frequent dinucleotide repeat motif (Tables 2-4). C. elegans prefers AG in intergenic regions, AT in introns. In embryophytes, yeast, and fungi, AT repeats are the most frequent in general, except for introns in fungi where AC is more abundant (Table 4).
Trinucleotide Repeats
Trinucleotide repeats can be found in each genomic region with a significant frequency (Tables 2-5). However, the frequency distribution by repeat type shows major differences in various genomic regions and among taxa. In all vertebrates, (G+C)-rich repeats dominate in exons, whereas they are less pronounced in other regions. AAC and AAG are the most frequent repeat types in Embryophyta exons and interesting relative abundance of (A+T)-rich repeats can also be observed in the exons of yeast and other fungi.
Generally there is an underrepresentation of ACG and ACT repeats in most taxa. The lack of ACG repeats is worth noting, because the triplet repeat with the same base composition (AGC=CAG) is found much more frequently in all regions. There is also a noticeable excess of AGC repeats in exons compared to introns and intergenic regions. In primates and rodents, CCG constitutes the second most frequent repeat type in exons. CCG repeats are almost totally absent from introns. ACC repeats are relatively infrequent in intergenic regions and introns, with the exception of rodents, where their occurrence exceeds that of ATC repeats.
Apart from these general trends, a relatively unique pattern of distribution can be observed for each taxon. While intergenic CCG repeats are quite significant in all vertebrates, they are underrepresented in other taxa. In sharp contrast with this, there is a lack of CCG repeats in vertebrate introns (Tables 3 and 4). Rodents have a relatively balanced distribution of most triplet repeat types in intergenic regions and introns showing generally higher frequencies than most other taxa. AAT repeats are the most abundant in the introns of primates, vertebrates, arthropods, yeast and other fungi, whereas they come out third after AGG and AAG in rodents. Interestingly, in mammalian introns, AAC turns out to be the most frequent triplet repeat.
Tetranucleotide Repeats
Exons contain almost no tetranucleotide repeats (Tables 1 and 9). Therefore, data can only be evaluated for introns and intergenic regions. The abundance of tetranucleotide repeats in vertebrate introns and intergenic regions exceeds that of trinucleotide repeats. Repeat frequency by type shows a general dependence on the base composition of the repeat unit. Repeats with <50% of G+C are generally more abundant (Tables 6-8). There are, however, a few notable exceptions, e.g. AAGG, which constitutes the second most frequent tetranucleotide repeat in mammals, and the fourth one in primates and rodents. Repeats of the type AAAB, where B denotes any base other than A, are very abundant in primates and rodents. AAAG and AAAT are also highly represented in other mammals.
Pentanucleotide Repeats
In all mammalian taxa, pentanucleotide repeats are at least as abundant as triplet repeats both in introns and intergenic regions (Table 1). They are underrepresented in exons of all taxa, whereas their frequency is comparable to that of trinucleotide repeats in introns and intergenic regions of nonmammalian genomes. In nonvertebrate taxa, they are invariably more frequent than tetranucleotide repeats. Within the whole genome, among the most common types we can always find (A+T)-rich ones, such as AAAAC in primates, rodents or AAAAT in vertebrates, arthropods, C. elegans, vascular plants, and fungi as dominant tract (Tables 6-9). The exclusive dominance of AAAAB type repeats is clear for primates and a bit less striking for rodents, and occurs in vascular plants and fungi. An interesting finding is that the CpG-containing CCCCG repeat is present in the top 50% of pentanucleotide repeats found in vertebrate intergenic regions.
Hexanucleotide Repeats
Hexanucleotide repeats constitute the second most frequent type after trinucleotide repeats in exons (Table 1). In introns and intergenic regions of nonvertebrate taxa, they are generally more abundant than tetranucleotide repeats, and in C. elegans their density also exceeds that of pentanucleotide repeats.
The repeat motifs present in exons show a great variation and are relatively (G+C)-rich (Table 9). A dominance of (A+T)-rich repeats can be observed in primate, plant, yeast, and fungal introns and intergenic regions (Tables 7 and 8). A few telomere-like repeat motifs are also found, like AACCCT in vertebrates and fungi, or AATCCC in vertebrates and arthropods. Interestingly, AACCCT repeats are present in vertebrate introns and intergenic regions. The presence of the (G+C)-rich ACCCCC motif in the top 50% of simple sequence repeats in introns of rodents and mammals is also noteworthy. Two CpG-containing repeats (AGAGCG and ACACGC) are relatively abundant in mammalian intergenic regions.
Rare Repeats
We could not find in our database subsets any of the following 27 sequence motifs in repeats longer than 12 bp: the pentanucleotide ACGCT, the hexanucleotides AAACGT, AAAGCG, AACGAG, AACGCG, AACGCT, AACGTT, AAGAGT, AAGCGC, ACACCG, ACACTG, ACCGAG, ACGACT, ACGATC, ACGCCT,
ACGCGT, ACGCTC, ACGGCT, ACTAGC, AGATCT, AGCGCT, AGCTCG, ATATCG, ATCGCG,
ATGCGC, CCCGGG, and CCGCGG. It should be noted here that 23 of them
contain the dinucleotide CpG and four of them contain two CpG
motifs. Ten of them are palindromes. Of the four hexanucleotides that
do not contain the CpG dinucleotide (AAGAGT, ACACTG, ACTAGC,
AGATCT), the first three include the trinucleotide duplex
(ACT)
(AGT), and three contain a stop codon in at least one frame.
Considering the cumulated size (>380 Mbp, see Table
10) of the sequences we analyzed, the total absence of a repeat type may well indicate either a sequence unpreferred for
the mechanism generating repeats or strong selective pressure against
repeated occurrence of the particular sequence. The very low frequency
of ACT trinucleotide repeats in all sequences is also striking (Table
2). It cannot be explained by the presence of a stop codon on one
strand since genomic regions other than exons are also affected.
|
Repeats Longer than 24 bp
The above results apply to repeats longer than 12 bp. To be able to estimate the instability of the various repeat motifs, we also analyzed repeats longer than 24 bp and defined the expandability of a repeat motif as the total length of repeats longer than 24 bp divided by the total length of repeats longer than 12 bp. The overall distribution of these longer repeats follows comparable trends as presented above for all repeats considered (data not shown; for details see the SSRDB database at http://genetics.elte.hu/ssr). The contribution of SSRs with different unit lengths is generally similar to that observed for repeats longer than 12 bp, albeit with modified ratios. Mononucleotide repeats are, however, replaced by dinucleotide repeats as the dominant repeat type in primate, plant and yeast intergenic regions and introns. Although the abundance of the repeats longer than 24 bp is much lower and some motifs are missing, the relative frequencies of various motifs are mostly conserved. An interesting exception is the AAC repeat in the exons of embryophytes, being much more abundant using the greater length threshold than AAG, which is the most frequent repeat at the shorter threshold (101bp/Mbp vs. 18bp/Mbp compared with 253bp/Mbp vs. 317bp/Mbp for AAC vs. AAG).
The contribution of repeats longer than 24 bp to the observed SSR
distribution is well represented by the expandability values, which not
surprisingly, turn out to be repeat- and taxon-dependent. In all
sequences, rodents show the highest and arthropods the lowest values
(data not shown). The expandability of AC, AG, and AT repeats is almost
uniformly high, although a preference for long
(AC)n
(GT)n repeats is observed in primates.
However, consistent with their general underrepresentation, no CG
repeats longer than 24 bp were found. In rodent intergenic regions and
introns, AC, AG, and AT dinucleotide repeats show very high
expandability values (55%-80%), and most of these repeats are longer
than 24 bp in rodent exons (79%-100%), even though dinucleotide
repeats are generally rare in exons. In the case of trinucleotide
repeats, repeat abundance and expandability rarely correlate: e.g., in primate intergenic regions, the second most abundant AAC displays the
lowest expandability (10%), whereas 45% of the total length of the
moderately frequent AAG originates from tracts longer than 24 bp.
Trinucleotide repeats in exons exhibit uniformly low expandability: AGC
is the only trinucleotide motif for which repeats longer than 24 bp can
be found in all taxa. However, the expandability values for AGC in
exons vary between 3% (arthropods) and 57% (rodents).
| |
DISCUSSION |
|---|
|
|
|---|
We examined the distribution of microsatellites composed of motifs 1-6 bp long in primates, other mammals, other vertebrates, arthropods, C. elegans, embryophytes, S. cerevisiae and other fungi. To obtain a detailed picture, we analyzed the frequencies of perfect SSRs longer than 12 bp in exons, introns, and intergenic regions for all of these taxa. Our results show that the abundance of certain repeat types varies with the genomic region and distribution is also characteristic of the taxonomic group examined.
It should be noted here that due to biased sequence availability in the
databases, our results apply mainly to those regions of the genomes
that contain protein-coding genes. Even in the case of 'all'
sequences, where we did not select for genes (see Methods), the
contribution of gene-rich sequences is considerable, as can be judged
from the relatively high ratio of exon sequences compared to the total
(Table 10). In an attempt to analyze regions less represented in
GenBank, we included the human chromosome 22 sequence. Data obtained
for this chromosome agree well with those obtained for all primate
sequences, although an increase in (A+T)-rich microsatellites could
be observed. We suggest that the poly(A/T) tails of densely scattered
retroposed sequences, like Alu, LINE-1, and processed pseudogenes are
responsible for this higher proportion of (A+T)-rich repeats.
Chromosome 22 sequence, however, includes only the euchromatic portion,
namely the relatively gene-rich long arm, 22q (Dunham et al. 1999
).
Thus, any interpretation of the results should bear in mind that
telomeric regions or genomic regions with very low gene density are not
covered in the present analysis. Repeat abundance and distribution in
such regions may differ from those presented here.
Nonetheless, analysis of the datasets resulted in several noteworthy findings. First, it is very interesting to compare repeat occurrence in introns and intergenic regions. Whereas the constraints shaping protein-coding DNA sequences obviously differ from those that affect these two regions of the genome, comparison of the latter could reveal some less trivial differences. In all vertebrates, the microsatellite distribution in introns and intergenic regions is quite similar but the abundance of CCG triplets differs: Introns do not contain this type of repeat whereas it is relatively abundant in intergenic regions. Because CCG is one of the most abundant repeats in vertebrate exons, a potential bias caused by error in distinguishing exons and intergenic regions cannot be ignored (see Methods). However, we have taken sufficient and appropriate measures to avoid such errors, and we argue that the observed difference is not due to incorrect assignment of exon sequences to intergenic regions. A short calculation carried out on primate data supports this argument: Assuming that microsatellite distribution in intergenic sequences is identical to that of introns, and the increased length of CCG repeats observed in the intergenic regions can be attributed only to exonic sequences, the expected total length of AGC repeats (the dominant trinucleotide repeat of all vertebrate exons) would be almost three times greater in intergenic regions than the observed value.
The absence of CCG and ACG repeats from introns of all vertebrates
could be explained by the presence of the highly mutable CpG
dinucleotide within the motif. The elevated level of CCG repetition could be found in intergenic regions of all vertebrates but not in the
other taxonomic groups examined. This result suggests that intergenic
sequences containing regulatory DNA elements are unmethylated sufficiently in all vertebrates to prevent 5-methyl-cytosine-directed spontaneous mutations that would efficiently disrupt repeated stretches
of the CCG triplet, as it is observed for intronic sequences. An
alternative explanation would be that a specific mechanism exists to
maintain the observed level of CCG repeats in intergenic regions of all
vertebrates. The role of cytosine methylation in histone deacetylation,
chromatin remodeling, and gene silencing (Razin 1998
) and the presence
of CpG islands (Bird 1986
) may account for this phenomenon. Coffee et
al. (1999)
demonstrated histone deacetylation as a consequence of CGG
(=CCG) repeat expansion at the 5' end of FMR1 in fragile
X-syndrome cells. Although the association with acetylated histones
depends on the methylation state of DNA, we suggest that the length of
the repetitive tract may be an important factor determining the level
of methylation, not only in the CGG microsatellite but also in the
proximal CpG island of FMR1. Boyes and Bird (1992)
demonstrated that transcriptional repression by DNA methylation depends
on CpG density. Thus, (CCG)n
(CGG)n repeats may
play an active role in vertebrates by allowing regulatory switches via
the processes of DNA methylation/demethylation and, consequently,
histone acetylation/deacetylation. The low level of CCG repeats in
intergenic regions of species that do not methylate their DNA (C. elegans, Drosophila and yeast) suggests that, even in the
absence of methyl-directed CpG suppression, CCG repeats are not favored
outside the protein-coding regions. This supports the idea that either
the maintenance of CCG repeats in intergenic regions of vertebrates or
their suppression in most nonvertebrate sequences is an active process.
Another interesting problem is the absence of CCG from introns. In
addition to the above mentioned effect of the CpG dinucleotide, CCG
repeats may also be selected against because of the requirements of the
splicing machinery. Repeated elements containing the motif GGG located
at the 5' end of human introns proved to be involved in splice site
selection (Sirand-Pugnet et al. 1995
). Long CCG sequences could compete
with this region in recruiting splicing machinery components resulting
in inadequate splicing. Furthermore, CCG repeats, which exhibit
considerable hairpin- and quadruplex-forming potential, may influence
the secondary structure of the pre-mRNA molecule. If we consider the
observations showing that intron self-complementarity (Howe and Ares
1997
) and mRNA secondary structure (stem loops, Coleman and Roesser
1998
; hairpins, Goguel et al. 1993
) modulate the efficiency and
accuracy of splicing, we can assume that the presence of repeated CCG
tracts would interfere with the formation of mature mRNA.
Differences between introns and intergenic regions can also be observed in nonvertebrate taxa. Intergenic regions of arthropods and vascular plants show excess of AAC and AAG repeats, respectively, when compared to introns of the same taxon. In fungi, AAT is the most frequent trinucleotide repeat in both intergenic regions and introns, but its abundance is much higher in the latter. Other biases (e.g., C, AG, and AAG in C. elegans; AC in yeast and other fungi) also suggest that the selective forces acting on intergenic regions and introns differ from each other in a taxon-specific manner.
It is also worth noting that tetranucleotide repeats represent a higher proportion of all vertebrate genomes than triplet repeats (Table 1), in spite of the fact that exons seem to tolerate only trinucleotide and hexanucleotide repeats effectively. The observed dependence of repeat abundance on repeated unit length is very much biased from the expected trend of gradual decrease. SSRs with even unit length seem to be favored strongly in rodent introns and intergenic regions, and, to a lesser extent, in other vertebrates. In sharp contrast to this, penta- and hexanucleotide repeats are almost invariably more frequent than tetranucleotide repeats in all nonvertebrate taxa. This varying dependence on repeat unit length suggests fundamental differences between vertebrates and other taxa in the mechanisms of generation and fixation of simple repetitive DNA.
Although our analysis cannot measure microsatellite polymorphism per se, the maximum, average, and variance of SSR lengths may give good indication of the expected instability (data available online). As a rough estimate for this expandability, we compared the abundance of SSRs longer than 24 bp to that of repeats longer than 12 bp. AC, AG, and AT dinucleotide repeats show a striking dominance among long SSRs in introns and intergenic regions of all taxa, except for fungi. This suggests that dinucleotide repeats other than CG are the most expandable types in higher eukaryotes, a statement well supported by the numerous dinucleotide microsatellite markers used in mapping studies.
Our study confirmed the previous results indicating that the
microsatellite patterns of coding and noncoding regions in eukaryotes show divergence that can be explained on the basis of differential selection (Hancock 1995
). However, where Hancock (1995)
using a
different approach
found high correlation between introns and
intergenic regions in Homo sapiens, C. elegans and
S. cerevisiae, we observed characteristic differences between
the two regions in all taxa examined. The notion of differential
selection can also be invoked to explain these differences. Moreover,
our results clearly demonstrate that the preferred SSR types in exons
and other genomic regions are taxon-dependent. Each repeat type that was shown to be flexible in forming various nonconventional intra- or
interstrand structures (Pearson and Sinden 1998
; Sinden 1999
) can be
found in relatively high frequencies in one or more, but never in all,
taxa. This observation may indicate differences in repair enzyme
specificities or other divergent factors acting at the level of selection.
Our results show, in accordance with many other studies, that
strand-slippage theories alone cannot explain microsatellite distribution in the genome as a whole. The inherent potential of a
sequence to form alternative DNA conformations can be important for the
generation of SSRs, but cannot account for the differences observed
among taxa. Enzymes and other proteins involved in various aspects of
DNA-processing (i.e., replication and repair) and chromatin remodeling
may be responsible for the taxon-specificity of microsatellite abundance. It should be emphasized that not only does the
repetitiveness of the genomes differ (Hancock 1996b
), but also the
preferred microsatellite types are quite different. This may indicate
that SSRs play an important role in genome evolution whereas the
processes responsible for SSR generation and fixation must also have
undergone alteration during evolution.
| |
METHODS |
|---|
|
|
|---|
DNA Sequences
Sequences were obtained from GenBank releases 107 (for primates),
109 (for rodents, mammals, and vertebrates) and 110 (for all other
taxa) (ftp://ncbi.nlm.nih.gov/genbank). The taxonomic groups examined
were the following: primates, rodents, other mammals (excluding
primates and rodents), other vertebrates (excluding mammals),
arthropods, C. elegans, embryophytes, S. cerevisiae, and other fungi. The human chromosome 22 sequence superlink was obtained from the Sanger Center web site
(http://www.sanger.ac.uk/HGP/Chr22). Only genomic (chromosomal)
sequences were included in our study. To decrease the effect of
database bias as much as possible, we eliminated all GenBank entries
defined as either tandem repeats, microsatellites, minisatellites,
SSRs, telomeric or centromeric sequences. All mRNA, cDNA, and
structural RNA sequences were excluded from the analysis. Standard
UNIX tools (e.g., grep, awk) and Perl scripts were used to
carry out the necessary filtering steps. From the remaining sequences,
we selected those
250-bp long (1000 bp in the case of primate
sequences). The redundancy of sequences present in the database was
minimized using the program CLEANUP (Grillo et al. 1996
). We eliminated
sequences that were
95% similar to and overlapped by
60% with
another, longer sequence. The sizes of the database subsets used for
the analysis, also broken down to intergenic regions, introns, and
exons, are listed in Table 10. The taxonomic groups are rather
arbitrarily defined, primarily based on sequence availability. The
species contributing to >5% of sequences in the appropriate
database subset are listed in Table 11.
|
Although full chromosomal sequences are available for S. cerevisiae and C. elegans, the unconfirmed nature of the majority of sequence annotations prevented their meaningful use in our study. The potential risk of incorrectly classifying DNA fragments into exons, introns, and intergenic regions cannot be neglected even for sequences derived from the traditional GenBank database sections. Although the extent of such bias did not seem to be large, we tried to minimize it by excluding from the analysis all such entries that contained no CDS line and by a rather conservative handling of alternative splicing (either biologically relevant or due to uncertain predictions or database errors). We eliminated from our analysis all DNA fragments where exon-intron junctions of a protein-coding gene was specified in two or more different, contradictory ways. We also ignored putative intergenic regions before and after such genes. Despite our precautions, there still may be a few exon or intron sequences specified incorrectly as intergenic regions. We think, however, that the resultant bias should not affect our conclusions.
Because most of our results were obtained from sequences containing protein-coding genes, we were also interested in whether or not this caused a bias in the SSR distribution. To test this, we also carried out the analysis on the full sequence of the human chromosome 22. The sequence was used as a whole, i.e., no attempt was made to assign portions of the chromosome 22 sequence to exon, intron, or intergenic regions.
SSR Analysis
From the database subsets obtained for each taxa, we extracted all
perfect tandem repeats with a maximum unit size of six that contained
at least two consecutive units, as described by Jurka and Pethiyagoda
(1995)
. The SSRs were then grouped according to their localization in
the genome (i.e., within exons, introns, or intergenic regions) using
Perl scripts. This classification was based on the information provided
in the CDS feature table lines of the GenBank entries. Intergenic
regions were defined as being the part of DNA from the end of the last
exon of one gene to the beginning of the first exon of the following
gene (similar to Hancock 1995
). Fragments derived from entries
containing no CDS line were not classified to regions but were retained
in all sequences.
Further data analysis (classification of SSRs by unit patterns and
computing the values listed in the tables) was carried out as described
by Jurka and Pethiyagoda (1995)
. In the present analysis, repeats with
unit patterns being circular permutations and/or reverse complements of
each other were grouped together as one type. The total number of such
nonoverlapping types is 501 for 1-6-bp long motifs (for details see
Jurka and Pethiyagoda 1995
).
We mainly examined the distribution of perfect repeats >12-bp long. Because microsatellites are often disrupted by single base substitutions, the contribution of various repetitive motifs to the overall repetitivity of the genome could be better estimated using this relatively short cutoff length. However, to assess expandability of the repeats, we also identified repeats longer than 24 bp. For a particular motif, expandability is defined as the total length of repeats longer than 24 bp divided by the total length of repeats longer than 12 bp.
To allow direct comparisons regardless of the cumulated size of genomic regions in the database subsets, normalized total lengths of the microsatellites were calculated for 1 Mbp of the appropriate genomic sequence type.
| |
ACKNOWLEDGMENTS |
|---|
This work was supported by grant OTKA T19278 from the Hungarian National Scientific Research Fund. We thank Ágnes Major for helpful discussion and Paul Klonowski for the computer program of tandem repeat extraction. We also thank the anonymous referees for their useful comments and suggestions.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL tothg{at}ludens.elte.hu; FAX (+36-1) 266-2694.
| |
REFERENCES |
|---|
|
|
|---|
, N.,
Macura, S., and
MacMurray, C.T.
1995.
Trinucleotide repeats that expand in human disease form hairpin structures in vitro.
Cell
81:
533-540[CrossRef][Medline].
CAG) repeats occur by recombination.
J. Biol. Chem.
274:
23468-23479
a three-way connection.
EMBO J.
17:
4905-4908[CrossRef][Medline].
-tropomyosin pre-mRNA.
Nucl. Acids Res.
23:
3501-3507Received January 5, 2000; accepted in revised form May 4, 2000.
This article has been cited by other articles:
![]() |
M. Brandstrom and H. Ellegren Genome-wide analysis of microsatellite polymorphism in chicken circumventing the ascertainment bias Genome Res., June 1, 2008; 18(6): 881 - 887. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. GuhaMajumdar, E. Dawson-Baglien, and B. B. Sears Creation of a Chloroplast Microsatellite Reporter for Detection of Replication Slippage in Chlamydomonas reinhardtii Eukaryot. Cell, April 1, 2008; 7(4): 639 - 646. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. E. Hile and K. A. Eckert DNA polymerase kappa produces interrupted mutations and displays polar pausing within mononucleotide microsatellite sequences Nucleic Acids Res., February 2, 2008; 36(2): 688 - 696. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y.-J. Han and P. de Lanerolle Naturally Extended CT {middle dot} AG Repeats Increase H-DNA Structures and Promoter Activity in the Smooth Muscle Myosin Light Chain Kinase Gene Mol. Cell. Biol., January 15, 2008; 28(2): 863 - 872. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Mrazek, X. Guo, and A. Shah Simple sequence repeats in prokaryotic genomes PNAS, May 15, 2007; 104(20): 8472 - 8477. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. B. Mudunuri and H. A. Nagarajaram IMEx: Imperfect Microsatellite Extractor Bioinformatics, May 15, 2007; 23(10): 1181 - 1187. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. W. Messer and P. F. Arndt The Majority of Recent Short DNA Insertions in the Human Genome Are Tandem Duplications Mol. Biol. Evol., May 1, 2007; 24(5): 1190 - 1197. [Abstract] [Full Text] [PDF] |
||||
![]() |
A.-M. Patch and S. J. Aves Fingerprinting fission yeast: polymorphic markers for molecular genetic analysis of Schizosaccharomyces pombe strains Microbiology, March 1, 2007; 153(3): 887 - 897. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. M. Shedlock, C. W. Botka, S. Zhao, J. Shetty, T. Zhang, J. S. Liu, P. J. Deschavanne, and S. V. Edwards Phylogenomics of nonavian reptiles and the structure of the ancestral amniote genome PNAS, February 20, 2007; 104(8): 2767 - 2772. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Singh, L. Zheng, V. Chavez, J. Qiu, and B. Shen Concerted Action of Exonuclease and Gap-dependent Endonuclease Activities of FEN-1 Contributes to the Resolution of Triplet Repeat Sequences (CTG)n- and (GAA)n-derived Secondary Structures Formed during Maturation of Okazaki Fragments J. Biol. Chem., February 9, 2007; 282(6): 3465 - 3477. [Abstract] [Full Text] [PDF] |
||||