|
|
|
|
Vol. 12, Issue 8, 1185-1189, August 2002
LETTER
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Intron-size distributions for most multicellular (and some unicellular) eukaryotes have a sharp peak at their "minimal intron" size. Across the human population, these minimal introns exhibit an abundance of insertion-deletion polymorphisms, the effect of which is to maintain their optimal size. We argue that minimal introns affect function by enhancing the rate at which mRNA is exported from the cell nucleus.
| |
INTRODUCTION |
|---|
|
|
|---|
Decades of research on the mechanisms of pre-mRNA
splicing have revealed a remarkably intricate
process. These complexities include exon versus intron recognition
(Berget 1995
), co-transcriptional splicing (Goldstrohm et al. 2001
),
alternative splicing (Dredge et al. 2001
; Grabowski and Black 2001
),
exonic splicing enhancers (Blencowe 2000
; Nissim-Rafinia and Kerem
2002
), intronic splicing enhancers (McCullough and Berget 2000
), and
tight coupling between splicing and efficient mRNA export from the
nucleus (Luo and Reed 1999
; Zhou, et al. 2000
). Given such complexity,
it is not hard to imagine that different introns are processed
differently, not only between species but also within species. That
being the case, can we segregate intron sequences according to
differences in how they are processed? If so, might these differences
be reflected in the nature of the sequence polymorphisms that are found
in the population? We will argue that the answer to both questions is yes.
On the surface, this is an implausible idea, because intron sequences
are poorly conserved. Known splicing motifs (e.g., GT-AG, branch
points) are only a few bases in size, whereas intron lengths can be
hundreds of kilobases. In the transition between cold-blooded to
warm-blooded vertebrates, many introns experienced a twofold increase
in GC content (Bernardi 2000
). The fact that intron sequence contents
are so pliable is the reason why introns are often considered "junk." On the other hand, the enzymatic degradation of the excised introns must be a significant biochemical burden for the cell, especially if most of the human genome is transcribed (Wong et al 2000
,
2001
). Why would the cell go to so much trouble? Why not just get rid
of the introns? People who do experiments on transgenic mice have an
answer, for they have long known that some introns are essential for a
high level of expression (Choi et al. 1991
; Palmiter et al. 1991
). If
introns can influence expression levels, they are certainly not junk.
Where might one go to find such introns?
One of the most conspicuous features of eukaryotic genomes is that a
significant fraction of the introns are often clustered around a
species-specific peak at the low end of the size distribution. We call
them "minimal" introns because there are none smaller. Our
objective is to show that the evolutionary persistence of such an
optimal intron size is owing to functional constraints. However, there
are many practical difficulties. Minimal intron sequence contents are
degenerate (Lim and Burge 2001
). Not every intron is size constrained,
and any benefits to having an optimal intron size are likely to be
marginal. We reasoned that our chances of success would be best in a
genome in which large introns are prevalent, like human, because
evolution would have already selected those introns that need to remain
small from those that do not. Because of the recent expansion in the
human population (Harpending and Rogers 2000
), even a slight benefit,
as might be expected from a small change in the intron size, would have
a high probability of being fixed in the population. One might thus
expect to see an abundance of minor alleles that embody the process of
intron size optimization.
Specifically, we present resequencing data on a collection of 93 minimal introns sampled across a diverse human population. The data
reveal an abundance of insertion-deletion (indel) polymorphisms that
are clearly trying to maintain the optimal intron size. From an
analysis of the yeast expression data, we will show that minimal introns can enhance mRNA synthesis rates. In essence, we present an
example of selection based on conservation of intron size, as opposed
to conservation of sequence content. In fact, studies of recombination
rates in Drosophila melanogaster have indicated that there are
selective pressures on intron size (Carvalho and Clark 1999
). Perhaps
the perception that introns are junk is an artifact of an overly narrow
focus on conservation of sequence content as the only signature of selection.
| |
RESULTS |
|---|
|
|
|---|
Minimal Introns Are Found in Most Multicellular Eukaryotes
The distributions for intron size in Homo sapiens,
Arabidopis thaliana, D. melanogaster, and
Caenorhabditis elegans are displayed in Figure
1. All of these data are based on
cDNA-to-genomic alignments, not gene-prediction programs. The mean
number of introns per gene is 12.1, 6.2, 4.7, and 7.7, respectively. A
significant fraction of the introns is always clustered about a
species-specific minimum size, reflected by the sharp "spike" in
the distribution centered around a mean (±SD) intron size of 92±14,
89±12, 61±10, and 48±9 bp, respectively. The idea that introns might
have a minimum size, independent of sequence content, is not new
(Wieringa et al. 1984
). Presumably, it is a reflection of the physical
constraints imposed by the cellular machinery, and the dimensions of
this machinery are species specific. Minimal introns are also observed
in Mus musculus, Gallus gallus, Xenopus
laevis, Fugu rubripes, and Oryza sativa, albeit
with annotation data parsed from GenBank. Yeast also has minimal
introns, at 92±20 and 49±11 bp, for Saccharomyces cerevisiae
(Ares et al. 1999
; Spingola et al. 1999
) and Schizosaccharomyces pombe (Wood et al. 2002
), respectively.
|
In contrast, there is an enormous variability from species to species in the distributions for larger introns. In the most extreme case, H. sapiens, there is a broad "hump" in the intron size distribution, extending out to hundreds of kilobases. Sequence content analysis with RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html) reveals a gradual transition, from introns with no detectible transposons, at <1 Kb, to introns with one or more transposons, at >1Kb. It is thus reasonable to make a distinction between "minor" humps owing to introns <1Kb, and "major" humps owing to introns >1Kb. Given that the absence of a major hump can be owing to acquisition biases against large sequence contigs, the only statement that we are comfortable with is that there is a major hump in H. sapiens, M. musculus, G. gallus, and X. laevis.
Minimal Introns Are Not Randomly Distributed Among Genes
Considering that so many H. sapiens introns have been
expanded by transposon insertions, one has to wonder why the minimal intron peak persists. Are there benefits to the organism for
maintaining some introns at an optimal size? Given the predominance of
the neutral theory of evolution (Kimura 1983
), we must first eliminate any neutral or semineutral explanations. Perhaps some introns were
never bombarded by transposons. Even if an intron was bombarded, it
could have deleted back to the minimum because of the mutational bias
for deletions over insertions, which is well known from comparisons of
processed pseudogenes with their functional paralogs (Ophir and Graur
1997
; Petrov 2001
). Combined with the selection against introns too
small for the splicing machinery, it would appear that persistence of
minimal introns can be explained, without using any special functional
constraints. However, something else must also be going on, as there is
a glaring inconsistency in the data.
A distinguishing characteristic of the above-mentioned processes is
that they do not favor any particular intron or gene. Therefore, if
minimal introns are a fraction fm of the total, and if there
are R introns per gene, the probability that a gene has a minimal
intron would be 1
(1
fm)R. More precisely,
we define a minimal intron as anything that lies within three standard
deviations of the optimum, which in H. sapiens amounts to
13.6% of the introns. Because fm varies with GC content, we
compute fm in four groups (similar results are obtained with
eight groups) based on GC content in 10-Kb windows at each end of the
gene. We then integrate over all observed Rs. The computation is
performed on 882 genes from a previous analysis (Wong et al. 2000
),
containing every gene with a cDNA sequence that could be aligned in its
entirety to finished genomic sequence. The neutral expectation is that
minimal introns will be found in 56.2% of the genes, but the reality
is 44.4%. The difference is statistically significant,
P(binomial)=5 × 10
13, which means that minimal
introns tend to cluster in certain genes.
In our four primary data sets
H. sapiens, A. thaliana, D. melanogaster, and C. elegans
the
magnitude of the difference between the observed and expected number of
genes with at least one minimal intron is
11.7%,
2.0%,
6.5%,
and
4.7%, respectively. Evidently, the nonrandom distribution of
minimal introns among genes is most readily observable in those species
with a greater number of extremely large introns. Assuming that these
introns are the result of transposon bombardment, this implies that
transposon activity acts as a probe of how sensitive each gene is to
the presence of minimal introns. Without significant transposon
activity over the evolutionary history of a species, minimal introns
remain randomly distributed. Thus, the genome we should resequence is
H. sapiens, because evolution has already separated those
introns that need to remain small from those that do not.
Minimal Introns Are Full of Indel Polymorphisms
According to Kimura (1993)
, "polymorphism is just a transient
phase of molecular evolution." We reasoned that, if evolution is
really trying to maintain some species-specific minimal intron size, it
might be possible to catch this process in action from an analysis of
minimal intron polymorphisms across a diverse population. Without
further justification, we will let the data speak for themselves.
Our resequencing efforts were focused on introns with sizes close to
the human optimum of 92±14 bp. We resequenced 93 of these introns in a
population of diverse ethnicity (Collins et al. 1998
), over an average
of 45.7 individuals (91.4 chromosomes). These introns were small, so
there were no transposons in them. Their mean (±SD) size was 94±14
bp. To minimize the potential biases arising from differences in
mutation and recombination rates, most of which are correlated with
local GC content, we selected introns that span the full range of GC
content, as depicted in Figure 2. We
identified 42 polymorphic sites in all, 30 single-base substitutions
and 12 indels, with the indels in nine different introns. To compare
our results with the published data, we adjusted for variations in
sample depths and sequenced lengths. If K polymorphic sites
are found in a region of length L after sequencing n
chromosomes, the commonly used population genetics parameter (Cargill
et al. 1999
; Halushka et al. 1999
) is the normalized number of variant sites,
|
into separate components,
(subst) =
6.75 × 10
4 for substitutions, and
(indel) = 2.70 × 10
4 for indels. Strikingly, this
substitution rate is not significantly different from the substitution
rate of 7.51 × 10
4 reported for the human genome;
Sachidanandam et al. 2001
|
However, 28.6% of our minimal introns polymorphisms were indels, which
is significantly more than usual. To make this point, we need a
background rate for human indels. Such a rate is not readily available,
as most large-scale polymorphism discovery projects are focused on the
easier-to-genotype substitution polymorphisms. Many of the putative
indels are in poly-N tracts, where N is any nucleotide, and these are
usually caused by sequencing errors. For example, 13% of chromosome 22 polymorphisms were indels, but this ratio was only 4% when poly-N
tracts were ignored (Mullikin et al. 2000
). We note that in our minimal
intron indels, the longest poly-N tract was a run of only nine Gs, as
shown in Table 1. For a background rate, we
used the data from the Environmental Genome Project
(http://www.genome.washington.edu/projects/egpsnps). These data focus
on introns of every size but are restricted to the first few hundred
bases flanking the exons, much like our minimal intron data, which also
never stray far from the exons. Averaged over 90 genes, 7.9% of 392 intron polymorphisms were indels. Taking 7.9% as the null hypothesis,
the observation of 28.6% indel polymorphisms in minimal introns is
statistically significant, with
P(binomial) = 6 × 10
5.
|
The observed indels lie in two distinct clusters. There are 10 rare
indels of minor allele frequency f < 0.06, plus two common indels of
minor allele frequency f > 0.35. The direction of the intron size
change, relative to the major allele, is shown in Figure
3. All the rare indels drive the introns
back toward their optimal size of 92 bp. The exceptions are the two
common indels, which likely arose from different population dynamics.
We further note that the probability of 10 indels in a row with the
correct sign, under a null hypothesis that the sign is random, is
1 × 10
3. To confirm that the major allele is the
ancestral allele, we resequenced these indel-containing introns in a
panel of 10 primates, ranging from chimpanzees to lemurs. Our
polymerase chain reaction primers failed on the three most GC-rich
introns, but in the other introns, the major allele agreed with the
primate orthologs. The sole exception was in the most AT-rich intron,
in which both human alleles were observed in different primates.
|
Minimal Introns Can Enhance the Export of Spliced mRNAs
These data indicate that at least for some genes, the presence of a
minimal intron can be beneficial. Transgenic mice experiments (Choi et
al. 1991
; Palmiter et al. 1991
) have long shown that some introns can
affect expression levels. The most current explanation (Luo and Reed
1999
; Zhou et al. 2000
) is that "splicing generates a specific
isolable complex that promotes rapid and efficient mRNA export."
Thus, our conjecture is that minimal introns can affect mRNA maturation
by coupling more efficiently to the biochemically linked machineries of
splicing and export, thereby increasing the rate at which mRNA is
exported from the cell nucleus.
Support for this conjecture can be found in the yeast expression data
(Holstege et al. 1998
). One must be careful not to mix up S. cerevisiae and S. pombe, because their minimal intron
peaks are at very different sizes, and their genomes contain different sets of splicing-related proteins (Kaufer and Potashkin 2000
). For this
purpose, S. cerevisiae is more appealing because, with the 229 of the 6188 genes that do have introns, there is generally only one
intron per gene. Moreover, there is a striking dichotomy in the types
of introns found in different types of genes. Ribosomal-protein genes
have nonminimal introns, but nonribosomal genes have minimal introns.
As we show in Table 2, mRNA synthesis rates
were 3.4 times higher in nonribosomal genes with minimal introns than
in nonribosomal genes without introns. Ribosomal-protein genes with and
without nonminimal introns showed no such differences in mRNA synthesis
rates. Although there may be other explanations for this observation
besides mRNA export, these data are not inconsistent with our
conjecture.
|
Additional support for our conjecture is observed in Drosophila
populations with a 66-bp intron presence-absence polymorphism in the
jingwei (jgw) gene. Absence of this minimal intron reduces the
expression level by almost a factor of two (Llopart et al. 2002
).
| |
DISCUSSION |
|---|
|
|
|---|
The general understanding is that many mRNAs are actively
transported out of the nucleus, not passively diffused. Incompletely processed mRNAs are poor substrates for this export machinery, indicating that mRNA export is a type of quality control to ensure that
only functional mRNAs reach the cytoplasm (Cullen 2000
). We envision a
competition to get out of the cell nucleus, with at least three
different export paths: one each for mRNAs with no introns, mRNAs with
minimal introns, and mRNAs with nonminimal introns. Perhaps minimal
introns function as "routing" tags that define a more secure export
path. Comparisons between species might not be simple. Particular
export paths may not be present in some species, and orthologous genes
need not use the same export path. For example, in Encephalitozoon
cuniculi (Katinka et al. 2001
) and in the nucleomorph chromosomes
of Guillardia theta (Douglas et al. 2001
), all introns are
minimal (23 to 52 bp in E. cuniculi and 42 to 52 bp in G. theta), but they are in ribosomal-protein genes, the opposite of
the situation for S. cerevisiae.
Our conclusion is that those genes that are reliant on the improvement in the rate of mRNA export that having minimal introns provide would be more resistant to intron expansion. Furthermore, any intron that drifts away from this optimum will be returned to it at the first opportunity. Assuming this is the correct explanation, it is difficult to see how such a complicated system of interacting molecules could ever be reconstituted in a typical in vitro experiment. What we did, in effect, was let evolution perform the in vivo experiments for us and then query the results through a statistical analysis of the extant human polymorphisms. The interesting question is whether or not this methodology can be applied to other degenerate sequences with functional significance, such as promoter motifs associated with transcription regulation.
| |
METHODS |
|---|
|
|
|---|
We constructed high-quality databases of intron sequences,
based exclusively on cDNA-to-genomic sequence alignments (Wong et al.
2000
), for all of the multicellular eukaryotes for which a significant
fraction of the genome had been finished. Resequencing was performed on
the National Human Genome Research Institute (NHGRI)/Coriell Human
Diversity Panel, which is a representation of all of the major
ethnicities, including Northern European, Chinese, Indo-Pakistani,
African American, Middle Eastern, Southwestern American Indian,
Japanese, Mexican, and Puerto Rican (Collins et al. 1998
). For
ancestral alleles, we sequenced a primate panel from Coriell, which has
chimpanzee, pigmy chimpanzee, lowland gorilla, orangutan, rhesus
macaque, pig-tailed macaque, red-bellied tamarin, woolly monkey,
black-handed spider monkey, and ring-tailed lemur. Polymerase chain
reaction primers were designed from exon sequences flanking the
selected introns. Sequencing was performed with dye-terminator
chemistry on capillary sequencers. Every polymorphism, particularly the
indels, was confirmed by visual inspection of the sequence traces. SNPs
were submitted to GenBank/dbSNP with the handle UWGC (batches
2.12.2001.1 to 2.12.2001.4).
| |
WEB SITE REFERENCES |
|---|
|
|
|---|
http://ftp.genome.washington.edu/RM/RepeatMasker.html; The RepeatMasker program is available at this site.
http://www.genome.washington.edu/projects/egpsnps; Data from the Environmental Genome Project.
| |
ACKNOWLEDGMENTS |
|---|
We thank Maynard Olson, Lars Bolund, and Changqing Zeng for comments and suggestions. This analysis was partially supported by a grant from the National Institute of Environmental Health Sciences (1 RO1 ES09909).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
5 These authors contributed equally to this work.
6 Corresponding author.
E-MAIL junyu{at}u.washington.edu; FAX (206) 685-7344.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.224602.
| |
REFERENCES |
|---|
|
|
|---|
-globin intron.
Cell
37:
915-925[CrossRef][Medline].Received August 9, 2001; accepted in revised form June 12, 2002.
This article has been cited by other articles:
![]() |
S. S. Hughes, C. O. Buckley, and D. E. Neafsey Complex Selection on Intron Size in Cryptococcus neoformans Mol. Biol. Evol., February 1, 2008; 25(2): 247 - 253. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Hong, D. G. Scofield, and M. Lynch Intron Size, Abundance, and Distribution within Untranslated Regions of Genes Mol. Biol. Evol., December 1, 2006; 23(12): 2392 - 2404. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Ichikawa, A. H. Sorenson, E. A. Imel, N. E. Friedman, J. M. Gertner, and M. J. Econs Intronic Deletions in the SLC34A3 Gene Cause Hereditary Hypophosphatemic Rickets with Hypercalciuria J. Clin. Endocrinol. Metab., October 1, 2006; 91(10): 4022 - 4027. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. R. Hoen, K. C. Park, N. Elrouby, Z. Yu, N. Mohabir, R. K. Cowan, and T. E. Bureau Transposon-Mediated Expansion and Diversification of a Family of ULP-like Genes Mol. Biol. Evol., June 1, 2006; 23(6): 1254 - 1268. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z.-P. HUANG, H. ZHOU, H.-L. HE, C.-L. CHEN, D. LIANG, and L.-H. QU Genome-wide analyses of two families of snoRNA genes from Drosophila melanogaster, demonstrating the extensive utilization of introns for coding of snoRNAs RNA, August 1, 2005; 11(8): 1303 - 1316. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. D. Wu and C. K. Watanabe GMAP: a genomic mapping and alignment program for mRNA and EST sequences Bioinformatics, May 1, 2005; 21(9): 1859 - 1875. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Ometto, W. Stephan, and D. De Lorenzo Insertion/Deletion and Nucleotide Polymorphism Data Reveal Constraints in Drosophila melanogaster Introns and Intergenic Regions Genetics, March 1, 2005; 169(3): 1521 - 1527. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Comeron Selective and Mutational Patterns Associated With Gene Expression in Humans: Influences on Synonymous Composition and Intron Presence Genetics, July 1, 2004; 167(3): 1293 - 1304. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. von Ahsen and M. Oellerich The intronic prothrombin 19911A>G polymorphism influences splicing efficiency and modulates effects of the 20210G>A polymorphism on mRNA amount and expression in a stable reporter gene assay system Blood, January 15, 2004; 103(2): 586 - 593. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Parsch Selective Constraints on Intron Evolution in Drosophila Genetics, December 1, 2003; 165(4): 1843 - 1851. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. K.-S. Wong, Z. Yang, D. A. Passey, M. Kibukawa, M. Paddock, C.-R. Liu, L. Bolund, and J. Yu A Population Threshold for Functional Polymorphisms Genome Res., August 1, 2003; 13(8): 1873 - 1879. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||