|
|
|
Published online before print
October 15, 2001, 10.1101/gr.188001
Vol. 11, Issue 11, 1848-1853, November 2001
LETTER
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Completion of the human genome sequence provides evidence for a gene count with lower bound 30,000-40,000. Significant protein complexity may derive in part from multiple transcript isoforms. Recent EST based studies have revealed that alternate transcription, including alternative splicing, polyadenylation and transcription start sites, occurs within at least 30-40% of human genes. Transcript form surveys have yet to integrate the genomic context, expression, frequency, and contribution to protein diversity of isoform variation. We determine here the degree to which protein coding diversity may be influenced by alternate expression of transcripts by exhaustive manual confirmation of genome sequence annotation, and comparison to available transcript data to accurately associate skipped exon isoforms with genomic sequence. Relative expression levels of transcripts are estimated from EST database representation. The rigorous in silico method accurately identifies exon skipping using verified genome sequence. 545 genes have been studied in this first hand-curated assessment of exon skipping on chromosome 22. Combining manual assessment with software screening of exon boundaries provides a highly accurate and internally consistent indication of skipping frequency. 57 of 62 exon skipping events occur in the protein coding regions of 52 genes. A single gene, (FBXO7) expresses an exon repetition. 59% of highly represented multi-exon genes are likely to express exon-skipped isoforms in ratios that vary from 1:1 to 1:>100. The proportion of all transcripts corresponding to multi-exon genes that exhibit an exon skip is estimated to be 5%.
| |
INTRODUCTION |
|---|
|
|
|---|
Gene expression products can have variable forms,
characterized by alternate start sites of transcription and
polyadenylation (Gautheret et al. 1998
), exon skipping, and alternate
donor and acceptor sites at exon boundaries (Mironov et al. 1999a
;
Brett et al. 2000
; Croft et al. 2000
). Exon skipping in transcript
isoforms is the most frequent event altering the protein coding
sequence of genes (Mironov et al. 1999b
; International Human Genome
Sequencing Consortium 2001
)
(http://industry.ebi.ac.uk/~thanaraj/gene.html). Surveys of the
incidence of alternative splicing, including exon skipping, have been
performed (Andreadis et al. 1987
; Iida 1997
; Valentine 1998
; Thanaraj
1999
), and a growing number of anecdotal observations confirm the
utilization of exon-skipped transcripts in developmental (Dufour et al.
1998
; Lambert de Rouvroit et al. 1999
; Lim et al. 1999
; Unsworth et al.
1999
; Kawahara et al. 2000
), tissue-specific (Zacharias et al. 1995
),
and disease-specific (Jiang and Wu 1999
; Mercatante and Kole 2000
;
Strehler and Zacharias 2001
) states.
Several approaches have successfully used hybridization experiments
both in silico (Wolfsberg and Landsman 1997
; Gautheret et al. 1998
;
Mironov et al. 1999b
; Thanaraj 1999
; Beaudoing et al. 2000
; Brett et
al. 2000
; Croft et al. 2000
; Schweighhoffer 2000
; International Human
Genome Sequencing Consortium 2001
) and in vitro (Schweighhoffer 2000
;
Strehler and Zacharias 2001
) to assess alternate transcript diversity.
Nevertheless there are difficulties with interpretation of the results
that include (1) the existence of gene families, paralogs, gene copies,
and pseudogenes that have similar DNA sequences, providing false
positive hybridization; (2) the existence of orphan genes that are
located in the complementary strand of intronic or flanking regions
(Mironov et al. 1999a
); (3) insufficient representation of expressed
sequence data in public expressed sequence tag (EST) databases to
identify all transcript isoforms.
We have taken an exhaustive approach to the detection of exon skipping from carefully annotated, protein-confirmed genes to maximize the accurate assessment of the degree of isoform diversity.
| |
RESULTS |
|---|
|
|
|---|
To develop an unambiguous assessment of the degree to which exon
skipping contributes to expressed transcript isoform diversity, and to
assess the impact on protein coding of exon-skipping events within
coding regions of transcripts from known genomic loci, we have compared
ESTs to 545 annotated genes on chromosome 22. Although no standard
measure of relative spliceform frequencies for human genes exists,
coverage of exon boundaries by ESTs provides a measure of the diversity
of isoforms for a particular gene. The incidence of captured ESTs
spanning exon junctions may also provide a reasonable, though
noncomprehensive, view of transcript diversity and expression.
Detection of transcripts displaying exon skipping was performed using
novel software, j_explorer, which reduced the complexity
of the gene sequences to a set of possible splice junctions that were
used to search public EST databases to identify ESTs spanning the
annotated exon-exon junctions. The software employs standard data
format (EMBL sequence format) and visualization tools
(ARTEMIS, Rutherford et al. 2000
) in the analysis
(http://www.sanbi.ac.za/exon_skipping/). Removal of single and double
exon genes reduced the set to 347 multi-exon genes (Table
1), of which 10 were annotated
previously in literature or public databases as having experimentally
confirmed exon skips (Table
2). Exon-skipping events
were recorded when all original junctions involved in the skipping
event, including flanking exons, were confirmed by EST sequences. All
ESTs supporting exon-skipping events were subsequently confirmed to be
unambiguous transcripts of the corresponding gene and not products of
paralogous genes, pseudogenes, or related members of an extended gene
family by BLAST searches against the nonredundant (nr)
database at NCBI. Highly specific identification of exon-skipping and
exon-repetition events has resulted.
|
|
Sensitivity was assessed using the 10 genes with experimentally confirmed exon skipping. j_explorer accurately identified the previously reported skipped exons in four of the genes (NF2, ADSL, CLTCL, and GGT1). Novel isoforms were detected in EWSR1, PLA2G6, and GGT1 (Table 2), whereas previously described exon-skipping events in four genes (CACNA1I, BZRP, MTMR3, SEP3) were not detected because ESTs mapping to these exon junctions were not available in the public EST databases. The approach yields zero false positives, as confirmed by available mRNA and genomic data, and provides a solid basis for the development of models of transcript diversity that can be generated from a single gene.
We have discovered 62 exon-skipping events in 52 genes (Table 2); 57 of the 62 (92%) exon-skipping events occur within the protein-coding region. The remainder occur in either the 3' (1/62) or 5' (4/62) untranslated region (UTR). In 31/62 (50%) of cases the reading frame is maintained but regions are deleted. In 18/62 cases (29%) the introduction of a skip destroys the reading frame resulting in a frame shift. Proteins for the remaining 8/62 (13%) could not be reconstructed. In four cases an alternative stop codon is used, whereas in five cases there is an alternative start codon introduced.
Gene transcripts were scanned for exon repetition using similarity
searching of repeated exon constructs against public EST data. A single
tandem repetition of exon 2 of the F-box protein (NM_012179) was
detected with high identity to EST AA569698. Exon repetition has only
previously been reported in rodents (Frantz et al. 1999
).
Ratios of transcript isoforms are difficult to resolve using only EST data, however using the relative capture frequency of skipped exons as a measure provides an indication of the incidence of more commonly occurring isoforms (four or more ESTs confirm the isoform with exon skipping in: CLTCL1, ADSL, GGT1, GSTT1, HMG2L1, MFNG, dJ222E13.1) as compared to rarer isoforms. In 47/62 (76%) cases, the reference isoform, constructed from the genomic EMBL entry, is represented more frequently than a skipped exon isoform (Table 2).
The degree to which the level of gene expression, and hence database
representation, affects the probability of finding a skipped transcript
was assessed using the number of EST exon-exon junction captures per
gene as a relative measure of transcript representation. Three
categories comprising equal gene numbers were selected: low capture,
which corresponds to <14 EST matches per gene; medium capture, those
from 14 to 50 EST matches per gene; and high capture, those with 50 or
more EST matches per gene (Table 3).
Forty-four genes had no matches to ESTs. We found that 33 have >50 EST
matches per gene and that >60% of genes that demonstrate exon
skipping have large numbers of ESTs matching to them. Although no
relationship between degree of gene expression and extent of skipping
can be determined from this study, the degree to which exon junctions
are represented in transcripts reveals that highly represented genes
demonstrate skipping more frequently. Ten of the 17 (59%) most highly
represented multi-exon genes show exon skipping and of these, three
(18%) express more than one isoform (Table 2,
http://www.sanbi.ac.za/exon_skipping).
|
To calculate the proportion of mRNAs that may contain a skipped exon we
treat each EST spanning an exon junction as an independent sample of
the exon junctions. The number of times an EST spans an exon junction
is 23,922. The number of times an EST spans a nonconsecutive junction
is 149. The number of times an EST spans a consecutive junction is
23,773. From this we estimate that the probability that a given exon
junction in a given mRNA is nonconsecutive is ~f = 149/23,773.
There are 2893 exon junctions in the 347 multi-exon genes, therefore
the average number of exon junctions per multi-exon genes is
m = 2893/347. The probability that a given multi-exon mRNA has at
least one nonconsecutive exon junction is therefore 1
(1
f)m = 0.051. As this estimate is derived
from a large sample of exon junctions, it may be applicable as a
genome-wide estimate.
| |
DISCUSSION |
|---|
|
|
|---|
Our approach precisely identifies exon skipping when EST transcript
data that spans exon boundaries is available. The number of ESTs that
cover an exon-exon boundary determines the likelihood of discovering
an exon skip, but capture of exon-skipping events are dependent on the
ratio of low-abundance to high-abundance isoforms of transcripts from
the gene. The depth of transcript representation in EST databases,
level of expression, and number and length of exons all contribute to
the complexity of estimation of the number of genes that may have
exon-skipped expressed transcripts. Estimation of the genome-wide
extent of exon skipping is supported here by 52 of 347 multi-exon genes
(~15%). This conservative estimate reflects the fact that only 68%
of exon-exon junctions have EST coverage, and that this coverage is
skewed towards over-representation of the 3' UTRs. In contrast, 59% of
multi-exon, highly EST-represented genes present exon skipping. If exon
skipping is independent of the level of expression of a gene, then 59%
of all multi-exon genes could exhibit skipping. The fact that
increasing EST coverage results in the detection of increasing numbers
of exon-skipping events indicates that exon-skipped transcripts are
relatively rare. We have estimated the probability of detecting an
exon-skipped transcript from a pool of multi-exon transcripts to be
~5%. More sensitive transcript capture techniques may discover exon
skipping to be far more widespread than the previous estimates of
~10%-~20% (Mironov et al. 1999a
; Croft et al. 2000
)
(http://industry.ebi.ac.uk/~thanaraj/gene.html), which have been
based on EST frequency-independent measures. Expression studies will
clarify the relationship between level of expression and degree of exon
skipping in transcripts. The diversity of skipped-exon transcript forms
is likely to contribute significantly to the diversity of protein
products encoded by the genome, especially because the ratio of skipped
isoforms of transcripts appears to vary widely, which is likely to have
significant functional impact on the proteins for which they code. At
least 50% of exon skips that we have detected result in in-frame
deletions in the predicted protein products. In 29% of cases, exon
skipping results in a disruption of the reading frame which may change
or disrupt the function of the protein product. Functional roles for
these protein isoforms remain to be explored experimentally.
| |
METHODS |
|---|
|
|
|---|
j_explorer (available for download from
http://www.sanbi.ac.za/exon_skipping) was used to assemble exon
constructs from mRNA-annotated genomic sequences produced by the Human
Chromosome 22 Sequencing Group at the Sanger Centre (Chr22.genes.dna file at
http://www.sanger.ac.uk/HGP/Chr22/cwa_archive/Nature_02-12-1999/Chr22Genes.tar.gz). Using a 50-bp tag from the 3' terminus of the preceding exon and a
50-bp tag from the 5' terminus of all downstream exons, a set of all
consecutive and nonconsecutive exon-exon junctions for each gene was
created. Each junction was submitted for similarity searching against
dbEST (human) using BLAST 2.0 (Altschul et al. 1990
). By
combining junctions in a consecutive (i.e., exon 1-exon 2 junction)
and nonconsecutive (i.e., exon 1-exon 3 junction) manner the incidence
of exon skipping was assessed. A skipping event is reported when an EST
is detected that does not contain the exon(s) in question, but does
contain an uninterrupted tag made up of 50 bp from each of the flanking
exons. Exon repetition was investigated by creating splice junctions
composed of the concatenation of the 3' and 5' 50-bp splice junctions
of the same exon. ESTs showing significant
(P<1 × 10
40) homology to an exon junction were
extracted and aligned to the corresponding genomic sequence using
sim4 (Florea et al. 1998
). To exclude the possibility that
ESTs confirming exon-skipping events were the products of paralogous
genes or members of gene families, all ESTs identifying exon skipping
were confirmed to be unique to a single target gene from Chromosome 22. Both interchromosomal and intrachromosomal specificity of the
transcripts was confirmed using BLAST with a cut-off score
of 1 × 10
30. sim4 was employed where
ambiguous matches were encountered. The resulting `unambiguous
transcripts' can therefore be assigned unambiguously to the correct
gene of origin. The effect of these transcripts on the reading frame of
the protein for which they code was assessed for frameshifts and
in-frame deletions. We have confirmed that the results contain no false
positives by manual analysis of all cases of possible misalignments.
All exon skips reported by j_explorer therefore represent
valid skipping events. The identity and genomic location of each the
ESTs was converted into EMBL format and added as annotation to the
relevant EMBL sequence file. Sequences were then analyzed using
ARTEMIS (Rutherford et al. 2000
) and are presented
together with supplemental information, annotated EMBL entries, and
links to ENSEMBL genes and transcripts at
http://www.sanbi.ac.za/exon_skipping. All exon structure annotations
for the genes used (both confirmed and predicted) were confirmed by
manual inspection to be correct. To prevent the detection of skips as a
result of incorrectly annotated exon boundaries we required that an EST
spanning consecutive (or linear) exon boundaries was present in
addition to the ESTs confirming the skip. All linear junctions that
could not be confirmed by ESTs resulted in that junction being excluded
from further analysis. To address data consistency, we confirmed that
in EMBL release 64 (GenBank 119) and 65 (GenBank 121) ~68% of splice
junctions are covered with an EST. This figure does not vary
significantly between the two releases.
To determine the effect of exon skipping on protein production, the
genomic sequences of the 52 genes with exon-skipping events affecting
the coding sequence were compared to the cognate mRNA using
sim4. Exon-skipping events in the first or last protein-coding exon were recorded as altering the start/stop codon. A
UTR-skipping event was recorded when the exon skipped was located in
either the 3' or 5' UTR. When the skipped exon was located between two
coding exons, framefinder (Slater 2000
) was used to
predict open reading frames (ORFs) for the ESTs confirming the skip. If
the ORF showed homology to the protein isoform then it was considered
to be a valid representation of the protein isoform.
BLASTX alignment of the EST and the predicted protein was
used to determine whether or not the exon-skipping event introduced a frameshift.
| |
ACKNOWLEDGMENTS |
|---|
We thank L. Sullivan and M. Ramsay for comments on the manuscript, and S. Bardien-Kruger, T. Broveak, and T-M. Chern for support, suggestions, and helpful discussions throughout the study. This work had financial support from the South African Government through the Department of Arts, Culture, Science, and Technology initiated Innovation Fund Program, grant 32146 (W.A.H.) and the South African National Research Foundation (J.F.K).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 These authors contributed equally to this work.
2 Present address: Center for Bioinformatics, University of Pennsylvania, Philadelphia, PA, USA.
3 Corresponding author.
E-MAIL winhide{at}sanbi.ac.za; FAX .
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.188001.
| |
REFERENCES |
|---|
|
|
|---|
Received March 9, 2001; accepted in revised form August 24, 2001.
This article has been cited by other articles:
![]() |
J.-i. Takeda, Y. Suzuki, M. Nakao, R. A. Barrero, K. O. Koyanagi, L. Jin, C. Motono, H. Hata, T. Isogai, K. Nagai, et al. Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56 419 completely sequenced and manually annotated full-length cDNAs Nucleic Acids Res., September 1, 2006; 34(14): 3917 - 3928. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Ghosh, R. Loper, M. H. Gelb, and C. C. Leslie Identification of the Expressed Form of Human Cytosolic Phospholipase A2beta (cPLA2beta): cPLA2beta3 IS A NOVEL VARIANT LOCALIZED TO MITOCHONDRIA AND EARLY ENDOSOMES J. Biol. Chem., June 16, 2006; 281(24): 16615 - 16624. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xing, T. Yu, Y. N. Wu, M. Roy, J. Kim, and C. Lee An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs Nucleic Acids Res., June 6, 2006; 34(10): 3150 - 3160. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Nagasaki, M. Arita, T. Nishizawa, M. Suwa, and O. Gotoh Automated classification of alternative splicing and transcriptional initiation and construction of visual database of classified patterns Bioinformatics, May 15, 2006; 22(10): 1211 - 1216. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Shao, V. Shepelev, and A. Fedorov Bioinformatic analysis of exon repetition, exon scrambling and trans-splicing in humans Bioinformatics, March 15, 2006; 22(6): 692 - 698. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Florea Bioinformatics of alternative splicing and its regulation Brief Bioinform, March 1, 2006; 7(1): 55 - 69. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Su, J. Wang, J. Yu, X. Huang, and X. Gu Evolution of alternative splicing after gene duplication Genome Res., February 1, 2006; 16(2): 182 - 189. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Dixon, I. C. Eperon, L. Hall, and N. J. Samani A genome-wide survey demonstrates widespread non-linear mRNA in expressed sequences from multiple species Nucleic Acids Res., October 19, 2005; 33(18): 5904 - 5913. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xing and C. Lee Colloquium Paper: Evidence of functional selection pressure for alternative splicing events that accelerate evolution of protein subsequences PNAS, September 20, 2005; 102(38): 13526 - 13531. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Kirschbaum-Slager, R. B. Parmigiani, A. A. Camargo, and S. J. de Souza Identification of human exons overexpressed in tumors through the use of genome and expressed sequence data Physiol Genomics, May 11, 2005; 21(3): 423 - 432. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Tanino, M.-A. Debily, T. Tamura, T. Hishiki, O. Ogasawara, K. Murakawa, S. Kawamoto, K. Itoh, S. Watanabe, S. J. de Souza, et al. The Human Anatomic Gene Expression Library (H-ANGEL), the H-Inv integrative display of human gene expression across disparate technologies and platforms Nucleic Acids Res., January 1, 2005; 33(suppl_1): D567 - D572. [Abstract] [Full Text] [PDF] |
||||
![]() |
The Ludwig-FAPESP Transcript Finishing Initiative, M. C. Sogayar, and A. A. Camargo A Transcript Finishing Initiative for Closing Gaps in the Human Transcriptome Genome Res., July 1, 2004; 14(7): 1413 - 1423. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xing, A. Resch, and C. Lee The Multiassembly Problem: Reconstructing Multiple Transcript Isoforms From EST Fragment Mixtures Genome Res., March 1, 2004; 14(3): 426 - 441. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Rigatti, J.-H. Jia, N. J. Samani, and I. C. Eperon Exon repetition: a major pathway for processing mRNA of some genes is allele-specific Nucleic Acids Res., January 22, 2004; 32(2): 441 - 446. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Brentani, O. L. Caballero, A. A. Camargo, A. M. da Silva, W. A. da Silva Jr., E. D. Neto, M. Grivet, A. Gruber, P. E. M. Guimaraes, W. Hide, et al. The generation and utilization of a cancer-oriented representation of the human transcriptome by using expressed sequence tags PNAS, November 11, 2003; 100(23): 13418 - 13423. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Volfovsky, B. J. Haas, and S. L. Salzberg Computational Discovery of Internal Micro-Exons Genome Res., June 1, 2003; 13(6): 1216 - 1221. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. P. Lewis, R. E. Green, and S. E. Brenner Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans PNAS, January 7, 2003; 100(1): 189 - 192. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Collins, M. E. Goward, C. G. Cole, L. J. Smink, E. J. Huckle, S. Knowles, J. M. Bye, D. M. Beare, and I. Dunham Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22 Genome Res., January 1, 2003; 13(1): 27 - 36. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Kan, D. States, and W. Gish Selecting for Functional Alternative Splices in ESTs Genome Res., December 1, 2002; 12(12): 1837 - 1845. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Kaessmann, S. Zollner, A. Nekrutenko, and W.-H. Li Signatures of Domain Shuffling in the Human Genome Genome Res., November 1, 2002; 12(11): 1642 - 1650. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Sorek, G. Ast, and D. Graur Alu-Containing Exons are Alternatively Spliced Genome Res., July 1, 2002; 12(7): 1060 - 1067. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||