|
|
|
|
Genome Res. 14:1617-1623, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods A Non-EST-Based Method for Exon-Skipping Prediction1 Department of Human Genetics, Sackler Faculty of Medicine, Tel Aviv University, Ramat Aviv 69978, Israel 2 Compugen, Tel Aviv 69512, Israel 3 School of Computer Science, Sackler Faculty of Exact Sciences, Tel Aviv University, Tel Aviv 69978, Israel
It is estimated that between 35% and 74% of all human genes can undergo alternative splicing. Currently, the most efficient methods for large-scale detection of alternative splicing use expressed sequence tags (ESTs) or microarray analysis. As these methods merely sample the transcriptome, splice variants that do not appear in deeply sampled tissues have a low probability of being detected. We present a new method by which we can predict that an internal exon is skipped (namely whether it is a cassette-exon) merely based on its naked genomic sequence and on the sequence of its mouse ortholog. No other data, such as ESTs, are required for the prediction. Using our method, which was experimentally validated, we detected hundreds of novel splice variants that were not detectable using ESTs. We show that a substantial fraction of the splice variants in the human genome could not be identified through current human EST or cDNA data.
Alternative splicing is a mechanism allowing one gene to produce multiple, sometimes functionally distinct, protein variants (Maniatis and Tasic 2002
Although much progress has been made in the field of computational detection of alternative splicing in recent years (for review, see Graveley 2001
Indeed, Johnson and colleagues, who recently investigated the extent of human alternative splicing using large-scale microarray experiments, reported on numerous events of alternative splicing that were not represented in ESTs (Johnson et al. 2003
Comparative genomics has recently proven a useful approach for alternative splicing research (Modrek and Lee 2003
To identify and characterize features that distinguish between alternative and constitutive exons, we used the training exons sets from Sorek and Ast (2003
Table 1 summarizes the major classifying features that we characterized. In short, alternatively spliced exons are flanked by intronic sequences that are more conserved between human and mouse; they are shorter than constitutively spliced exons; their size tends to be a multiple of three; and they have higher identity level when aligned to their mouse counterpart exon (Fig. 1AE). These differences probably stem from the unique function of the alternative exons: Because these exons are cassette exons that are sometimes inserted and sometimes skipped, their size should be a multiplication of three so that their skipping would not alter the reading frame of the downstream exons. This constraint, which was also recently reported by Resch et al. (2004
The features described above could be used to identify exons that are skipped in the human and the mouse genomes. However, each feature by itself provides only a weak classification for exons. Our goal was to find a combination of features that would detect a substantial fraction of the alternative exons, while making near-zero false-positive detection errors. The features we have chosen are the following: (1) exon length, (2) divisible/not divisible by three, (3) percent identity when aligned to the mouse counterpart, and (4) conservation in the upstream and downstream intronic sequences. Each of the two "intronic conservation" features (upstream and downstream) were divided into two subfeatures: (1) length of best human/mouse local alignment in the 100 intronic nucleotides nearest to the exon (where only local alignments with at least 12 consecutive perfectly matching nucleotides were considered) and (2) identity level in this local alignment. For each of the features we defined a set of thresholds (see Methods). For example, the "human/mouse exon identity" threshold can be set to 100%, at least 99%, at least 98%, and so forth. Similarly, the thresholds for "length of conserved upstream region" can be set to 100, at least 95, at least 90 and so forth. By using a threshold for each of the seven features above, one gets a classification rule that classifies as alternative all exons that pass all seven thresholds. Such a rule might, for example, be: "all exons that are at least 99% conserved with their mouse counterpart and have at least 95 conserved nucleotides upstream the exon and are divisible by three and...". We enumerated all possible rules (about 100 million rules) and tested the quality of the resulting classification on our training set of 243 alternative and 1753 constitutive exons. We sought a rule that would correctly identify a maximum number of alternative exons from the training set while making no false-positive identification. The best rule that emerged was the following: At least 95% identity with the mouse exon counterpart; exon size is a multiple of three; a best local alignment of at least 15 intronic nucleotides upstream of the exon with at least 85% identity; and a perfect match of at least 12 consecutive intronic nucleotides downstream of the exon. This combination of features identified 76 exons, or 31% of the 243 alternatively spliced exons in our training set, whereas none of the 1753 constitutively spliced exons matched these features. To check the robustness of this analysis we employed five-way cross validation (see Supplemental material for details). The average sensitivity in these five analyses was 32.3%, and the average specificity was 99.72%. The above combination of parameters can therefore be used to identify alternatively spliced exons with very high specificity, making less than 0.3% false-positive calls. We note that because the ratio of constitutive to alternative exons in the genome is probably higher than in our training set, and because our training set may have some other unknown bias, the performance in genome-wide application of the rule may be somewhat lower.
To test this classifier in a genome-wide manner, as well as to discover novel splice variants in the human genome, we collected a large set of 108,983 human exons, for which a mouse counterpart could be identified (see Methods). To ensure the coherence of the analysis, we excluded our training exons from this analysis. For each of the exons, all classifying parameters were calculated. Out of the 108,983 human exons, 952, or To check whether these exons are indeed alternatively spliced, we searched for human expressed sequences (ESTs or cDNAs) that skip the exons but contain the two flanking exons. For 453 (48%) of the 952 candidate alternative exons there was such skipping evidence. For comparison, only 7% (7495 exons) out of our entire set of 108,983 exons had similar skipping EST evidence. This means that our classification rule indeed substantially enriches for alternatively spliced exons.
Moreover, there is evidence that EST databases can contain spurious sequences that appear as splice variants but are, in fact, artifacts caused by aberrant splicing. Such splicing artifacts are usually characterized by low EST support, although there are cases in which real, functional splice variants are supported by a single EST (Sorek et al. 2004 We manually examined the remaining 499 candidate alternative exons (952 453) for which no EST/cDNA showing an exon skipping event was found, by using the UCSC genome browser (April 2003). We found that for 190 additional exons (out of the 499) there was a human expressed sequence showing patterns of alternative splicing other than exon skipping [41 cases (22%) of alternative donor/acceptor; 33 cases (17%) of intron retention; 14 cases (7%) of mutually exclusive exons. More complicated types, such as double and triple exon skipping, comprise the remaining]. Thus, for 643 (453 + 190; 68%) of the 952 candidate alternative exons identified by our method, there was independent evidence for alternative splicing in dbEST and RefSeq. But what about the remaining 309 candidate exons for which no EST or cDNA indicating the skipped isoform was found? These can still be rarely expressed alternatively spliced exons, or exons that are specific to a tissue, developmental stage, or condition which is underrepresented in dbEST, so that an EST representing their skipping isoform has not been sequenced yet. Indeed, although on average there were 32 supporting expressed sequences per exon in our general set of 108,983 exons (median 10), the support for the 309 candidate alternative exons was much smaller, averaging 14 sequences (median 7). This shows that the 309 candidate exons are supported by fewer ESTs than the average exon, in accordance with our hypothesis that underrepresentation in dbEST is the cause for not identifying them as alternatively spliced. To test whether these candidate alternative exons for which no skipping ESTs were found are indeed alternative, we selected 5% of them (15 exons) for experimental verification (Table 2). Only exons with EST support equal to or less than the average (14 sequences) were selected for this verification, as such alternative splicing events are more likely to have been missed in dbEST due to low sampling and not due to a their appearance in a transient developmental state or in a rare condition. For each of these 15 exons, primers were designed from the two flanking exons. RTPCR reactions were carried out with RNA extractions of 14 different tissue types (see Methods). For nine of these exons, a splice variant was detected in at least one of the 14 tissues tested (Fig. 2). In six of the nine cases the variant represented exon skipping. Interestingly, in the other three cases the exon was alternatively spliced, but in a pattern other than exon-skipping: Two cases (genes BAZ1A and SMARCD1) of alternative acceptor site, and one case (VLDLR) of intron retention. This is consistent with our genome-wide scan, where 453/643 (70%) cases that were identified according to the classifying parameters were exon-skipping, whereas the remaining 30% exhibited other types of alternative splicing.
The above experimental results indicate that at least 60% (9/15) of our predictions are true (although this estimate can have a relatively large variance, due to the small size of exon set tested). Some or all of the remaining six exons might also be alternatively spliced, but in a tissue other than the ones we tested, or in an early developmental stage. We therefore believe that the actual prediction rate of this method may be even higher.
The classification rule that was chosen for the experimental verification retrieves alternatively spliced exons with a very high specificity (less than 0.3% false-positive rate) but at the price of a relatively low sensitivity (20%32%). Other rules can be chosen in which sensitivity is higher, but naturally this would increase the false-positive rate of the prediction. Figure 3 presents a sensitivity versus false-positive rate plot (ROC curve) for different rules selecting for increasing number of alternative exons from our test set of 243 exons. As shown in the figure, it is possible to employ a rule that would identify up to 73% of the alternative exons, but this rule would also retrieve 36% of the constitutively spliced exons (the upper limit of 73% is due to the Boolean nature of the "divisibility by 3" feature). Note that because most of the exons in the human genome are constitutive, such a rule would have low predictability for exon skipping: Assuming, for example, that
Our method is able to identify alternative splicing ab initio. Other computational approaches to detect alternative splicing were previously described, but most of them used ESTs and/or cDNAs, or information from transcripts predicted using ESTs, to predict alternative splicing (e.g., Clamp et al. 2003 We have described a novel computational method for prediction of alternative splicing. A possible improvement of the method could be the addition of more classifying features. One such feature could be the comparison of the flanking intronic sequences between the human and other genomes. For example, we were able to locate in the chicken genome 72 and 328 exons from our original alternative and constitutive training sets, respectively. Of the 72 alternatively spliced exons, 34 (47%) had conserved sequences in both their upstream and downstream introns when human and chicken genomes were compared; only 10 (3%) of the 328 constitutively spliced exons that could be found in the chicken genome had such intronic conservation (data not shown). Currently, our classifier mainly identifies exon-skipping events in exons conserved between human and mouse. In the future, it could develop into a more general alternative splicing predictor that would identify other types of alternative splicing. The ultimate goal of such a predictor would be genome-based prediction of all splice variants, including their pattern of alternative splicing (i.e., in which tissue would the exon be inserted). This could set the foundations for understanding the absolute number of exons that are alternatively spliced and might ultimately lead to narrowing the gap between the genome and the proteome, and thereby advance toward revealing the full extent of our proteome's complexity.
Enumeration Over Features in Training Set Training sets of alternatively spliced internal exons and constitutively spliced internal exons were taken from our previous study (Sorek and Ast 2003
The thresholds used in the enumeration of classification rules were as follows: Exon identity thresholds were 100%, at least 99%, at least 98%, and so forth until 80%; exon lengths were below 18 bp, 23 bp, 28 bp,..., 198 bp and 1000 bp; length of human/mouse local alignment of the 100 nearest upstream (or downstream) intronic nucleotides using Sim4 (Florea et al. 1998
Genome-Wide Retrieval of Human and Mouse Orthologous Exons
To find the mouse ortholog for each human exon, we first aligned the mouse expressed sequences from GenBank version 136 to the human genome, as described (Sorek and Ast 2003 Human exons for which no spanning mouse expressed sequence was detected were aligned directly to the mouse genome. Hits spanning the full length of the exon, and were flanked by AG/GT or AG/GC legal splice sites, were declared as the orthologous mouse exons.
Altogether, these searches retrieved 108,983 pairs of exons in the human and mouse genomes (this set does not contain the exons from our two training sets). For each such exon, all classifying parameters were calculated as follows. Conservation between exons was calculated from aligning the human and mouse exons using the global alignment program "GAP" of the GCG software package with default parameters (Womble 2000
Reverse Transcription of mRNA Samples RNA was incubated with a random hexamer primer mix (Invitrogen), denatured at 70°C for 5 min, and transferred to 4°C for hexamer annealing. Reverse transcription was done by Superscript II Reverse transcriptase (Invitrogen) in the presence of RNAsin (Promega) at 37°C for 1 h. Reaction was terminated by enzyme deactivation on beads (Promega).
Amplification of Splicing Products
We thank Amos Tanay, Irit Gat-Viks, and Gideon Dror for fruitful discussion, and Kinneret Savitsky, Dvir Dahary, and Pini Akiva for critical reading. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2572604.
4 Corresponding author. [Supplemental material is available online at www.genome.org.]
Berget, S.M. 1995. Exon recognition in vertebrate splicing. J. Biol. Chem. 270: 24112414. Brett, D., Hanke, J., Lehmann, G., Haase, S., Delbruck, S., Krueger, S., Reich, J., and Bork, P. 2000. EST comparison indicates 38% of human mRNAs contain possible alternative splice forms. FEBS Lett. 474: 8386.[CrossRef][Medline] Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 7894.[CrossRef][Medline] Cartegni, L., Chew, S.L., and Krainer, A.R. 2002. Listening to silence and understanding nonsense: Exonic mutations that affect splicing. Nat. Rev. Genet. 3: 285298.[CrossRef][Medline]
Clamp, M., Andrews, D., Barker, D., Bevan, P., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., et al. 2003. Ensembl 2002: Accommodating comparative genomics. Nucleic Acids Res. 31: 3842.
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967974. Graveley, B.R. 2001. Alternative splicing: Increasing diversity in the proteomic world. Trends Genet. 17: 100107.[CrossRef][Medline]
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr., R.K., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D., et al. 2003. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31: 56545666.
Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M., Armour, C.D., Santos, R., Schadt, E.E., Stoughton, R., and Shoemaker, D.D. 2003. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302: 21412144.
Kan, Z., Rouchka, E.C., Gish, W.R., and States, D.J. 2001. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 11: 889900.
Kan, Z., States, D., and Gish, W. 2002. Selecting for functional alternative splices in ESTs. Genome Res. 12: 18371845. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline] Maniatis, T. and Tasic, B. 2002. Alternative pre-mRNA splicing and proteome expansion in metazoans. Nature 418: 236243.[CrossRef][Medline]
Mironov, A.A., Fickett, J.W., and Gelfand, M.S. 1999. Frequent alternative splicing of human genes. Genome Res. 9: 12881293. Modrek, B. and Lee, C. 2002. A genomic view of alternative splicing. Nat. Genet. 30: 1319.[CrossRef][Medline] Modrek, B. and Lee, C.J. 2003. Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat. Genet. 34: 177180.[CrossRef][Medline]
Modrek, B., Resch, A., Grasso, C., and Lee, C. 2001. Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 29: 28502859.
Nurtdinov, R.N., Artamonova, I.I., Mironov, A.A., and Gelfand, M.S. 2003. Low conservation of alternative splicing patterns in the human and mouse genomes. Hum. Mol. Genet. 12: 13131320.
Resch, A., Xing, Y., Alekseyenko, A., Modrek, B., and Lee, C. 2004. Evidence for a subpopulation of conserved alternative splicing events under selection pressure for protein reading frame preservation. Nucleic Acids Res. 32: 12611269.
Sorek, R. and Ast, G. 2003. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 13: 16311637.
Sorek, R. and Safer, H.M. 2003. A novel algorithm for computational identification of contaminated EST libraries. Nucleic Acids Res. 31: 10671074.
Sorek, R., Ast, G., and Graur, D. 2002. Alu-containing exons are alternatively spliced. Genome Res. 12: 10601067. Sorek, R., Shamir, R., and Ast, G. 2004. How prevalent is functional alternative splicing in the human genome? Trends Genet. 20: 6871.[CrossRef][Medline] Thanaraj, T.A. and Stamm, S. 2003. Prediction and statistical analysis of alternatively spliced exons. Prog. Mol. Subcell. Biol. 31: 131.[Medline] Womble, D.D. 2000. GCG: The Wisconsin Package of sequence analysis programs. Methods Mol. Biol. 132: 322.[Medline]
http://genes.mit.edu/GENSCANinfo.html; GENSCAN. www.ncbi.nlm.nih.gov/dbEST; GenBank version 136 (June 2003). www.ncbi.nlm.nih.gov/genome/guide/human; Human genome (April 2003 assembly).
Received March 14, 2004; accepted in revised format June 2, 2004. This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||