|
|
|
|
Published online before print
March 9, 2007, 10.1101/gr.6049107 Genome Res. 17:503-509, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00 OPEN ACCESS ARTICLE
Methods Genome-wide identification of spliced introns using a tiling microarray1 Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA; 2 Departments of Genome Sciences and Medicine, University of Washington, Seattle, Washington 98195, USA
The prediction of gene models from genome sequence remains an unsolved problem. One hallmark of eukaryotic gene structure is the presence of introns, which are spliced out of pre-mRNAs prior to translation. The excised introns are released in the form of lariats, which must be debranched prior to their turnover. In the yeast Saccharomyces cerevisiae, the absence of the debranching enzyme causes these lariat RNAs to accumulate. This accumulation allows a comparison of tiling array signals of RNA from the debranching mutant to the wild-type parent strain, and thus the identification of lariats on a genome-wide scale. This approach identified 141 of 272 known introns, confirmed three previously predicted introns, predicted four novel introns (of which two were experimentally confirmed), and led to the reannotation of four others. In many instances, signals from the tiling array delineated the 5' splice site and branchpoint site, confirming predicted gene structures. Nearly all introns that went undetected are present in mRNAs expressed at low levels. Overall, 97% of the significant probes could be attributed either to spliced introns or to genes up-regulated by deletion of the debranching enzyme. Because the debranching enzyme is conserved among eukaryotes, this approach could be generally applicable for the annotation of eukaryotic genes and the detection of novel and alternative splice forms.
Gene annotation remains a formidable challenge following the completion of a whole genome sequence. Annotation typically relies on available expressed sequence tags (ESTs) or other cDNA sequences, alignment to protein sequences, comparative analysis of genomes, or de novo prediction programs that use statistical models to detect codons and conserved motifs for transcription initiation, polyadenylation, and splicing (Brent 2005 5% of the genes and possess a near invariant UACUAAC sequence at the branchpoint, and alternative splicing and pseudogenes are not major concerns (Spingola et al. 1999
Using S. cerevisiae as a model system, we sought to develop a tiling array-based method for the genome-wide detection of pre-mRNA introns. Tiling arrays that contain overlapping oligonucleotide probes covering millions of bases have been used to measure chromosomal copy number changes (Wilson et al. 2006
The basis of our approach is the detection of tiling array signals corresponding to introns, due to the accumulation of a splicing intermediate in the appropriate yeast mutant. Following transcription, primary pre-mRNA transcripts are processed by the spliceosome to remove introns, which are released as lariats. Lariat RNAs must be subsequently debranched prior to their turnover by cellular exonucleases (Fig. 1A). In wild-type cells, the half-life of lariats is short; however, in yeast cells that lack the debranching RNA endonuclease Dbr1, whose activity initiates lariat degradation, lariat RNAs accumulate to high levels (Chapman and Boeke 1991
In order to identify spliced introns via the detection of accumulated lariats, we isolated total RNA from diploid dbr1/dbr1 and DBR+/DBR+ yeast strains, labeled it as double-stranded cDNA, and hybridized the cDNA to S. cerevisiae tiling arrays. Signals enriched in the dbr1 strain were identified and mapped to genomic coordinates. We integrated current intron annotations, the presence of splice signals, and dbr1-specific hybridization patterns to assess the ability of the array to identify intronic regions and novel RNA splice forms. In S. cerevisiae, the great majority of introns are readily identifiable by the presence of conserved splicing signals at the 5' and 3' boundaries and branchpoint (Lim and Burge 2001
We identified probes significantly enriched in the dbr1 sample using a window-based statistical test. Because we performed multiple, simultaneous statistical tests, we evaluated the array data using the false discovery rate (FDR), which measures the expected proportion of false positives in a set of predictions (Storey and Tibshirani 2003 We examined the overall correspondence of dbr1-enriched signals with our annotated intronic regions (Fig. 2). At an estimated FDR of 10%, 141 of 272 annotated introns contained significant probes in the dbr1 sample. Significant tiling array signals were detected for all but three of 105 introns in ribosomal protein genes, including 89 in coding regions and 13 in 5' untranslated regions (UTRs). Of 158 introns found in the coding regions of nonribosomal genes, 32 were detected by the tiling array, and of seven introns found in the 5' UTRs of nonribosomal genes, five were detected on the tiling array. In addition, the two introns in snoRNAs were detected.
In order to assess the specificity and sensitivity of the approach for identifying annotated introns, we also used a receiveroperator characteristic (ROC) plot. In this analysis, we labeled intronic regions "positive," and all other genomic regions were labeled "negative." With these designations, 13,000 probes map to 67 kb in the positive regions (introns), and 2.4 million probes map to 12 Mb in the negative regions (rest of the genome). We used the P-values assigned to each probe based on its enrichment in dbr1 versus DBR+ data sets (Supplemental Methods) as thresholds for the construction of a ROC plot (Supplemental Fig. 1). At each threshold, the proportion of significant probes found in the positive and negative regions was plotted. The area under the ROC curve is typically used to assess the quality of a classification method; a score of 1.0 indicates a perfect classifier, whereas a score of 0.5 indicates a random classifier. The tiling array data scored 0.9. We also performed a more stringent ROC analysis focused on intronic boundaries (Supplemental Fig. 1). Here, we designated intronic regions to be positives, and regions up- and downstream equal to half the size of the intervening intron to be negatives. The ROC score of the tiling array data remained 0.9, although there was a loss of sensitivity and specificity in this second classification at an estimated FDR of 10%.
The likeliest reason for a failure to detect an intron in the dbr1-specific signals is that the primary transcript was expressed at too low a level. Alternatively, there could be a relationship between the size of an intron and its identification on the array. To address RNA expression levels, we calculated the average intensity of tiling array probes within the exons of intron-containing genes expressed in DBR+ cells and plotted these intensities versus intron length (Fig. 3A; Supplemental Table 3). Introns that were detected were likely to be either more highly expressed or larger than the mean intron length. At an estimated FDR of 10%, all of the 124 genes (131 introns) for which an intronic signal was not observed were below the mean exonic probe intensity of these genes, and all but one were below the mean intron length. Those genes not transcribed under the culture conditions used or transcribed at a level below the threshold required for significance will be missed, such as HMRa1 (a silent mating cassette that is not expressed) and SPO22 (expressed mainly during meiosis) (Primig et al. 2000
Because the lariat RNA structure could give rise to biases in the array labeling and hybridization process, we looked at the distribution of tiling array signals at annotated intron boundaries and splice signals (Fig. 3B). We calculated the position of significant probes within the 272 annotated introns by normalizing the intron loop and tail lengths to their mean values (210 and 36 nt, respectively). In order to assess signals at intron boundaries, we also considered significant probes that were found within 100-bp regions flanking the introns. At an estimated FDR of 10%, we observed a total of 8947 probes in lariat loop regions, 91 significant probes within 100 bp upstream of the 5' splice site, and 31 significant probes downstream of 3' splice site. One reason for this "spillover" to flanking regions is the inaccuracy of intron annotation. For example, among 91 significant probes found upstream of a 5' splice site, 26 are upstream of the RPL26B intron, which we considered a case for reannotation. Only six significant probes were found in lariat tails. This absence of signal could be due in part to the short length of lariat tails in S. cerevisiae, which average 36 nt. Alternatively, this bias could be due to the exonucleolytic degradation of lariat tails (Chapman and Boeke 1991
We examined a gene model for RPL7A, a dual intron-containing gene, in more detail (Fig. 4A). Significant tiling array probes are found within both introns, and the ratio of dbr1/DBR+ signals illustrates the correspondence of annotated introns with the accumulation of intron-specific signals. The expression of RPL7A in DBR+ cells is shown to demarcate intronexon boundaries, and the conservation among seven related yeast species is shown to highlight the lack of conservation found in intronic regions (Cliften et al. 2003
Comparative studies have identified novel introns by searching for conserved splice-donor and branchpoint signals among related yeast species (Brachat et al. 2003 Although the S. cerevisiae genome has been studied extensively by traditional genetics, functional genomic approaches, and comparative sequence analysis of related genomes, we identified at least four likely spliced sequences that were previously undetected. An intron in the 5' UTR of YPR153W contains tiling array signals that lie between a canonical 5' splice site signal and noncanonical branchpoint (5'-AACTAAC) (Fig. 4E; Supplemental Table 2). PTC7 contains an intron within its coding region (Fig. 4F). RT-PCR and sequencing of these products confirmed the presence of both of these introns (Supplemental Fig. 2). The reading frame of PTC7, encoding a mitochondrial protein phosphatase, remains intact before and after splicing, raising the possibility that the mRNA codes for two protein isoforms. We were unable to confirm putative introns in BDF2 (coding region) or YEL023C (5' UTR) (Supplemental Fig. 3D,E).
Array data could also be used to identify misannotated introns. We found three ribosomal genes (RPL26B, RPL20A, and RPL20B) for which the array signals differed from the annotated intron regions, agreeing with other studies (M. Ares, pers. comm.). RPL26B contains a previously annotated 354-nt intron in its 5' UTR (Fig. 4G). However, signals from the array suggest that the intron is 123 nt larger, with a different 5' splice site that coincides with the array signals; the updated annotation for RPL26B was confirmed by RT-PCR and sequencing (Supplemental Fig. 2), and N-terminal sequencing of RPL26B is consistent with this annotation (Otaka et al. 1984 We characterized a total of 898 significant probes that fell outside of annotated intronic regions. Among them, 88 are due to reannotation (RPL26B, RPL20A, RPL20B, NHP6B/YBR090C) and 236 to newly predicted introns (URA2, BMH2, PTC7, HRB1, YPR153W). Several other genes are apparently up-regulated by DBR1 deletion, including FMP45 (172 probes), HSP12 (47 probes), and SYN8 (69 probes), accounting for 32% of the 898. Overall, 97% of the 9851 significant probes can be attributed to spliced introns or gene up-regulation in the dbr1 strain. Some of the significant probes could belong to bona fide spliced introns that remain to be experimentally verified, or could be due to spurious detection events, such as cases in which a significant dbr1/DBR+ ratio is observed for a gene expressed at a low level.
We demonstrate an approach for identifying spliced introns on a genome-wide basis by the detection of lariat signals on a tiling array; the signals arise because of a mutation in the debranching enzyme necessary for lariat turnover. In yeast, this method is capable of identifying more than half of the known introns, providing gene models that typically allow the delineation of both the 5' splice site and the branchpoint. Introns that were not detected lie in genes that are either not expressed or expressed at a low level under the conditions of our experiments. Despite the intense analysis of yeast introns in the decade since the genome sequence became available, our approach predicted novel introns and led to the reannotation of others.
Previous studies have used microarrays to study mRNA splicing and its regulation (Clark et al. 2002
Spliced intronic lariats serve as markers for transcriptional as well as spliceosomal activity. This utility is in marked contrast with total RNA hybridization to tiling arrays (Kapranov et al. 2002
Because splicing in metazoans is significantly more complicated than in yeast, the tiling array method could have limitations when applied to other organisms. For example, alternative splicing would give rise to complex tiling array signals; however, alternative splice forms might be addressed by the identification of subsets of introns within a gene that have distinct dbr1/DBR+ signals. Another concern would be the assignment of 5' and 3' splice sites and branchpoints, which can be highly degenerate (Lim and Burge 2001
Our approach may therefore be applicable to at least a subset of genes in most eukaryotes, and it should be complementary to other approaches to identify introns and annotate genes. The regulation of intron turnover in metazoans via debranching is largely unknown. The Pfam database (Finn et al. 2006
Strains and culturing S. cerevisiae BY4743 (MATa/MAT his3 1/his3 1 leu2 0/leu2 0 lys2 0/+ met15 0/+ ura3 0/ura3 0) and its corresponding dbr1 double-deletion strain were obtained from Open Biosystems. Yeast was cultured in rich medium (YPD) at 30°C. Escherichia coli strain DH5 was used in cDNA cloning.
RNA preparation and hybridization
Tiling array data analysis
False discovery rate estimation
RT-PCR
We thank W. Noble and P. Green for helpful discussions, and M. Ares for access to unpublished data. J.H. was supported by an NIH NRSA (F32HG003439-02), P41 RR11823, and a Rosetta Fellowship provided to the University of Washington by Merck Research Laboratories. S.F. is an investigator of the Howard Hughes Medical Institute.
3 These authors contributed equally to this work.
E-mail fields{at}u.washington.edu; fax (206) 543-0754. [Supplemental material is available online at www.genome.org and at http://depts.washington.edu/sfields/supplemental_data/intron_tiling_supplement/.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6049107
Brachat, S., Dietrich, F.S., Voegeli, S., Zhang, Z., Stuart, L., Lerch, A., Gates, K., Gaffney, T., and Philippsen, P. 2003. Reinvestigation of the Saccharomyces cerevisiae genome annotation by comparison to the genome of a related fungus: Ashbya gossypii. Genome Biol. 4: R45.[CrossRef][Medline] Brent, M.R. 2005. Genome annotation past, present, and future: How to define an ORF at each locus. Genome Res. 15: 17771786. C. elegans Sequencing Consortium. 1998. Genome sequence of the nematode C. elegans: A platform for investigating biology. Science 282: 20122018. Chapman, K.B. and Boeke, J.D. 1991. Isolation and characterization of the gene encoding yeast debranching enzyme. Cell 65: 483492.[CrossRef][Medline] Christie, K.R., Weng, S., Balakrishnan, R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Feierbach, B., Fisk, D.G., Hirschman, J.E., et al. 2004. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32: D311D314. Clark, T.A., Sugnet, C.W., and Ares Jr., M. 2002. Genomewide analysis of mRNA processing in yeast using splicing-specific microarrays. Science 296: 907910. Cliften, P., Sudarsanam, P., Desikan, A., Fulton, L., Fulton, B., Majors, J., Waterston, R., Cohen, B.A., and Johnston, M. 2003. Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301: 7176. Conklin, J.F., Goldman, A., and Lopez, A.J. 2005. Stabilization and analysis of intron lariats in vivo. Methods 37: 368375.[CrossRef][Medline] David, L., Huber, W., Granovskaia, M., Toedling, J., Palm, C.J., Bofkin, L., Jones, T., Davis, R.W., and Steinmetz, L.M. 2006. A high-resolution map of transcription in the yeast genome. Proc. Natl. Acad. Sci. 103: 53205325. Davis, C.A., Grate, L., Spingola, M., and Ares Jr., M. 2000. Test of intron predictions reveals novel splice sites, alternatively spliced mRNAs and new introns in meiotically regulated genes of yeast. Nucleic Acids Res. 28: 17001706. Engebrecht, J.A., Voelkel-Meiman, K., and Roeder, G.S. 1991. Meiosis-specific RNA splicing in yeast. Cell 66: 12571268.[CrossRef][Medline] Finn, R.D., Mistry, J., Schuster-Bockler, B., Griffiths-Jones, S., Hollich, V., Lassmann, T., Moxon, S., Marshall, M., Khanna, A., Durbin, R., et al. 2006. Pfam: Clans, web tools and services. Nucleic Acids Res. 34: D247D251. Grate, L. and Ares Jr., M. 2002. Searching yeast intron data at Ares lab Web site. Methods Enzymol. 350: 380392.[Medline] Gresham, D., Ruderfer, D.M., Pratt, S.C., Schacherer, J., Dunham, M.J., Botstein, D., and Kruglyak, L. 2006. Genome-wide detection of polymorphisms at nucleotide resolution with a single DNA microarray. Science 311: 19321936. Halasz, G., van Batenburg, M.F., Perusse, J., Hua, S., Lu, X.J., White, K.P., and Bussemaker, H.J. 2006. Detecting transcriptionally active regions using genomic tiling arrays. Genome Biol. 7: R59.[CrossRef][Medline] Hiller, M., Huse, K., Szafranski, K., Jahn, N., Hampe, J., Schreiber, S., Backofen, R., and Platzer, M. 2004. Widespread occurrence of alternative splicing at NAGNAG acceptors contributes to proteome plasticity. Nat. Genet. 36: 12551257.[CrossRef][Medline] Johnson, J.M., Castle, J., Garrett-Engele, P., Kan, Z., Loerch, P.M., Armour, C.D., Santos, R., Schadt, E.E., Stoughton, R., and Shoemaker, D.D. 2003. Genome-wide survey of human alternative pre-mRNA splicing with exon junction microarrays. Science 302: 21412144. Kampa, D., Cheng, J., Kapranov, P., Yamanaka, M., Brubaker, S., Cawley, S., Drenkow, J., Piccolboni, A., Bekiranov, S., Helt, G., et al. 2004. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 14: 331342. Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P., and Gingeras, T.R. 2002. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296: 916919. Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E.S. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: 241254.[CrossRef][Medline] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline] Lim, L.P. and Burge, C.B. 2001. A computational analysis of sequence features involved in recognition of short introns. Proc. Natl. Acad. Sci. 98: 1119311198. Manak, J.R., Dike, S., Sementchenko, V., Kapranov, P., Biemar, F., Long, J., Cheng, J., Bell, I., Ghosh, S., Piccolboni, A., et al. 2006. Biological function of unannotated transcription during the early development of Drosophila melanogaster. Nat. Genet. 38: 11511158.[CrossRef][Medline] Miura, F., Kawaguchi, N., Sese, J., Toyoda, A., Hattori, M., Morishita, S., and Ito, T. 2006. A large-scale full-length cDNA analysis to explore the budding yeast transcriptome. Proc. Natl. Acad. Sci. 103: 1784617851. Otaka, E., Higo, K., and Itoh, T. 1984. Yeast ribosomal proteins: VIII. Isolation of two proteins and sequence characterization of twenty-four proteins from cytoplasmic ribosomes. Mol. Gen. Genet. 195: 544546.[CrossRef] Primig, M., Williams, R.M., Winzeler, E.A., Tevzadze, G.G., Conway, A.R., Hwang, S.Y., Davis, R.W., and Esposito, R.E. 2000. The core meiotic transcriptome in budding yeasts. Nat. Genet. 26: 415423.[CrossRef][Medline] Sabo, P.J., Kuehn, M.S., Thurman, R., Johnson, B.E., Johnson, E.M., Cao, H., Yu, M., Rosenzweig, E., Goldy, J., Haydock, A., et al. 2006. Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat. Methods 3: 511518.[CrossRef][Medline] Schumacher, A., Kapranov, P., Kaminsky, Z., Flanagan, J., Assadzadeh, A., Yau, P., Virtanen, C., Winegarden, N., Cheng, J., Gingeras, T., et al. 2006. Microarray-based DNA methylation profiling: Technology and applications. Nucleic Acids Res. 34: 528542. Sinha, I., Wiren, M., and Ekwall, K. 2006. Genome-wide patterns of histone modifications in fission yeast. Chromosome Res. 14: 95105.[CrossRef][Medline] Spingola, M. and Ares Jr., M. 2000. A yeast intronic splicing enhancer and Nam8p are required for Mer1p-activated splicing. Mol. Cell 6: 329338.[CrossRef][Medline] Spingola, M., Grate, L., Haussler, D., and Ares Jr., M. 1999. Genome-wide bioinformatic and molecular analysis of introns in Saccharomyces cerevisiae. RNA 5: 221234.[Abstract] Storey, J.D. and Tibshirani, R. 2003. Statistical significance for genomewide studies. Proc. Natl. Acad. Sci. 100: 94409445. Wilson, G.M., Flibotte, S., Missirlis, P.I., Marra, M.A., Jones, S., Thornton, K., Clark, A.G., and Holt, R.A. 2006. Identification by full-coverage array CGH of human DNA copy number increases relative to chimpanzee and gorilla. Genome Res. 16: 173181. Ye, Y., De Leon, J., Yokoyama, N., Naidu, Y., and Camerini, D. 2005. DBR1 siRNA inhibition of HIV-1 replication. Retrovirology 2: 63.[CrossRef][Medline]
Received October 18, 2006; accepted in revised format January 3, 2007. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||