|
|
|
|
Genome Res. 14:2424-2429, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00
Letter The mouse genome: Experimental examination of gene predictions and transcriptional start sitesCold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, USA
The completion of the mouse and other mammalian genome sequences will provide necessary, but not sufficient, knowledge for an understanding of much of mouse biology at the molecular level. As a requisite next step in this process, the genes in mouse and their structure must be elucidated. In particular, knowledge of the transcriptional start site of these genes will be necessary for further study of their regulatory regions. To assess the current state of mouse genome annotation to support this activity, we identified several hundred gene predictions in mouse with varying levels of supporting evidence and tested them using RACEPCR. Modifications were made to the procedure allowing pooling of RNA samples, resulting in a scaleable procedure. The results illustrate potential errors or omissions in the current 5' end annotations in 58% of the genes detected. In testing experimentally unsupported gene predictions, we were able to identify 58 that are not usually annotated as genes but produced spliced transcripts ( 25% success rate). In addition, in many genes we were able to detect novel exons not predicted by any gene prediction algorithms. In 19.8% of the genes detected in this study, multiple transcript species were observed. These data show an urgent need to provide direct experimental validation of gene annotations. Moreover, these results show that direct validation using RACEPCR can be an important component of genome-wide validation. This approach can be a useful tool in the ongoing efforts to increase the quality of gene annotations, especially transcriptional start sites, in complex genomes.
The sequence of the human genome has been completed, and the mouse genome is rapidly nearing completion (Lander et al. 2001
A canonical set of mouse genes is rapidly coalescing, and similar progress is also being made on other species. Programs contributing to this progress, at both Ensembl and NCBI, make use of full-length cDNAs and add the genes predicted by ab initio algorithms such as Genscan (Burge and Karlin 1997
Current gene predictions and annotation also focus largely on predicting and confirming coding regions. Transcriptional start sites (TSSs) have been identified for some genes by full-length cDNA approaches. Other efforts such as the MGC (http://mgc.nci.nih.gov/ An important facet of the annotation of any genome is the determination of the TSSs of both coding and noncoding RNA genes. Even for many well-characterized genes, TSSs are often unknown. This is of critical importance due to the association between the TSS and the cis-acting sites that regulate transcription. Information about the start sites of all the genes is ultimately required for large-scale computational analysis of sequence patterns that are associated with transcriptional regulatory themes. A specific example will perhaps clarify this point. If a set of microarray data indicates that there are several hundred genes that are expressed together in a pattern, it would be of interest to search the putative promoters of all of these genes for common structural motifs. We cannot do this now because of a lack of a confirmed TSSs for many, perhaps even most, genes.
Approximately 40% of the known human genes have completely noncoding first exons (Davuluri et al. 2000 Here we describe results from a systematic study in which gene discovery and the understanding of gene structure is approached by integrating existing gene models and ab initio gene predictions into an experimental pipeline geared both to identify the TSSs of known and novel genes and to develop more accurate gene models. Three hundred genes with varying degrees of associated experimental evidence were chosen to test the approach. The structure of the 5' end of a sizable number of both previously known and unknown genes was established by using the methods described here. Based on our results, this approach seems ideal for validation of a large number of gene predictions, as well as better annotation of the TSSs in known genes, in a rapid yet reliable manner.
Gene categorization The 300 gene predictions or annotated genes selected for analysis were grouped into five different categories, based on the quality and quantity of supporting evidence for the gene model (for the selection process, see Methods and ftp://ftp.cshl.org/pub/sequences/mouse/data_for_paper/
Amplification of 5' end of transcripts by RACEPCR The result of 5' rapid amplification of cDNA ends (RACE)PCR on a set of 15 mouse tissues/stages is shown in Table 1. The RACEPCR fragments were cloned, and eight clones for each amplification were sequenced. Of the 300 genes in all the categories, 106 were successfully amplified. The genes in the EPD set served as our internal positive control. All 13 genes in the EPD category were detected. For all of these genes, at least one splice variant was detected that agreed with the annotated first exon for the corresponding gene in the EPD. We amplified from the 5' cap of the transcript to an internal exon so that the presence of a spliced product would rule out the possibility of genomic contamination. Spliced products for a majority of the well-characterized and curated genes (100% in the EPD and 74% in the RefSeq categories) were successfully detected. Fewer of the category C (24%) and D (26%) genes were detected, which was expected based upon a number of possible hypotheses. For example, the predicted gene may not be expressed in the tissues/stages testedwhich may also be the case for the 26% of the RefSeq genes as well as the 35% of category B genes that were also not detected. Alternatively, the gene model may have an incorrect structure. Finally, the predicted gene may be a false positive of the prediction algorithm and not truly be an expressed gene.
Annotation at the 5' ends of genes is incomplete Of the 106 genes successfully detected in this study, 61 (58%) produced 5' RACEPCR sequences that were longer than the annotation. We analyzed these sequences in relation to their alignment to their associated gene model or prediction (Fig. 1A). In 20 of these 61 genes, we found at least one additional exon upstream of the existing annotation/prediction (Fig. 1B). We extended the annotated first exon of two genes (15%) in the EPD and six genes (30%) in the RefSeq categories by an average of 33 and 48 transcribed bases, respectively. EST or similar evidence in GenBank supported our results concerning all eight of these well-characterized genes whose annotation was in disagreement with our results.
Additionally, for two other genes in the EPD category, alternative TSSs defining an exon located in the annotated first intron were detected by RACEPCR (Fig. 1B). The 5' RACEPCR sequence obtained for the c-myc oncogene (GenBank accession no. NM_010849 [GenBank] ) identifies an exon that has supporting evidence in the form of one EST, and our mapped TSS has previously been reported as an alternate transcription start site in the EPD database (ID nos. EP14066 and EP14067). However, this exon is not represented in the RefSeq annotation. In the second case, sequence obtained for the myb oncogene (GenBank accession no. NM_010848 [GenBank] ) indicates an alternate first exon, which has no experimental support in GenBank or EPD (see Supplemental Fig. 3). Interestingly, this exon sequence is conserved in human but not in Takifugu rubripues. In the case of the RefSeq category, four genes were observed in which the first exon was located in the first intronic region. Altogether, we detected 23 cases in all the categories (Fig. 1B) in which the 5' RACE sequence aligned to the first intronic region. These findings show the limitations of current annotation, which must collate various empirical data sources (such as ESTs) and computer predictions in an attempt to determine the TSS of a gene.
Comparison of 5' RACEPCR sequences with full-length RIKEN cDNA
Detection of novel genes After the initiation of this study, at least one full-length cDNA was submitted by the RIKEN group for 28 of the 237 genes tested in category C and D. Of the 58 category C and D genes that were detected in this study, nine overlapped with Riken clones. The TSS suggested by RACEPCR agreed with that of the RIKEN clone for all nine genes (data not shown). In addition, eight of the 237 genes tested in category C and D have subsequently been annotated as RefSeq genes. However, the RefSeq annotation of the start sites of four of these eight genes is indicated to be incomplete by our data.
First exon variation
CpG island association with first exons
We have carried out a study both to determine the TSSs of a number of mouse genes and to assess the accuracy of current mouse genome annotation. The results show that there are considerable deficiencies in the current annotation of TSSs. For 61 of the 106 genes (58%) successfully amplified (Table 1), 5' RACEPCR detected transcripts had incomplete TSS annotation. This is not surprising, since in the absence of full-length cDNA information, the annotation of transcription start sites must collate and evaluate data from disparate sources, assessing their relative quality in making a "final" determination of the TSS. We would contend that this is an extremely difficult task. Since a substantial number of genes are not represented in the publicly available full-length cDNA sets, the occurrence of misannotated TSSs could be expected to be very common. Our results suggest that a viable way to address this issue rigorously is with a systematic program of directed RACEPCR based on known gene structures and/or gene predictions that will aid in "correcting" the existing gene annotation. Such corrections will be to either capture previously misannotated structures or, alternately and importantly, alternate TSSs for correctly annotated genes. Such variation in start sites undoubtedly exists for many genes, and a directed approach is ultimately probably the only way to capture it. Only with such data in hand will we be able to begin carrying out large-scale computational studies on likely motifs used in the coregulation of expression of large sets of genes.
In addition, we found that the conservative nature of gene annotation in general leads to a very low false-positive rate but a surprisingly high number of missed genes. It is interesting to note that the current estimates of gene numbers in mammalian genomes is lower than previously thought (Lander et al. 2001
We chose RACEPCR rather than reverse transcriptase (RT)PCR to carry out confirmation of gene predictions for two major reasons. First, in addition to confirming a transcript, this approach would provide information about the TSS. Second, the work of Vidal and colleagues in experimental validation of gene predictions in C. elegans (Reboul et al. 2001 The optimized 5' RACEPCR protocol used to obtain the present data was designed to be performed in a high-throughput format. Figure 2 provides an overview of the workflow that was used in this study. Other than the production of 5' RACE cDNA from specific tissues, all other steps were carried out in 96-well format with liquid handling performed by the Biomek FX robotic workstation or a multichannel digital pipettor. Based on the procedures and workflow we have optimized in the course of this work, we estimate that two people can obtain 5001000 5' RACEPCR sequences per week.
There are 6000 Twinscan and GenomeScan predictions matching the criteria that we have used for assigning gene predictions category C and D (ftp://ftp.cshl.org/pub/sequences/mouse/data_for_paper/ 25%), we estimate that there are likely to be at least 1450 additional genes in the mouse genome that are not part of the current annotated gene set. Our results clearly suggest that a significant fraction of the transcriptome may go undetected if gene verification strategies are restricted to testing only those gene predictions that have highly conserved structures or previous EST matches. It has previously been reported that three-quarters of all human genes (and by extrapolation, mouse genes) can be recognized in the Fugu genome (Aparicio et al. 2002 21% of the 58 novel genes detected in this study have identifiable homologs in the sequenced Fugu genome. This may show circularity in current gene annotation efforts that are having the effect of pushing gene estimates unrealistically downward.
Three hundred annotated genes or gene predictions were chosen for analysis. Some of these were selected to allow an evaluation of the quality of the annotation describing the TSS of well-characterized genes, while others represent predictions chosen to allow an estimate of the unannotated genes that might exist in the genome. Table 1 describes the criteria for categorizing the genes into different groups based on the quality and confidence level of information associated with each known gene or gene prediction (for data sources, and detailed information about gene selection, see Supplemental information; the complete gene set is available at ftp://ftp.cshl.org/pub/sequences/mouse/data_for_paper/
Primer design
RACE protocol
RACEPCR amplification
Cloning and sequencing of RACEPCR products
Sequence analysis
CpG island analysis
This work was supported by grant 3 U54 HG02135 from the National Human Genome Research Institute. We thank Damon J. Kelly for assistance in programming PERL modules used in primer selection. We would also like to thank William Tansey and Lincoln Stein for critical reading of the manuscript.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3158304.
1 These authors contributed equally.
2 Corresponding author. [Supplemental material is available online at www.genome.org. The sequence data from this study have been submitted to dbEST/GenBank under accession nos. CV303589CV309218.]
Antequera, F. and Bird, A. 1993. Number of CpG islands and genes in human and mouse. Proc. Natl. Acad. Sci. 90: 11995-11999.
Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301-1310. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94.[CrossRef][Medline] Carninci, P., Shiraki, T., Mizuno, Y., Muramatsu, M., and Hayashizaki, Y. 2002. Extra-long first-strand cDNA synthesis. Biotechniques 32: 984-985.[Medline]
Carninci, P., Waki, K., Shiraki, T., Konno, H., Shibata, K., Itoh, M., Aizawa, K., Arakawa, T., Ishii, Y., Sasaki, D., et al. 2003. Targeting a complex transcriptome: The construction of the mouse full-length cDNA encyclopedia. Genome Res. 13: 1273-1289.
Davuluri, R.V., Suzuki, Y., Sugano, S., and Zhang, M.Q. 2000. CART classification of human 5' UTR sequences. Genome Res. 10: 1807-1816. Davuluri, R.V., Grosse, I., and Zhang, M.Q. 2001. Computational identification of promoters and first exons in the human genome. Nat. Genet. 29: 412-417.[CrossRef][Medline]
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967-974.
Kampa, D., Cheng, J., Kapranov, P., Yamanaka, M., Brubaker, S., Cawley, S., Drenkow, J., Piccolboni, A., Bekiranov, S., Helt, G., et al. 2004. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 14: 331-342.
Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P., and Gingeras, T.R. 2002. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296: 916-919.
Kent, W.J. 2002. BLAT: The BLAST-like alignment tool. Genome Res. 12: 656-664. Korf, I., Flicek, P., Duan, D., and Brent, M.R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 17: S140-S148.[Abstract] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.[CrossRef][Medline] Larsen, F., Gundersen, G., Lopez, R., and Prydz, H. 1992. CpG islands as gene markers in the human genome. Genomics 13: 1095-1107.[CrossRef][Medline] Reboul, J., Vaglio, P., Tzellas, N., Thierry-Mieg, N., Moore, T., Jackson, C., Shin-i, T., Kohara, Y., Thierry-Mieg, D., Thierry-Mieg, J., et al. 2001. Open-reading-frame sequence tags (OSTs) support the existence of at least 17,300 genes in C. elegans. Nat. Genet. 27: 332-336.[CrossRef][Medline] Reboul, J., Vaglio, P., Rual, J.F., Lamesch, P., Martinez, M., Armstrong, C.M., Li, S., Jacotot, L., Bertin, N., Janky, R., et al. 2003. C. elegans ORFeome version 1.1: Experimental verification of the genome annotation and resource for proteome-scale protein expression. Nat. Genet. 34: 35-41.[CrossRef][Medline]
Rinn, J.L., Euskirchen, G., Bertone, P., Martone, R., Luscombe, N.M., Hartman, S., Harrison, P.M., Nelson, F.K., Miller, P., Gerstein, M., et al. 2003. The transcriptional activity of human chromosome 22. Genes & Dev. 17: 529-540. Rozen, S. and Skaletsky, H. 2000. Primer3 on the WWW for general users and for biological programmers. In Bioinformatics methods and protocols: Methods in molecular biology (eds. S. Krawetz and S. Misner), pp. 365-386. Humana Press, Totowa, NJ. Shoemaker, D.D., Schadt, E.E., Armour, C.D., He, Y.D., Garrett-Engele, P., McDonagh, P.D., Loerch, P.M., Leonardson, A., Lum, P.Y., Cavet, G., et al. 2001. Experimental annotation of the human genome using microarray technology. Nature 409: 922-927.[CrossRef][Medline]
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. 2001. The sequence of the human genome. Science 291: 1304-1351. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562.[CrossRef][Medline]
Yeh, R.F., Lim, L.P., and Burge, C.B. 2001. Computational inference of homologous gene structures in the human genome. Genome Res. 11: 803-816.
Zavolan, M., van Nimwegen, E., and Gaasterland, T. 2002. Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome. Genome Res. 12: 1377-1385.
Zhang, T., Haws, P., and Wu, Q. 2004. Multiple variable first exons: A mechanism for cell- and tissue-specific gene regulation. Genome Res. 14: 79-89.
http://mgc.nci.nih.gov/; Mammalian Gene Collection. ftp://ftp.cshl.org/pub/sequences/mouse/data_for_paper/; Author's additional mouse data.
Received August 17, 2004; accepted in revised format September 23, 2004. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||