|
|
|
|
Genome Res. 14:661-664, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods Accurate Identification of Novel Human Genes Through Simultaneous Gene Prediction in Human, Mouse, and Rat1 Department of Electrical Engineering, University of CaliforniaBerkeley, Berkeley, California 94720, USA 2 Department of Mathematics, University of CaliforniaBerkeley, Berkeley, California 94720, USA 3 Affymetrix Inc., Emeryville, California 94608, USA 4 Fraunhofer-Chalmers Centre, SE-412 88 Gothenburg, Sweden 5 Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA
We describe a new method for simultaneously identifying novel homologous genes with identical structure in the human, mouse, and rat genomes by combining pairwise predictions made with the SLAM gene-finding program. Using this method, we found 3698 gene triples in the human, mouse, and rat genomes which are predicted with exactly the same gene structure. We show, both computationally and experimentally, that the introns of these triples are predicted accurately as compared with the introns of other ab initio gene prediction sets. Computationally, we compared the introns of these gene triples, as well as those from other ab initio gene finders, with known intron annotations. We show that a unique property of SLAM, namely that it predicts gene structures simultaneously in two organisms, is key to producing sets of predictions that are highly accurate in intron structure when combined with other programs. Experimentally, we performed reverse transcription-polymerase chain reaction (RT-PCR) in both the human and rat to test the exon pairs flanking introns from a subset of the gene triples for which the human gene had not been previously identified. By performing RT-PCR on orthologous introns in both the human and rat genomes, we additionally explore the validity of using RT-PCR as a method for confirming gene predictions.
The difficulty of accurate ab initio gene finding has been well documented (e.g., Mathe et al. 2002
Although comparative gene finders use sequence data from multiple genomes, most only predict in one genome at a time. Among gene finders that have been used to annotate entire genomes, a unique characteristic of the SLAM gene finder is its simultaneous prediction of genes having identical structure in two genomes. With the addition of a third genome, combining the results from two SLAM runs allows for the prediction of genes having identical structure in all three genomes. Previous studies (e.g., Rogic et al. 2002 Finally, we show that our strategy for gene prediction using the human, mouse, and rat genomes leads to 924 novel human gene predictions (along with corresponding mouse and rat orthologs). One intron from each of a subset of these genes (48 in human and the corresponding 48 in rat) was experimentally tested by reverse transcription-polymerase chain reaction (RT-PCR) sequencing. Combined with our computational analysis, the experiments suggest that up to roughly 80% of our novel gene predictions correspond to transcribed sequence. Furthermore, the design of our experiments (simultaneous RT-PCR in both human and rat tissue) provides a method for concurrent validation of the RT-PCR technique for identifying novel gene orthologs.
A homology map was constructed for the human (November 2002), mouse (February 2002), and rat genomes (November 2002; Bray and Pachter, 2004
In order to compare SLAM with other gene finders, whole-genome gene sets were obtained from the UCSC Genome Browser Web site (Kent et al. 2002 All available human (November 2002) gene sets from ab initio gene finders (Geneid, Genscan, SGP, and Twinscan), as well as from evidence-based methods (ENSEMBL, Known genes, and RefSeq) were obtained. The ab initio sets along with the SLAM hm and SLAM hr sets were each compared pairwise to produce "consensus" sets of gene predictions that contained genes predicted identically in human in two different sets. The accuracies of the introns of the abinitio gene prediction sets and the consensus sets were measured by comparison with the introns of the human RefSeq gene set.
The consensus set for the SLAM hm and hr runs contained 3698 genes. It is important to note that by virtue of the SLAM constraints this set consisted of genes in human, mouse, and rat, all predicted to have exactly the same structure. In comparison with the set of all SLAM hm predictions, the consensus set is enriched for single-exon genes but has a similar distribution of coding sequence length. In the interest of finding novel genes, this set was filtered for those predictions that did not overlap at all with genes in the ENSEMBL, Known genes, and RefSeq sets (Guigó et al. 2003
We set out to confirm, using RT-PCR, one pair of exons flanking an intron from each of a subset of the filtered ortholog set in order to get an experimental estimate of the accuracy of these predictions. Using Primer3 (Rozen and Skaletsky 1996 Source RNA was pooled from 20 human tissues including adrenal gland, bone marrow, brain cerebellum, brain (whole), fetal brain, fetal liver, heart, kidney, liver, lung, placenta, prostate, salivary gland, skeletal muscle, spleen, testis, thymus, thyroid gland, trachea, and uterus (Clontech Human Total RNA master panel II), and 18 rat tissues including 1012-d embryo, adrenal gland, bladder, brain (whole), brain cerebellum, colon, heart, kidney, liver, lung, ovary, spleen, testicle, thymus (Clontech), mammary gland, pancreas, placenta, and prostate (Ambion). Reverse transcription (RT) reactions were primed by OligodT using Superscript II reverse transcriptase (Invitrogen). The RT reactions were followed by PCR using Clontech Advantage 2 PCR Enzyme System. The PCR program was set at 95°C for 30 sec, followed by 35 cycles of 95°C for 10 sec, and 68°C for 30 sec. Finally, there was an extension cycle of 72°C for 1 min. The pair of exons flanking each intron to be tested were amplified with specific primers. RT-PCR products were examined by agarose gel electrophoresis (Figure 2, below). Kodak Digital Software was used to estimate the product sizes. PCR products were purified with a QIAquick 96-well PCR purification kit (QIAGEN) and sequenced using both forward and reverse primers for each predicted gene.
The amplified sequences were compared with the original SLAM predictions to verify the identity of recovered products. Sequence alignments were computed using standard penalties (match +1, mismatch -1, gap -2, gapExtend -1) and the resulting alignments were considered "valid" if they were at least 40 bp long, overlapped the boundaries of the predicted intron with its flanking exons, and contained 75% sequence similarity (determined by counting the number of matches and dividing by the alignment length). An intron was considered to be verified if the sequenced product had a valid alignment with the predicted product. The gene predictions for which the introns were tested were also subject to further analysis in the form of BLAST alignments against standard databases, and comparison with other existing gene annotations and EST evidence.
The results of the SLAM gene finding runs, the comparison of the intron predictions with known introns, and the confirmation of the intron predictions by RT-PCR are summarized in Table 1. The accuracy of the introns of all available ab initio whole-genome gene prediction sets and of the consensus sets generated from each pair is shown in Figure 1. A companion Web site at http://hanuman.math.berkeley.edu/~cdewey/SLAMHMR/index.html
We mention a few interesting examples: the gene M4H1U1D4r70.005 contains five exons, and has an intron that was validated only in rat. The gene appears inside the intron of a known gene (NMNAT) but in the opposite strand. The gene M16H3U2D1r112.003 (validated in both human and rat) is known only in mouse, but the human/mouse gene predictions align with 97% identity and the human rat also with 97% identity. In fact, the prediction is part of an 18-exon gene (>1000 amino acids) that was known only in mouse! This illustrates the power of the comparative method to not only identify novel genes, but to extend annotations from one organism to another.
Our analysis of the accuracy of the introns of currently available whole-genome ab initio gene prediction sets and consensus sets generated from them reveals several important facts. First, unsurprisingly, comparative gene finders produce more accurate intron predictions than noncomparative ab initio gene finders. Every comparative gene prediction set analyzed (SGP, SLAM hm, SLAM hr, and Twinscan) had higher intron accuracy than the noncomparative gene prediction sets (Genscan and Geneid). In terms of exact intron predictions, the noncomparative gene prediction sets had a mean accuracy of 68% whereas the comparative sets had a mean accuracy of 77%. The consensus gene prediction sets greatly improve accuracy (up to 98% accuracy), but at a large loss of sensitivity (the most accurate sets had just below 1000 introns overlapping a RefSeq intron). Figure 1 shows that similar gene finders, such as Twinscan and Genscan or SGP and Geneid, give rise to larger consensus sets with lower accuracy. It follows from this observation that the most dissimilar gene finders can be combined to give the most accurate consensus sets. A case of combining the results from extremely similar gene finders is the SLAM hm/SLAM hr consensus set, which utilizes the same exact gene finder but different comparative data sets. Of all the consensus sets, this set is the largest and has the lowest accuracy. It strikes a good compromise between accuracy and sensitivity, with about 8500 intron predictions overlapping RefSeq introns and an accuracy of 90%, a 10% improvement over the accuracy obtained by the best comparative gene finders by themselves. The largest consensus set with accuracy above 91% (Twinscan/SGP) has only 4806 predictions. It is important to note that the SLAM hm/SLAM hr set is also unique in that it represents simultaneous predictions of orthologs in the human, mouse, and rat genomes.
The fact that SLAM is required for all consensus sets with accuracy greater than 96% indicates that SLAM is quite different than other gene finders. The most accurate consensus sets had accuracies up to 98% and were those resulting from combining SLAM with the two noncomparative gene finders, Genscan and Geneid. Some aspect of SLAM's comparative nature must account for its uniqueness, as SLAM and Genscan are based on very similar gene models. As the consensus sets involving the other comparative gene finders had lower accuracy than those involving SLAM, it is likely that it is SLAM's ability to predict simultaneously in two genomes (Pachter et al. 2002 In other computational analyses where we have analyzed exon and whole-gene structure accuracy, the results are essentially the same as for the intron analysis (see Supplemental Data). Consensus sets including SLAM have the highest accuracies: up to 95% at the exon level, and 83% at the whole-gene level. Interestingly, consensus-set intron accuracy was greater than consensus-set exon accuracy, and nonconsensus-set intron accuracy was lower than nonconsensus-set exon accuracy. This suggests that introns are generally harder to predict accurately than exons, but that by using gene-finder consensus, the task becomes much easier. This seems important in light of the fact that RT-PCR experiments validate introns and not exons.
Our RT-PCR intron validation rates (60% in human, 71% in rat, 66% overall, and 73% for intron pairs when requiring validation in only one organism) are encouraging compared to the rates obtained in a previous study by Guigó et al. (2003
Although we did not explicitly study alternative splicing here, we have implemented a sampling strategy for SLAM in which it is possible to sample alternative orthologous transcripts instead of obtaining just one prediction (Cawley and Pachter 2003 By using our intron accuracy rates obtained both computationally and experimentally, we can make an estimate of the number of novel gene predictions obtained by our method. If we assume that the RefSeq gene annotation set is somewhat representative of the entire human gene set, then we estimate from our computational analysis that given that a SLAM hm/SLAM hr predicted intron overlaps with a real intron, it will be 100% accurate 90% of the time. Because of our RT-PCR procedures, we could only validate predicted introns that were 100% accurate. Therefore, given a 73% validation rate for our predicted introns (where we consider an intron validated if the predicted product is sequenced in either organism), we estimate that 0.73/0.90 = 81% of the introns in the SLAM hm/SLAM hr set overlap with a real gene. With 322 genes from the filtered ortholog set (potentially novel genes) possessing one or more introns, a rough estimate of the number of novel multiexon genes that can be discovered and validated (by RT-PCR validation of an intron) through this method is 322 x 81% = 260.
It is important to note that RT-PCR validations of predicted genes, as undertaken here as well as by Guigó et al. (2003
L.P. and C.D. were partially supported by NIH grant R01 HG2362-2. The whole-genome SLAM runs were performed on the Affymetrix computing cluster. R.G. and J.Q.W. were partially supported by grants from the NHGRI/NHLBI (1 U54 HG02345) and NCI/SAIC (20XS182A). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1939804.
6 These authors contributed equally to this work.
7 Corresponding author. [Supplemental material is available online at http://hanuman.math.berkeley.edu/~cdewey/SLAMHMR/index.html.]
Alexandersson, M., Cawley, S., and Pachter, L. 2003. SLAMCrossspecies gene finding and alignment with a generalized pair hidden markov model. Genome Res. 13: 496-502. Bray, N. and Pachter, L. 2004. MAVID: Constrained ancestral alignment of multiple sequences. Genome Res. (this issue). Burge, C. and Karlin, T. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94.[CrossRef][Medline] Burset, M. and Guigó, R. 1996. Evaluation of gene structure prediction programs. Genomics 34: 353-367.[CrossRef][Medline] Cawley, S. and Pachter, L. 2003. HMM sampling and applications to gene finding and alternative splicing. Bioinformatics 19: ii36-ii41[Abstract]
Guigó, R., Dermitzakis, E.T., Agarwal, P., Ponting, C., Parra, G., Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C., et al. 2003. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl. Acad. Sci. 100: 1140-1145.
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. 2002. The Human Genome Browser at UCSC. Genome Res. 12: 996-1006. Korf, I., Flicek, P., Duan, D., and Brent, M.R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 1: S1-S9.
Mathe, C., Sagot, M-F., Schiex, T., and Rouze, P. 2002. SURVEY and SUMMARY: Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 30: 4103-4117. Modrek, B. and Lee, C.J. 2003. Alternative splicing in the human, mouse and rat genomes is associated with an increased frequency of exon creation and/or loss. Nat. Genet. 34: 177-180.[CrossRef][Medline]
Nurtdinov, R.N., Artamonova, I.I., Mironov, A.A., and Gelfand, M.S. 2003. Low conservation of alternative splicing patterns in the human and mouse genomes. Hum. Mol. Genet. 12: 1313-1320. Pachter, L., Alexandersson, M., and Cawley, S. 2002. Applications of generalized pair hidden Markov models to alignment and gene finding problems. J. Comp. Biol. 9: 389-399.
Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W., and Guigó, R. 2003. Comparative gene prediction in human and mouse. Genome Res. 13: 108-117.
Reese, M.G., Kulp, D., Tammana, H., and Haussler, D. 2000. Genie Gene finding in Drosophila Melanogaster. Genome Res. 10: 529-538. Rogic, S., Ouellette, B.F.F., and Mackworth, A.K. 2002. Improving gene recognition accuracy by combining predictions from two gene-finding programs. Bioinformatics 16: 1034-1045. Rozen, S. and Skaletsky, H.J. 1996. Primer3. Code available at http://www-genome.wi.mit.edu/genome_software/other/primer3.html
Thanaraj, T.A., Clark, F., and Muilu, J. 2003. Conservation of human alternative splice events in mouse. Nucleic Acids Res. 31: 2544-2552. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562.[CrossRef][Medline]
http://hanuman.math.berkeley.edu/~cdewey/SLAMHMR/index.html; Supplemental data.
Received November 5, 2003;
accepted in revised format January 26, 2004.
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||