|
|
|
|
Genome Res. 15:1777-1786, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Perspective Genome annotation past, present, and future: How to define an ORF at each locusLaboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA
Driven by competition, automation, and technology, the genomics community has far exceeded its ambition to sequence the human genome by 2005. By analyzing mammalian genomes, we have shed light on the history of our DNA sequence, determined that alternatively spliced RNAs and retroposed pseudogenes are incredibly abundant, and glimpsed the apparently huge number of non-coding RNAs that play significant roles in gene regulation. Ultimately, genome science is likely to provide comprehensive catalogs of these elements. However, the methods we have been using for most of the last 10 years will not yield even one complete open reading frame (ORF) for every genethe first plateau on the long climb toward a comprehensive catalog. These strategiessequencing randomly selected cDNA clones, aligning protein sequences identified in other organisms, sequencing more genomes, and manual curationwill have to be supplemented by large-scale amplification and sequencing of specific predicted mRNAs. The steady improvements in gene prediction that have occurred over the last 10 years have increased the efficacy of this approach and decreased its cost. In this Perspective, I review the state of gene prediction roughly 10 years ago, summarize the progress that has been made since, argue that the primary ORF identification methods we have relied on so far are inadequate, and recommend a path toward completing the Catalog of Protein Coding Genes, Version 1.0.
The 10 years since Genome Research began publication bracket a complete era of genome researchan era of stunning successes and nagging loose ends, promise exceeded and promise as yet unfulfilled. The years 1996-2005 were characterized by tremendous optimism and productivity. In 1996, the sequencing of the human genome was scheduled to be completed in 2005 (Collins and Galas 1993
Among the things we have learned by analyzing mammalian genomes are the incredible abundance of alternatively spliced RNAs (Modrek and Lee 2002 It is abundantly clear that the methods we have been using to identify ORFs for most of the last 10 years are inadequate for finishing the job. In this perspective, I argue that we cannot rely on any of the following to get us through the home stretch of ORF identification:
All of these things are valuable, but none of them is likely to get us to a new, higher plateau in the quest for a complete ORF at each protein coding locus. Instead, we will have to rely on large-scale PCR amplification of specific cDNAs followed by sequencing of the amplicons. To amplify cDNAs, we need reasonably accurate, though not necessarily perfect, gene predictions to use for PCR primer design. The further a prediction is from a true gene structure, the greater the likelihood that PCR primers designed for it will fail. Each failure increases the cost per gene identified and may reduce the completeness of the resulting collection of cDNA sequences. This method is feasible and in use today (Guigó et al. 2003 In the end, success in translating genome to ORFeome will take the same route as success in sequencing itselfinvestment in technology development, process optimization, and improved automation. Of course, transcripts that are completely unexpressed except in very specific circumstances will tend to be missed, but we can use these high-throughput methods to make a qualitative leap in the completeness of our ORF annotation. To provide historical context for the argument outlined above, I will first review the state of the major gene prediction methods roughly 10 years ago, when Genome Research began publishing. The second section below provides a brief summary of the progress that has been made in the last decade. The third section presents the argument that the methods used for most of the last 10 years are not suitable for the end stages of ORF identification. The final section spells out some details of the recommended path toward understanding the most basic products of a genome.
Over the last 10 years, we have relied on three fundamental methods for identifying ORFs in genomic sequence: (1) sequencing randomly selected cDNA clones and aligning the sequences to their genomic sources; (2) finding ORFs that could produce proteins similar to proteins that are already in databases; and (3) finding ORFs de novo, without reference to cDNA sequences or their conceptual translations. Each of these methods came of age in the middle 1990s.
Aligning cDNA and protein sequences
As an example of the ambiguities that arise in cDNA-to-genome alignment, consider a short cDNA segment that can be aligned as the 3' end of a long exon with mismatches (Fig. 1A) or as a short independent exon without mismatches (Fig. 1B). Traditional spliced alignment programs, such as EST_GENOME (Mott 1997
A related gene prediction approach is to align protein sequences or profiles from existing databases to a genome sequence (Birney et al. 1996
Philosophically, these are very different approaches. The evidence that a cDNA sequence provides about the exon-intron structure from which it is assembled is much more direct than the evidence that a protein sequence provides about the loci of putative homologs. Cross-locus protein aligners must accept a significant degree of mismatch between the protein to be aligned and the target locus, which can lead to difficulty in distinguishing between functional homologs and nontranscribed pseudogenes (Birney et al. 2004b
GeneWise (Birney and Durbin 1997
Instead of using the P-value, Birney et al. (2004b
De novo gene prediction
Combining prediction methods
Aligning cDNA sequences The accuracy of prediction systems based on aligning cDNA or protein sequence depends on the sequences that are available for alignment as well as the algorithms used to align them. There can be no doubt that both the quantity and quality of expressed sequences have improved dramatically in the last ten years. For example, the human EST database has gone from 415,000 sequences in 1997 to over 6 million in 2005. Several projects, including the Mammalian Gene Collection (MGC) (http://mgc.nci.nih.gov/ There have been improvements in alignment algorithms, too. Traditional cDNA-to-genome alignment programs do not explicitly model the probability of mismatches in the correct alignment (Fig. 1A) as compared with the probability of an additional intron in the correct alignment (Fig 1B). In fact, mismatches in correct alignments are either sequencing errors or differences between the reference genome and the genome from which the cDNA was transcribed. (Occasionally, they may also result from post-transcriptional events such as RNA editing.) Thus, the probabilities of these events depend on both the sequence quality and the rate of polymorphism (for within-species alignments) or divergence (for cross-species alignments). On the other hand, the probability of an additional intron depends on the frequency of introns in the species at hand.
A new generation of cDNA-to-genome alignment programs models all these things using pair hidden Markov models with parameters estimated from the specific cDNA collection and the genome sequence to be aligned (M. Arumugam and M.R. Brent, in prep.). For example, such systems can easily model the fact that sequencing errors are much less likely when aligning an MGC cDNA sequence to the finished human genome than when aligning a single-pass EST sequence to the draft dog genome. For 70%-80% of high-quality cDNA sequences, these more precise models will result in the same alignment as a program like EST_GENOME. In many of the remaining cases, however, they produce better alignments. For example, they are better able to distinguish small exons from sequencing errors. This accuracy improvement is made possible by the availability of very high quality sequences to align and the availability of sufficient computing power to run the pairHMM algorithms in a reasonable amount of time.
Single-genome de novo gene prediction
Genscan (Burge and Karlin 1997
As test sets became more realistic, estimates of Genscan's accuracy at predicting complete human ORFs dropped. Guigó et al. (2000
Currently, gene prediction programs are used primarily for whole genome annotation. As described above, their accuracy when evaluated on a whole genome is typically much lower than their accuracy when evaluated on isolated genes or artificially concatenated sets of single genes. Even whole chromosomes can be deceptive. For example, human chromosome 22, besides being the smallest autosome, is also unusually gene dense, with smaller than average introns and intergenic regions and above average GC content. Most gene prediction programs, including Genscan, tend to perform best on high GC, gene-dense regions. Thus, evaluation on chromosome 22 systematically overestimates the accuracy of most systems. In the current environment, the minimal standard for evaluation of gene prediction programs must be based on whole genome annotation runs. Some may argue that, since we do not know all the exon-intron structures for the human or any other genome, we cannot know the accuracy of a prediction set for the whole genome. This is true, but it should not be an impediment to evaluating whole genome annotations. Sensitivity estimates based on the subset of genes whose structures are known should be an unbiased estimate of sensitivity on all genes, to the extent that the sets of known genes and unknown genes do not differ in ways that greatly affect accuracy. While it is possible that unknown genes are radically different from known genes in this way, there is no reason to believe that they are. Specificity will be systematically underestimated when the predictions are compared to known genes rather than to all genes. Under the same assumption described above, dividing by the fraction of genes that are known (or the fraction of exons that are known, for exon-level specificity) corrects the underestimate. The exact value of that correction factor does not matter when comparing the specificities of two programsthe one with the higher raw estimate will also have the higher corrected estimate. Another approach, which seems to always give qualitatively similar results, is to use only gene predictions that overlap known genes by at least one nucleotide when computing specificity (Wei et al. 2005 Determining gene boundaries is one of the most challenging aspects of ORF predictionmuch more so than predicting the boundaries of exons with splices on both sidesand so it is also the area in which the potential for improvement is greatest. Many improvements to gene prediction algorithms have a large effect on accuracy as measured by exact ORF prediction, even though they have little effect on the accuracy of exon prediction. Thus, it is critical to include measures of exact ORF prediction in comparative evaluations of gene prediction programs.
Statistics on the exact-ORF accuracies of programs are important, but there is a legitimate argument that the value of these programs is not in predicting known genes but in predicting novel genes. Thus, the most convincing evaluation of a program or a set of programs is the extent to which its novel predictions can be verified experimentally. The trend toward publishing experimental evaluations of prediction sets (Wu et al. 2004
Dual- and multi-genome de novo predictors
A new level of accuracy was achieved this year by N-SCAN, a version of TWINSCAN with a new, phylogenetic conservation model that is capable of considering alignments among multiple genomes (Gross and Brent 2005
Many of the challenges of de novo gene prediction that have been observed over the years remain challenges today. Even the best prediction programs tend to split and fuse genes, and they have difficulty accurately predicting stop codons and especially start codons. They only predict a single isoform at each locus, even though a large fraction of human genes are alternatively spliced. Yet there has been enormous progress. We have moved from predicting a correct ORF at one tenth of the human loci to predicting a correct ORF at one third. We can now predict long introns (Gross and Brent 2005
Combining prediction methods
Most Ensembl gene predictions are ultimately created by GeneWise, a protein-alignment program, although Genscan is used to help identify the best proteins to align from other species (Curwen et al. 2004
Given the recent progress in de novo gene prediction, it is worth asking whether GeneWise is still more accurate, or even more conservative, than the best de novo predictors. A direct comparison would be most informative, but the data set that Birney et al. (2004b
A direct comparison was recently made among integrated annotation pipelines as part of the E-GASP community evaluation (Guigó and Reese 2005
A more recent approach to integrating predictions is to score each potential exon using a weighted combination of evidence from alignment-based predictions and de novo predictions (Allen et al. 2004
Manual annotation has also progressed over the last 10 years. In 2000, the Drosophila community held an annotation "jamboree," in which fly biologists and bioinformaticians gathered at Celera Genomics for two weeks to create an initial annotation of the Drosophila genome (Pennisi 2000
Limits of sequencing random cDNA clones Improving the accuracy of annotations based on expressed sequences depends, to a large extent, on improving the collection of sequences that are available to align. The vast majority of ESTs and cDNA sequences currently in databases were obtained by sequencing clones selected at random from cDNA libraries. However, this method has been found to saturate well short of the full gene set (The MGC Project Team 2004
Limits of protein alignment
Limits of combiners
Limits of manual annotation
Limits of comparative genomics
There are several possible reasons for the failure, so far, to achieve substantial accuracy improvements by using multigenome alignments. It may be that we do not yet have the right combinations of genomes sequenced to sufficiently high qualitydraft sequence may not be good enough. Or, it may be that we simply cannot align these genomes precisely enough to draw accurate inferences about selection. If these are the reasons, then finished sequences from more mammals, especially primates (Boffelli et al. 2003
It is my conviction that a finished genome sequence should reveal the set of ORFs it encodes. Therefore, I believe we must develop a cost-effective technology for translating a genome to a set of exon-intron structures and the proteins they encode. The outlines of this technology are now becoming clear, but its cost must still be reduced through automation and optimization.
The current gold standard of evidence for gene structures is cDNA sequence aligned to the genomic locus from which it was transcribed. This leaves something to be desired, in that one must still infer the exon-intron structure by alignment and the protein product by conceptual translation of the most likely-looking open reading frame. Both of these inferences are subject to error, so one might hope for confirmation by direct experimental evidence. However, there is as yet no economical, high-throughput method for obtaining such evidence. In particular, there is no analog of RT-PCR for proteinsan economical method of directly amplifying or purifying hypothesized, low-abundance proteins. Since we must rely on computational inference of protein products that aren't easily picked up by high-throughput proteomics, it is possible that incorrectly processed pre-mRNAs, such as those with retained introns, would yield incorrect inferences about functional proteins. The best approach to flagging such cases may be to screen for cDNAs that are likely candidates for nonsense-mediated decaythose with splice junctions more than 50-55 nt 3' of the inferred ORF (Lejeune and Maquat 2005 The most efficient way to obtain cDNA sequence for every protein-coding gene is to combine standard EST sequencing, gene prediction, and RT-PCR using primers designed to amplify predicted transcripts. A small to moderate collection of ESTs should be developed first by the standard methodsequencing randomly selected cDNA clones. This will produce sequence from transcripts that are relatively abundant, and will completely determine the exon-intron structures of abundant transcripts that are shorter than two read lengths (currently about 1400-1800 bp). The cost per transcript will remain relatively low as long as a fairly high proportion of sequences produced are new. By calculating the number of clones that must be sequenced to obtain a new EST and multiplying by the cost per clone one can estimate the cost per new cDNA read. When this cost exceeds the estimated cost per new read by RT-PCR, EST sequencing should be stopped. The resulting ESTs should be aligned to the genome using cDNA-to-genome alignment tools based on strong models of gene structure, and those that do not align well should be discarded or set aside for manual inspection if time permits. High-quality EST alignments that overlap one another must then be grouped together and computational techniques used to determine which groups are likely to contain a complete ORF. Those that do form the core set of genes in the annotation. Once the core set has been determined, the rest of the genes must be identified by a series of RT-PCR and sequencing steps, starting with the most confident predictions and progressing toward the less confident (Fig. 2). Considering the analyses described above, predictions based on cross-locus and cross-species protein alignments are more reliable than de novo predictions only when the aligned protein is highly similar to the predicted one (probably >95% identity). Such predictions should be used to design primers for the first round of RT-PCR and sequencing experiments. After each RT-PCR and sequencing step, the resulting cDNA sequences should be aligned, grouped, and sorted by completeness of the predicted ORF as described above. Aligning the experimental sequences to the genome may confirm parts of the predicted gene structure, but it may also reveal errors in other parts of the predicted structure. The updated set of full-ORF gene structures can now be used to train a de novo gene prediction algorithm. Typically, clusters of genomes within a clade are sequenced at once, so it is usually possible to use dual- or, potentially, multi-genome de novo prediction methods. The EST alignments that do not cover a full ORF can be used to guide the prediction algorithms, which will predict complete structures that are consistent with the alignments, but may extend them with additional exons and/or link several ESTs together into a single predicted transcript (C. Wei and M.R. Brent, in prep.) The unconfirmed regions of predictions that extend or link EST alignments can then be tested in the next round of RT-PCR. After aligning the resulting sequences to the genome, the gene structures they define can be used as additional examples for retraining the gene predictor and as additional guidance around which the gene predictor can build models. If this process is taken to convergence, where all gene models have been tested, the result will be an annotation of exon-intron structures that is more complete than any we have now and that is fully verified by native cDNA sequences.
Several variants of this approach are also being developed. One is to use Rapid Amplification of cDNA Ends (RACE) PCR, a method in which a universal primer at one end of the cDNA is paired with a single gene-specific primer inside the predicted cDNA. Certain RACE methods selectively amplify 5' complete mRNAs with a 7 methyl guanine cap, allowing amplification of the 5' end without knowing a sequence in the 5' end. Only the sequences of one or more internal exons are needed for the design of the gene specific primer. Since only one exon needs to be predicted correctly, this method can be more sensitive than ordinary RT-PCR. Specificity is often a problem with RACE, but this can be ameliorated by a second round of PCR using a nested pair of universal and gene specific primers. McCombie and colleagues (Dike et al. 2004
Of course, some transcripts are expressed transiently during development, or only under rare environmental conditions. We can increase the number of detectable transcripts by pooling RNA from many tissues. Cloning artifacts can be reduced by amplifying reverse-transcriptase products directly rather than using cloned cDNA libraries and by sequencing PCR products directly rather than sequencing clones. But there will still be rare transcripts that cannot be verified by a high-throughput annotation system. In the end, these will have to be identified on a case-by-case basis using traditional biochemical or genetic approaches. Nonetheless, we can use high-throughput methods to get much closer than we have so far to determining the most basic elements on the parts list of an organism. To make this vision a reality, we must bring the cost of the RT-PCR and sequencing experiments down as far as possible. This means relying on end-to-end automation. Much of the necessary automation consists of software pipelines for selecting predictions to test, designing primer pairs to test them, and analyzing the resulting sequences to determine new gene structures. The physical processes of setting up PCR and sequencing reactions must also be optimized and automated. Finally, the accuracy of the gene predictions will be a central determinant of the cost and completeness of the resulting annotation. Prediction errors may lead to one or more PCR experiments that fail to amplify their targets and produce no useful sequence, thus raising the cost per transcript annotated. Therefore, we must continue to improve the accuracy of gene prediction by developing more complete and more realistic models of the signals in the genome sequence that guide the transcription and processing of mRNA. The genomics community is used to rapid progress and headline-making excitement, so the temptation to "declare victory and move on" is understandable. I have heard it said numerous times that the identification of protein-coding genes is well understood, and the real challenges now are identifying transcription factor binding sites, non-coding RNA genes, and other exciting sequence elements. While these are important challenges, we must resist the temptation to leave the identification of protein-coding genes incomplete while we chase after the hottest new features. We must not forget that the defining characteristic of genomics is the all-out effort to view an organism globally by analyzing data sets that are as complete as we can possibly make them.
I am grateful to Paul Flicek for help with analyzing the EGASP evaluation results and to Mark Diekhans for analysis of the MGC cDNA sequences. M.R.B. is supported in part by R01 HG02278, R01 AI051209, and U01 HG003150 from the National Institutesof Health; in part by grant DBI-0501758 from the National Science Foundation; and in part by National Cancer Institute funds for the Mammalian Gene Collection project under Contract No. N01-CO-12400.
E-mail brent{at}cse.wustl.edu; fax (314) 935-7302. Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3866105.
Alexandersson, M., Cawley, S., and Pachter, L. 2003. SLAM: Cross-species gene finding and alignment with a generalized pair hidden Markov model. Genome Res. 13: 496-502. Allen, J.E., Pertea, M., and Salzberg, S.L. 2004. Computational gene prediction using multiple sources of evidence. Genome Res. 14: 142-148. Ansari-Lari, M.A., Timms, K.M., and Gibbs, R. 1996. Improved ligation-anchored PCR strategy for identification of 5' ends of transcripts. BioTechniques 21: 34-38.[Medline] Ansari-Lari, M.A., Shen, Y., Muzny, D.M., Lee, W., and Gibbs, R.A. 1997. Large-scale sequencing in human chromosome 12p13: Experimental and computational gene structure determination. Genome Res. 7: 268-280. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301-1310. Bafna, V. and Huson, D.H. 2000. The conserved exon method for gene finding. Proc. Int. Conf. Intell. Syst. Mol. Biol. 8: 3-12.[Medline] Batzoglou, S., Pachter, L., Mesirov, J.P., Berger, B., and Lander, E.S. 2000. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 10: 950-958. Birney, E. and Durbin, R. 1997. Dynamite: A flexible code generating language for dynamic programming methods used in sequence comparison. Proc. Int. Conf. Intell. Syst. Mol. Biol. 5: 56-64.[Medline] . 2000. Using GeneWise in the Drosophila annotation experiment. Genome Res. 10: 547-548. Birney, E., Thompson, J.D., and Gibson, T.J. 1996. PairWise and SearchWise: Finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids Res. 24: 2730-2739. Birney, E., Andrews, T.D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., et al. 2004a. An overview of Ensembl. Genome Res. 14: 925-928. Birney, E., Clamp, M., and Durbin, R. 2004b. GeneWise and Genomewise. Genome Res. 14: 988-995. Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K.D., Ovcharenko, I., Pachter, L., and Rubin, E.M. 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 1391-1394. Brackenridge, S., Wilkie, A.O., and Screaton, G.R. 2003. Efficient use of a `dead-end' GA 5' splice site in the human fibroblast growth factor receptor genes. Embo. J. 22: 1620-1631.[CrossRef][Medline] Brown, R.H., Gross, S.S., and Brent, M.R. 2005. Begin at the beginning: Predicting genes with 5' UTRs. Genome Res. 15: 742-747. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94.[CrossRef][Medline] Burset, M. and Guigó, R. 1996. Evaluation of gene structure prediction programs. Genomics 34: 353-367.[CrossRef][Medline] Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G., et al. 2005. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308: 1149-1154. Collins, F. and Galas, D. 1993. A new five-year plan for the U.S. Human Genome Project. Science 262: 43-46. Collins, F.S., Green, E.D., Guttmacher, A.E., and Guyer, M.S. 2003. A vision for the future of genomics research. Nature 422: 835.[CrossRef][Medline] Curwen, V., Eyras, E., Andrews, T.D., Clarke, L., Mongin, E., Searle, S.M., and Clamp, M. 2004. The Ensembl automatic gene annotation system. Genome Res. 14: 942-950. Dike, S., Balija, V.S., Nascimento, L.U., Xuan, Z., Ou, J., Zutavern, T., Palmer, L.E., Hannon, G., Zhang, M.Q., and McCombie, W.R. 2004. The mouse genome: Experimental examination of gene predictions and transcriptional start sites. Genome Res. 14: 2424-2429. Drysdale, R.A. and Crosby, M.A. 2005. FlyBase: Genes and gene models. Nucleic Acids Res. 33: D390-D395. The ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306: 636-640. Eyras, E., Reymond, A., Castelo, R., Bye, J.M., Camara, F., Flicek, P., Huckle, E.J., Parra, G., Shteynberg, D.D., Wyss, C., et al. 2005. Gene finding in the chicken genome. BMC Bioinformatics 6: 131.[CrossRef][Medline] Flicek, P., Keibler, E., Hu, P., Korf, I., and Brent, M.R. 2003. Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res. 13: 46-54. Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967-974. Furey, T.S., Diekhans, M., Lu, Y., Graves, T.A., Oddy, L., Randall-Maher, J., Hillier, L.W., Wilson, R.K., and Haussler, D. 2004. Analysis of human mRNAs with the reference genome sequence reveals potential errors, polymorphisms, and RNA editing. Genome Res. 14: 2034-2040. Gelfand, M.S., Mironov, A.A., and Pevzner, P.A. 1996. Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci. 93: 9061-9066. Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., et al. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493-521.[CrossRef][Medline] Gross, S.S. and Brent, M.R. 2005. Using multiple alignments to improve gene prediction. In 9th Annual International Conference, RECOMB 2005 (eds. S. Miyano et al.), pp. 374-388. Springer, Boston. . 2006. Using multiple alignments to improve gene prediction. J. Comput. Biol. 13: (in press). Guigó, R. and Reese, M.G. 2005. EGASP: Collaboration through competition to find human genes. Nat. Methods 2: 575-577.[CrossRef][Medline] Guigó, R., Knudsen, S., Drake, N., and Smith, T. 1992. Prediction of gene structure. J. Mol. Biol. 226: 141-157.[CrossRef][Medline] Guigó, R., Agarwal, P., Abril, J.F., Burset, M., and Fickett, J.W. 2000. An assessment of gene prediction accuracy in large DNA sequences. Genome Res. 10: 1631-1642. Guigó, R., Dermitzakis, E.T., Agarwal, P., Ponting, C., Parra, G., Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C., et al. 2003. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl. Acad. Sci. 100: 1140-1145. Hillier, L.D., Lennon, G., Becker, M., Bonaldo, M.F., Chiapelli, B., Chissoe, S., Dietrich, N., DuBuque, T., Favello, A., Gish, W., et al. 1996. Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6: 807-828. Hillier, L.W., Miller, W., Birney, E., Warren, W., Hardison, R.C., Ponting, C.P., Bork, P., Burt, D.W., Groenen, M.A., Delany, M.E., et al. 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432: 695-716.[CrossRef][Medline] International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431: 931-945.[CrossRef][Medline] Kapranov, P., Drenkow, J., Cheng, J., Long, J., Helt, G., Dike, S., and Gingeras, T.R. 2005. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res. 15: 987-997. Korf, I., Flicek, P., Duan, D., and Brent, M.R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 17 Suppl 1: S140-S148.[Abstract] Kulp, D., Haussler, D., Reese, M.G., and Eeckman, F.H. 1996. A generalized hidden Markov model for the recognition of human genes in DNA. Proc. Int. Conf. Intell. Syst. Mol. Biol. 4: 134-142.[Medline] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.[CrossRef][Medline] Lejeune, F. and Maquat, L.E. 2005. Mechanistic links between nonsense-mediated mRNA decay and pre-mRNA splicing in mammalian cells. Curr. Opin. Cell Biol. 17: 309-315.[CrossRef][Medline] |