Genome Research

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Guigó, R.
Right arrow Articles by Fickett, J. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Guigó, R.
Right arrow Articles by Fickett, J. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Vol. 10, Issue 10, 1631-1642, October 2000

METHODS
An Assessment of Gene Prediction Accuracy in Large DNA Sequences

Roderic Guigó,1,3 Pankaj Agarwal,2 Josep F. Abril,1 Moisés Burset,1 and James W. Fickett2

1 Grup de Recerca en Informática Mèdica, Institut Municipal d'Investigació Mèdica, Universitat Pompeu Fabra, E-08003 Barcelona, Spain; 2 Department of Bioinformatics, SmithKline Beecham Pharmaceuticals Research and Development, King of Prussia, Pennsylvania 19406, USA

One of the first useful products from the human genome will be a set of predicted genes. Besides its intrinsic scientific interest, the accuracy and completeness of this data set is of considerable importance for human health and medicine. Though progress has been made on computational gene identification in terms of both methods and accuracy evaluation measures, most of the sequence sets in which the programs are tested are short genomic sequences, and there is concern that these accuracy measures may not extrapolate well to larger, more challenging data sets. Given the absence of experimentally verified large genomic data sets, we constructed a semiartificial test set comprising a number of short single-gene genomic sequences with randomly generated intergenic regions. This test set, which should still present an easier problem than real human genomic sequence, mimics the ~200kb long BACs being sequenced. In our experiments with these longer genomic sequences, the accuracy of GENSCAN, one of the most accurate ab initio gene prediction programs, dropped significantly, although its sensitivity remained high. Conversely, the accuracy of similarity-based programs, such as GENEWISE, PROCRUSTES, and BLASTX, was not affected significantly by the presence of random intergenic sequence, but depended on the strength of the similarity to the protein homolog. As expected, the accuracy dropped if the models were built using more distant homologs, and we were able to quantitatively estimate this decline. However, the specificities of these techniques are still rather good even when the similarity is weak, which is a desirable characteristic for driving expensive follow-up experiments. Our experiments suggest that though gene prediction will improve with every new protein that is discovered and through improvements in the current set of tools, we still have a long way to go before we can decipher the precise exonic structure of every gene in the human genome using purely computational methodology.


3 Corresponding author.


10:1631-1642 ©2000 by Cold Spring Harbor Laboratory Press  ISSN 1088-9051/00 $5.00

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
J Exp BotHome page
M. Gorantla, P. Babu, V. Reddy Lachagari, A. Reddy, R. Wusirika, J. L. Bennetzen, and A. R. Reddy
Identification of stress-responsive genes in an indica rice (Oryza sativa L.) using ESTs generated from drought-stressed seedlings
J. Exp. Bot., January 1, 2007; 58(2): 253 - 265.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Vilardell and A. Sanchez-Pla
Hypothesis testing approaches to the exon prediction problem
Bioinformatics, December 15, 2006; 22(24): 3003 - 3008.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
M. R. Brent
Genome annotation past, present, and future: How to define an ORF at each locus
Genome Res., December 1, 2005; 15(12): 1777 - 1786.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Conklin, B. Haldeman, and Z. Gao
Gene finding for the helical cytokines
Bioinformatics, May 1, 2005; 21(9): 1776 - 1781.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
C. R. Marshall, J. A. Fox, S. L. Butland, B. F. F. Ouellette, F. S. L. Brinkman, and G. F. Tibbits
Phylogeny of Na+/Ca2+ exchanger (NCX) genes from genomic data identifies new gene duplications and a new family member in fish species
Physiol Genomics, April 14, 2005; 21(2): 161 - 173.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
L. Florea, V. Di Francesco, J. Miller, R. Turner, A. Yao, M. Harris, B. Walenz, C. Mobarry, G. V. Merkulov, R. Charlab, et al.
Gene and alternative splicing annotation with AIR
Genome Res., January 1, 2005; 15(1): 54 - 66.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
B. Issac and G. P. S. Raghava
EGPred: Prediction of Eukaryotic Genes Using Ab Initio Methods After Combining With Sequence Similarity Approaches
Genome Res., September 1, 2004; 14(9): 1756 - 1766.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
B. C. Meyers, S. S. Tej, T. H. Vu, C. D. Haudenschild, V. Agrawal, S. B. Edberg, H. Ghazal, and S. Decola
The Use of MPSS for Whole-Genome Transcriptional Analysis in Arabidopsis
Genome Res., August 1, 2004; 14(8): 1641 - 1653.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Stanke, R. Steinkamp, S. Waack, and B. Morgenstern
AUGUSTUS: a web server for gene finding in eukaryotes
Nucleic Acids Res., July 1, 2004; 32(suppl_2): W309 - W312.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
N. Maudling and T. K. Attwood
FAN: fingerprint analysis of nucleotide sequences
Nucleic Acids Res., July 1, 2004; 32(suppl_2): W620 - W623.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
E. Birney, M. Clamp, and R. Durbin
GeneWise and Genomewise
Genome Res., May 1, 2004; 14(5): 988 - 995.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
T. E. Scheetz, J. J. Laffin, B. Berger, S. Holte, S. A. Baumes, R. Brown II, S. Chang, J. Coco, J. Conklin, K. Crouch, et al.
High-Throughput Gene Discovery in the Rat
Genome Res., April 1, 2004; 14(4): 733 - 741.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. M. Meyer and R. Durbin
Gene structure conservation aids similarity based gene prediction
Nucleic Acids Res., February 4, 2004; 32(2): 776 - 783.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. Zhang and L. Luo
Splice site prediction with quadratic discriminant analysis using diversity measure
Nucleic Acids Res., November 1, 2003; 31(21): 6214 - 6220.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
A. P. Lifanov, V. J. Makeev, A. G. Nazina, and D. A. Papatsenko
Homotypic Regulatory Clusters in Drosophila
Genome Res., April 1, 2003; 13(4): 579 - 588.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
T.-J. Chuang, W.-C. Lin, H.-C. Lee, C.-W. Wang, K.-L. Hsiao, Z.-H. Wang, D. Shieh, S. C. Lin, and L.-Y. Ch'ang
A Complexity Reduction Algorithm for Analysis and Annotation of Large Genomic Sequences
Genome Res., February 1, 2003; 13(2): 313 - 322.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
G. Parra, P. Agarwal, J. F. Abril, T. Wiehe, J. W. Fickett, and R. Guigo
Comparative Gene Prediction in Human and Mouse
Genome Res., January 1, 2003; 13(1): 108 - 117.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
J. E. Collins, M. E. Goward, C. G. Cole, L. J. Smink, E. J. Huckle, S. Knowles, J. M. Bye, D. M. Beare, and I. Dunham
Reevaluating Human Gene Annotation: A Second-Generation Analysis of Chromosome 22
Genome Res., January 1, 2003; 13(1): 27 - 36.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
D. T. Morishige, K. L. Childs, L. D. Moore, and J. E. Mullet
Targeted Analysis of Orthologous Phytochrome A Regions of the Sorghum, Maize, and Rice Genomes using Comparative Gene-Island Sequencing
Plant Physiology, December 1, 2002; 130(4): 1614 - 1625.
[Abstract] [Full Text] [PDF]


Home page
MutagenesisHome page
I. Dunham
Human genome sequences: enigmatic variations
Mutagenesis, November 1, 2002; 17(6): 457 - 461.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. Mathe, M.-F. Sagot, T. Schiex, and P. Rouze
Current methods of gene prediction, their strengths and weaknesses
Nucleic Acids Res., October 1, 2002; 30(19): 4103 - 4117.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. Chen, M. Sun, S. Lee, G. Zhou, J. D. Rowley, and S. M. Wang
Identifying novel transcripts and novel genes in the human genome by using novel SAGE tags
PNAS, September 17, 2002; 99(19): 12257 - 12262.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. Tjaden, R. M. Saxena, S. Stolyar, D. R. Haynor, E. Kolker, and C. Rosenow
Transcriptome analysis of Escherichia coli using high-density oligonucleotide probe arrays
Nucleic Acids Res., September 1, 2002; 30(17): 3732 - 3738.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
K. L. Howe, T. Chothia, and R. Durbin
GAZE: A Generic Framework for the Integration of Gene-Prediction Data by Dynamic Programming
Genome Res., September 1, 2002; 12(9): 1418 - 1427.
[Abstract] [Full Text] [PDF]


Home page
Endocr. Rev.Home page
C. P. Leo, S. Y. Hsu, and A. J. W. Hsueh
Hormonal Genomics
Endocr. Rev., June 1, 2002; 23(3): 369 - 381.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. H. Graber, G. D. McAllister, and T. F. Smith
Probabilistic prediction of Saccharomyces cerevisiae mRNA 3'-processing sites
Nucleic Acids Res., April 15, 2002; 30(8): 1851 - 1858.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
Y. Lee, R. Sultana, G. Pertea, J. Cho, S. Karamycheva, J. Tsai, B. Parvizi, F. Cheung, V. Antonescu, J. White, et al.
Cross-Referencing Eukaryotic Genomes: TIGR Orthologous Gene Alignments (TOGA)
Genome Res., March 1, 2002; 12(3): 493 - 502.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. A. Camargo, H. P. B. Samaia, E. Dias-Neto, D. F. Simao, I. A. Migotto, M. R. S. Briones, F. F. Costa, M. Aparecida Nagai, S. Verjovski-Almeida, M. A. Zago, et al.
From the Cover: The contribution of 700,000 ORF sequence tags to the definition of the human transcriptome
PNAS, October 9, 2001; 98(21): 12103 - 12108.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
K. J. Schmid and C. F. Aquadro
The Evolutionary Analysis of ""Orphans"" From the Drosophila Genome Identifies Rapidly Diverging and Incorrectly Annotated Genes
Genetics, October 1, 2001; 159(2): 589 - 598.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
D. Greenbaum, N. M. Luscombe, R. Jansen, J. Qian, and M. Gerstein
Interrelating Different Types of Genomic Data, from Proteome to Secretome: 'Oming in on Function
Genome Res., September 1, 2001; 11(9): 1463 - 1468.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
T. Wiehe, S. Gebauer-Jung, T. Mitchell-Olds, and R. Guigo
SGP-1: Prediction and Validation of Homologous Genes Based on Sequence Alignments
Genome Res., September 1, 2001; 11(9): 1574 - 1583.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
R.-F. Yeh, L. P. Lim, and C. B. Burge
Computational Inference of Homologous Gene Structures in the Human Genome
Genome Res., May 1, 2001; 11(5): 803 - 816.
[Abstract] [Full Text]




Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
Genes Dev. Learn. Mem.
Protein Science RNA Genome Res.