Genome Research cityscape

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Birney, E.
Right arrow Articles by Durbin, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Birney, E.
Right arrow Articles by Durbin, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Vol. 10, Issue 4, 547-548, April 2000

METHODS
Using GeneWise in the Drosophila Annotation Experiment

Ewan Birney,1 and Richard Durbin

Sanger Centre, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SA, UK



    ABSTRACT
TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
METHODS
REFERENCES

The GeneWise method for combining gene prediction and homology searches was applied to the 2.9-Mb region from Drosophila melanogaster. The results from the Genome Annotation Assessment Project (GASP) showed that GeneWise provided reasonably accurate gene predictions. Further investigation indicates that many of the incorrect gene predictions from GeneWise were due to transposons with valid protein-coding genes and the remaining cases are pseudogenes or possible annotation oversights.



    INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
METHODS
REFERENCES

The critical assessment of machine learning techniques is necessary to assess the effectiveness of computational methods. The critical assessment of protein structure prediction (CASP) has become a benchmark for protein structure assessment worldwide (Moult et al. 1999). We welcomed the opportunity offered by Reese and coworkers (2000) to independently assess the gene prediction methods available and provided one of the methods we developed, GeneWise, for this study.

The use of protein and EST similarity to help gene prediction is widespread, including methods such as Genie (Kulp et al. 1996) and GRAIL (Uberbacher et al. 1996). The GeneWise approach builds on the success of hidden Markov models (HMMs) for modeling both protein family information (Krogh et al. 1994; Eddy 1998) and gene predictions (Kulp et al. 1996; Burge and Karlin 1997; Krogh 1997). GeneWise is a HHM that is formed by the principled combination of two separate HMMs (E. Birney and R. Durbin, in prep.). GeneWise therefore can be thought of as considering every possible gene prediction in a genomic sequence and comparing each one to the protein profile-HMM. The best combined score of both the gene prediction and the protein profile-HMM is used to provide a simultaneous gene prediction and protein alignment.

To use GeneWise for gene prediction one needs a source of homology information. In this case, we used protein profile-HMMs from PFAM (Bateman et al. 2000). One of the major drawbacks to using GeneWise is the prohibitive computational cost of the method. This was solved in this case by using the halfwise methods, which prefilters the protein profile-HMM used in the comparison (see Methods). The results presented here were the completely automatic annotation from GeneWise without any manual intervention in the process.

    RESULTS
TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
METHODS
REFERENCES

A total of 165 gene predictions with 252 exons were made in the 2.9-Mb genomic segment. Of the 252 exons, 216 overlapped in some way with the std3 dataset of definite and possible predictions. This left 36 exons in 23 predictions outside of this set. A number of these (16) were profile HMMs of transposons or retroviral transposons. The remaining 20 exons were potential mispredictions or annotation mistakes. By manual examination of these cases we found four potential mispredictions by GeneWise, in each case a trailing exon in an otherwise correct gene prediction. Of the remaining 16 exons, 10 were clear annotation oversights, leaving 6 that were less clear cut, for example, pseudogenes might explain the presence of these hits. There were no predictions by GeneWise of completely wrong genes, in line with our expectation, as GeneWise only predicts genes by virtue of their homology to other genes. We would place our base pair accuracy as far higher (in the 90% range) and the wrong gene predictions to be at 0.

    DISCUSSION
TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
METHODS
REFERENCES

The GASP assessment was a valuable exercise in providing independent evaluation of gene prediction effectiveness. Providing clear-cut assessment of gene predictions is a difficult task and was not helped by the time pressures of both the contributing groups and the assessing group to provide this study. It is clear that the rules for what predictions will be considered as real need to be detailed in the future, and possibly the ability to assess such things as pseudogene predictions, will be important. Ideally there should be experiments by the assessing group after the gene predictions have been made, so that it is clearer that people have at least attempted to verify a gene prediction experimentally.

The predictions made by GeneWise were very much in line with the predictions made using the BLOCKS method (Henikoff et al. 2000). The BLOCKS method considers smaller, ungapped and unspliced motifs drawn from a broader database than PFAM. The result is that there are differences due to the different database source and due to the method---in particular GeneWise tends to predict more coding sequence than BLOCKS for a particular family.

The effectiveness of GeneWise in this study was reported at below the levels we believe to be correct. It is our belief that the specificity numbers for all methods are not well assessed in this study, and that people should not quote them without considerable discussion of the shortcomings of this assessment, that is, the calling of transposon genes as errors and annotation oversights. Even so, this exercise is valuable to raise awareness of the problems in both prediction and assessment. We look forward to participating in future studies.

    METHODS
TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
METHODS
REFERENCES

The method used in this study, halfwise, is part of the Wise2 package available from http://www.sanger.ac.uk/Software/Wise2. halfwise is a PERL script that uses BLASTX to compare the DNA sequence against a protein database designed to represent the protein space covered by PFAM database. The BLASTX search selects a number of potential PFAM models to be used in the more computationally expensive GeneWise method.

The DNA sequence was split up into 100-kb chunks with no overlaps, and each chunk was run through the halfwise method. The resulting GFF output was then processed to assemble the complete GFF file. The total time to perform the analysis was a weekend of off-peak computer resources at the Sanger Centre.

    ACKNOWLEDGMENTS

This work was supported by the Wellcome Trust. E.B. is a Wellcome Trust Prize Student.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.

    FOOTNOTES

1 Corresponding author.

Present address: European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) , Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK.

E-MAIL Birney{at}ebi.ac.uk; FAX 44-1-2223-494468.

    REFERENCES
TOP
ABSTRACT
INTRODUCTION
RESULTS
DISCUSSION
METHODS
REFERENCES

  • Bateman, A., E. Birney, R. Durbin, S.R. Eddy, K.L. Howe, and E.L.L. Sonnhammer. 2000. The pfam protein families database. Nucleic Acids Res. 28: 263-266[Abstract/Free Full Text].
  • Burge, C. and S. Karlin. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94[CrossRef][Medline].
  • Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14: 755-763[Abstract/Free Full Text].
  • Henikoff, J.G. and S. Henikoff. 2000. Genomic sequence annotation based on translated searching of the BLOCKS+ database. Genome Res. (this issue).
  • Krogh, J. 1997. Two methods for improving performance of a HMM and their application for gene finding. Intell. Syst. Mol. Biol. 5: 179-186.
  • Krogh, A., M. Brown, I.S. Milan, K. Sjolander, and D. Haussler. 1994. Hidden Markov models in computational biology: Applications to protein modeling. J. Mol. Biol. 235: 1501-1531[CrossRef][Medline].
  • Kulp, D., D. Haussler, M.G. Reese, and F.H. Eeckman. 1996. A generalized hidden Markov model for the recognition of human genes in DNA. Intell. Syst. Mol. Biol. 4: 134-142.
  • Moult, J., T. Hubbard, K. Fidelis, and J.T. Pedersen. 1999. Critical assessment of methods of protein structure prediction (CASP): Round III. Proteins Suppl. 3: 2-6.
  • Reese, M., G. Hartzell, N.L. Harris, U. Ohler, and S.E. Lewis. 2000. Genome annotation assessment in Drosophila melanogaster. Genome Res. (this issue).
  • Uberbacher, E.C., Y. Xu, and R.J. Mural. 1996. Discovering and understanding genes in human DNA sequence using GRAIL. Methods Enzymol. 266: 259-281[Medline].


10:547-548 ©2000 by Cold Spring Harbor Laboratory Press  ISSN 1088-9051/00 $5.00

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
ScienceHome page
G. A. Tuskan, S. DiFazio, S. Jansson, J. Bohlmann, I. Grigoriev, U. Hellsten, N. Putnam, S. Ralph, S. Rombauts, A. Salamov, et al.
The genome of black cottonwood, Populus trichocarpa (Torr. & Gray).
Science, September 15, 2006; 313(5793): 1596 - 1604.
[Abstract] [Full Text] [PDF]


Home page
GeneticsHome page
M. Metta, R. Gudavalli, J.-M. Gibert, and C. Schlotterer
No Accelerated Rate of Protein Evolution in Male-Biased Drosophila pseudoobscura Genes
Genetics, September 1, 2006; 174(1): 411 - 420.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
I. M. Smyth, L. Wilming, A. W. Lee, M. S. Taylor, P. Gautier, K. Barlow, J. Wallis, S. Martin, R. Glithero, B. Phillimore, et al.
Genomic anatomy of the Tyrp1 (brown) deletion complex
PNAS, March 7, 2006; 103(10): 3704 - 3709.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
M. R. Brent
Genome annotation past, present, and future: How to define an ORF at each locus
Genome Res., December 1, 2005; 15(12): 1777 - 1786.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
N. Chen, S. Pai, Z. Zhao, A. Mah, R. Newbury, R. C. Johnsen, Z. Altun, D. G. Moerman, D. L. Baillie, and L. D. Stein
Identification of a nematode chemosensory gene family
PNAS, January 4, 2005; 102(1): 146 - 151.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
S. Richards, Y. Liu, B. R. Bettencourt, P. Hradecky, S. Letovsky, R. Nielsen, K. Thornton, M. J. Hubisz, R. Chen, R. P. Meisel, et al.
Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution
Genome Res., January 1, 2005; 15(1): 1 - 18.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. S. Janssen, R. S. Phillips, C. M. R. Turner, and M. P. Barrett
Plasmodium interspersed repeats: the major multigene superfamily of malaria parasites
Nucleic Acids Res., October 26, 2004; 32(19): 5712 - 5720.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
B. Issac and G. P. S. Raghava
EGPred: Prediction of Eukaryotic Genes Using Ab Initio Methods After Combining With Sequence Similarity Approaches
Genome Res., September 1, 2004; 14(9): 1756 - 1766.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
R. D. Emes, M. C. Riley, C. M. Laukaitis, L. Goodstadt, R. C. Karn, and C. P. Ponting
Comparative Evolutionary Genomics of Androgen-Binding Protein Genes
Genome Res., August 1, 2004; 14(8): 1516 - 1529.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
E. Birney, M. Clamp, and R. Durbin
GeneWise and Genomewise
Genome Res., May 1, 2004; 14(5): 988 - 995.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
Z. Zhang, P. E. Burch, A. J. Cooney, R. B. Lanz, F. A. Pereira, J. Wu, R. A. Gibbs, G. Weinstock, and D. A. Wheeler
Genomic Analysis of the Nuclear Receptor Family: New Insights Into Structure, Regulation, and Evolution From the Rat Genome
Genome Res., April 1, 2004; 14(4): 580 - 590.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
R. D. Emes, S. A. Beatson, C. P. Ponting, and L. Goodstadt
Evolution and Comparative Genomics of Odorant- and Pheromone-Associated Genes in Rodents
Genome Res., April 1, 2004; 14(4): 591 - 602.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
J. Q. Wu, D. Shteynberg, M. Arumugam, R. A. Gibbs, and M. R. Brent
Identification of Rat Genes by TWINSCAN Gene Prediction, RT-PCR, and Direct Sequencing
Genome Res., April 1, 2004; 14(4): 665 - 671.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
I. M. Meyer and R. Durbin
Gene structure conservation aids similarity based gene prediction
Nucleic Acids Res., February 4, 2004; 32(2): 776 - 783.
[Abstract] [Full Text] [PDF]


Home page
DevelopmentHome page
C. Vogel, S. A. Teichmann, and C. Chothia
The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity
Development, December 22, 2003; 130(25): 6317 - 6328.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. E. Moore and J. A. Lake
Gene structure prediction in syntenic DNA segments
Nucleic Acids Res., December 15, 2003; 31(24): 7271 - 7279.
[Abstract] [Full Text] [PDF]


Home page
Eukaryot CellHome page
A. R. Grossman, E. E. Harris, C. Hauser, P. A. Lefebvre, D. Martinez, D. Rokhsar, J. Shrager, C. D. Silflow, D. Stern, O. Vallon, et al.
Chlamydomonas reinhardtii at the Crossroads of Genomics
Eukaryot. Cell, December 1, 2003; 2(6): 1137 - 1150.
[Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. J. Haas, A. L. Delcher, S. M. Mount, J. R. Wortman, R. K. Smith Jr, L. I. Hannick, R. Maiti, C. M. Ronning, D. B. Rusch, C. D. Town, et al.
Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies
Nucleic Acids Res., October 1, 2003; 31(19): 5654 - 5666.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
D. R. FitzPatrick, I. M. Carr, L. McLaren, J. P. Leek, P. Wightman, K. Williamson, P. Gautier, N. McGill, C. Hayward, H. Firth, et al.
Identification of SATB2 as the cleft palate gene on 2q32-q33
Hum. Mol. Genet., October 1, 2003; 12(19): 2491 - 2501.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
L. Zhang, V. Pavlovic, C. R Cantor, and S. Kasif
Human-Mouse Gene Identification by Comparative Evidence Integration and Evolutionary Analysis
Genome Res., June 1, 2003; 13(6): 1190 - 1202.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
R. Guigo, E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra, A. Reymond, J. F. Abril, E. Keibler, R. Lyle, C. Ucla, et al.
Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes
PNAS, February 4, 2003; 100(3): 1140 - 1145.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
T.-J. Chuang, W.-C. Lin, H.-C. Lee, C.-W. Wang, K.-L. Hsiao, Z.-H. Wang, D. Shieh, S. C. Lin, and L.-Y. Ch'ang
A Complexity Reduction Algorithm for Analysis and Annotation of Large Genomic Sequences
Genome Res., February 1, 2003; 13(2): 313 - 322.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
P. Flicek, E. Keibler, P. Hu, I. Korf, and M. R. Brent
Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny Map
Genome Res., January 1, 2003; 13(1): 46 - 54.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
P. Dehal, Y. Satou, R. K. Campbell, J. Chapman, B. Degnan, A. De Tomaso, B. Davidson, A. Di Gregorio, M. Gelpke, D. M. Goodstein, et al.
The Draft Genome of Ciona intestinalis: Insights into Chordate and Vertebrate Origins
Science, December 13, 2002; 298(5601): 2157 - 2167.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
M. Zavolan, E. van Nimwegen, and T. Gaasterland
Splice Variation in Mouse Full-Length cDNAs Identified by Mapping to the Mouse Genome
Genome Res., September 1, 2002; 12(9): 1377 - 1385.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
S. Aparicio, J. Chapman, E. Stupka, N. Putnam, J.-m. Chia, P. Dehal, A. Christoffels, S. Rash, S. Hoon, A. Smit, et al.
Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes
Science, August 23, 2002; 297(5585): 1301 - 1310.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
D. Thomasova, L. Q. Ton, R. R. Copley, E. M. Zdobnov, X. Wang, Y. S. Hong, C. Sim, P. Bork, F. C. Kafatos, and F. H. Collins
Comparative genomic analysis in the region of a major Plasmodium-refractoriness locus of Anophelesgambiae
PNAS, June 11, 2002; 99(12): 8179 - 8184.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down, et al.
The Ensembl genome database project
Nucleic Acids Res., January 1, 2002; 30(1): 38 - 41.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
Y. Pouliot, J. Gao, Q. J. Su, G. G. Liu, and X. B. Ling
DIAN: A Novel Algorithm for Genome Ontological Classification
Genome Res., October 1, 2001; 11(10): 1766 - 1779.
[Abstract] [Full Text] [PDF]


Home page
Physiol. GenomicsHome page
M. DAS, I. HARVEY, L. L. CHU, M. SINHA, and J. PELLETIER
Full-length cDNAs: more than just reaching the ends
Physiol Genomics, July 17, 2001; 6(2): 57 - 80.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
R.-F. Yeh, L. P. Lim, and C. B. Burge
Computational Inference of Homologous Gene Structures in the Human Genome
Genome Res., May 1, 2001; 11(5): 803 - 816.
[Abstract] [Full Text]


Home page
Genome Res.Home page
J. Andrews, G. G. Bouffard, C. Cheadle, J. Lü, K. G. Becker, and B. Oliver
Gene Discovery Using Computational and Microarray Analysis of Transcription in the Drosophila melanogaster Testis
Genome Res., December 1, 2000; 10(12): 2030 - 2043.
[Abstract] [Full Text]


Home page
Genome Res.Home page
A. Louis, E. Ollivier, J.-C. Aude, and J.-L. Risler
Massive Sequence Comparisons as a Help in Annotating Genomic Sequences
Genome Res., July 1, 2001; 11(7): 1296 - 1303.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Birney, E.
Right arrow Articles by Durbin, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Birney, E.
Right arrow Articles by Durbin, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
Genes Dev. Learn. Mem.
Protein Science RNA Genome Res.