|
|
|
|
Vol. 10, Issue 4, 547-548, April 2000
METHODS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
The GeneWise method for combining gene prediction and homology searches was applied to the 2.9-Mb region from Drosophila melanogaster. The results from the Genome Annotation Assessment Project (GASP) showed that GeneWise provided reasonably accurate gene predictions. Further investigation indicates that many of the incorrect gene predictions from GeneWise were due to transposons with valid protein-coding genes and the remaining cases are pseudogenes or possible annotation oversights.
| |
INTRODUCTION |
|---|
|
|
|---|
The critical assessment of machine learning
techniques is necessary to assess the effectiveness
of computational methods. The critical assessment of protein structure
prediction (CASP) has become a benchmark for protein structure
assessment worldwide (Moult et al. 1999
). We welcomed the opportunity
offered by Reese and coworkers (2000)
to independently assess the gene
prediction methods available and provided one of the methods we
developed, GeneWise, for this study.
The use of protein and EST similarity to help gene prediction is
widespread, including methods such as Genie (Kulp et al.
1996
) and GRAIL (Uberbacher et al. 1996
). The GeneWise approach builds on the success of hidden Markov models (HMMs) for modeling both protein family information (Krogh et
al. 1994
; Eddy 1998
) and gene predictions (Kulp et al. 1996
; Burge and
Karlin 1997
; Krogh 1997
). GeneWise is a HHM that is formed
by the principled combination of two separate HMMs (E. Birney and R. Durbin, in prep.). GeneWise therefore can be thought of as
considering every possible gene prediction in a genomic sequence and
comparing each one to the protein profile-HMM. The best combined score
of both the gene prediction and the protein profile-HMM is used to
provide a simultaneous gene prediction and protein alignment.
To use GeneWise for gene prediction one needs a source of
homology information. In this case, we used protein profile-HMMs from
PFAM (Bateman et al. 2000
). One of the major drawbacks to
using GeneWise is the prohibitive computational cost of
the method. This was solved in this case by using the
halfwise methods, which prefilters the protein
profile-HMM used in the comparison (see Methods). The results
presented here were the completely automatic annotation from
GeneWise without any manual intervention in the process.
| |
RESULTS |
|---|
|
|
|---|
A total of 165 gene predictions with 252 exons were made in the 2.9-Mb genomic segment. Of the 252 exons, 216 overlapped in some way with the std3 dataset of definite and possible predictions. This left 36 exons in 23 predictions outside of this set. A number of these (16) were profile HMMs of transposons or retroviral transposons. The remaining 20 exons were potential mispredictions or annotation mistakes. By manual examination of these cases we found four potential mispredictions by GeneWise, in each case a trailing exon in an otherwise correct gene prediction. Of the remaining 16 exons, 10 were clear annotation oversights, leaving 6 that were less clear cut, for example, pseudogenes might explain the presence of these hits. There were no predictions by GeneWise of completely wrong genes, in line with our expectation, as GeneWise only predicts genes by virtue of their homology to other genes. We would place our base pair accuracy as far higher (in the 90% range) and the wrong gene predictions to be at 0.
| |
DISCUSSION |
|---|
|
|
|---|
The GASP assessment was a valuable exercise in providing independent evaluation of gene prediction effectiveness. Providing clear-cut assessment of gene predictions is a difficult task and was not helped by the time pressures of both the contributing groups and the assessing group to provide this study. It is clear that the rules for what predictions will be considered as real need to be detailed in the future, and possibly the ability to assess such things as pseudogene predictions, will be important. Ideally there should be experiments by the assessing group after the gene predictions have been made, so that it is clearer that people have at least attempted to verify a gene prediction experimentally.
The predictions made by GeneWise were very much in line
with the predictions made using the BLOCKS method (Henikoff et al. 2000
). The BLOCKS method considers
smaller, ungapped and unspliced motifs drawn from a broader database
than PFAM. The result is that there are differences due to the different database source and due to the method
in particular GeneWise tends to predict more coding sequence than
BLOCKS for a particular family.
The effectiveness of GeneWise in this study was reported at below the levels we believe to be correct. It is our belief that the specificity numbers for all methods are not well assessed in this study, and that people should not quote them without considerable discussion of the shortcomings of this assessment, that is, the calling of transposon genes as errors and annotation oversights. Even so, this exercise is valuable to raise awareness of the problems in both prediction and assessment. We look forward to participating in future studies.
| |
METHODS |
|---|
|
|
|---|
The method used in this study, halfwise, is part of the Wise2 package available from http://www.sanger.ac.uk/Software/Wise2. halfwise is a PERL script that uses BLASTX to compare the DNA sequence against a protein database designed to represent the protein space covered by PFAM database. The BLASTX search selects a number of potential PFAM models to be used in the more computationally expensive GeneWise method.
The DNA sequence was split up into 100-kb chunks with no overlaps, and each chunk was run through the halfwise method. The resulting GFF output was then processed to assemble the complete GFF file. The total time to perform the analysis was a weekend of off-peak computer resources at the Sanger Centre.
| |
ACKNOWLEDGMENTS |
|---|
This work was supported by the Wellcome Trust. E.B. is a Wellcome Trust Prize Student.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 Corresponding author.
Present address: European Molecular Biology Laboratory-European Bioinformatics Institute (EMBL-EBI) , Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK.
E-MAIL Birney{at}ebi.ac.uk; FAX 44-1-2223-494468.
| |
REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
G. A. Tuskan, S. DiFazio, S. Jansson, J. Bohlmann, I. Grigoriev, U. Hellsten, N. Putnam, S. Ralph, S. Rombauts, A. Salamov, et al. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science, September 15, 2006; 313(5793): 1596 - 1604. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Metta, R. Gudavalli, J.-M. Gibert, and C. Schlotterer No Accelerated Rate of Protein Evolution in Male-Biased Drosophila pseudoobscura Genes Genetics, September 1, 2006; 174(1): 411 - 420. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. M. Smyth, L. Wilming, A. W. Lee, M. S. Taylor, P. Gautier, K. Barlow, J. Wallis, S. Martin, R. Glithero, B. Phillimore, et al. Genomic anatomy of the Tyrp1 (brown) deletion complex PNAS, March 7, 2006; 103(10): 3704 - 3709. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. Brent Genome annotation past, present, and future: How to define an ORF at each locus Genome Res., December 1, 2005; 15(12): 1777 - 1786. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Chen, S. Pai, Z. Zhao, A. Mah, R. Newbury, R. C. Johnsen, Z. Altun, D. G. Moerman, D. L. Baillie, and L. D. Stein Identification of a nematode chemosensory gene family PNAS, January 4, 2005; 102(1): 146 - 151. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Richards, Y. Liu, B. R. Bettencourt, P. Hradecky, S. Letovsky, R. Nielsen, K. Thornton, M. J. Hubisz, R. Chen, R. P. Meisel, et al. Comparative genome sequencing of Drosophila pseudoobscura: Chromosomal, gene, and cis-element evolution Genome Res., January 1, 2005; 15(1): 1 - 18. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. S. Janssen, R. S. Phillips, C. M. R. Turner, and M. P. Barrett Plasmodium interspersed repeats: the major multigene superfamily of malaria parasites Nucleic Acids Res., October 26, 2004; 32(19): 5712 - 5720. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Issac and G. P. S. Raghava EGPred: Prediction of Eukaryotic Genes Using Ab Initio Methods After Combining With Sequence Similarity Approaches Genome Res., September 1, 2004; 14(9): 1756 - 1766. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Emes, M. C. Riley, C. M. Laukaitis, L. Goodstadt, R. C. Karn, and C. P. Ponting Comparative Evolutionary Genomics of Androgen-Binding Protein Genes Genome Res., August 1, 2004; 14(8): 1516 - 1529. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Birney, M. Clamp, and R. Durbin GeneWise and Genomewise Genome Res., May 1, 2004; 14(5): 988 - 995. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Zhang, P. E. Burch, A. J. Cooney, R. B. Lanz, F. A. Pereira, J. Wu, R. A. Gibbs, G. Weinstock, and D. A. Wheeler Genomic Analysis of the Nuclear Receptor Family: New Insights Into Structure, Regulation, and Evolution From the Rat Genome Genome Res., April 1, 2004; 14(4): 580 - 590. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. D. Emes, S. A. Beatson, C. P. Ponting, and L. Goodstadt Evolution and Comparative Genomics of Odorant- and Pheromone-Associated Genes in Rodents Genome Res., April 1, 2004; 14(4): 591 - 602. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Q. Wu, D. Shteynberg, M. Arumugam, R. A. Gibbs, and M. R. Brent Identification of Rat Genes by TWINSCAN Gene Prediction, RT-PCR, and Direct Sequencing Genome Res., April 1, 2004; 14(4): 665 - 671. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. M. Meyer and R. Durbin Gene structure conservation aids similarity based gene prediction Nucleic Acids Res., February 4, 2004; 32(2): 776 - 783. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Vogel, S. A. Teichmann, and C. Chothia The immunoglobulin superfamily in Drosophila melanogaster and Caenorhabditis elegans and the evolution of complexity Development, December 22, 2003; 130(25): 6317 - 6328. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Moore and J. A. Lake Gene structure prediction in syntenic DNA segments Nucleic Acids Res., December 15, 2003; 31(24): 7271 - 7279. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. R. Grossman, E. E. Harris, C. Hauser, P. A. Lefebvre, D. Martinez, D. Rokhsar, J. Shrager, C. D. Silflow, D. Stern, O. Vallon, et al. Chlamydomonas reinhardtii at the Crossroads of Genomics Eukaryot. Cell, December 1, 2003; 2(6): 1137 - 1150. [Full Text] [PDF] |
||||
![]() |
B. J. Haas, A. L. Delcher, S. M. Mount, J. R. Wortman, R. K. Smith Jr, L. I. Hannick, R. Maiti, C. M. Ronning, D. B. Rusch, C. D. Town, et al. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies Nucleic Acids Res., October 1, 2003; 31(19): 5654 - 5666. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. R. FitzPatrick, I. M. Carr, L. McLaren, J. P. Leek, P. Wightman, K. Williamson, P. Gautier, N. McGill, C. Hayward, H. Firth, et al. Identification of SATB2 as the cleft palate gene on 2q32-q33 Hum. Mol. Genet., October 1, 2003; 12(19): 2491 - 2501. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Zhang, V. Pavlovic, C. R Cantor, and S. Kasif Human-Mouse Gene Identification by Comparative Evidence Integration and Evolutionary Analysis Genome Res., June 1, 2003; 13(6): 1190 - 1202. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Guigo, E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra, A. Reymond, J. F. Abril, E. Keibler, R. Lyle, C. Ucla, et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes PNAS, February 4, 2003; 100(3): 1140 - 1145. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-J. Chuang, W.-C. Lin, H.-C. Lee, C.-W. Wang, K.-L. Hsiao, Z.-H. Wang, D. Shieh, S. C. Lin, and L.-Y. Ch'ang A Complexity Reduction Algorithm for Analysis and Annotation of Large Genomic Sequences Genome Res., February 1, 2003; 13(2): 313 - 322. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Flicek, E. Keibler, P. Hu, I. Korf, and M. R. Brent Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny Map Genome Res., January 1, 2003; 13(1): 46 - 54. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Dehal, Y. Satou, R. K. Campbell, J. Chapman, B. Degnan, A. De Tomaso, B. Davidson, A. Di Gregorio, M. Gelpke, D. M. Goodstein, et al. The Draft Genome of Ciona intestinalis: Insights into Chordate and Vertebrate Origins Science, December 13, 2002; 298(5601): 2157 - 2167. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Zavolan, E. van Nimwegen, and T. Gaasterland Splice Variation in Mouse Full-Length cDNAs Identified by Mapping to the Mouse Genome Genome Res., September 1, 2002; 12(9): 1377 - 1385. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Aparicio, J. Chapman, E. Stupka, N. Putnam, J.-m. Chia, P. Dehal, A. Christoffels, S. Rash, S. Hoon, A. Smit, et al. Whole-Genome Shotgun Assembly and Analysis of the Genome of Fugu rubripes Science, August 23, 2002; 297(5585): 1301 - 1310. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Thomasova, L. Q. Ton, R. R. Copley, E. M. Zdobnov, X. Wang, Y. S. Hong, C. Sim, P. Bork, F. C. Kafatos, and F. H. Collins Comparative genomic analysis in the region of a major Plasmodium-refractoriness locus of Anophelesgambiae PNAS, June 11, 2002; 99(12): 8179 - 8184. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Hubbard, D. Barker, E. Birney, G. Cameron, Y. Chen, L. Clark, T. Cox, J. Cuff, V. Curwen, T. Down, et al. The Ensembl genome database project Nucleic Acids Res., January 1, 2002; 30(1): 38 - 41. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Pouliot, J. Gao, Q. J. Su, G. G. Liu, and X. B. Ling DIAN: A Novel Algorithm for Genome Ontological Classification Genome Res., October 1, 2001; 11(10): 1766 - 1779. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. DAS, I. HARVEY, L. L. CHU, M. SINHA, and J. PELLETIER Full-length cDNAs: more than just reaching the ends Physiol Genomics, July 17, 2001; 6(2): 57 - 80. [Abstract] [Full Text] [PDF] |
||||
![]() |
R.-F. Yeh, L. P. Lim, and C. B. Burge Computational Inference of Homologous Gene Structures in the Human Genome Genome Res., May 1, 2001; 11(5): 803 - 816. [Abstract] [Full Text] |
||||
![]() |
J. Andrews, G. G. Bouffard, C. Cheadle, J. Lü, K. G. Becker, and B. Oliver Gene Discovery Using Computational and Microarray Analysis of Transcription in the Drosophila melanogaster Testis Genome Res., December 1, 2000; 10(12): 2030 - 2043. [Abstract] [Full Text] |
||||
![]() |
A. Louis, E. Ollivier, J.-C. Aude, and J.-L. Risler Massive Sequence Comparisons as a Help in Annotating Genomic Sequences Genome Res., July 1, 2001; 11(7): 1296 - 1303. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||