|
|
|
|
Vol. 10, Issue 4, 511-515, April 2000
METHODS
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage. GeneID is available at http://www1.imim.es/~eblanco/GeneId.
| |
INTRODUCTION |
|---|
|
|
|---|
GeneID (Guigó et al. 1992
) was
one of the first programs to predict full exonic structures of
vertebrate genes in anonymous DNA sequences. GeneID was
designed with a hierarchical structure: First, gene-defining signals
(splice sites and start and stop codons) were predicted along the query
DNA sequence. Next, potential exons were constructed from these sites,
and finally the optimal scoring gene prediction was assembled from the
exons. In the original GeneID the scoring function to
optimize was rather heuristic: The sequence sites were predicted and
scored using position weight matrices (PWMs), a number of coding
statistics were computed on the predicted exons, and each exon was
scored as a function of the scores of the exon defining sites and of the coding statistics. To estimate the coefficients of this function a
neural network was used. An exhaustive search of the space of possible
gene assemblies was performed to rank predicted genes according with an score
obtained through a complex function of the scores of the assembled exons.
During recent years GeneID had some usage, mostly through
a now nonfunctional e-mail server at Boston University (geneid{at}darwin.bu.edu) and through a WWW server at the IMIM
(http://www1.imim.es/geneid.html). During this period, however, there
have been substantial developments in the field of computational gene
identification (for recent reviews, see Claverie 1997
; Burge and Karlin
1998
; Haussler 1998
), and the original GeneID has become
clearly inferior to other existing tools. Therefore, some time ago we
began developing an improved version of the GeneID
program, which is at least as accurate as other existing tools but much
more efficient at handling very large genomic sequences, both in terms
of speed and usage of memory. This new version maintains the
hierarchical structure (signal to exon to gene) in the original
GeneID, but we have simplified the scoring schema and
furnished it with a probabilistic meaning: Scores for both
exon-defining signals and protein-coding potential are computed as
log-likelihood ratios, which for a given predicted exon are summed up
into the exon score, in consequence also a log-likelihood ratio. Then,
a dynamic programming algorithm (Guigó 1998
) is used to search
the space of predicted exons to assemble the gene structure (in the
general case, multiple genes in both strands) maximizing the sum of the
scores of the assembled exons, which can also be assumed to be a
log-likelihood ratio. Execution time in this new version of
GeneID grows linearly with the size of the input sequence,
currently at ~2 Mb per minute in a Pentium III (500 MHz) running
linux. The amount of memory required is also proportional to the length
of the sequence, ~1 megabyte (MB)/Mb plus a constant amount of
~15 MB, irrespective of the length of the sequence. Thus,
GeneID is able to analyze sequences of virtually any
length, for instance, chromosome size sequences.
In this paper we describe the "training" of GeneID to
predict genes in the genome of Drosophila melanogaster. In the
context of GeneID training means essentially computing PWMs for splice sites and start codons, and deriving a model of coding
DNA, which, in this case, is a Markov model of order 5, similar to the
models introduced by Borodovsky and McIninch (1993)
. Therefore, in the
following sections, we describe the training data set used,
particularly our attempt to recreate a more realistic scenario to train
and test GeneID by generating semiartificial large genomic
contigs from single-gene DNA sequences, and we briefly describe the
main features of GeneID for D. melanogaster. Then, we present the results obtained in the training data set when
different schemas are used to compute scores for sites and coding
potential, and the results obtained on the D. melanogaster Adh
region when the optimal scoring schema in the training set is used to
predict genes in this region.
| |
METHODS |
|---|
|
|
|---|
Data Sets
We have merged the sets of 275 multi- and 141 single-exon
sequences provided by Martin Reese (Reese et al. 2000
) as a set of
known D. melanogaster gene-encoding sequences into the unique MR set. From the MR set we inferred PWMs for splice sites and start
codons, and the Markov model of order 5 for coding regions. The MR set
contains only single-gene sequences. To assess the accuracy of the
predictions in a more realistic scenario, we have randomly embedded the
sequences in the MR set in a background of artificial random intergenic
DNA as described (R. Guigó, P. Agarwal, J.F. Abril, M. Burset,
and J.W. Fickett, in prep.). Thus, a single sequence of 5,689,206 bp
embedding the 416 genes in the MR set has been used to evaluate the
accuracy of the predictions. The sequence, and the coordinates of the
embedded exons are available at http://www1.imim.es/~gparra/GASP1.
GeneID
As outlined, GeneID for D. melanogaster uses PWMs to predict potential splice sites and start codons. Potential sites are scored as log-likelihood ratios. From the set of predicted sites (which includes, in addition, all potential stop codons), the set is built of all potential exons. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of the Markov model for coding sequences. Finally, the gene structure is assembled from the set of predicted exons, maximizing the sum of the scores of the assembled exons. The procedure is illustrated in Figure 1, which shows the GeneID predictions in a small region of the Adh sequence.
|
Predicting and Scoring Sites
Actual splice sites, and start codons were extracted from the MR set.Donor Sites
The MR set contains 757 donor sites. From them, a frequency matrix P was derived from position
3 to +6 around the
exon-intron boundary, with position 0 being the first position in the
intron. Pij is the probability of observing
nucleotide i[i
(A,C,G,T)] at position
j [j
(
3,...,+6)] in an actual donor
site. The positional frequency Q of nucleotides in the region
3 to +6 around all dinucleotides GT was also computed (with
position 0 being the position corresponding to the nucleotide G in the
GT dinucleotides.) Then, a PWM for donor sites D was
calculated as
|
(1) |
|
(2) |
Predicting and Scoring Exons
GeneID distinguishes four types of exons: (1) Initial ORFs, defined by a start codon and a donor site; (2) internal ORFs, defined by an acceptor site and a donor site; (3) terminal ORFs, defined by an acceptor site and a stop codon; and (4) single ORFs, defined by a start codon and a stop codon. This corresponds to intronless genes. GeneID constructs all potential exons that are compatible with the predicted sites. (Only the five highest scoring donor sites within frame are considered for each start codon and acceptor site.)CODING POTENTIAL
All exon and intron sequences were extracted from the MR multiexon data set. A Markov model of order 5 was estimated to model both exon and intron sequences, that is, we estimated the probability distribution of each nucleotide given the pentanucleotide preceding it in exon and intron sequences. From the exon sequences we estimated this probability for each of the three possible frames, building the transition probability matrices F1, F2, F3. Fj (s1s2s3s4s5s6) is the observed probability of finding hexamer s1s2s3s4s5s6 with s1 in codon position j, given that pentamer s1s2s3s4s5 is with s1 in codon position j. An initial probability matrix, Ij, was estimated from the observed pentamer frequencies at each codon position. From the intron sequences a single transition matrix was computed F0, as well as a single initial probability matrix, I0. Then, for each hexamer h and frame j a log-likelihood ratio was computed:
|
(3) |
|
(4) |
|
(5) |
|
(6) |
Assembling Genes
GeneID predicts gene structures, which can be multiple genes in both strands, as sequences of frame-compatible nonoverlapping exons. A minimum intron length of 40 bp and a minimum intergenic distance of 300 bp are enforced. If a gene structure, g, is a sequence of exons, e1, e2,...en, a natural scoring function is
|
(7) |
|
(8) |
7.
| |
RESULTS |
|---|
|
|
|---|
Training GeneID
We tested two additional models of coding DNA before deciding for a
Markov model of order 5, a Codon usage model, and a model that combined
a Markov model of order 1 of the translated amino acid sequence and a
Codon preference model (see Guigó 1999
for details on these
models). In both cases, log-likelihood ratios were obtained in a
similar way to the Markov model log-likelihood ratios (see Methods).
For instance, in the case of the Codon usage model, for each triplet
s, we estimated the probabilities of the codon s in
coding sequences, U(s) and the probability of the triplet in
noncoding sequences, U0(s), and built the
log-likelihood ratio
|
|
|
Results in the Adh Region
Table 2 shows the results when GeneID, with the
parameters estimated above, is used to predict genes in the
Adh region. Both the results originally
submitted to the Genome Annotation Assessment Project (GASP) and the
results obtained with the currently available version of
GeneID are given (see Discussion). In addition, we provide
information on execution time and memory requirements of
GeneID to analyze the Adh region. The detailed
exon coordinates of the predictions by GeneID can be found
at http://www1.imim.es/~gparra/GASP1.
|
| |
DISCUSSION |
|---|
|
|
|---|
The results presented above indicate that the current version of GeneID shows an accuracy, as measured by the GASP contest, comparable to the accuracy of the programs based on hidden Markov models (HMMs), which in GASP exhibited the highest accuracy. In favor of GeneID is the simplicity and modularity of its structure, which, as a consequence, is likely to make the program more efficient in terms of speed and memory usage. In GeneID the gene identification problem is stated as a one-dimensional chaining problem for which more efficient algorithms may be designed than for an aligment problem, as gene identification is implicitly formulated in HMMs. Against GeneID is the somehow less rigorous probabilistic treatement of the scoring schema. For instance, we are currently unable to justify the "magic number" (IW, see Methods), which needs to be added to the exon scores to obtain accurate predictions.
GeneID submitted rather poor predictions to GASP (see Table 2). Two bugs in the version of the program under development at that time were to blame. They were discovered and a second prediction submitted (see Table 2). After GASP we changed a rather complex schoring schema to the simpler and more natural schema described in Methods, which resulted in higher accuracy. This is the scoring schema currently in use in GeneID.
Although currently fully functional, we are still developing
GeneID further. Our short-term plans include, among others, to train GeneID to predict genes in the human and
the Arabidopsis thaliana genomes and to include the
possibility of incorporating the results of database searches
both
ESTs and proteins
in the GeneID prediction schema, which
can be done rather naturally. The possibility of including external
evidence to "force" known genes or exons into the prediction is
already included in the working version of GeneID. This
may be useful for reannotation of very large genomic sequences.
Finally, the current structure of GeneID can be highly
parallelized, and we are also working in this direction.
| |
ACKNOWLEDGMENTS |
|---|
We thank Josep F. Abril and Moisès Burset for helpful discussions and constant encouragement. This work was supported by a grant from Plan Nacional de I+D (BIO98-0443-C02-01) from the Ministerio de Educación y Ciencia (Spain).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 Corresponding author.
E-MAIL rguigo{at}imim.es; FAX 34-93-221-3237.
| |
REFERENCES |
|---|
|
|
|---|
Received February 9, 2000; accepted in revised form February 28, 2000.
This article has been cited by other articles:
![]() |
C. Ansong, S. O. Purvine, J. N. Adkins, M. S. Lipton, and R. D. Smith Proteogenomics: needs and roles to be filled by proteomics in genome annotation Brief Funct Genomic Proteomic, March 10, 2008; (2008) eln010v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Liu, A. J. Mackey, D. S. Roos, and F. C. N. Pereira Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction Bioinformatics, March 1, 2008; 24(5): 597 - 605. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Castellano, V. N. Gladyshev, R. Guigo, and M. J. Berry SelenoDB 1.0 : a database of selenoprotein genes, proteins and SECIS elements Nucleic Acids Res., January 11, 2008; 36(suppl_1): D332 - D338. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. F. Lin, J. W. Carlson, M. A. Crosby, B. B. Matthews, C. Yu, S. Park, K. H. Wan, A. J. Schroeder, L. S. Gramates, S. E. St. Pierre, et al. Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes Genome Res., December 1, 2007; 17(12): 1823 - 1836. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Lyle, P. Prandini, K. Osoegawa, B. ten Hallers, S. Humphray, B. Zhu, E. Eyras, R. Castelo, C. P. Bird, S. Gagos, et al. Islands of euchromatin-like sequence and expressed polymorphic sequences within the short arm of human chromosome 21 Genome Res., November 1, 2007; 17(11): 1690 - 1696. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. DeCaprio, J. P. Vinson, M. D. Pearson, P. Montgomery, M. Doherty, and J. E. Galagan Conrad: Gene prediction using conditional random fields Genome Res., September 1, 2007; 17(9): 1389 - 1398. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Parra, K. Bradnam, and I. Korf CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes Bioinformatics, May 1, 2007; 23(9): 1061 - 1067. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. A. Peters, B. St. Croix, T. Sjoblom, J. M. Cummins, N. Silliman, J. Ptak, S. Saha, K. W. Kinzler, C. Hatzis, and V. E. Velculescu Large-scale identification of novel transcripts in the human genome Genome Res., March 1, 2007; 17(3): 287 - 292. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Tanner, Z. Shen, J. Ng, L. Florea, R. Guigo, S. P. Briggs, and V. Bafna Improving gene annotation using peptide mass spectrometry Genome Res., February 1, 2007; 17(2): 231 - 239. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. G. Gilbert DroSpeGe: rapid access database for new Drosophila species genomes Nucleic Acids Res., January 12, 2007; 35(suppl_1): D480 - D485. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. S. Alioto U12DB: a database of orthologous U12-type spliceosomal introns Nucleic Acids Res., January 12, 2007; 35(suppl_1): D110 - D115. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Vilardell and A. Sanchez-Pla Hypothesis testing approaches to the exon prediction problem Bioinformatics, December 15, 2006; 22(24): 3003 - 3008. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Parra, A. Reymond, N. Dabbouseh, E. T. Dermitzakis, R. Castelo, T. M. Thomson, S. E. Antonarakis, and R. Guigo Tandem chimerism as a means to increase protein complexity in the human genome Genome Res., January 1, 2006; 16(1): 37 - 44. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano, H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu, et al. The UCSC Genome Browser Database: update 2006 Nucleic Acids Res., January 1, 2006; 34(suppl_1): D590 - D598. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. E. Galagan, M. R. Henn, L.-J. Ma, C. A. Cuomo, and B. Birren Genomics of the fungal kingdom: Insights into eukaryotic biology Genome Res., December 1, 2005; 15(12): 1620 - 1631. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Lomsadze, V. Ter-Hovhannisyan, Y. O. Chernoff, and M. Borodovsky Gene identification in novel eukaryotic genomes by self-training algorithm Nucleic Acids Res., November 28, 2005; 33(20): 6494 - 6506. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Castellano, A. V. Lobanov, C. Chapple, S. V. Novoselov, M. Albrecht, D. Hua, A. Lescure, T. Lengauer, A. Krol, V. N. Gladyshev, et al. From the Cover: Diversity and functional plasticity of eukaryotic selenoproteins: Identification and characterization of the SelJ family PNAS, November 8, 2005; 102(45): 16188 - 16193. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. V Kryukov, S. Schmidt, and S. Sunyaev Small fitness effect of mutations in highly conserved non-coding regions Hum. Mol. Genet., August 1, 2005; 14(15): 2221 - 2229. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stanke and B. Morgenstern AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints Nucleic Acids Res., July 1, 2005; 33(suppl_2): W465 - W467. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Taskov, C. Chapple, G. V. Kryukov, S. Castellano, A. V. Lobanov, K. V. Korotkov, R. Guigo, and V. N. Gladyshev Nematode selenoproteome: the use of the selenocysteine insertion system to decode one codon in an animal genome? Nucleic Acids Res., April 20, 2005; 33(7): 2227 - 2238. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Felder, K. Szafranski, Rüd. Lehmann, L. Eichinger, A. A. Noegel, M. Platzer, and G. Glöckner DictyMOLD-a Dictyostelium discoideum genome browser database Bioinformatics, March 1, 2005; 21(5): 696 - 697. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stanke, R. Steinkamp, S. Waack, and B. Morgenstern AUGUSTUS: a web server for gene finding in eukaryotes Nucleic Acids Res., July 1, 2004; 32(suppl_2): W309 - W312. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. F. Lobo, L. Q. Ton, C. A. Hill, C. Emore, J. Romero-Severson, G. J. Hunt, and F. H. Collins Genomic Analysis in the sting-2 Quantitative Trait Locus for Defensive Behavior in the Honey Bee, Apis mellifera Genome Res., December 1, 2003; 13(12): 2588 - 2593. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Zhang, V. Pavlovic, C. R Cantor, and S. Kasif Human-Mouse Gene Identification by Comparative Evidence Integration and Evolutionary Analysis Genome Res., June 1, 2003; 13(6): 1190 - 1202. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Beltran, E. Blanco, F. Serras, B. Perez-Villamil, R. Guigo, S. Artavanis-Tsakonas, and M. Corominas Transcriptional network controlled by the trithorax-group gene ash2 in Drosophila melanogaster PNAS, March 18, 2003; 100(6): 3293 - 3298. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Guigo, E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra, A. Reymond, J. F. Abril, E. Keibler, R. Lyle, C. Ucla, et al. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes PNAS, February 4, 2003; 100(3): 1140 - 1145. [Abstract] [Full Text] [PDF] |
||||
![]() |
T.-J. Chuang, W.-C. Lin, H.-C. Lee, C.-W. Wang, K.-L. Hsiao, Z.-H. Wang, D. Shieh, S. C. Lin, and L.-Y. Ch'ang A Complexity Reduction Algorithm for Analysis and Annotation of Large Genomic Sequences Genome Res., February 1, 2003; 13(2): 313 - 322. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Flicek, E. Keibler, P. Hu, I. Korf, and M. R. Brent Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny Map Genome Res., January 1, 2003; 13(1): 46 - 54. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Parra, P. Agarwal, J. F. Abril, T. Wiehe, J. W. Fickett, and R. Guigo Comparative Gene Prediction in Human and Mouse Genome Res., January 1, 2003; 13(1): 108 - 117. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Thomasova, L. Q. Ton, R. R. Copley, E. M. Zdobnov, X. Wang, Y. S. Hong, C. Sim, P. Bork, F. C. Kafatos, and F. H. Collins Comparative genomic analysis in the region of a major Plasmodium-refractoriness locus of Anophelesgambiae PNAS, June 11, 2002; 99(12): 8179 - 8184. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Andrews, G. G. Bouffard, C. Cheadle, J. Lü, K. G. Becker, and B. Oliver Gene Discovery Using Computational and Microarray Analysis of Transcription in the Drosophila melanogaster Testis Genome Res., December 1, 2000; 10(12): 2030 - 2043. [Abstract] [Full Text] |
||||
| |||||||||||||||||||||||