Genome Research

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Parra, G.
Right arrow Articles by Guigó, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Parra, G.
Right arrow Articles by Guigó, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Vol. 10, Issue 4, 511-515, April 2000

METHODS
GeneID in Drosophila

Genís Parra, Enrique Blanco, and Roderic Guigó1

Grup de Recerca en Informàtica Mèdica, Institut Municipal d'Investigació Mèdica (IMIM), Universitat Pompeu Fabra, E-08003 Barcelona, Spain



    ABSTRACT
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES

GeneID is a program to predict genes in anonymous genomic sequences designed with a hierarchical structure. In the first step, splice sites, and start and stop codons are predicted and scored along the sequence using position weight matrices (PWMs). In the second step, exons are built from the sites. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of a Markov model for coding DNA. In the last step, from the set of predicted exons, the gene structure is assembled, maximizing the sum of the scores of the assembled exons. In this paper we describe the obtention of PWMs for sites, and the Markov model of coding DNA in Drosophila melanogaster. We also compare other models of coding DNA with the Markov model. Finally, we present and discuss the results obtained when GeneID is used to predict genes in the Adh region. These results show that the accuracy of GeneID predictions compares currently with that of other existing tools but that GeneID is likely to be more efficient in terms of speed and memory usage. GeneID is available at http://www1.imim.es/~eblanco/GeneId.



    INTRODUCTION
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES

GeneID (Guigó et al. 1992) was one of the first programs to predict full exonic structures of vertebrate genes in anonymous DNA sequences. GeneID was designed with a hierarchical structure: First, gene-defining signals (splice sites and start and stop codons) were predicted along the query DNA sequence. Next, potential exons were constructed from these sites, and finally the optimal scoring gene prediction was assembled from the exons. In the original GeneID the scoring function to optimize was rather heuristic: The sequence sites were predicted and scored using position weight matrices (PWMs), a number of coding statistics were computed on the predicted exons, and each exon was scored as a function of the scores of the exon defining sites and of the coding statistics. To estimate the coefficients of this function a neural network was used. An exhaustive search of the space of possible gene assemblies was performed to rank predicted genes according with an score obtained through a complex function of the scores of the assembled exons.

During recent years GeneID had some usage, mostly through a now nonfunctional e-mail server at Boston University (geneid{at}darwin.bu.edu) and through a WWW server at the IMIM (http://www1.imim.es/geneid.html). During this period, however, there have been substantial developments in the field of computational gene identification (for recent reviews, see Claverie 1997; Burge and Karlin 1998; Haussler 1998), and the original GeneID has become clearly inferior to other existing tools. Therefore, some time ago we began developing an improved version of the GeneID program, which is at least as accurate as other existing tools but much more efficient at handling very large genomic sequences, both in terms of speed and usage of memory. This new version maintains the hierarchical structure (signal to exon to gene) in the original GeneID, but we have simplified the scoring schema and furnished it with a probabilistic meaning: Scores for both exon-defining signals and protein-coding potential are computed as log-likelihood ratios, which for a given predicted exon are summed up into the exon score, in consequence also a log-likelihood ratio. Then, a dynamic programming algorithm (Guigó 1998) is used to search the space of predicted exons to assemble the gene structure (in the general case, multiple genes in both strands) maximizing the sum of the scores of the assembled exons, which can also be assumed to be a log-likelihood ratio. Execution time in this new version of GeneID grows linearly with the size of the input sequence, currently at ~2 Mb per minute in a Pentium III (500 MHz) running linux. The amount of memory required is also proportional to the length of the sequence, ~1 megabyte (MB)/Mb plus a constant amount of ~15 MB, irrespective of the length of the sequence. Thus, GeneID is able to analyze sequences of virtually any length, for instance, chromosome size sequences.

In this paper we describe the "training" of GeneID to predict genes in the genome of Drosophila melanogaster. In the context of GeneID training means essentially computing PWMs for splice sites and start codons, and deriving a model of coding DNA, which, in this case, is a Markov model of order 5, similar to the models introduced by Borodovsky and McIninch (1993). Therefore, in the following sections, we describe the training data set used, particularly our attempt to recreate a more realistic scenario to train and test GeneID by generating semiartificial large genomic contigs from single-gene DNA sequences, and we briefly describe the main features of GeneID for D. melanogaster. Then, we present the results obtained in the training data set when different schemas are used to compute scores for sites and coding potential, and the results obtained on the D. melanogaster Adh region when the optimal scoring schema in the training set is used to predict genes in this region.

    METHODS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES

Data Sets

We have merged the sets of 275 multi- and 141 single-exon sequences provided by Martin Reese (Reese et al. 2000) as a set of known D. melanogaster gene-encoding sequences into the unique MR set. From the MR set we inferred PWMs for splice sites and start codons, and the Markov model of order 5 for coding regions. The MR set contains only single-gene sequences. To assess the accuracy of the predictions in a more realistic scenario, we have randomly embedded the sequences in the MR set in a background of artificial random intergenic DNA as described (R. Guigó, P. Agarwal, J.F. Abril, M. Burset, and J.W. Fickett, in prep.). Thus, a single sequence of 5,689,206 bp embedding the 416 genes in the MR set has been used to evaluate the accuracy of the predictions. The sequence, and the coordinates of the embedded exons are available at http://www1.imim.es/~gparra/GASP1.

GeneID

As outlined, GeneID for D. melanogaster uses PWMs to predict potential splice sites and start codons. Potential sites are scored as log-likelihood ratios. From the set of predicted sites (which includes, in addition, all potential stop codons), the set is built of all potential exons. Exons are scored as the sum of the scores of the defining sites, plus the log-likelihood ratio of the Markov model for coding sequences. Finally, the gene structure is assembled from the set of predicted exons, maximizing the sum of the scores of the assembled exons. The procedure is illustrated in Figure 1, which shows the GeneID predictions in a small region of the Adh sequence.


View larger version (50K):
[in this window]
[in a new window]
 
Figure 1   Predictions obtained by GeneID in the region 462500-477500 from the Adh sequence, compared with the annotation in the standard std3 set. In a first step, GeneID identifies and scores all possible donor (blue) and acceptor (yellow) sites, start codons (green), and stop codons (red) using PWMs---the height of the corresponding spike is proportional to the site score. A total of 4704 sites were generated along this 15,000-bp region by GeneID, only the highest scoring ones are displayed here. In a second step, GeneID builds all exons compatible with these sites. A total of 11,967 exons were built in this particular region (not displayed). Exons are scored as the sum of the scores of the defining sites, plus the score of their coding potential measured according with a Markov model of order 5. The coding potential is displayed along the DNA sequence (MM_score). Regions strong in red are more likely to be coding than regions strong in blue. From the set of predicted exons, the gene structure is generated, maximizing the sum of the scores of the assembled exons. Exons assembled in the predicted genes are drawn with heights proportional to their scores. A two-color code is used to indicate frame compatibility: Two adjacent exons are frame compatible if the right half of the upstream exon (the remainder) matches the color of the left half of the downstream exon (the frame). Data are from the gff2ps program (available at http://www1.imim.es/~jabril/GFFTOOLS/GFF2PS.html). The input GFF and the configuration files required for gff2ps to generate this diagram can be found at http://www1.imim.es/~gparra/GASP1.

Predicting and Scoring Sites

Actual splice sites, and start codons were extracted from the MR set.

Donor Sites

The MR set contains 757 donor sites. From them, a frequency matrix P was derived from position -3 to +6 around the exon-intron boundary, with position 0 being the first position in the intron. Pij is the probability of observing nucleotide i[i is in (A,C,G,T)] at position j [j is in (-3,...,+6)] in an actual donor site. The positional frequency Q of nucleotides in the region -3 to +6 around all dinucleotides GT was also computed (with position 0 being the position corresponding to the nucleotide G in the GT dinucleotides.) Then, a PWM for donor sites D was calculated as
D<SUB>ij</SUB> = <UP>log</UP><FENCE><FR><NU>P<SUB>ij</SUB></NU><DE>Q<SUB>     ij</SUB></DE></FR></FENCE> (1)
PWMs for acceptor sites, A, and start codons, S, were obtained in a similar way. These matrices can be obtained from http://www1.imim.es/~gparra/GASP1.

PWMs can be used to score each potential donor site (GT), acceptor site (AG), and start codon (ATG), along a given sequence. The score of a potential donor site, S = s1s2 . . . s10 within the sequence is computed as
L<SUB>D</SUB>(S) = <LIM><OP>∑</OP><LL>i = 1</LL><UL>10</UL></LIM>D<SUB>s<SUB>i</SUB>i</SUB> (2)
This is the log-likelihood ratio of the probability of observing this particular sequence S in an actual site versus the probability of observing S in any false GT site. Similar scores are computed for acceptor sites (LA) and start codons (LB).

Predicting and Scoring Exons

GeneID distinguishes four types of exons: (1) Initial ORFs, defined by a start codon and a donor site; (2) internal ORFs, defined by an acceptor site and a donor site; (3) terminal ORFs, defined by an acceptor site and a stop codon; and (4) single ORFs, defined by a start codon and a stop codon. This corresponds to intronless genes. GeneID constructs all potential exons that are compatible with the predicted sites. (Only the five highest scoring donor sites within frame are considered for each start codon and acceptor site.)

CODING POTENTIAL

All exon and intron sequences were extracted from the MR multiexon data set. A Markov model of order 5 was estimated to model both exon and intron sequences, that is, we estimated the probability distribution of each nucleotide given the pentanucleotide preceding it in exon and intron sequences. From the exon sequences we estimated this probability for each of the three possible frames, building the transition probability matrices F1, F2, F3. Fj (s1s2s3s4s5s6) is the observed probability of finding hexamer s1s2s3s4s5s6 with s1 in codon position j, given that pentamer s1s2s3s4s5 is with s1 in codon position j. An initial probability matrix, Ij, was estimated from the observed pentamer frequencies at each codon position. From the intron sequences a single transition matrix was computed F0, as well as a single initial probability matrix, I0. Then, for each hexamer h and frame j a log-likelihood ratio was computed:
LF<SUP>      j</SUP>(h) = <UP>log</UP> <FR><NU>F<SUP>      j</SUP>(h)</NU><DE>F<SUB>0</SUB>(h)</DE></FR> (3)
as well as for each pentamer p and frame j
LI<SUP>      j</SUP>(p) = <UP>log</UP> <FR><NU>I<SUP>        j</SUP>(h)</NU><DE>I<SUB>0</SUB>(h)</DE></FR> (4)
The distributions F and I can be obtained from http://www1.imim.es/~gparra/GASP1.

Then, given a sequence S of length l in frame j, the coding potential of the sequence is defined as
L<SUB>M</SUB>(S) = LI<SUP>j</SUP>(S<SUB>1..5</SUB>) + <LIM><OP>∑</OP><LL>i = 1</LL><UL>l − 5</UL></LIM>LF<SUP>j</SUP>(S<SUB>i..i + 5</SUB>) (5)
where Si..k is the subsequence of S starting in position i and ending in position k.

The score of a potential exon, S, LE(S) defined by sites sa (start/acceptor) and sd (stop/donor) is computed as
L<SUB>E</SUB>(S) = L<SUB>A</SUB>(s<SUB>a</SUB>) + L<SUB>D</SUB>(s<SUB>d</SUB>) + L<SUB>M</SUB>(S) (6)
This score can be assumed to be the log-likelihood ratio of the probability of finding such sites and sequence composition given an actual exon over the probability of finding it on a random sequence bounded by AG and GT dinucleotides. Because LM is the logarithm of the ratio of the probability of the sequence under the coding model over the probability under the noncoding model (not under a random model), LM only approximates such a log-likelihood ratio.

Assembling Genes

GeneID predicts gene structures, which can be multiple genes in both strands, as sequences of frame-compatible nonoverlapping exons. A minimum intron length of 40 bp and a minimum intergenic distance of 300 bp are enforced. If a gene structure, g, is a sequence of exons, e1, e2,...en, a natural scoring function is
L<SUB>G</SUB>(g) = L<SUB>E</SUB>(e<SUB>1</SUB>) + L<SUB>E</SUB>(e<SUB>2</SUB>) + … + L<SUB>E</SUB>(e<SUB>n</SUB>) (7)
LG (g) can be approximately interpreted as the log-likelihood ratio of the probability of the defining sites and the hexamer composition of the resulting product given a gene sequence, over this probability given a nongene sequence. In GeneID, the gene structure predicted for a given sequence is the gene maximizing LG (g), among all gene structures that can be assembled from the set of predicted exons for the sequence. Because the number of approximations made, the simple sum of log-likelihood ratios does not produce necessarily genes with the correct number of exons (if LE is positive, the genes tend to have a large number of exons; if LE is negative, the genes tend to have a small number of exons), and the score of the exons is corrected by adding a constant, IW. Thus, given an exon, e, the actual score of e is
L<SUP>*</SUP><SUB>E</SUB>(e) = L<SUB>E</SUB>(e) + IW (8)
To estimate this constant, a simple optimization procedure was performed. Genes were predicted in the training semiartificial genomic sequence for different values of IW, and the value was chosen that maximized the correlation coefficient between the actual and predicted coding nucleotides. This value was found to be IW = -7.

    RESULTS
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES

Training GeneID

We tested two additional models of coding DNA before deciding for a Markov model of order 5, a Codon usage model, and a model that combined a Markov model of order 1 of the translated amino acid sequence and a Codon preference model (see Guigó 1999 for details on these models). In both cases, log-likelihood ratios were obtained in a similar way to the Markov model log-likelihood ratios (see Methods). For instance, in the case of the Codon usage model, for each triplet s, we estimated the probabilities of the codon s in coding sequences, U(s) and the probability of the triplet in noncoding sequences, U0(s), and built the log-likelihood ratio
LU(s) = <UP>log </UP><FR><NU>U(s)</NU><DE>U<SUB>0</SUB>(s)</DE></FR>
Then, given a sequence, S, of length l in frame 0 (i.e., S1S2S3 form a codon), the coding potential of the sequence is computed as
L<SUB>C</SUB>(S) = <LIM><OP>∑</OP><LL>i = 1,4,7...</LL><UL>l − 2</UL></LIM>LU(S<SUB>i</SUB>S<SUB>i + 1</SUB>S<SUB>i + 2</SUB>)
The models were inferred from the MR set, as the Markov model was, and tested on the MR-set sequences embedded in the large artificial genomic contig. To test the models, genes were predicted using GeneID, but exons were scored using only the scores derived under the coding DNA model (i.e., the scores from the exon defining sites were ignored). Predictions were compared with the annotated genes, and the usual measures of accuracy were computed (Reese et al. 2000). Results are shown in Table 1. For comparison, we also show the results when only the scores of the sites are used to score the exons. As it is possible to see the Markov model of order 5 produces more accurate results than the other models, it was chosen to be used in GeneID to predict the genes in the Adh region. As described above, GeneID scores the exons as the sum of the scores of the sites and the Markov model score. Results under this scoring schema, the one effectively used to predict genes in the Adh region, are also given in Table 1.

                              
View this table:
[in this window]
[in a new window]
 
Table 1.   Testing Different Models of Coding DNA in the Training Semiartificial Genomic Sequence

Results in the Adh Region

Table 2 shows the results when GeneID, with the parameters estimated above, is used to predict genes in the Adh region. Both the results originally submitted to the Genome Annotation Assessment Project (GASP) and the results obtained with the currently available version of GeneID are given (see Discussion). In addition, we provide information on execution time and memory requirements of GeneID to analyze the Adh region. The detailed exon coordinates of the predictions by GeneID can be found at http://www1.imim.es/~gparra/GASP1.

                              
View this table:
[in this window]
[in a new window]
 
Table 2.   Accuracy of GeneID in the Adh Region

    DISCUSSION
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES

The results presented above indicate that the current version of GeneID shows an accuracy, as measured by the GASP contest, comparable to the accuracy of the programs based on hidden Markov models (HMMs), which in GASP exhibited the highest accuracy. In favor of GeneID is the simplicity and modularity of its structure, which, as a consequence, is likely to make the program more efficient in terms of speed and memory usage. In GeneID the gene identification problem is stated as a one-dimensional chaining problem for which more efficient algorithms may be designed than for an aligment problem, as gene identification is implicitly formulated in HMMs. Against GeneID is the somehow less rigorous probabilistic treatement of the scoring schema. For instance, we are currently unable to justify the "magic number" (IW, see Methods), which needs to be added to the exon scores to obtain accurate predictions.

GeneID submitted rather poor predictions to GASP (see Table 2). Two bugs in the version of the program under development at that time were to blame. They were discovered and a second prediction submitted (see Table 2). After GASP we changed a rather complex schoring schema to the simpler and more natural schema described in Methods, which resulted in higher accuracy. This is the scoring schema currently in use in GeneID.

Although currently fully functional, we are still developing GeneID further. Our short-term plans include, among others, to train GeneID to predict genes in the human and the Arabidopsis thaliana genomes and to include the possibility of incorporating the results of database searches---both ESTs and proteins---in the GeneID prediction schema, which can be done rather naturally. The possibility of including external evidence to "force" known genes or exons into the prediction is already included in the working version of GeneID. This may be useful for reannotation of very large genomic sequences. Finally, the current structure of GeneID can be highly parallelized, and we are also working in this direction.

    ACKNOWLEDGMENTS

We thank Josep F. Abril and Moisès Burset for helpful discussions and constant encouragement. This work was supported by a grant from Plan Nacional de I+D (BIO98-0443-C02-01) from the Ministerio de Educación y Ciencia (Spain).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.

    FOOTNOTES

1 Corresponding author.

E-MAIL rguigo{at}imim.es; FAX 34-93-221-3237.

    REFERENCES
TOP
ABSTRACT
INTRODUCTION
METHODS
RESULTS
DISCUSSION
REFERENCES

  • Borodovsky, M. and J. McIninch. 1993. Genmark: Parallel gene recognition for both DNA strands. Comput. Chem. 17: 123-113.
  • Burge, C.B. and S. Karlin. 1998. Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 8: 346-354[CrossRef][Medline].
  • Claverie, J.M. 1997. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6: 1735-1744[Abstract/Free Full Text].
  • Guigó, R. 1998. Assembling genes from predicted exons in linear time with dynamic programming. J. Comput. Biol. 5: 681-702[Medline].
  • -----. 1999. DNA composition, codon usage and exon prediction. In Nucleic protein databases (ed. M. Bishop), pp. 53-80. Academic Press, San Diego, CA.
  • Guigó, R., S. Knudsen, N. Drake, and T.F. Smith. 1992. Prediction of gene structure. J. Mol. Biol. 226: 141-157[CrossRef][Medline].
  • Haussler, D. 1998. Computational genefinding. Trends in Biochemical Sciences, Supplementary Guide to Bioinformatics: 12-15. Trends Genet.
  • Reese, M.G., G. Hartzell, N.L. Harris, U.Ohler, and S.E. Lewis. 2000. Genome annotation assessment in Drosophila melanogaster. Genome Res. (this issue).

Received February 9, 2000; accepted in revised form February 28, 2000.


10:511-515 ©2000 by Cold Spring Harbor Laboratory Press  ISSN 1088-9051/00 $5.00

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Brief Funct Genomic ProteomicHome page
C. Ansong, S. O. Purvine, J. N. Adkins, M. S. Lipton, and R. D. Smith
Proteogenomics: needs and roles to be filled by proteomics in genome annotation
Brief Funct Genomic Proteomic, March 10, 2008; (2008) eln010v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Q. Liu, A. J. Mackey, D. S. Roos, and F. C. N. Pereira
Evigan: a hidden variable model for integrating gene evidence for eukaryotic gene prediction
Bioinformatics, March 1, 2008; 24(5): 597 - 605.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
S. Castellano, V. N. Gladyshev, R. Guigo, and M. J. Berry
SelenoDB 1.0 : a database of selenoprotein genes, proteins and SECIS elements
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D332 - D338.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
M. F. Lin, J. W. Carlson, M. A. Crosby, B. B. Matthews, C. Yu, S. Park, K. H. Wan, A. J. Schroeder, L. S. Gramates, S. E. St. Pierre, et al.
Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes
Genome Res., December 1, 2007; 17(12): 1823 - 1836.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
R. Lyle, P. Prandini, K. Osoegawa, B. ten Hallers, S. Humphray, B. Zhu, E. Eyras, R. Castelo, C. P. Bird, S. Gagos, et al.
Islands of euchromatin-like sequence and expressed polymorphic sequences within the short arm of human chromosome 21
Genome Res., November 1, 2007; 17(11): 1690 - 1696.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
D. DeCaprio, J. P. Vinson, M. D. Pearson, P. Montgomery, M. Doherty, and J. E. Galagan
Conrad: Gene prediction using conditional random fields
Genome Res., September 1, 2007; 17(9): 1389 - 1398.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. Parra, K. Bradnam, and I. Korf
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes
Bioinformatics, May 1, 2007; 23(9): 1061 - 1067.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
B. A. Peters, B. St. Croix, T. Sjoblom, J. M. Cummins, N. Silliman, J. Ptak, S. Saha, K. W. Kinzler, C. Hatzis, and V. E. Velculescu
Large-scale identification of novel transcripts in the human genome
Genome Res., March 1, 2007; 17(3): 287 - 292.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
S. Tanner, Z. Shen, J. Ng, L. Florea, R. Guigo, S. P. Briggs, and V. Bafna
Improving gene annotation using peptide mass spectrometry
Genome Res., February 1, 2007; 17(2): 231 - 239.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. G. Gilbert
DroSpeGe: rapid access database for new Drosophila species genomes
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D480 - D485.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. S. Alioto
U12DB: a database of orthologous U12-type spliceosomal introns
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D110 - D115.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Vilardell and A. Sanchez-Pla
Hypothesis testing approaches to the exon prediction problem
Bioinformatics, December 15, 2006; 22(24): 3003 - 3008.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
G. Parra, A. Reymond, N. Dabbouseh, E. T. Dermitzakis, R. Castelo, T. M. Thomson, S. E. Antonarakis, and R. Guigo
Tandem chimerism as a means to increase protein complexity in the human genome
Genome Res., January 1, 2006; 16(1): 37 - 44.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano, H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu, et al.
The UCSC Genome Browser Database: update 2006
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D590 - D598.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
J. E. Galagan, M. R. Henn, L.-J. Ma, C. A. Cuomo, and B. Birren
Genomics of the fungal kingdom: Insights into eukaryotic biology
Genome Res., December 1, 2005; 15(12): 1620 - 1631.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Lomsadze, V. Ter-Hovhannisyan, Y. O. Chernoff, and M. Borodovsky
Gene identification in novel eukaryotic genomes by self-training algorithm
Nucleic Acids Res., November 28, 2005; 33(20): 6494 - 6506.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. Castellano, A. V. Lobanov, C. Chapple, S. V. Novoselov, M. Albrecht, D. Hua, A. Lescure, T. Lengauer, A. Krol, V. N. Gladyshev, et al.
From the Cover: Diversity and functional plasticity of eukaryotic selenoproteins: Identification and characterization of the SelJ family
PNAS, November 8, 2005; 102(45): 16188 - 16193.
[Abstract] [Full Text] [PDF]


Home page
Hum Mol GenetHome page
G. V Kryukov, S. Schmidt, and S. Sunyaev
Small fitness effect of mutations in highly conserved non-coding regions
Hum. Mol. Genet., August 1, 2005; 14(15): 2221 - 2229.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Stanke and B. Morgenstern
AUGUSTUS: a web server for gene prediction in eukaryotes that allows user-defined constraints
Nucleic Acids Res., July 1, 2005; 33(suppl_2): W465 - W467.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
K. Taskov, C. Chapple, G. V. Kryukov, S. Castellano, A. V. Lobanov, K. V. Korotkov, R. Guigo, and V. N. Gladyshev
Nematode selenoproteome: the use of the selenocysteine insertion system to decode one codon in an animal genome?
Nucleic Acids Res., April 20, 2005; 33(7): 2227 - 2238.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Felder, K. Szafranski, Rüd. Lehmann, L. Eichinger, A. A. Noegel, M. Platzer, and G. Glöckner
DictyMOLD-a Dictyostelium discoideum genome browser database
Bioinformatics, March 1, 2005; 21(5): 696 - 697.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Stanke, R. Steinkamp, S. Waack, and B. Morgenstern
AUGUSTUS: a web server for gene finding in eukaryotes
Nucleic Acids Res., July 1, 2004; 32(suppl_2): W309 - W312.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
N. F. Lobo, L. Q. Ton, C. A. Hill, C. Emore, J. Romero-Severson, G. J. Hunt, and F. H. Collins
Genomic Analysis in the sting-2 Quantitative Trait Locus for Defensive Behavior in the Honey Bee, Apis mellifera
Genome Res., December 1, 2003; 13(12): 2588 - 2593.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
L. Zhang, V. Pavlovic, C. R Cantor, and S. Kasif
Human-Mouse Gene Identification by Comparative Evidence Integration and Evolutionary Analysis
Genome Res., June 1, 2003; 13(6): 1190 - 1202.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
S. Beltran, E. Blanco, F. Serras, B. Perez-Villamil, R. Guigo, S. Artavanis-Tsakonas, and M. Corominas
Transcriptional network controlled by the trithorax-group gene ash2 in Drosophila melanogaster
PNAS, March 18, 2003; 100(6): 3293 - 3298.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
R. Guigo, E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra, A. Reymond, J. F. Abril, E. Keibler, R. Lyle, C. Ucla, et al.
Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes
PNAS, February 4, 2003; 100(3): 1140 - 1145.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
T.-J. Chuang, W.-C. Lin, H.-C. Lee, C.-W. Wang, K.-L. Hsiao, Z.-H. Wang, D. Shieh, S. C. Lin, and L.-Y. Ch'ang
A Complexity Reduction Algorithm for Analysis and Annotation of Large Genomic Sequences
Genome Res., February 1, 2003; 13(2): 313 - 322.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
P. Flicek, E. Keibler, P. Hu, I. Korf, and M. R. Brent
Leveraging the Mouse Genome for Gene Prediction in Human: From Whole-Genome Shotgun Reads to a Global Synteny Map
Genome Res., January 1, 2003; 13(1): 46 - 54.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
G. Parra, P. Agarwal, J. F. Abril, T. Wiehe, J. W. Fickett, and R. Guigo
Comparative Gene Prediction in Human and Mouse
Genome Res., January 1, 2003; 13(1): 108 - 117.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
D. Thomasova, L. Q. Ton, R. R. Copley, E. M. Zdobnov, X. Wang, Y. S. Hong, C. Sim, P. Bork, F. C. Kafatos, and F. H. Collins
Comparative genomic analysis in the region of a major Plasmodium-refractoriness locus of Anophelesgambiae
PNAS, June 11, 2002; 99(12): 8179 - 8184.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
J. Andrews, G. G. Bouffard, C. Cheadle, J. Lü, K. G. Becker, and B. Oliver
Gene Discovery Using Computational and Microarray Analysis of Transcription in the Drosophila melanogaster Testis
Genome Res., December 1, 2000; 10(12): 2030 - 2043.
[Abstract] [Full Text]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend