|
|
|
|
Vol. 10, Issue 4, 483-501, April 2000
LETTER
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Computational methods for automated genome annotation are critical to our community's ability to make full use of the large volume of genomic sequence being generated and released. To explore the accuracy of these automated feature prediction tools in the genomes of higher organisms, we evaluated their performance on a large, well-characterized sequence contig from the Adh region of Drosophila melanogaster. This experiment, known as the Genome Annotation Assessment Project (GASP), was launched in May 1999. Twelve groups, applying state-of-the-art tools, contributed predictions for features including gene structure, protein homologies, promoter sites, and repeat elements. We evaluated these predictions using two standards, one based on previously unreleased high-quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. Although these standard sets only approximate the unknown distribution of features in this region, we believe that when taken in context the results of an evaluation based on them are meaningful. The results were presented as a tutorial at the conference on Intelligent Systems in Molecular Biology (ISMB-99) in August 1999. Over 95% of the coding nucleotides in the region were correctly identified by the majority of the gene finders, and the correct intron/exon structures were predicted for >40% of the genes. Homology-based annotation techniques recognized and associated functions with almost half of the genes in the region; the remainder were only identified by the ab initio techniques. This experiment also presents the first assessment of promoter prediction techniques for a significant number of genes in a large contiguous region. We discovered that the promoter predictors' high false-positive rates make their predictions difficult to use. Integrating gene finding and cDNA/EST alignments with promoter predictions decreases the number of false-positive classifications but discovers less than one-third of the promoters in the region. We believe that by establishing standards for evaluating genomic annotations and by assessing the performance of existing automated genome annotation tools, this experiment establishes a baseline that contributes to the value of ongoing large-scale annotation projects and should guide further research in genome informatics.
| |
INTRODUCTION |
|---|
|
|
|---|
Genome annotation is a rapidly evolving field in genomics made possible by the large-scale generation of genomic sequences and driven predominantly by computational tools. The goal of the annotation process is to assign as much information as possible to the raw sequence of complete genomes with an emphasis on the location and structure of the genes. This can be accomplished by ab initio gene finding, by identifying homologies to known genes from other organisms, by the alignment of full-length or partial mRNA sequences to the genomic DNA, or through combinations of such methods. Related techniques can also be used to identify other features, such as the location of regulatory elements or repetitive sequence elements. The ultimate goal of genome annotation, the functional classification of all the identified genes, currently depends on discovering homologies to genes with known functions.
We are interested in an objective assessment of the state of the art in automated tools and techniques for annotating complete genomes. The Genome Annotation Assessment Project (GASP) was organized to formulate guidelines and accuracy standards for evaluating computational tools and to encourage the development of new models and the improvement of existing approaches through a careful assessment and comparison of the predictions made by current state-of-the-art programs.
The GASP experiment, the first of its kind, was similar in many ways to
the CASP (Critical Assessment of techniques for
protein Structure Prediction) contests for
protein structure prediction (Dunbrack et al. 1997
; Levitt 1997
; Moult
et al. 1997
, 1999
; Sippl et al. 1999
; Zemla et al. 1999
),
described at http://predictioncenter.llnl.gov. However, unlike
the CASP contest, GASP was promoted as a collaboration to evaluate
various techniques for genome annotation.
The GASP experiment consisted of the following stages: (1) Training
data for the Adh region, including 2.9 Mb of Drosophila melanogaster genomic sequence, was collected by the organizers and
provided to the participants; (2) a set of standards was developed to
evaluate submissions while the participating groups produced and
submitted their annotations for the region; and (3) the participating groups' predictions were compared with the standards, a team of independent assessors evaluated the results of the comparison, and the
results were presented as a tutorial at ISMB-99(Reese et al. 1999
).
Participants were given the finished sequence for the Adh
region and some related training data, but they did not have access to
the full-length cDNA sequences that were sequenced for the paper by
Ashburner et al. (1999b)
that describes the Adh region in
depth. The experiment was widely announced and open to any participants. Submitters were allowed to use any available technologies and were encouraged to disclose their methods. Because we were fortunate to attract a large group of participants who provided a wide
variety of annotations, we believe that our evaluation addresses the
state of art in genome annotation.
Twelve groups participated in GASP, submitting annotations in one or
more of six categories: ab initio gene finding, promoter recognition,
EST/cDNA alignment, protein similarity, repetitive sequence
identification, and gene function. Table 1 lists each participating group, the names of the programs or systems it used, and
which of the six classes of annotations it submitted (see enclosed
poster in this issue for a graphic overview of all the groups'
results). Additional papers in this issue are written by the
participants themselves and describe their methods and results in detail.
|
It should be noted that the lack of a standard that is absolutely
correct makes evaluating predictions problematic. The expert annotations described by the Drosophila experts (Ashburner et al. 1999b
) are our best available resource, but their accuracy will
certainly improve as more data becomes available. At best, the data we
had in hand is representative of the true situation, and our
conclusions would be unchanged by using a more complete data set. At
worst, there is a bias in the available data that makes our conclusions
significantly misleading. We believe that the data is not unreasonable
and that conclusions based on it are correct enough to be valuable as
the basis for discussion and future development. We do not believe that
the values for the various statistics introduced below are precisely
what they would be using the extra information, and we emphasize that
they should always be considered in the context of this particular annotated data set [for a further detailed discussion of evaluating these predictions, see Birney and Durbin (2000)
].
In the next section we describe the target genomic sequence and the
auxiliary data, including a critical discussion of our standard sets.
Methods gives a short description of existing annotation methods that
complements other papers in this issue, including a review article of
existing gene-finding methods by Stormo (2000)
and papers describing
the methods used by the individual participants. Results assesses the
individual annotation methods and the Conclusions discusses what the
experiment revealed about issues involved in annotating complete
genomes. An article by Ashburner (2000)
provides a biological
perspective on the experiment.
Data: The Benchmark Sequence: The Adh Region in D. melanogaster
The selection of a genomic target region for assessing the accuracy of computational genome annotation methods was a difficult task for several reasons: The genomic region had to be large enough, the organism had to be well studied, and enough auxiliary data had to be available to have a good experimentally verified "correct answer," but the data should be anonymous so that a blind test would be possible. The Adh region of the D. melanogaster genome met these criteria. D. melanogaster is one of the most important model organisms, and although the Adh region had been extensively studied, the best gene annotations and cDNAs for the region were not published until after the conclusion of the GASP experiment. The 2.9 Mb Adh contig was large enough to be challenging, contained genes with a variety of sizes and structures, and included regions of high and low gene density. It was not a completely blind test, however, because several cDNA and genomic sequences for known genes in the region were available prior to the experiment.
Genomic DNA Sequence
The contiguous genomic sequence of the Adh region in the D. melanogaster genome spans nearly 3 Mb and has been sequenced from a series of overlapping P1 and BAC clones as a part of the Berkeley Drosophila Genome Project (BDGP; Rubin et al. 1999Curated Training Sequences
We provided several D. melanogaster-specific data sets to the GASP participants. This enabled participants to tune their tools for Drosophila and facilitated a comparison of the various approaches that was unbiased by organism-specific factors. The following curated sequence sets, extracted from Flybase and EMBL (provided by the EDGP at Cambridge and provided by the BDGP, were made available and can be found at http://www.fruitfly.org/GASP/data/data.html): (1) A set of complete coding sequences (start to stop codon), excluding transposable elements, pseudogenes, noncoding RNAs, and mitochondrial and viral sequences (2122 entries); (2) nonredundant set of repetitive sequences, not including transposable elements (96 entries); (3) transposon sequences, containing only the longest sequence of each transposon family and excluding defective transposable elements (44 entries); (4) genomic DNA data from 275 multi- and 141 single-exon nonredundant genes together with their start and stop codons and splice sites, taken from GenBank version 109; (5) a set of 256 unrelated promoter regions, taken from the Eukaryotic Promoter Database (EPD; Cavin Périer et al. 1999Resources for Assessing Predictions: The "Correct" Answer
In a comparative study, the gold standard used to evaluate solutions is the most important factor in determining the usefulness of the study's results. For the results to be meaningful, the standard must be appropriate and correct in the eyes of the study's audience. Because our goal was to evaluate tools that predict genes and gene structure in complex eukaryotic organisms, we drew our standard from a complex eukaryotic model organism, choosing to work with a 2.9-Mb sequence contig from the Adh region of D. melanogaster. Comparing predicted annotations in such a region is only consequential if the standard is believed to be correct, if that correctness has been established by techniques that are independent of the approaches being studied, and if the predictors had no prior knowledge of the standard. Ideally, it would contain the correct structure of all the genes in the region without any extraneous annotations. Unfortunately, such a set is impossible to obtain because the underlying biology is incompletely understood. We built a two-part approximation to the perfect data set, taking advantage of data from the BDGP cDNA sequencing project (http://www.fruitfly.org/EST) and a Drosophila community effort to build a set of curated annotations for this region (Ashburner et al. 1999b
0.35 threshold, which gives a 98%
true positive rate, and 3' splice sites
0.25, which gives
a 92% true positive rate) using a neural network splice site predictor
trained on D. melanogaster data (Reese et al. 1997Data Exchange Format
One of the challenges of a gene annotation study is finding a common format in which to express the various groups' predictions. The format must be simple enough that all of the groups involved can adapt their software to use it and still be rich enough to express the various annotations. We found that the General Feature Format (GFF) (formerly known as the Gene Feature Finding format) was an excellent fit to our needs. The GFF format is an extension of a simple name, start, end record that includes some additional information about the sequence being annotated: the source of the feature; the type of feature; the location of the feature in the sequence; and a score, strand, and frame for the feature. It has an optional ninth field that can be used to group multiple predictions into single annotations. More information can be found at the GFF web site: http://www.sanger.ac.uk/Software/formats/GFF/. Our evaluation tools used a GFF parser for the PERL programming language that is also available at the GFF web site. We found that it was necessary to specify a standard set of feature names within the GFF format, for instance, declaring that submitters should describe coding exons with the feature name CDS. We produced a small set of example files (accessible from the GASP web site) that we distributed to the submitters and were pleased with how easily we were able to work with their results.| |
METHODS |
|---|
|
|
|---|
Genome annotation is an ongoing effort to assign functional features to locations on the genomic DNA sequence. Traditionally, most of these annotations record information about an organism's genes, including protein-coding regions, RNA genes, promoters, and other gene regulatory elements, as well as gene function. In addition to these gene features, the following general genome structure features are also commonly annotated: repetitive elements and general A, C, G, T content measures (e.g., isochores).
Genome Annotation Classes
Although the GASP experiment invited and encouraged any class of annotations, most submissions were for gene-related features, emphasizing ab initio gene predictions and promoter predictions. In addition, two groups submitted functional protein domain annotations, and two groups submitted repeat element annotations. In the sections that follow, we categorize and discuss the submitted predictions.
Gene Finding
Protein coding region identification is a major focus of computational biology. A separate article in this issue (Stormo 2000
predominantly coding bias, coding preference, and consensus sequences for start codon, splice sites, and
stop codons
only two groups used protein similarity information or
promoter information to predict gene structure. More than half of the
groups incorporated sequence information from cDNA sequences. In
general, state-of-the-art gene prediction systems use complex models
that integrate multiple gene features into a unified model.
|
Promoter Prediction
The complicated nature of the transcription initiation process makes computational promoter recognition a hard problem. We define promoter prediction as the identification of TSSs of protein coding genes that are transcribed by eukaryotic RNA polymerase II. A detailed description of the structure of promoter regions and existing promoter prediction systems is beyond the scope of this paper. Fickett and Hatzigeorgiou (1997)Repeat Finders
Detecting repeated elements plays a very important role in modeling the three-dimensional structure of a DNA molecule, specifically, the packing of the DNA in the cell nucleus. It is believed that the packing of the DNA around the nucleosome is correlated with the global sequence structure produced predominantly by repetitive elements. Repeats also play a major role in evolution (for review, see Jurka 1998Protein Homology Annotation
Homologies to gene sequences from other organisms can often be used to identify protein-coding regions in anonymous genomic sequence. In addition to the location, it is often possible to infer the function of the predicted gene based on the function of the homologous gene in the other organism or of a known structural and functional protein element in the gene. Whereas the tools in the gene prediction category and the EST/cDNA alignment category are usually intended to determine the exact structure of a gene, the protein homology-based tools are usually optimized to find conserved parts of the sequence without worrying about the exact gene structure. Traditionally, this area of genome annotations has been dominated by the suite of local alignment search tools of BLAST (Altschul et al. 1990EST/cDNA Alignment
Computational predictions of gene location and structure go hand in hand with EST/cDNA sequencing and alignment techniques for building transcript annotations in genomic sequence. Either can be used as a discovery tool, with the other held in reserve for verification. A researcher can verify the existence and structure of predicted genes by sequencing the corresponding mRNA molecules and aligning their sequences to the original genomic sequence. Alternatively, one can start with an EST or cDNA sequence and build an alignment to the genomic sequence that has been guided and/or verified by tools from the gene prediction arsenal, for example, using likely splice site locations and checking for long ORFs and potential frame shifts. There are many tools for aligning sequences. Although they have generally been specialized for aligning sequences that are evolutionarily related, some are designed for niche applications such as recognizing overlaps among sequencing runs. Aligning EST/cDNA sequences to the original genomic sequence also presents a unique set of tradeoffs and issues. In some cases (interspecies EST/genomic alignments), these tools must model evolutionary changes in the sequence. Sometimes (e.g., for low-quality EST sequences), they need to model errors in the sequence generated by the sequencing process. For multiexon genes, they need to model the intron regions as cost-free gaps tied to a model for recognizing splice sites. Several tools have been developed for this task: Mott (1997)Gene Function
Gene function predictions are the most difficult annotations to produce and to evaluate. Current technologies use similarity to proteins (or protein domains) with known function to predict functional domains in genomic sequence. Although some tools use simple sequence alignments, more powerful tools have developed significantly more sensitive models. It quickly became apparent that a consistent and correct assessment of function predictions as part of the GASP experiment was not possible because of the incomplete understanding of the protein products encoded by the 222 genes in the Adh region.Evaluating Gene Predictions
An ideal gene prediction tool would produce annotations that were exactly correct and entirely complete. The fact that no existing tool has these characteristics reflects our incomplete understanding of the underlying biology as well as the difficulty to build adequate gene models in a computer. Although no tool is perfect, each tool has particular strengths and weaknesses, and any performance evaluation should be in the context of an intended use. For example, researchers who are interested in identifying gene-rich regions of a genome for sequencing would be happy with a tool that successfully recognizes a gene's approximate location, even if it incorrectly described splice site boundaries. On the other hand, someone trying to predict protein structures is more interested in getting a gene's structure exactly right than in a tool's ability to predict every gene in the genome.
When assessing the accuracy of predictions, each prediction falls into
one of four categories. A true-positive (TP) prediction is one that
correctly predicts the presence of a feature. A false-positive (FP)
prediction incorrectly predicts the presence of a feature. A
true-negative (TN) prediction is correct in not predicting the presence
of a feature when it isn't there. A false-negative (FN) prediction
fails to predict the existence of a feature that actually exists. The
sensitivity (Sn) of a tool is defined as TP / (TP + FN) and can be
thought of as a measure of how successful the tool is at finding things
that are really there. The specificity (Sp) of a tool is defined as
TP / (TP + FP) and can be thought of as a measure of how careful a
tool is about not predicting things that aren't really there. Burset
and Guigó (1996)
also use a correlation coefficient and an
average correlation coefficient. We chose not to use these measures
because they depend on predictors' TN information, and we recognize
that our evaluation sets were constructed in such a way that the TN
information is not trustworthy. These Sn and Sp metrics are used for
evaluating the submissions in the gene-finding, promoter recognition,
and gene identification using protein homology categories. In the gene
finding category, they are used for all three levels: base level, exon
level, and gene level. In the protein homology category, they are used
for base level and gene level only.
In one of the first reviews of gene prediction accuracy, Fickett and
Tung (1992)
developed a method that measured predictors' ability to
correctly recognize coding regions in genomic sequence. They used their
method to compare published techniques and concluded that in-frame
hexamer counts were the most accurate measure of a region's coding
potential. Burset and Guigó (1996)
recognized that there are a
wide variety of uses for gene predictions and developed
measures
including base level, exon level, and gene level Sp and
Sn
that describe a predictor's suitability for a particular task.
Base Level
The base level score measures whether a predictor is able to correctly label a base in the genomic sequence as being part of some gene. It rewards predictors that get the broad sweeps of a gene correct, even if they don't get the details such as the splice site boundaries entirely correct. It penalizes predictors that miss a significant portion of the coding sequence, even if they get the details correct for the genes they do predict. We used the Sn and Sp measures defined above as the measures of success in this category.Exon Level
Exon level scores measure whether a predictor is able to identify exons and correctly recognize their boundaries. Being off by a single base at either end of the exon makes the prediction incorrect. Because we only considered coding exons in our assessment, the first exon is bracketed by the start codon and a 5' splice site, the last exon is bracketed by a 3' splice site and the stop codon, and the interior exons are bracketed by a pair of splice sites. As measures of success in this category, we used two statistics in addition to Sn and Sp. The missed exon (ME) score is a measure of how frequently a predictor completely failed to identify an exon (no prediction overlap at all), whereas the wrong exon (WE) score is a measure of how frequently a predictor identifies an exon that has no overlap with any exon in the standard sets. The ME score is the percentage of exons in the standard set for which there were no overlapping exons in the predicted set. Similarly, the WE score is the percentage of exons in the predicted set for which there were no overlapping exons in the standard set.Gene Level
Gene level Sn and Sp measure whether a predictor is able to correctly identify and assemble all of a gene's exons. For a prediction to be counted as a TP, all of the coding exons must be identified, every intron-exon boundary must be exactly correct, and all of the exons must be included in the proper gene. This is a very strict measure that addresses a tool's ability to perfectly identify a gene. In addition to the Sn and Sp measures based on absolute accuracy, we used the missed genes (MG) score as a measure of how frequently a predictor completely missed a gene (a standard gene is considered missed if none of its exons are overlapped by a predicted coding gene) and the wrong genes (WG) score as a measure of how frequently a predictor incorrectly identified a gene (a prediction is considered wrong if none of its exons are overlapped by a gene from the standard set).Split and Joined Genes
The exon level scores discussed above measure how well a predictor recognizes exons and gets their boundaries exactly correct. The gene level scores measure how well a predictor can recognize exons and assemble them into complete genes. Neither of these scores directly measures a predictor's tendency to incorrectly assemble a set of predicted exons into more or fewer genes than it should. We developed two new measures, split genes (SG) and joined genes (JG), which describe how frequently a predictor incorrectly splits a gene's exons into multiple genes and how frequently a predictor incorrectly assembles multiple genes' exons into a single gene. Because the coverage of the std1 data set is so incomplete, we have only included SG and JG scores from the comparison with std3. A gene from the standard set is considered split if it overlaps more than one predicted gene. Similarly, a predicted gene is considered joined if it overlaps more than one gene in the standard set. The SG measure is defined as the sum of the number of predicted genes that overlap each standard gene divided by the number of standard genes that were split. Similarly, the JG measure is the sum of the number of standard genes that overlap each predicted gene divided by the number of predicted genes that were joined. A score of 1 is perfect and means that all of the genes from one set overlap exactly one gene from the other set.Application of These Measures to Correct Answer Data Sets std1/std3
We built the std1 data set in such a way that we believe it is correct in the details of the genes that it describes, though we know that it only includes a small portion of the genes in the region. The std3 data set, on the other hand, is as complete as was possible but does not have rigorous independent evidence for all of its annotations. For the std1 data set, we believe that the TP count (it was predicted, and it exists in the standard) and FN count (it was not predicted, but it does exist in the standard) are reliable because of the confidence that we have in the correctness of the predictions in the set. On the other hand, we do not believe that the TN count (it was not predicted, and it is not in the standard set) and FP count (it was predicted, but is not in the standard set) are reliable because they both assume that the standard correctly describes the absence of a feature and we know that there are genes missing from std1. It follows that we believe that Sn is meaningful for std1 because it only depends on TP and FN but that we are less confident about the Sp score because it depends on TP and FP. A similar logic applies to the std3 data set, where our confidence in the set's completeness but not its fine details suggests that the TP and FP scores are usable but that the TN and FN scores are not. This means that for std3, we believe that the Sp measure can be used to describe a predictor's performance but that Sn is likely to be misleading.Evaluation of Promoter Predictions
We adopted the measures proposed by Fickett and Hatzigeorgiou
(1997)
. They evaluated the success of promoter predictions by giving
the percentage of correctly identified TSSs versus the FP rate. A TSS
is regarded as identified if a program makes one or more predictions
within a certain "likely" region around the annotated site. The FP
rate is defined as the number of predictions within the "unlikely"
regions outside the likely regions divided by the total number of bases
contained in the unlikely set. As our annotation of the TSS is only
preliminary and not experimentally confirmed, we chose a rather large
region of 500 bases upstream and 50 bases downstream of the annotated
TSS as the likely region. The upstream region is always taken as the
likely region, even if it overlaps with a neighboring gene annotation
on the same strand. The unlikely region for each gene then consists of
the rest of the gene annotation, from base 51 downstream of the TSS to
the end of the final exon.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Visualization of the Annotations
Generating "good" annotations generally requires integrating multiple sources of information, such as the results of various sequence analysis tools plus supporting biological information. Visualization tools that display sequence annotations in a browsable graphical framework make this process much more efficient. In this experiment we found that visualization tools are essential to evaluate the genome annotation submissions. When annotations are displayed visually, overall trends become apparent, for example, gene-rich versus gene-poor regions, genes that were predicted by most participants versus those that were predicted by few. Additionally, as we discuss below, a visualization tool that is capable of displaying annotations at multiple levels of detail provides a way to examine individual predictions in detail.
Building genome annotation visualization tools is a daunting task. Many
such tools have been developed, starting with ACeDB (Eeckman and Durbin
1995
; Stein and Thierry-Mieg 1998
). We were fortunate in that the BDGP
has built a flexible suite of genome visualization tools (Helt et al.
1999
) that could be extended to display the GASP submissions. We
adapted the BDGP's annotated clone display and editing tool,
CloneCurator (Harris et al. 1999
), which is based on a
genomic visualization toolkit (Helt et al. 1999
), to read the
annotation submissions in GFF format and display each
team's predictions in a unique color and location.
CloneCurator (see Fig. 1) displays features on a sequence as colored rectangles. Features on the forward strand appear above the axis, whereas those on the reverse strand appear below the axis. The display can be zoomed and scrolled to view areas of interest in more detail. A configuration file identifies the feature types that are to be displayed and assigns colors and offsets to each one. For example, the std1 and std3 exons appear in yellow and orange close to the central axis.
|
| |
RESULTS |
|---|
|
|
|---|
The results of an experiment such as GASP are only meaningful if
enough groups participate. We were fortunate to have 12 diverse groups
involved, and we were very grateful for the speed with which they were
able to submit their predictions. We believe that these 12 groups
provide a fair representation of the state of the art in annotation
system technology. We collected submissions by electronic mail and
evaluated them using the std1 and std3 data sets as described above.
Before releasing our results at the Intelligent Systems in Molecular
Biology conference in August 1999 in Heidelberg, Germany, we assembled
a team of independent assessors (Ashburner et al. 1999a
) to review our
techniques and conclusions. As discussed in the introduction, the
accuracy of the various measures discussed below depends heavily on how
well our standard sets capture the true set of features in the region. These values should only be considered in the context of the standard data sets.
A detailed description of the results and the evaluation techniques we used can be accessed through the GASP homepage at http://www.fruitfly.org/GASP/.
Gene Finding
Table 3 summarizes the performance of the
gene-finding tools using the measures defined above. Three groups
submitted multiple submissions. The first group, Fgenes1,
Fgenes2, and Fgenes3, submitted three
predictions at varying stringency (for details, see Salamov and
Solovyev 2000
). For the GeneID program, two submitted
versions are presented, version 1 (GeneID v1) being the
original submission and version 2 (GeneID v2) being a
newer submission from a corrected version of the original program (for
details, see Parra et al. 2000
). The third group with multiple
submissions used three versions of the Genie program: the
first a pure statistical approach (Genie), the second
including EST alignment information (GenieEST), and the
third using protein homology information (GenieESTHOM)
(for details, see Reese et al. 2000
). For all other groups from Table
2, only one submission was evaluated. The following sections discuss
the base level, exon level, and gene level performance of these
submissions.
|
Base Level Results
Several gene prediction tools had a Sn of >0.95 at the base level. This suggests that current technology is able to correctly identify >95% of the D. melanogaster proteome. A few tools demonstrated a specificity of >0.90 at the base level, only infrequently labeling a noncoding base as coding. Generally, the tools have a higher Sn than Sp. Two programs, Fgenes2 and GeneID, were designed to be conservative about their predictions and do not follow this trend.Exon Level Results
There was a great deal of variability in the exon level scores. Several tools had Sn scores ~0.75, correctly identifying both exon boundaries ~75% of the time. Their Sps were generally much lower (the highest was 0.68), probably a reflection of the strict definition of exon level scores both splice sites had to be predicted correctly and possible inaccuracies in the std3 data set. The low ME scores (several <0.05) combined with the fairly high Sn suggest that several tools were successful at identifying exons but had trouble finding the correct exon boundaries. Programs that incorporate EST alignment information, such as GenieEST and HMMGene, had sensitivity scores that were up to 10% better than the other tools. The high WE scores suggest either that the tools are overpredicting or that there are genes that are missing even from std3.Gene Level Results
All of the predictors had considerable difficulty correctly assembling complete genes. The best tools were able to achieve Sns between 0.33 and 0.44, meaning that they are incorrect over half of the time. This value seems to be very similar in Drosophila and human sequences, based on a recent analysis of the BRCA2 region in human (T.J. Hubbard, pers. comm.). Even on the more complete std3 data set, the programs tended to incorrectly predict many genes. The very low MG score (as low as 4.6%) is reassuring because it suggests that several tools are able to recognize a gene, even if they have difficulty figuring out the exact details of its structure. Comparing the WG and MG measures suggests that existing tools tend to predict genes that do not exist more often than they miss genes that do exist. Because it is almost certain that there are real genes that are missing from both standard sets, this conclusion must be viewed with some skepticism. Although there were several tools with good SG or JG scores, none of them performed well in both categories.Promoter Prediction
Table 4 shows the performance of the promoter
prediction systems, grouped by approach: search-by-signal,
search-by-region, and gene prediction programs.
|
Gene-finding programs that include a prediction of the TSS obtained the best results. The number of false predictions made by the region-based programs is very high (giving them a low Sp), and because the signal-specific programs only identify one promoter, their Sn is very low. The high Sp of the gene finders is obviously due to the context information: All promoter predictions within gene predictions are ruled out in advance, and the location of the possible start codon provides the system with a good initial guess of where to look for a promoter. The MAGPIE system also uses EST alignments to obtain information on 5' UTRs, which mirrors the way the std sets were constructed: Roughly one-third of the putative TSS assignments rely on cDNAs that were publicly available in GenBank. A closer look at the results reveals that the region-based programs have a Sn that is comparable with the gene finders and the signal based program had only a single FP, showing that both types of tools can be used for different applications.
Our data set, and the evaluation based on it, relies on the assumption that the 5' ends of the full-length cDNAs are reasonably close to the TSS. This makes it very hard to draw strong conclusions from the presented results. Even the most sensitive systems could identify only roughly one third of the start sites. This could of course be caused by the fact that the existing annotation is only an approximation and some of the true TSSs may be located further upstream. It also hints at the diversity of promoter regions that mirrors the possibilities for gene regulation and at the existing bias toward housekeeping genes in the current data sets used for the training of the models.
Gene Identification Using Protein Homology
Gene-finding evaluation statistics, such as those described above,
can be used to summarize the ability of a program to identify complete
and accurate gene structures in genomic DNA. In Table 5 we have applied the same evaluation statistics to
the homology-based search programs GeneWise and
BLOCKS+. Because these programs are not optimized to deal
with exact exon boundary assignments, Table 5 only shows the
performance for the base level and the MG and WG.
|
The very low Sns at the base level are not surprising, because the
programs identify only conserved protein motifs or particular domains
and make no effort to predict complete genes. Sp, which should be high
given that only conserved protein motifs are scored, was lower than
expected. Detailed studies of these predictions (see Birney and Durbin
2000
; Henikoff and Henikoff 2000
) show that most of the FP predictions
were hits to transposable elements or to possible genes that are
missing in the standard sets. Both programs use a database of protein
domains or conserved protein motifs. Both databases are large and are
believed to contain at least 50% of the existing protein domains. The
high number of MG, 62.7% for BLOCKS and 69.7% for
GeneWise, means that these programs will miss a
significant number of Drosophila genes when used to search
genomic DNA directly. The WG scores of 12.9% BLOCKS and
14.1% for GeneWise are lower than the gene finding
programs discussed in the previous section.
Gene Identification Using EST/cDNA Alignments
It is believed that some cDNA information exists for approximately half of the genes in the D. melanogaster genome. This cDNA database (available as the EST data set at the GASP web site) was used as a basis for the cDNA/EST alignment category. The Sn of 31% for MAGPIEEST and GrailSimilarity (Table 5) implies that the coding portion of the available EST data currently covers one-third of the genome's coding sequence. The low Sp is very surprising and suggests that the EST/cDNA alignment problem is not a trivial one. The only program that tried to align complete cDNAs to genomic DNA, MAGPIEcDNA, could find complete cDNAs for only 2.4% of the genes. EST alignments also resulted in high numbers of missed genes, suggesting that the EST libraries are biased toward highly expressed genes. The high WG scores suggest that some genes are missing even from std3.
Selected Gene Annotations
The summary statistics discussed above only provide a global view of the predicting programs' characteristics. A much better understanding of how the various approaches behave can be obtained by looking at individual gene annotations. Such a detailed examination can also help identify issues that are not addressed by current systems.
In the following paragraphs we will discuss a few interesting examples.
Figure 1 shows the color codes of the participating groups that are
used throughout this section. Genes located on the top of each map are
transcribed from distal to proximal (with respect to the telomere of
chromosome arm 2L); those on the bottom are transcribed from proximal
to distal. std1 and std3 are the expert annotations described in
Ashburner et al.(1999b)
. Just below the axis, you can see the
annotations for the two repeat finding programs. These have no sequence
orientation and are therefore only shown on one side. Farther away from
the axis, after std1 and std3, we grouped all of the ab initio
gene-finding programs together. Next to the gene finders are the
homology-based annotations. On the bottom and the top of the figure we
show the three promoter annotations, but for clarity we did not include
these annotations in the subsequent figures. (On the front page and in
the legend of Fig. 1, you can see the full set of annotations of all
programs, which are also accessible from the GASP web site.)
Our first example is a "busy" region with 12 complete genes and 1 partial gene in a stretch of only 40 kb (Fig. 2A). This region is located at the 3' end of the Adh region from base 2,735,000 to base 2,775,000. Genes exist on both strands, and it is striking that in this region the genes tend to alternate between the forward and the reverse strands. We selected this region for its gene density and because it has characteristics that are typical of the complete Adh region. Figure 2A vividly demonstrates that all of the gene-finding programs' predictions are highly correlated with the annotated genes from std1/std3. In the past, gene finders had often mistakenly predicted a gene on the noncoding strand opposite of a real gene, leading to FP predictions known as "shadow exons." Figure 2A makes it clear that gene finders have overcome this problem, because there are almost no shadow exon predictions for any of the genes in std3. Another characteristic, captured in the high base level sensitivity and the low missing genes statistics, is that every gene in the std3 set was predicted by at least a few groups and that most of these predictions agree with each other. Except for the second and third genes [DS02740.5, I(2)35Fb] on the forward strand (2,740,000-2,745,000), which seem to be single exon genes, all of the genes in this region are multiexon genes with between two and eight exons. The exon size varies widely. There are genes that consist of only two large exons, some that consist of a mix of large and small exons, and some that are made up exclusively of many small exons. The distribution seems to be almost random. Except for the long final intron in the last gene on the reverse strand (cact), the region consists exclusively of short introns.
|
Predictions on the reverse strand indicate a possible gene from base
2,741,000 to base 2,745,000. Most of the gene finders agree on this
prediction, but neither std1 nor std3 describes a gene at this
location. This could be a real gene that was missed by the expert
annotation pathway described in Ashburner et al (1999b)
. Neither
BLOCKS+ nor GeneWise found any homologies in
this region, but we can see from the table in the previous section that
many real genes do not have any homology annotations. Interestingly,
this is the only area in the region where two gene finders predicted a
possible gene that likely consists of shadow exons.
The fifth gene on the forward strand (DS02740.10, bases 2,752,500-2,755,000) shows that long genes with multiple exons are much harder to predict than single exon genes or genes with only a few exons. In this region splitting and joining genes does not seem to be a problem. Repeats occur sparsely and mostly in noncoding regions, predominantly in introns.
In contrast to the busy region in Figure 2A, Figure 2B highlights a region of almost equal size in which only one gene (DS01759.1) is present in both std1 and std3. There are very few FP predictions by any group, but there is one case where the "false" predictions by different programs are located at very similar positions (on the reverse strand near base 620,000). This suggests a real gene that is missing from both standard sets.
Figure 3, A-D, depicts selected genes that
illustrate some interesting challenges in gene finding. Figure 3A shows
the Adh and the Adhr genes that occur as gene
duplicates. The encoded proteins have a sequence identity of 33%. The
positions of the two introns interrupting the coding regions are
conserved and give additional evidence to tandem duplication. Both
genes are under the control of the same regulatory promoter, the
Adhr gene does not have a TSS of its own, and its transcript
is always found as part of an Adh-Adhr dicistronic mRNA. Gene
duplications occur very frequently in the Drosophila
genome
estimates show that at least 20% of all genes occur in gene
family duplications. In an additional twist, Adh and
Adhr are located within an intron of another gene,
outspread (osp), that is found on the opposite strand (for details, see Fig. 3B). The Adh gene is correctly
predicted by most of the programs, although one erroneously predicts an additional first exon. Most of the programs also predict the structure of Adhr correctly; one program misses the initial exon and
shortens the second exon. Both Adh and Adhr show hits
to the protein motifs in BLOCKS+ as well as alignments to
a PFAM protein domain family through GeneWise. Both genes
hit two different PFAM families, and the order of these two domains is
conserved in the gene structure.
|
Figure 3B highlights the osp gene region. This is an example
of a gene with exceptionally long (>20 kb) introns, making it hard
for any gene finder to predict the entire structure correctly. In
addition, there are a number of smaller genes [including the Adh and Adhr genes discussed above,
DS09219.1 (r.) and DS07721.1 (f.)] within the
introns of osp. No current gene finder includes overlapping
gene structures in its model; as a consequence, none of the GASP gene
finders were able to predict the osp structure without
disruption. This is clearly a shortcoming of the programs because genes
containing other genes are often observed in Drosophila (Ashburner et al. 1999b
report 17 cases for the Adh region).
However, it should be noted that most of the gene finders predict the
3' end of osp correctly and therefore get most of the
coding region right. The region that includes the 5' end of
osp shows a lot of gene prediction activity, but there is no
consistency among the predictions. One program
(FGenesCCG3) does correctly predict the DS09219.1 gene.
Figure 3C shows the entire gene structure of the Ca-
1D
gene. This gene is the most complex gene in the Adh region,
with >30 exons. This is a very good example for studying gene
splitting. Several predictors break the gene up into several genes, but
some groups make surprisingly close predictions. This shows the complex structure that genes can exhibit and that extent to which this complexity has been captured in the state-of-the-art prediction models.
It is interesting to note that most of the larger exons are predicted,
whereas the shorter exons are missed. Such a large complex gene is a
good candidate for alternative splicing, which can ultimately be
detected only by extensive cDNA sequencing.
Figure 3D shows the triple duplication of the idgf gene (idgf1, idgf2, and idgf3) on the forward strand. Two programs mistakenly join the first two genes into a single gene; all the others correctly predict all three