|
|
|
Vol. 11, Issue 9, 1574-1583, September 2001
METHODS
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Conventional methods of gene prediction rely on the recognition of DNA-sequence signals, the coding potential or the comparison of a genomic sequence with a cDNA, EST, or protein database. Reasons for limited accuracy in many circumstances are species-specific training and the incompleteness of reference databases. Lately, comparative genome analysis has attracted increasing attention. Several analysis tools that are based on human/mouse comparisons are already available. Here, we present a program for the prediction of protein-coding genes, termed SGP-1 (Syntenic Gene Prediction), which is based on the similarity of homologous genomic sequences. In contrast to most existing tools, the accuracy of SGP-1 depends little on species-specific properties such as codon usage or the nucleotide distribution. SGP-1 may therefore be applied to nonstandard model organisms in vertebrates as well as in plants, without the need for extensive parameter training. In addition to predicting genes in large-scale genomic sequences, the program may be useful to validate gene structure annotations from databases. To this end, SGP-1 output also contains comparisons between predicted and annotated gene structures in HTML format. The program can be accessed via a Web server at http://soft.ice.mpg.de/sgp-1. The source code, written in ANSI C, is available on request from the authors.
| |
INTRODUCTION |
|---|
|
|
|---|
Given homologous genomic sequences from two
species, their local alignment usually shows a patchwork pattern of
conserved and less conserved segments. Generally, coding sequences tend to be more conserved than noncoding sequences. Highly conserved fragments may sometimes also be attributed to gene regulatory (Hardison
et al. 1997
) or other DNA elements, such as clade-specific repeats.
Although the level of sequence similarity depends strongly on the
evolutionary distance of the compared species, recent studies (Roest
Crollius et al. 2000
; Wiehe et al. 2000
; R. Guigó, L. Duret, and T. Wiehe, unpubl.) suggest that homology-based gene prediction can be very
reliable over a fairly wide spectrum of species and evolutionary
divergence times. Gene prediction has received considerable attention
from computer scientists and biologists during the past decade (for
reviews, see Burge and Karlin 1998
; Claverie 1998
). These efforts have
led to considerable progress, but the problem is still far from a
satisfactory solution (Guigó et al. 2000
). Current methods can be
roughly grouped into two main categories, ab initio and homology-based
methods. The former methods recognize signals or compositional features
in a single input sequence by pattern-matching, probabilistic, or
statistical methods. An example is Genscan (Burge and
Karlin 1997
). The homology-based methods use external information such
as comparison of the query sequence with protein, EST, or cDNA
databases. Examples are BLASTX (Altschul et al. 1990
) and
the more sophisticated spliced alignment algorithms
Procrustes (Gelfand et al. 1996
) or GeneWise (www.sanger.ac.uk/Software/Wise2). Lately, gene prediction tools have
become available that infer gene structures from alignments of
anonymous genomic sequences and the resulting pattern of conserved segments. For instance, ExoFish (Roest Crollius et al. 2000
) predicts human exons by comparison with a database of random sequences from Tetraodon nigroviridis. Bafna and Huson (2000)
and Batzoglou et al. (2000)
have developed programs for gene prediction by pairwise comparison of human and mouse homologous sequences. SGP-1 is a similar gene prediction tool, but it is not species specific. It is designed for large-scale genomic sequences, such as complete bacterial artificial chromosomes (BACs), of
vertebrates and plants.
Today we are in a situation where analysis programs need to cope with low sequence quality of published draft genomes. Whereas classical tools are sensitive to sequence quality, similarity-based tools are much less so, because similarity levels are less affected by sequencing or assembly errors. Rather, such errors may even be rectified with the help of sequence comparison programs.
For gene prediction in SGP-1, two lines of reasoning are
combined (Fig. 1). First, a pairwise local
alignment is computed. This may be done either on a DNA level (e.g.,
with BLASTN or SIM96 [Huang and Miller
1991
]) or on an amino acid level (e.g., with TBLASTX
[Altschul et al. 1990
]). The evolutionary distance of the compared
sequences is an important criterion to choose between the methods.
Generally, we obtained better results with DNA-based alignments for
closely related sequences and with amino acid-based alignments for
distantly related sequences (Wiehe et al. 2000
). If computation speed
is a main concern, a BLAST-like alignment is preferable
over a dynamic programming algorithm such as SIM96. In any
case, a postprocessing step to reduce noise may be applied to the
resulting local alignment. Second, for both sequences we generate
separate lists of potential exons, termed precandidates. A
subroutine called filter retains only those precandidates
that are compatible with the alignment. Subsequently, exons are
rescored and then assembled into a gene model.
|
Because the two tasks, computation of the alignment and gene prediction, are separated from each other, SGP-1 can work with the (possibly reformatted) output of an arbitrary alignment program. Furthermore, run time of the program can be significantly reduced, if a precomputed alignment is provided as input.
We applied SGP-1 to several sets of homologous sequence
pairs from vertebrates and from plants. Unlike ProGen (Novichkov et al. 2000
) or ExoFish (Roest Crollius et al.
2000
), SGP-1 emphasizes similarity and is therefore particularly suited for well-conserved genes or sufficiently closely related species. For instance, good results can be obtained for species
that are evolutionarily at least as close as Homo sapiens and
Gallus gallus (R. Guigó, L. Duret, and T. Wiehe, unpubl.). Finally, we show how SGP-1 may be useful for verifying gene structure annotations.
| |
RESULTS |
|---|
|
|
|---|
Algorithm
Gene prediction with SGP-1 proceeds in two separate
steps, calculation of a pairwise alignment and processing of sequence
and alignment files. This modularity makes the tool very fast and
versatile: given a suitable format-conversion tool, SGP-1 may be combined with any pairwise alignment program. We successfully ran SGP-1 on alignments produced by SIM96
(Huang and Miller 1991
), BLASTN, TBLASTX
(Altschul et al. 1990
), BLASTZ (W. Miller, pers. comm.),
and MUMMER (Delcher et al. 1999
). Given two sequences and
their alignment as input, the program calls subroutines for (1)
alignment postprocessing, (2) generating exon precandidates, (3)
filtering, (4) rescoring, and (5) gene assembly and output. The
subroutine with the highest time complexity is filter (see
Methods). A very rough bound for its run time is given by
O(nm), where n and m are the lengths of the
input query sequences. This is due to the fact that the size of the two
exon precandidate lists depends linearly on n and m,
respectively, but pairs of precandidates, one from each list, have to
be processed. For all other subroutines the time requirement is
subquadratic. Memory space is dynamically allocated in all subroutines.
An upper bound for the required memory space is also given by
O(nm), because lists with pairs of exon precandidates have to
be stored and handled. However, absolute running times and space
requirements also depend on sequence properties such as the level of
similarity. For instance, gene prediction with SGP-1 for
two homologous sequences from the human and mouse HOX regions (176 kb
and 214 kb, respectively) took 10.5 sec CPU time on a Linux PC (RedHat,
distribution 7.0) with a 400 MHz Pentium II processor and 256 Mb RAM.
Memory size was sufficient for the program to run without swapping. In
detail, the time requirements for the individual subroutines (1) to (5) were 0.3 sec, 2.6 sec, 4.5 sec, 1.9 sec, and 1.2 sec, respectively. In
contrast, calculation of the pairwise alignment with
BLASTZ, a heuristic alignment method (W. Miller, pers.
comm.), for these two sequences took 89 sec. To take advantage of the
modularity, the Web server provides the possibility of uploading
precomputed alignment files.
Evaluation of Test Sets
To measure gene-prediction accuracy of SGP-1, we
generated several test sets from human/rodent (S1, S2) and
plant (T1) homologous sequences. S1 is the set
originally used by Batzoglou et al. (2000)
as a test set for
Rosetta. S2 is a set of large homologous
chromosomal fragments that contains multiple genes in both species.
Accuracy is measured in terms of sensitivity and specificity (Burset
and Guigó 1996
; see Methods). The results for SGP-1 on
set S1 of single genes are comparable to those of
Genscan and Rosetta (Table
1), Genscan being slightly
inferior and Rosetta being slightly superior to
SGP-1 on nucleotide level accuracy. Data set S1
contains several exons with nonstandard splice sites, which are not
detected in the current version of SGP-1. This explains
the low sensitivity on exon level (SN) of
SGP-1 compared with Rosetta in set
S1. Similarity-based programs tend to be more accurate than
conventional methods for large-scale sequences with multiple genes
(Guigó et al. 2000
; Wiehe et al. 2000
). This property was also found
when we evaluated test set S2. Because the divergence between
two species may considerably vary along their genomes (Fig.
2), measures have to be taken to cope with different levels of conservation. For less conserved sequences, SGP-1 performs better if it is based on an amino acid alignment; for highly conserved sequences it performs better if it is
based on a DNA alignment. More generally, this pattern was found both
for vertebrate and for plant sequences. On set T1, the
performance of SGP-1 on nucleotide level was better when
based on a DNA alignment; on exon level the performance was better when
based on an amino acid alignment (Table 1). In both cases,
SGP-1 performed better than Genscan, which
was in particular due to a higher specificity.
|
|
When comparing duplicated regions within a single species, the same
dependence of prediction accuracy on conservation levels is observed.
Genome duplication is particularly common among plants. For example,
the most recent large-scale duplication events in Arabidopsis
thaliana are estimated to have occurred between 50 and 100 Myr B.P.
(Vision et al. 2000
). In such cases, similarity-based gene prediction
can be useful to detect genes even if a homologous sequence of a second
species is not available. For an example, we tested a pair of
duplicated segments residing on chromosomes 3 and 5 in Arabidopsis
thaliana. Sensitivity and specificity results are
Sn = 0.87 and Sp = 0.84 (nucleotide level), and
SN = 0.62 and SP = 0.65 (exon level),
respectively. Comparing a pair of BACs from Oryza sativa and
Zea mays that contain orthologous genes for AdHI (rice and
maize) and the paralogous AdHII gene (rice), SGP-1 detects
these three genes with Sn = 0.94 and Sp = 0.98
on nucleotide level, and SN = 0.65 and
SP = 0.72 on exon level. Finally, we also applied
without
specific training
SGP-1 to the complete chloroplast
genomes of Oryza sativa and Zea mays. Sensitivity and
specificity on nucleotide level are both 0.80. For comparison,
Genscan with standard settings for nuclear genes in
Zea mays yielded sensitivity and specificity of 0.01 and 0.38, respectively, for the chloroplast genome of Zea mays. Not
surprisingly, both programs do very poorly in terms of exon level
accuracy (0.01). However, it would be a relatively simple task to
provide SGP-1 with a splice-site profile that is adequate
for organellar instead of nuclear genes and to enhance the exon level accuracy.
Codon Bias Versus Splice-Site Conservation
Codon bias can vary to a large extent among species within the same
taxonomic group, and even among genes within the same species (Sharp et
al. 1995
). On the other hand, the average splice-site profile, that is,
the nucleotide distribution around splice sites, appears to be more
conserved. In particular, this is true for human/rodent comparisons.
Codon bias is on average much lower in mice and rats than in humans,
which is perhaps due to the different rates of neutral evolution in the
two lineages: an acceleration in the rodent lineage and a slow-down in
the primate lineage (Britten 1986
, Li et al. 1987
). We calculated codon
bias (Peden 1997
) individually for the human and rodent genes in set
(S1) of single human/rodent genes and then determined the
difference in codon bias for each pair of homologous genes. Applying a
two-tailed t-test (level
= 0.01) to the differences, we
rejected the null hypothesis that the difference is zero
(p = 4.5×10-6, Fig. 3).
|
Similarly, we calculated the difference of the scores of homologous
human and rodent acceptor and donor sites. The distribution of both the
acceptor and donor score differences is more symmetrical around zero
than is the distribution of the codon bias difference (Fig. 3). In
fact, the null hypothesis of the difference being zero cannot be
rejected in a two-tailed t-test (level
= 0.05).
Application to Gene Structure Validation
Annotations of gene structures submitted to sequence databases such as GenBank (www.ncbi.nlm.nih.gov), EMBL (www.ebi.ac.uk), or DDBJ (www.ddbj.nig.ac.jp) can sometimes be erroneous. SGP-1 provides an option to compare a CDS (coding sequence) annotation with the gene prediction result. This feature may be helpful for cross-checking and validating annotations because discrepancies between the given annotation and the prediction are highlighted. To that end, annotated and predicted exons are written (in GFF [General Feature; http://www.sanger.ac.uk/Software/formats/GFF] format) into an HTML file that can be viewed with any Web browser. Such potential annotation errors may include sequencing errors, wrongly annotated start or stop codons, or wrongly annotated splice sites. Figure 4 shows a discrepancy between GenBank annotation and prediction in the mouse preproinsulin gene II (Accession Number X04724). Donor site of exon 1 and acceptor site of exon 2 are wrongly annotated. As a consequence, the inferred intron phase would differ from that of the homologous human intron.
|
DISCUSSION
Genome analysis has entered a stage in which comparative methods
play an increasingly important role, not only for computational gene
finding but also for determining gene regulatory regions and
delineating gene function. Various programs (Bafna and Huson 2000
;
Batzoglou et al. 2000
; Roest Crollius et al. 2000
; Novichkov 2000
) have
already been published or are under development. Here, we present a
method that is based on DNA or amino acid pairwise alignments to
predict coding regions and exon-intron structure of multiple genes,
and to validate gene-structure annotations. One of the shortcomings of
traditional gene prediction tools has been that they are extremely
species specific and that their accuracy may drop dramatically when
they are applied to species for which they have not been trained. In
contrast, comparative gene prediction may rely exclusively or primarily
on the pattern of conservation between a pair of species, exploiting
the fact that functional (which here means amino acid coding) parts of
the genome are generally more conserved than nonfunctional parts.
Therefore, such programs should be more versatile and perform well
across a wide spectrum of species, no matter whether bacterial, animal,
or plant genomes are compared. In practice, however, there is probably
no single tool that works equally well regardless of the evolutionary
divergence between the compared sequences. Underprediction and
overprediction, depending on the evolutionary distance, are common
problems. Furthermore, prediction accuracy is sensitive not only to the
choice of an appropriate species pair, but may also vary considerably
along the genome within a particular species pair. SGP-1
is designed for comparative analysis in evolutionarily closely related species such as Homo sapiens and Mus musculus,
Arabidopsis thaliana and Brassica oleracea,
Caenorhabditis elegans and Caenorhabditis briggsae,
or more closely related species. A central strategy of
SGP-1 is to rely as little as possible on species-specific DNA characteristics, such as nucleotide composition, isochore distribution, codon bias, or repetitive elements. Therefore, the precandidate exons (see Methods) do not receive scores that depend on
the coding potential or codon usage. Rather, scoring at the initial
step relies exclusively on splice-site quality. Splice profiles are
generally less variable within a taxonomic group than is codon usage.
SGP-1 is an alignment-based method. Ideally, the alignment
is computed with dynamic programming such as that implemented in
SIM96, and which guarantees an optimal alignment to be
found. Often, however, the time requirement is prohibitive for such a
method to be applicable. The current Web-server version of
SGP-1 provides alternative alignment options:
BLASTN, TBLASTX, or the possibility to upload
a precomputed alignment. SGP-1 relies on local rather than
global alignments. It is well known (Doolittle 1990
) that local
alignments are more appropriate to identify short regions of similarity
that may be embedded in regions of high dissimilarity
as is the case
with coding regions embedded in large intragenic stretches. With a
global alignment program, such short conserved stretches may only be
detected if the gap penalties are extremely well adapted to the
problem, which would pose a severe restriction on program versatility.
The generally accepted strategy to individually anchor highly
conserved, but possibly short, stretches is to produce a set of
suboptimal local alignments rather than a single, global alignment.
Furthermore, global alignments necessarily yield a colinear similarity
pattern. Therefore, particularly in the absence of colinearity, two
sequences may sensibly be compared only in terms of a local alignment.
The currently distributed version of SGP-1 is designed for
nuclear eukaryotic DNA sequences as input. A parameter file, which is
easily accessible by the user and which describes splice profiles
and/or genetic code, needs to be edited to treat nonnuclear DNA.
When comparing SGP-1 with other, not similarity based,
gene finders, one of the most remarkable features is the generally much
higher specificity. SGP-1 also performs well in large-scale genomic sequences. In particular, in problem zones, such as
unusually large introns, SGP-1 may be superior to other
gene-finding programs: the prediction of SGP-1 of the
Human MeCP2 gene structure is correct around intron 2 (size 60 kb),
whereas Genscan returns a number of false-positive results
in this region. We compared SGP-1 with other similarity-based gene finders such as Rosetta and
ProGen (see Table 1). ProGen uses an amino
acid alignment rather than a DNA alignment. Its strength is in
detecting more distant relationships, such as seen when, for example,
Human and Fugu sequences are compared. Rosetta is
primarily designed for human/rodent comparisons (Batzoglou et al.
2000
). ExoFish (Roest Crollius et al. 2000
) compares a
human query sequence with a sequence database of the pufferfish T. nigroviridis; it is designed for gene prediction in humans, not in
arbitrary species. Prediction accuracy of SGP-1 does not
depend on the availability of ESTs or CDSs or the completeness of EST
or CDS databases. Given two homologous genomic sequences,
SGP-1 is expected to be superior to programs that rely on
extrinsic information, and spliced alignment programs of the first
generation, such as Procrustes (Gelfand et al. 1996
),
which were designed for a particular species. Even if homologous BACs
of two or more species are not available, gene prediction by homology
may still yield reliable results. We applied SGP-1 to a
self-aligned 340-kb genomic BAC of Oryza sativa (Accession
Number AF172282). The rice AdH region is known (Tarchini 2000
) to have
undergone several micro duplication events. In principle, duplicated
genes may be identified by homology-based programs. Again, accuracy
depends on the time when the duplication event occurred and on the
speed of divergence. Comparing the rice AdH (AdHI and AdHII) genes with
that in Arabidopsis thaliana and assuming that the split
between Dicotyledons and Monocotyledons occurred about 100 Myr ago, we
estimate the duplication event between AdHI and AdHII in rice at about
44 Myr B.P. SGP-1 correctly predicts the gene structures
of AdHI and AdHII, except for the terminal exon. More generally,
members of a gene family may be identified by comparing a single gene,
or even only a CDS, with an entire chromosome or genome of the same or
a related organism. The question of whether two genes are an
orthologous or paralogous pair is per se irrelevant for gene
identification by similarity. What matters, however, is the time and
speed of their divergence. In addition to local duplications, extant
organisms carry traces of a history of genome or chromosome
duplications. This is particularly common in plants that may have
undergone several rounds of genome duplication. This fact can be
usefully applied to homology-based gene prediction by aligning two
chromosomes of a single organism. For example, chromosome 3 and 5 of
Arabidopsis thaliana contain syntenic regions over large parts
of the chromosomes (Blanc et al. 2000
; Vision et al. 2000
). Applying
SGP-1 to two 230-kb regions (Fig.
5) in the two chromosomes, related genes and gene families are identified. Clearly, unique genes will be missed
by such an approach. Therefore, values of prediction accuracy, in
particular sensitivity, for SGP-1, or any other
homology-based program, are not very informative in such a region. This
is aggravated in the preceding example by the fact that most of the
annotated genes in this region are not experimentally confirmed but are only computer predicted.
|
In the future, we will see an increasing need not only for computerized prediction of gene structures, but also of regulatory regions in particular, and for reliable statements about inferred gene function in the absence of experimental validation. Comparative genome analysis will undoubtedly play an important role in accomplishing these tasks.
| |
METHODS |
|---|
|
|
|---|
Algorithm
Sequence Alignment
SGP-1 requires a pairwise local alignment of two genomic sequences, such as produced by SIM96 (Huang and Miller 1991Generating Precandidates
Input DNA sequences are scanned for patterns such as start codons and stop codons and splice sites. The patterns are represented in a tree-like data structure, known as keyword tree (Aho and Corasick 1975Filtering
The subroutine FILTER checks whether begin and end positions of any pair of precandidates are contained in the postprocessed alignment. If there is a discrepancy, a pair is discarded. Optionally, the filter can be relaxed to allow for an offset between alignment and exon precandidate. There are two parameters: x, the number of base pairs by which locally aligned segments are extended, and d, the maximal distance (in bp) by which the ends of two paired precandidates may be separated (Fig. 6). The parameter values can be selected by the user via a command line switch. Computation time depends on parameter settings. For the general case one has as upper limit
|
|
Rescoring
The output of FILTER consists of pairs of precandidates, where each one is uniquely characterized by its position, strand label ("+" or "
") and reading frame. Hence, translation is
unambiguous and for each pair of amino acid sequences a similarity
score can be computed, for example, by a dynamic programming method
similar to the Needleman and Wunsch (1970)Gene Assembly
Assembly is performed independently for both species. Here, we use the method described by Guigó (1998)Evaluating Gene Prediction Accuracy
Prediction accuracy is measured by the quantities Sn
("sensitivity") and Sp ("specificity"), as defined by
Burset and Guigó (1996)
. On the level of nucleotides, sensitivity is
|
|
|
|
|
Output and Visualization of Results
The program returns an ASCII file with predicted genes in GFF
format. The file also contains the amino acid sequence (in
FASTA format) of the predicted proteins. Optionally, gene
prediction results may be visualized via an annotated two-dimensional
dotplot (Abril et al. 1999
). The Web server contains a switch to
produce such a graphical output (see Fig. 2). It is in PostScript or
PDF format and can be saved to a local file. Furthermore, an HTML file
can be generated. It contains a list of the DNA sequences of predicted
and annotated exons. Special features, such as splice sites or start or
stop codons, are highlighted along the sequence (see Fig. 4).
Test Sets
Gene prediction accuracy is evaluated on several test sets. A problem with available test sets is that they often contain only sequences with single genes. However, the analysis of megabase-sized draft sequences is a routine task in many laboratories, and gene finders need to perform well also with large-scale sequences that may include multiple genes on both strands.
Human/Rodent
A set (S1) of 116 homologous human/rodent single gene sequences was kindly provided by S. Batzoglou. A further human/rodent test set of 57 pairs was compiled by N. Jarborg and is available at www.sanger.ac.uk/Software/Alfresco/mmhs.shtml. Because the two sets are not disjoint, we generated a disjoint subset from the latter, comprising 39 homologous sequence pairs, which we then used as a training set to optimize program parameters. Set S2 consists of four homologous human/rodent pairs of partially unfinished BACs. They include the human and mouse MHC-II (accession numbers X87344; AF100956, AF027865), ERCC2 (accession numbers L47234; L47235) and MeCP2 regions (accession numbers AF030876, Z47046, Z47066; AF121351), and the HOX cluster (accession numbers AC009336; L1084). Parameter optimization was done manually on a low-dimensional discrete grid. Parameters to be tuned were c, the lower score cutoff for exon precandidates; d, the radius of fuzziness; x, the alignment extension (both in module FILTER); the weight w of splice site- versus similarity-score (module RESCORE); and s, a value by which the entire distribution of candidate scores is shifted (module ASSEMBLY, Guigó 1999Plants
A set (T1) of 20 homologous nuclear gene pairs of Brassicaceae was obtained from U. Göbel (pers. comm.). In each pair, one species is Arabidopsis thaliana and the other is Brassica oleracea or Brassica napus. This set was generated by first searching SWISS-PROT (release 39.0) for the taxon name Brassicaceae. Sequences were then clustered into species. Each possible pair of sequences from different species was globally aligned (GAP in GCG package [1999]). Pairs with a minimum protein identity of 30% were considered further and their respective DNA entry was extracted from GenBank, release 117. If the GenBank entry contained the keyword "gene" or "complete cds" in the DEFINITION line, the pair was retained; otherwise it was discarded. The remaining entries were manually checked for complete annotation. We also analyzed (set T2) several BACs of Arabidopsis thaliana (AC002291, AC009465, AC009177; AC007123, AF007271), Oryza sativa (AF172282), and Zea mays (AF123535), which are known to contain several sets of duplicated genes. Finally, we applied the program to the complete chloroplast genomes of Oryza sativa (NC_001320) and Zea mays (NC_001666).Implementation
The program is written in ANSI C. The source is available under the general GNU license agreement from the authors on request. Furthermore, a Web server is accessible at http://soft.ice.mpg.de/sgp-1. Memory requirement depends on the size of the sequences to be analyzed and on the chosen options. Running the program on a Linux PC with a single Pentium II processor, we found 256 Mbyte RAM generally to be sufficient for analyzing sequences in the range of up to 200 kb.
| |
ACKNOWLEDGMENTS |
|---|
We thank two anonymous reviewers for valuable comments, and Bernhard Haubold, Webb Miller, and Matthias Platzer for stimulating discussions. We are grateful to Ulrike Göbel who helped to compile a set of homologous plant genes and to René Kiessig for support in CGI-programming. Laurent Duret provided unpublished material on orthologous vertebrate genes. This work has been supported by the Max-Planck Gesellschaft (Germany), by Proyecto del Plan Nacional de I+D, BIO98-0443-C02-01, Beca de Formación de Personal Investigador, FP95-3881 7943 from the Ministero de Educación y Ciencia (Spain), and by a TMR grant (ERBFMGECT950062) from the European Community.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL twiehe{at}ice.mpg.de; FAX 49-3641-64-3668.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.177401.
| |
REFERENCES |
|---|
|
|
|---|
Received January 1, 2001; accepted in revised form June 5, 2001.
This article has been cited by other articles:
![]() |
A. S. Hinrichs, D. Karolchik, R. Baertsch, G. P. Barber, G. Bejerano, H. Clawson, M. Diekhans, T. S. Furey, R. A. Harte, F. Hsu, et al. The UCSC Genome Browser Database: update 2006 Nucleic Acids Res., January 1, 2006; 34(suppl_1): D590 - D598. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Pohler, N. Werner, R. Steinkamp, and B. Morgenstern Multiple alignment of genomic sequences using CHAOS, DIALIGN and ABC Nucleic Acids Res., July 1, 2005; 33(suppl_2): W532 - W534. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. H. Majoros, M. Pertea, and S. L. Salzberg Efficient implementation of a generalized pair hidden Markov model for comparative gene finding Bioinformatics, May 1, 2005; 21(9): 1782 - 1788. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. D. Wu and C. K. Watanabe GMAP: a genomic mapping and alignment program for mRNA and EST sequences Bioinformatics, May 1, 2005; 21(9): 1859 - 1875. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. J. Wagstaff and D. J. Begun Comparative Genomics of Accessory Gland Protein Genes in Drosophila melanogaster and D. pseudoobscura Mol. Biol. Evol., April 1, 2005; 22(4): 818 - 832. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Brudno, R. Steinkamp, and B. Morgenstern The CHAOS/DIALIGN WWW server for multiple alignment of genomic sequences Nucleic Acids Res., July 1, 2004; 32(suppl_2): W41 - W44. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Taher, O. Rinner, S. Garg, A. Sczyrba, and B. Morgenstern AGenDA: gene prediction by cross-species sequence comparison Nucleic Acids Res., July 1, 2004; 32(suppl_2): W305 - W308. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stanke, R. Steinkamp, S. Waack, and B. Morgenstern AUGUSTUS: a web server for gene finding in eukaryotes Nucleic Acids Res., July 1, 2004; 32(suppl_2): W309 - W312. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. M. Meyer and R. Durbin Gene structure conservation aids similarity based gene prediction Nucleic Acids Res., February 4, 2004; 32(2): 776 - 783. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Mignone, G. Grillo, S. Liuni, and G. Pesole Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis Nucleic Acids Res., August 1, 2003; 31(15): 4639 - 4645. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Foissac, P. Bardou, A. Moisan, M.-J. Cros, and T. Schiex EUGENE'HOM: a generic similarity-based gene finder using multiple homologous sequences Nucleic Acids Res., July 1, 2003; 31(13): 3742 - 3745. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Guigo, E. T. Dermitzakis, P. Agarwal, C. P. Ponting, G. Parra, A. Reymond, J. F. Abril, E. Keibler, R. Lyle, C. Ucla, et al. Comparison of mouse and human ge |