Published online before print
December 30, 2002, 10.1101/gr.789803
Vol 13, Issue 1, 97-102, January 2003
METHODS
AVID: A Global Alignment Program
Nick Bray1,2,
Inna Dubchak1 and
Lior Pachter2,3
1Lawrence Berkeley National Laboratory, Berkeley,
California 94720, USA; 2Department of Mathematics, University
of California, Berkeley, California 94720, USA
 |
ABSTRACT
|
|---|
In this paper we describe a new global alignment method called AVID.
The method is designed to be fast, memory efficient, and practical for
sequence alignments of large genomic regions up to megabases long. We
present numerous applications of the method, ranging from the
comparison of assemblies to alignment of large syntenic genomic regions
and whole genome human/mouse alignments. We have also performed a
quantitative comparison of AVID with other popular alignment tools. To
this end, we have established a format for the representation of
alignments and methods for their comparison. These formats and methods
should be useful for future studies. The tools we have developed for
the alignment comparisons, as well as the AVID program, are publicly
available. See Web Site References section for AVID Web address and Web
addresses for other programs discussed in this paper.
The comparison of biological sequences is one of the oldest
problems in computational biology, and early work on
the problem (Needleman and Wunsch 1970 ; Altschul et al. 1990 ) resulted
in what were arguably the first highly successful and widely adopted
applications of computer science to biology. It became apparent early
on that alignment programs could be divided into two types:- Local alignment methods (e.g., BLASTZ; Schwartz et al. 2000
) are
designed to search for highly similar regions in two sequences, where
the regions of similarity are not necessarily conserved in order and
orientation. BLAST-like methods work by first finding very short common
segments between the sequences, and then expanding out the matching
regions as far as possible. Algorithms such as the Smith-Waterman
dynamic programming (Smith and Waterman 1981 ) work by identifying the
most likely significant matches according to an evolutionary model.
- Global alignment algorithms (e.g., Needleman and Wunsch 1970
) are
suitable when an extra assumption holds, namely, that the highly
similar regions in the sequences appear in the same order and
orientation. These methods attempt to find the "global map" between
the sequences, in the process rejecting alignments that overlap or
cross over.
Local alignment algorithms are generally very useful in
finding similarity between regions that may be related but are inverted
or rearranged with respect to each other. There has also been evidence
that transcription factor binding sites are prone to reordering and so
are more suited to detection by local alignment methods. A problem with
local alignment algorithms is that, because of the weaker assumptions
in place, there is less power in finding weakly conserved regions;
furthermore, identified conserved regions may not be true homologs
(i.e., related via a common ancestor).
Global alignment algorithms have been found to be useful in many
situations because biological sequences from related organisms tend to
satisfy the order assumption, assuming that the regions being examined
are sufficiently small. For example, on average, the human and mouse
genome appear to have order and orientation preserved for regions up to
8 Mb in length (Mural et al. 2002 ). On the other hand,
sophisticated scoring functions and global alignment models lead to
slow algorithms that are also very memory intensive. Thus, until
relatively recently, global alignment algorithms have generally only
been applied to short sequences.
The AVID alignment method is our attempt to address existing
shortcomings of global alignment programs. As we will show, AVID is
sensitive in finding homologous regions, but is also specific and
avoids the false-positive problem of local alignment programs. At the
same time, it is fast and highly reliable. It has been used to align
thousands of submitted sequence pairs from biological researchers
worldwide, and it has also been used as a key component in an alignment
of the entire human and mouse genomes (O. Couronne et al.
2003 ).
In order to assess the performance of AVID, we compared it to both
local and global alignment programs. Despite the emergence of many
alignment programs in recent years (Wiehe et al. 2000 ; Chain 2001 ;
Miller 2001 ; Dubchak and Pachter 2002 ), there has not been a detailed
performance comparison as has been done in the gene-finding field
(Burset and Guigo 1996 ). This is partially due to the difficulty in
assessing the "correctness" of an alignment. We have devised
various ways of doing so and we describe the results following.
Finally, we describe various applications of AVID (e.g., the alignment
of assemblies), which have been prohibitive with previous
computationally expensive approaches.
Algorithm
The AVID method is summarized in Figure
1. The input to the program consists of two
genomic sequences; the output is a global alignment with additional
information (e.g., an overall score). The details of the components are
described as follows.
Repeat Masking (Optional)
The input sequences can be processed with the RepeatMasker program
(http://ftp.genome.washington.edu/RM/RepeatMasker.html, Smit and
Green), but, unlike standard alignment programs, both the
masked and unmasked sequences are used in the alignment process.
Matches are divided into two groups: those overlapping repeats that we
call repeat matches, and those not overlapping repeats that we
call clean matches (it is important to note that the term
"match" here refers to a maximal match that is not necessarily
unique). Clean and repeat matches are used in different ways by the
program and we discuss this in more detail as follows.
Finding Matches Using Suffix Trees
A maximal repeated substring in a string is a subsequence that has
the property that every subsequence that contains it is not repeated in
the string (Gusfield 1997 ). The problem of finding all maximal repeated
substrings of a single string has a straightforward solution using
suffix trees (Fig. 2). Maximal matches
between two sequences are a pair of matching subsequences (one from
each sequence) whose flanking bases are mismatches. In AVID, the
problem of finding all maximal matches between two sequences is
transformed to the problem of finding maximal repeated substrings in
one string. This is done by concatenating the two sequences and placing
the character N between them. A maximal repeat in this string
that crosses the boundary between the two sequences represents a
maximal match between the two sequences.

View larger version (13K):
[in this window]
[in a new window]
|
Figure 2. Finding maximal matches using a suffix tree: The suffixes of the word
at the root are represented by the characters along the paths from the
root to the leaves. Branchings in the tree correspond to locations
where different suffixes shared the same prefix, and therefore are
matches. Every internal node in the tree is therefore a match (with the
matching sequence corresponding to the path characters along the path
from the root). Maximal matches can be efficiently detected by
considering some additional criteria.
|
|
Anchor Selection
Once the match finding has been completed, AVID begins the recursive
process of anchoring and aligning the sequences. An anchor set is a
collection of nonoverlapping, noncrossing matches (e.g., the red
matches shown in Fig. 3).

View larger version (14K):
[in this window]
[in a new window]
|
Figure 3. Selecting anchors from the set of matches. Every maximal match is shown
in blue. A set of good anchors is shown in red.
|
|
First, the entire match set is reduced to eliminate "noisy" matches
from those being considered for anchors. Our current heuristic is to
remove matches that are less than half the length of the longest match
from initial consideration. The shorter matches will be reconsidered
for anchoring in later rounds. The matches are then ordered, with clean
matches appearing first (sorted by length), followed by repeat matches.
Repeat matches will not be considered for anchoring until there are no
more clean matches.
The anchors are selected using a variant of the Smith-Waterman
algorithm. The gap score used is zero, and the mismatch score
isInfinity. The score assigned to a match is based on its length and
the alignment score of the regions flanking the match (10 bp on each
side). Anchors are also required to be nonoverlapping (hence a minor
modification of the Smith-Waterman algorithm). This process of anchor
selection is similar to the idea first adopted in the GLASS algorithm
(Pachter 1999 ; Batzoglou et al. 2000 )
There is no guarantee that the matches in the anchor set produced by
this procedure are biologically significant. For regions that are too
long to align by the Needleman-Wunsch algorithm, there is no choice but
to use the anchors. For shorter regions, the use of anchors can speed
up the alignment procedure, but may result in a lower quality alignment
than that which would be arrived at using the Needleman-Wunsch
algorithm. Therefore, the anchors should only be used if we are
confident that they are correct. When AVID aligns regions short enough
to perform an optimal alignment, it uses anchors only if the total
length of the anchor set is >50% of the length of the sequence;
otherwise the regions are aligned using the Needleman-Wunsch algorithm
using standard parameters.
Recursion
Once the anchors have been selected, they will form part of the
final global alignment. We think of them as having been set.
If n anchors have been set, there will be n + 1
regions between these anchors that remain to be aligned. By filtering
the current list of matches, we produce n + 1 lists of
matches, the ith list being the list of all maximal matches
between the ith interanchor pair. This is done by checking for
each match whether that match (or any sufficiently long part of that
match) lies entirely between two sets of anchors. Once the maximal
matches have been obtained, the smaller interanchor regions are
realigned using the anchor selection step described earlier.
The recursion terminates when there are either no remaining bases to be
aligned, or there are no significant matches in the remaining
sequences. If the sequences are short ( 4 kb each), they are aligned
using the Needleman-Wunsch algorithm. For long sequences, we conclude
that the lack of anchors indicates no significant alignment between
them, and because a Needleman-Wunsch type algorithm is meaningless, we
return a trivial alignment, where both sequences are
completely gapped.
Draft
AVID has the ability to order and orient draft sequence by using
comparisons to a finished sequence. If one of the sequences is
submitted in draft format (in multiple FASTA format), then the contigs
are first aligned separately to the finished genome. Alignments are
performed in both orientations, and the resulting scores are used to
determine the correct orientation. The matching locations of the 5'
ends of the contigs are then used to sort them, and consecutive contigs
are then aligned to each other to determine the amount of overlap. The
overlap information is then used to generate a "merged" draft
sequence, which is then realigned to the finished sequence to produce a
"finished-draft" alignment.
Testing
Despite a number of software packages for alignments of large
genomic regions in recent years, there has been a lack of quantitative
comparison between methods. One of the problems in comparing alignment
methods is that local and global alignment methods have fundamentally
different outputs, and comparison is complicated by the myriad of
alignment formats and standards for output that are being used. We have
developed convenient and general alignment formats, and tools for
converting output of popular programs to these formats, in order to
test sensitivity and specificity of the different aligners.
Sensitivity
The comparison of the sensitivity of different alignment programs is
complicated by the fact that different programs are based on different
sets of parameters, whose setting can greatly influence the amount of
coverage. In order to fairly compare aligners, we adopted a technique
first suggested by Jim Kent (Waterston et al. 2002 ), which is to filter
the alignments after they have been generated to retain only those
portions of the alignments that score above a certain threshold
according to a suitable scoring matrix. After filtering, we computed
two relevant statistics: the overall amount of coverage and the
coverage of certain select features such as coding sequences and
untranslated transcribed regions (UTRs).
Specificity
It is difficult to construct accurate tests for specificity of
alignment programs because it is often hard to ascertain whether a
reported alignment is biologically significant, or the result of random
matching between the sequences. In the mouse analysis paper (Waterston
et al. 2002 ), a "reverse" test suggested by Arian Smit has been
used, in which the mouse genome was reversed (not reverse complemented)
and aligned in order to measure the amount of alignment due to random
matches. Here we have instead measured the average amount of
overprediction resulting from alignments that are not order and
orientation preserving. In other words, we have measured the amount
of alignment when the alignments are required to be order and
orientation preserving, versus the amount when they are not. This test
was performed for BLASTZ thanks to its "chaining option", which
makes the measurement easy. It is not necessarily the case that all
extra alignments are false positives, but, in the short regions we have
analyzed, we have checked that order and orientation is preserved and
thus extra alignments can be assumed to be mostly incorrect.
Testing Sets
We compiled a number of testing sets in order to analyze the
performance of the programs: - Finished sequences from the cat (14), chicken (11), chimp (6), cow
(16), dog (10), pig (27), and rat (33) for a total of 117 sequences
with an average length of
170,000 bp. The corresponding human
sequences were obtained together with RefSeq annotations.
- The Celera and EDGP assemblies of the tip of the X chromosome in
Drosophila (Benos et al. 2001
).
- Celera mouse chromosome 16 (Mural et al. 2002
) for alignment to the
public mouse genome.
The test sets are available for download at
http://baboon.math.berkeley.edu/ syntenic/avid/tests/.
Programs
We attempted to test as many programs as possible, and we report on
comparisons of AVID, MUMmer (Delcher et al. 1999 , 2002 ), BLASTZ
(Schwartz et al. 2000 ), CHAOS (Brudno and Morgenstern 2002 ), and GLASS
(Pachter 1999 ; Batzoglou et al. 2000 ). Other programs such as DBA
(Jareborg et al. 1999 ), WABA (Kent and Zahler 2000 ), and DIALIGN
(Morgenstern et al. 1998 , 2002 ) were not tested for one or more of a
number of reasons: problems obtaining or working with the code,
difficulty in parsing the output for comparison, or slow running time
for large sequences. The main problem with DIALIGN was the speed of the
programruns on a few hundred kilobases did not finish in hours. We
did not succeed in merging the CHAOS and DIALIGN programs for a
significant speedup (Morgenstern et al. 2002 ), but this is possible.
All the programs were tested with their default parameters. It is
possible that certain results could improve with different parameters,
but such analysis for each program is beyond the scope of this paper.
BLASTZ was run with and without the chaining option.
Running Times and Memory Usage
We measured the running time for each of the programs on a typical
Linux-based PC (2-GHz processor). Repeat Masking times were not
included. Memory usage was not reported because we found that memory
usage varied substantially with the alignment structure of the
sequences and not just their length (this is probably due to the
heuristic nature of the alignment programs). We did find that AVID and
BLASTZ require approximately 100 Mb to align bacterial artificial
chromosome (BAC)-sized sequences.
Formats
A major problem in comparing aligners is that every alignment
program has its own output format, and these vary greatly between local
and global aligners, and multiple alignment programs. We have
introduced a new alignment format we call AVX, which is a hybrid of the
CLUSTALW and FASTA formats and allows for recording multiple local or
global alignments. The exact specification is described in
http://baboon.math.berkeley.edu/avid/.
 |
RESULTS
|
|---|
Table 1 summarizes the
results of the BAC testing. The table shows the amount of coding exon
coverage in base pairs rather than percentages because of the
difficulty in determining how much of the annotated coding regions were
truly coverable by alignments. It is reassuring to note that most of
the programs are good at identifying coding exons, and this holds true
for all the organisms tested. It is interesting that the local and
global aligners did not exhibit much difference in this measure, AVID
and BLASTZ (with the chaining option) having very similar coverage.
BLASTZ did have slightly higher coding exon coverage without chaining,
but this is associated with a considerable increase in total coverage
( 10%), most of which cannot be homologous alignment. Nevertheless,
the overall coverage of BLASTZ in chaining mode is very similar to the
AVID coverage. We believe that the large discrepancy in coverage
between the BLASTZ chaining and nonchaining modes is mostly due to
false-positive alignments. The reason for this is that these finished
sequences were selected for containing the same genes in the same order
with no rearrangements between the sequences. It is possible that a
part of the 10% is due to alignments of transcription factor binding
sites that are not preserved in order, and to repetitive sequence.
View this table:
[in this window]
[in a new window]
|
Table 1. Coverage Results for the Different Programs on Human Sequence
Alignments With Cat, Chicken, Cow, Dog, Pig,
and Rat
|
|
It is reassuring to note that the BLASTZ coverage with chaining on is
very similar to the AVID coverage. It is important to note, however,
that the coverage is not always at the same place. We observed that
roughly 7% of the alignments were unique to AVID and another 7% to
BLASTZ. This is similar to results obtained on the whole genome
alignments (Waterston et al. 2002 ).
A comparison of coverage results for the different organisms shows that
the different programs are sensitive to the evolutionary distances of
the sequences. The chimp sequences were alignable with all of the
programs except GLASS, although CHAOS and BLASTZ (without chaining)
took much longer than the other programs. Coding exon coverage results
were not reported because they are not informative for highly similar
sequences.
AVID currently has worse results on chicken in comparison to the other
organisms because of the low similarity between human and chicken on
the nucleotide level. We are currently trying different approaches to
optimize the use of both protein and DNA matches for increasing
sensitivity on coding regions in highly divergent organisms.
The running time of the methods was also measured because this measure
is important in practice, especially if the alignment methods are to be
used on a whole genome scale. All the programs were fast, except for
GLASS. Of the three global alignment programs, AVID was, in general,
the fastest.
The Celera and EDGP assemblies of the tip of the X chromosome from
Drosophila melanogaster were used to test the efficiency of
AVID in aligning assemblies. The alignment took 30 sec on a 2-GHz Linux
machine, and used less than 600 MB of RAM. Results are
posted at http://baboon.math.berkeley.edu/ syntenic/avid/benos/.
The alignment compares favorably with the MUMmer alignment performed in
the original paper (Benos et al. 2001 ). The process of aligning the
assemblies is completely automatic, requiring no manual intervention
(as was needed with MUMmer), and the subsequent visualization is easy
to construct and examine (see Web site). Aside from MUMmer, none of the
other programs in this paper were able to align assemblies.
AVID was also used to align the Celera mouse chromosome 16 sequence to
the public mouse chromosome 16. This comparison represents the largest
assembly comparison to date (the Celera chromosome is 92 Mb long). This
alignment required the division of the sequences into 10 pieces, but
should be doable on a large memory machine in one pass. Results can be
viewed at http://baboon.math.berkeley.edu/ syntenic/avid/celera/.
Results on the whole genome alignment of human and mouse using AVID are
reported in another paper (O. Couronne et al. 2003 ).
 |
DISCUSSION
|
|---|
The problem of alignment has expanded and become substantially more
complicated during the past two decades. Whereas the original alignment
problem asked for a comparison of two short sequences according to a
straightforward evolutionary model, the current sequences being
analyzed demand robust alignment algorithms that satisfy many criteria
and requirements. Among the desirable attributes a program should have
are speed (in order to be able to process whole genomes); the ability
to deal with rearrangements, duplications, and other large-scale
genomic events; sensitivity for the detection of remote homologies and
short coding exons; accurate results regardless of the evolutionary
distance of the sequences; functionality for dealing with draft
sequence, seamless integration with visualization tools; and methods
for incorporating phylogenetic information in multiple alignment.
Our results show that AVID is a very effective and practical alignment
tool that addresses many, although not all, of these problems. The
critical assumption that the sequences being aligned have the property
that their functional elements are preserved in order and orientation
is used to reduce false alignments, but, at the same time, is a
weakness in that it restricts the possible applications of the program.
This latter problem can be remedied by combining AVID with a local
alignment program, as we have done for the whole human and mouse genome
alignment (O. Couronne et al. 2003 ). Nevertheless, it remains an
unsolved problem (and in our opinion an important one) to develop a
sophisticated yet efficient alignment program that combines the best of
both local and global alignment algorithms.
Our comparison of alignment programs on several BAC-sized regions
reveals that coding exons are relatively straightforward to identify,
and are correctly aligned by most methods (global and local). Weaker
homologies are more difficult to detect, and differences emerge between
local and global alignment programs. The extra alignments that result
from local alignments that are not required to be order and orientation
preserving in regions in which gene order is conserved are strong
indicators that local alignment methods may align regions that are
similar in sequence, but are not necessarily biologically significant.
Perhaps most important, the fact that different alignment programs do
not return the exact same alignments, even for conserved regions,
indicates that users may want to try all the available methods when
aligning their sequences.
The comparison of alignment programs is a non-trivial issue; in
particular, we were not successful in running all the available
programs on a modestly sized data set. We hope that any omissions
resulting from errors on our part will be rectified by further
investigations that lead to detailed comparative studies of alignment
methods analogous to those that have been undertaken for gene finding.
In order to facilitate such comparisons, we have made all our format
conversion tools available at
http://baboon.math.berkeley.edu/avid/.
The draft mode of AVID has proved to be extremely useful to us, and we
use it routinely to order and orient GenBank draft sequence by
alignment with finished sequence on the VISTA server
http://www-gsd.lbl.gov/vista/ (Mayor et al. 2000 ). AVID is also used in
the genome VISTA server at http://pipeline.lbl.gov/cgi-bin/GenomeVista
(O. Couronne et al. 2003 ), which is useful for locating and aligning
user-submitted sequences to whole genomes (human or mouse at the
present time).
AVID can be used online at http://bio.math.berkeley.edu/avid/. AVID
integrates well with the VISTA visualization tool (Mayor et al. 2000 ).
Figure 4 shows an example of a VISTA
visualization of an AVID alignment of cat and mouse sequence. Coding
exons are displayed in blue and easy to identify. Noncoding highly
conserved regions are displayed in red. A server for running AVID and
displaying alignments using VISTA is operational at
http://www-gsd.lbl.gov/vista/. These two servers are currently aligning
almost 2000 sequence pairs per month, with submissions originating from
over 30 countries. The bio.math.berkely.edu server is set up to handle
requests up to 2 Mb in size. The program is also available for download
(free for nonprofit use) at
http://www-gsd.lbl.gov/vista/VISTAdownload2.html. Executables are
available for Solaris, Mac OS X, Linux, and Alpha, and the source code
is available on request.

View larger version (33K):
[in this window]
[in a new window]
|
Figure 4. Cat versus mouse: A VISTA picture showing an AVID alignment of the 5'
region of the MET gene in cat and mouse. The top panel shows the
alignment of the two finished sequences with the X-axis
showing coordinates in the mouse, and the Y-axis showing the
%identity in a 100-bp window. The bottom panel shows the results of a
draft placement simulation: the cat sequence has been sliced at
locations corresponding to the vertical black lines. The resulting
contigs were permuted (order and orientation changed randomly) and
realigned to the mouse.
|
|
 |
WEB SITE REFERENCES
|
|---|
http://baboon.math.berkeley.edu/avid/; AVX alignment format, test
sets for comparing programs, Celera mouse chromosome 16 to public mouse
chromosome 16 alignment, Drosophila assembly comparison,
AVID syntenic map whole-genome humanmouse alignment.
http://bibiserv.techfak.uni-bielefeld.de/dialign/; DIALIGN.
http://bio.cse.psu.edu/; BLASTZ.
http://bio.math.berkeley.edu/avid/; AVID.
http://crossspecies.lcs.mit.edu/; GLASS.
http://ftp.genome.washington.edu/cgi-bin/RepeatMasker; RepeatMasker,
Smit, A. and Green, P.
http://pipeline.lbl.gov/cgi-bin/GenomeVista; VISTA/AVID.
http://www.cse.ucsc.edu/ kent/xenoAli/; WABA.
http://www-gsd.lbl.gov/vista/; VISTA/AVID.
http://www-gsd.lbl.gov/vista/VISTAdownload2.html; VISTA/AVID download.
http://www.sanger.ac.uk/Software/Wise2/dbaform.shtml; DBA.
http://www.stanford.edu/ brudno/chaos/; CHAOS.
http://www.tigr.org/software/mummer/; MUMmer.
 |
Acknowledgements
|
|---|
We thank Alex Poliakov for helping in setting up the AVID Web
servers and provided extensive debugging support and assistance. We
also thank Jody Schwartz for help in testing and debugging AVID and Jim
Lord who helped in developing overlap identification methods for draft
contigs. Thanks also to the Mouse Sequencing Consortium for generating
whole genome mouse sequence, which helped greatly in refining and
streamlining AVID. Some of the sequence data used to benchmark the
alignment programs were generated by the NIH Intramural Sequencing
Center (www.nisc.nih.gov). This project was supported in part by a
Program in Genomic Applications grant (PGA) from the National Heart
Lung and Blood Institute and a grant from the NIH
(ROI-HG02362-01).
The publication costs of this article
were defrayed in part by payment of page charges. This article must
therefore be hereby marked "advertisement" in accordance with 18
USC section 1734 solely to indicate this fact.
 |
Footnotes
|
|---|
3 Corresponding author. 
E-MAIL lpachter{at}math.berkeley.edu; FAX (510) 642-8204.
Article and publication are at
http://www.genome.org/cgi/doi/10.1101/gr.789803. Article published online before print in December
2002.
 |
REFERENCES
|
|---|
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410.[CrossRef][Medline]
Batzoglou, S., Pachter, L., Mesirov, J.P., Berger, B., and Lander, E.S. 2000. Human and mouse gene structure: Comparative analysis and application to exon prediction. Genome Res. 10: 950-958.[Abstract/Free Full Text]
Benos, P.V., Gatt, M.K., Murphy, L., Harris, D., Barrell, B., Ferraz, C., Vidal, S., Brun, C., Demaille, J., Cadieu, E., et al. 2001. From first base: The sequence of the tip of the X chromosome of Drosophila melanogaster, a comparison of two sequencing strategies. Genome Res. 11: 710-730.[Abstract/Free Full Text]
Brudno, M. and Morgenstern, B., 2002. Fast and sensitive alignment of large genomic sequences. In Proceedings of the First IEEE Computer Society Conference on Bioinformatics. IEEE Computer Society Press.
Burset, M. and Guigo, R. 1996. Evaluation of gene structure prediction programs. Genomics 34: 353-357.[CrossRef][Medline]
Chain, P. 2001. Examining the problems of whole genome comparison: A review. http://cmgm.stanford.edu/biochem218/Projects2001/Chain.pdf.
Couronne, O., Poliakov, A., Bray, N., Ishkhanov, T., Ryaboy, D., Rubin, E., Pachter, L., Dubchak, I. 2003. Strategies and tools for whole-genome alignments. Genome Res. (this issue).
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., and Salzberg, S.L. 1999. Alignment of whole genomes. Nucleic Acids Res. 27: 2369-2376.[Abstract/Free Full Text]
Delcher, A.L., Phillippy, A., Carlton, J., and Salzberg, S.L. 2002. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Res. 30: 2478-2483.[Abstract/Free Full Text]
Dubchak, I. and Pachter, L. 2002. The computational challenges of applying comparative-based computational methods to whole genomes. Brief. Bioinform. 3: 18-22.[Abstract/Free Full Text]
Gusfield, D., 1997. Algorithms on strings, trees, and sequences: Computer science and computational biology. Cambridge University Press, Cambridge, UK.
Jareborg, N., Birney, E., and Durbin, R. 1999. Comparative analysis of non-coding regions of 77 orthologous mouse and human gene pairs. Genome Res. 9: 815-824.[Abstract/Free Full Text]
Kent, J. and Zahler, M. 2000. The Intronerator: Exploring introns and alternative splicing in C. elegans genomic alignment. Genome Res. 10: 1115-1125.[Abstract/Free Full Text]
Mayor, C., Brudno, M., Schwartz, J.R., Poliakov, A., Rubin, E.M., Frazer, K.A., Pachter, L.S., and Dubchak, I. 2000. VISTA: Visualizing global DNA sequence alignments of arbitrary length. Bioinformatics 16: 1046-1047.[Abstract/Free Full Text]
Miller, W. 2001. Comparison of genomic DNA sequences: Solved and unsolved problems. Bioinformatics 17: 391-397.[Abstract/Free Full Text]
Morgenstern, B., Frech, K., Dress, A., and Werner, T. 1998. DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14: 290-294.[Abstract/Free Full Text]
Morgenstern, B., Rinner, O., Abdeddaïm, S., Haase, D., Mayer, K., Dress, A., and Mewes, H-W. 2002. Exon discovery by genomic sequence alignment. Bioinformatics 18: 777-787.[Abstract/Free Full Text]
Mural, R., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L., Wides, R., Halpern, A., Li, P.W., Sutton, G.G., Nadeau, J., et al. 2002. A comparison of whole-genome-shotgun-derived mouse chromosome 16 and the human genome. Science 296: 1667-1671.
Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453.[CrossRef][Medline]
Pachter, L. 1999. "Domino tiling, gene recognition, and mice." Ph.D thesis, MIT, Cambridge, Massachusetts.
Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R., and Miller, W. 2000. PipMakerA web server for aligning two genomic DNA sequences. Genome Res. 10: 577-586.[Abstract/Free Full Text]
Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195-197.[CrossRef][Medline]
Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562.[CrossRef][Medline]
Wiehe, T., Guigó, R., and Miller, W. 2000. Genome sequence comparisons: Hurdles in the fast lane to functional genomics. Briefings in Bioinformatics 1: 381-388.[Abstract/Free Full Text]
Received September 9, 2002;
accepted in revised format November 7, 2002.
13:97-102 © by 2003 Cold Spring Harbor Laboratory Press ISSN 1088-9051/03 $5.00

CiteULike Connotea Del.icio.us Digg Reddit Technorati What's this?
This article has been cited by other articles:

|
 |

|
 |
 
N. Warthmann, S. Das, C. Lanz, and D. Weigel
Comparative Analysis of the MIR319a MicroRNA Locus in Arabidopsis and Related Brassicaceae
Mol. Biol. Evol.,
May 1, 2008;
25(5):
892 - 902.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
A. C. Tzika, R. Helaers, Y. Van de Peer, and M. C. Milinkovitch
MANTIS: a phylogenetic framework for multi-species genome comparisons
Bioinformatics,
January 15, 2008;
24(2):
151 - 157.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
C. M. Bergman and H. Quesneville
Discovering and detecting transposable elements in genome sequences
Brief Bioinform,
November 1, 2007;
8(6):
382 - 392.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
J. Dong, C. Ivascu, H.-D. Chang, P. Wu, R. Angeli, L. Maggi, F. Eckhardt, L. Tykocinski, C. Haefliger, B. Mowes, et al.
IL-10 Is Excluded from the Functional Cytokine Memory of Human CD4+ Memory T Lymphocytes
J. Immunol.,
August 15, 2007;
179(4):
2389 - 2396.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
C. Boyd and A. Naray-Fejes-Toth
Steroid-Mediated Regulation of the Epithelial Sodium Channel Subunits in Mammary Epithelial Cells
Endocrinology,
August 1, 2007;
148(8):
3958 - 3967.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
A. Vishnoi, R. Roy, and A. Bhattacharya
Comparative analysis of bacterial genomes: identification of divergent regions in mycobacterial strains using an anchor-based approach
Nucleic Acids Res.,
June 28, 2007;
35(11):
3654 - 3667.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
X. Huang and D. L. Brutlag
Dynamic use of multiple parameter sets in sequence alignment
Nucleic Acids Res.,
January 28, 2007;
35(2):
678 - 686.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
Z. Yuan, A. Tie, M. Tarnopolsky, and M. Bakovic
Genomic organization, promoter activity, and expression of the human choline transporter-like protein 1
Physiol Genomics,
September 14, 2006;
26(1):
76 - 90.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
J. F. Mulley, C.-h. Chiu, and P. W. H. Holland
Breakup of a homeobox cluster after genome duplication in teleosts
PNAS,
July 5, 2006;
103(27):
10369 - 10372.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
T. Sauer, E. Shelest, and E. Wingender
Evaluating phylogenetic footprinting for human-rodent comparisons
Bioinformatics,
February 15, 2006;
22(4):
430 - 437.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
X. Huang, S.-P. Yang, A. T. Chinwalla, L. W. Hillier, P. Minx, E. R. Mardis, and R. K. Wilson
Application of a superword array in genome assembly
Nucleic Acids Res.,
January 5, 2006;
34(1):
201 - 205.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
E. Blanco, D. Farre, M. M. Alba, X. Messeguer, and R. Guigo
ABS: a database of Annotated regulatory Binding Sites from orthologous promoters
Nucleic Acids Res.,
January 1, 2006;
34(suppl_1):
D63 - D67.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
W. Huang, D. M. Umbach, and L. Li
Accurate anchoring alignment of divergent sequences
Bioinformatics,
January 1, 2006;
22(1):
29 - 34.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
E. A. Grice, E. S. Rochelle, E. D. Green, A. Chakravarti, and A. S. McCallion
Evaluation of the RET regulatory landscape reveals the biological relevance of a HSCR-implicated enhancer
Hum. Mol. Genet.,
December 15, 2005;
14(24):
3837 - 3845.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. Vanrompay, T. Q. T. Hoang, L. De Vos, K. Verminnen, T. Harkinezhad, K. Chiers, S. A. Morre, and E. Cox
Specific-Pathogen-Free Pigs as an Animal Model for Studying Chlamydia trachomatis Genital Infection
Infect. Immun.,
December 1, 2005;
73(12):
8317 - 8321.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
E. A. Jones and R. A. Flavell
Distal Enhancer Elements Transcribe Intergenic RNA in the IL-10 Family Gene Cluster
J. Immunol.,
December 1, 2005;
175(11):
7437 - 7446.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
X. Pan, L. Stein, and V. Brendel
SynBrowse: a synteny browser for comparative sequence analysis
Bioinformatics,
September 1, 2005;
21(17):
3461 - 3468.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
B. P. Hermann and L. L. Heckert
Silencing of Fshr Occurs through a Conserved, Hypersensitive Site in the First Intron
Mol. Endocrinol.,
August 1, 2005;
19(8):
2112 - 2131.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. J. Gaffney and P. D. Keightley
The scale of mutational variation in the murid genome
Genome Res.,
August 1, 2005;
15(8):
1086 - 1094.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
S. Aerts, P. Van Loo, G. Thijs, H. Mayer, R. de Martin, Y. Moreau, and B. De Moor
TOUCAN 2: the all-inclusive open source workbench for regulatory sequence analysis
Nucleic Acids Res.,
July 1, 2005;
33(suppl_2):
W393 - W396.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
E. Berezikov, V. Guryev, and E. Cuppen
CONREAL web server: identification and visualization of conserved transcription factor binding sites
Nucleic Acids Res.,
July 1, 2005;
33(suppl_2):
W447 - W450.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
K. S. Pappu, E. J. Ostrin, B. W. Middlebrooks, B. T. Sili, R. Chen, M. R. Atkins, R. Gibbs, and G. Mardon
Dual regulation and redundant function of two eye-specific enhancers of the Drosophila retinal determination gene dachshund
Development,
June 15, 2005;
132(12):
2895 - 2905.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
B. Negre, S. Casillas, M. Suzanne, E. Sanchez-Herrero, M. Akam, M. Nefedov, A. Barbadilla, P. de Jong, and A. Ruiz
Conservation of regulatory sequences and gene expression patterns in the disintegrating Drosophila Hox gene complex
Genome Res.,
May 1, 2005;
15(5):
692 - 700.
[Abstract]
[Full Text]
[PDF]
|
 |
|
|