|
|
|
|
Genome Res. 14:693-699, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods MAVID: Constrained Ancestral Alignment of Multiple SequencesDepartment of Mathematics, University of California at Berkeley, Berkeley, California 94720, USA
We describe a new global multiple-alignment program capable of aligning a large number of genomic regions. Our progressive-alignment approach incorporates the following ideas: maximum-likelihood inference of ancestral sequences, automatic guide-tree construction, protein-based anchoring of ab-initio gene predictions, and constraints derived from a global homology map of the sequences. We have implemented these ideas in the MAVID program, which is able to accurately align multiple genomic regions up to megabases long. MAVID is able to effectively align divergent sequences, as well as incomplete unfinished sequences. We demonstrate the capabilities of the program on the benchmark CFTR region, which consists of 1.8 Mb of human sequence and 20 orthologous regions in marsupials, birds, fish, and mammals. Finally, we describe two large MAVID alignments, an alignment of all the available HIV genomes and a multiple alignment of the entire human, mouse, and rat genomes.
The multiple-alignment problem is difficult for many reasons (for example, see Notredame 2002
Despite the overwhelming complexity of the problem, it is important to observe that the multiple alignment problem for genomic sequences is not equivalent to the mathematical problem of producing an optimal alignment maximizing some score function (Gusfield 1997
In this study, we propose a method capable of rapidly aligning multiple large genomic regions by incorporating biologically meaningful heuristics with theoretically sound alignment strategies. The core of our approach is a probabilistic ancestral alignment scheme (Feng and Doolittle 1987
To incorporate biological information into the alignment procedure, the progressive alignment is constrained by gene-based anchors. These anchors are precomputed on the basis of ab-initio gene predictions and their protein alignments and form part of the input to the program. In addition, nontrivial positional constraints (Hardison et al. 1993
The alignment of the ancestor sequences is based on the AVID (Bray et al. 2003
We have combined all of these ideas into a new program called MAVID, which we used to align the human, mouse, and rat genomes. We also show that MAVID is suitable for aligning very large numbers of sequences, and is therefore practical for the alignment of multiple HIV genomes (Korber et al. 2001
Our method consists of a core progressive ancestral alignment step, which can incorporate preprocessed constraints (see Fig. 1).
To clarify the presentation of the method, we begin by defining some terminology. A match is any identified similar region between two sequences. A match does not have to be exact at the sequence level; for example, we could declare a match between two orthologous gene regions, even if the sequence does not match exactly. A maximal exact match is a match that is exact at the sequence level, and is maximal (i.e., cannot be extended on either side without creating a mismatch). An anchor is a match that is used in the alignment. A constraint ai bj in a multiple alignment means that position i in sequence a must appear before position j in sequence b in the multiple alignment.
Progressive Alignment We associate multiple alignments to the vertices of T recursively, starting from the leaves. For a vertex x in T, let Tx be the subtree consisting of x and all of the vertices beneath x. If x is a leaf, then Tx is a trivial tree (i.e., a tree consisting of only one vertex), and we will label x with the sequence corresponding to x in the phylogenetic tree. This sequence can be considered a trivial multiple alignment (i.e., an alignment of only one sequence). If x is an internal node, then it has two children, u and v, which are labeled with the multiple alignments Au and Av, respectively. We then construct an alignment of all the sequences in Tx by aligning the two alignments Au and Av. This procedure is applied recursively, so the program works its way from the leaves of the phylogenetic tree to the root, at which point it will have constructed an alignment for all of the sequences in T.
The key difference between our progressive alignment schema and more standard methods is that instead of aligning Au and Av directly, we first infer ancestral sequences su and sv using standard phylogenetic models for inference of the common ancestor (Felsenstein 1981
The discussion above ignores the issue of gaps. Gaps can be modeled as a fifth symbol, which is equivalent to assigning a linear gap penalty. We have implemented the procedure in this form, but affine gap penalties are preferable. Furthermore, it is desirable to infer the deletion or insertion of bases in the ancestor, and models for this already exist (e.g., TKF; Thorne et al. 1992
After the ancestral sequence calculation, su and sv are aligned with AVID (Bray et al. 2003 The alignment of the ancestral sequences is then used to glue together Au and Av to produce a new multiple alignment, which is assigned to the vertex x. In particular, if position i in su matches position j in sv, then column i in the multiple alignment assigned to u is aligned with column j in the multiple alignment assigned to v. Gaps in the ancestral sequence alignments lead to gaps in the multiple alignment in the obvious way. The procedure terminates with a final pairwise alignment at the root node of the tree.
Exon Anchoring and Constraints Genes are considered to match if they form a reciprocal best hit. The gene matches are assembled into runs, which then form the basis of the homology map. Genome sequence coordinates for exon matches are inferred from the protein alignments, thus producing a set of pairwise matches between all of the sequences. These matches are used in the obvious way when aligning Au and Av at node x; all matches that are between a sequence in u and a sequence in v are collected. Every such match can be converted into a match between the ancestral sequences su and sv, which is then used in the AVID alignment.
In addition to anchoring the alignment of the ancestral sequences, the exon matches can be used in more subtle ways to shape the final multiple alignment. It is illustrative to consider 1-bp anchors, that is, single matches between the sequences. Suppose we have sequences a, b, and c, and that ai is anchored to cx, and bj is anchored to cy. If we are aligning sequences a and b, then the given anchors to c do not allow us to anchor the alignment, but they do allow us to constrain it. If x is less than (resp. greater than) y, we must have that ai comes before (resp. after) bj in the alignment of a and b if we are to produce an alignment of a, b, and c, which is consistent with both of the anchors. In the language of Myers et al. (1997 Thus, when constructing the multiple alignment at node x, every triplet of sequences (a,b,c) with a in u, b in v, and c not in x provides a potential constraint for the alignment. This can lead to a combinatorial explosion of constraints. If there are n sequences in the alignment, then there are O(n3) such triplets, each of which may imply many constraints. Fortunately, we do not need to find the set of all possible constraints, many of which will be redundant. Instead, we wish to find a set of prime constraints (i.e., a set such that no constraint is implied by the others) that is equivalent to, but potentially much smaller than the set of all constraints implied by the gene matches. Such a set can be inferred from the homology map. If there are m sets of orthologous exons (not all of which will be in every sequence), then at node x there can be at most O(m) prime constraints, and a prime set that is equivalent to all possible constraints can easily be found in O(mk) time, where k is the number of leaves below x. Thus, the sets of all prime constraints can be found in O(mk2) time with a small constant factor. Matches between the ancestral sequences that are inconsistent with this set of constraints can then be filtered out in time O(Nlogm), where N is the total number of matches. For typical values of m and k, the time taken computing and utilizing the constraints is negligible. Figure 2 shows an example of a constraint, and how it is enforced in the AVID alignment of the ancestral sequences.
The preprocessing step of finding all exon matches is quadratic in the number of sequences; however, as the protein alignments are gene based, they are typically computed on <5% of the sequence. Thus, the gene matching is actually significantly faster than translated match finding, which requires searching the entire sequence in all three frames and on both strands. Furthermore, by comparing only the proteins produced by a gene-prediction program, the program implicitly takes into account splice sites and other gene features in building gene anchors. It is also important to note that this approach is completely ab initio, even though a gene-finding step is necessary; no information beyond the sequences is used. For this study, we performed alignments using this strategy in order to demonstrate the performance of an ab initio approach. However, it is possible to make use of mRNA and EST data, thus incorporating known biological annotation about the sequences into the alignment.
Tree Building The initial guide tree is selected randomly from the set of complete binary trees (or almost complete binary trees in the case in which the number of sequences is not a power of 2). For a given number of nodes, these are the binary trees with minimal depth, and thus, initial errors in pairwise alignments have less opportunity to propagate through the tree. The sequences are aligned using this random tree, and then a phylogenetic tree is inferred from the resulting multiple alignment. The likelihood of the tree given the alignment can be used as a quantitative measure of the quality of the tree and the process is iterated until the alignment and tree are satisfactory.
For small numbers of sequences, the inference of the tree from the multiple alignment can be done using maximum-likelihood methods and accounts for only a small percentage of the running time. However, as the number of sequences increases, we have found that ML reconstruction becomes impractical and neighbor joining must be used. Because pairwise alignments are easy to infer from a multiple alignment, we can perform neighbor-joining reconstruction rapidly, even with large numbers of sequences. We have tested MAVID with the fastDNAml (Olsen et al. 1994
Instead of computing all pairwise alignments, only O (nk) alignments are necessary to perform n iterations with k sequences. We found that for typical alignment problems, only a small number of iterations were necessary (see results on the HIV sequences below). It is important to note that our iterative method (multiple alignment alternating with neighbor joining) is considerably less sophisticated than ML methods such as SEMPHY (Friedman et al. 2002
A Human, Mouse, and Rat Whole-Genome Multiple Alignment We aligned the human (April 2003), mouse (February 2003), and rat (June 2003) genomes using MAVID. A homology map for the genomes was built by C. Dewey (in prep.), and was used to generate gene anchors and constraints. Figure 3 summarizes the exon coverage of the alignment on chromosome 20; it shows how many of the RefSeq genes were covered by anchors (and, therefore, automatically aligned correctly), and how many were subsequently aligned by MAVID. Chromosome 20 was chosen because it aligns almost completely with mouse chromosome 2, and therefore, the quoted numbers should be useful for comparing MAVID to other alignment approaches that do not explicitly separate out orthologous from paralogous alignments.
The MAVID alignments have been used to estimate evolutionary rates for the genomes and to identify evolutionary hotspots in which one of the rodent genomes has been evolving much more slowly than the other (these results are reported in a companion paper by Yap and Pachter 2004). They are also used to support the K-BROWSER (Chakrabarti and Pachter 2004
CFTR Region: 21 Organisms
The map-building step takes It is difficult to assess the overall quality of the alignment, but one feature that can be verified is the alignment of exons. To do so, we projected the alignment onto the human sequence in order to produce pairwise alignments between human and each of the other 20 sequences. This analysis was complicated by the fact that the sequencing is not complete, and so not every exon has been sequenced in every organism. To address this shortcoming, we calculated the fraction of human exons that were aligned with each of the sequences. An exon was considered to be aligned if at least 70% of it was covered by alignment, and at least 50% of the bases were matching.
The MAVID alignments were compared with MLAGAN, version 1.1 (Brudno et al. 2003
To better understand how alignment accuracy varies with the number of sequences being aligned, we compared the MAVID and MLAGAN alignments on 21 different data sets, beginning with a pairwise alignment of the most distant organisms, and adding in one (mutually most distant) organism at a time, all the way up to a comparison on the 21 sequences. To do this, we computed the human clamped k-MST trees (Boffelli et al. 2003 The results of the alignments show that both programs correctly aligned mammalian sequences. The alignment of distant organisms shows much greater variability with respect to the sequences included in the alignment problem. For example, adding fugu to an alignment of human, zebrafish, and dunnart may improve the alignment, but as Table 1 demonstrates, adding platypus can degrade it. MAVID shows significant improvement over MLAGAN in this respect.
HIV1/SIV: Complete Genomes From 242 Individuals The HIV databases maintained at LANL contain a collection of HIV-1, HIV-2, and SIV sequences, carefully linked with individuals and their histories. We extracted the complete genomic sequences of HIV1 and SIV from this database (currently totaling 242) and aligned them with MAVID. The alignment of the sequences takes 2.5 min. A phylogenetic tree was constructed with neighbor joining taking an additional 30 sec. Again, it is difficult to assess the quality of the alignment, but it is accurate enough that the different strains cluster in the inferred tree (see Fig. 4). To understand the stability of the tree building/alignment iteration, we examined 100 alignment runs on a reduced sequence set with the recombinant strains removed (because recombination is not addressed by standard phylogenetic models). We found that the correct strains were grouped together within one round of alignment (starting with a random tree) in all of the 100 runs. To validate the iterative alignment/tree building procedure, we also examined the number of subtrees that were fixed with each successive round of alignment. The tree inferred after the second round of alignment has 45.87 of its subtrees in agreement with the tree after the first run (on average), and the number fixed between the second and third rounds is 54.08, on average. Starting with different random trees and aligning for three rounds, we find that 54.16 of the subtrees agree, on average. Our criteria for comparing trees (exact agreement of subtrees) is rather strict, and these numbers are very encouraging (one would expect about one nontrivial subtree to agree for two random trees). Although a MAVID multiple alignment combined with a neighbor-joining tree may not be as accurate as hand-edited alignments, followed by maximum likelihood tree building, it can serve to provide very fast results that can then form the basis for further refinement. An alignment problem of this size is not practical on a standard desktop computer if all of the pairwise alignments are computed in order to build an initial guide tree (as is done in many programs, e.g., CLUSTALW).
The alignments and phylogenetic trees for all the above sequences are downloadable at http://baboon.math.berkeley.edu/mavid/data.
Conclusion
Our approach is also consistent with a number of other ideas that we have not yet implemented, but which could be easily integrated into MAVID and will improve results. Iterative refinement, the process of realigning across an edge in the tree, fits in naturally with our framework (Gotoh 1993 MAVID compares favorably with existing programs. As we have pointed out, it is significantly more accurate than MLAGAN on the alignment of the CFTR benchmark region. MLAGAN is the only other program we know of that can even align such a large data set. We also know of no other programs that can quickly align hundreds of viral or mitochondrial genomes.
A MAVID Web server has been operational for over 6 mo and processes over 1000 requests a month (Bray and Pachter 2003
We thank Colin Dewey for providing access to his homology map-making programs. We thank Von Bing Yap for helping with the evolutionary models used in MAVID. Thanks to Ingileif Brynd's Hallgr'msdóttir for her help throughout the project and for her comments on the final manuscript. The data used in the multiple alignment of the CFTR region was generated by the NIH Intramural Sequencing Center (www.nisc.nih.gov), and was used subject to their 6-mo hold policy. The HIV sequences were downloaded from the HIV database (hiv-web.lanl.gov). Thanks also to the Rat Sequencing Consortium, both for providing the rat sequence to align, and for facilitating helpful collaborations and discussions. Finally, we thank the anonymous reviewers for their insightful comments and suggestions. This work was partially supported by funding from the NIH (grant R01-HG02362-01) and the Berkeley PGA grant from the NHLBI. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1960404.
1 Corresponding author. [Supplemental material is available online at http://baboon.math.berkeley.edu/mavid/data.]
Boffelli, B., McAuliffe, J., Ovcharenko, D., Lewis, K.D., Ovcharenko, I., Pachter, L., and Rubin, E.M. 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 1391-1394.
Bray, N. and Pachter, L. 2003. MAVID multiple alignment server. Nucleic Acids Res. 31: 3525-3526.
Bray, N., Dubchak, I., and Pachter, L. 2003. AVID: A global alignment program. Genome Res. 13: 97-102.
Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., N.C.S. Program, Green, E.D., Sidow, A., and Batzoglou, S. 2003. LAGAN and Multi-LAGAN: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13: 721-731. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94.[CrossRef][Medline] Chakrabarti, K. and Pachter, L. 2004. Visualization of multiple genome annotations and alignments with the K-BROWSER. Genome Res. (this issue). Felsenstein, J. 1981. Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17: 368-376.[CrossRef][Medline] Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25: 351-360.[Medline] Friedman, N., Ninio, M., Pe'er, I., and Pupko, T. 2002. A structural EM algorithm for phylogenetic inference. J. Computat. Biol. 9: 331-353. Gonnet, G.H. and Benner, S.A. 1996. Probabilistic ancestral sequences and multiple alignments. In Algorithm theory, pp. 380-391. Proceedings of SWAT '96, Reykjavik, Iceland.
Gotoh, O. 1993. Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comput. Appl. Biosci. 9: 361-370. . 1996. Significant improvement in accuracy of multiple protein alignments by iterative refinement as assessed by reference to structural alignments. J. Mol. Biol. 264: 823-838.[CrossRef][Medline] Gusfield, D. 1997. Algorithms on strings, trees and sequences. Cambridge University Press, Cambridge, UK.
Hardison, R.C., Chao, K.M., Adamkiewicz, M., Price, D., Jackson, J., Zeigler, T., Stojanovic, N., and Miller, W. 1993. Positive and negative regulatory elements of the rabbit embryonic Hein, J. 2001. An algorithm for statistical alignment of sequences related by a binary tree. Proc. Pacific Symp. Biocomput. 179-190. Hein, J., Wiuf, C., Knudsen, B., Moller, M.B., and Wibling, G. 2000. Statistical alignment: Computational properties, homology testing and goodness-of-fit. J. Mol. Biol. 302: 265-279.[CrossRef][Medline] Herrnstadt, C., Elson, J.L., Fahy, E., Preston, G., Turnbull, D.M., Anderson, C., Ghosh, S.S., Olefsky, J.M., Beal, M.F., Davis, R.E., et al. 2002. Reduced-median-network analysis of complete mitochondrial DNA coding-region sequences for the major african, asian, and european haplogroups. Amer. J. Hum. Genet. 70: 1152-1171.[CrossRef][Medline] Höhl, M., Kurtz, S., and Enno, O. 2002. Efficient multiple genome alignment. Bioinformatics 18: 5312-5320. Holmes, I. 2003. Using guide trees to construct multiple-sequence evolutionary HMMs. In Proceedings of the Eleventh ISMB conference. pp. 147-157, AAAI Press, Menlo Park, California.
Holmes, I. and Bruno, W.J. 2001. Evolutionary HMMs: A Bayesian approach to multiple alignment. Bioinformatics 17: 803-820.
Kent, W.J. 2002. BLATThe BLAST-like alignment tool. Genome Res. 12: 656-664. Korber, B.T.M., Brander, C., Haynes, B.F., Koup, R., Kuiken, C., Moore, J.P., Walker, B.D., and Watkins, D.I. (Ed.) 2001. In HIV molecular immunology, pp. 02-4663. Los Alamos National Laboratory, Theoretical Biology and Biophysics, Los Alamos, NM.
Löytynoja, A. and Milinkovitch, M.C. 2003. A hidden Markov model for progressive multiple alignment. Bioinformatics 19: 1505-1513.
Morgenstern, B., Frech, K., Dress, A., and Werner, T. 1998. DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14: 290-294. Myers, G., Selznick, S., Zhang, Z., and Miller, W. 1997. Progressive multiple alignment with constraints. In Proceedings of the first annual international conference on computational molecular biology, pp. 220-225. Sante Fe, New Mexico. Notredame, C. 2002. Recent progresses in multiple sequence alignment: A survey. Pharmacogenomics 3: 1-14.[CrossRef][Medline]
Olsen, G.J., Matsuda, H., Hagstrom, R., and Overbeek, R. 1994. fastDNAml: A tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10: 41-48. Slowinski, J.B. 1998. The number of multiple alignments. Mol. Phylogenet. Evol. 10: 264-266.[CrossRef][Medline] Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., McDowell, J.C., et al. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788-793.[CrossRef][Medline]
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4680. Thorne, J.L., Kishino, H., and Felsenstein, J. 1991. Evolutionary model for maximum likelihood alignment of DNA. J. Mol. Evol. 33: 114-124.[CrossRef][Medline] . 1992. Inching toward reality: An improved likelihood model of sequence evolution. J. Mol. Evol. 34: 3-16.[CrossRef][Medline] Yap, V.B. and Pachter, L. Identification of evolutionary hotspots in the rodent genomes. Genome Res. (this issue).
http://www.nisc.nih.gov/; NIH Intramural Sequencing Center. http://hiv-web.lanl.gov/; LANL HIV Databases. http://baboon.math.berkeley.edu/mavid/; The MAVID Web server. http://baboon.math.berkeley.edu/mavid/data/; Supplemental Data. http://hanuman.math.berkeley.edu/kbrowser/; K-BROWSER.
Received September 10, 2003;
accepted in revised format November 17, 2003.
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||