|
|
|
|
Genome Res. 14:2336-2346, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods A novel method for multiple alignment of sequences with repeated and shuffled elementsDepartment of Computer Science and Engineering, University of California, San Diego, La Jolla, California 92093-0114, USA
We describe ABA (A-Bruijn alignment), a new method for multiple alignment of biological sequences. The major difference between ABA and existing multiple alignment methods is that ABA represents an alignment as a directed graph, possibly containing cycles. This representation provides more flexibility than does a traditional alignment matrix or the recently introduced partial order alignment (POA) graph by allowing a larger class of evolutionary relationships between the aligned sequences. Our graph representation is particularly well-suited to the alignment of protein sequences with shuffled and/or repeated domain structure, and allows one to construct multiple alignments of proteins containing (1) domains that are not present in all proteins, (2) domains that are present in different orders in different proteins, and (3) domains that are present in multiple copies in some proteins. In addition, ABA is useful in the alignment of genomic sequences that contain duplications and inversions. We provide several examples illustrating the applications of ABA.
Multiple sequence alignment (MSA) is arguably among the most studied (Sankoff 1975
This representation of the alignment as a linear sequence of alignment columns implicitly assumes that all regions of all sequences are similar over their entire length. However, for many biological sequences this assumption does not hold. For example, multidomain protein families evolve not only through mutation of individual amino acids but also through operations such as domain duplications and domain recombinations (Doolittle 1995
Recently, in a pioneering paper, Lee et al. (2002
However, even the DAG representation used in POA is not flexible enough to capture the full complexity of the similarities between biological sequences. For example, related protein sequences frequently share common domains, but the order of the domains may be different in different proteins (shuffled domains), or a single domain may be repeated in a single protein (repeated domains). To represent shuffled or repeated domains, the alignment representation must permit directed cycles (Fig. 2). Hence, neither the rowcolumn representation nor the DAG representation provides an accurate representation of shuffled or repeated domains. We take the approach of Lee et al. (2002
We emphasize that the real multiple alignment problem is more difficult that the schematic representation in Figure 2. For example, when aligning multidomain protein sequences, the delineation of the sequences into domains is not known in advance and needs to be derived from raw protein sequences. Neuwald et al. (1997
In this article, we describe a new representation for a multiple alignment as a weighted directed graph (possibly containing cycles) called the A-Bruijn graph. The A-Bruijn graph was recently introduced and applied to fragment assembly and de novo repeat identification (Pevzner et al. 2004
The MSA problem involves two tasks: finding a graph that represents the domain structure and finding a mapping of each sequence to this graph. Our approach constructs this graph based on a predetermined set of local similarities (e.g., pairwise alignments) between the input sequences. Intuitively, the A-Bruijn graph is obtained for a collection of pairwise alignments by "gluing" the aligned positions.
Although Figure 2F is illustrative of the alignment representation that we wish to obtain, it is not immediately clear how to obtain such a representation. The major challenge is the determination of the regions of similarity that should be "glued together" in the graph to represent the protein domains (boxes in Fig. 2). One cannot use a stringent criterion for similarity, such as exact
The A-Bruijn and ABA graphs
These short cycles occlude the recognition of domains in protein alignments or conserved segments in genomic alignments. Thus, we want to "clean" the A-Bruijn graph by removing these short cycles while retaining the large cycles that indicate multidomain organization. To accomplish this task, we simplify the A-Bruijn graph by solving the maximum subgraph with large girth (MSLG) problem as described in Pevzner et al. (2004
Case study: Proteins with SH2, SH3, and Pkinase domains
The ABA graph reveals the shared domains as edges of multiplicity two (Fig. 5B). Furthermore, the ABA graph reflects the shuffled domain structure as a cycle consisting of two edges of multiplicity twocorresponding to the shared domains and two edges of multiplicity onecorresponding to the unique interdomain regions in each sequence. This cyclic structure cannot be represented as a rowcolumn alignment or as a POA graph.
As a second example, we present an alignment of four human proteins: MATK, ABL1, GRB2, and CRKL. Lee et al. (2002
To obtain the ABA graph, we identify pairwise local alignments between the four protein sequences by using the BLAST program with BLOSUM80 matrix. Hits with minimal length of 40 and at least 40% conserved (as defined by BLAST) are input to the ABA algorithm. The resulting ABA graph (Fig. 6C) clearly shows the domain structures as edges with high multiplicity. In the ABA graph, the edge (2 1) of multiplicity four corresponds to the SH2 domain shared by all four sequences. The edge (1 2) of multiplicity five corresponds to the five SH3 domains in four sequences. Notably, the two SH3 domains in GRB2 are glued together on this edge. As a result, the path through the graph corresponding to the GRB2 sequence contains a cycle signifying duplication of the SH3 domain. Also note that there is a second SH3 domain at the C-terminal end of CRKL that is not glued by ABA to the other SH3 domains. The reason for the isolation of this SH3 domain is that it is sufficiently diverged from the other SH3 domains so that there are no significant pairwise local alignments (satisfying our criteria above) between this SH3 domain and the other sequences detected by BLAST.
Case study: Proteins with GAF, Response_reg, GGDEF, and EAL domains
To construct the domain shuffling network, we select a subset of Pfam domains (7316 domains, as at February 2004) according to the following criteria: (1) is >50 amino acids (aa); (2) is > 21% conserved; and (3) contained in at least 500 proteins. A total of 119 domains satisfy these criteria, and 47 of these appear in different orders in the Pfam annotation of some SWISS-PROT proteins. There are a total of 56 edges representing shuffles between these 47 domains. The network has 10 connected components. The largest connected component of the domain shuffling network (Fig. 7) prominently displays the protein kinase domain (Pkinase) as the highest degree vertex. This reflects the fact that domain shuffling is a common feature in the kinase family. However, the domain shuffling network reveals that shuffling of domains is not restricted to kinases. We now describe an example of domain shuffling and duplication outside the protein kinase family. We analyzed the four proteins Q82U13, ETR1_ARATH, PHY2_SYNY3, and Q7MD98 from SWISS-PROT, each containing some but not all of the Pfam domains GAF, Response_reg, GGDEF and EAL (Fig. 8). The BLAST program with the BLOSUM80 matrix gives eight significant pairwise local alignments between these sequences that satisfy the constraints that alignment length is >40 aa and is >40% conserved. We inputed these alignments into the ABA program, and obtained the graph shown in Figure 8.
Edges of high multiplicity (or a chain of high multiplicity edges) in the ABA graph corresponds to domains shared by the sequences. Table 1 shows four edges (chain of edges), each representing a significant local multiple alignment. We emphasize that the correspondence between these edges and known protein domains is approximate, because the edges result directly from significant alignments from BLAST. We can extract the subsequences corresponding to high multiplicity edges, and refine the multiple alignment using an existing tool like CLUSTALW. We remark that ABA eliminates a time-consuming, manual clipping procedure.
Domain shuffling creates directed cycles in the ABA graph. In this example there are two domain shuffles: Response_reg versus GAF, and EAL versus GAF/GGDEF represented by two directed cycles (2 0 1 2 and 2 3 4 5 7 2) in Figure 8. The different domain orders in individual sequences are reflected by the different paths traversing the ABA graph that visit edges in different orders.
When we align these four sequences by using POA (Lee et al. 2002
Case study: Proteins with condensation, AMP-binding, and PP-binding domains
We obtain the ABA graph shown in Figure 9 by using the same BLAST parameters as the previous case studies. Most of the long edges in the graph correspond to the long domains: condensation and AMP-binding. These domains are typically not well conserved over their full length, and ABA reveals the well conserved parts as high multiplicity edges (e.g., A
Genomic sequences ABA is also applicable to the alignment of genomic sequences, and the ABA graph directly reveals duplications and inversions that are often found in alignments of long mammalian genomic sequences. The input to ABA is a set of t DNA sequences (with the t reverse complements) and the pairwise local alignments between the 2t sequences. The resulting ABA graph is a collection of 2t pathscorresponding to the t input sequences and their t reverse complementsthat are glued together according to the local alignments. A duplication in a single sequence corresponds to a directed cycle in the path corresponding to this sequence, whereas an inversion corresponds to a gluing of the direct strand of one sequence to the reverse strand of another sequence.
We apply ABA to a pair of plant chloroplast genomes, Arabidopsis thaliana and Oenothera elata, and produce the graph in Figure 10A. We compare our results to the alignment obtained by the Threaded Blockset Aligner (TBA) of Blanchette et al. (2004
We note that the current implementation of TBA produces a limited type of threaded blockset, namely, TBA "does not accommodate inversions2 and duplications, and it is restricted to finding matches that occur in the same order and orientation in the given sequences". ABA has no such restrictions. Indeed, in the ABA graph of the chloroplast genomes, block 7 appears twice along the path corresponding to the Arabidopsis genome, once in the direct strand and once in the reverse strand. (TBA now can handle reverse-strand matches and inversions [W. Miller, pers. comm.]).
To further our comparison of the blocks extracted from the long edges of the ABA graph with the blocks produced by TBA, we examined the human, mouse, and rat sequences from the NISC target region T1. The complete set of sequences from 12 species was first analyzed in Thomas et al. (2003
ABA and TBA use different algorithmic approaches to blockset generation (discussed below), yet most of the blocks produced by TBA and ABA have significant overlaps. However, we observe three differences. First, ABA generates blocks of multiplicity higher than three (dark gray blocks in Fig. 11), demonstrating the ability of ABA to handle duplications and inversions. Second, TBA detects a few blocks that are missed by ABA: These blocks represent short three-way alignments. ABA misses these short alignments because it uses only pairwise alignments, whereas TBA implements a progressive multiple alignment engine (MULTIZ). Third, ABA generally produces longer blocks (or concatenations of blocks). The above results demonstrate that (1) ABA is able to automatically generate threaded blocksets for genomic sequence alignment, and (2) ABA handles duplications and matches of sequences that are in different orders in different genomes. A more detailed comparison of the two approaches and the possibility of synergistic combinations of both approaches are important questions for future study.
The important feature of ABA is the ability to produce multiple alignments of sequences that include shuffled and repeated regions, a feature lacking in other alignment methods. We now compare ABA with other approaches to MSA, and describe further applications and extensions of ABA.
Alignment representation
Lee et al. (2002
Zhang and Waterman (2003
The Zhang and Waterman (2003
Very recently, Blanchette et al. (2004 The algorithmic approach of TBA is very different from ABA. TBA progressively aligns input sequences along a phylogenetic tree from leaves to the root. The blocks in the blockset at a parent node result from intersections or exclusive-ORs of the blocks at its children. The blocks are only split into smaller blocks during the progressive steps of TBA. There is no mechanism to merge blocksTBA follows the maxim "Once a block boundary, always a block boundary." In contrast, ABA permits the merging and simplification of very small blocks. Every blockset represented by TBA is a somewhat simplified linear view of a multiple alignment. In reality, some alignments within a blockset may extend over several blocks, whereas other alignments may be significantly shorter. In a sense, the individual blocks in a blockset have the same limitations as the rowcolumn alignment, in comparison to a DAG alignment that was discussed in the introduction. ABA has a more flexible approach to defining the block boundaries that are expressed as "tangles" in the ABA graph.
Applications and extensions
The ability of the ABA graph to succinctly represent proteins with shuffled and repeated domains makes it useful for de novo domain finding and studies of domain structure. Galperin and Koonin (1998 Further refinements in the ABA algorithm will be required to extend its application. One possible improvement to ABA is the implementation of an iterative refinement procedure: After we construct the initial ABA graph using pairwise similarities, we identify the important edges and refine the alignment at each important edge using more accurate alignment procedures. Finally, we rethread individual sequences through the important edges; if the topology of the graph changes, the procedure is repeated.
We describe the construction of the A-Bruijn and ABA graphs that we use to represent a multiple alignment. Our presentation follows that in Pevzner et al. (2004 t.
Let
The main obstacle to combining these local similarities into an alignment is finding consistent sets of similarities. We now form a multigraph G = (V,E) with vertices that are the connected components of the A-graph. Accordingly, let V denote the connected components of the A graph, and for si in S, let vi denote the connected component containing si. The set E of edges of G are (vi
In the case of inconsistent and "weak" alignments the resulting A-Bruijn graph is often very complex and contains many short cycles. We distinguish directed short cycles, whirls, and undirected cycles, bulges. Whirls result from inconsistencies in pairwise alignments. Often these inconsistencies can be removed by moving gaps or removing matches in the alignment. Bulges result from gaps in pairwise alignments. Bulges and whirls can form complex structures of overlapping bulges/whirls, complicating their removal. We solve the MSLG problem, using the method described in Pevzner et al. (2004 After removing bulges and whirls, the resulting graph may still contain many short edges, due to ambiguities at the ends of aligned regions. These edges add unnecessary complexity to the A-Bruijn graph. To reveal aligned regions, we are interested in important edges: edges of high multiplicity and edges with length greater than some threshold. Therefore, we apply a two-step rethreading procedure: (1) remove the unimportant edges, and (2) thread each sequence Si through the remaining important edges.
Finally, for visual display purposes, we apply a short edge removal heuristic that simply collapses any connected component of short edges in the graph into a single super-vertex, represented by boxes in the figures. We use the Graphviz package (Gansner and North 1999 For short sequences (e.g., protein sequences) the running time of ABA is negligible compared to the time taken in computing all local pairwise alignments that form the input to ABA. For longer sequences (e.g., megabase-sized genomic sequence), the major constraint is memory. The humanmouserat sequences considered above required 2 h of processing time and three gigabytes of memory on an Alpha ES40 workstation. Improvements in memory usage will be necessary for scaling ABA to larger genomic regions. We are currently implementing a version of ABA with reduced memory requirements.
We are greatly indebted to Nick Grishin, Eugene Koonin, Webb Miller, and Yuri Wolf for critical readings of the manuscript. We also thank Neil Jones and Michele Day for helpful discussions. This work is supported by NHGRI grant 1 R01 HG02366. B.R. is supported by a fellowship from the Alfred P. Sloan Foundation.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2657504.
1 Corresponding author. [Supplemental material is available online at www.genome.org.]
2 To avoid the situation when two sources or two sinks are glued into a single vertex (i.e., when the ends of two different sequences are aligned in one alignment in
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138-D141.
Blanchette, M., Kent, J.W., Riemer, C., Elnitski, L., Smit, A.F., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14: 708-715. Böcker, S. 2003. Sequencing from compomers: Using mass spectrometry for DNA de-novo sequencing of 200+ nt. In: Third workshop on algorithms in bioinformatics, Vol. 2812 of Lecture Notes in Computer Science, pp. 476-497. Springer, New York.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M., Estreicher, A., Gasteiger, E., Martin, M., Michoud, K., O'Donovan, C., Phan, I., et al. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids. Res. 31: 365-370. Cormode, G., Paterson, M., Sahinalp, S., and Vishkin, U. 2000. Communication complexity of document exchange. In: Proceedings of the 11th annual ACM-SIAM symposium on discrete algorithms, pp. 197-206. Society for Industrial and Applied Mathematics, Philadelphia.
Darling, A.C., Mau, B., Blattner, F.R., and Perna, N.T. 2004. Mauve: Multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 14: 1394-1403. Doolittle, R.F. 1995. The multiplicity of domains in proteins. Annu. Rev. Biochem. 64: 287-314.[CrossRef][Medline] Eddy, S.R. 1998. Multiple-alignment and sequence searches. Trends Guide to Bioinformatics, pp. 15-18. Elsevier Science, Amsterdam. Ergün, F., Muthukrishnan, S., and Sahinalp, S.C. 2003. Comparing sequences with segment rearrangements. FST TCS 2003: Foundations of software technology and theoretical computer science, Vol. 2914 of Lecture Notes in Computer Science, pp. 183-194. Springer, New York. Feng, D.F. and Doolittle, R.F. 1987. Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J. Mol. Evol. 25: 351-360.[Medline] Galperin, M.Y. and Koonin, E.V. 1998. Sources of systematic error in functional annotation of genomes: Domain rearrangement, non-orthologous gene displacement and operon disruption. In Silico Biol. 1: 55-67.[Medline] Gansner, E.R. and North, S.C. 1999. An open graph visualization system and its applications to software engineering. Software-Practice and Experience. http://www.research.att.com/sw/tools/graphviz/.
Grasso, C. and Lee, C. 2004. Combining partial order alignment and progressive multiple sequence alignment increases alignment speed and scalability to very large alignment problems. Bioinformatics 20: 1546-1556. Heber, S., Alekseyev, M., Sze, S.H., Tang, H., and Pevzner, P.A. 2002. Splicing graphs and EST assembly problem. Bioinformatics 18(Suppl 1): S181-S188.[Abstract] Higgins, D.G. and Sharp, P.M. 1988. CLUSTAL: A package for performing multiple sequence alignment on a microcomputer. Gene 73: 237-244.[CrossRef][Medline] Idury, R.M. and Waterman, M.S. 1995. A new algorithm for DNA sequence assembly. J. Comput. Biol. 2: 291-306.[Medline] Kececioglu, J. 1993. The maximum weight trace problem in multiple sequence alignment. Combinatorial pattern matching (Padova, 1993), Vol. 684 of Lecture Notes in Computer Science, pp. 106-119. Springer, New York.
Kent, J., Sugnet, C., Furey, T., Roskin, K., Pringle, T., Zahler, A.M., and Haussler, D. 2002. The Human Genome Browser at UCSC. Genome Res. 12: 996-1006.
Lee, C., Grasso, C., and Sharlow, M.F. 2002. Multiple sequence alignment using partial order graphs. Bioinformatics 18: 452-464.
Li, X. and Waterman, M.S. 2003. Estimating the repeat structure and length of DNA sequences using
Lipman, D.J., Altschul, S.F., and Kececioglu, J.D. 1989. A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. 86: 4412-4415.
Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott, C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A.R., Lanczycki, C.J., et al. 2003. CDD: A curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31: 383-387.
Morgenstern, B., Dress, A., and Werner, T. 1996. Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc. Natl. Acad. Sci. 93: 12098-12103.
Morgenstern, B., Frech, K., Dress, A., and Werner, T. 1998. DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics 14: 290-294. Myers, E.W. 1996. Approximate matching of network expressions with spacers. J. Comput. Biol. 3: 33-51.[Medline]
Neuwald, A.F., Liu, J.S., Lipman, D.J., and Lawrence, C.E. 1997. Extracting protein alignment models from the sequence database. Nucl. Acids Res. 25: 1665-1677. Notredame, C. 2002. Recent progress in multiple sequence alignment: A survey. Pharmacogenomics 3: 131-144.[CrossRef][Medline] Notredame, C., Higgins, D.G., and Heringa, J. 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302: 205-217.[CrossRef][Medline]
Pe'er, I., Arbili, N., and Shamir, R. 2002. A computational method for resequencing long DNA targets by universal oligonucleotide arrays. Proc. Natl. Acad. Sci. 99: 15492-15496.
Pei, J., Sadreyev, R., and Grishin, N. 2003. PCMA: Fast and accurate multiple sequence alignment based on profile consistency. Bioinformatics 19: 427-428.
Pevzner, P.A. 1989.
Pevzner, P.A., Tang, H., and Waterman, M.S. 2001. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. 98: 9748-9753. Pevzner, P.A., Tang, H., and Tesler, G. 2004. De novo repeat classification and fragment assembly. In: Proceedings of the fifth ACM conference on computational molecular biology (RECOMB), pp. 213-222. ACM Press, New York. Sammeth, M., Morgenstern, B., and Stoye, J. 2003. Divide-and-conquer multiple alignment with segment-based constraints. Bioinformatics 19(Suppl 2): II189-II195. Sankoff, D. 1975. Minimal mutation trees of sequences. SIAM J. Appl. Math. 28: 35-42. Sankoff, D. and Kruskal, J.B., eds. 1983. Time warps, string edits, and macromolecules: The theory and practice of sequence comparison. Addison-Wesley, Boston. Schuler, G., Altschul, S., and Lipman, D. 1991. A workbench for multiple alignment construction and analysis. Proteins 9: 180-190.[CrossRef][Medline]
Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., Smit, A., Green, E.D., Hardison, R.C., and Miller, W. 2003a. MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res. 31: 3518-3524. Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003b. Humanmouse alignments with BLASTZ. Genome Res. 13: 103-107. Shamir, R. and Tsur, D. 2002. Large scale sequencing by hybridization. J. Comput. Biol. 9: 413-428.[CrossRef][Medline] Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., McDowell, J.C., et al. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788-793.[CrossRef][Medline]
Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. CLUSTAL W: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 4673-4680. Vingron, M. and Argos, P. 1991. Motif recognition and alignment for many sequences by comparison of dot-matrices. J. Mol. Biol. 218: 33-43.[CrossRef][Medline] Wang, L. and Jiang, T. 1994. On the complexity of multiple sequence alignment. J. Comput. Biol. 1: 337-348.[Medline] Waterman, M.S., Smith, T.F., and Beyer, W.A. 1976. Some biological sequence metrics. Adv. Math. 20: 367-387.[CrossRef]
Wuchty, S. 2001. Scale-free behavior in protein domain networks. Mol. Biol. Evol. 18: 1694-1702.
Ye, Y. and Godzik, A. 2004. Comparative analysis of protein domain organization. Genome Res. 14: 343-353. Zhang, Y. and Waterman, M.S. 2003. An Eulerian path approach to global multiple alignment for DNA sequences. J. Comput. Biol. 10: 803-819.[CrossRef][Medline]
http://nbcr.sdsc.edu/euler/; EULER Project Homepage. http://www.cse.ucsd.edu/groups/bioinformatics/browser-tba-aba-human.bed; ABA and TBA blocks for humanmouserat sequences.
Received April 7, 2004; accepted in revised format August 16, 2004. |