|
|
|
|
Published online before print
September 18, 2006, 10.1101/gr.5383506 Genome Res. 16:1557-1565, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00 OPEN ACCESS ARTICLE
Methods Reconstructing contiguous regions of an ancestral genome1 Center for Comparative Genomics and Bioinformatics, Penn State University, University Park, Pennsylvania 16802, USA; 2 Department of Mathematics, National University of Singapore, Singapore 117543; 3 Center for Biomolecular Science and Engineering, University of California Santa Cruz, Santa Cruz, California 95064, USA; 4 School of Computer Science, McGill University, Montreal, Quebec H3A 2B4, Canada
This article analyzes mammalian genome rearrangements at higher resolution than has been published to date. We identify 3171 intervals, covering 92% of the human genome, within which we find no rearrangements larger than 50 kilobases (kb) in the lineages leading to human, mouse, rat, and dog from their most recent common ancestor. Combining intervals that are adjacent in all contemporary species produces 1338 segments that may contain large insertions or deletions but that are free of chromosome fissions or fusions as well as inversions or translocations >50 kb in length. We describe a new method for predicting the ancestral order and orientation of those intervals from their observed adjacencies in modern species. We combine the results from this method with data from chromosome painting experiments to produce a map of an early mammalian genome that accounts for 96.8% of the available human genome sequence data. The precision is further increased by mapping inversions as small as 31 bp. Analysis of the predicted evolutionary breakpoints in the human lineage confirms certain published observations but disagrees with others. Although only a few mammalian genomes are currently sequenced to high precision, our theoretical analyses and computer simulations indicate that our results are reasonably accurate and that they will become highly accurate in the foreseeable future. Our methods were developed as part of a project to reconstruct the genome sequence of the last ancestor of human, dogs, and most other placental mammals.
Using computer simulations, we have shown (Blanchette et al. 2004
The regional correspondence between modern and ancestral chromosomes has been predicted with increasing accuracy by a number of groups using a variety of methods. Currently, the main experimental technique is chromosomal painting (for surveys, see Wienberg 2004 To predict large-scale relationships among modern and ancestral genomes with sufficient accuracy for our needs, we have devised new methods. Conserved genomic segments are identified directly from freely available data and analyzed by a new computer program, as described below. We also estimate the accuracy of our results, compare them with published analyses, and explore the biological properties of rearrangement sites. The computer software described herein and details of our predictions for human, mouse, rat, and dog are freely available at http://www.bx.psu.edu/miller_lab/.
Segmenting the genomes based on pair-wise alignments To predict segments of the ancestral genome, we start with "nets" (Kent et al. 2003
Our methods for constructing orthology blocks and conserved segments are illustrated in Figure 2, while later sections and the Supplemental material explain the construction of CARs. Figure 2A shows humanmouse nets. Four mouse intervals are depicted, as ordered and oriented by the orthologous human segments. The second and third mouse intervals are actually adjacent (and appropriately orientated) on a mouse chromosome, and the intervening bases, if any, do not align to human; this is depicted by a thin line connecting the representations of those intervals.
Figure 2B shows the humanmouse, humanrat, and humandog nets for a segment of the human sequence and illustrates the creation of orthology blocks. A dashed line between orthology blocks lies halfway between two intervals that are adjacent relative to human. For instance, the line at human position 2 in Figure 2B is midway between two mouse intervals. When the gap between two adjacent intervals in one species overlaps a gap relative to another species, as at position 1, we use the point halfway between the larger of the interval endpoints to the left and the smaller of the endpoints on the right. We discard all orthology blocks that cover <50 kb of human, because experiments showed that they tend to be unreliable (e.g., aligned segments are not always clearly orthologous). For this reason we describe our orthology blocks as having "50-kb resolution." Application of this process created 3171 genomic intervals, which include 92.36% of the available human genome sequence.
As illustrated in Figure 2C, we fuse runs of consecutive orthology blocks whenever the order and orientation of these blocks are conserved in each of the contemporary genomes. In terms of the convention used in Figure 2A, this means that for all nonhuman species, the boundary between the blocks is crossed either by a net or by a thin line. The results of the fusion process are independent of the order that fusions are performed. We call each resulting union of blocks a conserved segment. We found 1338 conserved segments, each containing an average of Figure 3 shows the length distributions of orthology blocks and conserved segments across the whole genome.
Predicting contiguous ancestral regions from modern adjacencies Adjacencies of genomic content (e.g., genes) have been used as a binary character to infer phylogeny in a parsimony framework (for a survey, see Savva et al. 2003
If the same segment is adjacent to the chosen segment in human, mouse, and rat but not in dog, we need more information to confidently predict the ancestral configuration, since there is a chance that the dog adjacency is ancestral and that the breakage occurred on the short branch from the humandog ancestor to the humanrodent ancestor (see Fig. 1). To help resolve these cases, we add outgroup information in the form of matches to opossum (build monDom4, Jan 2006, http://www.ncbi.nlm.nih.gov/; K. Linblad-Toh, pers. comm.) and chicken sequence (build galGal2, Feb 2004) (Hillier et al. 2004
We have generalized these observations to develop a computational procedure for predicting the order and orientation of conserved segments (and hence of orthology blocks) in the ancestor, based on observed adjacency relationships in the modern genomes. In broadest outline, the method is analogous to Fitchs parsimony method (Fitch 1971
In order to estimate the breakages on each lineage, we also reconstructed the intermediate ancestral genomes, i.e., the rodent ancestor and humanrodent ancestor. Boreoeutherian adjacencies were propagated to the intermediate ancestors. See the Methods section for the inference algorithm.
Identification of small inversions This method assigns 856 inversions to human, 3210 to mouse, 3067 to rat, and 4924 to dog. Among the 4924 inversions assigned to the dog lineage, only 703 were confirmed by an outgroup (e.g., opossum agreed with human); many of the remaining 4221 will be resolved by better outgroup data, e.g., from elephant. Figure 5 shows the length distributions of observed inversions assigned to each species. Among human inversions, 29 were assigned to the short branch leading from the Boreoeutherian ancestor to the humanrodent ancestor. The shortest inversions we found are 87 bp (in human), 31 bp (in mouse), 36 bp (in rat), and 34 bp (in dog). Figure 6 gives a detailed map of CAR 16, including small inversions. The detailed coordinates of these inversions and the tools for identifying them are freely available from http://www.bx.psu.edu/miller_lab/.
Properties of the breakpoints Of 1309 pairs of conserved intervals that were predicted to be adjacent in the Boreoeutherian ancestor, 149 (11%) were separated by events in at least two independent lineages (12 were separated in three lineages). When we omit rat (because of the potential assembly problems indicated in Fig. 1), we find breakpoint reuse for 57 of 742 (8%). The ratio of breakpoint reuse we found is lower than what was reported in Murphy et al. (2005) We inspected intervals around breakpoints in the human sequence, looking for properties that might help explain why breaks occur at some positions but not others. We used 50-kb intervals centered on the end of a conserved segment where the adjacent segments in the ancestor and human differ. Breaks that occurred only in the human lineage were treated separately from those that were reused in another species, giving 16.97 Mb of human-specific breakpoint regions and 11.96 Mb around reused breakpoints. For small inversions in human, we used 1-kb intervals centered on each endpoint, covering 1.60 Mb.
Our observations are summarized in Table 3. GC content around breakpoints is slightly higher than the genome average, but not as elevated as reported for dog chromosomal breaks by Webber and Ponting (2005)
Theoretical analysis We wanted to assess what proportion of the ancestral adjacencies inferred by our method might be correct. We started with a theoretical analysis, which necessitated some rather severe assumptions. Following the method of Sankoff and Blanchette (1999) with n elements that evolves through a series of rearrangements. An entry of can be thought of as one of the 2n numbers 1, 1, 2, 2, ., n, n, where the sign designates the elements orientation. No assumptions are made about the relative frequencies of inversions, translocations, fusions, and fissions. Instead, if f precedes g in the genome = .fg. at a particular time, then g may be changed to h (other than f, f or g), with each of the 2n 3 possibilities equally likely. The probabilities that the successor of f is changed in time t along a specific branch are pivotal parameters of the model. We estimated those parameters using data observed in real genomes. A surprisingly complex computation (see the Supplemental material) then estimated that 90.18% of the adjacencies we predict in the Boreoeutherian ancestor are correct. Mathematical tractability of this model is purchased at a high cost in realism. For instance, the assumption that all replacements are equally likely contradicts the observations that inversions are more frequent than are other rearrangements and that short inversions are more common than are long ones.
Simulations
For the species shown in Figure 1, we repeated the simulation 50 times, in each case running our program for inferring CARs on the resulting data set and comparing the predicted adjacencies with the known (simulated) ones. For determining the success rate, we considered only the ancestral joins that were broken in at least one lineage, since the unbroken joins will be found by essentially any procedure.
When using human, mouse, rat, and dog, with opossum and chicken as outgroups (Fig. 1), the frequency of correctly predicted adjacencies was 98.96% (SD = 0.39) for the Boreoeutherian ancestor, 98.37% (SD = 0.55) for the humanrodent ancestor, and 97.07% (SD = 1.01) for the mouserat ancestor. Note that as for inference of nucleotides (Blanchette et al. 2004 We also reconstructed the Boreoeutherian ancestor without using opossum and chicken; the accuracy decreased to 97.40% (SD = 0.69). If we retain the outgroups but leave out rat, the accuracy drops to 98.29% (SD = 0.67). However, if chimp, cow, and macaque are included in the reconstruction, the simulation indicates that joins in the Boreoeutherian ancestor are computed with 99.34% (SD = 0.29) accuracy.
Comparison with other reconstructions
The predictions of Murphy et al. (2005)
A number of additional mammals are already being sequenced "at low redundancy" for the purpose of identifying human regions under negative selection for substitutions (Margulies et al. 2005 Our computational methods need further refinement. In particular, we anticipate improvements in the handling of large duplications and deletions. Additional progress may be possible through the modeling of other evolutionary events, such as gene conversion or expansion/contraction of short tandem repeats caused by strand slippage. Moreover, advances are needed to facilitate simultaneous reconstruction of ancestral genomes at all internal nodes of the phylogenetic tree. An accurate conceptual model of large-scale evolutionary events will be critical for successful reconstruction of ancestral genomes. The discrepancy between our theoretical analysis of reconstruction accuracy and the results from simulation underline the need for more work in this area. A number of challenges remain before the genome sequences of mammalian ancestors can be accurately predicted at nucleotide resolution. However, we believe that this goal is an appropriate focus for our twin aims of understanding the evolutionary history of every position in the human genome and of providing an Internet resource that optimally organizes and presents the ever-expanding wealth of mammalian and vertebrate sequence data in the context of the ancestral sequence they have in common.
Inferring CARs Given information about adjacencies between conserved segments in each modern species, our goal is to infer segment order in the ancestral genome. To get a clean and precise statement of the problem, we formalize it using graph theory. We develop an algorithm that identifies a most parsimonious scenario for the history of each individual adjacency, although the whole-genome prediction is not guaranteed to optimize traditional measures like the number of breakpoints. We introduce weights to the graph edges to model the reliability of each adjacency.
By encoding each genomic element as a number, we view a chromosome as a signed permutation where the sign corresponds to the orientation (strand) of the encoded element. For example,
Recall that the Fitch algorithm infers the ancestral DNA sequence at the root of a phylogenetic tree from the DNA sequences at the leaves, site-by-site in two stages (Felsenstein 2004 To infer the ancestral predecessor of an element x, our algorithm does exactly the same thing as described above. In a bottom-up fashion, for each node u, we compute a set Pu(i) of the "predecessors" for each element i according to the following rule:
The above inference method outputs a "predecessor graph" at each internal node. In this graph there is an arc from each element in Pu(i) to element i. In general, such a predecessor graph may not be a union of vertex-disjoint paths. This is because two different series of evolutionary events may transform different genomes into the same one, and hence, we often do not have enough information to determine the true predecessor and successor relationships in the ancestor. However, the vast majority of the ancestral adjacencies are likely to be present as edges in the inferred graph. We treat outgroups in a consistent manner. We first infer the predecessor set PR(i) in the common ancestor R of all the species, including outgroups. Then we propagate PR(i) down the tree until we reach the target ancestor T. During the propagation process, if O and A are ancestor and descendant on one branch, respectively, for each element i, we adjust the inferred predecessor set PA(i) of i at the mammalian ancestor A as follows: If PO(i) and PA(i) share common elements, we just take them as the predecessor set of i at A; otherwise, PA(i) is unchanged. In our reconstruction, we first infer the predecessor set of humanchicken common ancestor. Then we used it to adjust the humanopossum common ancestor. And finally, we used human-opossum ancestor to adjust the humandog common ancestor. Similarly, we infer a "successor graph" at each internal node in the given phylogeny. In a successor graph, there is an arc from element i to each element in its inferred successor set S(i) at each internal node. The predecessor and successor graphs are the same at each leaf, but those at an internal node are generally different, although they typically have many common parts. We find the consistent parts by taking the intersection graph G of the predecessor and successor graphs. In the intersection graph, an element could still be involved in three kinds of ambiguities, as depicted in Figure 7. If none of these ambiguous cases is present, the graph itself forms the set of paths that covers all the elements and provides the reconstructed ancestral genome structure.
When ambiguity exists, we need to resolve it and choose appropriate directed edges to form CARs. We assign a weight to each of the directed edges in the intersection graph using the following approach. If the edge (i, j) is neither one of the incoming edges of case (a) nor one of the outgoing edges of case (b), we set w (i, j) = 1. Otherwise, the weight w (i, j) is recursively determined from the leaves by
. Here, wL(i, j) and wR(i, j) are the edge weights to left child and right child. Therefore, in the graph, all the edge weights are between 0 and 1. On a leaf genome, if (i, j) is present in the predecessor graph, we set w(i, j) = 1; otherwise w(i, j) = 1. This weight can be determined by a post-order traversal. Note that if an edge (i, j) is involved in ambiguous case (a) or (b), then w(i, j) < 1. The greater the weight of an edge (i, j), the more likely it is that i and j should be joined. The underlying assumption is that rearrangement is more likely to happen on longer branches. Our purpose is to have a set of paths that cover all the nodes in the directed graph and at the same time maximize the total edge weights in the paths (we allow degenerate paths where there is only one node in the path). We propose a greedy heuristic approach. We first sort all the edges by weight and start to add edges to the vertex-disjoint paths representing the CARs. When an edge (i, j) is being added, we make sure that it will not cause ambiguous cases. If it will, we discard that edge and choose the next available one. The process is performed until no edge can be added. In the graph resulting from this step, all the edges that are not involved in ambiguous cases (a) and (b) will be retained. For ambiguous case (c) where a cycle is formed, we can discard any one edge to break the cycle. In fact, we can easily prove that if there is a cycle, the weight of each edge in that cycle is 1. When we are adding elements into an existing path, particular care is needed to avoid putting j and j in the same CAR. In addition, we add both (i, j) and its symmetric version, (j, i). For each path found by this approach, a symmetric path in the opposite orientation is also found, since we have nodes for both i and i. The two paths correspond to the same CAR, and we choose one of them. A detailed example of the algorithm can be found in the Supplemental materials.
We thank the Broad Institute for making the opossum assembly publicly available, Francesca Chiaromonte for several helpful suggestions, Elliott H. Margulies for providing the phylogenetic tree used in ENCODE project, and Guillaume Bourque for the MGR program. We especially would like to thank the anonymous reviewers for their comments on the initial draft of this paper. J.M., R.C.B., and W.M. were supported by NIH grant HG02238. L.X.Z. was supported by ARF grant 146-000-068-112. W.J.K, B.J.R, and D.H. were supported by NHGRI grant 1P41HG02371 and NCI contract 22XS013A, and D.H. was supported additionally by the Howard Hughes Medical Institute.
5 Present address: Center for Biomolecular Science and Engineering, University of California, Santa Cruz, California 95064, USA.
E-mail jianma{at}bx.psu.edu; fax (814) 863-6699. [Supplemental material is available online at www.genome.org and http://www.bx.psu.edu/miller_lab/.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5383506
Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J., and Eichler, E.E. 2001. Segmental duplications: Organization and impact within the current human genome project assembly. Genome Res. 11: 10051017. Blanchette, M., Green, E.D., Miller, W., and Haussler, D. 2004. Reconstructing large regions of an ancestral mammalian genome in silico. Genome Res. 14: 24122423. Bourque, G. and Pevzner, P.A. 2002. Genome-scale evolution: Reconstructing gene orders in the ancestral species. Genome Res. 12: 2636. Bourque, G., Tesler, G., and Pevzner, P.A. 2006. The convergence of cytogenetics and rearrangement-based models for ancestral genome reconstruction. Genome Res. 16: 311313. Felsenstein, J. 2004. Inferring phylogenies. Sinauer, Sunderland, MA. Fitch, W.M. 1971. Toward defining the course of evolution: Minimum change for a specific tree topology. Syst. Zool. 20: 406416.[CrossRef] Froenicke, L., Caldes, M.G., Graphodatsky, A., Muller, S., Lyons, L.A., Robinson, T.J., Volleth, M., Yang, F., and Wienberg, J. 2006. Are molecular cytogenetics and bioinformatics suggesting diverging models of ancestral mammalian genomes? Genome Res. 16: 306310. Hillier, L.W., Miller, W., Birney, E., Warren, W., Hardison, R.C., Ponting, C.P., Bork, P., Burt, D.W., Groenen, M.A., and Delany, M.E., et al. 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432: 695716.[CrossRef][Medline] . Human Genome Sequencing Consortium 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline] Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. 2002. The human genome browser at UCSC. Genome Res. 12: 9961006. Kent, W.J., Baertsch, R., Hinrichs, A., Miller, W., and Haussler, D. 2003. Evolutions cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes. Proc. Natl. Acad. Sci. 100: 1148411489. Lindblad-Toh, K., Wade, C.M., Mikkelsen, T.S., Karlsson, E.K., Jaffe, D.B., Kamal, M., Clamp, M., Chang, J.L., Kulbokas III, E.J., and Zody, M.C., et al. 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438: 803819.[CrossRef][Medline] Margulies, E.H., Vinson, J.P., Miller, W., Jaffe, D.B., Lindblad-Toh, K., Chang, J.L., Green, E.D., Lander, E.S., Mullikin, J.C., and Clamp, M. 2005. An initial strategy for the systematic identification of functional elements in the human genome by low-redundancy comparative sequencing. Proc. Natl. Acad. Sci. 102: 47954800. . Mouse Genome Sequencing Consortium 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562.[CrossRef][Medline] Murphy, W.J., Larkin, D.M., Everts-van der Wind, A., Bourque, G., Tesler, G., Auvil, L., Beever, J.E., Chowdhary, B.P., Galibert, F., and Gatzke, L., et al. 2005. Dynamics of mammalian chromosome evolution inferred from multispecies comparative maps. Science 309: 613617. Peng, Q., Pevzner, P.A., and Tesler, G. 2006. The fragile breakage versus random breakage models of chromosome evolution. PLoS Comp. Biol. 2: e14.[CrossRef] . Rat Genome Sequencing Consortium 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493521.[CrossRef][Medline] Sankoff, D. 2006. The signal in the genomes. PLoS Comput. Biol. 2: e35.[CrossRef][Medline] Sankoff, D. and Blanchette, M. 1999. Probability models for genome rearrangement and linear invariants for phylogenetic inference. In RECOMB 99: Proceedings of the third annual international conference on computational molecular biology, pp. 302309. ACM Press, New York. Savva, G., Dicks, J., and Roberts, I.N. 2003. Current approaches to whole genome phylogenetic analysis. Brief. Bioinform. 4: 6374. Webber, C. and Ponting, C.P. 2005. Hotspots of mutation and breakage in dog and human chromosomes. Genome Res. 15: 17871797. Wienberg, J. 2004. The evolution of eutherian chromosomes. Curr. Opin. Genet. Dev. 14: 657666.[CrossRef][Medline]
Received April 7, 2006; accepted in revised format June 22, 2006. This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||