|
|
|
|
Genome Res. 14:574-579, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Letter Identification of Evolutionary Hotspots in the Rodent GenomesDepartment of Mathematics, University of California, Berkeley, California 94720-3840, USA
We describe a whole-genome comparative analysis of the human, mouse, and rat genomes to describe the average substitution patterns of four genomic regions: ancient repeats, rodent-specific DNA, exons, and conserved (coding and noncoding) regions, and to identify rodent evolutionary hotspots. In all types of regions, except the rodent-specific DNA, the rat branch is slightly longer than the mouse branch. Moreover, the mouserat distance is longer in the rodent-specific DNA than in the ancient repeats. Analysis of individual conserved regions with different substitution models yielded the conclusion that the JukesCantor model is inadequate, and the HasegawaKishinoYano model is almost as good as the REV model. Using human as an outgroup, we identified 5055 evolutionary hotspots, which are highly conserved subalignment blocks (each consisting of at least 100 aligned sites and a small fraction of gaps) with a large and statistically significant difference in the branch lengths of the rodent species. The cutoffs used to identify the hotspots are partially based on estimates of the average rates of substitution. The fractions of hotspots overlapping with the rodent RefSeq genes, RefSeq exons, and ESTs are all higher than expected. Still, more than half of the hotspots lie in noncoding regions of the mouse genome. We believe that the hotspots represent biologically interesting regions in the rodent genomes.
The sequencing of the rat genome makes possible, for the first time, a whole-genome comparative analysis of three large mammalian genomes. Despite the exciting prospects of such an analysis, existing methods on a whole (mammalian) genome scale are scarce. Some examples include methods based on gene order comparison rather than sequence comparison (Moret et al. 2001
Three species comparison of genomic regions was first undertaken by Lee et al. (1998
We performed a comparative analysis on a whole-genome multiple alignment of human, mouse, and rat DNA sequences, to describe the average substitution patterns of four types of DNA: ancient repeats, rodent-specific DNA, exons, and conserved regions (coding and noncoding), and to identify rodent evolutionary hotspots, which are well-conserved regions where the rodent branch lengths are very different. The data were obtained by extracting appropriate gapped subalignments, called blocks, from the whole-genome alignment. The blocks were aggregated to estimate the average substitution rates and branch lengths of the unrooted tree relating the three species (Fig. 1) for each type of DNA, by maximum likelihood on the REV substitution model (Tavaré 1986
The ancient repeats (Waterston et al. 2002 An evolutionary hotspot is a conserved region with the property that the human branch is shorter than 0.25 substitutions per site, the absolute difference between the rodent branch lengths is more than twice the asymptotic standard error, and the ratio of the mouse branch to the rat branch is at least 10 or at most 0.1. These are well-conserved regions, likely to contain functional elements, where the rodents have evolved at very different rates.
Average Substitution Patterns Table 1 displays some statistics and the branch lengths for four types of regions: ancient repeats, rodent-specific DNA, exons, and conserved regions; the substitution patterns were aggregated in these analyses. Because the rodent-specific blocks are really pairwise alignments, the distance between the rodents, but not their respective distances to the rodent ancestor, may be estimated via the REV model. For all other regions, the mouse branch is slightly shorter than the rat branch, the ratios ranging from 0.91 to 0.95, indicating that mouse generally evolved slightly more slowly, relative to rat. The fractions of blocks with the rat branch longer than the mouse branch were 54% for long ancient repeats (>100 aligned sites), 52% for long exons (>100 aligned sites), and 55% for conserved regions. The ratio of the human branch to the mouse branch is 5. These are consistent with observations by others (Cooper et al. 2004
The estimated REV rate matrices are shown in Table 2. All rate matrices are not of the HKY type, R(C, A) being typically larger than R(T, A); the exon rate matrix is the closest to HKY. However, they are very close to being strand-symmetric, that is, the rates are invariant under complementation. For example, Q(A, C) Q(T, G). The AR and Rodent rates are remarkably similar, and they are both similar to the rates for conserved regions. Finally, all the rate matrices are more similar to each other than to those corresponding to the 4D sites and ancient repeats in Hardison et al. (2003
Sensitivity of Branch Length Estimates to Substitution Model The average substitution patterns were studied using aggregated blocks, in which blocks of any size were effectively glued together to form a large block. On the other hand, to identify interesting regions for further study, it is necessary to apply the estimation procedure to individual blocks, which were required to have at least 100 aligned sites so that the estimates were not too variable. We compare the branch lengths of 646,741 conserved regions estimated via the Jukes-Cantor (JC), Hasegawa-Kishino-Yano (HKY), and general reversible (REV) models by maximum likelihood. Between JC and REV, the fractions of blocks for which the difference is <0.01 is 85% (rodents) and 27% (human). The corresponding fractions for comparing HKY and REV are 94% (rodents) and 79% (human). Thus, HKY is significantly more accurate than JC. Generally, the REV estimates are larger than the HKY estimates, which are, in turn, larger than the JC estimates. Also, the difference between REV and HKY (also REV and JC) tends to decrease modestly as the branch length (with the REV estimate as a proxy) decreases (see Fig. 2), which is consistent with the fact that the variance is larger for a longer branch. Interestingly, for very short branches, the HKY and JC estimates tend to be larger than REV. This phenomenon may partially account for the fact that the number of hotspots increases as the substitution model becomes more accurate: 3672 (JC), 4522 (HKY), and 5055 (REV). We conclude that JC should be avoided, and although HKY may be adequate for branch length estimation, the REV model is better and thus we used the latter.
Evolutionary Hotspots We found 5055 evolutionary hotspots among the conserved regions. The average and standard deviation (SD) of the number of aligned sites are 190 and 86, respectively (histogram in Fig. 3). If the conserved regions were not filtered by the requirement that the three pairwise similarities were at least 60%, then the number of hotspots found was 5086. Thus, the effect of the filter is rather small for identification of hotspots.
Small fractions of the hotspots have some overlap with ancient repeats (mouse 6%, rat 5%) and simple and low-complexity repeats (mouse 9%, rat 8%). We also computed the fractions that have some overlap with RefSeq genes, RefSeq exons, and ESTs. Treating hotspots as points on the genome, if they were scattered randomly, then we would expect the fraction that landed on, say, the ESTs, to be roughly the fraction of ESTs in the genome. The observed and expected fractions are reported in Table 3. The hotspots are overrepresented in the three regions by factors ranging from 2.0 (mouse ESTs) to 6.7 (rat exons). This observation still holds for each individual chromosome. Because the mouse Ref-Seq database is more complete, we infer that about 37% of the hotspots lie totally in the noncoding portion of mouse genes. Assuming that the mouse ESTs cover all RefSeq genes, 27% are intergenic in the mouse genome. Thus, 64%, or more than half, of the hotspots are probably functional noncoding sequences in the mouse genome.
The evolutionary hotspots are available for downloading at http://baboon.math.berkeley.edu/hotrodent/
The sequencing of the rat genome has provided us, for the first time, the opportunity to compare and contrast closely related vertebrate genomes. We have specifically used the human genome as an outgroup to the rodent genomes to identify evolutionary hotspots; it is important to note that this analysis would not have been possible without the complete sequence for all three genomes. Although the phylogenetic tree for the human, mouse, and rat is rather simple, the estimation of the branch lengths is not, and, as we have shown, several interesting results emerge from a detailed analysis of the branch length estimates and their sensitivity to parameters.
One of the clearcut results is that the rat branch length is slightly longer than the mouse branch. This finding is true not only on average, but also locally across the genomes. Furthermore, this observation is confirmed by independent analyses on different alignments (Cooper et al. 2004 We showed that the relatively simple HKY model worked almost as well as the general REV model, and much better than the simplest JC model. This confirms the well-known observation that base composition and substitution bias should be taken into account in branch length estimation. Furthermore, given that the HKY model is simpler than the REV model, our finding suggests that using HKY for branch length estimation can be a workable compromise between accuracy and speed when analyzing large data sets. The close distance of rat to its common ancestor with mouse means that the total predictive power of the mouse and rat genomes for identifying conserved regions in the human genome is not that much greater than using the mouse alone. Nevertheless, as we have pointed out, treating human as an outgroup to two similar genomes (mouse and rat), allows for the targeted identification of regions in the rodent genomes that are evolving in unexpected ways. We believe that the 5055 blocks we have identified represent a conservative estimate of the number of such regions. Because we only selected hotspots with at least 100 aligned sites, we necessarily miss the shorter ones. Perhaps one way to identify them is by sliding a window along a conserved region to detect very different rodent branch lengths. It is important to note that the cutoffs used in our methodology are motivated partly by analyses of the average substitution patterns in the human, mouse, and rat lineages. Although the estimates depend on the thresholds used, and also on the alignment, systemic patterns do emerge from independent analyses, and the universal observation that the rat lineage is evolving faster than the mouse is confirmation of that. Thus, we believe that our identified hotspots do represent biologically interesting regions, and are not merely artifacts of selected parameters and heuristic cutoffs.
The natural generalization of our study is to identify regions in a multiple alignment where the inferred tree differs substantially from the consensus tree for the genome. Recent work by Billera et al. (2001
Alignment A multiple alignment of the human, mouse, and rat genomes was generated by first constructing a three-way homology map between the genomes and then aligning the homologous regions with MAVID (Bray and Pachter 2003
Ancient Repeats
Exons
Rodent-Specific DNA and Conserved Regions
Evolutionary Hotspots
Estimation To obtain the average substitution rates and branch lengths in Tables 1 and 2, we aggregated the blocks over the whole-genome alignment. We also applied the estimation procedure on individual blocks for some blocks in the ancient repeats and exons (Fig. 4) and for all conserved regions. Because aggregation is equivalent to gluing alignments into a long alignment, it suffices to explain the estimation procedure on a single block. Let h, m, and r be the respective sequences from human, mouse, and rat. Assuming, as usual, that there was a common ancestor to human, mouse, and rat, which then split into a human lineage and a rodent lineage, which, in turn, split into mouse and rat, we then have a rooted phylogenetic tree relating the sequences as depicted in Figure 1. The branch lengths t1, t2, t3, and t4 represent the evolutionary distances between the appropriate sequences, measured in the number of evolutionary events that occurred, averaged over all possible substitution paths. Because the rate of evolution is likely to have varied across lineages, there is no general correspondence between the evolutionary distances and the chronological time intervals.
The REV rate matrix Q can be represented as
= ( A, C, G, T) is the equilibrium distribution. Equivalently, Q = R , where is diagonal and R is symmetric:
= = = and = , then Q is an HKY matrix. If is uniform and = = = = = , then Q is a JC matrix. Finally, if A = T, C = G, = and = , then the substitution rates are invariant under complementation; for example, Q(A, C) = Q(T, G). In this case, Q is called strand-symmetric.
Suppose that Q is calibrated, that is,
Assuming site independence and homogeneity, the probability of a subalignment (without gaps) is the product of the site-specific probabilities. This is maximized numerically to obtain estimates of the rate matrix and the branch lengths. By suitably constraining the rate matrix, we get the maximum likelihood (ML) estimates of an REV, HKY, or JC rate matrix. The whole-genome analysis for REV could be finished within 7 h on a single 2.6-Ghz processor.
We thank Nicolas Bray and Colin Dewey for generating the whole-genome alignments; and Greg Cooper, Ross Hardison, David Haussler, Webb Miller, and Arend Sidow for extensive discussions. Special appreciation goes to Krishna Roskin for reconciling the findings of the different groups. We also thank the referees for many helpful comments and suggestions. L.P. and V.B.Y. were partially funded by a grant from the NIH (R01-HG02362-01). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be herebymarked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1967904.
1 Corresponding author.
Billera, L.J., Holmes, S.P., and Vogtmann, K. 2001. Geometry of the space of phylogenetic trees. Adv. Appl. Math. 27: 733-767.[CrossRef]
Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K.D., Ovcharenko, I., Pachter, L., and Rubin, E.M. 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 1391-1394.
Bray, N. and Pachter, L. 2003. The MAVID multiple alignment server. Nucleic Acids Res. 31: 3525-3526. . 2004. MAVID: Constrained ancestral alignment of multiple sequences. Genome Res. (this issue).
Cooper, G.M., Brudno, M., NISC Comparative Sequencing Program, Green, E.D., Batzoglou, S., and Sidow, A. 2003. Quantitative estimates of sequence divergence. Genome Res. 13: 813-820. Cooper, G.M., Brudno, M., Stone, E.A., Dubchak, I., Batzoglou, S., and Sidow, A. 2004. Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Res. (in press).
Dubchak, I., Brudno, M., Loots, G.G., Pachter, L., Mayor, C., Rubin, E.M., and Frazer, K.A. 2000. Active conservation of noncoding sequences revealed by three-way species comparison. Genome Res. 10: 1304-1306.
Hardison, R.C., Roskin, K.M., Yang, S., Dickhans, M., Kent, W.J., Weber, R., Elnitski, L., Li, J., O'Connor, M., Kolbe, D., et al. 2003. Covariation in frequencies of substitution, deletion, transposition, and recombination during eutherian evolution. Genome Res. 13: 13-26.
Lee, I.Y., Westway, D., Smit, A.F.A., Wang, K., Seto, J., Chen, L., Acharya, C., Ankener, M., Baskin, D., Cooper, C., et al. 1998. Complete genome sequence and analysis of prion protein gene region from three mammalian species. Genome Res. 8: 1022-1037.
Miller, W. 2001. Comparison of genomic DNA sequences. Bioinformatics 17: 391-397. Moret, B.M.E., Wang, L.-S., Warnow, T., and Wyman, S. 2001. New approaches for reconstructing phylogenies based on gene order. Bioinformatics 17 Suppl. 1: S165-173.[Abstract] Rat Genome Sequencing Project Consortium. 2004. Genome sequence of the Brown Norway Rat yields insights into mammalian evolution. Nature (in press). Tavaré, S. 1986. Some probabilistic and statistical problems in the analysis of DNA sequences. Lectures on Mathematics in the Life Sciences 17: 57-86. Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., McDowell, J.C., et al. 2003. Comparative analysis of multi-species sequences from targeted genomic regions. Nature 424: 788-793.[CrossRef][Medline] Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562.[CrossRef][Medline] Wiehe, T., Guigo, R., and Miller, W., 2000. Genome sequence comparisons: Hurdles in the fast lane to functional genomics. Briefings in Bioinformatics 4: 381-388. Yang, S., Schwartz, S., Chiaromonte, F., Roskin, K.M., Haussler, D., Miller, W., and Hardison, R.C. 2004. Patterns of insertions and their covariation with substitutions in the rat, mouse and human genomes. Genome Res. (this issue). Yang, Z. 1994. Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39: 105-111.[Medline] Yap, V.B. and Speed, T.P. 2004. Modeling DNA base substitution in large genomic regions from two organisms. J. Mol. Evol. 58: 12-16.[CrossRef][Medline]
http://baboon.math.berkeley.edu/hotrodent/; rodent hotspots. http://baboon.math.berkeley.edu/mavid/; MAVID alignment server. http://genome.ucsc.edu/; UCSC Genome Bioinformatics.
Received September 11, 2003;
accepted in revised format December 27, 2003.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||