|
|
|
|
Genome Res. 14:1610-1616, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods Indel-Based Evolutionary Distance and MouseHuman Divergence1 National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20892, USA 2 Division of Genetics, Brigham and Women's Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA
We propose a method for estimating the evolutionary distance between DNA sequences in terms of insertions and deletions (indels), defined as the per site number of indels accumulated in the course of divergence of the two sequences. We derive a maximal likelihood estimate of this distance from differences between lengths of orthologous introns or other segments of sequences delimited by conservative markers. When indels accumulate, lengths of orthologous introns diverge only slightly slower than linearly, because long indels occur with substantial frequencies. Thus, saturation is not a major obstacle for estimating indel-based evolutionary distance. For introns of medium lengths, our method recovers the known evolutionary distance between rat and mouse, 0.014 indels per site, with good precision. We estimate that mousehuman divergence exceeds ratmouse divergence by a factor of 4, so that mousehuman evolutionary distance in terms of selectively neutral indels is 0.056. Because in mammals, indels are 14 times less frequent than nucleotide substitutions, mousehuman evolutionary distance in terms of selectively neutral substitutions is 0.8.
Evolutionary distance (ED) between two homologous DNA sequences is defined as the per nucleotide site number of mutations that have been fixed in the course of evolution of the sequences from their last common ancestor. For not-too-tightly related sequences, ED exceeds their dissimilarity (DS), the per site number of differences between the properly aligned sequences. This happens because several mutations can affect the same site. Different methods of inferring the number of such multiple hits from the observed DS often produce rather different estimates of ED (see Li 1997
ED is usually estimated on the basis of single nucleotide substitutions. Then, each site serves as an independent timer, which is advantageous when only short sequences are available. However, substitution-based ED also has two substantial limitations.
First, as DS at single-site timers can assume only two states (match or mismatch), it rapidly reaches saturation when ED increases. In the simplest case of equally frequent nucleotides and no selective constraint, DS approaches 0.75 and ED is almost impossible to estimate when the number of hits per timer exceeds two or three (see Li 1997
Second, even when the number of hits per neutral site is below one or two, so that recovering ED from DS may be feasible, selectively neutral sequences often cannot be used, due to their unreliable alignments. Even for placental mammals from different orders, accumulated insertions and deletions often make it impossible to establish homology of individual sites within introns and other noncoding sequences (Shabalina et al. 2001
Thus, it may be preferable to estimate ED on the basis of length-difference mutations, insertions, and deletions (indels). Then, a timer is the segment of the sequence delimited by unambiguous markers, for example, an intron flanked by conservative exons. Availability of large-scale sequence data makes such long timers acceptable. If homologous timer sequences are similar enough to allow their unambiguous alignments, the indel-based ED between them must be close to their indel-based DS, the number of gaps in their alignment over its length. Recently, methods of phylogenetic reconstruction based on individually recognizable indels become available (see Sanchis et al. 2001 Indel-based ED is less prone to saturation than substitution-based ED, due to two factors. First, lengths of two homologous segments of DNA can diverge almost without a limit. Second, as indels are less common than substitutions, even distant sequences can be compared without encountering too many hits. This study is concerned with estimating indel-based ED between unalignable timer sequences. We derive a maximal likelihood estimate of indel-based ED, test it by recovering the known ED between rat and mouse, and apply it to estimate ED between mouse and human.
First, let us assume that we know the distribution p( ) of the length of individual indels that occurred in a timer sequence (e.g., an intron) in the course of evolution of a pair of species (e.g., mouse and human) from their last common ancestor (Table 1 presents the notations). We arbitrarily assign one (mouse) sequence to be the first, and the other (human) the second. The lengths of these timer sequences are L1 and L2, respectively, and = L1L2. Then, an indel of positive length makes the mouse sequence longer than the human sequence, and vice versa. Usually, we cannot tell a deletion in the murine lineage from an insertion of the same length in the human lineage. Thus, an indel of a positive (negative) length was either an insertion into murine (human) lineage or a deletion from human (murine) lineage.
If different indels are fixed independently, the distribution fk(
We need to recover k, the per timer number of indels, from the observed
Because there are no a priori restrictions on k, the likelihood function for q for the i-th timer with length difference
For a set of many independent timers,
The value of q which maximizes l(q), q^, is the maximal likelihood estimate of q for this set.
Now, let us address the problem of estimating p(
We will consider probably the most realistic among such intermediate situations. It is easy to see that
) (Pd( )) is the probability that an insertion (deletion) has the length , and S is a normalizing constant. We will assume that Pi( ) and Pd( ) are the same for all the three distributions and consider the consequences of changes in A and I. We will also assume that Pi( ) = Pd( ) = P( ), that is, that distributions of length of insertions and of deletions coincide. If so, the shapes of the branches of p( ), that is, the distributions corresponding to only positive or only negative values of , p+( ) and p( ), are always the same and coincide with P( ), and the probability that an indel has positive length is given by:
Thus, with I = 0.5 or A = 0.5, p(
We will ascertain P(
The asymmetry parameter a will be estimated from the relationship of differences between the lengths of timer sequences and absolute values of these differences.
We identified 8817 of triplets of mouse, rat, and human orthologous protein-coding genes, using the standard reciprocal best hit approach (Tatusov et al. 1997
Figure 1 presents data on distributions of lengths of individual indels within alignments of orthologous introns. Figure 1A displays pm( ) and ph( ). Area under the positive branch, a, is 0.46 (605,079 indels of positive lengths vs. 715,175 indels of negative lengths) in pm( ) and 0.48 (12,866 vs. 13,801) in ph( ), reflecting the fact that gaps in the first of the aligned sequences (rat or human) are slightly more common than in the second sequence (mouse or OWM).
Shapes of the positive and negative branches are nearly identical in both pm( ) (Fig. 1B; elevated frequencies of indels of lengths 100 and 200 are due to insertions of B1 and B2 SINEs) and ph( ) (data not reported). The shape of the branches is also very similar between pm( ) and ph( ) (Fig. 1C; elevated frequencies of indels of lengths 320 in humanOWM alignments are due to insertions of Alu SINEs). The shapes of these branches computed only for those parts of ratmouse and humanOWM alignments where neither of the two sequences is masked by RepeatMasker, are similar (Fig. 1D), and the total number of gaps that are due to insertion of recognizable transposable elements is <3%. Although short indels are slightly more frequent in ph( ) than in pm( ), in both distributions the median indel length is 3. The average length of an indel is not a good parameter because when , P( ) declines as 2 (Fig. 1E), and all its moments diverge. Alignments of introns of different lengths yield different P( ) (Fig. 1F). Not surprisingly, long indels are more common within longer introns.
To estimate mousehuman indel-based ED, we calculate p(
Figure 2 shows how properties of intron pairs depend on k. Such data exist only for ratmouse (and humanOWM) pairs. The average intron length increases linearly with k (Fig. 2A). Figure 2, B and C compare the data on the mean length difference and median absolute value of length difference between orthologous introns, M(
Figure 3 presents properties of intron pairs as functions of L. In ratmouse intron pairs, M( ) and Med(| |) depend on L almost exactly as on k, which is to be expected, as L is a good proxy for k. Probably, the same is also true for mousehuman intron pairs. Figure 4 shows how a can be estimated from the relationship between M( ) and Med(| |). In the course of mousehuman divergence a was 0.42.
Figure 5 shows estimates of q between rat and mouse and between mouse and human, together with actual values of q between rat and mouse. Using the correct value of a, 0.46, leads to the best estimate of ratmouse ED. Using smaller a underestimates q (because the same number of indels leads, on average, to a large divergence of intron lengths), and vice versa (data not reported). Thus, we assume that for mouse and human, where q cannot be observed directly, its best estimate is also obtained under the correct value of a = 0.42.
Figure 6 compares ratmouse and mousehuman divergence of lengths of orthologous introns.
Our results demonstrate that q, the indel-based ED, can be estimated from length differences between orthologous timer sequences. Figure 1A and shows that pm( ) and ph( ) are close to each other (very similar data have been obtained previously; Britten 2002 0.5, as the shapes of their positive and negative branches are almost the same (Fig. 1B). Thus, pm( ) or ph( ) can be described by equation 7 with P( ) = (p+( ) + p( ))/2 and a = 0.46 or 0.48, respectively.
Rigorously speaking, any observed distribution of gap lengths is a convolution of the distribution of length of individual indels, because multiple indels can occur on top of each other. However, as rat and mouse and, in particular, human and OWM, are close enough to each other, ascertaining pm(
We will assume that P(
Data on ratmouse intron pairs show that the number of accumulated indels k is proportional to the average intron length L (Fig. 2A). Thus, the per site density of indels q is almost independent of intron length. For rat and mouse, q = 0.014. When k increases, both M(
The key property of timer lengths evolution is evident from Figure 2, B and C. When k increases, lengths of orthologous introns diverge only slightly slower than linearly. As a result, saturation, a major obstacle to estimating substitution-based ED, is much less of a problem in the case of indel-based ED. Two extreme cases elucidate this pattern. On the one hand, if all indels were of length 1, and indels of positive and negative lengths were equally frequent (a = 0.5), Med(|
Two facts are obvious if we compare the patterns in divergence of lengths within ratmouse and mousehuman intron pairs (Fig. 3). First, for pairs with the same L, Med(|
Data on mammalian evolution (Springer et al. 2003 Indel-based ED between rat and mouse is 0.014 (Fig. 5). Slight decline of q with L may be due to its underestimation in long introns, whose alignments often contain long gaps, as additional indels cannot be recorded within such gaps.
Maximal likelihood estimate of q recovers its real values for rat and mouse with good precision (Fig. 5). In long introns, q is slightly underestimated because Med(|
Thus, the best estimates of q are obtained for introns of lengths 3001500, where accumulation of indels is closest to independent. Within this range, the average q between mouse and human is 0.056, that is, four times higher than between rat and mouse (Fig. 5). Simple considerations support this estimate. Med(|
Perhaps mousehuman ED estimated for completely neutral indels would be slightly above 0.056, as stabilizing selection must slow down the divergence of intron lengths. This is certainly the case for very short introns (Fig. 5). However, Med(|
Insertions of transposable elements is a distinct, important mechanism of accumulation of indels that is probably less homogeneous over evolutionary times than other mechanisms (Waterson et al. 2002). Thus, it is interesting to estimate indel-based ED, which is independent of this process. For this purpose, we used p(
We believe that masking transposable elements causes overestimation of mousehuman divergence. Some elements inserted early in the mammalian radiation changed beyond recognition (Waterson et al. 2002) and are not detected by RepeatMasker. Such elements still contribute to the length difference between orthologous mouse and human introns, even when all masked sequences are ignored. In contrast, RepeatMasker must recognize almost all transposable elements inserted after ratmouse (or humanOWM) divergence, so that p( Therefore, which ED, substitution-based or indel-based, should be calculated for a pair of species? Obviously, if the species are so close that even their neutral sequences can be aligned, both measures can be used, and substitution-based ED is preferable, as it requires less data. In the opposite extreme case of very distant species, ED can only be estimated from non-neutral sequences. Traditionally, this is done on the basis of substitutions, and using indels, although feasible, would require analysis different from presented here. However, there appears to be a substantial range of moderate distances between species for which only indel-based ED can be estimated for neutral sequences. Mouse and human are close to the lower boundary of this range; if they were approximately two times more similar than they are, their introns would be alignable. The upper boundary of this range is currently unknown and the possibility of indel-based estimate of ED between, for example, Fugu and mammals is worth studying.
Both indel-based and substitution-based estimates of ED are vulnerable to changes in the parameters of the underlying process. The distribution of the length of individual indels did not stay exactly invariant (Fig. 1). However, relative rates of nucleotide substitutions of different types also change in the course of evolution (Duret et al. 2002
Indel-based and substitution-based ED's can be converted into each other, as long as we know the indel/substitution ratio R among fixed mutations. Within primates, R = 1/13 for noncoding sequences (Silva and Kondrashov 2002
Thus, if we could construct the correct alignment of human and mouse selectively neutral sequences, we would see
Knowing the evolutionary distance between murine and human neutral sequences is essential for identifying selectively constrained regions. Quantitative analysis is necessary to determine whether such regions cover >10% (Shabalina et al. 2001
We thank two anonymous reviewers for a number of useful suggestions. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2450504.
3 Corresponding author.
Arndt, P.F., Petrov, D.A., and Hwa, T. 2003. Distinct changes of genomic biases in nucleotide substitution at the time of mammalian radiation. Mol. Biol. Evol. 20: 18871896.
Britten, R.J. 2002. Divergence between samples of chimpanzee and human DNA sequences is 5%, counting indels. Proc. Natl. Acad. Sci. 99: 1363313635.
Britten, R.J., Rowen, L., Williams, J., and Cameron, R.A. 2003. Majority of divergence between closely related DNA samples is due to indels. Proc. Natl. Acad. Sci. 100: 46614665.
Castresana, J. 2002. Genes on human chromosome 19 show extreme divergence from the mouse orthologs and a high GC content. Nucleic Acids Res. 30: 17511756.
Cooper, G.M., Brudno, M., Stone, E.A., Dubchak, I., Batzoglou, S., and Sidow, A. 2004. Characterization of evolutionary rates and constraints in three mammalian genomes. Genome Res. 14: 539548.
Duret, L., Semon, M., Piganeau, G., Mouchiroud, D., and Galtier, N. 2002. Vanishing GC-rich isochores in mammalian genomes. Genetics 162: 18371847. Feller, W. 1968. An introduction to probability theory and its applications. John Wiley & Sons, New York.
Hellman, I., Zolliner, S., Enard, W., Ebersberger, I., Nickel, B., and Paabo, S. 2003. Selection on human genes as revealed by comparisons to chimpanzee cDNA. Genome Res. 13: 831837. Kondrashov, A.S. 2003. Direct estimates of human per nucleotide mutation rates at 20 loci causing Mendelian diseases. Hum. Mutat. 21: 1227.[CrossRef][Medline] Li, W.-H. 1997. Molecular evolution. Sinauer Associates, Sunderland, MA.
Makalowski, W. and Boguski, M.S. 1998. Evolutionary parameters of the transcribed mammalian genome: An analysis of 2,820 orthologous rodent and human sequences. Proc. Natl. Acad. Sci. 95: 94079412. Nei, M. and Kumar, S. 2000. Molecular evolution and phylogenetics. Oxford Univ. Press, Oxford.
Ogurtsov, A.Y., Roytberg, M.A., Shabalina, S.A., and Kondrashov, A.S. 2002. OWEN: Aligning long collinear regions of genomes. Bioinformatics 18: 17031704. Rat Genome Sequencing Project Consortium 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493521.[CrossRef][Medline]
Sanchis, A., Michelena, J.M., Latorre, A., Quicke, D.L.J., Gardenfors, U., and Belshaw, R. 2001. The phylogenetic analysis of variable-length sequence data: Elongation factor-1 Shabalina, S.A., Ogurtsov, A.Y., Kondrashov, V.A., and Kondrashov, A.S. 2001. Selective constraint in intergenic regions of human and mouse genomes. Trends Genet. 17: 373376.[CrossRef][Medline] Silva, J.C. and Kondrashov, A.S. 2002. Patterns in spontaneous mutation revealed by humanbaboon sequence comparison. Trends Genet. 18: 544547.[CrossRef][Medline] Smith, N.G.C. and Eyre-Walker, A. 2003. Human disease genes: Patterns and predictions. Gene 318: 169175.[CrossRef][Medline]
Springer, M.S., Murphy, W.J., Eizirik, E., and O'Brien, S.J. 2003. Placental mammals diversification and the Cretaceous-Tertiary boundary. Proc. Natl. Acad. Sci. 100: 10561061.
Tatusov, R.L., Koonin, E.V., and Lipman, D.J. 1997. A genomic perspective on protein families. Science 278: 631637. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562.[CrossRef][Medline]
Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555556.
Received August 28, 2003; accepted in revised format April 22, 2004. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||