|
|
|
|
Genome Res. 15:1566-1575, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Methods Genomic scans for selective sweeps using SNP data1 Department of Biological Statistics and Computational Biology, Cornell University, Ithaca, New York 14853, USA 2 Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA 3 Center for Bioinformatics and Department of Biology, University of Copenhagen, Copenhagen, Denmark 4 Department of Biology, University of Rochester, Rochester, New York 14627, USA
Detecting selective sweeps from genomic SNP data is complicated by the intricate ascertainment schemes used to discover SNPs, and by the confounding influence of the underlying complex demographics and varying mutation and recombination rates. Current methods for detecting selective sweeps have little or no robustness to the demographic assumptions and varying recombination rates, and provide no method for correcting for ascertainment biases. Here, we present several new tests aimed at detecting selective sweeps from genomic SNP data. Using extensive simulations, we show that a new parametric test, based on composite likelihood, has a high power to detect selective sweeps and is surprisingly robust to assumptions regarding recombination rates and demography (i.e., has low Type I error). Our new test also provides estimates of the location of the selective sweep(s) and the magnitude of the selection coefficient. To illustrate the method, we apply our approach to data from the Seattle SNP project and to Chromosome 2 data from the HapMap project. In Chromosome 2, the most extreme signal is found in the lactase gene, which previously has been shown to be undergoing positive selection. Evidence for selective sweeps is also found in many other regions, including genes known to be associated with disease risk such as DPP10 and COL4A3.
When a new beneficial mutation increases in frequency in a population because of natural selection, the standing genetic variation in neighboring regions will be affected. The level of variability will be reduced, the level of linkage disequilibrium increased, and the pattern of allele frequencies will be skewed (e.g., Maynard Smith and Haigh 1974
Full genomic scans for selective sweeps face several challenges. First, the effects of selection are confounded by the effects of demographic factors. The neutral null hypothesis (absence of selective sweeps) is a composite hypothesis that also makes assumptions regarding the demography of the populations investigated. Typically, it is assumed that the population is in equilibrium at constant size and with no population subdivision or gene-flow with other populations. Clearly, there is no human population for which these assumptions are true. The second challenge faced in genomic scans of selective sweeps is that much of the available data consist of SNP genotypes that had been initially identified using an ascertainment (or SNP discovery) process. Because these data deviate from a random sample of fully identified genotypes, standard population genetic methods cannot be applied without taking this "ascertainment bias" into account. Typically, the SNPs have been ascertained by direct sequencing in a relatively small sample, and then, if variable in the small sample, they have been typed by single-SNP genotyping assays in a much larger sample. The levels of variability, distribution of allele frequencies, and levels of linkage disequilibrium will all be strongly affected by such ascertainment schemes (e.g., Nielsen and Signorovitch 2003
There are many different methods available for detecting selective sweeps from DNA sequence data (e.g., Tajima 1989
An alternative approach for detecting selective sweeps, when data have been sampled from more than one population, is to identify genomic regions with elevated levels of population subdivision (e.g., Lewontin and Krakauer 1973
In this paper, we present two new methods for detecting selective sweeps based on ascertained SNP data. The methods are similar to the method by Kim and Stephan (2002
Consider first SNP data without structure, that is, where the SNP data have been obtained from a single population or where the individuals are not labeled with respect to population. In the following we assume that it is known which allele is ancestral and which allele is derived for a particular SNP (known polarity). When such information is available, the power to detect selective sweeps may be improved. However, the methods described here have also been implemented for SNPs in which the polarity of the mutation is not known (i.e., the folded site-frequency spectrum). For human data it will usually be possible to infer, with acceptable confidence, the polarity of the mutation from consideration of the nucleotide state in the chimpanzee.
Test 1: Aberrant site frequency spectrum
Let the (unknown) probability of observing a derived allele of frequency j in the sample be pj, j = 1, 2,..., n 1, then assuming homogeneity along the genome Pr (Xi = j) = pj. Let p =(p1, p2,..., pn 1). For k SNPs, a composite likelihood function is formed by multiplying the sampling probabilities along the chromosome:
j = kj/k, j =1, 2,..., n 1. The composite likelihood can also be calculated for a window of SNPs, from SNP v to b, as
v b and , respectively. The first approach we will consider is a sliding window approach where the test statistic in a window running from SNP v to SNP b is given by
v b) from the global sets of allele frequencies ( ).
This statistic can readily be used to scan the genome for regions with aberrant allele frequency distributions (frequency spectra). However, to test if the deviation from the expectation is beyond what would be expected under a particular population genetic model, simulations are needed. Such simulations can also incorporate population growth, bottlenecks, and other challenging demographic changes, all readily performed using available methods (e.g., Hudson 2002
Test 2: Parametric approach
Before a selective sweep, the allele frequency spectrum in the population will be p =(p1, p2,..., pn 1). Barton (1998
is a parameter that depends on the rate of recombination, the effective population size, and the selection coefficient of the selected mutation. For example, in the approximation by Durrett and Schweinsberg (2004 = r ln (2N)/s, where N is the population size, r is the recombination rate per base pair, and s is the selection coefficient.
The assumption that each ancestral lineage escapes the sweep independently with probability Pe is equivalent to a model in which, looking backward in time, (1) the coalescence among lineages linked to the beneficial allele happens at the end of the selective phase (beginning of the sweep), and (2) recombination events by which lineages escape the sweep happen before these coalescent events. This is a simple approximation to the more accurate model of hitchhiking in which the rates of the coalescent and recombination events considered increase by 1/x and 1 x, respectively, where x is the frequency of the beneficial mutation in the population, as x decreases during the selective phase (Kaplan et al. 1989
If k lineages escape the sweep, the ancestral sample right before the sweep contains H = min{n, k + 1} lineages. If the distribution of allele frequencies in a sample of size n, in the absence of a selective sweep, is given by p =(p1, p2,..., pn 1), then the probability of observing j mutant lineages in an ancestral sample of size H is given by
This expression allows us to calculate the composite likelihood for a set of SNP data assuming a selective sweep of intensity
This method can be considered a modification of the method by Kim and Stephan (2002
Tajima's D test, Mann-Whitney's U test, and the correction for multiple tests To identify candidate SNPs that may be associated with a selective sweep, we form confidence regions for the location of a selective sweep. The confidence intervals are constructed using the composite likelihood score based on simulations (see Methods for details).
Ascertainment Ascertainment biases can be taken into account using the approach of Nielsen et al. (2004 ) be symbolized by . For SNP i, the likelihood function is modified by conditioning on the ascertainment condition of SNP i (ASCi), that is,
i | ) is the usual likelihood function in the absence of any ascertainment bias (e.g., equation 6). In the simplest possible case where ascertainment has occurred in a subsample of size d and no population structure is assumed,
Likewise, even in cases where the data have been obtained by full sequencing, we may want to include only variable sites in the analysis. The motivation for doing this is to increase the robustness to assumptions regarding
Analysis of simulated data
The power to detect a selective sweep based on the folded frequency spectrum is depicted in Figure 2. Tajima's D does not take advantage of outgroup information and is performed identically for unfolded and folded data. The power of the MWU test and Tajima's D test are now comparable, but the power of the other tests is largely unaffected. The reason is presumably that a selective sweep affects both the proportion of high-frequency-derived and low-frequency-derived mutations. It is possible that the increase in information associated with the use of the unfolded frequency spectrum is counteracted by the increase in the number of parameters in models of unfolded frequency spectra. This suggests that the MWU should be performed using the folded, and not the unfolded frequency spectrum, because use of the unfolded frequency spectrum involves additional assumptions regarding the outgroup species and may be sensitive to mis-specifications of ancestral states.
Next we were interested in examining the power and robustness of the test under a demographic model that includes population growth, when the neutral equilibrium model is used as the null model (Fig. 3). In this case, Tajima's D rejects in 100% of the cases, even in the absence of a selective sweep. As shown in Simonsen et al. (1995 In practical applications, it may be desirable to perform the tests using more realistic demographic models as the null model. If the assumed demographic model is correct, this circumvents the problem of robustness to demographic factors. Although there is considerable uncertainty regarding what the correct model for human demography might be, using more realistic demographic models might nonetheless in many applications be more desirable than the simplistic standard neutral model. Therefore, it might also be of interest to see how the power of the tests compare when the correct demographic model has been used as a null model in the presence of population growth (Fig. 4). Notice that Test 1 and Test 2 in this case have much more power than the MWU test and Tajima's D. Applying Tajima's D test, and possibly other related tests, while correcting for population growth, may lead to tests with very low power.
The accuracy of the inference of the locations of the selective sweep is illustrated in Figure 5 based on the significant results under 2Ns = 500, 750, and 1000. Notice that almost all the inferred sweeps are within a window of 10 kb centered on the true location of the sweep. While the accuracy of the location of the inferred sweep is strongly dependent on the SNP density and recombination rate, these results suggest that even for genomic SNP data it is possible to identify the location of a selective sweep to a particular gene. For older selective sweeps, the accuracy may be lower.
Robustness of Test 2 To evaluate the robustness of Test 2 further, we determined the distribution of the test statistic for various realistic models of human demography, such as the model of divergence and population growth explored by Marth et al. (2004 =2NR =103) and low ( =2NR = 0) values of the recombination rate are shown in Figure 6. Notice that the null distributions are very similar for all investigated demographic models. This suggests that Test 2 has a very high degree of robustness against demographic assumptions. Close inspection of the tails reveals some differences that will affect the critical values of tests performed at the 5% and 1% significance levels. However, it is important to notice that for both high recombination rates and low recombination rates, the standard neutral model provides the most conservative critical values. This means that using the standard neutral null model, when the Marth et al. (2004
We also examined the null distribution under varying assumptions regarding the recombination rate, to investigate the robustness of Test 1, Test 2, and the MWU test to such assumptions (Fig. 7). The distribution of the test statistic of the MWU test (Fig. 7A) shows weak dependence on the recombination rate when the rate of recombination is high, but strong dependence when the recombination rate is low. Clearly, accurate estimates of the recombination rate are important when applying this test. Test 1 shows even stronger dependence on the recombination rate (Fig. 7B). If the recombination rate assumed is higher than the true recombination rate, application of this test will lead to an elevated type I error. In contrast to Test 1, Test 2 has a very high degree of robustness to assumptions regarding the recombination rate (Fig. 7C). As in the case of the other tests, R =0 provides the most conservative assumption. The extreme difference between the behavior of Test 1 and Test 2 tests can probably best be explained by noting that much of the information for Test 2 comes from the spatial pattern along the sequence, and that distances for this test depend on the selection coefficient, which is a free parameter. Wrong assumptions about the recombination rate will then lead to biased estimates of the selection coefficient, but will not have a strong effect on the null distribution of the test statistic (Kim and Nielsen 2004
Analysis of Seattle SNP data Results for the gene regions with p-values < 0.05 are shown in Table 1. Figure 8 also shows the maximized composite likelihood function along the length of the gene region of the gene showing the strongest evidence for a selective sweep, C3. The maximum composite likelihood estimate of the location of the sweep falls approximately in an intron at position 30,525. This intron contains several insertion/deletion differences when compared to the chimpanzee sequence and is a region associated with alternative splicing.
These data have previously been examined by other researchers for the purpose of detecting selective sweeps. Akey et al. (2002
Analysis of HapMap data The main effect of the ascertainment bias correction of the HapMap data is to reduce the composite likelihood ratio value for several of the significant peaks (Fig. 9). It also changes the relative height of the peaks, of the likelihood function, depending on the underlying ascertainment sample size in the region. This illustrates the importance of correcting for ascertainment biases, and suggests that lack of correction for ascertainment bias may lead to excess false positives.
The genes identified to lie within a confidence region for a selective sweep are listed in Table 2. The major peak (at position 1.36 x 108) of the composite likelihood surface is centered on the lactase (LCT) locus. The evidence for selection on the lactase locus dates back to the early seventies (e.g., Cavalli-Sforza 1973
There are several disease factors on the gene list in Table 2, including COL4A3, a known disease factor for Alport syndrome. The peak at location 14.4 x 107 is associated with the KYNU gene encoding kynurenine hydrolase. Elevated levels of this enzyme are associated with cerebral and systemic inflammatory conditions. A peak at position 12.2 x 107 is located on the TSN gene. The protein encoded by this gene, translin, binds to single-stranded DNA ends generated by staggered breaks, such as those occurring at recombination and translocation breakpoints. A peak at location 1.71 x 107 is located on the DPP10 gene, a gene that is primarily expressed in the brain but also has been implicated as a disease factor for asthma. There are several other genes that are specific to the brain or overexpressed in neuronal tissue. A peak at location 7.5 x 107 is located on the SEMA4F gene, a gene that functions in axon guidance. ACVR1C, the only gene in the confidence region under the peak with the fourth highest CLR, is a type I receptor for TGF that plays an unknown role during mammalian brain development.
The modification of the Kim and Stephan (2002
Beyond demographic factors, significant results of the neutrality tests considered here may be caused by other types of selection than a selective sweep. Test 2 may in this sense give spurious results if other types of selection are acting locally to mimic the signature of a selective sweep. It is unknown to what degree selection against slightly deleterious mutations could mimic the spatial pattern providing the power of Test 2. Test 2 is designed specifically to detect a recent selective sweep and may have very little power to detect other types of selection, such as balancing selection. It may, therefore, be preferable to use other methods, such as Test 1 or the MWU-based test, if other types of selection are of interest. Alternatively, extensions of Test 2 that specifically accommodate different modes of selection, such as balancing selection, could be implemented. Ascertainment was here modeled explicitly for the HapMap data based on information regarding the empirical distribution of ascertainment sample sizes. However, many complexities of the ascertainment procedure for the HapMap data were ignored. Nonetheless, it is clear from the current results that ascertainment, and its correction, seems to matter, even for Test 2. Also, it is clear that selective sweep mapping using Test 2 has great potential for identifying regions in the human genome that have been targeted by recent sweeps. The simplified analysis performed here found the strongest evidence for selection associated with the lactase locus, which previously has been demonstrated to be under selection, but it also identified several other interesting potential targets of selective sweeps. We hope that further analysis of extensive genome-wide human polymorphism data, especially when combined with full information regarding the ascertainment scheme, may help determine which genes in the human genome have been targeted by selective sweeps.
Simulations The sliding window sizes for the MWU test, Tajima's D test, and Test 1 were chosen based on preliminary runs to maximize the power of the tests. No choice of window size is necessary for Test 2. We first simulated data using the method of Kim and Stephan (2002 = 0.005. The selective sweep, of strength , was placed in the center of the chromosome. Critical values were found by repeatedly simulating data from a standard neutral coalescence model under values of estimated from the data using the number of segregating sites and the true values of the recombination rate. Simulations were performed using the standard coalescent simulation algorithm (e.g., Hudson 2002
To determine the power in the presence of population growth, we modified the simulation method of Kim and Stephan (2002 We investigated the robustness of Test 2 further, using an additional set of simulations. In all, 100,000 new data sets of 200 SNPs were simulated for varying assumptions regarding demography and recombination. For each 200-SNP window, we performed Test 2 using the remaining data sets to provide the background frequency spectrum. We examined four demographic conditions:
Ascertainment was also modeled in these simulations using varying discovery sample sizes corresponding to the ascertainment method used for the Perlegen data (Hinds et al. 2005
Construction of confidence regions
Analysis of Seattle SNP data
Analysis of HapMap data To obtain critical values, we again use neutral coalescence simulations, with parameter values estimated from the real data. The critical value obtained for the HapMap data is the maximum composite likelihood ratio observed in the entire chromosome. This procedure then automatically controls for multiple tests.
We explicitly model the ascertainment process when simulating the data (e.g., Nielsen et al. 2004
We thank J.C. Mullikin for helpful information regarding the HapMap ascertainment data. This research was supported by grants NSF/NIH Grant DMS/NIGMS0201037 and NIH grant HG-003229. Y.K. is supported by NSF grant DEB-0449581.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.4252305. Freely available online through the Genome Research Immediate Open Access option. [The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: J.C. Mullikin.]
5 Corresponding author.
Akashi, H. 1999. Inferring the fitness effects of DNA mutations from patterns of polymorphism and divergence: Statistical power to detect directional selection under stationarity and free recombination. Genetics 151: 221238.
Akey, J.M., Zhang, G., Zhang, K., Jin, L., and Shriver, M.D. 2002. Interrogating a high-density SNP map for signatures of natural selection. Genome Res. 12: 18051814. Barton, N.H. 1998. The effect of hitch-hiking on neutral genealogies. Genet. Res. 72: 123133.[CrossRef]
Barton, N.H. and Etheridge, A.M. 2004. The effect of selection on genealogies. Genetics 166: 11151131. Bersaglieri, T., Sabeti, P.C., Patterson, N., Vanderploeg, T., Schaffner, S.F., Drake, J.A., Rhodes, M., Reich, D.E., and Hirschhorn, J.N. 2004. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74: 11111120.[CrossRef][Medline] Cavalli-Sforza, L. 1973. Analytic review: Some current problems of population genetics. Am. J. Hum. Genet. 25: 82104, 156.[Medline]
Clark, A.G., Glanowski, S., Nielsen, R., Thomas, P.D., Kejariwal, A., Todd, M.A., Tanenbaum, D.M., Civello, D., Lu, F., Murphy, B., et al. 2003. Inferring non-neutral evolution from humanchimpmouse orthologous gene trios. Science 302: 19601963. Durrett, R. and Schweinsberg, J. 2004. Approximating selective sweeps. Theor. Popul. Biol. 66: 129138.[CrossRef][Medline]
Fay, J.C. and Wu, C.-I. 2000. Hitchhiking under positive Darwinian selection. Genetics 155: 14051413.
Gilad, Y., Rosenberg, S., Przeworski, M., Lancet, D., and Skorecki, K. 2002. Evidence for positive selection and population structure at the human MAO-A gene. Proc. Natl. Acad. Sci. 99: 862867.
Harr, B., Kauer, M., and Schlötterer, C. 2002. Hitchhiking mapping: A population-based fine-mapping strategy for adaptive mutations in Drosophila melanogaster. Proc. Natl. Acad. Sci. 99: 1294912954.
Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A., and Cox, D.R. 2005. Whole-genome patterns of common DNA variation in three human populations. Science 307: 10721079.
Hudson, R.R. 2002. Generating samples under a Wright-Fisher neutral model. Bioinformatics 18: 337338. The International HapMap Consortium. 2003. The International HapMap Project. Nature 426: 789796.[CrossRef][Medline]
Jensen, J.D., Kim, Y., Dumont, V.B., Aquadro, C.F., and Bustamante, C.D. 2005. Distinguishing between selective sweeps and demography using DNA polymorphism data. Genetics 170: 14011410.
Kaplan, N.L., Hudson, R.R., and Langley, C.H. 1989. The `hitchhiking effect' revisited. Genetics 123: 887899.
Kim, Y. and Stephan, W. 2002. Detecting a local signature of genetic hitchhiking along a recombining chromosome. Genetics 160: 765777.
Kim, Y. and Nielsen, R. 2004. Linkage disequilibrium as a signature of selective sweeps. Genetics 167: 15131524.
Lewontin, R.C. and Krakauer, J. 1973. Distribution of gene frequency as a test of the theory of the selective neutrality of polymorphisms. Genetics 74: 175195.
Marth, G.T., Czabarka, E., Murvai, J., and Sherry, S.T. 2004. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166: 351372. Maynard Smith, J. and Haigh, J. 1974. The hitch-hiking effect of a favourable gene. Genet. Res. 23: 2335.[Medline] Nielsen, R. 2004. Population genetic analysis of ascertained SNP data. Hum. Genomics 1: 218224.[Medline] Nielsen, R. and Signorovitch, J. 2003. Correcting for ascertainment biases when analyzing SNP data: Applications to the estimation of linkage disequilibrium. Theor. Pop. Biol. 63: 245255.[CrossRef][Medline]
Nielsen, R., Todd, M.J., and Clark, A.G. 2004. Reconstituting the frequency spectrum of ascertained SNP data. Genetics 168: 23732382.
Parsch, J., Meiklejohn, C.D., and Hartl, D.L. 2001. Patterns of DNA sequence variation suggest the recent action of positive selection in the janusocnus region of Drosophila simulans. Genetics 159: 647657.
Przeworski, M. 2002. The signature of positive selection at randomly chosen loci. Genetics 160: 11791189.
Przeworski, M. 2003. Estimating the time since the fixation of a beneficial allele. Genetics 164: 16671676. Sabeti, P.C., Reich, D.E., Higgins, J.M., Levine, H.Z., Richter, D.J., Schaffner, S.F., Gabriel, S.B., Platko, J.V., Patterson, N.J., McDonald, G.J., et al. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832837.[CrossRef][Medline] Simonsen, K.L., Churchill, G.A., and Aquadro, C.F. 1995. Properties of statistical tests of neutrality for DNA polymorphism data. Genetics 141: 413429.[Abstract] Stephan, W., Wiehe, T.H.E., and Lenz, M.W. 1992. The effect of strongly selected substitutions on neutral polymorphism: Analytical results based on diffusion theory. Theor. Pop. Biol. 41: 237254. Stephens, M., Smith, N., and Donnelly, P. 2001. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978989.[CrossRef][Medline]
Tajima, F. 1989. The effect of change in population size on DNA polymorphism. Genetics 123: 597601. Watterson, G.A. 1975. On the number of segregating sites in genetical models without recombination. Theor. Pop. Biol. 7: 256276.[CrossRef][Medline]
Williamson, S.H., Hernandez, R., Fledel-Alon, A., Zhu, L., Nielsen, R., and Bustamante, C.D. 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. 102: 78827887. Wootton, J.C., Feng, X., Ferdig, M.T., Cooper, R.A., Mu, J., Baruch, D.I., Magill, A.J., and Su, X.-Z. 2002. Genetic diversity and chloroquine selective sweeps in Plasmodium falciparum. Nature 418: 320323.[CrossRef][Medline]
http://pga.gs.washington.edu; the Seattle SNP database [Feb. 2004]SeattleSNPs. NHLBI Program for Genomic Applications, SeattleSNPs, Seattle, WA.
Received June 9, 2005; accepted in revised format September 6, 2005. This article has been cited by other articles:
|