|
|
|
|
Genome Res. 15:1576-1583, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Methods Calibrating a coalescent simulation of human genome sequence variation1 Program in Medical and Population Genetics, The Broad Institute, Cambridge, Massachusetts 02139, USA 2 Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA 3 Department of Medicine, Harvard Medical School, Boston, Massachusetts 02115, USA 4 Department of Molecular Biology and Diabetes Unit, Massachusetts General Hospital, Boston, Massachusetts 02114, USA
Population genetic models play an important role in human genetic research, connecting empirical observations about sequence variation with hypotheses about underlying historical and biological causes. More specifically, models are used to compare empirical measures of sequence variation, linkage disequilibrium (LD), and selection to expectations under a "null" distribution. In the absence of detailed information about human demographic history, and about variation in mutation and recombination rates, simulations have of necessity used arbitrary models, usually simple ones. With the advent of large empirical data sets, it is now possible to calibrate population genetic models with genome-wide data, permitting for the first time the generation of data that are consistent with empirical data across a wide range of characteristics. We present here the first such calibrated model and show that, while still arbitrary, it successfully generates simulated data (for three populations) that closely resemble empirical data in allele frequency, linkage disequilibrium, and population differentiation. No assertion is made about the accuracy of the proposed historical and recombination model, but its ability to generate realistic data meets a long-standing need among geneticists. We anticipate that this model, for which software is publicly available, and others like it will have numerous applications in empirical studies of human genetics.
The search for inherited influences on human disease and the effort to understand the history of human populations rely on a detailed knowledge of human genome sequence variation. Population genetic models serve as important tools in this quest. By simulating variation under neutral evolution, they provide background expectations about genetic variation, for example, for the frequency distribution of disease alleles and for patterns of linkage disequilibrium. They also serve as a tool to evaluate evidence for evolutionary selection, and to infer the history of populations. A primary difficulty in using such models lies in deciding which ones to employ. Given the complex history of human populations and the wide variety of plausible model parameters, as well as considerable uncertainty about how mutation and recombination rates vary, it is impossible rigorously to infer from data a single "correct" model: The problem is badly underdetermined. Until comparatively recently, in fact, genome-scale data were scarce enough that they provided few constraints on models. As a result, a common practice in simulating human variation has been to use a set of simple models that are easy to implement. The most basic model is the standard Wright-Fisher neutral model of a freely mixing, constant-sized population, with uniform rates of recombination and mutation across the genome. This model has been (and continues to be) widely used, as have other simple models (e.g., island models).
Simple models have been of enormous utility as heuristics. In some respects, they have also offered surprisingly good fits to empirical data (see, e.g., Fig. 1C,D). As empirical data have accumulated, however, disagreements between expectation and observation have become clear. Perhaps the most obvious discrepancy occurs in predictions of linkage disequilibrium, the nonrandom association of nearby alleles (Frisse et al. 2001
Given the prevailing ignorance about the details of human demographic history, and because of the evident inconsistencies between simple models and empirical data, working models of human genetic variation are increasingly including a range of additional features (Wakeley et al. 2001 What have not been published to date, however, are models that have been calibrated by comparison to many aspects of large, multilocus empirical data sets. As a result, despite all of the work done on modeling (much of it quite sophisticated), no tool currently exists for creating simulated human genetic data that is a good approximation to real, empirical data. It is to address this lack that we have carried out the model calibration described here. We have two goals in doing so: First, to create, for our and others' use, a simulation package that can produce realistic genetic data; and second, to encourage the development of similar (and better) calibrated models.
In order to carry out our simulations, we implemented a coalescent population genetic model (Hudson 1990 sfs/cosi.
As the basis for calibrating our model, we used a set of empirical observations from numerous genomic loci studied in population samples from three continental regions: West Africa, Europe, and East Asia (see Methods for details of the empirical data sets used). Since our goal was to create simulated data that simultaneously matches many aspects of empirical data, we compared the results of simulation to a broad set of empirical measures: (1) the frequency distribution of alleles, (2) the relationship between the frequency of an allele and the probability that it is the ancestral allele (estimated for empirical data from the chimpanzee allele), (3) a measure of the genetic distance between current populations (FST), (4 and 5) two measures of the extent of linkage disequilibrium (r2 and the fraction of marker pairs with D' = 1), and (6) the total diversity (measured as heterozygosity, or Since our simulated data are intended for a variety of applications, there is no single natural choice of a metric for assessing how closely the simulated results match empirical observations. We therefore adopted a simple statistical framework that incorporated all of our statistical measures. For each measure (e.g., allele frequency, r2), we estimated the root-mean-square (RMS) deviation between simulated values and the mean empirical value, calculating it for each of the distributions shown in Figure 2 (see Methods). Four of the measures (minor allele frequency, ancestral/chimpanzee fraction, and the two LD measures) contribute three distributions each, one for each population; treating each of the pairwise FST values as a separate distribution, we have a total of 15 distributions. While the choice of this particular combination of measures was arbitrary, the goal of the procedure was to produce a model yielding adequate simulations of all measures, making it useful for a variety of applications. We compared the statistical uncertainty (standard error on the mean, see Methods for details) of our training data set to the distance between the mean values of the prediction and the empirical data. That is, our goal was to achieve an agreement between the prediction and the data that was comparable to the statistical uncertainty in a large empirical data set such as ours; for this purpose, we set our threshold for acceptable performance at 1.5 times the statistical uncertainty. As an overall composite measure of statistical fit, we calculated the RMS error (RMSE) for all 15 distributions together, normalizing each by its empirical statistical uncertainty. We started our calibrated model from what was essentially a standard neutral model, modified to incorporate an "Out of Africa" scenario for the origin of our three test populations. In the base model, each population was a Wright-Fisher population of effective size 10,000; dates for the splitting of the populations were set to 3500 generations (before the present) for the separation between the African and non-African populations, and to 2000 generations for the separation of the two non-African populations (see Fig. 3). Predictions for this simple model, assuming a constant recombination rate of 1.3 cM/Mb, matched the empirical data quite poorly (Fig. 2): the RMSE was 4.7, that is, the disagreement between empirical measures and model prediction was almost five times larger than the statistical uncertainty in our empirical data set. To this base model we introduced additional parameters and tuned them until the overall match between simulation and empirical data met our criterion. Where possible, we tried to make choices that were demographically and biologically plausible. The resulting parameters, however, represent a mechanism for producing realistic-looking data, not an inference about the actual history of human populations or about details of recombination rate variation. Since the potential search space among many parameters is large, we took a stepwise approach: first choosing a set of parameters to add to the model, optimizing them by minimizing the RMSE, and then iterating the procedure by adding further parameters until the model fit was acceptable. At each step, fitting first used coarse step sizes (e.g., ±2000 for population sizes), and then finer ones, in an effort to avoid local minima.
Beginning with the base model described above, we first altered parameters that affect single-locus features of the data, which are influenced only by demography (not recombination): specifically, heterozygosity, allele frequency spectrum, fraction of ancestral/chimpanzee alleles, and FST. Of these, we first fit the West African allele frequency spectrum, since it is generally accepted that the human population originated in Africa, finding that models with a historical population expansion resulted in an improved fit to the data by increasing the fraction of low-frequency alleles. Next, we considered the remaining single-locus measures, and successively added parameters (primarily population bottlenecks, but also small amounts of continuing migration between populations, which served to reduce the genetic distances between populations) until the RMSE for the single-locus measures was 1.15. That is, based solely on the characteristics of single markers, the model fit the data nearly as well as the sampling error within the empirical observations, and much better than our base, standard neutral, model. Finally, the mutation rate in the model was tuned to match the observed heterozygosity (see Methods) (Sachidanandam et al. 2001
We then turned to the recombination model. In order to generate the considerable extent of LD seen in empirical data, we held the demographic parameters fixed and introduced variation in recombination rates, first by including observed large-scale variation, as measured in the deCODE genetic map (Kong et al. 2002 In total, approximately 2 billion coalescent trees were generated in the fitting process. The best-fitting set of parameters (Table 1; Fig. 2) yielded good agreement with all aspects of the observed data listed above. Agreement was not perfect: The RMSE between predicted values and the mean empirical values was on average 1.35 (that is, the disagreement was 35% larger than what would be seen solely based on random sampling of the empirical data). The agreement was, however, far superior to the discrepancy of 4.7 found with the standard neutral model. We explored the effects of re-estimating the parameters once all had been added to the model (i.e., iterating the fit); we found further improvements of the fit to be negligible. To our knowledge, this is the first time population genetic simulations have produced data that agree with multiple aspects of empirical data from multiple parts of the genome and in more than one population sample.
Having tuned our simulation, we next examined how well it predicted aspects of the data not used in the tuning process. First, we used the model to generate predictions for the same measures as above, but for the X-chromosome instead of the autosomes. We compared the results to an X-chromosome data set, derived from the same population samples as the autosomal data. Results for both our best-fitting model and the standard neutral model are shown in Figure 4. The statistical power of this data set for evaluating simulations is limited, but it is sufficient to demonstrate that the calibrated model does perform visibly better than the standard neutral model (RMSE = 0.97 for the best-fit model vs. 1.51 for the standard neutral model). The smaller effective population size of the X-chromosome (three-quarters that of the autosomes) makes it a useful test of how well the simulation models genetic drift, which is dependent on population size: The effect of the smaller population can be clearly seen in the larger genetic distances and the high fraction of X-linked markers that are monomorphic. Second, we looked at how well the calibrated model simulated haplotype blocks, contiguous stretches of chromosome observed to have very low rates of historical recombination and low haplotype diversity (Fig. 4L,M). Again the model performed very well (within the statistical limits of our ability to measure it), and much better than the standard neutral model (Phillips et al. 2003
Finally, we show in Figure 5 an application of our calibrated model to a study of human genetic variation, a search for evidence of positive selection in a set of
We have described the development of a particular calibrated model. Since this was not an exhaustive survey, it is certain that better-fitting parameters can be found even within the same model framework, and it is likely that a broad array of different models would also perform acceptably. The model parameters, therefore, do not represent inferences about real-world processes, but values that happen to generate useful simulations. Nevertheless, our experience in developing our model suggests, but does not prove, that some of its features are likely to recur in any successful simulation. In particular, we found that substantially increased coalescence in our non-African lineages was a necessary component. Thus, our best-fitting model had a probability of European coalescence of 22% (equivalent to a bottleneck with an inbreeding coefficient of 0.22 relative to the source population) (Reich et al. 2001 0.200.25. More detailed features of the model (e.g., migration rates, or the attribution of inbreeding into population size and discrete bottleneck), on the other hand, are likely to be quite variable between successful models. Similarly, a substantial non-uniformity in recombination rate was required, but the specifics of how the non-uniformity was implemented are not likely to be very meaningful.
The justification for this kind of model, therefore, is not that it enables us to draw conclusions about human history. There are, instead, two reasons for developing it. First, the ability to produce realistic-looking data is itself useful, regardless of the historical accuracy of the underlying model. It is true that there are many applications for which simple models of human genetic variation serve admirably. For many other applications, however, it is important to be able to produce simulated data that bear a close resemblance to empirical data. For example, simulated data sets for comparing haplotype-phasing algorithms, or for assessing the density of markers needed to provide good coverage for disease studies, are only useful if they accurately reflect LD patterns in human populations. In these cases, a calibrated model is greatly preferable to something like the standard neutral model, and the model presented here is already in use for studying haplotype phasing. We anticipate that it (and similar models, when they appear) will continue to be used, and will continue to be refined as additional comparisons with empirical data are made. Second, while the model is unlikely to reflect accurately the details of either demographic history or recombination, it does represent one of the many models that is consistent with what is currently known about human genetics. While it would be preferable to have the true model, for many purposes it is better to have one model that is consistent with data than to have none. For example, consider the X-chromosome data presented earlier (Fig. 4). In the absence of any model, or with only the standard neutral model as guidance, there is no way to interpret the differences between the X data and that from the autosomes. Are the increased monomorphism in non-African populations and the increased FST values to be expected from the smaller effective population of the X, or are they suggestive of positive selection acting on these loci? Given that our model, a randomly selected instance drawn from the space of consistent neutral models, predicts such features, there seems little need to invoke selection. Clearly, for more robust inference, especially in the case that model and data disagree, a broad range of calibrated models would be preferable, and we anticipate that such additional models will be developed in the near future.
Data sets The autosomal data set (previously described in Gabriel et al. 2002
Samples and genotyping for the genes shown in Figure 5 are described in Walsh et al. (2005
Empirical measures
Simulation software
Demographic model
Recombination model
SNP ascertainment Sensitivity to details of ascertainment modeling was tested by (1) varying the source-population composition of the public genome model by ±20% for each population, and (2) carrying out the simulation with only 1x or only 4x coverage. Variation in the source population, and keeping all model parameters fixed, resulted in RMS deviations of 1.361.40 (vs. 1.35 for the original model), and incorrectly specifying the coverage resulted in deviations of 1.39 and 1.45. Such modest changesall results met our threshold requirement for acceptable resultssuggest that details of the ascertainment model are not of great importance.
Parameter fitting
i is the RMSE for one of the 15 distributions (1 per population for r2, D', allele frequency spectrum, and fraction ancestral/chimpanzee; and 1 for each of the three genetic distances). i is defined as
ij is the empirical mean for the same value, is the standard error on the empirical mean, and the sum runs over the n bins in the distribution i. Note that, with these definitions, tot calculated from a random resampling (rather than from model predictions) of the empirical data is 1.0.
Not all parameters were of equal importance in the model. Setting the three least important parameters (the migration rates and the African bottleneck) to zero yielded agreement only slightly worse than our threshold value (1.55 times the standard error). In contrast, eliminating bottlenecks in the non-African lineages (whether supplied by an explicit bottleneck or by a small population size) produced discrepancies at least a factor of 2 larger than those seen in the full model. In the recombination model, eliminating either the regional variation or the discrete hotspots produced roughly the same loss of performance (discrepancy
We thank David Cutler for helpful comments, members of the Broad Institute's Program in Medical and Population Genetics for useful discussions, and the International Haplotype Map Consortium for financial support.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3709305. Freely available online through the Genome Research Immediate Open Access option.
5 Corresponding author.
Akey, J.M., Eberle, M.A., Rieder, M.J., Carlson, C.S., Shriver, M.D., Nickerson, D.A., and Kruglyak, L. 2004. Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2: e286.[CrossRef][Medline] Anderson, E.C. and Slatkin, M. 2004. Population-genetic basis of haplotype blocks in the 5q31 region. Am. J. Hum. Genet. 74: 4049.[CrossRef][Medline] Ardlie, K.G., Kruglyak, L., and Seielstad, M. 2002. Patterns of linkage disequilibrium in the human genome. Nat. Rev. Genet. 3: 299309.[CrossRef][Medline]
Collins, F.S., Brooks, L.D., and Chakravarti, A. 1998. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 8: 12291231. Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A., and Stephens, M. 2004. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 36: 700706.[CrossRef][Medline] Cullen, M., Perfetto, S.P., Klitz, W., Nelson, G., and Carrington, M. 2002. High-resolution patterns of meiotic recombination across the human major histocompatibility complex. Am. J. Hum. Genet. 71: 759776.[CrossRef][Medline] Frisse, L., Hudson, R.R., Bartoszewicz, A., Wall, J.D., Donfack, J., and Di Rienzo, A. 2001. Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. 69: 831843.[CrossRef][Medline]
Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. 2002. The structure of haplotype blocks in the human genome. Science 296: 22252229. Hudson, R.R. 1990. Gene genealogies and the coalescent process. In Oxford surveys in evolutionary biology (eds. D.J. Futuyma and J. Antonovics), pp. 144. Oxford University Press, Oxford, UK.
Hudson, R.R. 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337338. Jeffreys, A.J., Kauppi, L., and Neumann, R. 2001. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet. 29: 217222.[CrossRef][Medline]
Kauppi, L., Sajantila, A., and Jeffreys, A.J. 2003. Recombination hotspots rather than population history dominate linkage disequilibrium in the MHC class II region. Hum. Mol. Genet. 12: 3340. Kong, A., Gudbjartsson, D.F., Sainz, J., Jonsdottir, G.M., Gudjonsson, S.A., Richardsson, B., Sigurdardottir, S., Barnard, J., Hallbeck, B., Masson, G., et al. 2002. A high-resolution recombination map of the human genome. Nat. Genet. 31: 241247.[CrossRef][Medline]
Marth, G.T., Czabarka, E., Murvai, J., and Sherry, S.T. 2004. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166: 351372. May, C.A., Shone, A.C., Kalaydjieva, L., Sajantila, A., and Jeffreys, A.J. 2002. Crossover clustering and rapid decay of linkage disequilibrium in the Xp/Yp pseudoautosomal gene SHOX. Nat. Genet. 31: 272275.[CrossRef][Medline]
McVean, G.A., Myers, S.R., Hunt, S., Deloukas, P., Bentley, D.R., and Donnelly, P. 2004. The fine-scale structure of recombination rate variation in the human genome. Science 304: 581584. Phillips, M.S., Lawrence, R., Sachidanandam, R., Morris, A.P., Balding, D.J., Donaldson, M.A., Studebaker, J.F., Ankener, W.M., Alfisi, S.V., Kuo, F.S., et al. 2003. Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet. 33: 382387.[CrossRef][Medline] Pritchard, J.K. and Przeworski, M. 2001. Linkage disequilibrium in humans: Models and data. Am. J. Hum. Genet. 69: 114.[CrossRef][Medline] Reich, D.E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P.C., Richter, D.J., Lavery, T., Kouyoumjian, R., Farhadian, S.F., Ward, R., et al. 2001. Linkage disequilibrium in the human genome. Nature 411: 199204.[CrossRef][Medline] Reich, D.E., Schaffner, S.F., Daly, M.J., McVean, G., Mullikin, J.C., Higgins, J.M., Richter, D.J., Lander, E.S., and Altshuler, D. 2002. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat. Genet. 32: 135142.[CrossRef][Medline] Sabeti, P.C., Reich, D.E., Higgins, J.M., Levine, H.Z., Richter, D.J., Schaffner, S.F., Gabriel, S.B., Platko, J.V., Patterson, N.J., McDonald, G.J., et al. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832837.[CrossRef][Medline] Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L., et al. 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928933.[CrossRef][Medline]
Tenaillon, M.I., U'Ren, J., Tenaillon, O., and Gaut, B.S. 2004. Selection versus demography: A multilocus investigation of the domestication process in maize. Mol. Biol. Evol. 21: 12141225.
Wakeley, J. and Lessard, S. 2003. Theory of the effects of population structure and sampling on patterns of linkage disequilibrium applied to genomic data from humans. Genetics 164: 10431053. Wakeley, J., Nielsen, R., Liu-Cordero, S.N., and Ardlie, K. 2001. The discovery of single-nucleotide polymorphismsand inferences about human demographic history. Am. J. Hum. Genet. 69: 13321347.[CrossRef][Medline] Wall, J.D. and Pritchard, J.K. 2003a. Assessing the performance of the haplotype block model of linkage disequilibrium. Am. J. Hum. Genet. 73: 502515.[CrossRef][Medline] Wall, J.D. and Pritchard, J.K. 2003b. Haplotype blocks and linkage disequilibrium in the human genome. Nat. Rev. Genet. 4: 587597.[CrossRef][Medline]
Wall, J.D., Andolfatto, P., and Przeworski, M. 2002. Testing models of selection and demography in Drosophila simulans. Genetics 162: 203216. Walsh, E.C., Sabeti, P., Hutcheson, H., Fry, B., Schaffner, S.F., de Bakker, P.I.W., Varilly, P., Roy, J., Cooper, R., Zeng, Y., et al. 2005. Large-scale survey of variation and signals of selection in 168 immunological genes. Human Genetics (in press).
Wiuf, C. and Posada, D. 2003. A coalescent model of recombination hotspots. Genetics 164: 407417.
http://www.broad.mit.edu/
Received January 17, 2005; accepted in revised format May 17, 2005. This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||