|
|
|
|
Genome Res. 13:1873-1879, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Letter A Population Threshold for Functional Polymorphisms1 University of Washington Genome Center, Department of Medicine, Seattle, Washington 98195, USA 2 James Watson Institute of Zhejiang University, Hangzhou Genomics Institute, Key Laboratory of Bioinformatics of Zhejiang Province, Hangzhou 310007, China 3 Beijing Institute of Genomics, Center of Genomics and Bioinformatics, Chinese Academy of Sciences, Beijing 101300, China 4 Walter and Eliza Hall Institute of Medical Research, Parkville, Victoria 3050, Australia 5 The Institute of Human Genetics, University of Aarhus, DK-8000 Århus C, Denmark
We sequenced 114 genes (for DNA repair, cell cycle arrest, apoptosis, and detoxification)in a mixed human population and observed a sudden increase in the number of functional polymorphisms below a minor allele frequency of 6%. Functionality is assessed by considering the ratio in the number of nonsynonymous single nucletide polymorphisms (SNPs)to the number of synonymous or intron SNPs. This ratio is steady from below 1% in frequencythat regime traditionally associated with rare Mendelian diseasesall the way up to about 6% in frequency, after which it falls precipitously. We consider possible explanations for this threshold effect. There are four candidates as follows: (1)deleterious variants that have yet to be purified from the population, (2)balancing selection, in which a selective advantage accrues to the heterozygotes, (3)population-specific functional polymorphisms, and (4)adaptive variants that are accumulating in the population as a response to the dramatic environmental changes of the last 7,000 17,000 years.
The prevailing view in human genetics is that most polymorphisms are selectively neutral (Kimura 1983 5% of the genome, and even fewer have searched deep enough to identify polymorphisms that occur infrequently in the population. To the extent that it has been done, subtle anomalies in the polymorphism distribution at low frequencies (Cargill et al. 1999Before proceeding, we must clarify the use of the word functional in the context of the single nucleotide polymorphisms (SNPs) that are the focus of our work. Functional SNPs can be located anywhere. We define four categories. SNPs found in the protein-coding regions are either nonsynonymous or synonymous, depending on whether they do or do not change the protein sequence. SNPs found in the nonprotein-coding regions are either intron or intergenic. The essential point is that a randomly chosen nonsynonymous SNP is more likely to be functional than a randomly chosen synonymous, intron, or intergenic SNP. This assertion is supported by the observed frequencies of occurrence in the human population for the different SNP categories. Nonsynonymous SNPs tend to be found less frequently, given the availability of sites. In the analysis to be presented, synonymous and intron SNPs both behave as expected by neutral theory. This does not imply that they are never functional, only that an extremely small fraction of them are functional, hence, to a first approximation, either can be used to establish a neutral theory baseline to normalize out the complexities of population history. Given the lower quality of our intron data, we focus on the NON/SYN ratio for the number of nonsynonymous to synonymous SNPs. Under neutral theory, this ratio is expected to be constant as a function of the minor allele frequency. The fact that it is not is the main point of interest.
Cargill et al. (1999
A total of 114 genes were resequenced under the Environmental Genome Project, with a focus on genes that are implicated in DNA repair, cell cycle arrest, apoptosis, and detoxification. The entire list is at http://www.genome.washington.edu/projects/egpsnps
For comparison with the published data, we correct for variations in sample depth and sequence length by estimating the mutation parameter
(NON) = 3.23 x 10-4, (SYN) = 2.67 x 10-4, and (INT) = 5.89 x 10-4, for nonsynonymous, synonymous, and intron SNPs, respectively. When we further adjust for the fact that there is a 0.775 probability of any substitution in the coding region of our specific genes to be nonsynonymous, we find that (NON) = 4.17 x 10-4 and (SYN) = 11.9 x 10-4. These numbers are in good agreement with Cargill et al. (1999 (NON) = 3.59 x 10-4 and (SYN) = 10.0 x 10-4, despite the different genes and population samples. The main point of departure in our analyses of the NON/SYN ratio as a function of the minor allele frequency is that we partition the frequency axis into five histogram bins instead of three. One must be sensitive to the limitations imposed by having sampled only 44 diploid individuals in so many genes, as it means that the minor allele frequencies are essentially discretized in multiples of 1/(2*44) = 0.0114. It is critical that every histogram bin capture at least one of these discrete units. Specifically, we partitioned the frequency axis at 0.0000, 0.0126, 0.0280, 0.0614, 0.2346, and 0.5000. Bin 1 is meant to capture the singlets, that is, those SNPs that are observed in only one heterozygote. To accommodate the occasional sample failures, we set the upper bound to 0.0126 instead of 0.0114. The other four bins are designed to each capture an approximately equal number of coding SNPs, to equalize their statistical properties. We had the computer look at the actual distribution of allele frequencies above 0.0126, and then set the remaining partitions to make the number of coding SNPs per bin as uniform as possible. As so defined, bin 2 captures the doublets, whereas bin 3 captures the triplets, quadruplets, and quintuplets.
Figure 1 shows that, as a function of minor allele frequency, the NON/SYN ratio exhibits a threshold at a frequency of 0.0614,
One of the motivations for studying ratios like NON/SYN, instead of comparing NON or SYN directly to neutral theory expectations, is that the expected allele frequency distribution is confounded by the complexities of population history, which tend to affect the results at the low-frequency end, where we observe the threshold. Ratios offer a built-in control against much of this complexity, but to be safe, we wanted a second control, to ensure that nothing unusual is happening with the synonymous distribution itself, and that it behaves in an approximately neutral manner. Intron SNPs are the solution. As we show in Figure 1, comparisons against the intron data reveal that the threshold effect is due to changes in the nonsynonymous distribution, not changes in the synonymous distribution. This suggests that some kind of selection might be involved.
Considering how not every nonsynonymous SNP is functional, we were curious whether there was a second threshold effect for the probability that a nonsynonymous SNP is functional. We estimated this probability from the extent to which the polymorphic site is conserved across all available homologs in the public databases, using the program SIFT (Ng and Henikoff 2001
We can also test for departures from neutrality by determining the ancestral allele for each SNP, on the basis of orthologous chimpanzee and gorilla sequences. One chimpanzee and one gorilla sample was resequenced. Our experiments were successful in 544 of 616 coding SNPs. Of the rejects, 3 were eliminated because the primate alleles did not match either human allele, and 11 were ambiguous in that both human alleles were observed in the primates. PCR failures accounted for the remainder of the failures. The neutral theory expectation is that the probability of any allele being ancestral is equal to its frequency in the population (Watterson and Guess 1977
Proposed Explanations The original explanation that enrichment in low-frequency nonsynonymous SNPs is due to deleterious variants that have yet to be purified from the population must be re-examined in light of this 6% threshold. In the standard model, the likelihood that a simple recessive deleterious allele would reach 1% in frequency in a large population is thought to be astronomically small (Zwick et al. 2000
Perhaps the simplest explanation is just to admit that the longstanding association of Mendelian diseases with variants of below 1% in frequency was always arbitrary, and that deleterious alleles and/or balancing selection are far more prevalent than commonly thought. If we had to choose, we would choose the former. Lack of change in NON/SYN below 6% implies a continuum of Mendelian diseases, presumably with reduced disease severity as allele frequencies increase. This follows from the equation for the equilibrium frequency in recessive deleterious alleles, f =
The first explanation relies on the fact that some polymorphisms are found in all populations, and others are found only in specific populations. Suppose that synonymous SNPs have a frequency dependence S(f), and nonsynonymous SNPs have a frequency dependence [N1 + N2] · S(f), in which N1 is population independent and N2 is population specific. In a pure population, there would be no threshold effect. But, in a mixture of M populations, in which for simplicity we assume identical parameters for all populations, any population-specific SNP of frequency f in the source population would have an apparent frequency of f' = f/M when averaged over the mixture. Hence, there would be two components to the NON/SYN ratio. For f' > 0.5/M, this ratio would be N1, whereas for f'
We can estimate the appropriate parameters from our observed data. If we assume that M = 8, the apparent threshold frequency would be 0.0625. Given that the HDP panel is a 2:1:1 mixture of three major populations (European, African, Asian), one might argue that that is not a justifiable assumption, and that the observed 6% threshold frequency is, at worst, a few times smaller than it would have been in a pure population. But, given the method by which these samples were collected, we cannot rule out additional complexity in the underlying population structure. Moving on, if one assumes that f is not too close to zero, the standard model would predict that S(f)
Should the first explanation be too disturbing, we can offer a second, on the basis of the temporal dynamics of how adaptive variants are fixed. Notice that by adaptive, we mean in the evolutionary sense, which has no implications for post-reproductive phenotypes. In fact, it has been argued that loss-of-function might be the preferred adaptive response in a rapidly changing environment (Olson 1999
Regardless of the parameter settings or the mode of inheritance, transitions from frequencies of 0.10.9 are always fast, taking just a few dozen generations. In dominant mode, the favored allele quickly rises to 0.9 in frequency, and then it slows down, taking another 500 generations to reach its asymptote. Conversely, in recessive mode, the initial drift to 0.1 in frequency takes 500 generations or more, but afterward, the favored allele is quickly fixed. Precisely when this rapid transition from 0.10.9 occurs is sensitive to the initial frequency, which may be larger than one over the population size, as genetic variations can accumulate through mutation-selection balance (Orr and Betancourt 2001
An important point worth emphasizing is that we are envisioning a transient effect from a massive one-time change in the environment. In contrast, most theoretical models, including the more sophisticated Poisson random field (Bustamante et al. 2001
If we multiply 500 generations by a nominal human generation time of 20 yr, we get 10,000 yr. This neatly coincides with the end of the last ice age, the melting of the glaciers, and the development of agricultureall of which happened 7,000
One might wonder why such a simple effect had never been observed before. The answer is that surprisingly few people have been in any position to look for it. Despite the massive amounts of SNP (Sachidanandam et al. 2001
As for our more radical explanations, there is a major caveat on the applicability of the second explanation regarding adaptive selection to agriculture. We only sequenced 114 genes. This is a small subset of the 30,00040,000 genes in the genome, and it is a biased subset at that, carefully selected for potential relevance to environmental diseases. Recall the assumptions that went into our model. We required the existence of a selection coefficient on the high end of what had been measured. Because most gene knockouts show no obvious phenotypes (Tautz 2000
Aside from being a potentially important clue to understanding the basis of human genetic variation, the threshold has a practical implication for the HapMap Project, which hopes to unravel the genetic basis of important complex diseases (Couzin 2002 There will be no definitive answers until a larger number of genes are studied in a larger population sample with a better-defined history. Nevertheless, it is instructive how, by focusing the experiments on SNPs that are likely to be functional, and dividing out the complexities of population history, one can discover an interesting anomaly that stands as a challenge to the existing conceptual framework.
PCR primers were placed on the introns. The objective was to acquire, along with the targeted exon, 100 bp of flanking intron sequence on each side. Short 1-Kb amplicons were chosen. Sequencing was performed by capillary electrophoresis and dyeterminator chemistry. We took the NHGRI/Coriell Human Diversity Panel (HDP) as a representative of all the major populations. For ancestral alleles, we used the human-specific primers to sequence chimpanzee and gorilla. Initial polymorphism detection was done by PolyPhred (Nickerson et al. 1997
We thank Maynard Olson, Joe Felsenstein, and Pauline Ng for their help with the manuscript. This project was sponsored by the National Institute of Environmental Health Sciences (Grant no. 1 RO1 ES09909) and the National Human Genome Research Institute (Grant no. 1 P50 HG02351). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1324303.
6 Corresponding authors. E-MAIL gksw{at}u.washington.edu; FAX (206)685-7344. E-MAIL junyu{at}u.washington.edu; FAX (206)685-7344.
Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers, E.W., Li, P.W., and Eichler, E.E. 2002. Recent segmental duplications in the human genome. Science 297: 1003-1007.
Barbujani, G., Magagni, A., Minch, E., and Cavalli-Sforza, L.L. 1997. An apportionment of human DNA diversity. Proc. Natl. Acad. Sci. 94: 4516-4519.
Bustamante, C.D., Wakeley, J., Sawyer, S., and Hartl, D.L. 2001. Directional selection and the site-frequency spectrum. Genetics 159: 1779-1788. Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C.R., Lim, E.P., Kalyanaraman, N., et al. 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22: 231-238.[CrossRef][Medline]
Collins, F.S., Brooks, L.D., and Chakravarti, A. 1998. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 8: 1229-1231. Cook, G.S. and Hill, A.V. 2001. Genetics of susceptibility to human infectious disease. Nat. Rev. Genet. 2: 967-977.[CrossRef][Medline]
Couzin, J. 2002. New mapping project splits the community. Science 296: 1391-1393. Diamond, J. 2002. Evolution, consequences and future of plant and animal domestication. Nature 418: 700-707.[CrossRef][Medline]
Fay, J.C., Wyckoff, G.J., and Wu, C.I. 2001. Positive and negative selection on the human genome. Genetics 158: 1227-1234. Halushka, M.K., Fan, J.B., Bentley, K., Hsie, L., Shen, N., Weder, A., Cooper, R., Lipshutz, R., and Chakravarti, A. 1999. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22: 239-247.[CrossRef][Medline] Harpending, H. and Rogers, A. 2000. Genetic perspectives on human origins and differentiation. Annu. Rev. Genomics Hum. Genet. 1: 361-385.[CrossRef][Medline] Hartl, D.L. 2000. A primer of population genetics. 3d ed. Sinauer Associates, Sunderland, MA. Kimura, M. 1983. The neutral theory of molecular evolution. Cambridge University Press, Cambridge, UK. Kingsolver, J.G., Hoekstra, H.E., Hoekstra, J.M., Berrigan, D., Vignieri, S.N., Hill, C.E., Hoang, A., Gibert, P., and Beerli, P. 2001. The strength of phenotypic selection in natural populations. Am. Nat. 157: 245-261.[CrossRef]
Lander, E.S. 1996. The new genomics: Global views of biology. Science 274: 536-539.
Ng, P.C. and Henikoff, S. 2001. Predicting deleterious amino acid substitutions. Genome Res. 11: 863-874.
Nickerson, D.A., Tobe, V.O., and Taylor, S.L. 1997. PolyPhred: Automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25: 2745-2751. Olson, M.V. 1999. When less is more: Gene loss as an engine of evolutionary change. Am. J. Hum. Genet. 64: 18-23.[CrossRef][Medline]
Orr, H.A. and Betancourt, A.J. 2001. Haldane's sieve and adaptation from the standing genetic variation. Genetics 157: 875-884.
Pritchard, J.K. and Cox, N.J. 2002. The allelic architecture of human disease genes: Common diseasecommon variant... or not? Hum. Mol. Genet. 11: 2417-2423. Przeworski, M., Hudson, R.R., and Di Rienzo, A. 2000. Adjusting the focus on human variation. Trends Genet. 16: 296-302.[CrossRef][Medline] Reich, D.E. and Lander, E.S. 2001. On the allelic spectrum of human disease. Trends Genet. 17: 502-510.[CrossRef][Medline]
Rosenberg, N.A., Pritchard, J.K., Weber, J.L., Cann, H.M., Kidd, K.K., Zhivotovsky, L.A., and Feldman, M.W. 2002. Genetic structure of human populations. Science 298: 2381-2385. Rutherford, S.L. and Lindquist, S. 1998. Hsp90 as a capacitor for morphological evolution. Nature 396: 336-342.[CrossRef][Medline] Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L., et al. 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928-933.[CrossRef][Medline]
Stephens, J.C., Schneider, J.A., Tanguay, D.A., Choi, J., Acharya, T., Stanley, S.E., Jiang, R., Messer, C.J., Chew, A., Han, J.H., et al. 2001. Haplotype variation and linkage disequilibrium in 313 human genes. Science 293: 489-493. Tautz, D. 2000. A genetic uncertainty problem. Trends Genet. 16: 475-477.[CrossRef][Medline] Terwilliger, J.D. and Weiss, K.M. 1998. Linkage disequilibrium mapping of complex disease: Fantasy or reality? Curr. Opin. Biotechnol. 9: 578-594.[CrossRef][Medline] Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562.[CrossRef][Medline] Watterson, G.A. and Guess, H.A. 1977. Is the most frequent allele the oldest? Theor. Popul. Biol. 11: 141-160.[CrossRef][Medline] Weiss, K.M. and Clark, A.G. 2002. Linkage disequilibrium and the mapping of complex human traits. Trends Genet. 18: 19-24.[CrossRef][Medline] Wright, A., Charlesworth, B., Rudan, I., Carothers, A., and Campbell, H. 2003. A polygenic basis for late-onset disease. Trends Genet. 19: 97-106.[CrossRef][Medline]
Yu, J., Yang, Z., Kibukawa, M., Paddock, M., Passey, D.A., and Wong, G.K.S. 2002. Minimal introns are not "junk". Genome Res. 12: 1185-1189. Zwick, M.E., Cutler, D.J., and Chakravarti, A. 2000. Patterns of genetic variation in Mendelian and complex traits. Annu. Rev. Genomics Hum. Genet. 1: 387-407.[CrossRef][Medline]
http://www.genome.washington.edu/projects/egpsnps; University of Washington Genome Center Repository of Candidate-Gene Polymorphisms for Environmental Genome Project (EGP).
Received March 7, 2003;
accepted in revised format June 4, 2003.
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||