Genome Research

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


Genome Res. 15:1211-1221, 2005
©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Supplemental Research Data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Reed, F. A.
Right arrow Articles by Aquadro, C. F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Reed, F. A.
Right arrow Articles by Aquadro, C. F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Letter

Fitting background-selection predictions to levels of nucleotide variation and divergence along the human autosomes

Floyd A. Reed1,3,4, Joshua M. Akey2 and Charles F. Aquadro1

1 Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York 14853, USA 2 Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA


    ABSTRACT
 Top
 ABSTRACT
 Results
 Discussion
 Methods
 REFERENCES
 WEB SITE REFERENCES
 
The roles of positive directional selection (selective sweeps) and negative selection (background selection) in shaping the genome-wide distribution of genetic variation in humans remain largely unknown. Here, we optimize the parameter values of a model of the removal of deleterious mutations (background selection) to observed levels of human polymorphism, controlling for mutation rate heterogeneity by using interspecific divergence. A point of "best fit" was found between background-selection predictions and estimates of human effective population sizes, with reasonable parameter estimates whose uncertainty was assessed by bootstrapping. The results suggest that the purging of deleterious alleles has had some influence on shaping levels of human variation, although the effects may be subtle over the majority of the human genome. A significant relationship was found between background-selection predictions and measures of skew in the allele frequency distribution. The genome-wide action of selection (positive and/or negative) is required to explain this observation.


Levels of human nucleotide polymorphism are positively correlated with the physical density of genetic recombination in humans (Nachman et al. 1998Go; Przeworski et al. 2000Go; Nachman 2001Go; Hellmann et al. 2003Go). This correlation also exists in many other eukaryotic species including Drosophila (e.g., Begun and Aquadro 1992Go; for reviews, see Andolfatto 2001Go; Aquadro et al. 2001Go; Schlötterer 2002Go). Two principal nonexclusive classes of hypotheses posed to explain this observation are (1) genetic recombination is, or is correlated with, a mutagenic process (Lercher and Hurst 2002Go; Waterston et al. 2002Go; Hardison et al. 2003Go; Hellmann et al. 2003Go); and (2) recombination allows increased independence from the effects of widespread diversity reducing selection (for reviews, see Andolfatto 2001Go; Aquadro et al. 2001Go). The primary selective processes hypothesized to explain the apparent reduction of variation in regions of low recombination are (2a) background selection associated with the ongoing selective removal of new deleterious mutations from the population and (2b) hitchhiking associated with positive directional selection.

Background selection is related to the classical concept of purging a mutational load (Haldane 1937Go; Muller 1950Go; Crow 1958Go; Charlesworth et al. 1993Go). Assuming a uniform distribution of deleterious mutations across the genome, regions of lower recombination will have an increased probability of linkage between neutral and deleterious variants; therefore, the probability of the removal of neutral polymorphism from the population is increased (but see also Palsson and Pamilo 1999Go). Background selection thus predicts reductions in levels of variation in regions of rarer crossing over. The effects of background selection can be summarized as the fraction of neutral variation remaining (f0) after the reducing effects of selection on linked deleterious mutants, which is often thought of as a regional reduction in the effective population size (Ne). The predicted amount of neutral polymorphism removed from a population increases with the deleterious mutation rate (u). The amount of variation removed also increases with the reduction of the strength of selection against heterozygous deleterious mutants (sh) because of persistence in the population, but ultimate removal, of weakly deleterious alleles (Kimura et al. 1963Go; Crow and Simmons 1983Go; Charlesworth et al. 1993Go). Indeed, the ability to provide some degree of independence among loci in order to more efficiently purge deleterious alleles has been considered one of the primary reasons for the evolution of multiple chromosomes and meiotic recombination (e.g., Felsenstein 1974Go; Kondrashov 1988Go; Charlesworth 1990Go; Antezana and Hudson 1997Go). Furthermore, to the extent that deleterious mutations are known to be a frequent occurrence, the process of background selection has been proposed as the appropriate null selective model to reject in favor of hitchhiking in explaining the correlation between diversity and recombination (Charlesworth et al. 1993Go; Stephan 1995Go; Hamblin and Aquadro 1996Go).

Hitchhiking refers to the rapid increase in frequency of genetic variants linked to a positively selected rare allele. This process is predicted to reduce genetic polymorphism, the size and amount determined primarily by the strength of selection and local rates of recombination (Maynard Smith and Haigh 1974Go; Kaplan et al. 1989Go; Stephan et al. 1992Go; Durrett and Schweinsberg 2004Go). Apparent instances of hitchhiking are reported for several loci in humans, based on deviations from both flanking levels of variation and the expectation of a steady-state allele frequency distribution (for reviews, see Aquadro et al. 2001Go; Bamshad and Wooding 2003Go). However, positive selection cannot reasonably explain all of the changes in genetic variation in humans because the required positively selected mutation rate would have to equal or exceed the neutral mutation rate (Andolfatto 2001Go).



View larger version (18K):
[in this window]
[in a new window]
 
Figure 1. Plots of the bootstrapping distribution of the estimated deleterious mutation rate (û/Mb), strength of selection (sh), and the effective population size without selection (0). The best-fit value for all the data is contained in the weight of the outcomes. û and sh appear to be positively correlated (i.e., as the deleterious mutation rate increases, selection strength must also increase to maintain a similar outcome). û and 0 also appear to be correlated (as the deleterious mutation rate increases, the effective population size must also increase to maintain a similar outcome). A small number of points that were widely dispersed below sh = 0.00001 and/or û/Mb = 0.00001 are not included in the plot.

 
Hellmann et al. (2003Go) found that changes in mutation rate, correlated with recombination rates, are sufficiently large enough to account for the genome-wide positive correlation between changes in polymorphism and rates of recombination in humans. However, mutation rate changes associated with recombination can only explain 6% of the total variation in polymorphism. Humans appear to have a high deleterious mutation rate (e.g., Eyre-Walker and Keightley 1999Go; Nachman and Crowell 2000Go), and genome-wide integrated genetic-physical maps are available (Kong et al. 2002Go); thus, it is natural to ask how much, if any, of the remaining variation in polymorphism along the chromosomes, after correcting for mutation rate heterogeneity, can be explained by background selection.

A great deal of uncertainty surrounds the parameter values necessary for creating baseline background-selection predictions along the human chromosomes. Therefore, in this report, we optimize the parameters of a background-selection model to observed levels of human polymorphism using a weighted least-squares regression and estimate the uncertainty associated with our parameter estimates by bootstrapping. We control for mutation rate heterogeneity by using human-chimpanzee divergence to estimate an effective population size (e) for each genic region. The parameter estimates of these predictions are of interest in that they allow estimates of the fraction of the genome under constraint (i.e., purifying selection) and of the strength of selection against deleterious mutants, as well as affect how we interpret skewed allele frequency spectrums. The predicted levels of variation can be used as a null hypothesis representing the genome-wide effects of neutral and deleterious variation. Locus-specific departures can be evaluated against this prediction as candidates for positive selection. Additionally, these predictions can be used to select regions of the genome that maximize or minimize the region-specific effective population size in order to study various selective and demographic processes.


    Results
 Top
 ABSTRACT
 Results
 Discussion
 Methods
 REFERENCES
 WEB SITE REFERENCES
 
We estimated effective population sizes (e) of 126 autosomal loci (SeattleSNPs 2005, http://pga.gs.washington.edu;) using average nucleotide heterozygosity ({pi}) and human-chimpanzee levels of nucleotide divergence (d, equation 6 in Methods; Supplemental Table 1). The parameters of a background-selection model (the deleterious mutation rate u, the strength of selection against heterozygous deleterious mutants sh, and the effective population size without selection N0) (slightly modified from Hudson and Kaplan 1995Go) were varied over a grid of parameter values, incorporating physical locations and changes in rates of recombination across the majority of the human genome (Kong et al. 2002Go). The point of "best fit" between the estimated and expected Ne (i.e., the maximum variance that can be explained by the model) as measured by r2 was recorded (equations 7 and 8 in Methods). After this optimization, a maximum correlation between the nucleotide polymorphism and divergence data and model predictions was found with a per generation deleterious mutation rate of û/Mb = 0.0016 (genomic Û = 10.2 = 2 x 0.0016û/Mb x 3200 Mb), a relative strength of selection against heterozygous deleterious mutants of sh = 0.018 and an effective population size before selection of 0 = 32,800 (Fig. 1). Predictions of the gene-specific effective population sizes were reduced from strict neutral levels (0) to effective population sizes of 30,800 to 11,800 for the loci included (Supplemental Table 1). This range of population sizes leads to as much as a 62% reduction in polymorphism among the observed loci predicted by background-selection alone. However, at these best-fit values, the model of background selection can explain only r2 = 7.9% of the actual variation in effective population size estimates among loci (equation 8 in Methods). The correlation between observed and predicted Ne values is not statistically significant (determined by nonparametric bootstrapping, P = 0.28). However, we consider the observed relationship to be biologically meaningful for three reasons. First, bootstrapping is typically conservative because chance outliers (and perhaps loci affected by other forms of selection) in the total data set are overrepresented in a subset of the pseudo-samples (Efron and Tibshirani 1993Go). Second, reasonable parameter values are found that are consistent with previous literature and/or theoretical expectations (see below). Third, significant correlations between background-selection predictions and distortions in the allelic spectrum were found that require a role of selection to explain (also below).



View larger version (24K):
[in this window]
[in a new window]
 
Figure 2. Predicted (lines) and estimated (circles) effective population size estimates (Ne) along the human autosomes under the model of background selection (equation 2). The upper and lower 90% bootstrapping-based confidence intervals are solely for the background-selection estimates. The deviations of individual gene regions depend on evolutionary variance and their individual sampling properties. Filled black circles correspond to positive Tajima's D values, filled gray circles correspond to Tajima's D values between 0 and -1; filled white circles correspond to Tajima's D values <-1.

 
Assessing parameter uncertainty by bootstrapping
In order to estimate the uncertainty associated with our parameter estimates, we carried out nonparametric bootstrapping, which produced a "cloud" of points clustered around the original estimate (Fig. 1). Confidence intervals can be estimated from the bootstrapping outcomes using the "reflection" method by removing the same number of outcomes from the upper and lower edges of the bootstrapping distribution (Efron and Tibshirani 1993Go). The 90% confidence interval for the per generation deleterious mutation rate is 7.7 x 10-5 to 4.7 x 10-3 per single-copy megabase or 0.49 to 30 per diploid genome complement (2 x 3200 Mb). This is consistent with previous lower-bound estimates of the genomic deleterious mutation rate based on relative ratios of nonsynonymous to synonymous substitutions among humans and other primates (e.g., Û = 1.6-3.1 [Eyre-Walker and Keightley 1999Go]; Û = 1.5-4.0 [Nachman and Crowell 2000Go]) and is higher than the deleterious mutation rate estimated from amino-acid-altering protein electrophoretic band morph mutations (Û = 0.4 [Neel et al. 1988Go; Keightley and Eyre-Walker 1999Go]).

The 90% confidence interval of sh, the strength of selection against deleterious mutations in the heterozygote is 2.0 x 10-5 to 0.17. Values of sh are not predicted to be less than (2Ne)-1 if selection is to have any appreciable effect on a per nucleotide basis since below this point the force of stochastic drift in changing allele frequencies is greater than the selective differences (Kimura 1955Go). The lower bound for our estimate of sh is very close to this theoretical boundary.

The 90% confidence interval of N0, the effective population size without the reducing effect of background selection, is 28,700 to 51,000 diploid individuals. In general, the effective population size before the reducing effects of background selection is expected to be larger than the effective population sizes estimated from observed data. These values are larger than effective population sizes traditionally estimated for humans (e.g., 10,000 individuals [Li and Sadler 1991Go]; 14,600 [Nachman et al. 1998Go]) and suggest that many Ne values in the literature may at least be partially affected by diversity reducing selection. However, we make some assumptions in order to account for the expected contribution to divergence of polymorphism in the common ancestor of humans and chimpanzees (equation 6 in Methods), which will affect Ne and N0 estimates.

Identifying locus-specific departures
Estimated gene-specific effective population sizes along the human autosomes (Fig. 2) are, as expected, generally reduced on average near the physical centers of the chromosomes and maximized near the edges. A few of the autosomes contain regions of dramatic expected reductions in effective population size, portions of the interiors of Chromosomes 1, 9, 11, and 16; and the p-edge of Chromosome 13, which tend to be heterochromatic areas near the centromeres. Of interest are loci like ABO (ABO blood group; MIM 110300 [OMIM] , http://www.ncbi.nlm.nih.gov/omim/), which has also been found to have signals of selection robust to simple demographic assumptions (Akey et al. 2004Go). ABO has a very high estimated effective population size (91,200) and a large excess of intermediate frequency alleles (Tajima's D = 2.06) consistent with balancing selection and a possible role of this antigen in disease susceptibility (e.g., cholera [Glass et al. 1985Go]; norwalk virus [Hutson et al. 2002Go]). Also of interest, as candidates for hitchhiking events, are loci that deviate below (or have less variation than) predictions based on background selection and divergence (e.g., DCN, FGA, ICAM1, IL5, PPARA, PROS1, PROZ, SMP1, THBD, TNFAIP1). Of these, DCN (decorin; MIM 125255 [OMIM] ) has also been found to have signals of selection robust to simple demographic assumptions (Akey et al. 2004Go), and is associated with renal disease (De Cosmo et al. 2002Go). IL5 (Interleukin 5; MIM 147850 [OMIM] ) is a member of the T-helper 2 (Th2) interleukin immune defense cluster on Chromosome 5 (e.g., Brombacher 2000Go), and positive selection has been reported for other members of this group (IL4 [Rockman et al. 2003Go]; IL13 [Tarazona-Santos and Tishkoff 2004Go]). ICAM1 (Intercellular adhesion molecule 1; MIM 147840 [OMIM] ) is a Plasmodium falciparum cell adhesion receptor (Berendt et al. 1989Go), a rhinovirus receptor (e.g., Bella et al. 1998Go), and plays a role in both septic shock (Xu et al. 1994Go) and autoimmunity (Bullard et al. 1997Go). ICAM1 has also been identified as a gene undergoing significantly accelerated amino acid replacements along the human lineage (Clark et al. 2003Go). Similarly, DCN and TNFAIP1 (tumor necrosis factor {alpha} induced protein 1; MIM 191161 [OMIM] ) appear to be rapidly evolving along the chimpanzee lineage (Clark et al. 2003Go). Genes undergoing rapid evolution may deviate below background-selection predictions because they have undergone a recent selective sweep and/or because divergence has been overestimated owing to an excess of selected fixations. Finally, a few gene regions have paradoxical deviations, typified by MC1R (melanocortin 1 receptor; MIM 155555 [OMIM] ), which has an excess of intermediate frequency alleles, but low levels of variation compared to divergence, which is not consistent with a simple model of either balancing selection or a selective sweep, but may reflect a partial selective sweep (see also Harding et al. 2000Go).



View larger version (15K):
[in this window]
[in a new window]
 
Figure 3. A plot of Tajima's D versus 0 predictions. The observed frequency distribution is increasingly skewed toward an excess of rare alleles as predicted variation is reduced (assuming background selection). This could be a result of either positive or negative selection or both. The correlation between Tajima's D and 0 remains significant even if the apparent outliers (ABO, F2, and PROS1) are selectively removed (r2 = 0.045, P = 0.019). Note, if a linear extrapolation is made from the best-fit regression line to f0 = 1, in an effort to account for the effects of selection, there is little to no negative skew (D {approx} 0) predicted for humans, consistent with a nearly constant ancestral population size.

 
These 126 loci were chosen for the SeattleSNPs resequencing study largely because of their medical interest; many are involved in human disease interactions with clear fitness consequences (e.g., ABO and cholera [Glass et al. 1985Go]; IL4 and HIV-1 progression [Valentin et al. 1998Go]; CSF2 and pneumonia [LeVine et al. 1999Go]). Therefore, it is not difficult to imagine positive selection acting at some fraction of these loci. Individual gene regions affected by forces other than background selection may deviate from the level of variation predicted by fitting the background-selection model to the rest of the data. Inclusion of these positively selected loci in the data set may detract from the proportion of variation accounted for by the background-selection model, as measured by r2. As an ad hoc exploration to identify positive selection in the data, we tabulated how often each locus was included in the highest 10% of bootstrapping outcomes when ordered by r2 values and asked: Are there significant underrepresentations in the collection of gene regions that seem to best fit background-selection predictions. A goodness-of-fit test between observed counts and expectations from a Poisson distribution of mean 100 rejected the null distribution (Supplemental Fig. 1) (cells were pooled to satisfy Cochran's [1954Go] guidelines; {chi}2 = 27.75, df = 14, P < 0.025). The most underrepresented loci in the top 10% (ordered by r2) of bootstrapping replicates are VEGF (vascular endothelial growth factor; MIM 192240 [OMIM] ; P = 1.16 x 10-19 with a Bonferroni correction) and LTA (Lymphotoxin-{alpha}; MIM 153440 [OMIM] ; P = 6.92 x 10-12 with a Bonferroni correction). These two loci are 15.2 cM apart on Chromosome 6 and deviate above the background-selection predictions, and LTA has a positive Tajima's D estimate consistent with balancing selection (Fig. 2; Supplemental Table 1). The genetic grouping of VEGF and LTA suggests errors in the local recombination rate estimates, and the very large effective population size estimates suggest errors in the mutation rate estimates, might be responsible for their underrepresentation. However, LTA is found near the major histocompatibility complex (MHC) on Chromosome 6 (Jongeneel et al. 1991Go) and has an immune regulation function (Chin et al. 2003Go), thus it may, in fact, be affected by some form of balancing selection. Curiously, in addition to being a vascular growth factor (Ferrara and Henzel 1989Go), VEGF is inhibited by the dopamine neurotransmitter (Basu et al. 2001Go); is neurotrophic, neuroprotective, and neurogenic (Jin et al. 2002Go and the references therein); may affect neurocognitive function by promoting glucose passage across the blood-brain barrier during acute hypoglycemia (Dantz et al. 2002Go); and appears to be responsible for increased neurogenesis and improved cognitive response to enriched environments and learning tasks: "VEGF may be a key mediator linking the environment to neurogenesis, learning and memory" (Cao et al. 2004Go, p. 832). The excess of variation compared to divergence and an excess of rare alleles found at VEGF is not consistent with simple models of balancing selection or selective sweeps, but may be consistent with a model of diversifying selection in humans. One of the loci that we identified above as having less variation than predicted by background selection also has a tendency to be underrepresented in these replicates (ICAM1, P = 0.69, 0.0055 with and without a Bonferroni correction, respectively), consistent with putative hitchhiking. PROS1 (Protein S; MIM 176880 [OMIM] ) is a clear outlier that is awkward to explain. When included, PROS1 appears to increase the amount of variation that can be explained by background selection. However, PROS1 was identified above as a gene region that deviated below background-selection predictions and has a strongly negative Tajima's D value (-1.44). Alternatively, the low level of variation and skewed allelic spectra toward an excess of rare alleles at PROS1 may be entirely consistent with the effects of background selection, as described below.

Comparing background-selection predictions to the allelic spectra
Simulation studies have reported distortions in the allele frequency distribution associated with the removal of weakly deleterious mutations, particularly in populations with effective sizes as small as that of modern humans (Charlesworth et al. 1993Go, 1995Go; Fu 1997Go; Tachida 2000Go; Gordo et al. 2002Go; Williamson and Orive 2002Go). We find that 0 (the fraction of remaining neutral variation after background selection) is positively correlated with Tajima's D values (r2 = 0.074, P = 0.002) (Fig. 3; Tajima 1989Go). This is similar to the correlation between Tajima's D and the recombination rate recently reported by Stajich and Hahn (2005Go). There is a concern that this correlation may arise from, or be exaggerated by, sharing of the same {pi} values between Tajima's D and 0 calculations. To address this, the loci were divided, using a random function, into two subsamples with an equal chance of being included in each subsample. The first subsample was used to reoptimize the model parameters identically to the method used for the full data set. The resulting 0 estimates were calculated for all loci. The second sample was used to test for a correlation between D and 0. Therefore, in this second treatment, none of the data from genic regions whose {pi} values were used to calculate e and optimize 0 were used in the comparison between the models 0 predictions and the Tajima's D estimates. Essentially the same result, a significant positive correlation, was again found (r2 = 0.103, P = 0.009). These results indicate that, overall, areas predicted to be affected more by background selection are also increasingly skewed from an expected steady-state allele frequency distribution toward an excess of rare alleles. This is not unexpected given the parameter values we have estimated (sh = 0.018 and 0 = 32,800), but the skew can be problematic for discriminating between the genome-wide effects of positive and negative selection in humans.


    Discussion
 Top
 ABSTRACT
 Results
 Discussion
 Methods
 REFERENCES
 WEB SITE REFERENCES
 
The process of background selection as modeled can explain 7.9% of the variation in observed nucleotide polymorphism among autosomal gene regions across the human genome, after adjustments for variation in mutation rates. Comparisons of the range of locus-specific effective population sizes to their corresponding expectation suggests that the role of the removal of deleterious alleles in shaping levels of human variation can be subtle over the majority of autosomal loci (Fig. 2), as predicted for mammalian chromosomes (Nordborg et al. 1996Go). This is in contrast to findings in Drosophila, where a large amount of observed variation is consistent with background selection (Hudson and Kaplan 1995Go; Charlesworth 1996Go; Hamblin and Aquadro 1996Go). However, this does not necessarily mean that other deterministic factors like positive selection have a large effect on standing variation in humans. We are faced with the high evolutionary variance of nucleotide heterozygosity ({pi}) in modern humans (Tajima 1983Go; Nei 1987Go), variance of lineage coalescence in the common human-chimpanzee ancestor, and uncertainty of the relative size and age of this common ancestor (e.g., Takahata et al. 1995Go), as well as the possibility of positive selection affecting divergence estimates, particularly when regulatory elements might be included (e.g., Wray et al. 2003Go). Thus, inferences about any single locus as outliers from background-selection predictions should be verified with further data collection and analysis (e.g., by measuring changes in polymorphism at flanking regions, divergence estimates from additional species, and explicit tests of positive selection).

It should also be noted that we assume a uniform distribution of deleterious mutations for each physical segment of the genome. Known genes are nonrandomly distributed along a chromosome; however, the majority of interspecific conserved sequences appear to be nongenic (Shabalina et al. 2001Go; Dermitzakis et al. 2002Go; Waterston et al. 2002Go). The unequal distributions of nongenic and genic conservation appear to largely cancel each other out and yield a relatively uniform (at least as a first approximation) distribution of conserved nucleotides (Dermitzakis et al. 2002Go). However, considering changes in rates of purifying selection as a function of interspecific nucleotide conservation rather than physical distance, should also improve background-selection predictions.

The tolerance of a high deleterious mutation rate
A diploid genomic deleterious mutation rate (Û) of 10.2 represents a very large per generation mutational load for humans (e.g., Haldane 1937Go; Crow 1958Go; Kimura and Maruyama 1966Go; Kondrashov 2001Go). However, when the proportion of the human genome under functional constraint (estimated as 5% by comparisons to mouse) (Waterston et al. 2002Go) and the genomic mutation rate (estimated as 175 new mutations per generation) (Nachman and Crowell 2000Go) is considered, then a high U is predicted (in this example Û = 8.75).

Both synergistic epistatic fitness between deleterious alleles (Kimura and Maruyama 1966Go; Crow and Kimura 1978Go; Charlesworth 1990Go; Kondrashov 1994Go) and inbreeding (Glémin 2003Go) can more efficiently purge the genome of deleterious alleles by removing multiple alleles at a rate greater than expected under independent fitness effects with random mating and may help resolve this paradox. Furthermore, the removal of gametes carrying deleterious alleles (e.g., during atresia) (Gougeon 1996Go) may also allow an increased tolerance of a high per generation deleterious mutation rate and is more efficient at purging deleterious alleles because each gamete has only one genomic copy.

Effective population size estimates
To the extent that selection may vary the effective population size along the chromosome, different areas of the genome may contain different information about the history of a species since the time to the coalescence of sample lineages is a function of the effective population size (Kingman 1982Go). Loci in regions of low predicted Ne, such as the center of Chromosome 1, where the effects of stochastic drift are predicted to be magnified, will be expected to have greater differences in allele frequencies and may provide more information on the recent demographic structure of modern humans. Conversely, regions of high predicted Ne, such as the ends of most of the autosomes, may have maximal segregating lineage depth and provide information on more ancient demographic processes. In this case, gene regions with high 0 can have an expected Ne as large as 30,000 diploid individuals and deviate little from neutral predictions (although this prediction is sensitive to some assumptions made in our divergence estimate; see Methods). In a sample of sufficient size, the expected time to a most recent common ancestor is expected to be 4Ne generations (Kingman 1982Go). With an average generation time of 25 yr, this corresponds to three million years ago (Mya). Therefore, by selecting loci with high 0, it should be possible to study lineages that predate the emergence of Homo sapiens within the last 200,000 yr (White et al. 2003Go; McDougall et al. 2005Go) and perhaps even predate the origin of the genus Homo ~2.3 Mya (Kimbel et al. 1996Go).

Background selection's effect on divergence
Can the correlation between rates of recombination and divergence observed by Hellmann et al. (2003Go), and explained as neutral changes in mutation rates, also be explained by background selection acting in the common ancestral species of humans and baboons? Assuming an average generation time of 15 yr and a common ancestor 25 Mya (Goodman et al. 1998Go; Yoder and Yang 2000Go), humans and Old World monkeys are separated by 3.3 million generations (2g). Background selection may have had a larger effect in this common ancestor considering that baboons have lower rates of recombination (and thus more linkage between deleterious and neutral alleles) than humans (Rogers et al. 2000Go). However, even with a 10-fold difference in effective population sizes along the chromosomes due to background selection, an interlocus maximum effective population size of 240,000 diploid individuals (Ne,max) is required in the common ancestor to entirely explain the expected 20% change in divergence associated with recombination rates:

(1)
(note, the scaled mutation rate 2µ cancels out and the expected contribution to divergence 2gµ of the coalescence of two lineages in a common ancestor is 4Neµ). This is larger than the effective population sizes typically estimated in primates (e.g., Chen and Li 2001Go; Chiarello and de Melo 2001Go; Storz et al. 2002Go; Yang 2002Go; Li et al. 2003Go; Wall 2003Go). However, Satta et al. (2004Go) estimate an e of 105 over much of primate evolution, and Takahata and Satta (1997Go) estimate an e of 106 in the Oligocene. Thus, while background selection may be unlikely to be entirely responsible for the patterns observed by Hellmann et al. (2003Go), it is also hard to rule out a nontrivial contribution of background selection to changes in divergence between species.

Patterns of skewed allele frequency spectra
The traditional treatment of background selection, and the model we use, assumes efficient selection; deleterious mutants are removed fast enough to make no contribution to heterozygosity (Charlesworth et al. 1993Go). However, this assumption may break down with smaller effective population sizes, weaker selective coefficients, and nonmultiplicative mutant interactions. With human parameters, an Ne on the order of 25,000 diploid individuals and sh values approaching 0.01 or less, analyses and simulations of the process of background selection predict that deleterious mutants can persist and rise to high enough frequencies to begin to contribute to sample heterozygosity and to co-segregate (Charlesworth et al. 1995Go). Ultimately these mutants will likely be removed, and they tend to be young and to contribute mutations as rare alleles in external lineages (Williamson and Orive 2002Go). Furthermore, interference among linked mutants can also reduce the efficiency of selection in quickly removing these mutants (Hill and Robertson 1966Go; Felsenstein 1974Go). The result is an ongoing population expansion, where a subset of "healthy" chromosomes ultimately contributes to all future chromosomes, and, in a population sample, a class of chromosomes are held at a lower than expected sample frequency due to linked deleterious mutants that are inefficiently purged. This results in an apparent non-neutral pattern, particularly in regions of low recombination (Charlesworth et al. 1995Go; Tachida 2000Go; but see also Przeworski et al. 1999Go). Indeed, we found a significant positive correlation between Tajima's D and levels of variation predicted by background selection for the 126 SeattleSNPs loci (r2 = 0.068, P = 0.0058) (Fig. 3). This suggests that regional rare-skewed allele frequency distributions alone are not conclusive indicators of positive selection and that the effects of background selection, in humans, cannot be simply thought of as a reduction in the effective population size.

The selection of candidate loci that have undergone positive selection might be refined in light of the potential effects of background selection on allele frequency distributions (Fig. 3). For example, PROS1 has a low level of variation, compared to divergence, and an excess of rare alleles, but is very close to background-selection predictions (both in terms of levels of variation [Fig. 2] and allele frequency distribution [Fig. 3]). Therefore, PROS1 may not be as promising a candidate region for the presence of a selective sweep as PPARA (peroxisome proliferators-activated receptor-{alpha}; MIM 170998 [OMIM] ) and PROZ (protein Z; MIM 176895 [OMIM] ), which have slightly higher levels of variation and less skewed allelic distributions (Figs. 2 and 3), but deviate much more from Tajima's D values seen at high 0 values. Similarly, KLKB1 (prekallikrein deficiency; MIM 229000 [OMIM] ) may not be as likely a candidate for balancing selection as LTA, which deviates more from the negative Tajima's D value expected at lower 0 values (Fig. 3).

The correlation between Tajima's D and 0 could be due to selective sweeps with hitchhiking. However, Kitano et al. (2003Go) find a correlation in skewed allele frequency distributions from 10 X-linked genes in both humans and chimpanzees. This finding suggests that allele frequency skews are, to some degree, a static regional phenomena preserved between species. This may be difficult to explain under a hitchhiking model, where individual signals of positive selection are predicted to be intermittent and short-lived (Simonsen et al. 1995Go; Przeworski 2002Go). The allele frequency distortion correlation with 0 could also arise from an interaction of the reduction in effective population size due to background selection and other selective and demographic factors. For example, the human population size is expanding; the coalescent distribution in regions of reduced effective population size is more influenced by recent events; therefore, regions of lower f0 may reflect a greater signal of the expansion of modern humans. Also consider that regions of low f0 are predicted to experience accelerated genetic drift (due to a smaller effective population size) compared to regions of higher f0. If a sample is composed of members from recently structured populations, larger differences in rare allele frequencies among subpopulations may result in a sampled excess of rare alleles in areas of low f0 (Ptak and Przeworski 2002Go; Hammer et al. 2003Go).

Conclusions
We predict a null level of DNA sequence polymorphism expected across the human autosomes based on a simple model of background selection. Background-selection predictions may help refine candidate loci influenced by positive selection and identify gene regions informative in studying recent or ancient demographic patterns. The results of Hellmann et al. (2003Go) indicate that neutral mutation rates may be the main determinant of genome-wide variation, but a genome-wide role of selection in humans is required to explain the correlation between background-selection predictions and skews in the allele frequency spectrum.


    Methods
 Top
 ABSTRACT
 Results
 Discussion
 Methods
 REFERENCES
 WEB SITE REFERENCES
 
Background-selection model
We use the background selection with recombination derivation of Hudson and Kaplan (1995Go). Because we correct for differences in mutation rates (see below), we replace average pairwise nucleotide heterozygosity ({pi}) (Nei and Li 1979Go) in the original Hudson and Kaplan (1995Go) equation with effective population size, Ne. The expected effective population size under this model is calculated as follows:

(2)

where

(3)

N0 is the effective population size without the effects of background selection, f0 is the probability of not being linked to (and removed by) a deleterious mutant, u is the deleterious mutation rate per physical unit (in this case we used megabase pairs), sh is the selective coefficient against the deleterious mutants (s) multiplied by the degree of dominance (h), x refers to the physical position being considered, and M refers to the genetic map position in morgans of physical position x. For each locus of interest (k) the probability of being linked to, and removed by, a deleterious mutant is the sum of all the probabilities contributed by each physical segment (i to i + 1) along the chromosome. The distribution of linked deleterious mutants is assumed to be Poisson; thus the probability of the mutation-free class is given by e-G. The model assumes that selection is efficient so that mutations are removed quickly enough not to contribute to observable sample polymorphism and that mutant alleles are at low enough frequency to always be present in the heterozygote form (Hudson and Kaplan 1995Go).

Human polymorphism data
Publicly available estimates of DNA sequence variation for 126 autosomal gene regions (SeattleSNPs 2005, http://pga.gs.washington.edu) assayed in combined African-American (24 individuals) and European-American CEPH (23 individuals) population samples were used as a basis for levels of human genetic polymorphism (Supplemental Table 1). Because we are interested in patterns general to humans; because, relative to many other organisms, there is a high degree of shared polymorphism and evolutionary history in humans; and because these samples, in all likelihood, already consist of individuals admixed among different genetically structured human populations (e.g., Parra et al. 1998Go), we chose to combine the African-American and European-American samples in an effort to reduce sample variance. Only noncoding regions (averaging a total of 16.5 kb in length) were used in the analysis (i.e., 3'- and 5'-flanking and intron sites). We chose to exclude the X- and Y-chromosomes and focus on the autosomes in the present report because (1) more multilocus data from a common set of DNA samples are available for the autosomes; (2) the autosomes make up the majority of the genome; and (3) selection, mutation, and effective population size parameters are expected to be different between the sex chromosomes and the autosomes because of the X-chromosome's hemizygosity in males (coupled with effective sex ratio uncertainty) (Charlesworth 2001Go), the Y-chromosome's single-copy nature and absence in females, and a higher mutation rate in males (for review, see Li et al. 2002Go).

Background selection is predicted to have more of an effect on average pairwise nucleotide heterozygosity ({pi}) (Nei and Li 1979Go) than the number of segregating sites in a sample (S) (Watterson 1975Go), particularly when selection is weak and effective population size, Ne, is small (Charlesworth et al. 1993Go). Therefore we used total {pi} estimates from each gene region (including coding and noncoding sites) as our principal measure of genetic polymorphism.

In an ideal Wright-Fisher mutation drift equilibrium population of constant size, under the infinite sites model, {pi} is an unbiased, but inconsistent (i.e., maintains a finite variance under infinite sampling) (Tajima 1983Go; Nei 1987Go), estimator of the population mutation parameter {theta} (Haldane 1939Go; Kimura 1968Go). For autosomal genes in an ideal diploid population, {theta} =4Neµ, where Ne is the diploid effective population size and µ is the mutation rate per generation.

Controlling for mutation rate heterogeneity, estimating Ne
Possible differences in the proportion of nucleotides under functional constraint as well as increased mutation rates in regions of higher recombination and higher GC content require differences in effective mutation rates to be accounted for in order to combine gene regions for analysis. We correct for mutation rate differences among gene regions by using average pairwise DNA divergence between the humans and a chimpanzee, Pan troglodytes, sample (SeattleSNPs 2005, http://pga.gs.washington.edu;). Polymorphism in the ancestral population of humans and chimpanzees could have made a substantial contribution to the observed interspecific divergence (d), because the division between the two species is recent (4-6 Mya or ~250,000 generations, g) (e.g., Gavan 1953Go; Nishida et al. 1990Go; Hill and Hurtado 1996Go; Stauffer et al. 2001Go; Burnet et al. 2002Go; Wall 2003Go), and the common ancestor between humans and chimpanzees before this time may have had a large effective population size (see below), resulting in more ancient sequence divergence dates (e.g., 5.5-11 Mya) (Bailey et al. 1991Go; Goodman et al. 1998Go; Kumar and Hedges 1998Go; Huelsenbeck et al. 2000Go; Arnason and Janke 2002Go; Hasegawa et al. 2003Go; Yang and Yoder 2003Go). Also, the ancestral human-chimpanzee speciation may have occurred over a considerable period of time, adding to the variation in these estimates (Osada and Wu 2005Go). The ancestral contribution to divergence is expected to be {theta}a =4Naµ, where {theta}a and Na are the corresponding ancestral values. We correct divergence by using an ancestral constant, a, that is equal to the ratio of ancestral to modern effective population sizes. a is multiplied by the modern human {theta} estimate, based on {pi}, and this product (a{pi}) is subtracted from the species divergence to yield a divergence value corrected for ancestral polymorphism (d - a{pi}). This yields an estimate of the modern effective population size, e, for each gene region based on {pi}:

(4)

where

(5)

g is the number of generations since species division (here we use g = 250,000), and 2g represents the total number of generations between the two species. This simplifies to:

(6)

Estimates of the effective population size of the immediate common ancestor to humans and chimpanzees range from 1-2 (Yang 2002Go) to 4-10 (Takahata and Satta 1997Go, estimated separately from the Oligocene value cite above; Chen and Li 2001Go; Wall 2003Go) times larger than modern humans. These latter values are almost surely overestimates because an ancestral correction as large as four removes all the divergence between the species for some loci and leads to negative modern effective population size estimates (i.e., the human to chimpanzee species divergence is predicted to be entirely contained by lineages in the common ancestor). Because estimating the effective population size of the human-chimpanzee common ancestor is not a central goal of this report, we optimized a values with the full data set at the beginning of the analysis. The best-fit value, a = 2.0, was then used for all subsequent bootstrapping analyses. The particular value of a used should not bias the organization of levels of variation along the chromosome (which could affect û and sh), but a will affect 0.

Use of integrated maps
We used the integrated genetic and physical maps reported by deCODE, which result from 1257 meioses, 5136 markers, and 4690 recombination rate interval estimates (Kong et al. 2002Go). Sex-averaged genetic positions were used for all the autosomes. Each SeattleSNPs (2003, http://pga.gs.washington.edu;) locus was located in the April 2003 freeze of the UCSC Human Genome Browser (http://genome.ucsc.edu/; Kent et al. 2002Go), and assigned the genetic and physical position of the nearest marker reported in deCODE's map.

Least-squares optimization procedure, comparing estimated and expected Ne
By computer algorithm, values of u and sh were systematically varied over a range of 1 x 10-5 to 1. Each resulting f0 prediction was fit, by the weighted least-squared error described below, to the observed DNA polymorphism data by a one-dimensional optimization of N0. The best-fit minimum weighted squared error was found for a particular coordinate of u, sh, and N0 in this range and reported.

Although the chromosomes between gene regions were sampled in common in the SeattleSNPs data set, the lengths sequenced, local recombination rates, and levels of nucleotide variation differed among loci and affect the expected variance of the sample {pi} estimate. In order to appropriately take these differences into account, the expected variance, V, was estimated for each datum, under its corresponding sampling parameters, by the calculated variance of {pi} values resulting from 10,000 coalescent simulations using the MS program of Hudson (2002Go). The expected distribution of {pi} is not symmetrical (Tajima 1983Go; Nei 1987Go); therefore, separate upper and lower variances were calculated for each datum according to standard statistical methods. The squared errors between the background-selection prediction and the observed value were weighted by each datum's expected upper or lower variance estimate, corresponding to the individual deviation, in order to find the best-fit parameter configuration by weighted regression (e.g., Neter et al. 1996Go; cf. Andolfatto and Przeworski 2001Go).

Several of the SeattleSNPs loci shared the same, physically nearest deCODE marker. To avoid a single location's influence from being overrepresented in the optimization procedure, the variance of each individual SeattleSNPs locus was multiplied by the number of loci (l) that shared the same deCODE position. This approach proportionally downweights the contribution of clustered loci to the sum of the squared errors (SSE). Thus, for each locus i:

(7)

where, for the i-th locus, e,i is the estimated effective population size (equation 6), E(Ne,i) is the expected effective population size according to background selection (equation 2), li is the number of loci that share the same deCODE position, and i is the estimated sample variance. Note, this is a constrained regression with the slope of the regression line fixed at 1 and the intercept 0. The r2 value, the proportion of variance that can be accounted for by the model predictions, is calculated as follows:

(8)

where

(9)

and e is the mean e value.

Bootstrapping procedure
Nonparametric bootstrapping was used to assess the general robustness of the parameter values (Efron and Tibshirani 1993Go). Random sampling of the loci with replacement generated pseudo-samples of the same size as the original data set. For each pseudo-sample, the weighted regression optimization described above was repeated and the best-fit parameter values were found and reported. This procedure assumes that each locus is genetically far enough apart to reflect independent outcomes of the evolutionary process. For the majority of loci included in each pseudo-sample, this assumption is easily met. If we consider an expectation of 100 recombination events in the coalescent history between two loci on a single chromosome (assuming an ideal population of constant size) as sufficiently independent,

(10)

where c is the per generation recombinant fraction, and an effective population size of only 10,000 diploid individuals, then genetic distances as small as one-quarter of a centimorgan are effectively independent:

(11)

The average genetic distance between adjacent markers in the deCODE map is 0.7 cM, and the contribution of the subset of loci that map to the same marker is down-weighted so that the total contribution from each marker is equivalent to a single locus, as described above.


    Acknowledgements
 
We thank Vanessa Bauer DuMont, Yuseob Kim, Guy Reeves, Rasmus Nielsen, Bret Payseur, Molly Przeworski, Wolfgang Stephan, and three anonymous reviewers for helpful discussion and suggestions. We also thank Deborah Nickerson and the SeattleSNPs project for making these data publicly available. This work was supported in part by grants from Sigma Xi (to F.A.R.), National Institutes of Health grant GM36431 (to C.F.A), and National Science Foundation grant DMS-0201037 (to Richard Durrett, C.F.A., and Rasmus Nielsen).


    Footnotes
 
3 Present address: Department of Biology, University of Maryland, College Park, MD 20742, USA. Back

4 Corresponding author.
E-mail freed{at}umd.edu; fax (301) 314-9358.
Back

[Supplemental material is available online at www.genome.org.]

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3413205.


    REFERENCES
 Top
 ABSTRACT
 Results
 Discussion
 Methods
 REFERENCES
 WEB SITE REFERENCES
 

Akey, J.M., Eberle, M.A., Rieder, M.J., Carlson, C.S., Shriver, M.D., Nickerson, D.A., and Kruglyak, L. 2004. Population history and natural selection shape patterns of genetic variation in 132 genes. PLoS Biol. 2: 1591-1599.

Andolfatto, P. 2001. Adaptive hitchhiking effects on genome variability. Curr. Opin. Genet. Dev. 11: 635-641.[CrossRef][Medline]

Andolfatto, P. and Przeworski, M. 2001. Regions of lower crossing over harbor more rare variants in African populations of Drosophila melanogaster. Genetics 158: 657-665.[Abstract/Free Full Text]

Antezana, M.A. and Hudson, R.R. 1997. Before crossing over: The advantages of eukaryotic sex in genomes lacking chiasmatic recombination. Genet. Res. 70: 7-25.[CrossRef][Medline]

Aquadro, C.F., Bauer DuMont, V., and Reed, F.A. 2001. Genome-wide variation in the human and fruitfly: A comparison. Curr. Opin. Genet. Dev. 11: 627-634.[CrossRef][Medline]

Arnason, U. and Janke, A. 2002. Mitogenomic analyses of eutherian relationships. Cytogenet. Genome Res. 96: 20-32.[CrossRef][Medline]

Bailey, W.J., Fitch, D.H.A., Tagle, D.A., Czelusniak, J., Slightom, J.L., and Goodman, M. 1991. Molecular evolution of the {psi}{eta}-globin gene locus: Gibbon phylogeny and the hominoid slowdown. Mol. Biol. Evol. 8: 155-184.[Abstract]

Bamshad, M. and Wooding, S.P. 2003. Signatures of natural selection in the human genome. Nat. Rev. Genet. 4: 99-111.[CrossRef][Medline]

Basu, S., Nagy, J.A., Pal, S., Vasile, E., Eckelhoefer, I.A., Bliss, V.S., Manseau, E.J., Dasgupta, P.S., Dvorak, H.F., and Mukhopadhyay, D. 2001. The neurotransmitter dopamine inhibits angiogenesis induced by vascular permeability factor/vascular endothelial growth factor. Nat. Med. 7: 569-574.[CrossRef][Medline]

Begun, D.J. and Aquadro, C.F. 1992. Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 356: 519-520.[CrossRef][Medline]

Bella, J., Kolatkar, P.R., Marlor, C.W., Greve, J.M., and Rossmann, M.G. 1998. The structure of the two amino-terminal domains of human ICAM-1 suggest how it functions as a rhinovirus receptor and as an LFA-1 integrin ligand. Proc. Natl. Acad. Sci. 95: 4140-4145.[Abstract/Free Full Text]

Berendt, A.R., Simmons, D.L., Tansey, J., Newbold, C.I., and Marsh, K. 1989. Intercellular adhesion molecule-1 is an endothelial cell adhesion receptor for Plasmodium falciparum. Nature 341: 57-59.[CrossRef][Medline]

Brombacher, F. 2000. The role of interleukin-13 in infectious diseases and allergy. BioEssays 22: 646-656.[CrossRef][Medline]

Bullard, D.C., King, P.D., Hicks, M.J., Dupont, B., Beaudet, A.L., and Elkon, K.B. 1997. Intercellular Adhesion Molecule-1 deficiency protects MRL/MpJ-Faslpr mice from early lethality. J. Immunol. 159: 2058-2067.[Abstract]

Burnet, M., Guy, F., Pilbeam, D., Mackaye, H.T., Likius, A., Ahounta, D., Beauvilain, A., Blondel, C., Bocherens, H., Boisserie, J.R., et al. 2002. A new hominid from the Upper Miocene of Chad, Central Africa. Nature 418: 145-151.

Cao, L., Jiao, X., Zuzga, D.S., Liu, Y., Fong, D.M., Young, D., and During, M.J. 2004. VEGF links hippocampal activity with neurogenesis, learning and memory. Nat. Genet. 36: 827-835.[CrossRef][Medline]

Charlesworth, B. 1990. Mutation-selection balance and the evolutionary advantage of sex and recombination. Genet. Res. 55: 199-221.[Medline]

____. 1996. Background-selection and patterns of genetic diversity in Drosophila melanogaster. Genet. Res. Camb. 68: 131-149.[Medline]

____. 2001. The effect of life-history and mode of inheritance on neutral genetic variability. Genet. Res. 68: 131-149.

Charlesworth, B., Morgan, M.T., and Charlesworth, D. 1993. The effect of deleterious mutations on neutral molecular variation. Genetics 134: 1289-1303.[Abstract]

Charlesworth, D., Charlesworth, B., and Morgan, M.T. 1995. The pattern of neutral molecular variation under the background selection model. Genetics 141: 1619-1632.[Abstract]

Chen, F. and Li, W. 2001. Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am. J. Hum. Genet. 68: 444-456.[CrossRef][Medline]

Chiarello, A.G. and de Melo, F.R. 2001. Primate population densities and sizes in Atlantic forest remnants of northern Espírito Santo, Brazil. Int. J. Primatol. 22: 379-396.[CrossRef]

Chin, R.K., Lo, J.C., Kim, O., Blink, S.E., Christiansen, P.A., Peterson, P., Wang, Y., Ware, C., and Fu, Y.-X. 2003. Lymphotoxin pathway directs thymic Aire expression. Nat. Immun. 4: 1121-1127.

Clark, A.G., Glanowski, S., Nielsen, R., Thomas, P.D., Kejariwal, A., Todd, M.A., Tanenbaum, D.M., Civello, D., Lu, F., Murphy, B., et al. 2003. Inferring nonneutral evolution from human-chimp-mouse orthologous gene trios. Science 302: 1960-1963.[Abstract/Free Full Text]

Cochran, W.G. 1954. Some methods