|
|
|
|
Published online before print
September 5, 2006, 10.1101/gr.5431206 Genome Res. 16:1320-1327, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00
Methods Inference of population genetic parameters in metagenomics: A clean look at messy data1Biophysics Graduate Group, University of California, Berkeley, California 94720, USA; 2Department of Integrative Biology, University of California, Berkeley, California 94720, USA
Metagenomic projects generate short, overlapping fragments of DNA sequence, each deriving from a different individual. We report a new method for inferring the scaled mutation rate, = 2Neu, and the scaled exponential growth rate, R= Ner, from the site-frequency spectrum of these data while accounting for sequencing error via Phred quality scores. After obtaining maximum likelihood parameter estimates for and R, we calculate empirical Bayes quality scores reflecting the posterior probability that each apparently polymorphic site is truly polymorphic; these scores can then be used for other applications such as SNP discovery. For realistic parameter ranges, analytic and simulation results show our estimates to be essentially unbiased with tight confidence intervals. In contrast, choosing an arbitrary quality score cutoff (e.g., trimming reads) and ignoring further quality information during inference yields biased estimates with greater variance. We illustrate the use of our technique on a new project analyzing activated sludge from a lab-scale bioreactor seeded by a wastewater treatment plant.
Metagenomics applies shotgun-sequencing techniques to DNA extracted from a microbial community with the goal of learning about the ecological dynamics of the constituent microorganisms. Population genetics provides a theoretical basis to make inferences about population structure and evolution given samples of DNA sequences from individuals.
Although recent large-scale metagenomics sequencing projects (Tyson et al. 2004
In addition, researchers cannot afford to throw away experimental data. Even with modern high-throughput capillary sequencing technology, the cost and speed of DNA sequencing remain a limiting constraint on research (Shendure et al. 2004
The key methodological step that makes population metagenomic analysis possible lies in the difference between metagenomic sequencing and traditional sequencing. Instead of sequencing reads from a single isolate organism, metagenomics projects sequence reads from a pool of DNA extracted from all individuals in the sampled community. Considering the large number of individuals in the sampled community relative to the number of reads sequenced, each read derives from a different individual microorganism with probability near one. Thus, we have a population sample equal to the "depth," or number of reads, at every site in the assembly of overlapping reads. If only a few sites were sequenced or read depths were very low, this sample would not allow for meaningful statistical inference; however, with metagenomic assemblies spanning entire genomes and having average depths as high as 10 (Tyson et al. 2004
Estimated values for the apparent scaled mutation rate (
Given a population sample from n individuals (i.e., a constant read depth of n), the site-frequency spectrum represents the distribution of polymorphic sites that have a derived (as opposed to ancestral) nucleotide at a particular frequency in the sample. For example, in general, the most common type of polymorphic site will be one in which exactly one individual in the sample has the derived nucleotide; the next most common type of site will be one in which exactly two individuals in the sample have the derived nucleotide, and so on. Experimental data requires the sequence from an outgroup to distinguish which of the two nucleotides is ancestral and which is derived. If an outgroup is not available, the frequency spectrum must be "folded" to combine the indistinguishable categories (i.e., 1 and n 1, 2 and n 2, etc.).
With some assumptions about how mutations occur and how a population evolves, theory can predict the shape of this spectrum. Given an observed spectrum, maximum likelihood estimation can be used to work backward and infer the most likely parameters of the theoretical model. Relative comparison of these parameter estimates from different regions of the same genome provide useful information even when the underlying assumptions are questionable. This type of "genomic control" has been used in numerous studies when working with genome-scale data including some using the frequency spectrum for inference (Marth et al. 2004
Sawyer and Hartl have characterized the frequency spectrum as an explicit function of the selection coefficient for a constant-size population (Sawyer and Hartl 1992
Instead, this study makes use of frequency spectrum formulas arising from two types of neutrally evolving populations, i.e., one that experiences exponential growth (Polanski and Kimmel 2003 In reality, the frequency spectrum arising from evolutionary processes (e.g., mutation, selection, drift) is obscured by errors stemming from the data collection process. Raw data from automated sequencing machines provides a prior distribution for these error probabilities when using data from metagenomics sequencing projects.
The Sanger method for DNA sequencing depends on chemical reactions that contain stochastic elements, which lead to varying output quality (Ewing et al. 1998
Base-calling software converts the analog fluorescence output of automated sequencers into a string of digital nucleotides and attempts to quantify the probability of a given base being called in error. The widely used base-calling software Phred reports a quality score based on the shape of the peak and the shape of neighboring peaks calibrated to a particular sequencing chemistry (Ewing and Green 1998
A more refined estimate of sequence quality can be obtained by combining information across multiple aligned reads. If the reads all derive from a single haploid individual or are otherwise expected to be nonpolymorphic, then an algorithm need only consider the various error probabilities (Li et al. 2004
Despite these tools for quantifying error rates, to the best of our knowledge, most studies performing population genetic analysis use human visual confirmation of quality in combination with a strict Phred quality score cutoff of 30 or 25, which corresponds to between one and three errors per thousand bases (e.g., Brown et al. 2004
Here, we avoid picking an arbitrary quality threshold and instead explicitly incorporate quality scores into the likelihood function used to estimate the parameters
Below, we demonstrate the significant advantage of our approach over a cutoff-based method through the use of analytic and Monte Carlo techniques. After establishing that the method works in principle, we proceed to apply it to initial experimental data from a recent metagenomics sequencing project of activated sludge from a wastewater treatment plant (Martin et al. 2006
Analytic We first verified analytically that, for the restricted case of constant quality and depth across all sites, our method yields a nearly unbiased estimate for and R over reasonable parameter ranges. These results assume independence of sites and that sequencing errors cause a switch to any other nucleotide with equal probability. With the constant population size model, this result can be proven explicitly (e.g., if n = 5 and Pr(err) = 0.01, then ; further results not shown). For R > 0, we numerically solved for asymptotic parameter estimates using a range of input parameter values ( :0.001,...,0.01; R:1,...,50; n= 5,9; Pr(err) = 0.001, 0.01) covering two orders of magnitude in polymorphism rate (0.00016 to 0.017). As detailed below, the sludge parameter estimates fall within this range. Since metagenomics projects provide the first large population samples of microbial sequence, previous estimates for the range of microbial polymorphism rates are essentially nonexistent. Although mutation rates in Escherichia coli have been estimated via long-term laboratory evolution experiments to be on the order of 1010 per site per generation (Lenski et al. 2003
The numerical solutions consistently recovered the input parameter values with the exception of the case with low quality (Pr(err)
Simulation To test the utility of our method in a more realistic situation with variable depth and quality, we simulated data from populations experiencing a range of scaled mutation and growth rates. Each simulation sampled 100,000 independent sites from the same quality score distribution and a slightly modified version of the depth distribution as found in the largest contiguous sequence (170 kb) from the sludge metagenomics project (Fig. 2). We modified the depth distribution by trun cating the right-hand tail to a maximum of 20 to increase computation speed when performing replicate simulations. When a "sequencing error" occurred, the simulation switched the given nucleotide to one of the other three with probability 1/3. In addition to comparing our method against the true simulated value, we also used a simple cutoff scheme that ignores all bases below a threshold quality and completely trusts all bases above the threshold. This cutoff technique approximates the approach taken by previous studies in which low- quality bases are ignored and the rest are approved by a human (Brown et al. 2004
Using the same parameter ranges as above, we initially vary from 0.001 to 0.01 while keeping R fixed at 1 (Fig. 3). Taking the average sludge depth of nine, this range for corresponds to the rate of polymorphism ranging from 0.0017 to 0.017. Next, we vary R from 0 to 50 while keeping fixed at 0.01 (Fig. 4), which corresponds to a polymorphism rate ranging from 0.027 to 0.0024. In all cases, the cutoff estimates are distinctly biased in addition to having much greater variance than the estimates incorporating quality scores. Given our sample size (100,000 sites) and our depth distribution (Fig. 2), the folded estimates performed nearly as well as the full estimates.
After obtaining quality-based parameter estimates, we can also calculate an empirical Bayes quality score for apparent polymorphic positions (see Methods, equation 8) that much more closely predicts the true quality of those bases than the original quality score. In addition, our model provides a slight advantage over the posterior SNP probability reported by PolyBayes (Marth et al. 1999
Sludge Application of the folded version of our technique to the largest contiguous sequence in the sludge data set yielded
Our estimate for the polymorphism rate in the sludge (i.e., 1 Pr(d= 0|
Figures 3 and 4 ) and, in particular, an excess of sites with a single apparent derived nucleotide (overestimating R for reasons detailed below). If the cutoff were raised to a level sufficient to avoid the error, the increased variance due to lower sample size would eliminate the signal (Fig. 6).
While the results are quite promising, we have some caveats.
First, the analytic work and simulations treat sequencing error as causing a switch to any other nucleotide with uniform probability 1/3, which, given the sludge score and depth distributions, leads to a prediction that:
Second, decreasing the number of sites or depth leads to an increase in the variance of the estimators (particularly Third, when we multiply the per-site likelihoods to form the total likelihood (equation 7), we assume sites segregate independently. Estimation of recombination rate from metagenomic data remains a challenge. Dependence due to low levels of recombination will bias these estimates on an absolute scale, but, as with the other model assumptions, relative comparison of estimates from different regions of the same genome will still provide insight into the evolutionary processes at work.
Finally, the validity of these parameter estimates depends entirely on the accuracy of the given metagenomic assembly on which they are based. The very polymorphism that makes population genetic analysis possible also poses a challenge to the assembly process, making metagenomic assembly algorithms an area of active research (Chen and Pachter 2005
Given a constant-size or exponentially growing panmictic recombining population, our technique provides reliable estimates for In implementation of our Population genetic Inference In Metagenomics (PIIM) technique is freely available for download from http://ib.berkeley.edu/labs/slatkin/software.html
Statistical inference Likelihood calculation We begin with an assembly of metagenomic reads, FASTA sequences for the reads, and quality scores for every base call in each read. More generally, the data consist of a set of aligned reads containing K sites with a vector of quality scores for each site. Since the read depth varies across the assembly, we denote the read depth at a particular site k {1,...,K} to be nk . We assume independence of sites and that, at most, two different nucleotides are observed at any given site. If a given site actually has more than two nucleotides, we group the nucleotides into two classesancestral and nonancestral (if the ancestral is known) or the single most frequent nucleotide and everything else (if the ancestral is not known). For the duration of the Methods section, we will refer to these two classes of nucleotides as "ancestral" and "derived"; if an outgroup is not available, these references should be interpreted as "major" and "minor," along with the adjustments noted below.
First, we establish some notation. Starting with a single site where n reads align (i.e., a depth of n), we need two pieces of informationthe set of error probabilities, Q= {q 1,...,qn } and the subset of these probabilities corresponding to the observed derived nucleotides, Do
Where d is the number of apparent derived nucleotides that are actually ancestral and d + is the number of apparent ancestral nucleotides that are actually derived. For simplicity of notation, since the order of the reads is irrelevant, take the first do reads to be derived and the rest as ancestral (i.e., Do = {q 1,...,qdo }). For example, a site with depth n= 4 might have nucleotides (A, A, A, T) and quality scores (q 1,q 2,q 3,q 4). If A is derived, then Do = (q 1,q 2,q 3), and thus, do = 3.
Now, given these error probabilities and the set of frequency spectrum parameters (denoted by [H9024] and discussed in detail in the next section), we want to know the likelihood of the observed configuration of nucleotides at a single site: Pr(Do | Q,n,
Note that this sum stops at d = n 1 since frequency spectrum theory makes no statement regarding the fixation probability of the derived nucleotide (d = n). When we lack an outgroup to orient the polymorphism, we must fold the spectrum by changing the first term in the above sum to be (Pr(Do | d,Q) + Pr(Do | n d,Q))/2. Looking in more detail at the sequencing error term of equation 2, we condition on D, which we define in a parallel fashion to Do such that it is the assignment of the d true-derived nucleotides to a subset of the Q error probabilities:
Where D
For reasons of computational efficiency outlined later, we calculate this quantity in an indirect fashion. Specifically, the sum in equation 3 is equivalent to Pr(
The probability of a given deviation from the truth is a function of the probability of its d + and d [H11502] components (equation 1). For instance,
Where P and P + give the conditional probability distributions for d and d + , respectively. Note the probability distribution for d (ancestral-to-derived) depends only on the error probabilities for the observed derived nucleotides (Do ) while the distribution for d + (derived-to-ancestral) depends only on the error probabilities for the observed ancestral nucleotides (Q\Do ). Intuitively, to calculate the probability that d = x, we must sum the probabilities of all possible ways of having exactly x errors in the set of observed derived nucleotides. Formally, assuming errors are independent among reads, the random variable d has conditional distribution:
Where
In the n= (A, A, A, T) example above, the probability that d = 2 is the probability that the first two nucleotides were in error and the third was not (q 1 q 2(1 q 3)), plus the probability that the first and third were in error and the second was not (q 1 q 3(1 q 2)), plus the probability that the second and third were in error and the first was not (q 2 q 3(1 q 1)). Here,
Now, to find the total log likelihood across all K sites, we take a composite likelihood approach by assuming independence and summing the logarithm of the individual likelihoods (equation 4, substituting equation 5, equation 6, and the equation for P + analogous to equation 6). Below, we use superscripts to denote observations at a particular site, k
Frequency spectra
The neutral, constant population size frequency spectrum has the form 1/d, where d = 1...n 1. The expected proportion of all sites that are polymorphic, and thus participating in the 1/d formula, is
Since
In analytic expression for the neutral, exponentially growing population frequency spectrum has recently been derived (Polanski and Kimmel 2003
After finding the maximum likelihood parameter estimates for the two models, we compute the likelihood ratio test statistic (2 log[lik const/lik growth]) to determine whether the more general model (population growth) fits significantly better than the more limited model (constant size). If the model assumption of independence is applicable, then this statistic follows a
Revised quality score
Note that, since this equation uses point estimates for the parameters, the resulting probabilities do not incorporate the variance in
Computational complexity
Calculation of the probability distribution for d (equation 6), along with the corresponding distribution for d + , forms the most computationally intensive step of this process by consuming O(2 n ) time, where n is the read depth at a given position. Although metagenomic projects tend to have low-average depths, some sites will range considerably higher, pushing 2 n beyond the range of reasonable calculation. To work around this problem, we took advantage of the fact that this function has only one maximum, which, for realistic quality scores, is found near the low end of the distribution (d small). Our algorithm calculates this function in order of increasing d until reaching the first value past the peak that falls below 1010. From then on, all further values in the distribution are considered to be zero. The dynamic nature of this strategy prevents a rigorous analysis of the time savings; however, in practice, this simple change allows the program to run in a reasonable amount of time on a modern desktop computer (
Sludge metagenome
We thank Phil Hugenholtz and Victor Kunin at JGI for bringing this problem to our attention and sharing the sludge data that inspired this project. The excellent suggestions of three anonymous reviewers, Anna-Sapfo Malaspinas, John Novembre, Clair Null, Rachel Whitaker, and Weiwei Zhai improved earlier versions of this manuscript. This research was supported in part by National Institutes of Health grant R01-GM40282 to M.S.
3 Corresponding author.
E-mail plfjohnson{at}berkeley.edu; fax (510) 643-6264. Article published online before print. Article and publication date are online at http://www.genome.org/cgi/doi/10.1101/gr.5431206.
Brown, G., Gill, G., Kuntz, R., Langley, C., Neale, D. 2004. Nucleotide diversity and linkage disequilibrium in loblolly pine. Proc. Natl. Acad. Sci. 101: 1525515260. Bustamante, C., Wakeley, J., Sawyer, S., Hartl, D. 2001. Directional selection and the site-frequency spectrum. Genetics 159: 17791788. Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C.R., Lim, E.P., Kalyanaraman, N. et al. 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22: 231238.[CrossRef][Medline] Chen, K. and Pachter, L. 2005. Bioinformatics for whole-genome shotgun sequencing of microbial communities. PLoS Comput. Biol. 1: 106112. Crocetti, G., Hugenholtz, P., Bond, P., Schuler, A., Keller, J., Jenkins, D., Blackall, L. 2000. Identification of polyphosphate-accumulating organisms and design of 16S rRNA-directed probes for their detection and quantitation. Appl. Environ. Microbiol. 66: 11751182. DeLong, E., Preston, C., Mincer, T., Rich, V., Hallam, S., Frigaard, N., Martinez, A., Sullivan, M., Edwards, R., Brito, B. et al. 2006. Community genomics among stratified microbial assemblages in the ocean's interior. Science 311: 496503. Ewens, W. In Mathematical population genetics: I. Theoretical introduction .2004. 2nd ed. Springer-Verlag, New York. Ewing, B. and Green, P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8: 186194. Ewing, B., Hillier, L., Wendl, M.C., Green, P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8: 175185. Feil, E.J. and Spratt, B.G. 2001. Recombination and the population structures of bacterial pathogens. Annu. Rev. Microbiol. 55: 561590.[CrossRef][Medline] Gajer, P., Schatz, M., Salzberg, S.L. 2004. Automated correction of genome sequence errors. Nucleic Acids Res. 32: 562569. Galassi, M., Davies, J., Theiler, J., Gough, B., Jungman, G., Booth, M., Rossi, F. In GNU Scientific Library Reference Manual .2003. 2nd ed. Network Theory Ltd, Bristol UK. Hudson, R.R. 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337338. Irizarry, K., Kustanovich, V., Li, C., Brown, N., Nelson, S., Wong, W., Lee, C.J. 2000. Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nat. Genet. 26: 233236.[CrossRef][Medline] Kimura, M. 1969. The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61: 893903. Lenski, R.E., Winkworth, C.L., Riley, M.A. 2003. Rates of DNA sequence evolution in experimental populations of Escherichia coli during 20,000 generations. J. Mol. Evol. 56: 498508.[CrossRef][Medline] Li, M., Nordborg, M., Li, L.M. 2004. Adjust quality scores from alignment and improve sequencing accuracy. Nucleic Acids Res. 32: 51835191. Martín, H.G., Ivanova, N., Kunin, V., Warnecke, F., Barry, K., McHardy, A.C., Yeates, C., He, S., Salamov, A., Szeto, E. et al. 2006. Metagenomic analysis of phosphorus removing sludge communities. Nat. Biotechnol. in press. Marth, G., Czabarka, E., Murvai, J., Sherry, S. 2004. The allele frequency spectrum in genome-wide human variation data reveals signals of differential demographic history in three large world populations. Genetics 166: 351372. Marth, G.T., Korf, I., Yandell, M.D., Yeh, R.T., Gu, Z., Zakeri, H., Stitziel, N.O., Hillier, L., Kwok, P.Y., Gish, W.R. et al. 1999. A general approach to single-nucleotide polymorphism discovery. Nat. Genet. 23: 452456.[CrossRef][Medline] Neafsey, D., Blumenstiel, J., Hartl, D. 2004. Different regulatory mechanisms underlie similar transposable element profiles in pufferfish and fruitflies. Mol. Biol. Evol. 21: 23102318. Nelder, J. and Mead, R. 1965. A simplex method for function minimization. Computer Journal 7: 308315. Nielsen, R. 2000. Estimation of population parameters and recombination rates from single nucleotide polymorphisms. Genetics 154: 931942. Nielsen, R., Bustamante, C., Clark, A., Glanowski, S., Sackton, T., Hubisz, M., Fledel-Alon, A., Tanenbaum, D., Civello, D., White, T. et al. 2005a. A scan for positively selected genes in the genomes of humans and chimpanzees. PLoS Biol. 3: e170.[CrossRef][Medline] Nielsen, R., Williamson, S., Kim, Y., Hubisz, M.J., Clark, A.G., Bustamante, C. 2005b. Genomic scans for selective sweeps using SNP data. Genome Res. 15: 15661575. Polanski, A. and Kimmel, M. 2003. New explicit expressions for relative frequencies of single-nucleotide polymorphisms with application to statistical inference on population growth. Genetics 165: 427436. Polanski, A., Bobrowski, A., Kimmel, M. 2003. A note on distributions of times to coalescence, under time-dependent population size. Theor. Popul. Biol. 63: 3340.[CrossRef][Medline] Sawyer, S. and Hartl, D. 1992. Population genetics of polymorphism and divergence. Genetics 132: 11611176.[Abstract] Seviour, R.J., Mino, T., Onuki, M. 2003. The microbiology of biological phosphorus removal in activated sludge systems. FEMS Microbiol. Rev. 27: 99127.[CrossRef][Medline] Shendure, J., Mitra, R.D., Varma, C., Church, G.M. 2004. Advanced sequencing technologies: Methods and goals. Nat. Rev. Genet. 5: 335344.[Medline] Stephens, M., Sloan, J., Robertson, P., Scheet, P., Nickerson, D. 2006. Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat. Genet. 38: 375381.[CrossRef][Medline] Strous, M., Pelletier, E., Mangenot, S., Rattei, T., Lehner, A., Taylor, M., Horn, M., Daims, H., Bartol-Mavel, D., Wincker, P. et al. 2006. Deciphering the evolution and metabolism of an anammox bacterium from a community genome. Nature 440: 790794.[CrossRef][Medline] Tringe, S., von Mering, C., Kobayashi, A., Salamov, A., Chen, K., Chang, H., Podar, M., Short, J., Mathur, E., Detter, J. et al. 2005. Comparative metagenomics of microbial communities. Science 308: 554557. Tyson, G., Chapman, J., Hugenholtz, P., Allen, E., Ram, R., Richardson, P., Solovyev, V., Rubin, E., Rokhsar, D., Banfield, J. et al. 2004. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 428: 3743.[CrossRef][Medline] Venter, J., Remington, K., Heidelberg, J., Halpern, A., Rusch, D., Eisen, J., Wu, D., Paulsen, I., Nelson, K., Nelson, W. et al. 2004. Environmental genome shotgun sequencing of the Sargasso Sea. Science 304: 6674. Whitaker, R.J. and Banfield, J.F. 2006. Population genomics in natural microbial communities. Trends Ecol. Evol. (in press). Williamson, S., Fledel-Alon, A., Bustamante, C.D. 2004. Population genetics of polymorphism and divergence for diploid selection models with arbitrary dominance. Genetics 168: 463475. Williamson, S.H., Hernandez, R., Fledel-Alon, A., Zhu, L., Nielsen, R., Bustamante, C.D. 2005. Simultaneous inference of selection and population growth from patterns of variation in the human genome. Proc. Natl. Acad. Sci. 102: 78827887. Wright, S. 1931. Evolution in Mendelian populations. Genetics 16: 97159. Wright, S. In Evolution and the genetics of populations, Vol. 2: The theory of gene frequencies. .1969. University of Chicago Press, Chicago IL. Zhu, L. and Bustamante, C.D. 2005. A composite-likelihood approach for detecting directional selection from DNA sequence data. Genetics 170: 14111421.
Received April 24, 2006; accepted in revised format July 17, 2006. This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||