|
|
|
|
Genome Res. 14:1404-1412, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods SNP Discovery in Pooled Samples With Mismatch Repair Detection1 ParAllele Bioscience, South San Francisco, California, 94080, USA 2 Stanford Genome Technology Center, Palo Alto, California, 94304, USA
A targeted discovery effort is required to identify low frequency single nucleotide polymorphisms (SNPs) in human coding and regulatory regions. We here describe combining mismatch repair detection (MRD) with dideoxy terminator sequencing to detect SNPs in pooled DNA samples. MRD enriches for variant alleles in the pooled sample, and sequencing determines the nature of the variants. By using a genomic DNA pool as a template, 100 fragments were amplified and subsequently combined and subjected en masse to the MRD procedure. The variant-enriched pool from this one MRD reaction is enriched for the population variants of all the tested fragments. Each fragment was amplified from the variant-enriched pool and sequenced, allowing the discovery of alleles with frequencies as low as 1% in the initial population. Our results support that MRD-based SNP discovery can be used for large-scale discovery of SNPs at low frequencies in a population.
Random sequencing approaches have led to the identification of a tremendous number of single nucleotide polymorphisms (SNPs) in the human genome. Through the work of The SNP Consortium, 1.4 million SNPs have been identified (Sachidanandam et al. 2001 3 million SNPs with some level of validation. These SNPs provide researchers with a wealth of candidate SNPs in their desired candidate regions. Unfortunately, only a fraction of the disease-causing variations in regulatory and coding regions (cSNP) are identified through this approach (Kruglyak and Nickerson 2001
We present our utilization of mismatch repair detection (MRD; Faham et al. 2001 The basic approach is shown schematically in Figure 1. Sanger sequencing does not have sufficient sensitivity to detect rare alleles from genomic pools, as demonstrated in Figure 1, top trace, in which the PCR product from the pooled sample is sequenced directly. Instead, individual PCR reactions using pooled genomic DNA as a template are, in turn, pooled together and hybridized to PCR fragments from a single homozygous source (standard). These heteroduplexes are transformed into the mutation sorter strain, generating a pool of colonies enriched for variant alleles (compared with the standard). One amplification reaction from the variant-enriched pool is done for each amplicon, followed by a sequencing reaction to identify variant alleles in the population examined. The end result of this process is that the necessity of amplifying and sequencing many individuals is replaced with a pooled enrichment process that is carried out for hundreds or thousands of amplicons in a multiplexed fashion. The sequencing effort is thus reduced to the task of sequencing a standard and the variant-enriched pool.
Scheme for SNP Discovery in Pooled Samples The basic MRD protocol and the mechanism of the fragment sorting based on the presence or absence of a mismatch (variation) by the mutation sorter have been described before (Faham et al. 2001
The enrichment procedure, with the exception of the PCR and sequencing steps, allows simultaneous processing of hundreds or thousands of sequences in one reaction. These multiplex steps replace a large number of PCR and sequencing reactions that would have been required in the traditional targeted SNP discovery procedure.
SNP Discovery in 126 Amplicons on Human Chromosome 21 To construct a set of standards, we performed 126 PCR reactions using as a template a genomic DNA purified from mousehuman hybrid carrying one copy of human chromosome 21. The use of the hybrid ensured the presence of only one allele in the standards. PCR products were pooled, and a library of the cloned standards was produced. The bacterial clones were pooled, and DNA extracted from this pool was used as the standard that was compared with PCR products from the population of interest. We used a pool of genomic DNA from 100 whites or 100 African Americans as a template for the PCR amplification. PCR products for each population were pooled and subjected to an MRD reaction, producing a variant pool of colonies enriched for alleles that differ from the standards. Plasmid DNA was isolated from the variant pool and used as a template for PCR reaction for each amplicon that was then sequenced by forward and reverse primers. We obtained sequence information on 105 and 102 of 126 amplicons in each of the white and African American populations, respectively. No sequence information was obtained for 15 amplicons in either population, and 60% of these failures was due to failure of the PCR amplification from genomic DNA. Seven of 111 sequenced amplicons showed many "variants" as a result of amplifying paralogous sequences and they were removed from the subsequent analysis,4 reducing the total number of fragments analyzed to 104 fragments. For each of these 104 fragments, the sequence traces (forward and reverse) from the enriched pool(s) were compared with the traces of the standard, and SNPs were called. The list of SNPs identified by our method was then compared with the SNPs detected by the wafer technology.
In the 104 products that succeeded in at least one population, 44 SNPs were previously identified by using the wafer technology. We identified 42 of 44 of these SNPs. SNPs identified by Patil et al. (2001 One expects that comparing the sequence against itself would generate no variants. So in additional experiment, we substituted 49 PCR products from genomic DNA pool with PCR product from hybrid DNA. We did not detect any variation, suggesting that the false-positive rate for our method is low.
Two-Round MRD Enrichment
SNP Discovery in BRCA1 and BRCA2 Genes The above experiment with chromosome 21 markers showed that our method has an excellent sensitivity for alleles of 10%. However, it did not clearly define the lower limit of the sensitivity. This is especially true because we implemented multiple significant improvements to the process, including the two rounds of enrichment. We designed an experiment to investigate the sensitivity of the method by testing multiple variants at different frequencies. We used 94 samples that had already been sequenced for all the coding exons of BRCA1 gene (R. Kroiss, T.M.U. Wagner, D. Muhr, D. Richards, P. Shen, M. Schreiber, E. Fleischmann, G. Longbauer, E. Kubista, M. Kubista, et al., in prep.). There were 10 known SNPs in the BRCA1 exons. To assess our sensitivity at different allele frequency levels, we wanted to construct genomic pools to test each of these SNPs at frequencies ranging from 1% to 30%. We designed 95 amplicons that encompass all the exons of BRCA1 and BRCA2 genes. We used homozygous DNA from hydatidiform mole as a template for PCR. The PCR products were pooled and cloned en masse to generate the standard DNA. We constructed five different pools. The first pool was an equimolar ratio of all 94 samples. The other four pools were constructed by using five genomic samples and the homozygous mole DNA. The five genomic samples were very carefully quantitated and mixed in equal amounts. This DNA pool, the five-genome pool, was again carefully quantitated and mixed with four different ratios of excess mole DNA. The four pools had one part of the five-genome pool to 6, 13, 34, or 69 parts of the mole DNA. The mole DNA obviously had no variation to itself, and so it effectively acted as a diluent of variant alleles in the other samples. The frequency of an allele in a pool was then the frequency of that allele among the five individuals divided by the dilution factor. For example, an allele that is present in seven of 10 chromosomes among the five individuals has the final frequency of 10%, 5%, 2%, and 1% in the four pools. We used these four pools as well as the pool of all 94 samples as a template for 95 PCR reactions amplifying BRCA1 and BRCA2. The 95 PCR reactions from each of the five genomic mixtures were pooled and subjected to two rounds of MRD enrichment as described above. Eighty-nine out of 95 fragments yielded sequencing results from at least one of the five MRD reactions, and the failure in four of six cases was in the initial genomic PCR reaction. The sequencing traces from the standard and those from each MRD reaction were independently compared, and SNPs were called. A list of the SNPs detected in each pool was then compiled and compared with the known SNPs.
False Negatives
Sensitivity
False Positives In addition to the already known SNPs, we found eight new variants (seven of them in BRCA2 and one in BRCA1) in the four pools constructed from pooling the five individuals. We sequenced these eight amplicons in the five individuals that made up the pools and detected seven of eight variants in at least one individual. The frequency of each SNP in the initial genomic mixture was calculated and is depicted in Figure 4B. As seen in Figure 4B, the detection of SNPs at 1% frequency is robust, and some variants were detected at a frequency as low as 0.5%.
The last SNP that was detected in one of the pools could not be seen in any of the individuals. This estimates that this method generates a false positive in
Reproducibility
Extent of Enrichment
Dideoxy terminator sequencing is the standard method to determine the nature of a variation. However, to identify a relatively infrequent allele, a large number of sequencing reactions need to be performed. The premise of our method is to combine a highly multiplexed assay to provide a variation-enriched sample that can be analyzed with a much smaller number of sequencing reactions. At least 1000 fragments can be processed in parallel by MRD without loss of sensitivity (H. Fakhrai-Rad, E. Namsaraev, and M. Faham, unpubl.), making it an ideal method for generating the variant-enriched sample (we have compared the discovery rate when the experiment was done in 200 plex and 950 plex and obtained identical sensitivity). In this proof of principle, we demonstrated that SNPs with frequency as low as 1% can be detected with high sensitivity.
The sensitivity threshold of 1% is largely defined by the PCR error rate. By using pfu ultra, we have determined that
This work describes the first methodology that uses pooled genomic DNA to detect previously unknown variations in many fragments. Methodologies that study genetic variations in pooled genomic DNA can be divided into three classes. The first class of methods is focused on the estimation of the frequency of a known SNP in a specific population (Krook et al. 1992 Multiple applications can be considered for this technology, including the identification of somatic mutations in which only a fraction of the cells carry mutant alleles, or the cataloguing of mutations in many genes in a pool of mutagenized animals. We believe an important application of this SNP discovery platform is the large-scale discovery of coding and regulatory SNPs in human populations.
One limitation for MRD-based SNP discovery is that multiple SNPs can occur on a particular sequencing fragment. If this occurs with the two SNPs having very different frequencies, the SNP with the higher frequency will tend to dominate the enriched pool, suppressing the signal of the rarer SNP. This effect can be mitigated in several ways. The first is to use fairly small PCR fragments to minimize the chances of the presence in the tested population of more than one SNP within a single fragment (we use fragments with average size of This limitation is to be weighed against the high costs of sequencing and analysis of many individuals in the traditional sequencing approach. Reducing the number of individuals sequenced in the classical manner reduces coverage by introducing Poisson noise in the choice of a small population. For example, by sequencing 15 individuals there is 40% chance of missing a 3% allele and a 37% chance of seeing the allele once. When an allele is seen only once, it is difficult to distinguish this allele from other private alleles that are present in the sequenced individuals. This is a distinct advantage for the MRD-based SNP discovery, which is insensitive to private variations. For example, by using 300 individuals, a 2% allele is represented on average 12 times and can be readily enriched, whereas a private allele at frequency of 0.16% would not be sufficiently enriched to be detected. We have modeled the expected performance of MRD-based SNP discovery and compared it to the performance of traditional sequencing. As is shown in Figure 7, the sensitivity of the MRD-based SNP discovery is comparable to that obtained from sequencing 50 individuals. In this model, we assumed that for the MRD-based SNP discovery we would design our amplicons in such a way as to avoid validated SNPs in the public databases (to ameliorate the effect of having two SNPs in one amplicon). A better performance would be obtained by doing another cycle of SNP discovery, avoiding all SNPs detected in the first cycle.
The use of MRD to enrich for variant alleles can cut the cost and effort involved in sequencing. With MRD enrichment for each amplicon, two samples need to be sequenced: the variant pool and the standard compared with 50 individuals in the traditional approach. Therefore, MRD enrichment can lead to 25-fold reduction in sequencing. The cost and effort of performing the MRD reaction are amortized over hundreds of distinct fragments that can be processed in one reaction. With 1000 fragments processed simultaneously, each MRD reaction replaces 96,000 sequencing reactions (48 forward and reverse sequencing reaction saved per amplicon multiplied by 1000 amplicons) and the associated trace analysis overhead.
Much effort is being spent to develop linkage disequilibrium maps for the human genome to be used in later association studies (Dawson et al. 2002
Construction of Standards All enzymes used were from New England Biolabs (NEB) unless otherwise specified. Amplicons were designed to amplify exons and flanking intron sequences. Primer selection was done through a batch version of PRIMER3 (Rozen and Skaletsky 1996 , and selection for transformants was done in liquid by adding 100µg/mL carbenicillin. DNA was prepared and transformed into GM2929. The two step transformation is because dam strains have low efficiency of transformation. DNA obtained from this transformation was used in later steps.
MRD Protocol PCR reactions using the genomic pool as a template was performed by using pfu turbo hotstart (or pfu utra for the BRCA experiment) polymerase using a similar protocol as described above. The PCR products from each population were pooled, and a purification column QiaQuick (Qiagen) was performed. Methylation of the PCR products was carried out by addition of Tris (ph 7.6) to a final concentration of 50 mM, as well as SAM (NEB) to a final concentration of 80 µM and 8 U dam methylase (NEB) at 37°C for 1 or 2 h. The PCR pool was then digested with Cla I and Sac II h at 37°C for 1 to 2. For each MRD reaction, 2 µg of the above PCR product pool was mixed with 2 µg of the pool of the unmethylated standard DNA and 2 µg of digested vector carrying the inactive Cre gene pMRD400. pMRD400 is the same as pMRD300 except for a 5-bp deletion in the Cre gene. The three components were concentrated to 10 µL by using a QiaQuick minielute column (Qiagen); 0.5 µL of 0.5 M EDTA, 0.5 µL of 200 mM Tris (ph 7.6), 0.5 µL 20x SSC, and 1.25 µL of freshly diluted 1 M NaOH was added, and incubation for 15 min at room temperature followed. Then 1.25 µL of 2 M Tris (ph 7.2) and 12.5 µL formamide were then added, and reannealing was allowed to occur overnight at 42°C. The hybridization mixture was desalted by using a column (Edge Biosystems). Three microliters Taq Ligase buffer and 5 U Mbo I was added and incubated for 15 min followed by addition of 40 U Taq Ligase (NEB), and further incubation followed for 30 min at 65°C. Fifty units of exonuclease III (USB) and 20 U of T7 exonuclease (USB) were added and incubated for 30 min at 37°C. Ten microliters of SOPE Resin (Edge Biosystems) was added to eliminate single-stranded DNA, and a QiaQuick cleanup (Qiagen) was done before transformation.
Transformation of the MS3 strain was done by electroporation (Micropulser, BioRad). The electrocompetent MS3 cells preparation and the electroporation procedure were done as recommended (Ausubel et al. 1999
Sanger Sequencing and Sequence Analysis
We thank Dr. Martin G. Marinus for the generous gift of the dam strain GM2929 and Dr. Peidong Shen at the Stanford Genome Technology Center for providing the BRCA1 sequence. We also thank the members of the Stanford Genome Technology Center and ParAllele Bioscience for their constant support. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2373904.
3 Corresponding author.
4 In order to ameliorate this problem, amplification primers specific for one member of the family need to be designed; we have implemented software that would accomplish that by performing BLAST on the primer sequences and changing primer pairs that were not unique.
Amos, C.I., Frazier, M.L., and Wang, W. 2000. DNA pooling in mutation detection with reference to sequence analysis. Am. J. Hum. Genet. 66: 16891692.[CrossRef][Medline] Ausubel, F.M., Brent, R., Kingston, R.E., Moore, D.D., Seidman, J.G., Smith, J.A., and Struhl, K. 1999. Current protocols in molecular biology. John Wiley and Sons, New York. Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C.R., Lim, E.P., Kalyanaraman, N., et al. 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22: 231238.[CrossRef][Medline] Carlson, C.S., Eberle, M.A., Rieder, M.J., Smith, J.D., Kruglyak, L., and Nickerson, D.A. 2003. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat. Genet. 33: 518521.[CrossRef][Medline] Dawson, E., Abecasis, C.R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D.M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S., et al. 2002. A first-generation linkage disequilibrium map of human chromosome 22. Nature 418: 544548.[CrossRef][Medline]
Faham, M., Baharloo, S., Tomitaka, S., DeYoung, J., and Freimer N. 2001. Mismatch repair detection (MRD): High throughput scanning for DNA variations. Hum. Mol. Genet. 10: 16571664.
Gabriel, S., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. 2002. The structure of haplotype blocks in the human genome. Science 296: 22252229. Haga, H., Yamada, R., Ohnishi, Y., Nakamura, Y., and Tanaka, T. 2002. Gene-based SNP discovery as part of the Japanese Millennium Genome Project 2002: Identification of 190,562 genetic variations in the human genome. J. Hum. Genet. 47: 605610.[CrossRef][Medline] Halushka, M.K., Fan, J.B., Bentley, K., Hsie, L., Shen, N., Weder, A., Cooper, R., Lipshutz, R., and Chakravarti, A. 1999. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22: 239247.[CrossRef][Medline] International HapMap Consortium. 2003. The international HapMap project. Nature 426: 789796.[CrossRef][Medline] Johnson, G.C.L., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Di Genova, G., Ueda, H., Cordell, H.J., Eaves, I.A., Dudbridge, F., et al. 2001. Haplotype tagging for the identification of common disease genes. Nat. Genet. 29: 233237.[CrossRef][Medline]
Krook, A., Stratton, I.M., and O'Rahilly, S. 1992. Rapid and simultaneous detection of multiple mutations by pooled and multiplex single nucleotide primer extension: Application to the study of insulin-responsive glucose transporter and insulin receptor mutations in non-insulin dependent diabetes. Hum. Mol. Genet. 1: 391395. Kruglyak, L. and Nickerson, D.A. 2001. Variation is the spice of life. Nat. Genet. 27: 234236.[CrossRef][Medline] Kwok, P-Y., Carlson, C., Yager, T.D., Ankener, W., and Nickerson, D.A. 1994. Comparative analysis of human DNA variations by fluorescence-based sequencing of PCR products. Genomics 23: 138144.[CrossRef][Medline] McKinzie, P.B and Parsons, B.L. 2002. Detection of rare K-ras codon 12 mutations using allele-specific competitive blocker PCR. Mutat. Res. 27: 209220. Modrich, P. 1991. Mismatch repair. Ann. Rev. Genet. 25: 229248.[CrossRef][Medline] Parsons, B.L. and Heflich, R.H. 1998. Detection of basepair substitution mutation at a frequency of 1 x 10 7 by combining two genotypic selection methods, MutEx enrichment and allele-specific competitive blocker PCR. Environ. Mol. Mutagen. 32: 200211.[CrossRef][Medline]
Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., et al. 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 17191723. Reich, D.E., Schaffner, S.F., Daly, M.J., McVean, G., Mullikin, J.C., Higgins, J.M., Richter, D.J., Lander, E.S., and Altshuler, D. 2002. Human genome sequence variation and the influence of gene history, mutation and recombination. Nat. Genet. 32: 135142.[CrossRef][Medline] Rozen, S. and Skaletsky, H.J. 1996, 1997, and 1998. Primer3. Code available at: http://www-genome.wi.mit.edu/genome_software/other/primer3.html. Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L., et al. 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928933.[CrossRef][Medline] Werner, M., Sych, M., Herbon, N., Illig, T., Konig, I.R., and Wjst, M. 2002. Large-scale determination of SNP allele frequencies in DNA pools using MALDI-TOF mass spectrometry. Hum. Mutat. 20: 5764.[CrossRef][Medline] Wolford, J.K., Blunt, D., Ballecer, C., and Prochazka, M. 2000. High-throughput SNP detection by using DNA pooling and denaturing high performance liquid chromatography (DHPLC). Hum. Genet. 107: 483487.[CrossRef][Medline]
http://brie2.cshl.org/; The SNP Consortium Web site. http://genome.ucsc.edu/cgi-bin/hgBlat?command=start; UCSC Genome Bioinformatics.
Received January 19, 2004; accepted in revised format April 8, 2004. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||