|
|
|
|
Genome Res. 14:1664-1668, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Resources Large-Scale Validation of Single Nucleotide Polymorphisms in Gene RegionsSequenom Inc., San Diego, California 92121 USA
Genome-wide association studies using large numbers of bi-allelic single nucleotide polymorphisms (SNPs) have been proposed as a potentially powerful method for identifying genes involved in common diseases. To assemble a SNP collection appropriate for large-scale association, we designed assays for 226,099 publicly available SNPs located primarily within known and predicted gene regions. Allele frequencies were estimated in a sample of 92 CEPH Caucasians using chip-based MALDI-TOF mass spectrometry with pooled DNA. Of the 204,200 designed assays that were functional, 125,799 SNPs were determined to be polymorphic (minor allele frequency >0.02), of which 101,729 map uniquely to the human genome. Many of the commonly available RefSNP annotations were predictive of polymorphic status and could be used to improve the selection of SNPs from the public domain for genetic research. The set of uniquely mapping, polymorphic SNPs is located within 10 kb of 66% of known and predicted genes annotated in LocusLink, which could prove useful for large-scale disease association studies.
Single nucleotide polymorphisms (SNPs) are the most abundant genetic variations in the human genome. They occur, on average, once every 300 base pairs of sequence with a minor allele frequency (MAF) greater than 1% (Kruglyak and Nickerson 2001 To explore the potential of large-scale association studies, we set out to develop a suitable collection of approximately 100,000 SNPs. With the HapMap project far from completion, the SNPs could not be selected on the basis of LD patterns. Since a collection of 100,000 SNPs would be far too few to provide dense coverage throughout the genome, we primarily focused on SNPs located within and around known and predicted genes. Additionally, we sought SNPs with MAF greater than 5% in Caucasian populations that could be used in case-control type study designs, assuming that relatively common genetic variations are responsible for common diseases.
There are a large number of publicly available SNPs. The number of reported nonredundant SNPs in NCBI's dbSNP database at the time these analyses were initiated (refSNPs) exceeded four million (dbSNP build 114, April 2003, http://www.ncbi.nlm.nih.gov/SNP/). Most recently, the number of SNPs in the public domain stands at over nine million. Originally, these SNPs were primarily putative polymorphisms discovered by in silico datamining algorithms (Buetow et al. 1999
From November 1999 through September 2001, we collected 226,099 putative SNPs, primarily ascertained from in silico expressed sequence tag (EST) comparison projects (Buetow et al. 1999 Of the 226,099 SNP assays designed and tested, 204,200 (90%) were functional, producing at least one of the two expected extension products based on the SNP definition. Out of these functional assays, 126,391 SNPs (62%) were identified as polymorphic in this Caucasian sample. To improve our ability to select additional polymorphic SNPs amenable to assay design from the public domain, we investigated the relationships between the standard RefSNP annotations and functional and polymorphic status (Table 1). For this comparison, we further subdivided the polymorphic SNPs into those with frequencies equal to or less than 0.05 (N = 12,169) and those greater than 0.05 (N = 114,222). While the strengths of associations between polymorphic status and RefSNP attributes vary, all are statistically significant (P-value < 106), owing to the large sample size. For the few SNPs with frequency information (14%), those reporting high heterozygosity were much more likely to be higher frequency in our sample. The two strongest predictors of polymorphic status available for nearly all SNPs were NCBI validation status and the number of submitters reporting the SNP (Submitter Count). We observed that 85% of SNPs that were reported as "validated" by NCBI were identified as polymorphic in this sample. Polymorphic SNPs were also more likely to have longer sequences for their submission (Length), be drawn from more recent RefSNP submissions (RS Build), be derived from genomic DNA (MolType), be mapped within introns (SNP Type), and map exactly one time to the genome (RS Mapping).
We developed an algorithm called eXTEND based on NCBI's ePCR program (Schuler 1997
The distribution of MAFs for the 158,295 functional assays that mapped uniquely to the human genome is shown in Figure 1. The shape of this distribution shows a larger proportion of high frequency than low frequency SNPs. This distribution is compared in Figure 1 to the distribution for 61,173 SNPs with Caucasian frequencies available from The SNP Consortium (TSC). Ignoring the excess of SNPs with frequencies at 0.05 intervals due to rounding in the TSC data set, the distribution is more uniform than we observed. The overabundance of high frequency SNPs can be partially explained by the tendency of the pool-based approach used in this study to overestimate the low mass extension product compared to the high mass extension product (Jurinke et al. 2003
Of 226,099 putative SNPs tested, 21,899 reactions (9.7%) did not result in a functional assay. The majority of such reaction failures could be attributed to one of four causes: (1) inaccurate sequence information in the region of the SNP for those that could not be mapped, (2) non-functional PCR and/or Mass EXTEND primers, (3) random processing failures, or (4) genomic regions that are difficult to amplify (e.g., GC-rich). We found that for 29% of the failed assays, the oligos did not map to the genome by eXTEND analysis, compared to 9% of nonpolymorphic assays and 7% of polymorphic assays (Table 1). We also found that failed assays were more likely to have shorter sequences for their submission, be drawn from earlier RefSNP submissions, not have been reported as validated by NCBI, be derived from cDNA submissions, not have a SNP type annotation, and map less often to the genome. Results of a mass spectrometry analysis of oligonucleotides used in failed assays ranged from the expected reagent, to incomplete synthesis, reagent with salt adducts, and in the most extreme cases, no product. Based on these observations, we developed oligonucleotide quality control software (Spectro-CHECK) for mass spectrometric monitoring of reagent quality. With these procedures in place the average failure rate for new SNP assays has been successfully reduced to 6%.
In this study we estimated the allele frequencies in 92 Caucasian subjects for 204,200 SNPs derived from public sources available from 1999 to 2001. Of 158,295 SNPs that map uniquely, 64% were confirmed polymorphic (MAF > 0.02). We compared our confirmation rates of polymorphic SNPs to four published studies using Caucasian samples. The studies by Marth et al. (2001
Apart from measurement methods, there were other notable differences between our study and those cited. Our study tested a larger number of SNPs than the previous studies, the largest of which (Gabriel et al. 2002 An important consideration for all researchers using SNPs for genetic research is the selection of informative SNPs for the study in question. In the absence of thoroughly validated allele frequencies for the ethnicity of interest, we found that the standard NCBI annotations can improve the selection of polymorphic SNPs (Table 1). For example, restricting our data only to those SNPs with Length > 447, RS Build > 100, NCBI Validation = "YES", and Submitter Count > 1 results in 14,640 SNPs, 80% of which have MAF greater than 0.1. This compares to 53% of all SNPs tested (including zero and multiple mapping SNPs), representing a substantial improvement for selecting common SNPs. The single most useful factor in this selection is NCBI Validation. Approximately 75% of SNPs annotated with "YES" in our sample are common in Caucasians. Ignoring NCBI Validation status results in a more modest improvement from 53% to 69% of common SNPs.
A summary of each of the 226,086 tested SNPs along with the allele frequency estimates is available as Supplemental material (Table S1). Allele frequencies for polymorphic SNPs have been submitted to the NCBI dbSNP repository. Such public information may prove useful to develop SNP maps of various sizes targeting gene regions of the human genome. Until the haplotype map is completed (Gibbs et al. 2003
Construction of DNA Pool and SNP Confirmation Unrelated Caucasian DNA samples were purchased from Coriell. Ninety-two (92) DNA samples were measured and pooled in equimolar amounts to generate a single DNA pool for SNP confirmation and allele frequency estimation (Buetow et al. 2001
SNP Mapping
Public SNP Annotations For SNP frequency comparisons we used the data gathered by the allele frequency/genotype project of The SNP Consortium, as provided on their site (http://snp.cshl.org). There were 61,266 SNPs with refSNP identifiers and valid frequency results that we examined in this work. The SNPs were biallelic and for each SNP we put together the frequencies of the two alleles, as estimated for the Caucasian samples. A TSC-validated SNP and a Sequenom SNP were matched if both had been matched to the same refSNP. There was not enough information in the TSC downloaded data to unambiguously match alleles in TSCSequenom pairs. We only considered pairs for which: (1) TSC allele 1 was identical to Sequenom allele 1 and TSC allele 2 identical to Sequenom allele 2, in which case we matched TSC allele 1 frequency to Sequenom allele 1 frequency; or (2) TSC allele 1 was identical to Sequenom allele 2 and TSC allele 2 identical to Sequenom allele 1, in which case we matched TSC allele 1 frequency to Sequenom allele 2 frequency; or (3) there was only one TSC allele available (which was true for SNPs found to be nonpolymorphic) and this was identical to Sequenom allele 1 or 2. We considered 7,997 such pairs, associated with 7,026 distinct refSNP identifiers (some refSNPs in this set had more than one corresponding TSCSequenom pairs).
Statistical Methods
We are grateful to Dr. Eric Lai (GlaxoSmithKline, Inc., USA) and Dr. Ohara Osamu (Kazusa DNA Research Institute, Japan) for contributing SNPs to this project. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2421604.
1 Present address: National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892-2033.
2 Corresponding author. [Supplemental material is available online at www.genome.org. The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: E. Lai, and O. Osamu.]
Altshuler, D., Pollara, V.J., Cowles, C.R., Van Etten, W.J., Baldwin, J., Linton, L., and Lander, E.S. 2000. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407: 513516.[CrossRef][Medline]
Bansal, A., van den Boom, D., Kammerer, S., Honisch, C., Adam, G., Cantor, C.R., Kleyn, P., and Braun, A. 2002. Association testing by DNA pooling: An effective initial screen. Proc. Natl. Acad. Sci. 99: 1687116874. Buetow, K.H., Edmonson, M.N., and Cassidy, A.B. 1999. Reliable identification of large numbers of candidate SNPs from public EST data. Nat. Genet. 21: 323325.[CrossRef][Medline]
Buetow, K.H., Edmonson, M., MacDonald, R., Clifford, R., Yip, P., Kelley, J., Little, D.P., Strausberg, R., Koester, H., Cantor, C.R., et al. 2001. High-throughput development and characterization of a genomewide collection of gene-based single nucleotide polymorphism markers by chip-based matrix-assisted laser desorption/ionization time-of-flight mass spectrometry. Proc. Natl. Acad. Sci. 98: 581584. Carlson, C.S., Eberle, M.A., Rieder, M.J., Smith, J.D., Kruglyak, L., and Nickerson, D.A. 2003. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat. Genet. 33: 518521.[CrossRef][Medline] Clark, A.G., Nielsen, R., Signorovitch, J., Matise, T.C., Glanowski, S., Heil, J., Winn-Deen, E.S., Holden, A.L., and Lai, E. 2003. Linkage disequilibrium and inference of ancestral recombination in 538 single-nucleotide polymorphism clusters across the human genome. Am. J. Hum. Genet. 73: 285300.[CrossRef][Medline] Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., and Lander, E.S. 2001. High-resolution haplotype structure in the human genome. Nat. Genet. 29: 229232.[CrossRef][Medline] Dawson, E., Abecasis, G.R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D.M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S., et al. 2002. A first-generation linkage disequilibrium map of human chromosome 22. Nature 418: 544548.[CrossRef][Medline]
Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. 2002. The structure of haplotype blocks in the human genome. Science 296: 22252229. Gibbs, R.A., Belmont, J.W., Hardenbol, P., Willis, T.D., Yu, F., Yang, H., Ch'ang, L.Y., Huang, W., Liu, B., Shen, Y., et al. 2003. The international HapMap project. Nature 426: 789796.[CrossRef][Medline] Irizarry, K., Kustanovich, V., Li, C., Brown, N., Nelson, S., Wong, W., and Lee, C.J. 2000. Genome-wide analysis of single-nucleotide polymorphisms in human expressed sequences. Nat. Genet. 26: 233236.[CrossRef][Medline] Jurinke, C., Oeth, P., and van den Boom, D. 2003. MALDI-TOF mass spectrometry: A versatile tool for high-performance DNA analysis. Mol. Biotechnol. 25: 147164. Kruglyak, L. and Nickerson, D.A. 2001. Variation is the spice of life. Nat. Genet. 27: 234236.[CrossRef][Medline]
Lander, E.S. 1996. The new genomics: Global views of biology. Science 274: 536539. Marnellos, G. 2003. High-throughput SNP analysis for genetic association studies. Curr. Opin. Drug. Discov. Devel. 6: 317321.[Medline] Marth, G., Yeh, R., Minton, M., Donaldson, R., Li, Q., Duan, S., Davenport, R., Miller, R.D., and Kwok, P.Y. 2001. Single-nucleotide polymorphisms in the public domain: How useful are they? Nat. Genet. 27: 371372.[CrossRef][Medline]
Mohlke, K.L., Erdos, M.R., Scott, L.J., Fingerlin, T.E., Jackson, A.U., Silander, K., Hollstein, P., Boehnke, M., and Collins, F.S. 2002. High-throughput screening for evidence of association by using mass spectrometry genotyping on DNA pools. Proc. Natl. Acad. Sci. 99: 1692816933.
Pruitt, K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29: 137140. R Development Core Team. 2004. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Reich, D.E., Cargill, M., Bolk, S., Ireland, J., Sabeti, P.C., Richter, D.J., Lavery, T., Kouyoumjian, R., Farhadian, S.F., Ward, R., et al. 2001. Linkage disequilibrium in the human genome. Nature 411: 199204.[CrossRef][Medline] Reich, D.E., Gabriel, S.B., and Altshuler, D. 2003. Quality and completeness of SNP databases. Nat. Genet. 33: 457458.[CrossRef][Medline] Risch, N. and Merikangas, K. 1996. The future of genetic studies of complex human diseases. Science 273: 15161517.[Medline]
Schuler, G.D. 1997. Sequence mapping by electronic PCR. Genome Res. 7: 541550.
Sherry, S.T., Ward, M.H., Kholodov, M., Baker, J., Phan, L., Smigielski, E.M., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29: 308311. Shifman, S., Pisante-Shalom, A., Yakir, B., and Darvasi, A. 2002. Quantitative technologies for allele frequency estimation of SNPs in DNA pools. Mol. Cell. Probes 16: 429434.[CrossRef][Medline]
Stephens, J.C., Schneider, J.A., Tanguay, D.A., Choi, J., Acharya, T., Stanley, S.E., Jiang, R., Messer, C.J., Chew, A., Han, J.H., et al. 2001. Haplotype variation and linkage disequilibrium in 313 human genes. Science 293: 489493. Venables, W.N. and Ripley, B.D. 2002. Modern applied statistics with S. Springer, New York.
http://www.ncbi.nlm.nih.gov/SNP/; NCBI dbSNP home page. http://snp.cshl.org; The SNP Consortium home page.
Received February 9, 2004; accepted in revised format June 2, 2004. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||