|
|
|
|
Published online before print
September 25, 2007, 10.1101/gr.5996407 Genome Res. 17:1596-1602, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Letter Information capture using SNPs from HapMap and whole-genome chips differs in a sample of inflammatory and cardiovascular gene-centric regions from genome-wide estimatesClinical Pharmacology and the Genome Centre, William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, London EC1M 6BQ, United Kingdom
Large-scale genetic association studies are now widely conducted using SNPs selected from the International HapMap Project or provided on commercial "whole genome" chips. As only a subset of human genetic variation has been identified, it is unknown what proportion of the total genetic variation can be captured in this way, although recent genome-wide estimates of SNP capture rates have been encouraging. We estimated the expected gene-centric information capture for whole-genome chips using sequence data from 306 inflammatory/cardiovascular genes and found SNP capture rates notably lower than previous genome-wide estimates. Further investigation indicates that a major explanation for these lower capture rates is the aggregation of particular sequence features that influence both linkage disequilibrium and the amenability of SNPs for genotyping within the broad class of inflammatory/ cardiovascular genes. This suggests that the power of genetic association studies in some complex traits will depend not only upon established factors, such as allele frequency and penetrance, but may also be influenced by the distribution of sequence features in the class of genes expected to underlie the disease of interest.
The HapMap project (International HapMap Consortium 2005
Previous studies have concluded that the entire set of HapMap SNPs can capture, with r2 > 0.8, 94% of common SNP variation genome wide in European populations and 81% in African populations (International HapMap Consortium 2005 It is anticipated that many of the disease-associated variants that will be found in genome-wide studies are likely to be located in or near genes, meaning that it is important to consider coverage in gene-centric regions specifically. We set out to estimate the proportion of common gene-centric SNPs that can be captured using HapMap-derived tag SNP sets and commercial whole-genome SNP chips using public sources of sequence data (SeattleSNPs [http://pga.gs.washington.edu] and PARC [http://droog.mbt.washington.edu/parc]), which cover 306 genes (6.4 Mb) in total.
The 306 SeattleSNPs/PARC genes included in this study contained a total of 31,965 SNPs; their breakdown into rare and common variants and according population is shown in Table 1. Sequenced length per gene ranged from 3.3 kb to 103 kb (median, 17.5 kb).
We identified a total of 9713 SNPs in HapMap version 21a located within the sequenced regions. Of these, 8904 (92%) were polymorphic in SeattleSNPs/PARC. Among the common HapMap SNPs (MAF > 5%), 4725 of 5011 (94%) and 5425 of 5790 (93%) were polymorphic in SeattleSNPs/PARC in European and African descent populations, respectively. These figures should be weighed against the expected number of common HapMap SNPs that would appear monomorphic in 46 sequenced chromosomes, given the HapMap allele frequency distribution. We calculated this proportion to be about 1%, suggesting that perhaps 5% of HapMap SNPs may have been missed in the resequencing efforts. However, HapMap has recently released an updated data set (version 22), and a number of SNPs that were in version 21a have been excluded. Interestingly, 69 (15%) of the common HapMap SNPs we failed to identify in SeattleSNPs/PARC were among the list of excluded SNPs compared with only 22 (0.3%) of SNPs we did, suggesting that some of the SNPs we failed to align may have been incorrectly positioned in HapMap version 21a. Further comparison of genotyped SNPs common to both resources for the subset of 78 genes sequenced in a subset of HapMap individuals demonstrated a high genotype call concordance rate (98.3% in a total of 74,892 genotype calls over 3401 SNPs). Thus, our alignment of the SeattleSNPs/PARC resources to HapMap and the quality of the SeattleSNPs/PARC data was validated.
Information measures
Capture rates are a useful summary measure, but are based on a dichotomization of a continuous statistic—the maximum r2 between any sequenced polymorphism and a set of tag SNPs. Associations with disease-related polymorphisms will still be detectable at more moderate r2 (0.5–0.8) given sufficiently large sample sizes, but the opposite tail of the distribution contains polymorphisms with which association will not be detectable, no matter how large a sample is available. To examine these we introduce the "noncapture rate"—the proportion of sequenced SNPs that have a maximum r2 < 0.2 with any tag SNP or haplotype of tag SNPs. Jorgenson and Witte (2006)
We calculated information measures for each of our six tag SNP sets, and these are presented in Table 2. SeattleSNPs and PARC included African, African American, and European American samples (see Methods for details). For brevity, we describe the European American samples as "European descent" and both the African American and African samples as "African descent." The tag SNP set composed of all HapMap SNPs captured the majority of the gene-centric sequence SNPs (in European descent samples, capture rates CRE = 77%, mean maximum r2,
Using a subset of HapMap-derived tag SNPs proved to be an efficient strategy, resulting in moderate reductions in information capture (CRE = 66%, E = 78%; CRA = 43%, A = 61%) and increases in noncapture (nCRE = 8%; nCRA = 18%) for SNPs in return for substantial reductions of over 55% in genotyping requirements.
The commercially available whole-genome chips contained substantially fewer SNPs in the sequenced regions than the number of HapMap-derived tags and, as a result, did not perform as well in comparison. The Illumina HumanHap550 was the strongest performer (CRE = 53%,
Comparison with genome-wide estimates and correction for bias due to short sequenced regions We attempted to correct for this bias in two ways. First, we estimated the extent of the bias by resampling from the ENCODE data set. This showed that information measures calculated using short sequenced regions were underestimated by a factor that varied according to tagset, but not population (Table 3). If we assume no systematic differences between the regions within which the SeattleSNP/PARC genes lie and the regions sequenced by ENCODE, we can multiply the information measures from the SeattleSNPs/PARC data by the inverse of these underestimation factors. These corrected information measures remain below previous published estimates (Table 4).
We also attempted to overcome the bias by incorporating HapMap data to extend the length of SeattleSNPs/PARC-sequenced regions in an extended window analysis. A total of 78 of the 306 SeattleSNPs/PARC genes were sequenced in a subset of HapMap individuals. For this subset, we combined sequenced genotypes with HapMap genotypes in successively larger windows, allowing all HapMap SNPs to be potential tags, and thus incorporating long-range LD. Figure 1 shows that information capture increases as window size increases. The effect on HapMap derived tagsets is modest (e.g., CRE = 76% for all HapMap SNPs with no window compared with CRE = 79% with a 200-kb window). However, the underestimation is greater for genome-wide chips, as might be expected given their lower density of SNPs compared with HapMap, and therefore, their greater reliance on long-range LD.
These two methods both lead to increased estimates of information capture, but still substantially below published estimates. For example, for the Affymetrix GeneChip 500k, the corrected CRE = 56%, and the windowed CRE = 45% compared with 64% (Barrett and Cardon 2006
Explaining the residual difference in information capture estimates Also, ENCODE has sequenced more samples (60 compared with 23 or 24 for SeattleSNPs/PARC). We resampled 23 (CEU) or 24 (YRI) samples from the ENCODE data set and recalculated information measures. After 1000 replications, the mean information measures across the resampled data set suggested little systematic difference compared with those from the entire data set. Finally, we considered whether differences in the allele frequency spectra between SeattleSNPs/PARC and HapMap could explain the differences in estimated capture rates, as HapMap is biased toward common SNPs. However, we estimated capture rates only for SNPs with MAF > 5%, and, although the allele frequency spectra show the expected excess of rare SNPs in the SeattleSNPs/PARC data, the frequency spectra for common SNPs are not dissimilar between HapMap and SeattleSNPs/PARC (Supplemental Fig. 3). In addition, reanalysis of only SNPs with MAF > 10% still showed considerably lower capture rates in SeattleSNPs for the Affymetrix chips than a similar analysis using ENCODE (data not shown). We also compared the distribution of interspersed repeats in SeattleSNPs/PARC and ENCODE (Fig. 2). These are sequence features within which SNP genotyping can be difficult and fall into four classes: long interspersed elements (LINEs), short interspersed elements (SINEs), long terminal repeat (LTR) retrotransposons, and DNA transposons. Their distribution is similar to genome-wide averages in the ENCODE regions and similar in the SeattleSNPs/PARC regions to "gene-centric" ENCODE regions (within 10 kb of a known gene), except for a lower frequency of LINEs in SeattleSNPs/PARC (9.3% vs. 15%) and a slightly higher frequency of SINEs (15.2% vs. 13.9%).
An interesting pattern emerges when we compare the above distributions with the proportion of SNPs identified in each sequence feature (Fig. 2). Within SeattleSNPs/PARC, the proportion of SNPs in each feature is similar to the proportion of sequenced region in each feature, with perhaps a small increase in the number of SNPs found in SINEs compared with their sequenced length (18% vs. 15%). This is in keeping with a recent report that SNPs are found more frequently in SINEs than neighboring sequences (Ng and Xue 2006 80% of ENCODE SNPs identified by sequencing efforts are genotyped in HapMap (http://www.hapmap.org/downloads/encode1.html.en), we considered separately all ENCODE SNPs submitted to dbSNP (ENCODE-seq) and those ENCODE SNPs genotyped by HapMap (ENCODE-HapMap). The proportion of SNPs and sequenced regions within each feature are also similar comparing ENCODE-seq SNPs with ENCODE regions. However, for ENCODE-HapMap, SNPs in SINEs appear under-represented (6%) compared with the proportion of ENCODE-sequenced regions in SINEs (12%). While there may be an increase in false positive SNPs in SeattleSNPs/PARC, as it is notoriously difficult to sequence through repeat regions, this low frequency of SNPs in SINEs only in ENCODE-HapMap suggests an undersampling of SNPs in SINEs by ENCODE-HapMap. This is most likely due to difficulties creating unique genotyping primers for such SNPs. This difference in the composition of the resources is important because information capture for SNPs in repeat features tends to be lower compared with an "average" SNP (e.g., Supplemental Fig. 4 shows capture rates for SNPs in each sequence feature, but a similar pattern is also seen for maximum mean r2), but this is most marked for SNPs in SINEs. A data set that under-represents these difficult to capture SNPs, then, could lead to inflated estimates of information capture.
Our gene-centric analysis reveals lower information capture for HapMap and whole-genome SNP chips than previously published genome-wide estimates. We believe this difference results from a combination of different study designs and the contrasting resources that have been used to estimate capture rates. Our results tend to underestimate capture due to long-range LD with distant chip SNPs outside the sequenced regions. However, extended-window analysis of 78 genes and resampling of ENCODE demonstrates that this underestimation explains only part of the lower information capture observed here. The other major explanation appears to be the higher proportion of SNPs in SINEs found in and around the SeattleSNPs/PARC genes, combined with lower capture rates in these features.
There are at least two mechanisms by which capture rates may be lower in sequence features. First, because it is more difficult to create unique primers, SNPs are less likely to be captured directly by virtue of their inclusion in the tag set and, indeed, SNPs in all sequence features are less likely to be included on any chip, but the effect is most dramatic for SNPs in SINEs (Supplemental Fig. 5). Second, sequence features may also affect the chance of a SNP being captured through high LD with one of its neighbors. The extent of LD in a region depends, among other things, on the local recombination rate, and this is correlated with proximity to particular sequence features. LINEs and SINEs, in particular, have been associated with decreases and increases in local recombination rates, respectively (Yu et al. 2001
We note that the relatively lower frequency of SNPs in SINEs in the ENCODE-HapMap data, given that such SNPs are harder to capture, could have led to overoptimistic genome-wide estimates of capture rates in a previous ENCODE-based empirical evaluation (Peer et al. 2006b
These findings raise interesting questions about why the SeattleSNPs/PARC genes studied here display such different frequencies of interspersed repeats to that observed from genome-wide averages. The frequency of these elements has been shown to be correlated with LD, and LD, in turn, has been shown to be correlated with a broad functional class of gene, with inflammatory genes displaying the lowest average LD of 35 classes considered (Smith et al. 2005
Thus, this snapshot of sequence data from SeattleSNPs/PARC is not representative of the average genome-wide distribution of interspersed repeats, apparently due to the nature of sequences within inflammatory genes. Had we considered another class of genes with a different set of sequence features (e.g., those involved in DNA metabolism that display higher than average LD; Smith et al. 2005 An important point is that this analysis assumes all SNPs on a given chip genotype successfully, which is in contrast to the experience in real datasets, where a proportion of SNPs fail. Thus, the results here (and in other studies) represent an upper bound for information capture in an ideal world. In summary, our results suggest that, for any particular disease that may result from variation in a particular functional class of genes, SNP chip performance may differ from genome-wide estimates of average performance. Although information capture is generally expected to improve with the new one million SNP chips recently released and in the pipeline, it is likely that additional technological approaches will be required to genotype variants in repeat sequence features, and hence, capture all common variation. These findings may remain particularly important for disease-gene detection in studies of disorders with an inflammatory etiology.
Data sets We retrieved data from the SeattleSNPs and PARC databases for all 306 genes labeled "complete," and which had been sequenced in human samples by March 20, 2006. Initially, sequencing in both of these projects was conducted in samples from American individuals of European (n = 23) and African (n = 24) descent, distinct from those used by HapMap. In October 2004, these were replaced by a subset of the HapMap CEU samples (n = 23) and YRI (Yoruba in Ibadan, Nigeria) samples (n = 24), so that 228 genes were sequenced in independent samples, and the remaining 78 in a subset of HapMap samples. All 306 genes were included in this study. A summary of the number of polymorphisms studied is given in Table 1. We extracted all SNPs from HapMap release 20 within the sequenced regions and aligned the two resources.
Alignment of SeattleSNPs/PARC with build 35 of the human genome We placed SNPs from SeattleSNPs and PARC, which were not yet in dbSNP, onto build 35 of the human genome by using MegaBLAST with the same settings as those used by the NCBI for mapping dbSNP entries onto the genome. The flanking sequence was initially cleaned with RepeatMasker (http://www.repeatmasker.org) and then the MegaBLAST was performed with a word size of 28. The position of the SNP in the SeattleSNPs and PARC sequences had to be returned in the alignment for a mapping to be considered valid.
Proportion of HapMap SNPs that would appear monomorphic in SeattleSNPs/PARC
Exclusion of genes We retrieved the ENCODE data from HapMap release 20 and found the position of all known genes within these regions by manually extracting their coordinates from the HapMap view of each region. There was no overlap between the ENCODE regions and the SeattleSNPs/PARC regions.
The coordinates of sequence features on builds 34 and 35 of the human genome were retrieved from the rmsk table in the Table Browser at the UCSC Human Genome Browser Gateway (http://genome.ucsc.edu/cgi-bin/hgGateway). Genome-wide average rates of sequence features used in Figure 2 were taken from Table 11 in Lander et al. (2001).
Estimation of capture measures
We expected SNPs toward the ends of sequenced regions, not directly typed in the tag set under consideration, might be in high LD, not with any tag SNP within the sequenced region, but with one lying outside that region, causing capture rates to be underestimated. For each sequenced region, we estimated the LD-block structure and considered polymorphisms in the flanking blocks separately. (We allow that a "block" may consist of a single SNP). Capture rates in these flanking blocks were
LD blocks were inferred using all three methods programmed in Haploview. Estimated capture rates were very similar across all three, and we chose to use the method "SPINE" in the final results, as it was the most conservative (selecting the largest LD blocks and resulting in marginally higher estimated capture rates). We used tagger (de Bakker et al. 2005
Resampling from ENCODE to estimate degree of bias due to missed long-range LD
Extended windows analysis
We thank Illumina for sharing the list of SNPs on their HumanHap550 chip, and John Todd, David Clayton, and anonymous reviewers for helpful comments. C.W. is a British Heart Foundation Intermediate Fellow (Grant no. FS/05/061/19501). R.D., M.C., and P.M. are supported by program grants from the Medical Research Council (G9521010D) and the British Heart Foundation (PG02/128). M.C. is a principal investigator on the Wellcome Trust Case Control Consortium (076113/B/04/Z).
1 Corresponding author.
E-mail c.wallace{at}qmul.ac.uk; fax 44-20-7882-3408. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5996407
Barrett, J.C. and Cardon, L.R. 2006. Evaluating coverage of genome-wide association studies. Nat. Genet. 38: 659–662.[CrossRef][Medline] Barrett, J.C., Fry, B., Maller, J., and Daly, M.J. 2005. Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics 21: 263–265. de Bakker, P.I.W., Yelensky, R., Peer, I., Gabriel, S.B., Daly, M.J., and Altshuler, D. 2005. Efficiency and power in genetic association studies. Nat. Genet. 37: 1217–1223.[CrossRef][Medline] ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia Of DNA Elements) project. Science 306: 636–640. Hinds, D.A., Kloek, A.P., Jen, M., Chen, X., and Frazer, K.A. 2006. Common deletions and SNPs are in linkage disequilibrium in the human genome. Nat. Genet. 38: 82–85.[Medline] International HapMap Consortium. 2005. A haplotype map of the human genome. Nature 437: 1299–1320.[CrossRef][Medline] Jorgenson, E. and Witte, J.S. 2006. Coverage and power in genomewide association studies. Am. J. Hum. Genet. 78: 884–888.[CrossRef][Medline] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921.[CrossRef][Medline] McCarroll, S.A., Hadnott, T.N., Perry, G.H., Sabeti, P.C., Zody, M.C., Barrett, J.C., Dallaire, S., Gabriel, S.B., Lee, C., Daly, M.J., et al. 2006. Common deletion polymorphisms in the human genome. Nat. Genet. 38: 86–92.[Medline] Ng, S.K. and Xue, H. 2006. Alu-associated enhancement of single nucleotide polymorphisms in the human genome. Gene 368: 110–116.[CrossRef][Medline] Peer, I., Chretien, Y.R., de Bakker, P.I.W., Barrett, J.C., Daly, M.J., and Altshuler, D.M. 2006a. Biases and reconciliation in estimates of linkage disequilibrium in the human genome. Am. J. Hum. Genet. 78: 588–603.[CrossRef][Medline] Peer, I., de Bakker, P.I.W., Maller, J., Yelensky, R., Altshuler, D., and Daly, M.J. 2006b. Evaluating and improving power in whole-genome association studies using fixed marker sets. Nat. Genet. 38: 663–667.[CrossRef][Medline] Sherry, S., Ward, M., Kholodov, M., Baker, J., Phan, L., Smigielski, E., and Sirotkin, K. 2001. dbSNP: The NCBI database of genetic variation. Nucleic Acids Res. 29: 308–311. Smith, A.V., Thomas, D.J., Munro, H.M., and Abecasis, G.R. 2005. Sequence features in regions of weak and strong linkage disequilibrium. Genome Res. 15: 1519–1534. Yu, A., Zhao, C., Fan, Y., Jang, W., Mungall, A.J., Deloukas, P., Olsen, A., Doggett, N.A., Ghebranious, N., Broman, K.W., et al. 2001. Comparison of human genetic and sequence-based physical maps. Nature 409: 951–953.[CrossRef][Medline] Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7: 203–214.[CrossRef][Medline]
Received September 26, 2006; accepted in revised format August 15, 2007.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||