|
|
|
|
Genome Res. 15:1594-1600, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Resources Inference and analysis of haplotypes from combined genotyping studies deposited in dbSNP1 Bioinformatics Program, University of California, San Diego, La Jolla, California 92093, USA 2 Department of Computer Science, University of California, San Diego, La Jolla, California 92093, USA 3 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA 4 International Computer Science Institute, Berkeley, California 94704, USA
In the attempt to understand human variation and the genetic basis of complex disease, a tremendous number of single nucleotide polymorphisms (SNPs) have been discovered and deposited into NCBI's dbSNP public database. More than 2.7 million SNPs in the database have genotype information. This data provides an invaluable resource for understanding the structure of human variation and the design of genetic association studies. The genotypes deposited to dbSNP are unphased, and thus, the haplotype information is unknown. We applied the phasing method HAP to obtain the haplotype information, block partitions, and tag SNPs for all publicly available genotype data and deposited this information into the dbSNP database. We also deposited the orthologous chimpanzee reference sequence for each predicted haplotype block computed using the UCSC BLASTZ alignments of human and chimpanzee. Using dbSNP, researchers can now easily perform analyses using multiple genotype data sets from the same genomic regions. Dense and sparse genotype data sets from the same region were combined to show that the number of common haplotypes is significantly underestimated in whole genome data sets, while the predicted haplotypes over the common SNPs are consistent between studies. To validate the accuracy of the predictions, we benchmarked HAP's running time and phasing accuracy against PHASE. Although HAP is slightly less accurate than PHASE, HAP is over 1000 times faster than PHASE, making it suitable for application to the entire set of genotypes in dbSNP.
Many risk factors for human disease are accounted for by variation in DNA sequence (Carlson et al. 2004
Alleles of SNPs that are physically located in close proximity to each other on a chromosome are often correlated (i.e., in "linkage disequilibrium") with each other. Thus, within most short regions, there is limited genetic variability, and only a small number of allele sequences (haplotypes) exist in a population. In a typical region or "block of limited diversity," three or four common haplotypes often account for at least 80% of the sequence variation in a population (Daly et al. 2001
Obtaining the haplotypes and partitioning the region into blocks of limited diversity are the first steps for many types of analysis of human variation. However, since humans are diploid, haplotype (or phase) information is not immediately available. Therefore, the construction of haplotypes from the diploid genotype information (i.e., phasing the genotypes) requires statistical inference or the financially prohibitive collection of extended pedigrees. Consider, for example, two SNPs lying on the same chromosome, both with alleles A and G. If both SNPs are observed as heterozygous, it is unclear whether one chromosome contains allele A at both loci and the other chromosome contains allele G in both loci, or whether one chromosome contains allele A at the first locus and allele G at the second locus and the other chromosome contains alleles G and A, respectively (Fig. 1). In order to overcome this problem, many computer programs have been designed to estimate and assign phase from diploid genotype data (Stephens et al. 2001 Since many of the data sets were originally mapped to different human genome builds, reconciling the original data sets and mapping them to a common genome build is a very time-consuming task. One of the main contributions of this study is the organization of the data sets in a way that corrects for errors in the strand and physical location annotations of the SNPs submitted to dbSNP. Through dbSNP, researchers can easily access all public genotype and haplotype data in their regions of interest. For example, researchers interested in the ABO gene can easily obtain haplotype and genotype data from data sets including the HAPMAP, Perlegen, and SeattleSNPs. By comparing multiple data sets, we perform a preliminary analysis to estimate the significance of the effect of SNP density on the inferred haplotype and block structure in a short region. By combining high-density data from Seattle SNPs and the Perlegen data sets in the same individuals, we show how the numbers of haplotypes in the blocks defined by the Perlegen data set are underestimated by a factor of 3.6. These differences illustrate the advantage of examining multiple data sets when inferring human variation structure.
We also infer the chimpanzee reference sequence corresponding to each human haplotype block by mapping all of the SNPs typed to the UCSC BLASTZ alignment of the human and chimpanzee genomes. We use this data to compute how often the reference sequence matches a common haplotype in the Perlegen whole-genome data set. These sequences are also available for download from dbSNP. The haplotype and genotype data in dbSNP is a valuable resource for researchers planning to perform genetic association studies. Using the multiple data sets, the researchers can obtain a clearer picture of the haplotype structure and make more informed choices on which SNPs to genotype in a planned association study. The haplotypes, block partitions, and tag SNPs discussed in this study have been deposited into dbSNP (accession nos. phs3.1, vs:3:4136.1vs:3:835194.1, sh:3:142355.1sh:3: 5247813.1) and can be accessed at http://www.ncbi.nlm.nih.gov/projects/SNP/.
Data description The human portion of the dbSNP database contains 286,757,371 total genotypes from 3285 individuals over 2.7 million SNPs partitioned into 417 data sets. A total of 835 of the individuals have genotypes from two or more data sets. The CEPH families, for example, were used in several different genotyping studies. Two whole-genome data sets compose 94.2% of the genotypes, i.e., the HAPMAP data set that contains 159,862,776 genotypes taken from four populations consisting of a total of 270 individuals over 954,302 SNPs, and the Perlegen data sets that consist of 110,385,051 genotypes taken from three populations consisting of a total of 71 individuals over 1,576,578 SNPs. In addition to these data sets, there are an additional 16,509,544 genotypes from other data sets. dbSNP contains a significant amount of genotypes derived from sequenced data, including the SeattleSNPs (PGA/UW) data and the Environmental Genome Project (EGP) sequenced genes. The Seattle SNPs consists of 573,194 genotypes of 48 individuals taken from two populations, in which 15,981 SNPs were genotyped in a total of 177 sequenced genes. The Environmental Genome Project (EGP) sequenced genes contains 3,184,170 genotypes over 37,737 SNPs in a total of 304 sequenced genes in 90 individuals. The 48 individuals in SeattleSNPs are the same individuals as the ones genotyped for the Perlegen data. Some of these data sets contain a much larger number of individuals, such as the SNP Consortium (TSC) Celera CEPH data set containing 691 individuals and a data set from Perlegen containing 655 individuals from Mexico City. Others data sets contain many populations, such as the TSC data set containing 17 populations. Table 1 summarizes the contents of the largest 10 data sets contained in dbSNP.
Since many of the original data sets were released at different times, the data sets were mapped to different human genome builds, and the genome positions listed for the SNPs are not necessarily compatible between different data sets. In dbSNP, each genotype is mapped to the human genome, consistent with the latest available build providing a common mapping of SNPs across data sets. Each genotype data set in dbSNP contains references to the dbSNP identifier for each genotyped SNP. Any strand or mapping errors corrected for a SNP are propagated to all genotype data sets containing that SNP. Since many of the data sets contain information on the same SNP for the same individual, we can measure the amount of discrepancy in the genotype calls between the data sets. In particular, 996,553 of the recorded SNPs contain information from two individuals or more, corresponding to a total of 19,719,200 specific SNPs in individuals that have information from at least two data sets. We consider the set of SNPs in individuals with information from two or more data sets where at least two of the genotype calls are not missing. Within this set, 33,076 SNPs have at least one individual with different genotype calls from different data sets. A total of 216,625 (1.1%) specific SNPs in individuals contain differing genotype calls.
We applied HAP to all of the genotypes in dbSNP by phasing each data set separately. Whenever available in dbSNP, we used the mother-father-child pedigree to increase the accuracy of the phasing. The haplotypes were partitioned into blocks of limited diversity so that five haplotypes covered at least 80% of the total number of haplotypes. A set of tag SNPs was chosen to minimize the number of SNPs needed to distinguish between the common haplotypes of each block (Zhang et al. 2002
Within dbSNP, the complete set of genotypes mapped to the correct positions in the genome are available for download along with the haplotypes, block partitions, and tag SNPs resulting from this study. The data is available in multiple formats including XML, allowing the data in dbSNP to be easily integrated into other databases.
Haplotype coverage The coverage of the HAPMAP and Perlegen data as well as the combined two data sets is shown in Table 3. As can be seen from the table, the HAPMAP and Perlegen data sets provide excellent coverage for minimum gap lengths of 10 kb and more, but they give poor coverage for minimum gap lengths of 1 and 5 kbfor a minimum gap length of 5 kb, they only cover about 50% of the genome. When the two data sets are combined with the remaining data sets of dbSNP, the coverage significantly increases for the minimum gap lengths of 1 or 5 kb. In addition, the remaining data in dbSNP provides higher coverage of the genome at higher depths, since the Perlegen data set has 71 individuals and the HAPMAP data has 270 individuals. The coverage of the haplotypes in dbSNP is summarized in Table 4.
Haplotype structure and genotype density We observe that the number of blocks and tag SNPs in the high-density sequence data is much higher than in the corresponding HAPMAP or Perlegen data sets. This shows that there is a considerable amount of information loss when the data is sampled every 5 kb, such as in the HAPMAP data set. We examined 41 blocks in the Perlegen data set that overlapped with SNPs typed in the Seattle data set. Figure 2 shows an example of such a region. There are 91 common haplotypes over the Seattle individuals on these SNPs. We then added in the additional Seattle SNPs typed on the blocks and re-examined the haplotypes for each individual. From the 91 original common haplotypes, 369 haplotypes were found with 72 common ones. On average, 1.2 common haplotypes were created for every original common haplotype, and 30 of the original haplotypes were split into only rare haplotypes. One may hypothesize that this is due to the rare SNPs in the Seattle data. However, we performed the same analysis using only Seattle SNPs with a minor allele frequency of 10% or greater. The 91 original haplotypes were split into 330 haplotypes with 73 common ones. On average, each original common haplotype was split into 1.16 new common haplotypes, and 28 common haplotypes were split into only rare haplotypes when the Seattle SNPs were added. The haplotype blocks and common haplotypes found by examining only the Perlegen data are significantly different from those found over the same individuals in the Seattle data. This type of analysis allows us to measure how much common variation is missed in the whole-genome data sets and demonstrates the utility of the analysis of multiple genotype data sets.
Chimpanzee reference alleles
Haplotype accuracy benchmarks
HAP error estimation
We evaluated the benchmark on both HAP and the widely used phasing algorithm PHASE (Stephens et al. 2001
Our results show that PHASE and HAP give identical results in 98.6% of the genotypes and 95.0% of heterozygous SNPs. We measured the accuracy of the results using the switch error rate. The switch error rate measures the proportion of heterozygous positions for which the phase is erroneously inferred relative to the previous heterozygous position. In terms of switch error rate, PHASE and HAP show 2.38% and 3.70% of switch error rates, respectively. When compared with the total number of genotypes, these switch errors occur in only 0.52% and 0.81% of genotypes, respectively, and these are comparable to the rate of missing SNPs in these regions, which is 1.14%. We performed the same benchmark for the African (YRI) population in the HAPMAP data and observed overall error rates of 2.22% and 1.37% for HAP and PHASE, respectively. This increase in error rate in African populations relative to European populations is consistent with the benchmark performed by the HAPMAP analysis group (J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, S. Qin, G. Abecassis, H. Munro, et al., in prep.). As opposed to the accuracy of the phase prediction, the running time of HAP and PHASE differs considerably. In Table 5 we provide the summary of the running times of HAP and PHASE on 10 randomly selected regions in chromosome 19 with different numbers of SNPs. From these experiments it is not clear how long it would take for PHASE to predict the haplotypes for the database, because of the high variance in running time and the fact that it does not appear that PHASE scales linearly with the number of SNPs. As can be seen from Table 5, the running time of HAP is several orders of magnitude faster than PHASE in most cases. Extrapolating from these results, by assuming that the PHASE algorithm is run with 100 SNPs sequentially on a single CPU, it would take PHASE at least 75,000 h to phase the whole dbSNP database. In the benchmark performed by the HAPMAP analysis group, HAP was able to phase unrelated individuals over 1000 times faster than PHASE (J. Marchini, D. Cutler, N. Patterson, M. Stephens, E. Eskin, E. Halperin, S. Lin, S. Qin, G. Abecassis, H. Munro, et al., in prep.).
Haplotype consistency analysis We measure the robustness of the haplotype inference by comparing the haplotypes inferred over the same SNPs in the same individuals from different data sets. We considered regions where resequenced genes are available from the SeattleSNPs (Crawford et al. 2004
Understanding the structure of common variation is an important step that will give insights into designing effective strategies for genetic association analysis. Our analyses show that the use of a combination of the various data sets of dbSNP increases the coverage of the genome considerably for high-density markers. Furthermore, we show that when the density of the sampled SNPs increases, the block partition and the set of tag SNPs changes considerably, providing evidence that multiple data sets can provide a more accurate picture of the structure of human variation in a region. These findings suggest that the design of genetic association studies in these regions can benefit from analysis of multiple data sets. However, several methodological challenges remain regarding how to most effectively use multiple data sets to understand the structure of human variation and design genetic association studies. dbSNP allows researchers to easily access multiple data sets for a genomic region and provide an invaluable resource for researchers to both address these methodological challenges as well as design effective genetic association studies. The haplotype resource of dbSNP will provide immediate access to the haplotypes, block partitions, and tag SNPs for all of the publicly available data sets. In addition, as the amount of data in dbSNP grows, new haplotypes will be computed with every dbSNP build, which will provide haplotype information for newly deposited data shortly after it is deposited. dbSNP can be accessed at http://www.ncbi.nlm.nih.gov/projects/SNP/.
HAP phasing of genome-wide data We used the HAP algorithm in order to phase the dbSNP data sets. HAP was run on a 30-CPU cluster consisting of 15 2GB RAM Nodes dual Intel Xeon 3.96 GHz processors. The HAP algorithm assumes that a perfect phylogeny tree can describe the ancestral history of the haplotypes. A perfect phylogeny tree is a genealogy tree with no recombinations and no recurrent mutations (see Fig. 3). HAP considers all phases that result in a set of haplotypes that are almost consistent with a perfect phylogeny. HAP then efficiently enumerates over all such phases, and gives a score to each phase according to the likelihood of the solution under the assumption that the haplotypes were randomly picked from the population. HAP then chooses the phase with the highest score. In order to phase a long region, HAP applies the perfect phylogeny model in a sliding window to short overlapping regions. These overlapping predictions are then combined using a dynamic programming-based tiling algorithm that chooses the optimal phase for the long region that is most consistent with the overlapping predictions of phase in the short regions. We considered all tiles of length 1012 when constructing the haplotypes.
HAP is capable of phasing data sets up to 40,000 SNPs. The computational bottleneck is the size of the data structure necessary to perform the tiling. Since we only phased one chromosome at a time, the vast majority of the data in dbSNP was smaller than this limit. For some of the chromosomes in the HAPMAP and Perlegen data, we had to split the data set into two to four regions in order to perform phasing. We partitioned the data sets within a gap of at least 50 kb between SNPs. Similarly, when computing block partitions, we only considered blocks that do not span a gap in SNPs >50 kb.
Partition into blocks of limited diversity
Extension of HAP to trios
E.E. is supported by the California Institute for Telecommunications and Information Technology, Calit2. Computational resources for performing the phasing and block partition were provided by Calit2 and the National Biomedical Computational Resource, NBCR (Grant no. P41 RR08605 NCRR, NIH). This research was supported in part by the Intramural Research Program of the NIH, National Library of Medicine.
[The sequence data from this study have been submitted to dbSNP under accession nos. phs3.1, vs:3:4136.1vs:3:835194.1, sh:3:142355.1sh:3:5247813.1] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.4297805. Freely available online through the Genome Research Immediate Open Access option.
5 Corresponding author.
Cargill, M., Altshuler, D., Ireland, J., Sklar, P., Ardlie, K., Patil, N., Shaw, N., Lane, C.R., Lim, EP., Kalyanaraman, N., et al. 1999. Characterization of single-nucleotide polymorphisms in coding regions of human genes. Nat. Genet. 22: 231238.[CrossRef][Medline] Carlson, C.S., Eberle, M.A., Kruglyak, L., and Nickerson, D.A. 2004. Mapping complex disease loci in whole-genome association studies. Nature 429: 446452.[CrossRef][Medline]
Collins, F.S., Brooks, L.D., and Chakravarti, A. 1998. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 8: 12291231. Crawford, D.C., Carlson, C.S., Rieder, M.J., Carrington, D.P., Yi, Q., Smith, J.D., Eberle, M.A., Kruglyak, L., and Nickerson, D.A. 2004. Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations. Am. J. Hum. Genet. 74: 610622.[CrossRef][Medline] Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., and Lander, E.S. 2001. High-resolution haplotype structure in the human genome. Nat. Genet. 29: 229232.[CrossRef][Medline] Eskin, E., Halperin, E., and Karp, R.M. 2003. Efficient reconstruction of haplotype structure via perfect phylogeny. J. Bioinform. Comput. Biol. 1: 120.[CrossRef][Medline]
Freudenberg-Hua, Y., Freudenberg, J., Kluck, N., Cichon, S., Propping, P., and Nothen, M.M. 2003. Single nucleotide variation analysis in 65 candidate genes for CNS disorders in a representative sample of the European population. Genome Res. 13: 22712276.
Gabriel, G.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. 2002. The structure of haplotype blocks in the human genome. Science 296: 22252229.
Halperin, E. and Eskin, E. 2004. Haplotype reconstruction from genotype data using imperfect phylogeny. Bioinformatics 20: 18421849. Halushka, M.K., Fan, J.B., Bentley, K., Hsie, L., Shen, N., Weder, A., Cooper, R., Lipshutz, R., and Chakravarti, A. 1999. Patterns of single-nucleotide polymorphisms in candidate genes for blood-pressure homeostasis. Nat. Genet. 22: 239247.[CrossRef][Medline]
Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A., and Cox, D.R. 2005. Whole genome patterns of common DNA variation in diverse human populations. Science 307: 10721079. Hudson, R.R. 1991. Gene genealogies and the coalescent process. Oxford Surveys in Evol. Biol. 7: 144. The International HapMap Consortium. 2003. The International HapMap Project. Nature 426: 789796.[CrossRef][Medline] The International SNP Map Working Group. 2001. A map of human genome sequence variation containing 1.4 million SNPs. Nature 409: 928933.[CrossRef][Medline] Kennedy, G.C., Matsuzaki, H., Dong, S., Liu, W.M., Huang, J., Liu, G., Su, X., Cao, M., Chen, W., Zhang, J., et al. 2003. Large-scale genotyping of complex DNA. Nat. Biotechnol. 10: 12331237. Kruglyak, L. and Nickerson, D.A. 2001. Variation is the spice of life. Nat. Genet. 27: 234236.[CrossRef][Medline]
Livingston, R.J., von Niederhausern, A., Jegga, A.G., Crawford, D.C., Carlson, C.S., Rieder, M.J., Gowrisankar, S., Aronow, B.J., and Nickerson, D.A. 2004. Patterns of sequence variation across 213 environmental response genes. Genome Res. 14: 18211831. Niu, T., Qin, S., Xu, X., and Liu, J. 2002. Bayesian haplotype inference for multiple linked single nucleotide polymorphisms. Am. J. Hum. Genet. 70: 157169.[CrossRef][Medline]
Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., et al. 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 17191723. Stephens, M., Smith, N., and Donnelly, P. 2001. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978989.[CrossRef][Medline]
Wang, D.G., Fan, J.B., Siao, C.J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester, E., Spencer, J., et al. 1998. Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280: 10771082.
Zhang, K., Deng, M., Chen, T., Waterman, M.S., and Sun, F. 2002. A dynamic programming algorithm for haplotype block partitioning. Proc. Nat. Acad. Sci. 99: 73357339.
http://www.ncbi.nlm.nih.gov/projects/SNP; dbSNP http://innateimmunity.net/; Innate Immunity PGA. NHLBI program in genomic applications.
Received June 16, 2005; accepted in revised format August 19, 2005. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||