|
|
|
|
Genome Res. 15:1511-1518, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Letter Genome-wide definitive haplotypes determined using a collection of complete hydatidiform moles1 Division of Genome Analysis, Research Center for Genetic Information, Medical Institute of Bioregulation, Kyushu University, Fukuoka, Fukuoka 812-8582, Japan 2 Perlegen Sciences Inc., Mountain View, California 94043, USA 3 Division of Molecular and Cell Therapeutics, Medical Institute of Bioregulation, Kyushu University, Beppu, Oita 874-0838, Japan
We present genome-wide definitive haplotypes, determined using a collection of 74 Japanese complete hydatidiform moles, each carrying a genome derived from a single sperm. The haplotypes incorporate 281,439 common SNPs, genotyped with a high throughput array-based oligonucleotide hybridization technique. Comparison of haplotypes inferred from pseudoindividuals (constructed from randomized mole pairs) with those of moles showed some switch errors in resolution of phases by the computational inference method. The effects of these errors on local haplotype structure and selection of tag SNPs are discussed. We also show that definitive haplotypes of moles may be useful for elucidation of long-range haplotype structure, and should be more effective for detecting extended haplotype homozygosity indicative of positive selection.
Recent studies have shown that patterns of linkage disequilibrium (LD) vary across the human genome, with regions of high LD interspersed with regions of low LD (Patil et al. 2001
Several computational methods for large-scale haplotype block partitioning have been developed (Patil et al. 2001
The complete hydatidiform mole (CHM) is a benign tumor, mostly with a karyotype of 46, XX, formed by the fertilization of an empty ovum by a single haploid sperm, that later duplicates its chromosomes to give a diploid (duplicated haploid) cell mass. CHMs offer a unique opportunity for determining long-range definitive haplotypes at a genome-wide level (Taillon-Miller et al. 1997 We genotyped 74 CHM samples that were collected throughout Japan using 281,439 common SNPs to obtain genome-wide definitive haplotypes. Using this data, whole genome haplotype block maps were constructed. We also used the haplotype data to create diploid "pseudoindividuals" from pairs of randomized moles, to determine the frequency of phasing errors and to assess the effects of these errors in haplotype block estimations. In addition, we examined extended shared haplotypes using the CHM data, and results were compared with those constructed from HapMap project genotype data. We found that the latter may fail to capture some extended haplotypes, some of which are expected to be indicative of positive selection.
SNPs genotyped in this study The CHM samples were genotyped using two sets of high-density oligonucleotide arrays. The first set contained 266,722 tag SNPs chosen to cover LD "bins" observed in a population of European ancestry (Hinds et al. 2005 Of the 75 CHMs, one was not included in most of the analysis, since it had a low call rate of 71.6%. For the remaining 74 CHMs, the call rates were >92%, as summarized in Supplemental data S1.
We evaluated the quality of the genotype data using an independent platform, the Affymetrix 100K array, which contained 18,782 SNPs in common with the SNPs described above. We genotyped 10 CHMs using this array, and the concordance rate for the 178,304 genotypes called in both sets was 99.91%, far better than the accuracy required for the analysis of multi-marker haplotypes (Gabriel et al. 2002
The median physical distance between genotyped SNPs is 5.5 kb and the average distance between SNPs is 10.0 kb, excluding centromeric gaps. More than 90% of the genome is within inter-SNP intervals of
Allele frequencies and linkage disequilibrium We measured linkage disequilibrium between adjacent SNPs using r2 statistics. The correlation between Han Chinese and CHM r2 values was 0.89 (Fig. 2C). For SNPs with an estimated r2 > 0.8 in the Han Chinese data, 76% had r2 > 0.8 and 96% had r2 > 0.5 in the CHM data (Supplemental Fig. S1). Thus, SNPs selected based on the diploid Han Chinese samples generally do seem to behave similarly in the CHM samples.
Definitive haplotypes, block structure, and tag SNPs
A total of 44,939 blocks was defined genome-wide. Of these, 6444 blocks (14%) contained a single SNP, but these isolated SNPs constitute only a small fraction (2%) of all SNPs. The average block size was 51.1 kb (6.3 SNPs per block), which was approximately twice as large as previously reported for Japanese and/or Chinese populations (Hinds et al. 2005 5%) haplotypes per block was 4.1, similar to values observed for other populations (Gabriel et al. 2002
Comparison of block structures of CHMs and HapMap Japanese sets The haplotype block structures of the present study and of the HapMap Japanese in Tokyo, Japan (JPT) samples represent genetic diversity of the same underlying Japanese population, although the material of the two studies was independently collected. It is of interest to see how similar (or different) are the results of the two studies. Haplotype blocks for the HapMap JPT samples were constructed by HapBlock using the phased (release 16) HapMap genotype data. Since these were mapped on Build 34 of the reference human sequence, we remapped these blocks onto Build 35 for comparison with our CHM-based structures. During this process, a portion of phased HapMap SNPs (14,966 SNPs) failed to be mapped or their order relative to surrounding SNPs was changed. Taking this into account, we considered 10,076 blocks (including blocks with a single SNP) containing those SNPs that were possibly problematic. The remaining 50,717 blocks were assumed to be correctly remapped on Build 35. We selected 256 long regions (>1 Mb) without problematic blocks and compared the blocks with our CHM-derived partition results (Supplemental data S4). In these regions, 92,296 SNPs were assigned to 8174 blocks (average block size: 38.4 kb, 11 SNPs) in the HapMap JPT data, and 37,477 SNPs were assigned to 6287 blocks in the CHM data. The numbers of tag SNPs were 12,704 (JPT) and 10,122 (CHM). It is not easy to compare block structures of CHM and JPT sets, because of the differences in numbers of chromosomes (74 CHM vs. 90 JPT chromosomes) and SNP density (2.8-fold more SNPs in the JPT data than in the CHM data). These differences are known to seriously affect block partition (Ke et al. 2004
Assessment of errors in phasing and subsequent block partitioning
We first selected 134 non-overlapping genomic regions (125 autosomal regions and 9 x chromosome regions [Fig. 4A]) each containing 50 SNPs to test the accuracy of phase determination by PHASE. The number of SNPs per region was decided based on our computing capacity. Since missing genotypes leave uncertainty in the phasing and following evaluations, we selected subsets of CHMs for which all 50 SNPs were called for each region. As a result, the number of CHMs used in the analysis varied from 56 to 62 (2831 pseudoindividuals), and all 50 SNPs were polymorphic in 122 regions. The remaining 12 regions contained between one and three SNPs that were monomorphic across the selected CHMs. The total size of the analyzed regions was 67 Mb, or
We made 100 sets of pseudoindividuals for each region, as described in the Methods section. Phasing for each set was done using PHASE v2.1.1 (Stephens and Scheet 2005
The 134 haplotype-inferred regions were partitioned into blocks using HapBlock v30 (Fig. 4B). A total of 1048 blocks was defined for true sets (CHM sets), yielding an average block size of 54 kb (6.4 SNPs per block). The average number of tag SNPs per region was 12.8, and the average number of common (
Extended shared haplotype analysis We were interested in the question of whether genotyping CHM samples offers additional advantages compared with genotyping diploid samples and computationally inferring phase. Therefore, we compared extended shared haplotypes (ESHs) obtained from CHM data and from phased HapMap data to evaluate a possible advantage of CHMs in identifying extended intervals of haplotype homozygosity.
The identification of ESH is sensitive to the choice of SNPs assayed, especially their density (see Supplemental data S6). Therefore, we identified 93,531 SNPs that were genotyped and polymorphic both in the CHM data and in the HapMap JPT data. This shared subset of SNPs represented
Table 1 summarizes the numbers of ESHs and their total coverage for the CHM samples, and for the HapMap JPT and CEU samples, across the shared subset of SNPs. The CHM data contained more ESHs, covering more of the genome, than the two HapMap samples, presumably because inferred haplotypes contained a low frequency of phasing errors, which broke some extended haplotypes. The JPT had more 1-Mb haplotypes than the CEU, but fewer 2-Mb haplotypes. This might reflect generally higher quality phasing in the CEU data, which is based on trios; hence, correct phase is confirmed at most SNPs, and the only ambiguous cases are positions that are heterozygous in all three trio members.
Figure 7 shows an example of a chromosome-wide view of ESH density. Many of the peaks of ESH density are common among different samples. Also evident is the fact that many of the density peaks are observed regardless of the number of SNPs used to detect the homozygosity, demonstrating that the sparse shared SNP subset is sufficient for detection of ESH.
Bersaglieri et al. (2004
In almost all large-scale genome diversity projects, genotypes are determined using diploid samples, and haplotypes are inferred computationally, either using family data or by population genetics-based inference. However, these inference methods do not always produce accurate and definitive haplotype data. Even if family data are available, haplotypes remain ambiguous for markers that are heterozygous for all family members. CHMs are tissues of gestational trophoblastic disease resulting from rare events of abnormal gametogenesis and/or fertilization. Although the exact etiology of CHM is unknown, most of these tissues arise by the fertilization of an anucleate egg by a single sperm. Phenotypegenotype comparison between CHMs indicates that maternal genomic condition plays a role in the pathophysiology of molar pregnancies, and paternal genomic contexts, i.e., genomes of CHMs, do not seem to be involved. Thus, a collection of CHM genomes can be regarded to represent generalized genomes of the population.
Most complete hydatidiform mole samples are homozygous diploids, and genotyping of multiple loci on one chromosome yields a definitive haplotype. Chromosome-wide haplotype analysis using CHMs was pioneered by Kwok's group (Taillon-Miller et al. 1997
The incidence of hydatidiform moles is known to be moderately high, representing 0.5 to one per 1000 pregnancies in Caucasians and one to two per 1000 pregnancies in eastern Asians (Steigrad 2003 We have shown that the allele frequencies of SNPs are highly correlated between Japanese and Chinese samples. Measures of linkage disequilibrium, i.e., r2 values, between neighboring SNPs were also similar between the two populations; these facts suggest a close relationship between the two populations. Thus, many of the conclusions drawn here for the Japanese should also apply to the Chinese population.
To estimate the error rate of the phasing process, we simulated diploid genomes using definitive haploid data from 134 genomic regions, where each region contained 50 SNPs with various densities. Of these, 118 regions contained two to 43 genes (or fragments), and 16 regions were nongenic. So, the 134 regions seem to reflect a variety of genomic contexts. Our results are in good agreement with previous evaluations of phasing accuracy, in which several genic regions or synthetic genomes constructed based on a coalescence model were used for diploid reconstruction (Stephens and Donnelly 2003
There is interest in the use of long-range haplotypes to make inferences about natural selection (Sabeti et al. 2002
Recent studies for recombination hot spots as local deficits of LD showed wide divergence between human and chimpanzee genomes (Ptak et al. 2005
It has been reported that rare variants can considerably contribute to common phenotypes of complex diseases (Pritchard 2001
DNA samples CHM samples were collected on a nationwide scale, and the effort was supported by the Japan Association of Obstetricians & Gynecologists. Both the female donors of the CHM tissues and the male partners were Japanese, and their informed consents have been obtained. The project has been approved by the Ethical Committee of Kyushu University. Genomic DNA samples of CHMs were extracted using QIAamp DNA Blood Mini Kit (Qiagen). To determine that the CHM DNA samples were homozygous at all loci without significant maternal contamination, we genotyped 17 microsatellite loci (Kondo et al. 2004
Whole genome amplification
In pilot experiments, we evaluated the effects of amplification on genotyping using Affymetrix Mapping 100K arrays. Using four CHM samples, the average call rates were 99.15% for amplified DNA and 99.34% for unamplified DNA. The overall concordance rate was 99.93%. We concluded that using amplified DNA was a reasonable strategy for whole genome analysis by DNA array assays, confirming previous reports (Paez et al. 2004
Genotyping by DNA arrays
For genotyping the first set of SNPs, 169 diploid Caucasian samples were analyzed along with the CHM samples. These Caucasian samples had been independently assayed on the same chip designs, and three clusters per SNP for the reference (r), alternate (a), or heterozygous (h) genotypes were determined. Clustering alongside diploid samples enabled an added layer of checks for genotyping quality. The following quality filters were performed for these SNPs: (1) a call rate for mole samples (r + a)/75
We did not have a large set of diploid sample scans available for the second chip design. In this case, we used a modified haploid clustering algorithm, which allowed a maximum of two genotyping clusters. Our requirements for data quality for these SNPs were: (1) a call rate for mole samples of
Block partition
Phasing of pseudoindividuals
Extended shared haplotype analysis
This work was supported by Grants-in-Aid for Scientific Research and Research Revolution 2002 from the Ministry of Education, Culture, Sports, Science and Technology, Japan to K.H. We thank members of the Japan Association of Obstetricians & Gynecologists for their cooperation in collecting mole samples. Some ofthe data included in this article are from The International Hap-Map Project Web sites.
[Supplemental material is available online at www.genome.org.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.4371105. Freely available online through the Genome Research Immediate Open Access option.
4 These authors contributed equally to this work.
5 Corresponding author.
Bersaglieri, T., Sabeti, P.C., Patterson, N., Vanderploeg, T., Schaffner, S.F., Drake, J.A., Rhodes, M., Reich, D.E., and Hirschhorn, J.N. 2004. Genetic signatures of strong recent positive selection at the lactase gene. Am. J. Hum. Genet. 74: 11111120.[CrossRef][Medline]
Cohen, J.C., Kiss, R.S., Pertsemlidis, A., Marcel, Y.L., McPherson, R., and Hobbs, H.H. 2004. Multiple rare alleles contribute to low plasma levels of HDL cholesterol. Science 305: 869872. Fan, J.B., Surti, U., Taillon-Miller, P., Hsie, L., Kennedy, G.C., Hoffner, L., Ryder, T., Mutch, D.G., and Kwok, P.Y. 2002. Paternal origins of complete hydatidiform moles proven by whole genome single-nucleotide polymorphism haplotyping. Genomics 79: 5862.[CrossRef][Medline]
Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. 2002. The structure of haplotype blocks in the human genome. Science 296: 22252229.
Hinds, D.A., Stuve, L.L., Nilsen, G.B., Halperin, E., Eskin, E., Ballinger, D.G., Frazer, K.A., and Cox, D.R. 2005. Whole-genome patterns of common DNA variation in three human populations. Science 307: 10721079. The International HapMap Consortium. 2003. The International HapMap Project. Nature 426: 789796.[CrossRef][Medline] Jeffreys, A.J., Neumann, R., Panayi, M., Myers, S., and Donnelly, P. 2005. Human recombination hot spots hidden in regions of strong marker association. Nat. Genet. 37: 601606.[CrossRef][Medline] Johnson, G.C.L., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Genova, G.D., Ueda, H., Cordell, H.J., Eaves, I.A., Dudbridge, F., et al. 2001. Haplotype tagging for the identification of common disease genes. Nat. Genet. 29: 233237.[CrossRef][Medline]
Ke, X., Hunt, S., Tapper, W., Lawrence, R., Stavrides, G., Ghori, J., Whittaker, P., Collins, A., Morris, A.P., Bentley, D., et al. 2004. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet. 13: 577588.
Kondo, H., Qin, M., Mizota, A., Kondo, M., Hayashi, H., Hayashi, K., Oshima, K., Tahira, T., and Hayashi, K. 2004. A homozygosity-based search for mutations in patients with autosomal recessive retinitis pigmentosa, using microsatellite markers. Invest. Ophthalmol. Vis. Sci. 45: 44334439. Lin, S., Cutler, D.J., Zwick, M.E., and Chakravarti, A. 2002. Haplotype inference in random population samples. Am. J. Hum. Genet. 71: 11291137.[CrossRef][Medline] Liu, N., Sawyer, S.L., Mukherjee, N., Pakstis, A.J., Kidd, J.R., Kidd, K.K., Brookes, A.J., and Zhao, H. 2004. Haplotype block structures show significant variation among populations. Genet. Epidemiol. 27: 385400.[CrossRef][Medline] Oota, H., Pakstis, A.J., Bonne-Tamir, B., Goldman, D., Grigorenko, E., Kajuna, S.L., Karoma, N.J., Kungulilo, S., Lu, R.B., Odunsi, K., et al. 2004. The evolution and population genetics of the ALDH2 locus: Random genetic drift, selection, and low levels of recombination. Ann. Hum. Genet. 68: 93109.[CrossRef][Medline]
Paez, J.G., Lin, M., Beroukhim, R., Lee, J.C., Zhao, X., Richter, D.J., Gabriel, S., Herman, P., Sasaki, H., Altshuler, D., et al. 2004. Genome coverage and sequence fidelity of phi29 polymerase-based multiple strand displacement whole genome amplification. Nucleic Acids Res. 32: e71.
Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., et al. 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 17191723. Phillips, M.S., Lawrence, R., Sachidanandam, R., Morris, A.P., Balding, D.J., Donaldson, M.A., Studebaker, J.F., Ankener, W.M., Alfisi, S.V., Kuo, F.S., et al. 2003. Chromosome-wide distribution of haplotype blocks and the role of recombination hot spots. Nat. Genet. 33: 382387.[CrossRef][Medline] Pritchard, J.K. 2001. Are rare variants responsible for susceptibility to complex diseases? Am. J. Hum. Genet. 69: 124137.[CrossRef][Medline] Ptak, S.E., Hinds, D.A., Koehler, K., Nickel, B., Patil, N., Ballinger, D.G., Przeworski, M., Frazer, K.A., and Pääbo, S. 2005. Fine-scale recombination patterns differ between chimpanzees and humans. Nat. Genet. 37: 429434.[CrossRef][Medline] Sabeti, P.C., Reich, D.E., Higgins, J.M., Levine, H.Z., Richter, D.J., Schaffner, S.F., Gabriel, S.B., Platko, J.V., Patterson, N.J., McDonald, G.J., et al. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832837.[CrossRef][Medline] Salem, R.M., Wessel, J., and Schork, N.J. 2005. A comprehensive literature review of haplotyping software and methods for use with unrelated individuals. Hum. Genomics 2: 3966.[Medline] Steigrad, S.J. 2003. Epidemiology of gestational trophoblastic diseases. Best Pract. Res. Clin. Obstet. Gynaecol. 17: 837847.[CrossRef][Medline]
Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., Arva, A., et al. 2002. The generic genome browser: A building block for a model organism system database. Genome Res. 12: 15991610. Stephens, M. and Donnelly, P. 2003. A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am. J. Hum. Genet. 73: 11621169.[CrossRef][Medline] Stephens, M. and Scheet, P. 2005. Accounting for decay of linkage disequilibrium in haplotype inference and missing-data imputation. Am. J. Hum. Genet. 76: 449462.[CrossRef][Medline] Sun, X., Stephens, J.C., and Zhao, H. 2004. The impact of sample size and marker selection on the study of haplotype structures. Hum. Genom. 1: 179193. Taillon-Miller, P., Bauer-Sardina, I., Zakeri, H., Hillier, L., Mutch, D.G., and Kwok, P.Y. 1997. The homozygous complete hydatidiform mole: A unique resource for genome studies. Genomics 46: 307310.[CrossRef][Medline]
Winckler, W., Myers, S.R., Richter, D.J., Onofrio, R.C., McDonald, G.J., Bontrop, R.E., McVean, G.A., Gabriel, S.B., Reich, D., Donnelly, P., et al. 2005. Comparison of fine-scale recombination rates in humans and chimpanzees. Science 308: 107111.
Wong, K.K., Tsang, Y.T., Shen, J., Cheng, R.S., Chang, Y.M., Man, T.K., and Lau, C.C. 2004. Allelic imbalance analysis by high-density single-nucleotide polymorphic allele (SNP) array with whole genome amplified DNA. Nucleic Acids Res. 32: e69. Zhang, K., Calabrese, P., Nordborg, M., and Sun, F. 2002a. Haplotype block structure and its applications to association studies: Power and study designs. Am. J. Hum. Genet. 71: 13861394.[CrossRef][Medline]
Zhang, K., Deng, M., Chen, T., Waterman, M.S., and Sun, F. 2002b. A dynamic programming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci. 99: 73357339.
Zhang, K., Qin, Z., Chen, T., Liu, J.S., Waterman, M.S., and Sun, F. 2005. HapBlock: Haplotype block partitioning and tag SNP selection software using a set of dynamic programming algorithms. Bioinformatics 21: 131134.
http://www.hapmap.org; The International HapMap Project Home page. http://orca.gen.kyushu-u.ac.jp/; Kyushu University Definitive Haplotype Database. http://www.cmb.usc.edu/msms/HapBlock/; HapBlock program. http://www.stat.washington.edu/stephens/software.html; PHASE program.
Received July 1, 2005; accepted in revised format August 24, 2005. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||