|
|
|
|
Genome Res. 13:2112-2117, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Letter Haplotype Information and Linkage Disequilibrium Mapping for Single Nucleotide Polymorphisms1 Department of Statistics, Harvard University, Cambridge, Massachusetts 02138, USA 2 Division of Preventive Medicine, Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts 02215, USA
Single nucleotide polymorphisms in the human genome have become an increasingly popular topic in that their analyses promise to be a key step toward personalized medicine. We investigate two related questions, how much the haplotype information contributes to linkage disequilibrium (LD) mapping and whether an in silico haplotype construction preceding the LD analysis can help. For disease gene mapping, using both simulated and real data sets on cystic fibrosis and the Alzheimer disease, we reached the following conclusions: (1) for simple Mendelian diseases, in which case a tractable full statistical model can be developed, the loss of haplotype information for either control or disease data do not have a great impact on LD fine mapping, and haplotype inference should be carried out jointly with LD mapping; (2) for complex diseases, inferring haplotype phases for individuals prior to LD mapping helps achieve a better accuracy. An improved version of the linkage disequilibrium mapping program, BLADE v2, is available at http://www.fas.harvard.edu/junliu/TechRept/03folder/bladev2.tgz.
Available data on tightly linked single nucleotide polymorphisms (SNPs) are experiencing a dramatic growth. Because it is commonly believed that haplotypes are essential for disease-gene discovery, genetic demography, and chromosomal evolution studies, as well as linkage disequilibrium (LD) mappings (Fallin and Schork 2000
Although the available in silico haplotyping methods are cost-effective and
have shown considerable power, they are still error-prone
(Fallin and Schork 2000
Realizing that the single- or pair-marker methods are unable to fully
exploit the information of the closely linked markers, researchers have been
interested in truly haplotype-based multi-marker LD fine-mapping methods for
case-control genetic marker data (McPeek
and Strahs 1999
An interesting question is whether the explicit construction of the case or
control haplotypes before the LD mapping is necessary for an efficient use of
the available multi-marker information. Conceptually, haplotype inference and
the location estimation can be achieved at the same time via a joint
statistical model. Because the uncertainty in haplotype phasing is accounted
for in this framework, the resulting location estimation can be more robust.
To test this hypothesis, we conducted a permutation study of the cystic
fibrosis (CF) data set (Kerem et al.
1989
For high-density SNP markers, it is often inappropriate to treat SNPs on
the control haplotypes as in linkage equilibrium, and an inhomogeneous Markov
chain model appears appropriate when the markers are not too closely linked
(Liu et al. 2001
Throughout this section, we compared the following two strategies for fine mapping the disease mutation: (A) a direct analysis by jointly modeling haplotype uncertainty and LD for the unphased data, and (B) inferring the haplotypes first and then applying a fine-mapping algorithm to the ascertained haplotypes. Two LD mapping algorithms were used in this study, BLADE (Liu et al. 2001
CF Data Set
To assess the impact of haplotype information of the disease chromosomes on
LD mapping, we simulated 100 independent diseased group data sets. Each data
set consists of 47 unphased diseased individuals with genotypes produced by
random pairing of the 94 known disease haplotypes in the CF data set,
effectively losing all of the haplotype information.
Table 1 shows the comparisons
between strategies A and B for fine-mapping the disease mutation
(
To test the effect of losing control haplotype information, we generated another 100 independent "control group" data sets. In addition to randomly pairing up the disease haplotypes, we also randomly paired up the control haplotypes and estimated the Markov transition matrices from these unphased control genotypes by an EM algorithm (see Methods). The root mean square errors (RMSEs) of strategies A and B in this case, when BLADE was used as the LD mapping tool, were 0.0103 and 0.0339, respectively, leading to the same conclusion as shown in Table 1. In summary, both strategies A and B were reasonably accurate in location estimations for this example, and the loss of control haplotype information did not seem to affect the estimation accuracy. Strategy A performed significantly better than strategy B in terms of both the RMSE of the disease location estimate and the percentage of times at which the 95% PI overlaps with the target region, regardless of the LD mapping method or haplotype phasing algorithm used in the analysis.
A Simulation for Simple Mendelian Disorders
As a comparison, we also applied BLADE to the 100 sets of simulated disease haplotypes (i.e., the phase information is known). The average of the 100 location estimates was 1.90 cM with the RMSE of 0.095 cM, and 98 out of 100 times, the 95% PI overlapped with the target interval. This example again shows that strategy B is inferior to strategy A in both the RMSE of the location estimate and the percentage of times at which the 95% PI overlaps with the target region.
APOE SNP Data Set for AD
From the original set of 60 SNPs, we used only those 30 SNPs in close
proximity to the APOE-4 locus (Martin et
al. 2000 By modeling the control haplotypes as a Markov chain and assuming k = 1 (i.e., a single founder mutation), we applied strategies A and B on the APOE data set. Because APOE-4 is the most susceptible SNP according to the single-marker LD measurement, we also tested on a modified data set with the APOE-4 marker removed from the original data set. In other words, we compared the performances of the two strategies solely on the basis of the genotype data of the remaining 29 SNPs.
The histograms of the posterior samples of the disease location
We further removed both APOE-4 and its nearest neighboring marker (SNP952, which has the second-highest single-marker LD measurement) from the data set. Now, markers SNP988 and its neighbor have the strongest single-marker association with AD. Because these two markers are 8.6 and 16 kb away from the APOE-4 locus (i.e., the origin of the x-axis in Fig. 1), respectively, the single-marker result under this scenario is misleading. However, as shown in Figure 2, the haplotype-based LD mapping result using strategy B remained robust even though we have lost the two SNPs with the strongest associations with AD. The estimated position by strategy B was 0.4303 cM (width of 95% PI; 0.0129 cM). The best result in 10 independent trials of strategy A was far off from the real locus (0.61 cM; almost at the end of the whole region).
This example shows that even when as few as 20% of diseased subjects actually carried the APOE-4 mutation, and the most susceptible markers are not available, BLADE can still accurately map the location of the AD-susceptible mutation. It also shows that, for complex traits, because of their polygenic nature as well as the presence of incomplete penetrance and phenocopy, the contribution of the information derived from the association between the founder mutation and the disease manifestation to disease haplotype inference is much less compared with that for Mendelian traits. Thus, inferring haplotype phase first using a computational algorithm (e.g., PLEM) and then performing LD mapping (i.e., strategy B) using these inferred haplotypes, may have slight advantages over the direct use of BLADE (strategy A).
Simulation Study of a Complex Disease In our simulation, we call a trial successful if the resulting 95% PI covers the true location and also has a width of no greater than 25% of the whole region (1.73 cM). The results of our analysis are summarized in Table 4, in which Mean(pos), Std(pos), Mean PI width, RMSE, and size of cluster1 were calculated only among such successful trials. Because approach 1 uses the phase information without any uncertainty, it is not surprising that it outperformed the other two approaches. It is a bit surprising, however, that strategy B performed only slightly worse than the case in which one knows the complete phase information.
The findings from this simulation study agree with those from the APOE data set; when the case haplotypes account for only a small proportion (e.g., 20%30%) of the diseased group (in the complex disease case), strategy B appeared to perform slightly better than strategy A in fine mapping of the disease mutation (Table 4). In contrast, when the case haplotypes account for a large proportion (e.g., 70%) for the diseased group (in the Mendelian disease case), strategy A, on average, beats strategy B (Tables 1, 2). This simulation study, in conjunction with our analysis of the APOE SNP data set, indicates that it is rather non-trivial, or even may not be possible, to design an effective model to integrate haplotype inference and disease mutation fine mapping in complex traits. Currently, strategy B is an attractive way to handle unphased diseased individuals for fine mapping of mutations responsible for a complex trait.
Several popular haplotype-frequency estimation and phase-construction methods have been proposed in the past 15 yr, including Clark's algorithm (Clark 1990 For complex diseases, however, the haplotype information for a particular disease locus has a less-significant contribution to the overall case pool. As a result, jointly modeling haplotype uncertainty and disease location may only add to the model complexity without having appropriate gain. Performing haplotype phasing first with a reasonable computational algorithm, and then feeding in the LD mapping machine with such approximately inferred haplotypes may thus offer some slight advantages in position estimation of the founder mutation.
LD Mapping The location estimation method used in our study employs a statistical model to describe the dependence structure among key variables characterizing the haplotypes and adopts a Markov chain Monte Carlo strategy to draw posterior samples of the location parameter and other variables. The resulting method, implemented in BLADE (Liu et al. 2001
BLADE uses the genetic distance (in Morgan, or cM) to measure distances
among the markers, and this measure has been used traditionally for
microsatellite or bi-allelic markers. The conversion between the genetic and
physical distances ranges from
Simulation
Estimating Control Haplotype Frequencies
When a Markov chain model is used for the control haplotypes, the haplotype
frequencies cannot be assessed straightforwardly with only unphased data. We
developed an EM algorithm to estimate the transition probabilities from the
mth to the (m+1)th marker. The
genotype of each locus is coded as follows: (0) homozygous minor alleles, (2)
homozygous major alleles, and (1) heterozygous. We let
Ni,j be the number of marker pairs with genotypes
i and j, respectively, at the two neighboring loci. Because
only N1,1 causes ambiguity, the EM algorithm for
estimating the frequencies
= 1a. These frequencies can be
easily converted to a transition matrix. The algorithm usually converges very
fast.
To test the effect of losing haplotype information among the controls, we
simulated a control data set by randomly pairing up the given haplotypes in
the CF control data set. To infer the location of the disease mutation
(
Determining the Number of Clusters
the location
of the disease mutation. The explicit forms of the likelihood function and
prior distributions are given in Liu et al.
(2001
We thank Dr. Eden R. Martin for kindly providing the APOE SNPs case-control data. This work was supported in part by National Science Foundation grant DMS-0204674 and the National Institute of Health grant R01 HG02518-01. We thank Jeremy Buchman for his English editing. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
[The following individual kindly provided reagents, samples, or unpublished information as indicated in the paper: E.R. Martin.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.586803.
3 Corresponding author.
Abecasis, G.R. and Cookson, W.O. 2000.
GOLDgraphical overview of linkage disequilibrium.
Bioinformatics 16:
182183. Akey, J.M., Zhang, K., Xiong, M., Doris, P., and Jin, L. 2001. The effect that genotyping errors have on the robustness of common linkage-disequilibrium measures. Am. J. Hum. Genet. 68: 14471456.[CrossRef][Medline] Chiano, M.N. and Clayton, D.G. 1998. Fine genetic mapping using haplotype analysis and the missing data problem. Ann. Hum. Genet. 62: 5560.[CrossRef][Medline] Clark, A.G. 1990. Inference of haplotypes from PCR-amplified samples of diploid populations. Mol. Biol. Evol. 7: 111122.[Abstract] Clark, V.J., Metheny, N., Dean, M., and Peterson, R.J. 2001. Statistical estimation and pedigree analysis of CCR2-CCR5 haplotypes. Hum. Genet. 108: 484493.[CrossRef][Medline] Excoffier, L. and Slatkin, M. 1995. Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol. Biol. Evol. 12: 921927.[Abstract] Fallin, D. and Schork, N.J. 2000. Accuracy of haplotype frequency estimation for biallelic loci, via the expectation-maximization algorithm for unphased diploid genotype data. Am. J. Hum. Genet. 67: 947959.[CrossRef][Medline]
Fallin, D., Cohen, A., Essioux, L., Chumakov, I., Blumenfeld, M.,
Cohen, D., and Schork, N.J. 2001. Genetic analysis of
case/control data using estimated haplotype frequencies: Application to APOE
locus variation and Alzheimer's disease. Genome Res.
11:
143151.
Gabriel, S.B., Schaffner, S.F., and Nguyen, H. 2002.
The structure of haplotype blocks in the human genome.
Science 296:
22252229. Gerdes, L.U., Gerdes, C., Hansen, P.S., Klausen, I.C., Færgeman, O., and Dyerberg, J. 1996, The apolipoprotein E polymorphism in Greenland Inuit in its global perspective. Hum Genet. 98: 546550.[CrossRef][Medline] Hanlon, C.S. and Rubinsztein, D.C. 1995. Arginine residues at codons 112 and 158 in the apolipoprotein E gene correspond to the ancestral state in humans. Atherosclerosis 112: 8590.[CrossRef][Medline]
Hawley, M.E. and Kidd, K.K. 1995. HAPLO: A program
using the EM algorithm to estimate the frequencies of multi-site haplotypes.
J. Hered. 86:
409411. Hodge, S.E., Boehnke, M., and Spence, M.A. 1999. Loss of information due to ambiguous haplotyping of SNPs. Nat. Genet. 21: 360361.[CrossRef][Medline] Hoh, J. and Hodge, S.E. 2000. A measure of phase ambiguity in pairs of SNPs in the presence of linkage disequilibrium. Hum. Hered. 50: 359364.[Medline] International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline] Jeffreys, A.J., Kauppi, L., and Neumann, R. 2001. Intensely punctate meiotic recombination in the class II region of the major histocompatibility complex. Nat. Genet. 29: 217222.[CrossRef][Medline]
Kerem, B., Rommens, J.M., Buchanan, J.A., Markiewicz, D., Cox,
T.K., Chakravarti, A., Buchwald, M., and Tsui, L.C. 1989.
Identification of the cystic fibrosis gene: Genetic analysis.
Science 245:
10731080.
Liu, J.S., Sabatti, C., Teng, J., Keats, B.J.B., and Risch, N.
2001. Bayesian analysis of haplotypes for linkage disequilibrium
mapping. Genome Res. 11:
17161724. Long, J.C., Williams, R.C., and Urbanek, M. 1995. An EM algorithm and testing strategy for multiple-locus haplotypes. Am. J. Hum. Genet. 56: 799810.[Medline] Mahley, R.W. and Rall Jr., S.C. 2000. Apolipoprotein E: Far more than a lipid transport protein. Annu. Rev. Gen. Hum. Genet. 1: 507537. Martin, E.R., Lai, E.H., Gilbert, J.R., Rogala, A.R., Afshari, A.J., Riley, J., Finch, K.L., Stevens, J.F., Livak, K.J., Slotterbeck, B.D., et al. 2000. SNPing away at complex diseases: Analysis of single-nucleotide polymorphisms around APOE in Alzheimer disease. Am. J. Hum. Genet. 67: 383394.[CrossRef][Medline] McPeek, M.S. and Strahs, A. 1999, Assessment of linkage disequilibrium by the decay of haplotype sharing, with application to fine-scale genetic mapping. Am. J. Hum. Genet. 65: 858875.[CrossRef][Medline] Morris, A.P., Whittaker, J.C., and Balding, D.J. 2002. Fine-scale mapping of disease loci via shattered coalescent modeling of genealogies. Am. J. Hum. Genet. 70: 686707.[CrossRef][Medline] Niu, T., Qin, Z.S., Xu, X., and Liu, J.S. 2002. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. Am. J. Hum. Genet. 70: 157169.[CrossRef][Medline] Qin, Z.S., Niu, T., and Liu, J.S. 2002. Partition-ligation EM algorithm for haplotype inference with single nucleotide polymorphisms. Am. J. Hum. Genet. 71: 12421247.[CrossRef][Medline] Rocchi, A., Pellegrini, S., Siciliano, G., and Murri, L. 2003. Causative and susceptibility genes for Alzheimer's disease: A review. Brain Res Bull. 61: 124.[CrossRef][Medline] Stephens, M., Smith, N.J., and Donnelly, P. 2001. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 68: 978989.[CrossRef][Medline]
http://www.fas.harvard.edu/junliu/TechRept/03folder/bladev2.tgz; An improved version of the program BLADE v2.
Received July 3, 2002;
accepted in revised format July 7, 2003.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||