|
|
|
|
Published online before print
December 14, 2005, 10.1101/gr.4346306 Genome Res. 16:290-296, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00
Methods Logistic regression protects against population structure in genetic association studies1 Department of Epidemiology and Public Health, Imperial College, St. Mary's Campus, London W2 1PG, United Kingdom 2 Worldwide Epidemiology, GlaxoSmithKline, Harlow CM19 5AW, United Kingdom
We conduct an extensive simulation study to compare the merits of several methods for using null (unlinked) markers to protect against false positives due to cryptic substructure in population-based genetic association studies. The more sophisticated "structured association" methods perform well but are computationally demanding and rely on estimating the correct number of subpopulations. The simple and fast "genomic control" approach can lose power in certain scenarios. We find that procedures based on logistic regression that are flexible, computationally fast, and easy to implement also provide good protection against the effects of cryptic substructure, even though they do not explicitly model the population structure.
Population-based association studies provide an attractive approach to the identification of susceptibility genes underlying complex genetic traits. However, the recent track record of such studies has been mixed: Many reported associations have not been replicated, and the number of confirmed, positive associations to date is less than might have been expected a few years ago. Some of the nonreplicated reported associations might be due to population structure. If a higher proportion of cases than controls is sampled from a particular subpopulation, for example, because of biased ascertainment or higher prevalence of the disease in that subpopulation, then association can potentially be detected between case-control status and any markers having unusual allele frequencies in that subpopulation. Many such associations will be spurious: not due to any causal relationship between phenotype and genomic variants in the vicinity of the marker. If the population structure is recognized, it can be accounted for either at the design or the analysis stage of a study. Thus, the most important potential threat from population structure arises when the structure is unknown, so-called cryptic substructure.
Freedman et al. (2004
Since the mid-1990s, many researchers have protected themselves against spurious associations due to cryptic substructure by implementing family-based designs (Thomson 1995
Perhaps the simplest established method for adjusting for the effects of cryptic substructure is Genomic Control (Devlin and Roeder 1999
Structured association methods are more sophisticated than genomic control, and are computationally more demanding. These methods aim to allocate an individual's genome to one or more subpopulations, and to test for association conditional on this allocation. The most widely used structured association method seems to be that of Pritchard et al. (2000a None of the structured association analyses is straightforward to implement, the most important difficulty being the selection of the number of subpopulations. This must be chosen in each method, yet is difficult to estimate from the data. Since the notion of subpopulation is a theoretical construct that only imperfectly reflects reality, it is clear that the problem of estimating the number of subpopulations will never satisfactorily be resolved and it is preferable, if feasible, to implement a method that does not rely on the number of subpopulations being correctly assessed. Despite concerns being raised over many years about the effects of cryptic substructure, there does not yet seem to be any extensive overview assessing the methods available to protect association studies against its adverse effects. Some simulation studies have been published, and we briefly review the principle studies below. These have been limited in different respects, for example, most consider only genomic control (GC) and not other statistical methods for population association studies; some use only extreme levels of population differentiation that will rarely arise in practice, or small sample sizes or a small number of genetic markers. Most studies do not compare performance with and without including the causal polymorphism among the markers typed. Some studies do not incorporate genealogical effects into their simulation, and hence the validity of their conclusions is severely compromised.
Bacanu et al. (2000
Marchini et al. (2004
Hao et al. (2004
Köhler and Bickeböller (2005
Thus, although some results are available, there is clearly much more to be done to investigate the relative merits of the different statistical analyses in scenarios that reflect real, well-designed studies and actual populations. For example, the degree of structuring that is sufficiently problematic to require a prophylactic statistical method has not been adequately characterized, nor has the potential loss of power from implementing such a method when it is in fact not needed. Finally, the use of logistic regression to account for cryptic substructure has not yet been extensively investigated. Recently, Wang et al. (2005 We present here the most extensive simulation study to date of statistical methods to allow for the effects of population stratification. Our study is unique in several respects: It assesses multiple statistical methods (five that are designed to allow for population stratification, and for comparison one that is not) and different levels of between-population variation (FST) and relative population sizes. We distinguish the situation in which the causal variant itself is included among the markers typed in the study from that in which only noncausal markers tightly linked with the causal variant are included, and we separately test the methods under a scenario of no causal association. Further, we also consider a "biased" ascertainment scenario, as well as variation in the numbers of cases and controls and the disease model, and the number of markers used to adjust for population stratification.
For each of five demographic scenarios, we simulated 50 sets of 500 cases and 500 controls genotyped at 110 SNPs in 51 genes: 10 SNPs in gene 1 (causal), and two SNPs in each of the remaining 50 genes (null). In our simulation the 51 genes are unlinked, but this corresponds in practice to genes that are widely spaced, so that any linkage disequilibrium due to linkage is negligible. Case/control status was assigned according to genotype at a randomly chosen SNP in gene 1, with genotype relative risks 1:2:4 (see Methods for details).
The demographic scenarios vary according to relative population size and between-population variance in allele proportions, measured by FST (for definitions and a review, see Balding 2003
The 110 SNPs were individually tested for association with case/control status applying six different methods: (a) Armitage's trend test (CHISQ); (b) genomic control (GC); (c) stepwise logistic regression (SLR); (d) Bayesian logistic regression (BLR); (e) STRAT assuming K = 2, where K is the assumed number of underlying subpopulations (STR2); and (f) AdmixMap assuming K = 2 (AM2). Methods (a), (c), and (d) do not make any explicit adjustment for population stratification. The adjustment for method (b) is encapsulated in the value of , which is estimated from data via the median of the CHISQ statistics. Average estimates of under each demographic scenario are shown in Table 1. Table 2 shows the total number of false positives over the 50 simulated studies for each of the five demographic scenarios. A false positive is a declaration of significance for an SNP in any of the 50 genes other than gene 1. For CHISQ, GC, STR2, and AM2, the nominal type 1 error rate was set to 2 x 10-4, corresponding to an expected total of one false positive for these methods in each row of Table 2. CHISQ displays few false positives except under our most extreme demographic scenario, (3), suggesting that in many realistic settings the effect of population structure on type 1 error may be small. GC, STR2, and AM2 all appear to be well calibrated, with five, three, and four false positives overall in Table 2, compared with an expectation of five each. The nominal type 1 error rate cannot be directly assigned for SLR and BLR, but the actual rate appears to be extremely low with just one false positive for SLR in the 250 data sets of the simulation study, and none for BLR.
The empirical power for the simulation study (Table 3) is defined as the proportion of data sets for which at least one of the 10 SNPs in gene 1 is significantly associated with case/control status. With increasing FST, all methods showed at least a modest loss of power, and this appears to be more severe for demographic Scenarios 4 and 5, with unequal population sizes. The empirical power is similar across all methods, but CHISQ, SLR, and AM2 display equal or greater power than both GC and STR2 in each of the five scenarios, although none of the individual differences was statistically significant. STR2 shows the lowest power of all six methods in four of the five demographic scenarios, whereas CHISQ and AM2 both achieve the highest power in four scenarios, as does SLR in three scenarios.
All the data sets were then reanalyzed with the causal SNP removed, leaving nine SNPs in gene 1. Again, GC, STR2, and AM2 appear to be well calibrated, with totals of five, three, and five false positives, respectively, but the false-positive totals for SLR and BLR are both now slightly worse: seven and nine (Table 4). Power corresponds to detecting at least one of the nine SNPs in gene 1. CHISQ now displays the highest empirical power in every scenario, equaled in only one case by BLR (Table 5). Again, AM2 is consistently superior to both GC and STR2, but it is now consistently inferior to both SLR and BLR.
In our next analysis, we removed the entire causal gene (gene 1) from each data set, generating a scenario in which there are no true causal associations. For this analysis every significant association is a type 1 error. GC, STR2, and AM2 have totals of six, six, and four false positives (Table 6), respectively, again close to the expectation of five, whereas SLR and BLR have 11 and four, respectively.
The results so far (Tables 2, 3, 4, 5, 6) do not indicate striking differences between the methods, although some trends are apparent. CHISQ tends to achieve the highest power, reflecting the fact that allowing for population structure does imply some cost in terms of power. Moreover, CHISQ produces few false positives except under Scenario 3. These observations invite the conclusion that, when any population stratification is expected to be weak, it may be best to adhere to the simple and familiar CHISQ test and ignore the effects of population structure. However, the power loss relative to CHISQ is small for SLR, BLR, and AM2, and still modest for GC; only STR2 suffers a substantial loss of power. In terms of type 1 error, GC, STR2, and AM2 all perform close to their nominal rates; SLR and BLR show fewer false positives when the causal SNP is included in the analysis, but more when the causal SNP is absent. So far subpopulation bias in case/control assignments was a consequence of allele frequency differences at the causal SNP, with no contribution from differing disease penetrances due, for example, to different environmental exposures in the two subpopulations, or the effects of selection at the causal gene, or ascertainment bias such that cases are preferentially chosen from one subpopulation. It is difficult to devise realistic models for these possibilities. Since, by definition, cryptic population structure has escaped the attention of investigators, it might be argued that any differences between subpopulations in, say, diagnosis patterns or environmental exposures, is unlikely to be large. However, this possibility cannot be ruled out, and we decided to investigate an extreme setting that is the same as demographic Scenario 3 except that all 500 cases were sampled from only one subpopulation, while in the other subpopulation all individuals were treated as controls. The results for this "biased" simulation scenario are shown in Table 7. Of the four methods that have a prespecified type 1 error rate (CHISQ, GC, STR2, AM2), only GC is able to achieve the correct level under this extreme scenario: No false positives were observed, compared with an expected total of one for each of the three analyses. However the price paid by GC for correct type 1 error is zero power at this significance level. When we increased the nominal significance level for GC so that the expected number of false positives per analysis was five, the observed average was around seven, similar to that for SLR and STR2, but the empirical power was 52% (including causal SNP) and 48% (excluding causal SNP), in each case considerably less than for both SLR and STR2. Conversely, AM2 displays the greatest empirical power in Table 7, but at the cost of a high false-positive rate. When the nominal type 1 error rate was reduced to 10-10, the observed average number of false positives was also about seven, but the empirical power values were 80% and 60%, much better than GC but lower than for SLR and STR2. Thus, when tested under this extreme ascertainment bias, CHISQ and AM2 fail because of high type 1 error rates, and GC fails because of low power. SLR and STR2 appear to perform best, displaying good power and a false-positive rate that exceeds the nominal rate for STR2, but is modest relative to other methods.
The GRR considered so far (1:2:4) are relatively high compared with realistic scenarios in which substructure is of most concern, and the numbers of cases and controls (both 500) relatively small. These limitations are dictated by computational constraints in a large simulation study. However, we additionally investigated the effect of sample size under Scenario 2, by increasing the sample size to 2000 cases and 2000 controls and reducing the GRR to 1:1.4:1.6 (Table 8). Even with this larger sample size, the effect of population structure remains modest, with a total of only four false positives for CHISQ over the 50 data sets. CHISQ again displays greater empirical power than all the other methods, and STR2 shows substantially lower power than all the other methods. Once again GC, STR2, and AM2 all appear to be well calibrated, with false-positive totals close to their expected value of one.
Finally, we resampled data under Scenario 2 with the original disease model but now considering 200 rather than 50 null genes in each of the 50 new data sets. With two SNPs per gene, this increases the total number of null SNPs from 100 to 400, which permits improved estimation of in GC, and of subpopulation allocation for STR2 and AM2. The additional null SNPs also provide additional opportunities for false positives, allowing the effect of population stratification on the type 1 error rate to be measured more precisely. GC, STR2, and AM2, are again well calibrated, whereas CHISQ produces nine false positives over the 50 data sets compared with an expectation of four under the nominal type 1 error rate (Table 9). Surprisingly, a comparison with row 2 of Tables 3 and 5 suggests that the empirical power has declined for this simulation for all methods except STR2, but the differences are not statistically significant. In the case of STR2, the additional markers have led to an improved power, so that it here displays about the same power as GC and AM2.
Our simulation study suggests that use of a simple 2 test generates substantial false-positive rates only in the presence of very high levels of population structure or substantial between-subpopulation difference in penetrances. Thus, explicit allowance for cryptic substructure may often be unnecessary provided that good study design principles have been used so that case and control populations are similar. However, methods that do protect against cryptic substructure typically perform well in limiting the number of false positives, and the cost of this protection, in terms of lost power, is often small. Thus researchers may prefer to routinely implement such a prophylactic statistical method even if it is unlikely to be necessary. Our results suggest that 100 randomly-selected null SNPs suffice for GC and AdmixMap, whereas additional markers were required by STRAT to achieve the same empirical power as other methods.
Among the five methods for analyzing population-based genetic association studies in the presence of population structure, no one is uniformly superior to the others, nor is any one method uniformly inferior. GC is computationally very fast, and it performs reasonably well in many settings, but it has low power when cases arise in only one subpopulation. A major drawback of GC is its inflexibility in being directly applicable only to single-point analyses. AdmixMap performs well except under this biased ascertainment scenario, but here it suffers from an inflated false-positive rate. STRAT can be relatively robust even under this biased sampling scenario, but appears to lose power in standard settings unless a large number of null markers is used. However, we assumed the correct number of subpopulations for both STR2 and AM2, which is unrealistic, and their actual performance in practice may be worse than our results suggest. Zhu et al. (2002 Possibly our most important finding is that simple statistical procedures based on logistic regression perform well in all scenarios considered. Our stepwise and Bayesian logistic regression methods (SLR and BLR) both protect against false positives in standard settings and mitigate their effects under extreme ascertainment bias, without significantly compromising power. These methods do not require an estimate of the number of underlying subpopulations; indeed they dispense entirely with the notion of subpopulation. It may seem surprising that these methods are so successful in countering the effects of cryptic substructure, despite the fact that population structure is not explicitly modeled. Indeed, this may explain why these methods have been little studied in the context of protection against population stratification. A possible explanation of their effectiveness is that when null markers are included in a regression analysis, each of them soaks up some of the effect of population stratification, but because this effect is shared across many markers, none of them is individually significant. In the case of SLR, this explanation does not apply to the final steps in the procedure when only few SNPs remain, but here it seems from our results that we can rely on the causal variant having a stronger signal than any of a small number of variants displaying spurious association: Broadly speaking, problems only arise when causal variants have to compete with many spurious variants.
There seems little to choose between SLR and BLR, although SLR was more robust to ascertainment bias. They are both computationally very fast, with SLR requiring only a few minutes per data set, and BLR only a few seconds. SLR and BLR enjoy several other advantages, including the ability to simultaneously incorporate signals from multiple SNPs in the vicinity of a causal locus, without the need to infer phase (Clayton et al. 2004 Overall we conclude that for well-designed studies, population structure might not be a serious cause for concern, even for large sample sizes. Structured association methods protect against false positives but are computationally intensive and lead to some loss of power. Genomic Control is fast and provides the best protection against false positives, but can be overly conservative in complex scenarios, such as differential penetrances or biased ascertainment. We believe that logistic regression using null markers as covariates provides a good solution, in terms of computational speed, flexibility, ease of implementation, statistical power, and robustness. Further work is required to refine the implementation details that provide the best solutions for specific association studies.
Where not otherwise indicated, all analyses were performed using R, a statistical software package freely available at http://www.r-project.org.
Simulation study
Armitage's trend test
and denote the mean genotype among cases and controls, respectively. Since Y2 is assumed to have a distribution, and we used = 2 x 10-4, the critical value was 13.83. See Sasieni (1997
Genomic control (GC)
Stepwise logistic regression (SLR)
Bayesian logistic regression (BLR)
STRUCTURE/STRAT
AdmixMap
E.S. thanks GlaxoSmithKline for funding. We thank David Conti of USC and Members of GSK Worldwide Epidemiology and Genetics Research groups for helpful discussions.
Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4346306.
3 Corresponding author.
Bacanu, S.A., Devlin, B., and Roeder, K. 2000. The power of genomic control. Am. J. Hum. Genet. 66: 1933-1944.[CrossRef][Medline] Balding, D.J. 2003. Likelihood-based inference for genetic correlation coefficients. Theoret. Pop. Biol. 63: 221-230.[CrossRef][Medline] Campbell, C.D., Ogburn, E.L., Lunetta, K.L., Lyon, H.N., Freedman, M.L., Groop, L.C., Altshuler, D., Ardlie, K.G., and Hirschhorn, J.N. 2005. Demonstrating stratification in a European American population. Nat. Genet. 37: 868-872.[CrossRef][Medline] Cardon, L.R. and Palmer, L.J. 2003. Population stratification and spurious allelic association. Lancet 361: 598-604.[CrossRef][Medline] Cavalli-Sforza, L.L., Menozzi, P., and Piazza A. 1996. The history and geography of human genes. Princeton University Press, Princeton, NJ. Clayton, D., Chapman, J., and Cooper J. 2004. The use of unphased multilocus genotype data in indirect association studies. Genet. Epidem. 27: 415-428. Devlin, B. and Roeder, K. 1999. Genomic control for association studies. Biometrics 55: 997-1004.[CrossRef][Medline] Falush, D., Stephens, M., and Pritchard, J.K. 2003. Inference of population structure using multilocus genotype data: Linked loci and correlated allele frequencies. Genetics 164: 1567-1587. Freedman, M.L., Reich, D., Penney, K.L., McDonald, G.J, Mignault, A.A. Patterson, N., Gabriel, S.B., Topol, E.J., Smoller, J.W., Pato, C.N., et al. 2004. Assessing the impact of population stratification on genetic association studies. Nat. Genet. 36: 388-393.[CrossRef][Medline] Hao, K., Li, C., Rosenow, C., and Wong, W.H. 2004. Detect and adjust for population stratification in population-based association study using genomic control markers: An application of Affymetrix Genechip Human Mapping 10K array. Eur. J. Hum. Genet. 12: 1001-1006.[CrossRef][Medline] Helgason, A., Yngvadottir, B., Hrafnkelsson, B., Gulcher, J., and Stefansson, K. 2005. An Icelandic example of the impact of population structure on association studies. Nat. Genet. 37: 90-95.[CrossRef][Medline] Hoggart, C.J., Parra, E.J., Shriver, M.D., Bonilla, C., Kittles, R.A., Clayton, D.G., and McKeigue, P.M. 2003. Control of confounding of genetic associations in stratified populations. Am. J. Hum. Genet. 72: 1492-1504.[CrossRef][Medline] Hudson, R.R. 2002. Generating samples under a Wright-Fisher neutral model of genetic variation. Bioinformatics 18: 337-338. Köhler, K. and Bickeböller, H. 2005. Case-control association tests correcting for population stratification. Ann. Hum. Genet. 69: 1-18.[CrossRef][Medline] Marchini, J., Cardon, L.R., Phillips, M.S., and Donnelly, P. 2004. The effects of human population structure on large genetic association studies. Nat. Genet. 36: 512-517.[CrossRef][Medline] Pritchard, J.K. and Rosenberg, N.A. 1999. Use of unlinked genetic markers to detect population stratification in association studies. Am. J. Hum. Genet. 65: 220-228.[CrossRef][Medline] Pritchard, J.K., Stephens, M., and Donnelly, P. 2000a. Inference of population structure using multilocus genotype data, Genetics 155: 945-957. Pritchard, J.K., Stephens, M., Rosenberg, N.A., and Donnelly, P. 2000b. Association mapping in structured populations. Am. J. Hum. Genet. 67: 170-181.[CrossRef][Medline] Sasieni, P.D. 1997. From genotypes to genes: Doubling the sample size. Biometrics 53: 1253-1261.[CrossRef][Medline] Satten, G.A., Flanders, W.D., and Yang, Q. 2001. Accounting for unmeasured population structure in case-control studies of genetic association using a novel latent-class model. Am. J. Hum. Genet. 68: 466-477.[CrossRef][Medline] Thomson, G. 1995. Mapping disease genes: Family-based association. Am. J. Hum. Genet. 57: 487-498.[Medline] Tibshirani, R. 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. [Ser. B] 58: 267-288. Van Steen, K., McQueen, M.B., Herbert, A., Raby, B., Lyon, H., DeMeo, D.L., Murphy, A., Su, J., Datta, S., Rosenow, C., et al. 2005. Genomic screening and replication using the same data set in family-based association testing. Nat. Genet. 37: 683-691.[CrossRef][Medline] Wang, Y., Localio, R., and Rebbeck, T.R. 2005. Bias correction with a single null marker for population stratification in candidate gene association studies. Hum. Hered. 59: 165-175.[Medline] Zhu, X., Zhang, S.L., Zhao, H., and Cooper, R.S. 2002. Association mapping, using a mixture model for complex traits. Genet. Epidemiol. 23: 181-196.[Medline]
Received June 27, 2005; accepted in revised format October 6, 2005. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||