|
|
|
|
Genome Res. 13:1952-1960, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Methods Automated Detection of Informative Combined Effects in Genetic Association Studies of Complex Traits1 INSERM U525, Faculté de Médecine, Hôpital Pitié-Salpêtrière, 75634 Paris, France 2 Genset-Serono Group, RN7, 91030 Evry, France
There is a growing body of evidence suggesting that the relationships between gene variability and common disease are more complex than initially thought and require the exploration of the whole polymorphism of candidate genes as well as several genes belonging to biological pathways. When the number of polymorphisms is relatively large and the structure of the relationships among them complex, the use of data mining tools to extract the relevant information is a necessity. Here, we propose an automated method for the detection of informative combined effects (DICE) among several polymorphisms (and nongenetic covariates) within the framework of association studies. The algorithm combines the advantages of the regressive approaches with those of data exploration tools. Importantly, DICE considers the problem of interaction between polymorphisms as an effect of interest and not as a nuisance effect. We illustrate the method with three applications on the relationship between (1) the P-selectin gene and myocardial infarction, (2) the cholesteryl ester transfer protein gene and plasma high-density-lipoprotein cholesterol concentration, and (3) genes of the renin-angiotensin-aldosterone system and myocardial infarction. The applications demonstrated that the method was able to recover results already found using other approaches, but in addition detected biologically sensible effects not previously described.
Unlike Mendelian disease, the genetic deciphering of which has known an extraordinary success in the past decades, advances in the genetics of complex diseases have been much more tenuous, and strategies aimed at identifying genes underlying these diseases must be reconsidered (Botstein and Risch 2003
This multidimensional approach requires the development of statistical
methods able to handle multiple variable loci, possibly in several genes, and
the detection, among all measured polymorphisms, of those which, alone or in
combination, may influence the phenotype. Indeed, there is increasing evidence
that even in the absence of significant marginal effects, polymorphisms may
exhibit epistatic effects on complex traits that are detectable only by a
multilocus approach (Templeton
2000
Neural networks have been recently proposed for investigating the
relationship between complex phenotypes and multilocus genotypes
(Curtis et al. 2001
Recursive partitioning methods, that is, classification and regression
trees (Breiman et al. 1984
A combinatorial partitioning method (CPM) was recently developed to
identify multilocus genotypic partitions that predict quantitative trait
variation (Nelson et al.
2001
Inspired by the CPM for quantitative traits, a multifactor-dimensionality
reduction (MDR) method was proposed for exploring high-order interactions
between polymorphisms in the framework of case/control studies (Ritchie et al.
2001
A stepwise regression procedure was proposed for evaluating the
contribution of several polymorphisms within a small genetic region in a
case/control framework (Cordell and Clayton
2002
In this report, we propose a fully automated method for exploring the
effects of several polymorphisms (and other nongenetic covariates) in the
framework of association studies involving any kind of phenotype
(quantitative, binary, or censored). This method, called DICE
(Detection of Informative Combined
Effects), combines the advantages of the regressive approaches in
terms of modeling and interpretation of effects, with those of data
exploration tools. Importantly, the approach considers the problem of
interaction between polymorphisms as an effect of interest and not as a
nuisance effect. It is therefore well suited to the exploration of the
spectrum of polymorphisms within candidate genes and more generally, within
biological systems. The forward selection approach is based on the principle
of parsimony, the principle of marginality, and the information theory
paradigm. The algorithm compares at each step a wide variety of models and
chooses the one(s) that provide(s) the best approximation to the data, while
having the least number of parameters. To avoid difficulties related to the
null-hypothesis testing theory (Goodman
1993
The method was applied to several real data samples, and the results are
available at our Web site GeneCanvas
(http://genecanvas.idf.inserm.fr/
The relationship between the phenotype and the covariates, which can be genotypes as well as nongenetic variables, is modeled using a logistic (binary outcome), linear (quantitative trait), or Cox (censored response) regression model. The algorithm explores by a forward procedure a set of competing models for which an IC is derived. Based on certain modalities developed below, this exploration leads to the selection of a best approximating model (or models). The model space is explored in a systematic way, and the best model(s) can include main effects and interactions of different orders.
Exploration Phase of the Algorithm If the composite condition has been satisfied at step 1, the algorithm goes to step 2 and replaces model 0 with the model retained at step 1. The procedure continues iteratively until there is no more improvement of the IC value.
Information Criterion (IC)
( /data)]+2K
corresponds to an estimator of the expected relative Kullback-Leibler (K-L)
distance. The term
loge[ ( )/data)]
yields the value of the maximized log-likelihood over the unknown parameters
( ), given the data and the model,
leading to the estimated parameters ( ).
K is the number of parameters estimated in that approximating model,
and n is the total sample size. The first term in AIC is a
lack of fit component which decreases as more parameters are fitted in the
model; the second term increases as a penalty for adding extra parameters.
Thus, AIC forces a trade-off between bias and variance as the number
of parameters is increased.
Evaluation of the Composite Condition
),i=1,...,
R}. If we note f, the full reality, with infinite number of
parameters, such differences estimate the relative expected K-L differences
between f and gi(data/ ):
[Î(f,gi)] is the expected
estimated K-L distance between f and
gi(data/ ), and min is over the
set of models explored.
Following a simple heuristic rule derived by extensive Monte Carlo
simulation (Burnham and Anderson
2002
It may happen that several models fulfill the composite condition, that is
Conditional Exclusion Phase
Coding of Genotypes
Algorithm Implementation
SELP Gene Polymorphisms and Myocardial Infarction P-selectin is a cellular adhesion molecule which plays a major role in the recruitment of inflammatory cells from the circulation and their transendothelial migration, the critical initial step of atherosclerosis (Price and Loscalzo 1999 Five polymorphisms were identified in the 5' region (C-2123G, A-1969G, T-1817C, C-1576G, and -485I/D) and eight in the coding region (P98P, S290N, C557C, N562D, N563N, V599L, T715P, and T741T; see our Web site for the description of polymorphisms, their allele frequencies, and the pairwise linkage disequilibrium [LD] coefficients). Due to their low allele frequency (<1%), the C-1576G and P98P polymorphisms were not included in analysis. In addition, because the C557C, N563N, and V599L polymorphisms were completely concordant, only the V599L, which had the less missing data, was selected, leaving nine polymorphisms for the analysis. Two different coding schemes for the genotypes were used. The first one, referred to as `dominant', opposed frequent homozygotes to others (genotype coded as a dichotomous variable 0, 1), whereas the second, referred to as `codominant', assumed for each marker an additive allele effect on a logistic scale (genotype coded as an ordinal variable 0, 1, 2). Because the country of origin (Northern Ireland/France) was a stratification variable of the study, this variable was forced in model 0. Table 1 presents the detailed results obtained with the codominant coding scheme. For each step, the four best models are reported. Figure 2 summarizes the results obtained with this coding scheme.
At steps 1 and 2, a unique best model was identified, which is indicated in
bold. At step 3, no model met the composite condition (
Detailed results of the exploration with the dominant coding scheme are
available at our Web site. Briefly, the dominant coding scheme led to the same
models as the codominant scheme for the first two steps. At step 3, a final
unique best model was identified including the interaction
S290N*N562D. Actually, the model including the
three-locus combination
S290N*N562D*V599L had the
minAICc at step 4, but was not retained due to a
CETP Gene Polymorphisms and HDL-Cholesterol Levels Ten polymorphisms had been previously identified (see our Web site for detailed information). Because three groups of polymorphisms were almost completely concordant (G+279/in1A and C+8/in7T, A373P and R451Q, I405V and G+524T), we excluded the marker of each pair having the most missing data, that is, G+279/in1A, A373P and I405V, respectively. Variables considered for exploration were therefore the seven remaining polymorphisms and alcohol consumption. All models were systematically adjusted for age and center of recruitment. Table 2 and Figure 3 show the results of the exploration using the dominant coding scheme. Results obtained with the codominant coding scheme are available at our Web site.
At step 1, based on the principle of parsimony, DICE selected the model including alcohol consumption as the main effect. At step 2, two tie models were selected, having the same number of parameters and both satisfying the composite condition: (1) an interaction between alcohol and the C-629A polymorphism, and (2) an interaction between alcohol and the C+8/in7T polymorphism. Note that these two polymorphisms are in strong LD (D' = +0.95) and have similar allele frequencies. The algorithm stopped at the following step for both paths evolving in parallel. With the codominant coding scheme, after inclusion of alcohol consumption at step 1, the interaction between alcohol and the C+8/in7T marker was selected.
Renin-Angiotensin-Aldosterone System Gene Polymorphisms and
Myocardial Infarction Nine polymorphisms were considered: the I/D polymorphism in the ACE gene, the M235T and T174M polymorphisms in the AGT gene, the T-810A, C-521T, T+55/ex4C, L191L, and A+39C polymorphisms in the AGTR1 gene (after exclusion of redundant polymorphisms), and the T-344C polymorphism in the CYP11B2 gene (see our Web site for details). All models were adjusted on country of origin. Table 3 shows the results using the codominant coding scheme. Results obtained with the dominant coding scheme are available at our Web site.
At step 1, considering the individual addition of each polymorphism and
their possible interaction with country, the composite condition was not
verified for any of the candidate models. DICE then considered all two-locus
combinations (Step 1bis) and selected the model including the
interaction between ACEI/D and AGTR1/A+39C
previously described (Tiret et al.
1994
Preliminary Study of Stability
To evaluate the stability of the effects identified by the proposed
algorithm, we performed a preliminary stability study by the bootstrap method
(Efron and Gong 1983 Table 4 presents the results of the stability study with the codominant coding scheme, with effects being ranked by frequency of inclusion over the 100 replicates. The main effect of the T715P polymorphism was detected in 61% of replicates and was the first effect selected in 54% of them. Frequency of inclusion of other main effects varied from 1% to 18%, far behind the T715P. Concerning the first-order interactions, the two highest frequencies of inclusion (46% and 44%, respectively) corresponded to the interactions detected in the original data set. The interaction between country and T-1817C had the third highest frequency (19%). The most frequent selected effect among second-order interactions (52%) was the one identified in the original data set. Analogous results were obtained with the dominant coding scheme and are available at our Web site.
The concomitant availability of an increasing amount of genetic data, large study samples, and computer power offers a new opportunity to assess multilocus associations in a more systematic fashion than ever before and to build models that may reveal hidden association structure. Different methods, reviewed above in the introductory text, have been proposed for the identification of multilocus combinations associated with disease risk or quantitative traits in association studies. Each method has advantages and drawbacks. However, as stressed in a recent editorial (Spence et al. 2003
The method proposed here combines the advantages of exploration tools with
those of the regressive approach, such as easily interpretable modeling and
the possibility of incorporating adjustment covariates, while trying to
overcome some methodological difficulties of parametric methods related to
hypothesis testing. Among other problems of the classical parametric selection
procedures are those of multiple testing correction and the asymptotic
distribution, under the null hypothesis, of the tests performed for each
variable (Derksen and Keselman
1992 The algorithm is fully automated, making the tool easy to use without any a priori hypothesis. It could be used in different situations, such as the exploration of several polymorphisms within a gene, as we did in the SELP and the CETP applications, or the investigation of several genes belonging to a common biological system, as we did with the RAA system. We note that the main purpose of the method is not to provide estimates of parameters and of their variances, nor to make inferences about the sampled population, but to identify a subset of variables and effects that would deserve further detailed analysis using other complementary methods, such as haplotype analysis or multivariate analysis, or would require further investigation in replication studies. This is a data mining method useful as an exploratory tool for data reduction and variable detection.
Several aspects of the method deserve discussion. We based the model
selection procedure on information theory and not on the classical hypothesis
testing theory for several reasons. First, in a context of data mining and
hypothesis generation, the use of the null-hypothesis testing theory seemed
conceptually counterintuitive (there is no real null-hypothesis to test),
generating practical difficulties related to multiple testing as mentioned
above. Second, when many models are considered, it may happen that several of
them fit the data almost equally well. By selecting a single model, the
null-hypothesis testing theory ignores model uncertainty and potential
ambivalence of the data. Furthermore, one particularity of genetic data is the
correlation between genetic polymorphisms, through LD, that can lead to
collinearity. Multicollinearity does not affect, in general, the overall fit
of the model, (i.e., the likelihood) nor does it tend to bias the estimates,
but regression coefficients will tend to have inflated sampling variances,
leading to incorrect statistical tests
(Neter et al. 1996
Another technical aspect of the algorithm concerns the thresholds adopted
for the
Another important aspect of the algorithm is the principle of parsimony on
which the model selection procedure is based. This principle, widely used in
statistics, states that among two equivalent models in terms of IC, the one
with the fewest parameters is to be preferred
(Forster 2001
Finally, DICE, as other combinatorial methods
(Nelson et al. 2001
The three applications described here showed that the algorithm was able to
recover the polymorphisms that were previously identified by haplotype
analysis (Tregouet et al.
2002
Another important issue requiring further research is the handling of
missing data, because this becomes a critical problem as the number of
investigated polymorphisms increases. Variants of AIC have been
proposed for model selection in the presence of incomplete data
(Cavanaugh and Shumway 1998
We thank all investigators of the ECTIM study for allowing the data to be used for the present study. N.T-D. gratefully acknowledges the support of the Association Nationale de la Recherche Technique (ANRT). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
[Additional applications on different candidate genes for myocardial infarction are available at our Web site GeneCanvas: http://genecanvas.idf.inserm.fr/.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1254203.
3 Corresponding author. E-MAIL
laurence.tiret{at}chups.jussieu.fr;
FAX 33-1-40-77-9728.
Akaike, H. 1974. A new look at the statistical model identification. IEEE Transactions on Automated Control 19: 716-723.[CrossRef] Altman, D.G. and Andersen, P.K. 1989. Bootstrap investigation of the stability of a Cox regression model. Stat. Med. 8: 771-783.[Medline] Botstein, D. and Risch, N. 2003. Discovering genotypes underlying human phenotypes: Past successes for mendelian disease, future approaches for complex disease. Nat. Genet. 33 Suppl: 228-237. Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C. 1984. Classification and regression trees. Wadsworth and Brooks, Pacific Grove, CA. Burnham, K.P. and Anderson, D.R. 2002. Model selection and inference: A practical information-theoretical approach. Springer-Verlag, New York. Cambien, F., Poirier, O., Lecerf, L., Evans, A., Cambou, J.P., Arveiler, D., Luc, G., Bard, J.M., Bara, L., Ricard, S., et al. 1992. Deletion polymorphism in the gene for angiotensin-converting enzyme is a potent risk factor for myocardial infarction. Nature 359: 641-644.[CrossRef][Medline] Cavanaugh, J.E. and Shumway, R.H. 1998. An Akaike information criterion for model selection in the presence of incomplete data. Journal of Statistical Planning and Inference 67: 45-65. Corbex, M., Poirier, O., Fumeron, F., Betoulle, D., Evans, A., Ruidavets, J.B., Arveiler, D., Luc, G., Tiret, L., and Cambien, F. 2000. Extensive association analysis between the CETP gene and coronary heart disease phenotypes reveals several putative functional polymorphisms and gene-environment interaction. Genet. Epidemiol. 19: 64-80.[CrossRef][Medline] Cordell, H.J. and Clayton, D.G. 2002. A unified stepwise regression procedure for evaluating the relative effects of polymorphisms within a gene using case/control or family data: Application to HLA in type 1 diabetes. Am. J. Hum. Genet. 70: 124-141.[CrossRef][Medline] Curtis, D., North, B.V., and Sham, P.C. 2001. Use of an artificial neural network to detect association between a disease and multiple marker genotypes. Ann. Hum. Genet. 65: 95-107.[CrossRef][Medline] Czika, W.A., Weir, B.S., Edwards, S.R., Thompson, R.W., Nielsen, D.M., Brocklebank, J.C., Zinkus, C., Martin, E.R., and Hobler, K.E. 2001. Applying data mining techniques to the mapping of complex disease genes. Genet. Epidemiol. (Suppl. 1) 21: S435-S440. Dannegger, F. 2000. Tree stability diagnostics and some remedies for instability. Stat. Med. 19: 475-491.[CrossRef][Medline] Derksen, S. and Keselman, H.J. 1992. Backward, forward and stepwise automate subset selection algorithms: Frequency of obtaining authentic and noise variables. Br. J. Math. Stat. Psychol. 45: 265-282. Draper, D. 1995. Assessment and propagation of model uncertainty (with discussion). J. R. Stat. Soc. Ser. B 56: 45-98. Efron, B. and Gong, G. 1983. A leisurely look at the bootstrap, the jackknife, and cross-validation. Am. Stat. 37: 36-48. Forster, M.R. 2001. The new science of simplicity. In Simplicity, inference and modelling (eds. H. Keuzenkamp, M. McAleer, and A. Zellner), pp. 83-119. Cambridge University Press, Cambridge, UK. Fox, J. 1997. Applied regression analysis, linear models, and related methods, chapter 7. Sage Publications, Newbury Park, CA. Fumeron, F., Betoulle, D., Luc, G., Behague, I., Ricard, S., Poirier, O., Jemaa, R., Evans, A., Arveiler, D., Marques-Vidal, P., et al. 1995. Alcohol intake modulates the effect of a polymorphism of the cholesteryl ester transfer protein gene on plasma high density lipoprotein and the risk of myocardial infarction. J. Clin. Invest. 96: 1664-1671.
Goodman, S.N. 1993. P-values, hypothesis tests, and
likelihood: Implications for epidemiology of a neglected historical debate.
Am. J. Epidemiol. 137:
485-496.
Herrmann, S.M., Ricard, S., Nicaud, V., Mallet, C., Evans, A.,
Ruidavets, J.B., Arveiler, D., Luc, G., and Cambien, F. 1998. The
P-selectin gene is highly polymorphic: Reduced frequency of the Pro715 allele
carriers in patients with myocardial infarction. Hum. Mol.
Genet. 7:
1277-1284.
Hurvich, C.M. and Tsai, C.L. 1989. Regression and time
series model selection in small samples. Biometrika
76:
297-307. Johnson, D.H. 1999. The insignificance of statistical significance testing. J. Wildl. Manage. 63: 763-772.
Nelson, M.R., Kardia, S.L., Ferrell, R.E., and Sing, C.F.
2001. A combinatorial partitioning method to identify multilocus
genotypic partitions that predict quantitative trait variation.
Genome Res. 11:
458-470. Neter, J., Kutner, M.H., Nachtsheim, C.J., and Wasserman, W. 1996. Applied linear statistical models, chapter 7. Irwin, Chicago. Patterson, S.D. and Aebersold, R.H. 2003. Proteomics: The first decade and beyond. Nat. Genet. (Suppl.)33: 311-323. Poirier, O., Georges, J.L., Ricard, S., Arveiler, D., Ruidavets, J.B., Luc, G., Evans, A., Cambien, F., and Tiret, L. 1998. New polymorphisms of the angiotensin II type 1 receptor gene and their associations with myocardial infarction and blood pressure: The ECTIM study. J. Hypertens. 16: 1443-1447.[CrossRef][Medline] Pojoga, L., Gautier, S., Blanc, H., Guyene, T.T., Poirier, O., Cambien, F., and Benetos, A. 1998. Genetic determination of plasma aldosterone levels in essential hypertension. Am. J. Hypertens. 11: 856-860.[CrossRef][Medline] Price, D.T. and Loscalzo, J. 1999. Cellular adhesion molecules and atherogenesis. Am. J. Med. 107: 85-97.[Medline] Province, M.A., Shannon, W.D., and Rao, D.C. 2001. Classification methods for confronting heterogeneity. Adv. Genet. 42: 273-286.[Medline] Pudil, P., Novovicova, J., and Kittler, J. 1994. Floating search methods in feature selection. Pattern Recognition Lett. 15: 1119-1125.[CrossRef] Ritchie, M.D., Hahn, L.W., Roodi, N., Bailey, L.R., Dupont, W.D., Parl, F.F., and Moore, J.H. 2001. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69: 138-147.[CrossRef][Medline] Ritchie, M.D., Hahn, L.W., and Moore, J.H. 2003. Power of multifactor dimensionality reduction for detecting genegene interactions in the presence of genotyping error, missing data, phenocopy, and genetic heterogeneity. Genet. Epidemiol. 24: 150-157.[CrossRef][Medline] Royall, R. 1997. Statistical evidence: A likelihood paradigm. Chapman and Hall, London, UK. Sauerbrei, W. and Schumacher, M. 1992. A bootstrap resampling procedure for model building: Application to the Cox regression model. Stat. Med. 11: 2093-2109.[Medline] Sherriff, A. and Ott, J. 2001. Applications of neural networks for gene finding. Adv. Genet. 42: 287-297.[Medline] Spence, M.A., Greenberg, D.A., Hodge, S.E., and Vieland, V.J. 2003. The Emperor's new methods. Am. J. Hum. Genet. 72: 1084-1087.[CrossRef][Medline] Stengard, J.H., Clark, A.G., Weiss, K.M., Kardia, S., Nickerson, D.A., Salomaa, V., Ehnholm, C., Boerwinkle, E., and Sing, C.F. 2002. Contributions of 18 additional DNA sequence variations in the gene encoding apolipoprotein E to explaining variation in quantitative measures of lipid metabolism. Am. J. Hum. Genet. 71: 501-517.[CrossRef][Medline]
Stoll, M., Cowley Jr., A.W., Tonellato, P.J., Greene, A.S.,
Kaldunski, M.L., Roman, R.J., Dumas, P., Schork, N.J., Wang, Z., and Jacob,
H.J. 2001. A genomic-systems biology map for cardiovascular
function. Science 294:
1723-1726. Tall, A.R. 1993. Plasma cholesteryl ester transfer protein. J. Lipid Res. 34: 1255-1274.[Medline] Templeton, A.R. 2000. Epistasis and complex traits. In Epistasis and evolutionary process (eds. M. Wade, B. Brodie III, J. Wolf), pp. 41-57. Oxford University Press, Oxford, UK. Tiret, L., Bonnardeaux, A., Poirier, O., Ricard, S., Marques-Vidal, P., Evans, A., Arveiler, D., Luc, G., Kee, F., Ducimetiere, P., et al. 1994. Synergistic effects of angiotensin-converting enzyme and angiotensin-II type 1 receptor gene polymorphisms on risk of myocardial infarction. Lancet 344: 910-913.[CrossRef][Medline] Tiret, L., Ricard, S., Poirier, O., Arveiler, D., Cambou, J.P., Luc, G., Evans, A., Nicaud, V., and Cambien, F. 1995. Genetic variation at the angiotensinogen locus in relation to high blood pressure and myocardial infarction: The ECTIM Study. J. Hypertens. 13: 311-317.[Medline]
Tregouet, D.A., Barbaux, S., Escolano, S., Tahri, N., Goldmard,
J.L., Tiret, L., and Cambien, F. 2002. Specific haplotypes of the
P-selectin gene are associated with myocardial infarction. Hum.
Mol. Genet. 11:
2015-2023. Zhang, H. and Bonney, G. 2000. Use of classification trees for association studies. Genet. Epidemiol. 19: 323-332.[CrossRef][Medline]
http://genecanvas.idf.inserm.fr/; GeneCanvas.
Received February 7, 2003;
accepted in revised format June 4, 2003.
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||