|
|
|
|
Published online before print
February 8, 2006, 10.1101/gr.4138406 Genome Res. 16:323-330, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00
Letter The portability of tagSNPs across populations: A worldwide survey1 Unitat de Biologia Evolutiva, Departament de Ciències Experimentals i de la Salut, Universitat Pompeu Fabra, 08003 Barcelona, Catalonia, Spain 2 Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, OX3 7BN, United Kingdom 3 Fondation Jean-Dausset, Centre d'Étude du Polymorphisme Humain (CEPH), 75010 Paris, France 4 The Sanger Institute, Wellcome Trust Genome Campus, Hinxton, CB10 1HH, United Kingdom
In the search for common genetic variants that contribute to prevalent human diseases, patterns of linkage disequilibrium (LD) among linked markers should be considered when selecting SNPs. Genotyping efficiency can be increased by choosing tagging SNPs (tagSNPs) in LD with other SNPs. However, it remains to be seen whether tagSNPs defined in one population efficiently capture LD in other populations; that is, how portable tagSNPs are. Indeed, tagSNP portability is a challenge for the applicability of HapMap results. We analyzed 144 SNPs in a 1-Mb region of chromosome 22 in 1055 individuals from 38 worldwide populations, classified into seven continental groups. We measured tagSNP portability by choosing three reference populations (to approximate the three HapMap populations), defining tagSNPs, and applying them to other populations independently on the availability of information on the tagSNPs in the compared population. We found that tagSNPs are highly informative in other populations within each continental group. Moreover, tagSNPs defined in Europeans are often efficient for Middle Eastern and Central/South Asian populations. TagSNPs defined in the three reference populations are also efficient for more distant and differentiated populations (Oceania, Americas), in which the impact of their special demographic history on the genetic structure does not interfere with successfully detecting the most common haplotype variation. This high degree of portability lends promise to the search for disease association in different populations, once tagSNPs are defined in a few reference populations like those analyzed in the HapMap initiative.
It is estimated that the human genome contains >5 million common SNPs with a minor allele frequency of 10% (Kruglyak and Nickerson 2001
Efforts are being made to reduce the number of SNPs that may be required for such studies to
Independent efforts are being undertaken to define both common haplotypes and tagSNPs for gene regions, encompassing both the coding and regulatory regions that will be of special interest in a candidate gene approach (Crawford et al. 2004 Two main questions require answers. First, how well do tagSNPs defined in one population perform in another population from the same or from a different continent? Second, should haplotype maps of the human genome be developed urgently in other populations for tagSNP selection besides the three main groups included in HapMap? Indeed, portability of tagSNPs among populations and continental groups is fundamental for the future application of HapMap-defined tagSNPs into other populations. To address these questions we have analyzed the LD structure and tagSNP portability across a worldwide set of samples in a region of chromosome 22.
In a well characterized geneless region of chromosome 22, SNPs were selected based on physical distance criteria at a mean distance of 7 kb across 987,872 kb, dense enough to provide a consistent view of general LD patterns in the region (Ke et al. 2004
To assess whether tagSNPs are consistent among populations within regions, we calculated, for each SNP and population, the probability of being selected as a tagSNP (see Methods). As an example, results for the six East Asian populations are shown together in Figure 1 (top), along with the LD structure (bottom). As expected, the probabilities are high in regions with low LD and small in those exhibiting high LD. More interesting are the similar probability values found for the various populations within continents, with highly significant coefficients of multiple correlation (seven populations in Europe, R = 0.606; six in East Asia, R = 0.469; six in Africa, R = 0.576; P-values < 104), indicating a common pattern of LD. Nonetheless, these correlations within continents do not directly translate in terms of high tagSNP portability across populations; they simply show a common LD pattern among populations within each of the three continental groups. A direct insight into the issue of the portability of tagSNPs defined in specific populations into others of the same geographic region can be reached by focusing, among the populations in our study, those that may be considered as references for the three continents (and closest to the ones used in HapMap). TagSNPs are defined in these populations, those SNPs are applied into other populations within its regional groups as if they were their own tagSNPs, and their validity as tagSNPs is measured. The populations used as reference include: Yoruba (YOR) as African, French (FRA) as European, and Han Chinese (HAN) as Asian; Japanese was also used as a representative of Asia with very similar results as for Han Chinese (results not shown).
For each non-tagSNP in a population being tested, r2 was calculated with every tagSNP selected from a reference population and the maximum value recorded. This maximum r2 value is a measure of the utility of the SNPs that were defined as tags in another population. In the analysis, we considered all the SNPs independently of whether or not they were polymorphic in the compared population. Two approaches have been used: "blind" (in which nothing will be done if a tagSNP has no information in the compared population for being monomorphic or not successfully genotyped) and "ideal" (if a tagSNP has no information in the compared population, a replacement tag is selected in the reference population to ensure all tagSNPs contain information in the compared population); see Methods for more detail. Mean values for those maximum r2 values (both for "blind" and "ideal" analysis) are presented in Figure 2 for the three geographic regions containing one of the three reference populations: Africa, with Yorubas as reference; Europe, with French as reference; and East Asia, with Han Chinese as reference. As expected, the highest r2 value for each non-tagSNP was found with the closest or with a very close SNP to the tagSNPs (defined in the reference population), even for singleton bin tagSNPs. <2% of the non-tagSNPs showing the maximum LD were at a distance further than three SNPs from the tagSNPs defined in the reference population. If those distant SNPs were removed, r2 dropped on average a mere 1.52% in various combinations of populations; thus, the signal of the LD measure comes overwhelmingly from the vicinity of the considered SNP. Results obtained using all three r2 values (0.8, 0.64, and 0.5) for both tests are provided in Supplemental Table 1, along with parameters including the number of SNPs, number of tagSNPs, tag efficiency, and proportion of values higher than the three threshold values used (0.5, 0.64, and 0.8) to give a more detailed distribution of maximum r2 values. Robustness of portability of tagSNPs was verified by comparing with the results of random SNP sets, which in all cases showed a strong decrease in average r2 values. The increase of average r2 achieved by using tagSNPs rather than random SNPs is in the order of 30%, with variation depending on the populations being used (results for the three reference populations and four compared populations are given in Supplemental Table 2). Results for the SNPs that are in both HapMap and the present study are very similar for the CEPH sample of HapMap and the French population used here (results not shown), as expected given the strong similarity among European populations in LD patterns.
When a 0.64 r2 threshold is used for selecting tagSNPs, the mean values of the maximum r2 of non-tagSNPs in other populations within each continent are very high, >0.60 in all cases and some of them >0.8, but with differences among continents and among populations in some cases (Fig. 2). The average maximum r2 values are highest in Europe (Fig. 2B); that is, on average, tagSNPs selected in the French population will tag SNPs in other European populations with very high r2 values. Thus, a tagSNP selected in one European population behaves as a good tagSNP in another European population, as previously seen in four gene regions in several European populations (Mueller et al. 2005
The dispersion of mean r2 values obtained when using tagSNPs of a reference population into the compared one can be measured through the 95th percentile, shown as central bars in Figure 2, defined by the value that leaves only 5% of the r2 values below it (other parameters of the distribution are given in Supplemental Table 1). For Europe, 95th percentiles are mostly 0.3, meaning that less than about one in 20 non-tagSNPs will give results worse than r2 = 0.3 by tagSNPs defined in another population. For Asian populations, 95th percentiles are wider and reach smaller values, some <0.2; Africans have heterogeneous intervals, according to the variable r2 values. Thus, although the portability of tagSNPs defined in the reference samples is reasonably high on average, the variability is such that some tagSNPs may not be informative in other populations from the same region. It is also relevant to global association studies to query on the portability of tagSNPs to populations from continents not covered by the three initial reference populations. Beyond the human populations that are represented by the three continental groups discussed here, an interesting question is to what extent human groups from different continents than the populations of reference could be productively analyzed using the initial three populations.
Although the existence of a unique underlying LD map in the human genome has been qualitatively suggested when comparing data from three or four populations (Ke et al. 2004 For all populations of these four regional groups, the same approach has been followed, using the tagSNPs defined in all three reference populations and applying them to each population following the same methods ("blind" and "ideal"). Results using a 0.64 r2 threshold are shown in Figure 3, but similar results were obtained applying 0.5 and 0.8 values (see Supplemental Table 1). Surprisingly, the mean r2 value is moderate to high for most populations, and it is rarely <0.6, even between distant groups. TagSNPs defined in the Yoruba are as portable as those defined in the French or the Han Chinese, although, since overall LD is lower in general in Yoruba, more tagSNPs are needed to represent a specific region (2449 depending on the region as compared with 2138 in French or 1729 in Han Chinese; Supplemental Table 1). Therefore, for both Middle Eastern/North African and Central/South Asian populations, the utility of tagSNPs defined in Europeans is promising and much better than those defined in Han Chinese. Oceania and the Americas show similar average trends, with most values >0.8, and with the three reference populations providing portable tagSNPs, but the Asian reference has the highest efficiency (fewer markers to achieve a similar power). Populations from Oceania and the Americas have accrued genetic differentiation by drifting from their parental sources; therefore, it may be somewhat surprising that SNPs defined as tagSNPs elsewhere in the world do indeed capture LD patterns in America and Oceania, as well. It should be noticed that the SNPs used in the analyses were never ascertained in the Americas or Oceania, and they had non-extreme frequencies; they can thus tag the common haplotypes, which are the same ones found in other places of the world, especially Asia.
We have conducted two different analyses, and it is worth comparing them. The "blind" is much easier than the "ideal," which intends to optimize the tagging of the compared population through the information of the reference one. In the majority of within- and across-continent portability analysis, no difference in terms of tagSNP performance is observed between the "ideal" and "blind" test; even when there is difference, it is generally very small. This means that even in situations where a tagSNP was found to be monomorphic or failed in genotyping in the compared population, the set of tagSNPs as a whole can still maintain good power (Figs. 2, 3; Supplemental Table 1).
We have studied the portability of tagSNPs across worldwide populations and have found that tagSNPs are often highly portable across human populations, with the partial exception of some populations, mainly African. The tagSNPs defined in the current reference populations used in the HapMap project may be useful not only for other populations of the same geographic regions, but also for populations in the rest of the world. The present results go beyond the expected portability shown in Europe (Mueller et al. 2005 The best portability of tagSNPs is obtained using SNPs that are known to be polymorphic in both the reference populations and all the populations being compared in the same or different continental groups (data not shown). Nonetheless, this is not a real case, and values are artificially inflated. In a realistic situation, SNPs polymorphic in a reference population are not necessarily also polymorphic in a test population, and this is the scenario upon which the present study is based ("blind" and "ideal" tests). In a "blind" test, tagSNPs selected in a reference population are applied to a compared population without regard to whether any of the tags is monomorphic or fails the genotyping, whereas in an "ideal" test, efforts would be made to replace such monomorphic or failed tagSNPs. The results of the two tests are very similar and demonstrate a generally high portability of tags across populations. What is more, compared with the "blind" tests, there is hardly any increase in portability in the "ideal" tests. This further indicates that in a real-world situation, tagSNPs are generally very effective and portable across populations. The observation that tagSNPs are very effective for distant and differentiated populations is an important one and suggests that new haplotype maps in other populations than those included in the current HapMap initiative are not urgently needed. We note, however, that the present data cover a small fraction of the genome at a density that is slightly less than that of the HapMap, and some sample sizes are small. Studies in other genomic regions, mainly in specific gene regions, and with higher marker density and also in other specific populations with large sample size would therefore be required, but the results here suggest promise for those panels in providing robust coverage in the genetic search for complex traits. In the present work, there are three populations with a sample size <30 chromosomes; this problem is acute in the San with only 14 chromosomes, but affects also Cambodians and Colombians. It is known that r2 is inflated when estimated from a very small number of chromosomes, and, as a result, the portability of tagSNPs will possibly be overestimated. Results about these populations in the present study, therefore, should be interpreted very carefully. It is interesting to note, however, that their behavior is very similar to other populations of the same geographic area with larger sample size.
Beyond the case of Eurasia and Africa, some other population groups deserve particular discussion. In the populations where drift (mainly through founder effect) has been an important factor in producing genetic differences among humans, portability does not seem to diminish. The main source of variation in those populations is the frequency of common haplotypes rather than their haplotype composition, and thus most of those common haplotypes will be captured by the same tagSNPs as in their source population (see references in Bertranpetit et al. 2003
The present results corresponding to a geneless region of chromosome 22 are relevant and are likely to be applied for the genome in general, and for gene regions in particular. It is known that LD patterns are unpredictable in a given region, and the most detailed studies in specific chromosomes (Patil et al. 2001
Data set SNPs were selected at 5-kb spacing across a 987,872-bp region of human chromosome 22 (NCBI Build 34; 32600114 bp to 33587986 bp) using dbSNP build 115. To improve experimental success, we applied a hierarchical approach, preferentially selecting SNPs verified in Dawson et al. (2002
Although it is still unclear whether the effects of natural selection can be wholly avoided, a gene-free region was selected in order to minimize the possible confounding effects of selection and hitchhiking. The 1-Mb region begins at the 3' end of the Glycosyltransferase-like protein LARGE, which belongs to the Glycosyltransferase family 8; no other known gene maps to this interval. Different classes of repeats have been found in the region, including SINEs, LINEs, LTRs, STRs, and others (Dunham et al. 1999
The CEPH-HGDP diversity panel contains 1064 individuals representing 51 populations (Cann et al. 2002
Probability of being tags and definition of best tagging SNPs To apply tagSNPs from one population to another, best tagSNPs were used. They were defined based on the ldSelect algorithm with the following modification: If there were multiple tagSNPs in a bin, the most common tagSNPs (highest value of MAF) were always selected first because the more common SNP in a population, the higher the chance of it being polymorphic in another. If there were multiple tagSNPs in a bin (having the same highest MAF value), the average pairwise r2 between each of them and all the rest of SNPs in the same bin was calculated. TagSNPs with the highest average r2 values were selected from each bin to create the best tagSNP set. For a given population, tagSNPs were selected with a threshold of r2 > 0.5, 0.64 (default value of ldSelect and results given in the main text), and 0.8.
It may be stressed that r2 is inversely related to the sample sizes required for a given power in association studies (Weiss and Clark 2002
Applying tagSNPs across populations For each of the two main types of tests, the following statistics were calculated to evaluate the effectiveness in a test population of tagSNPs selected from a reference population for three values of the r2 threshold (0.5, 0.64, and 0.8). For each of the non-tagSNPs in the test population, the pairwise r2 value between it and each of the tagSNPs was calculated. The maximum of such r2 values was regarded as the measure of how effective the tagSNPs as a whole were to that particular non-tagSNP in the test population. Average values of such overall non-tagSNPs (and the corresponding 95th percentile) were then computed as a measure of the overall effectiveness of a tagSNP set in another population. With each testing threshold of r2 (0.50, 0.64, and 0.80), and to have a better description of the distribution of maximum r2 values, we also computed the percentage of non-tagSNPs in a test population that had a maximum r2 value over a given cut point, using the same three r2 values.
This study was supported by the European Project QLG2-CT-2002-00916 and by the Ministerio de Ciencia y Tecnología from the Spanish Government (BMC2001-0772 and BFU2004-02002/BMC) and DURSI, Generalitat de Catalunya (Grup de Recerca Consolidat 2001SGR00285 and Distinció per a la Recerca Universitària to J.B.). Additional support was received from the Wellcome Trust and from the European Science Foundation (ESF) Integrated Approaches for Functional Genomics Program. We thank Mònica Vallés (UPF), and Sobia Raza and Benedict Cross (Sanger) for technical support, and Anthony Boyce for providing the unique environment of St. John's College, Oxford.
[Supplemental material is available online at www.genome.org.] Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4138406.
5 Present address: Human Cancer Genetics Programme, Genotyping Unit, Spanish National Cancer Centre (CNIO) E-28029, Madrid, Spain.
6 Corresponding author.
Ardlie, K.G., Kruglyak, L., and Seielstad, M. 2002. Patterns of linkage disequilibrium in the human genome. Nat. Rev. Genet. 3: 299309.[CrossRef][Medline] Bertranpetit, J., Calafell, F., Comas, D., González-Neira, A., and Navarro, A. 2003. Structure of linkage disequilibrium in humans: Genome factors and population stratification. In Cold Spring Harb. Symp. Quant. Biol., pp. 7988. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY. Cann, H.M., de Toma, C., Cazes, L., Legrand, M.F., Morel, V., Piouffre, L., Bodmer, J., Bonne-Tamir, B., Cambon-Thomsen, A., Chen, Z., et al. 2002. A human genome diversity cell line panel. Science 296: 261262.[Medline] Carlson, C.S., Eberle, M.A., Rieder, M.J., Smith, J.D., Kruglyak, L., and Nickerson, D.A. 2003. Additional SNPs and linkage-disequilibrium analyses are necessary for whole-genome association studies in humans. Nat. Genet. 33: 518521.[CrossRef][Medline] Carlson, C.S., Eberle, M.A., Kruglyak, L., and Nickerson, D.A. 2004a. Mapping complex disease loci in whole-genome association studies. Nature 429: 446452.[CrossRef][Medline] Carlson, C.S., Eberle, M.A., Rieder, M.J., Yi, Q., Kruglyak, L., and Nickerson, D.A. 2004b. Selecting a maximally informative set of single-nucleotide polymorphisms for association analyses using linkage disequilibrium. Am. J. Hum. Genet. 74: 106120.[CrossRef][Medline] Comas, D., Plaza, S., Wells, R.S., Yuldaseva, N., Lao, O., Calafell, F., and Bertranpetit, J. 2004. Admixture, migrations, and dispersals in Central Asia: Evidence from maternal DNA lineages. Eur. J. Hum. Genet. 12: 495504.[CrossRef][Medline] Crawford, D.C., Carlson, C.S., Rieder, M.J., Carrington, D.P., Yi, Q., Smith, J.D., Eberle, M.A., Kruglyak, L., and Nickerson, D.A. 2004. Haplotype diversity across 100 candidate genes for inflammation, lipid metabolism, and blood pressure regulation in two populations. Am. J. Hum. Genet. 74: 610622.[CrossRef][Medline] Dawson, E., Abecasis, G.R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D.M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S., et al. 2002. A first-generation linkage disequilibrium map of human chromosome 22. Nature 418: 544548.[CrossRef][Medline] de la Vega, F.M., Isaac, H., Collins, A., Scafe, C.R., Halldorsson, B.V., Su, X., Lippert, R.A., Wang, Y., Laig-Webster, M., Koehler, R.T., et al. 2005. The linkage disequilibrium maps of three human chromosomes across four populations reflect their demographic history and a common underlying recombination pattern. Genome Res. 15: 454462. Dunham, I., Hunt, R., Collins, J.E., Bruskiewich, R., Beare, D.M., Clamp, M., Smink, L.J., Ainscough, R., Almeida, J.P., Babbage, A., et al. 1999. The DNA sequence of human chromosome 22. Nature 402: 489495.[CrossRef][Medline] Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. 2002. The structure of haplotype blocks in the human genome. Science 296: 22252229. The International HapMap Consortium. 2003. The International HapMap Project. Nature 426: 789796.[CrossRef][Medline] The International HapMap Consortium. 2005. A haplotype map of the human genome. Nature 437: 12991320.[CrossRef][Medline] Jobling, M.A., Hurles, M.E., and Tyler-Smith, C. 2004. Human evolutionary genetics: Origins, peoples, and disease. Garland Science, Taylor & Francis, New York. Johnson, G.C., Esposito, L., Barratt, B.J., Smith, A.N., Heward, J., Di Genova, G., Ueda, H., Cordell, H.J., Eaves, I.A., Dudbridge, F., et al. 2001. Haplotype tagging for the identification of common disease genes. Nat. Genet. 29: 233237.[CrossRef][Medline] Ke, X., Hunt, S., Tapper, W., Lawrence, R., Stavrides, G., Ghori, J., Whittaker, P., Collins, A., Morris, A.P., Bentley, D., et al. 2004. The impact of SNP density on fine-scale patterns of linkage disequilibrium. Hum. Mol. Genet. 13: 577588. Kruglyak, L. 1999. Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat. Genet. 22: 139144.[CrossRef][Medline] Kruglyak, L. and Nickerson, D.A. 2001. Variation is the spice of life. Nat. Genet. 27: 234236.[CrossRef][Medline] Mateu, E., Pérez-Lezaún, A., Martínez-Arias, R., Andrés, A.M., Vallés, M., Bertranpetit, J., and Calafell, F. 2002. PKLR-GBA region shows almost complete linkage disequilibrium over 70 kb in a set of worldwide populations. Hum. Genet. 110: 532544.[CrossRef][Medline] Mueller, J.C., Lohmussaar, E., Magi, R., Remm, M., Bettecken, T., Lichtner, P., Biskup, S., Illig, T., Pfeufer, A., Luedemann, J., et al. 2005. Linkage disequilibrium patterns and tagSNP transferability among European populations. Am. J. Hum. Genet. 76: 387398.[CrossRef][Medline] Myers, S., Bottolo, L., Freeman, C., McVean, G., and Donnelly, P. 2005. A fine-scale map of recombination rates and hotspots across the human genome. Science 310: 321324. Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., et al. 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 17191723. Pritchard, J.K. and Przeworski, M. 2001. Linkage disequilibrium in humans: Models and data. Am. J. Hum. Genet. 69: 114.[CrossRef][Medline] Ramirez-Soriano, A., Lao, O., Soldevila, M., Calafell, F., Bertranpetit, J., and Comas, D. 2005. Haplotype tagging efficiency in worldwide populations in CTLA4 gene. Genes Immun. 6: 646657.[Medline] Risch, N. and Merikangas, K. 1996. The future of genetic studies of complex human diseases. Science 273: 15161517.[Medline] Rosenberg, N.A., Pritchard, J.K., Weber, J.L., Cann, H.M., Kidd, K.K., Zhivotovsky, L.A., and Feldman, M.W. 2002. Genetic structure of human populations. Science 298: 23812385. Simoni, L., Calafell, F., Pettener, D., Bertranpetit, J., and Barbujani, G. 2000. Geographic patterns of mtDNA diversity in Europe. Am. J. Hum. Genet. 66: 262278.[CrossRef][Medline] Soldevila, M., Calafell, F., Helgason, A., Stefansson, K., and Bertranpetit, J. 2005. Assessing the signatures of selection in PRNP from polymorphism data: Results support Kreitman and Di Rienzo's opinion. Trends Genet. 21: 389391.[CrossRef][Medline] Weiss, K.M. and Clark, A.G. 2002. Linkage disequilibrium and the mapping of complex human traits. Trends Genet. 18: 1924[CrossRef][Medline]
Received May 15, 2005; accepted in revised format December 15, 2005. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||