|
|
|
|
Genome Res. 15:67-77, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Methods Identification of polymorphic motifs using probabilistic search algorithms1 Human Genetics Unit, Indian Statistical Institute, Kolkata, 700108 India 2 Theoretical Statistics and Mathematics Unit, Indian Statistical Institute, Kolkata, 700108 India
The problem of identifying motifs comprising nucleotides at a set of polymorphic DNA sites, not necessarily contiguous, arises in many human genetic problems. However, when the sites are not contiguous, no efficient algorithm exists for polymorphic motif identification. A search based on complete enumeration is computationally inefficient. We have developed probabilistic search algorithms to discover motifs of known or unknown lengths. We have developed statistical tests of significance for assessing a motif discovery, and a statistical criterion for simultaneously estimating motif length and discovering it. We have tested these algorithms on various synthetic data sets and have shown that they are very efficient, in the sense that the "true" motifs can be detected in the vast majority of replications and in a small number of iterations. Additionally, we have applied them to some real data sets and have shown that they are able to identify known motifs. In certain applications, it is pertinent to find motifs that contain contrasting nucleotides at the sites included in the motif (e.g., motifs identified in case-control association studies). For this, we have suggested appropriate modifications. Using simulations, we have discovered that the success rate of identification of the correct motif is high in case-control studies except when relative risks are small. Our analyses of evolutionary data sets resulted in the identification of some motifs that appear to have important implications on human evolutionary inference. These algorithms can easily be implemented to discover motifs from multilocus genotype data by simple numerical recoding of genotypes.
Single nucleotide polymorphisms (SNPs) are abundant in the human genome and occur at roughly 1 per 2 kb spacing on the average (Balasubramanian et al. 2002 In the context of evolutionary or human genetic studies, there are two related issues. First, to identify motifs or haplotypes that occur at high frequencies in subsets of a large data set, such as those sampled from specific geographical regions or groups, or from individuals afflicted with a specific disease. Having identified such motifs, the second problem is to decipher the biological or population genetic processes (e.g., linkage, drift, selection, epistasis) that have resulted in the existence of these high-frequency motifs. In this study, we shall only address the first issue, viz., how to identify high-frequency motifs. To address the second issue, collection of further data (e.g., family data), statistical modeling, investigations of metabolic pathways, wet-laboratory experimentation, etc., may be required. It is theoretically possible to discover polymorphic motifs in a set of N-aligned DNA sequences, each of length L nucleotides, by examining frequencies in all possible k x k tables, k = 2,3,...,L. However, this is computationally infeasible. The purpose of this study is to propose a set of computationally fast probabilistic search algorithms that may be used for motif finding, and to evaluate their efficiencies using both synthetic and real data sets. Keeping SNP loci in mind, which are usually biallelic, we formulate, describe, and assess these algorithms using sequences of binary characters. However, there is no inherent restriction in these algorithms that the search has to be confined to binary sequences. These algorithms can also be used on multilocus genotype data of diploid individuals. When genotype data are used, the distinct genotypes only need to be numerically recoded, as discussed later. Thus, the proposed algorithms are fairly general in nature, and can be put to diverse uses.
We first propose an algorithm for identifying a motif of a given length. We then extend this algorithm when the length is unknown. Finally, we propose a modification for identifying "variant" motifs. The problem of identifying a variant motif arises when, given a collection of DNA sequences derived from a set of individuals, it is of interest to identify whether an appropriately defined subset of individuals in this collection possesses a motif that is different from that possessed by the remaining subset of individuals. For example, in a case-control study, it is pertinent to identify whether the cases possess a motif at a certain number of sites that comprise nucleotides, each of which is different from the nucleotide possessed by the controls at the corresponding sites. Identification of such a variant motif can help in identifying SNPs associated with the disease in question. The problem of identifying variant motifs in subsets of a collection of sequences at the hypervariable segment-1 (HVS1) of human mtDNA has received a lot of attention (Quintana-Murci et al. 1999
In each of these problems, a variant motif is defined in relation to another. For example, for case-control data, the variant motif among cases is defined in contrast to the one found among the controls. In the evolutionary analysis of mtDNA HVS1 sequences, search for a motif is made in contrast to the Cambridge Reference Sequence (CRS) (Anderson et al. 1981
Consider a data matrix ((aij))NxL, where aij denotes a nucleotide (A,T,G, or C) at the jth polymorphic site (j = 1,2,...,L) for the ith individual (i = 1,2,..., N). The data matrix is generated from aligned DNA sequences of a specific genomic segment of N individuals, from which all monomorphic sites have been removed. We note that if these N individuals belong to a case-control study, then the data matrix needs to be initially created by pooling all cases and controls, and subsequently separated into two matrices, one for cases and another for controls. A similar strategy is also required in evolutionary studies, while simultaneously dealing with two populations. We also note that if disjoint segments of DNA are to be simultaneously examined for motif finding, then appropriate segments may be separately aligned, and the aligned segments concatenated in the data matrix.
Let V = {1,2,...,L} denote the set of all L polymorphic sites in the data. Let
In general, the problem of finding a motif of length p from an NxL data matrix reduces to identifying the set , k = 1,2,..., ( ), from p, such that the modal sequence on is globally modal. With an NxL data-matrix, the search space p has ( ) elements. Obviously, each element of p is a string, S, comprising the identities of those specific p sites chosen out of L. There are ( ) such strings in p. An exhaustive search of this space p is computationally very expensive, and perhaps infeasible. We propose a stochastic search method, similar in spirit to the Metropolis-Hastings version (Metropolis et al. 1953 p." By our definition, maximizing the frequency of the modal sequence on p leads to identification of the motif of length p. Thus, the search comprises choosing both sites and characters at these sites, so that the chosen set of characters at the chosen set of sites has the maximum frequency in the data set.
Algorithm for finding a motif of a given length and assessing its statistical significance We shall use the following notations:
We initially set
We first calculate
Obviously, the transition probability from one string to another depends only on the outcome of the current string (Markov property). As is easily understood from the above updating rule, at any step of the iteration, a new string that yields a smaller value of H(S) is always accepted, but to avoid being trapped at a local minimum, the new string with higher value of H(S) may also be retained with a small probability (that crucially depends on the preassigned control parameter c and the corresponding sweep step t). It may be noted, however, that as the number of sweeps, t, increases, the process stabilizes. In other words, the probability of accepting a worse string decreases as t increases. The algorithm converges to the global minimum if
After each iteration, we compare We note that, as with all numerical optimization procedures, it is desirable to repeat the procedure a certain number of times from different starting strings, and examine whether convergence to the same optimal value is obtained. The number of repetitions of the procedure that is practically feasible obviously depends on the availability of computing resources. Having discovered a motif of a given length p in a data set, it is important to assess the statistical significance of the discovery. For this, we need to estimate the probability of existence of a motif of length p in a "random" data set of "similar" structure as the real data set in terms of nucleotide composition (as explained in detail in the Results section), that has a frequency higher than the motif discovered in the real data set. If this probability is smaller than a preassigned value (say, 0.05), then the motif that has been discovered can be declared to be statistically significant. To estimate this probability, we created a large number of random data sets, by randomly permuting the elements of each column of the real data set. For each random data set thus created, we used our algorithm to discover the motif of length p with the highest frequency, that is, the "best" motif. The proportion of random data sets in which the best motif had a frequency higher than that of the motif discovered in the real data set provided an empirical estimate of statistical significance. We note that for this purpose, ideally, the best motif in each random data set should be identified by a complete enumeration search, and not by using the algorithm proposed by us. However, this is infeasible unless the real data set is small. (We have actually carried out the complete enumeration search in many small data sets; the results are presented later.)
Extension of the algorithm when the motif length is unknown and assessment of statistical significance
For any given value of the motif length p
To assess the statistical significance of a decrease in G(S|p) as the motif length (p) is increased, we propose the following criterion. Let
, then we declare the decrease from G(S|pi) to G(S|pi + 1) as significant, and stop with the motif length p. The idea underlying this criterion is that we declare a drop in the value of the objective function to be statistically significant if this drop differs from the mean of all previous drops by more than two times the variance of all previous drops. In the rare event that 2(pi) = 0, we use the stopping criterion G(S|pi - 1) > 2.G(S|pi), and declare the length of the motif as pi.
Although the above method of assessment of statistical significance is intuitively appealing, the choice of the value of the constant (=2) in the stopping criterion is somewhat arbitrary. Further, in the above search procedure, it is possible that the sets of sites included in motifs of length p and (p + 1) are disjoint. In many practical applications, this may not be desirable. Therefore, after the initial stage, new sites should be added to the set of sites included in the motif discovered thus far. Such an addition is made by searching for a site from among those sites not included in the identified motif. This strategy is not only more meaningful in many practical applications, but is also computationally less expensive. However, there is a trade-off. After convergence of this procedure, it is possible that the identified motif of length q (say) is suboptimal among all motifs of length q. When this procedure is adopted, we suggest the use of the criterion described below to assess statistical significance of increase of motif length from p to (p + 1). Let Starting with a small motif length, one can continue to increase its length until the level of significance falls below a preassigned value (say, 0.05). If the structure of a data set is such that sequential addition of sites leads to the same motif at every stage, compared with the direct procedure of identifying a motif of a certain length, then, as we shall show later, the use of these two procedures of testing statistical significance yield concordant inferences.
Identification and statistical significance of variant motifs
Upon termination of the algorithm, we test whether the odds-ratio estimated from the 2 x 2 table comprising the frequencies of the two motifs identified among cases and controls (or in the two data matrices under consideration) was significantly different from unity (Breslow and Day 1993 Following the same spirit as for a single data set discussed and described earlier, one may also assess the statistical significance of the discovered motif in case-control data by using a permutation algorithm to generate a large number of "random" data sets of a structure similar to that of the controls. We have done this. For each case-control data set, synthetic or real, after having identified a motif in the case data by using the variant-motif algorithm, we generated a large number of control data sets by permuting the elements of each column of the control data matrix. We then used the algorithm, and empirically estimated the probability that the odds-ratio obtained for the real data sets of cases and controls is lower than the odds-ratio obtained from the real case data and a randomly generated control data set. We have used this probability as a measure of statistical significance (p-value) of the motif discovered from the real data sets.
In data sets pertaining to evolution, the method of finding a variant motif is simpler because a specific reference sequence is generally given. In this setup, given a string, Sp, of length p, we enumerate from the data all possible sequences
p. This indicates that, if the value of ml,p realized at the maximum value of the above objective function is less than p, then there may exist sequences of length p with more than ml,p mismatches with the reference sequence. But the frequency of such a sequence will be much smaller than fl,p, resulting in a drop in the value of G(Sp). One effective strategy that we have used in implementing the above objective function is to start the algorithm with a large value of p. This enables us to find a sequence with a considerably high frequency, where ml,p out of the p sites differ from the reference sequence. By keeping track of the sites at which the sequence differs from the reference sequence, we can find the sequence at the sites constituting the variant motif. Another advantage of using the algorithm is that, even without any prior on the actual length of the motif (discussed in detail in the previous section), the objective function obtains its maximum at some value of ml,p, which enables us to get the motif length, the best estimate of which is ml,p, from a single run. To assess the statistical significance of the discovered motif, we generated a large number (10,000) of "random" data sets of a structure similar to the original. If the length of the motif discovered in the original data set was p, we restricted the search algorithm to maximize only over those sequences for which m1,p was equal to p. That is, in the randomly generated data, given a string Sp, the frequency of a sequence was set to 0 if it had less than p mismatches with the reference sequence.
Performance of the algorithms: Assessment using synthetic data sets Data Set 1 We designed various synthetic data sets, so that the motif in each data set was known, to assess the performance of our algorithm. In our synthetic Data Set 1, a data matrix (N x L) was created, and a known motif of a fixed length (p) was planted in a proportion u of individuals. Data sets were created with different values of relevant parameters; details are given in Supplemental Text 2. The algorithm was applied on each synthetic data matrix, with different values of the control parameter c. As stated earlier, instead of maximizing G(S), we consider an equivalent problem of minimizing a monotonically decreasing function H(S) of G(S). We have taken
For every synthetic data set (for different values of N and L) on which the algorithm was used to discover a motif of length p (= the length of the planted motif), we generated 10,000 random data sets of similar structure to test the statistical significance of the discovered motif, as explained earlier. In every case, the estimated probability that a random data set has a motif of frequency higher than that of the discovered motif was <10-7. Thus, in every case, the discovered motif was statistically significant at a level <10-7. We have also assessed the levels of significance as the motif length was increased. The significance levels were all <0.005 as the motif length was increased from 2 to 10, but were >0.5 when the motif length was increased from 10 to 11. (Statistical significance was assessed using both the criteria described in an earlier sectionassessing the significance of a "drop" in frequency with increase in motif length and also of the addition of a site. Both criteria yielded concordant inferences in every simulation run.) This indicates that our algorithm was not only able to discover the planted motif of length 10, but the discovery was statistically significant. Further, increase of length to 11 was not statistically significant. Detailed results are presented in Supplemental Table 2. Some general results on the validity and good performance of the proposed method of assessing statistical significance of a motif discovered by our algorithm are presented in Supplemental text 3. To examine the limits to which our algorithm can perform well, we constructed new data sets. The descriptions of the data sets and results are given in Supplemental texts 4 and 5.
Data Set 2 In creating synthetic data sets, we have used various values of u1 and RR. The algorithm for finding variant motifs was used. Statistical significance was assessed by testing the null hypothesis of the odds-ratio being equal to unity, as described earlier. The values of the parameters used in generating the synthetic data sets were as follows: N = 100; L = 100, 200, 300; p = 4 and 6; u1 = 0.2, 0.4, and RR = 1.2, 1.5, and 2.0. For each combination of L and u1, 1000 synthetic data sets were generated with each of the various combinations of the other parameters. The algorithm was run on each data set for values of the control parameter c = 50, 75, and 100. The results are given in Table 2, for c = 100. (For c = 50 and 75, the results, not shown, were virtually identical.) In general, our algorithm correctly identified the planted motif in a large proportion of simulation runs only when the RR attributable to a single site was high. The probability of correct identification decreased with decrease in RR. Further, for fixed values of the parameters u1 and RR, this probability decreased with increase in the motif length, p, but was found to be not strongly dependent on the value of L. Although, for several combinations of simulation parameter values, the probability of correct identification was small or zero, we note that the number of sites and nucleotides that matched between the planted and identified motifs was large, except for RR = 1.2. This indicates that just by chance there may exist motifs with haplotype (motif) relative risks higher than that of the planted motif. However, it is clear that unless the relative risk is small, the true motif will share many sites and nucleotides with the identified motif.
Whether or not the identified motif matched with the planted motif in a synthetic data set, we carried out a test of statistical significance of the identified motif by generating 10,000 random data sets of a similar structure as the control data and estimating the odds-ratios, as explained earlier. The p-values corresponding to the identified motif in the real data, are given in Table 3. None of the identified motifs for the various combinations of the parameter values (motif-length, p; u1; and the number of polymorphic sites) was statistically significant when RR was small (=1.2). However, when the RR was 1.5 or 2, the identified motifs were all statistically significant at the 5% level.
Data Set 3 This data set was constructed to mimic an evolutionary scenario. When two populations that have diverged from an ancestral population evolve separately, the daughter populations accumulate separate sets of mutations that increase in frequencies because of natural selection or other evolutionary forces. Thus, one may find motifs in the daughter populations, with some motif sites being shared between the two populations, while some being unshared (Schwaiger and Epplen 1995 We carried out 1000 independent simulation runs using the procedure described above, with c = 200. Detailed results for five runs are provided in Table 3, which show that our probabilistic search algorithm always converged and identified the correct motifs of correct lengths in the parental and in the daughter populations in a small number of sweeps. The final motifs were statistically significant at levels <0.005, as assessed by the procedure in which 10,000 random data sets were generated. As a matter of fact, correct convergence was achieved in every one of the 1000 runs (detailed results not provided) and the convergence using the proposed algorithm was fairly fast (Supplemental Table 6).
Identification of variant motifs: Applications to real data
LDL receptor haplotypes among individuals of European and African descent: The PARC study
Mitochondrial DNA haplogroups M and U
4. For ml = 4, the sites at which nucleotides differed from the CRS were S = (16223, 16270, 16319, 16352). The frequency of this string, fl, was 21 (= 6.21% of the total number of samples), and the nucleotides at the relevant positions were T, T, A, and C, respectively. The next most frequent string was (16223[T], 16274[A], 16319[A], and 16320[C]) with a frequency of 17 (5.03%). These two motifs belong to known subhaplogroups M* (defined by C T transition at the site 16223) and M2 (defined by C T transition at the site 16223 and a G T transition at the site 16319), which are prevalent in Indian populations (Bamshad et al. 2001
For HG-U also, the objective function, coincidentally, attained a maximum at ml = 4, and the motif identified was (16051[G], 16206[C], 16230[G], 16311[C]), with a frequency of 18 (=15.65% of the total number of samples). The vast majority of HG-U individuals in India belong to HG-U2i and U7. The U2i is the Indian-specific subcluster of U, as opposed to the Western-Eurasian subcluster U2e (Kivisild et al. 1999
These examples demonstrate that the proposed algorithm was able to identify previously discovered motifs, and therefore, can be profitably used in evolutionary studies to identify new motifs. The anthropological implications of our findings on HGs M and U presented above have already been described in Basu et al. (2003
The !Kungs of Botswana, Africa
The problem of identifying motifs in genetic data arises commonly in human genetical research. Such data include DNA sequence data, haplotype data, and genotype data. Motif identification is necessary to draw inferences on evolutionary histories of populations or lineages, to examine associations in case-control studies, etc. More recently, with the initiation of the HapMap project (Couzin 2002 t, H(S)) used by us were chosen not only to satisfy the criteria required for convergence of this class of probabilistic search algorithms (Winkler and Lutz 2003
Through our simulations, we have discovered some of limitations of our algorithm as well. In particular, when we assessed (Supplemental text 4) whether our algorithm converges correctly in a search space that contains exactly one global maximum, and also a large number of local maxima with values not very different from the global maximum, our algorithm failed to converge to the global maximum. This limitation is, of course, inherent to all numerical search procedures that do not use complete enumeration. Further, in simulated case-control data, our algorithm failed to identify the correct motif, especially when the relative risk attributable to a site included in the motif was small (Table 2). For a small relative risk, the identified motif was also statistically nonsignificant (Table 2). However, in most simulation runs, the identified motif shared several sites in common with the planted motif. The reason for nonconvergence to the correct motif was due to the fact that in realistic case-control data sets, there may be multiple motifs with high haplotype (motif) relative risks just by chance, especially when individual sites (SNPs) do not confer a large relative risk to the disease. This finding is consistent with published observations (e.g., Cardon and Bell 2001 We would finally like to emphasize that the convergence properties of the proposed algorithms are critically dependent on the control parameter, c. While from the user's point of view it is desirable to be able to prescribe some universal and objective guidelines for the choice of c, this is not possible. In specific applications like those presented here, one can identify a range of values of c that makes the algorithm computationally feasible, with a high probability of convergence to the true optimum. In practice, this range of c needs to be identified by trial and error. We first note that the speed of convergence is directly proportional to the value of c. Further, the probability of convergence to the true optimum for a specific choice of c is more dependent on the value of L than on N. Using these two facts, the user should make a judicious choice of c, but try with multiple values. We strongly recommend that some experimentation on the convergence behavior of the algorithm with respect to c in multiparameter settings be done to make a judicious choice of c. We have found that with N in the range of from 200 to 500 and L in the range of from 200 to 500, any value of c in the range of from 50 to 100 works very well.
Although we have formulated our algorithms keeping haplotype or haploid DNA sequence data in mind, there is no inherent limitation to use these methods on genotype data. Genotype data need only be recoded in order to apply these algorithms. For example, at a biallelic locus, with alleles A and a, the genotypes AA, Aa, and aa may be recoded as 1, 2, and 3. We finally note that there are other classes of probabilistic search algorithmssuch as genetic algorithm (Goldberg 1989 We have developed a computer program, MOTIFIND, implementing these algorithms. This program is written in C, and can be obtained by writing to the authors. This program can handle both haploid and diploid genotype data.
This work was partially supported by grants from the Department of Biotechnology and Council for Scientific and Industrial Research, Government of India. We thank Dr. A. Chowdhury for allowing us to include the unpublished data on Gilbert's syndrome. We also thank two anonymous reviewers for comments that have helped to substantially improve an earlier version of this work.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2358005.
3 Corresponding author. [Supplemental material is available online at www.genome.org. The following individuals kindly provided reagents, samples, or unpublished information as indicated in the paper: A. Chowdhury.]
Anderson, S., Bankier, A.T., Barrell, B.G., de Bruijn, M.H., Coulson, A.R., Drouin, J., Eperon, I.C., Nierlich, D.P., Roe, B.A., Sanger, F., et al. 1981. Sequence and organization of the human mitochondrial genome. Nature 290: 457-465.[CrossRef][Medline] Balasubramanian, S., Harrison, P., Hegyi, H., Bertone, P., Luscombe, N., Echoles, N., McGarvey, P., Zhang, Z.L., and Gerstein, M. 2002. SNPs on chromosomes 21 and 22 analysis in terms of protein features and pseudogenes. Pharmacogenomics 3: 1-10.[CrossRef][Medline]
Bamshad, M.J., Kivisild, T., Watkins, W.S., Dixon, M.P., Ricker, L.E., Rao, B.B., Naidu, M., Prasad, B.V.R., Reddy, P.G., Rasanayagam, A., et al. 2001. Genetic evidence on the origins of Indian caste populations. Genome Res. 11: 994-1004.
Basu, A., Mukherjee, N., Roy, S., Sengupta, S., Banerjee, S., Chakraborty, M., Dey, B., Roy, M., Roy, B., Bhattacharyya, N.P., et al. 2003. Ethnic India: A genomic view, with special reference to peopling and structure. Genome Res. 13: 2277-2290.
Bosma, P.J., Chowdhury, J.R., Bakker, C., Gantla, S., deBoer, A., Oostra, B.A., Lindhout, D., Tytgat, G.N., Jansen, P.L., Oude Elferink, R.P., et al. 1995. The genetic basis of the reduced expression of UDP-glucuronosyltransferase 1 in Gilbert's syndrome. New Engl. J. Med. 333: 1171-1175. Breslow, N.E. and Day, N.E. 1993. Statistical methods in cancer research: The analysis of case-control studies. International Agency for Research on Cancer, Lyon. Cardon, L.R. and Bell, J. 2001. Association study designs for complex diseases. Nat. Genet. 2: 91-99.[CrossRef][Medline]
Collins, F.S., Brooks, L.D., and Chakravarti, A. 1998. A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 8: 1229-1231. Couzin, J. 2002. Human genome. HapMap launched with pledges of $100 million. Science 298: 941-942. Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., and Lander, E.S. 2001. High-resolution haplotype structure in the human genome. Nat. Genet. 29: 229-232.[CrossRef][Medline] Goldberg, D.E. 1989. Genetic algorithms in search, optimization, and machine learning. Addison-Wesley Publishing Co., Boston, MA. Gupta, M. and Liu, J.S. 2003. Discovery of conserved sequence patterns using a stochastic dictionary model. J. Amer. Stat. Assoc. 98: 55-66.[CrossRef]
Handt, O., Meyer, S., and Haeseler, A., 1998. Compilation of human mtDNA control region sequences. Nucleic Acids Res. 26: 126-129.
Keiler, K.C. and Shapiro, L. 2001. Conserved promoter motif is required for cell cycle timing of dnaX transcription in Caulobacter. J. Bacteriol. 183: 4860-4865. Khani-Hanjani, A., Lacaille, D., Horne, C., Chalmers, A., Hoar, D.I., Balshaw, R., and Keown, P.A. 2002. Expression of QK/QR/RRRAA or DERAA motifs at the third hypervariable region of HLA-DRB1 and disease severity in rheumatoid arthritis. J. Rheumatol. 29: 1358-1365.[Medline] Kivisild, T., Bamshad, M.J., Kaldma, K., Metspalu, M., Metspalu, E., Reidla, M., Laos, S., Parik, J., Watkins, W.S., Dixon, M.E., et al. 1999. Deep common ancestry of Indian and western-Eurasian mitochondrial DNA lineages. Curr. Biol. 9: 1331-1334.[CrossRef][Medline] Liang, F. and Wong, W. 2001. Real-parameter evolutionary Monte Carlo with applications to Bayesian mixture models. J. Amer. Stat. Assoc. 96: 653-666.[CrossRef] Liu, J.S. 2001. Monte Carlo strategies in scientific computing. Springer Series in Statistics, Springer, Heidelberg, Germany. Macaulay, V., Richards, M., Hickey, E., Vega, E., Cruciani, F., Guida, V., Scozzari, R., Bonne-Tamir, B., Sykes, B., and Torroni, A. 1999. The emerging tree of West Eurasian mtDNAs: A synthesis of control-region sequences and RFLPs. Am. J. Hum. Genet. 64: 232-249.[CrossRef][Medline] Metropolis, N., Rosenbluth, A., Rosenbluth, M., Teller, A., and Teller, E. 1953. Equations of state calculations by fast computing machines. J. Chem. Phys. 21: 1087-1091.[CrossRef] Quintana-Murci, L., Semino, O., Bandelt, H.J., Passarino, G., McElreavey, K., and Santachiara-Benerecetti, A.S. 1999. Genetic evidence of an early exit of Homo sapiens sapiens from Africa through eastern Africa. Nat. Genet. 23: 437-441.[CrossRef][Medline] Sabeti, P.C., Reich, D.E., Higgins, J.M., Levine, H.Z.P., Richter, D.J., Schaffner, S.F., Gabriel, S.B., Platko, J.V., Patterson, N.J., McDonald, G.J., et al. 2002. Detecting recent positive selection in the human genome from haplotype structure. Nature 419: 832-837.[CrossRef][Medline] Schwaiger, F.W. and Epplen, J.T. 1995. Exonic MHC-DRB polymorphisms and intronic simple repeat sequences: Janus' faces of DNA sequence evolution. Immunol. Rev. 143: 199-224.[CrossRef][Medline] Stephens, M., Smith, N.J., and Donnelly, P. 2001. A new statistical method for haplotype reconstruction from population data. Am. J. Hum. Genet. 69: 906-914.[CrossRef][Medline] Tateno, Y., Ikeo, K., Imanishi, T., Watanabe, H., Endo, T., Yamaguchi, Y., Suzuki, Y., Takahashi, K., Tsunoyama, K., Kawai, M., et al. 1997. Evolutionary motif and its biological and structural significance. J. Mol. Evol. 44: S38-S43. Wallace, D.C. 1995. Mitochondrial DNA variation in human evolution, degenerative disease and aging. Am. J. Hum. Genet. 57: 201-223.[Medline] Winkler, G. and Lutz, G.F.H. 2003. Image analysis, random fields and Markov chain Monte Carlo methods: A mathematical introduction. Applications of Mathematics Series. Springer, Heidelberg, Germany.
http://www.hvrbase.org/; The URL of the mtDNA database. http://droog.gs.washington.edu/parc/data/ldlr/welcome.htm; URL of the LDL receptor.
Received January 15, 2004; accepted in revised format October 21, 2004.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||