|
|
|
|
Vol. 12, Issue 4, 602-612, April 2002
LETTER
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Previous studies have reported that about 85% of human diversity at
Short Tandem Repeat (STR) and Restriction Fragment Length Polymorphism (RFLP) autosomal loci is due to differences between individuals of the same population, whereas differences among continental groups account for only 10% of the overall genetic variance. These findings conflict with popular notions of distinct and
relatively homogeneous human races, and may also call into question the
apparent usefulness of ethnic classification in, for example, medical
diagnostics. Here, we present new data on 21 Alu insertions in 32 populations. We analyze these data along with three other large,
globally dispersed data sets consisting of apparently neutral biallelic
nuclear markers, as well as with a
-globin data set possibly subject
to selection. We confirm the previous results for the autosomal data,
and find a higher diversity among continents for Y-chromosome loci. We
also extend the analyses to address two questions: (1) whether
differences between continental groups, although small, are
nevertheless large enough to confidently assign individuals to their
continent on the basis of their genotypes; (2) whether the observed
genotypes naturally cluster into continental or population groups when
the sample source location is ignored. Using a range of statistical methods, we show that classification errors are at best around 30% for
autosomal biallelic polymorphisms and 27% for the Y chromosome. Two
data sets suggest the existence of three and four major groups of
genotypes worldwide, respectively, and the two groupings are inconsistent. These results suggest that, at random biallelic loci,
there is little evidence, if any, of a clear subdivision of humans into
biologically defined groups.
| |
INTRODUCTION |
|---|
|
|
|---|
In various areas of applied genetics, it is
customary to regard the human species as divided in distinct and
objectively recognizable groups. Forensic scientists compare DNA
profiles from the place of a crime with databases from the general
population, usually grouped into broad racial categories (for instance,
African-American, European-American, Asian, and Hispanic), to
estimate the probability that an unrelated individual would have the
identical DNA profile. The markers chosen for DNA profiling are
considered to be essentially uniform across populations of the same
category. Although the existence of problems with group definition has
been acknowledged (e.g., Weir 2001
), the fact that some individuals may
not be easy to allocate to any such group is usually regarded as
unimportant (National Research Council 1992
; Lander and Budowle 1994
;
Morton 1994
; Roeder 1994
; Gill and Evett 1995
). In clinical practice, a
correlation of racial affiliation, as assessed from skin color, facial
characteristics, hair texture, and so forth, with disease pathology and
drug response is widely believed to exist. A PubMed search with the
keywords "human races" (January 10, 2002) yielded 34,143 papers,
including Benar et al. (2001)
, Estrada and Billett (2001)
, Hartz et al.
(2001)
, Hoffman et al. (2001)
, and Shaw and Krause (2001)
.
In contrast, population studies have suggested that genetic
variation is essentially continuous through space among humans, and
have failed to identify a set of genetically distinct and internally
homogeneous groups. Regardless of whether estimated at the protein
(Lewontin 1972
; Latter 1980
), craniometric (Relethford 1994
), or DNA
(Barbujani et al. 1997
, Jorde et al. 2000
) level, individual
differences between members of the same population have been reported
to account for about 85% of the overall genetic diversity, and
differences between populations within the same continent account for a
further 5% to 10%. Only about 10% of variation can be assigned to
differences between continental groups.
The existence of such different views in related areas of science has
probably more than one cause, but a clearer picture of human genetic
diversity is necessary to at least reduce the levels of disagreement.
One open problem has to do with the exact amount of genetic
diversity that can be attributed to the various levels of population
subdivision. The figures mentioned earlier were estimated from
populations that were well separated in space, and therefore they may
exaggerate the between-group component of genetic variance. On the
other hand, most of the markers studied are thought to be approximately
neutral, which may have the opposite effect if between-group variation
reflects adaptation to spatially variable factors such as climate. A
second question is whether the apparently continuous distribution of
genetic variation implies in practice that meaningful groups cannot be
identified, as has been argued by Templeton (1999)
. In fact, genetic
variances among groups, although small, are significantly greater than
zero at several loci. That may mean that, by jointly considering many loci, distinct groups may emerge, even though those groups cannot be
discriminated at the single-locus level.
We shall start by borrowing a definition from evolutionary and
conservation genetics. In those fields, races or subspecies are defined
as recognizable lineages within a species that have diverged
genetically because mating barriers have separated them for a
sufficiently long time (Templeton 1999
; see also Pennock and Dimmick
1997
). We shall ask if there is genetic evidence that the human species
is subdivided in groups of that kind. To address that question, we
consider fast-evolving DNA markers as less than optimal.
Indeed, STR and mitochondrial polymorphisms, such as most of
those considered by Barbujani et al. (1997)
and Jorde et al. (2000)
,
have high mutation rates, and hence their patterns of variation are
likely to reflect relatively recent divergence. Evidence of long-term
subdivision among populations, if any, is more likely to be found by
analyzing slow-mutating DNA sites, typically biallelic polymorphisms,
which presumably evolved only once ("unique-event polymorphisms";
see Markovtsova et al. 2000
) in human history.
For that purpose, we typed 21 Alu insertion polymorphisms in population samples from five continents, and we analyzed published biallelic DNA polymorphism data at several other nuclear loci, both autosomal and Y linked. On the five data sets thus assembled, we estimated the components of variance that can be attributed to differences between individuals, between populations of the same continent, and between continental groups, and we compared our estimates with previously published values. In addition, we investigated with what degree of accuracy individuals can be attributed to their continent on the basis of their genotypes, and which are the most likely clusters of individuals that can be inferred from multilocus genotypes, regardless of their geographical provenance.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
Alu Insertion Frequencies
Table 1 reports the
frequencies of the alternative alleles (presence or absence of the
insertion, the latter representing the likely ancestral state; Watkins
et al. 2001
) at the 21 loci typed in this study. The standardized
genetic variance, Fst, summarizes for each locus the global
differentiation among populations of all continents. Values of
Fst close to 15% are often observed in worldwide analyses of
humans (Cavalli-Sforza et al. 1994
). Hence, most Alu loci of this study
display what can be considered "normal" levels of interpopulation diversity.
|
Genetic Differences among Continental Groups
For both Alu-insertion data sets (Alu8 and Alu21, comprising
information, respectively, on 8 and 21 loci), an analysis of molecular
variance, AMOVA (Excoffier et al. 1992
) was run once for each locus,
and once for the multilocus genotypes. The Y-chromosome (Y98, Y99 data
sets) and
-globin (BGL data set) data were each subjected to
independent runs of AMOVA using all the available sequence information.
For most Alu loci, for the compound Alu genotypes, and for the BGL data
set, around 80% of the overall genetic diversity is allocated to
differences among members of the same sample (Table
2). About 10% is attributed to differences among populations within the same continent (less for the BGL data set,
where most continents, however, were represented by only one
population), and the rest, a little over 10%, to differences among
continents. The exceptions are one Alu locus (FXIIIB) and the Y
chromosome, which show a lower component of variance within populations
(between 42 and 46%) and a higher component between continental groups
(close to 40%). Even for these loci, the greatest fraction of genetic
variance occurs within populations.
|
In two previous Y-chromosome studies based, respectively, on a
combination of SNP and STR markers, Hammer et al. (2001)
and Jorde et
al. (2000)
estimated lower variances among continents and higher
variances within populations. In Jorde et al.'s (2000)
STR study, in
particular, variances between continents were practically zero. The
simplest explanation is that different mutational mechanisms generate
diversity at biallelic sites and at STR loci. Because of the higher
mutation rate of the latter (average 2.8 × 10-3 per locus
per generation; Kayser and Sajantila 2001
) and of probable constraints
to allele size (Deka et al. 1999
), it seems that most populations tend
to approach a common allelic distribution for those markers.
Conversely, biallelic polymorphisms mutate more slowly (about
5 × 10-7 per site per generation; Jobling et al. 1997
),
and therefore their distribution reflects more the effects of
demographic history than those of mutation.
The genetic variances among continents inferred in this study from the
Y-chromosome data sets are greater than those observed for autosomal
markers. The same is true, to a lesser extent, of mtDNA, where the
fraction of variance between continents, 12.5%, is still higher than
for the nuclear genes of the same study (Seielstad et al.
1998
). These results were expected under a model of neutral evolution, driven by genetic drift and gene flow. If selection is
negligible, the genetic variance among populations (Fst)
tends to reach an equilibrium value, which, in Wright's (1969)
classical model, is inversely proportional to N, the
population size, times m, the gene flow rate. We do not know
whether our populations are at equilibrium, but N of
mitochondrial and Y-chromosome loci is one-fourth that of autosomal
loci, so that the impact of drift is greater on the former. Therefore,
it seems that the genetic drift that has been going on since the
(apparently recent and incomplete) separation of continental human
groups has so far been able to generate appreciable differences only at
uniparentally transmitted loci.
Inferring the Geographic Origin of a Genotype
To test the extent to which continents are associated with specific
sets of alleles, we initially disregarded the geographic information.
We took one genotype at a time, and attributed it to its most likely
continent according to eight methods of discriminant analysis, three of
them parametric and five nonparametric (listed in the caption to Table
3). Then we calculated the rate
of misassignment for each method, that is, the percentage of
individuals who were wrongly allocated (Table 3). Among nuclear loci,
the worst results are obtained for the BGL data set: more than half of
the genotypes are misclassified. The Alu data sets give better results,
as would be anticipated for multilocus data. The NNET and LOG methods
each give error rates of 38% and 32% for Alu8 and Alu21, whereas with the RM method, 37% and 30% of the genotypes, respectively, were misclassified.
|
For the Y-chromosome data sets, the parametric methods again perform
poorly, with misclassification rates at least 70%. However, NNET, 1NN,
and 3NN each give an error rate of 27%, better than the 40% obtained
using the RM method and also better than the error rates obtained using
the same methods for the multilocus data sets. Although only a single
locus, the Y chromosome is relatively powerful for discrimination
because its between-group variance is higher, as revealed by AMOVA
(this study; Hammer et al. 2001
) and by previous independent studies
(Underhill et al. 2000
).
Table 4 is an example of the so-called
confusion matrix for the classification of individuals from the Alu8
data set using the method that gave the lowest rate of
misclassification, NNET. The entry in row i and column
j gives the number of individuals drawn from continent
i that are classified into continent j (so that the
diagonal entries correspond to correct classifications). Table
5 gives the same information
for the Y99 data set, again using the method giving the most accurate
results for that data set, 1NN. At the bottom of Table 5, Asia was
further subdivided in two subregions, which caused an increase in the
misclassification rate. Overall, the results indicate poor
discrimination, with even the best method and data set leading to
nearly 30% misclassification. Even our relatively large data sets do
not suffice to allow accurate assignment of individuals into their
continent.
|
|
To test how the number of loci considered affects these results, we repeatedly (500 times) analyzed subsets of the Alu21 data set, consisting of increasing numbers of loci chosen at random from the 21 available. The rate of misclassification decreased rapidly at the very beginning, but then leveled off (Fig. 1), suggesting that the error rate will not become zero, even further increasing the number of loci. This supports the view that error rate reflects, in part, factors other than the limited number of loci considered, including genetic exchanges leading to extensive allele sharing among populations. In principle, these results might be explained by the presence of a few hybrid populations at the boundaries between continents, but that proves not to be the case. If one looks at the geographic origin of the misclassified individuals (Table 6 is an example), it is evident that many other populations contain genotypes or haplotypes that discriminant analysis classifies along with those of another continent. For the Y99 data set, there seems to be a tendency to misclassify individuals from populations at nearby latitudes. Seventy-four European haplotypes are wrongly allocated to Africa, and they are mostly Greeks and Italians. Conversely, among the 90 Europeans that are allocated to North-Central Asia, most are British, Russians, and Germans. In the Americas, most of the 152 individuals allocated to North-Central Asia come from Northern and Central America (Tanana, Cheyenne, Pima, Havasupay, and Pueblos), whereas the misclassification rate decreases as one moves southward. The results are less clear for the Alu data sets (data not given), where we could not recognize a clear pattern of misclassification. This could be due to factors such as the lower number of populations available, their spatial distribution, or the already discussed effect of population sizes. However, a crucial factor to consider is the absence of recombination for Y-chromosome markers, so that migrant Y chromosomes may convey evidence of their source population for many generations.
|
|
Indeed, the greater differences among continents observed for Y-chromosome markers (all of them mapping on the nonrecombining portion of the chromosome) did not lead to a much better allocation of genotypes of unknown origin. Haplotypes are transmitted as a single unit, and can in principle be followed through time and place. On the other hand, though, each of those haplotypes can be regarded as one variant of a multiallelic locus, and the power of discriminant analysis increases with the number of independent loci considered (compare the Alu and Y-chromosome data sets in Table 3).
The low accuracy of discriminant analysis does not depend on a poor definition of the continental groups considered. In fact, the more groups are considered, the higher the misclassification rate. Compare, for example, the top and the bottom of Table 5; one-fifth of North-Central Asians were classified as South-Eastern Asians and vice versa, when these two groups were separately considered. At a subcontinental level, inferring the geographic origin of a person from her/his genotype becomes more complicated, and therefore it seems unlikely that classification errors would be reduced by choosing among a higher number of potential origins.
Most misclassified individuals were assigned to Europe and Asia in Table 4 (Alu8 data set), and to Africa and Asia in Table 5 (Y99 data set). Many individuals from Australia and the Americas were attributed to Asia, where the first settlers came from, and 13% (Table 4) of Australians, intriguingly, to Africa. These observations suggest that the distribution of misclassified individuals reflects, at least in part, past population movements. We are currently developing a formal model to infer past migrations from the results of discriminant analysis.
Inferring Population Structure from Genotypes
So far we have been trying to assign individuals to groups defined a
priori on the basis of geography. An alternative is to identify groups
a posteriori on the basis of genotypes, namely, to cluster genotypes
until a certain number of genetically homogeneous groups is defined.
The program STRUCTURE (Pritchard et al. 2000
)
infers the most likely number of such groups, and assigns individuals
to each of them, on the basis of probabilities estimated from a set of
independently transmitted loci. Because Y-chromosome markers are genetically
linked, this approach was suitable to analyze only the Alu data sets.
The most likely number of groups, k, was estimated as three
and four, respectively, for Alu8 and Alu21 (Table
7). All alternatives could be rejected with
a high level of confidence. Individuals were then associated with
posterior probabilities to belong to each of the previously identified
groups, and were assigned to the most likely group (Table
8; had we chosen to attribute an individual
to a group only when one of those posterior probabilities is higher
than 50%, 287 genotypes of the Alu8 data set and 162 of the Alu22
database would have been unclassified). The results inferred from the
two data sets differ markedly. Not only is the number of groups
different, but also the geographical ranges of the groups do not
overlap. For the Alu8 data set, the analysis suggests the existence of
a largely Eurasian group, plus two groups whose distribution is
essentially worldwide. Conversely, for the Alu21 data set, all African
and most Oceanian genotypes fall into the first group, whereas the
other three groups roughly correspond to Asia and the Americas (2), and
Eurasia (3 and 4). Using a set of X-chromosome data and the same
method, Wilson et al. (2001)
identified yet another set of groups
(four, roughly corresponding to Europe, New Guinea, Africa, and Asia).
|
|
Conclusions
Previous studies have shown that differences among continental
groups represent a rather small fraction of the global STR and RFLP
diversity of our species (Barbujani et al. 1997
; Jorde et al. 2000
;
Brown and Armelagos 2001
). In this study we found that a
between-continent variance accounting for 5% to 20% of the total is
the rule also for numerous nuclear biallelic polymorphisms, on the
basis of independent loci typed in a large number of samples. We
identified an exception, Y-chromosome polymorphisms, and we tried to
better understand the evolutionary meaning of both the rule and the
exception. For that purpose, we asked what is the probability of
allocating an individual to the correct continent, on the basis of her
or his genotype. Different statistical methods gave somewhat different
results, but three conclusions appear justified: (1) most individuals
are allocated correctly, but (2) the rate of misclassification is never < 27%, and (3) the rate of misclassification is roughly the same,
whether allocation is based on autosomal or Y-chromosome polymorphisms,
although for the latter the variance among continents is four times as
large. New Y-chromosome data sets containing many new polymorphisms are being assembled (Underhill et al. 2000
; Hammer et al. 2001
), and their
analysis may somewhat modify details of this picture.
Continent-specific and population-specific polymorphisms do exist in
humans, and individuals carrying certain, generally pathologic, alleles, can be assigned to a specific geographic area with a high
degree of confidence. Popular examples are the alleles for Tay-Sachs
disease among Ashkenazi Jews, and for thalassemia in the Mediterranean
area. However (with one exception, the Duffy-null alleles in Africa),
very few members of those populations carry those rare alleles, and the
mutations that generated them are recent (Oddoux et al. 1999
; Hamblin
and Di Rienzo 2000
; Weatherall 2001
). Alleles common in a continent,
and absent or nearly so elsewhere, which would support the existence of
a substantial ancestral differentiation among human groups, have been
identified in this or previous studies only in the Y chromosome. Even
the X-chromosome haplotypes that initially appeared to be restricted, respectively, to African and to non-African populations (Harris and Hey
1999
), turned out to be shared across continents when sample sizes were
increased (Yu and Li 2000
).
In summary, discriminant analysis confirms the existence of some degree
of geographical structuring in humans, contra Templeton (1999)
. If one
considers a set of biallelic loci from an individual's genome, and
asks which continent that genotype comes from, the answer will be
correct most of the time. However, even when jointly considered, all of
the markers we could use, including those of the Y chromosome, did not
prove able to assign more than 70% of the individuals to their
continent of origin. That is not what one would expect, if the human
species were subdivided, and deep genetic discontinuities existed among
continental groups. We have also shown, albeit in a relatively small
sample, that the genetic variances among continents at a locus
undergoing selection,
-globin, are not greater than those estimated
at neutral loci.
The genetic uniformity of the human species contrasts with what is
observed for other large mammals (reviewed in Templeton 1999
), whose
populations tend to be more diverse, even when restricted to a much
narrower geographic range. Groups occupying distinct territories, and
each characterized by peculiar combinations of genes that are absent or
at least rare elsewhere, can be found among gorilla (Ruvolo et al.
1994
), chimpanzee and bonobo (Gagneaux et al. 1999
), gray wolf and
elephant (Templeton 1999
), and gazelle (Arctander et al.
1996
), but not, so far, in humans. Two, not necessarily
alternative, explanations seem reasonable, namely: (1) a comparatively
recent common ancestry of all modern humans, so that there has been
little time for groups to diverge, and (2) gene flow rates high enough
to homogenize groups.
Our attempt to identify major human groups by clustering genotypes
yielded contradictory results. Different numbers of groups, and
different distributions of genotypes within such groups, were observed.
Moreover, these results do not overlap with those of another study
(Wilson et al. 2001
) based on the same method and different data. These
observations mean that there is no reason to expect that the same
groups will be identified on the basis of different sets of genes. As a
consequence, both for evolutionary studies and practical applications
(such as predicting liability to certain diseases or response to
certain drugs), what seems to matter is the individual genotype, much
more than the ethnic or geographic affiliation.
This study shows that, by assuming homogeneity of individuals within their continent, one disregards between 8% (as estimated in the Alu21 data set) and17% (Y99 data set) of the total biallelic human diversity (Table 2). The practical consequences of that depend on the composition of the population studied, and may be trivial in forensic applications, especially if STR markers are used, if the populations are homogeneous. However, the error may not be negligible in metropolitan areas, or where very different communities coexist. In those cases, up to 30% of individuals (Table 3) may carry genotypes that appear so different from the bulk of the others that discriminant analysis would assign them to another continent. As for clinical practice, it seems a clear distinction should be made between population-specific polymorphisms (which exist, albeit rare, and may be a useful diagnostic tool), and continent- or (the term could then be appropriate) race-specific genetic polymorphisms. To the best of our knowledge, no example of the latter has been described in humans. Going back to the evolutionary definition of race that we cited in the Introduction, this study found no evidence suggesting the existence in humans of recognizable lineages that have diverged because they have long been separated by reproductive barriers. The present study of biallelic, presumably ancient, polymorphisms does not suggest that there is a basis for an objective and unequivocal definition of distinct biological groups within the human species.
| |
METHODS |
|---|
|
|
|---|
DNA Samples and Alu Genotyping
Twenty-one Alu insertion polymorphisms were typed in 1330 individuals from 32 populations (listed in Table 7 along with the sample sizes). The cell lines used to isolate control DNA samples were
as follows: human (Homo sapiens), HeLa (ATCC CCL2);
chimpanzee (Pan troglodytes), Wes (ATCC CRL1609);
gorilla (Gorilla gorilla), Ggo-1 (primary gorilla
fibroblasts) provided by Dr. Stephen J. O'Brien, National Cancer
Institute, Frederick, MD, USA. Cell lines were maintained as directed
by the source and DNA isolations were performed using Wizard genomic
DNA purification (Promega). Diverse human DNA samples were isolated
from peripheral blood lymphocytes (Ausubel et al. 1996
), most of which
had been collected for previous studies (Stoneking et al. 1997
; Nasidze
and Stoneking 2001
). African-American, Bantu speakers,
Hispanic-American, Hungarian, Syrian, and Yanomamo DNA samples were
available in Batzer's laboratory.
PCR amplification of each Alu insertion polymorphism was performed in
25-µL reactions using 50-100 ng of target DNA, 40 pM of each
oligonucleotide primer, 200 µM dNTPs in 50 mM KCl, 1.5 mM MgCl2, 10 mM Tris-HCl (pH 8.4) and Taq DNA polymerase (1.25 units), as
recommended by the supplier (Life Technologies). Each sample was
subjected to the following amplification cycle: an initial denaturation
of 2:30 min at 94oC, 1 min of denaturation at 94°C, 1 min at the annealing temperature, 1 min of extension at 72°C,
repeated for 32 cycles, followed by a final extension at 72°C for 10 min. Twenty microliters of each sample was fractionated on a 2%
agarose gel with 0.25 µg/mL ethidium bromide. PCR products were
directly visualized using UV fluorescence. The sequences of the
oligonucleotide primers, annealing temperatures, PCR product sizes, and
chromosomal locations for the loci have been reported previously (Arcot
et al. 1996
, 1997
, 1998
; Stoneking et al. 1997
).
Data Sets
Five data sets were considered in this study (Table
9). The Alu21 data set represents the newly
typed, unlinked autosomal Alu insertion polymorphisms in the 32 populations of Table 10.
|
|
Four additional data sets were assembled and analyzed. The Alu8 data
set includes Alu insertion genotypes at 8 autosomal loci (TPA25, PV92,
APO, ACE, FXIIIB, D1, A25, B65), with 1500 individuals from 32 worldwide populations (Stoneking et al. 1997
), representing 1331 distinct multilocus genotypes. Further populations were incorporated into the Alu8 database, bringing the total number of populations to 46. Although all loci of the Alu8 data set are also found in the Alu21 data set, the populations are not the same (for example Eastern Asian populations are absent from Alu21), and therefore we
chose to analyze the two databases separately. The Y98 and Y99 data
sets are drawn from two surveys (Hammer et al. 1998
; Karafet et al.
1999
) of 12 biallelic Y-chromosome polymorphisms, defining respectively
12 and 14 distinct haplotypes in 1544 males from 35 populations (Y98),
and 2198 males from 60 populations (Y99). The final data set comes from
a study of a 3-kb region encompassing the
-globin gene in nine
populations (Harding et al. 1997
) (BGL data set). The gene tree
constructed from 326 sequences includes 29 haplotypes, 13 of them
apparently resulting from recombination or gene conversion. The latter
haplotypes are all rare, and to avoid choosing arbitrary weights for
nucleotide substitution, recombination, and gene conversion events, we
chose to consider only the remaining 16 haplotypes, assuming they have
been generated by nucleotide substitution alone.
The geographical distribution of the samples for each data set is shown in Figure 2. The Y99 and Y98 data sets have the widest global coverage. Alu8 has samples almost everywhere apart from North Asia, whereas Alu21 and BGL are smaller data sets; in particular, for most continents, BGL has only one sample.
|
Statistical Analysis: AMOVA
The genetic differences within and among population samples were
quantified, and their significance was assessed, using AMOVA, a
nonparametric method for the analysis of variance suitable for molecular data (Excoffier et al. 1992
). Genetic variances were estimated from allele-frequency differences between populations, and
from measures of molecular difference between alleles. The overall
genetic variance was then subdivided into three hierarchical components: between individuals within populations, between populations of the same group, and between groups. Because morphological studies of
the last two centuries led to lists of human races containing from 3 to
200 items (Armelagos 1994
; Barbujani 2001
), and in the absence of other
solid criteria for group definition, we decided to use groups
corresponding to continents. The significance of the variance
components was tested by a randomization approach. Each individual, or
population, was reassigned to a random location, according to three
resampling schemes. The molecular variances were recalculated, and the
procedure was repeated 1000 times to obtain empirical null
distributions of all relevant variances.
Statistical Analysis: Discriminant Analysis
In discriminant analysis, also known as supervised classification
(Ripley 1996
), variables measured on individuals whose grouping is
known (the training data set) are combined to construct a new variable that can be used to classify individuals (or, in our case,
genotypes) of unknown group (query genotypes). For the
analyses described following, we have used the classification functions implemented for the statistical package Splus that are freely available
on the StatLib server (S Archive: http://lib.stat.cmu.edu/) (Venables
and Ripley 1997
). The genotypes were coded as strings of binary digits,
so that distances estimated between pairs of individuals reflected the
minimum (and most likely) number of mutational events separating them.
This coding allows one to use both parametric and nonparametric forms
of discriminant analysis.
We initially considered three standard parametric methods, namely,
linear (LDA), logistic (LOG), and quadratic (QDA) discriminant analysis. All of these assume that the variables are at least approximately normal, which does not hold for our data. They performed poorly and are not discussed further here, although some results are
included following for comparison. We then resorted to four standard
nonparametric methods, which do not assume a probabilistic model for
the observations, namely, a neural network (NNET), Gaussian kernel
density estimation (KER), and k-nearest neighbor with
k = 1 (1NN) and k = 3 (3NN) (Venables and Ripley 1997
).
Neural networks are collections of mathematical models, and related
computer programs, which identify patterns in a data set by emulating
some properties of biological nervous systems and by drawing on the
analogies of adaptive biological learning (Jennions and Brooks 2001
).
NNETs are composed of a large number of interconnected processing
elements that are analogous to neurons, and are tied together with
weighted connections that are analogous to synapses. By a process of
trial and error, nonlinear functions are estimated from randomly
sampled subsets of the original data set. These functions are then used
by NNETs to classify the remaining genotypes. The goodness of the
classification obtained is evaluated, and iterations are run until a
desired level of accuracy is obtained.
The KER method starts by estimating the density of the genotype frequencies in each group, and then assigns a new observation (genotype) to the group for which its estimated density is maximal. Lastly, with the simplest method, the nearest neighbor, each genotype is assigned to the group whose first k nearest genotypes (one for what we refer to as 1NN, or three for 3NN) are closest.
The previously described classification methods are not specifically
designed for genetic data. We also implemented a method (RM) that, for
the autosomal data sets, exploits the assumptions of Hardy-Weinberg and
linkage equilibria (independence within and between loci) to improve
the estimation of genotype relative frequencies in each group (Rannala
and Mountain 1997
). The RM method uses Bayesian posterior expectations
given a symmetric Dirichlet prior distribution to overcome the
potential problem of zero frequency in the training data for an allele
observed in the query genotype. In the case of nonrecombining haploid
data, the equilibrium assumptions are not appropriate, and the natural analog of the RM method reduces to a trivial comparison of (posterior) haplotype frequencies.
For the Alu8 and Alu21 data sets, we included only those individuals with complete information (over all 8 or 21 loci), reducing the sample sizes to 1331 and 477 individuals, respectively. For these data sets, the variables involved in the discriminant analysis are the individual genotypes at each locus. For the other data sets (BGL, Y98 and Y99), the entire haplotype is treated as a single variable. One by one, each individual's known source population is temporarily ignored, and each of the classification methods is implemented to classify that individual into her or his most likely source population, with all the other individuals being used as the training set. At the end of this cross-validation procedure, the proportion of correct continental allocations was recorded.
Statistical Analysis: Inference of Population Structure
We estimated the most likely number of genetically homogeneous
groups in the data sets, and assigned each individual to her or his
most likely group by means of an approach implemented in the program
STRUCTURE (Pritchard et al. 2000
). Multilocus genotypes
are considered, and no particular mutational model is assumed. Each
individual's genotype is considered to result from a mixture of
contributions originating in k population groups, and
q(i) is the fraction of the genes of that individual that come
from the ith group defined. Under the assumption that each of
the populations is in Hardy-Weinberg equilibrium, k is
estimated by a Monte Carlo-Markov Chain algorithm. Then, for each
individual, regardless of her or his geographical provenance, the
vector q(1), q(2)...q(k) is estimated, and ultimately each
individual can be assigned to one of the inferred groups, that is, the
one with the highest probability.
| |
ACKNOWLEDGMENTS |
|---|
We thank Walter Fitch, Giorgio Bertorelle, Ryan Brown, George Armelagos, Joseph Terwilliger, Lorena Madrigal, and two anonymous reviewers for many comments and suggestions. This research was supported by grants from the University of Ferrara, from the Lousiana Board of Regents Millennium Trust Health Excellence Fund HEF (2000-05)-05, (2000-05)-01, and (2001-06)-02 (MAB), and awards 1999-IJ-CX-K009 and 2001-IJ-CX-K004 from the Office of Justice Programs, National Institute of Justice, Department of Justice (MAB). A 9-month's stay of C.R. at the University of Reading was partly supported by funds of the Department of Applied Statistics. Points of view in this document are those of the authors and do not necessarily represent the official position of the U.S. Department of Justice.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
Present addresses: 7CRIBI, Biotechnology Centre, University of Padua, Padua I-35121 Italy; 8Department of Epidemiology and Public Health, Imperial College, St. Mary's campus, Norfolk Place, London W2 1PG, United Kingdom.
9 Corresponding author. University of Ferrara, Department of Biology, via L. Borsari 46, I-44100 Ferrara, Italy
E-MAIL bjg{at}unife.it; FAX (+39) 0532 249761.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.214902.
| |
REFERENCES |
|---|
|
|
|---|
Distribution and insertion polymorphism.
Genome Res.
6:
1084-1092
-subunit locus.
Genetics
155:
1481-1483Received September 18, 2001; accepted in revised form February 12, 2002.
This article has been cited by other articles:
![]() |
A. Ferrer-Admetlla, E. Bosch, M. Sikora, T. Marques-Bonet, A. Ramirez-Soriano, A. Muntasell, A. Navarro, R. Lazarus, F. Calafell, J. Bertranpetit, et al. Balancing Selection Is the Main Force Shaping the Evolution of Innate Immunity Genes J. Immunol., July 15, 2008; 181(2): 1315 - 1322. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Garrigan, S. B. Kingan, M. M. Pilkington, J. A. Wilder, M. P. Cox, H. Soodyall, B. Strassmann, G. Destro-Bisol, P. de Knijff, A. Novelletto, et al. Inferring Human Population Sizes, Divergence Times and Rates of Gene Flow From Mitochondrial, X and Y Chromosome Resequencing Data Genetics, December 1, 2007; 177(4): 2195 - 2207. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. V. Scirica and J. C. Celedon Genetics of Asthma: Potential Implications for Reducing Asthma Disparities Chest, November 1, 2007; 132(5_suppl): 770S - 781S. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. W. Payne Jr and C. Royal The Role of Genetic and Sociopolitical Definitions of Racein Clinical Trials J. Am. Acad. Ortho. Surg., September 1, 2007; 15(suppl_1): S100 - S104. [Abstract] [Full Text] [PDF] |
||||
![]() |