|
|
|
|
Vol. 10, Issue 10, 1532-1545, October 2000
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
A common strategy for genotyping large samples begins with the
characterization of human single nucleotide polymorphisms (SNPs) by
sequencing candidate regions in a small sample for SNP discovery. This
is usually followed by typing in a large sample those sites observed to
vary in a smaller sample. We present results from a systematic
investigation of variation at the human apolipoprotein E locus
(APOE), as well as the evaluation of the two-tiered sampling strategy based on these data. We sequenced 5.5 kb spanning the entire
APOE genomic region in a core sample of 72 individuals, including 24 each of African-Americans from Jackson, Mississippi; European-Americans from Rochester, Minnesota; and Europeans from North
Karelia, Finland. This sequence survey detected 21 SNPs and 1 multiallelic indel, 14 of which had not been previously reported.
Alleles varied in relative frequency among the populations, and 10 sites were polymorphic in only a single population sample. Oligonucleotide ligation assays (OLA) were developed for 20 of these
sites (omitting the indel and a closely-linked SNP). These were then
scored in 2179 individuals sampled from the same three populations
(n = 843, 884, and 452, respectively). Relative allele frequencies were generally consistent with estimates from the core
sample, although variation was found in some populations in the larger
sample at SNPs that were monomorphic in the corresponding smaller core
sample. Site variation in the larger samples showed no systematic
deviation from Hardy-Weinberg expectation. The large OLA sample clearly
showed that variation in many, but not all, of OLA-typed SNPs is
significantly correlated with the classical protein-coding variants,
implying that there may be important substructure within the classical
2,
3, and
4 alleles. Comparison of the levels and patterns of
polymorphism in the core samples with those estimated for the OLA-typed
samples shows how nucleotide diversity is underestimated when only a
subset of sites are typed and underscores the importance of adequate
population sampling at the polymorphism discovery stage.
[The sequence data described in this paper have been submitted to the GenBank data library under accession no. AF261279.]
| |
INTRODUCTION |
|---|
|
|
|---|
The human apolipoprotein E gene (APOE) encodes a single
chain polymorphic protein composed of 299 amino acids
that plays a key role in the transport and metabolism of plasma
cholesterol and triglycerides (Mahley and Huang 1999
). APOE
harbors a globally distributed polymorphism that influences variation
in disease risk in human populations. There are three common isoforms
of apoE that differ in their amino acid sequence at residues 112 and
158, i.e., apoE2 (cysteine-cysteine), apoE3 (cysteine-arginine), and
apoE4 (arginine-arginine; Weisgraber 1994
). These protein variants are
encoded by haplotypes involving two diallelic single nucleotide
polymorphisms (SNPs), located in the 3' exon, that together yield
the
2,
3, and
4 alleles, respectively. Extensive association
studies with disease risk have been performed for these alleles (for
review, see de Knijff et al. 1994
). These analyses reveal that the
4
allele is associated with an increased risk for cardiovascular disease
(CVD; for review, see Davignon et al. 1999
) and Alzheimer's disease
(Corder et al. 1993
; Strittmatter et al. 1993
; Meyer et al. 1998
; Tang
et al. 1998
).
The success of association studies with APOE has stimulated
the development of systematic approaches to find and type sequence variations on a large scale for use in candidate gene or genome-wide association studies (Lander and Schork 1994
; Collins et al. 1997
; Lai
et al. 1998
; Martin et al. 2000
; Prezworski et al. 2000
). Although
APOE is often presented as a paradigm for SNP analysis in the
human genome, there has yet to be a systematic survey of sequence
variation within this gene. Furthermore, it has become clear that not
all individuals with the same APOE protein genotype are at
equivalent risk, and variants in the regulatory regions unrelated to
the protein isoforms have been identified that may have functional
relevance and complicate the simple subdivision into three haplotypes
(Mui et al. 1996
; Artiga et al. 1998a
,b
; Bullido et al. 1998
; Lambert
et al. 1998a
,b
). Therefore, we have undertaken a comprehensive analysis
of the genomic sequence of APOE to gain a better understanding
of the natural variation in this gene.
In this report, we present the sequence variation observed in APOE in 72 individuals (144 chromosomes) sampled from three populations (two of European-descent and one of African descent) currently engaged in epidemiological studies of environmental and genetic factors that influence the risk of cardiovascular disease. This represents a core data set characterizing the relative allele and genotype frequency distributions of variable sites in this gene. Association studies between disease risk and variation in candidate genes require population samples that have been systematically investigated for both phenotypes and genetic variation. Here we use APOE to illustrate how the variation identified in the core sample compares with variation in a much larger epidemiological sample of 2179 individuals from the same three populations, typing only those sites observed to be variable in the smaller core sample.
This two-step approach to an association study, in which variation is defined by sequencing a small random sample of individuals, followed by the genotyping of these variations in a larger sample sufficient for epidemiological studies, is becoming a common strategy. Within a given candidate gene, this procedure can result in only a sparse set of markers. Even when a high fraction of variation within a gene is identified in a core sample, however, such a two-step approach raises important statistical problems for analysis of the resulting genetic data. The problem centers around the fact that the larger epidemiological sample is scored only at nucleotide sites that were observed to vary in the core sample. This conditional sampling of genetic variation imposes a bias, in that rare variants are likely to be missed in the larger sample. Although time and expense precluded complete sequence analysis of the larger epidemiological sample here, we were able to examine features of the bias described above by comparison of population genetic statistics estimated from the core and larger samples.
| |
RESULTS |
|---|
|
|
|---|
APOE Sequence Variants in the Core Sample
Approximately 5500 bp of DNA containing the APOE gene were
amplified and scanned for variation (Fig.
1). The target region contained 1059 bp of
5' flanking sequence, the entire coding sequence and intervening
introns of APOE (four exons and three introns) spanning 3586 bp, as well as 846 bp 3' to the polyadenylation signal (Fig. 1a).
Approximately 20% of the scanned sequence was coding (1156 of 5491 bp), and 80% was noncoding (4335 of 5491 bp). Several putative
regulatory elements
i.e., promoter and enhancer elements, which
contain protein-binding sites
have also been mapped in the 5'
flanking sequence and first intron of this sequence (Fig. 1b; Paik et
al. 1988
; Smith et al. 1988
). In addition to these elements, the
noncoding regions associated with APOE also contain a number
of common interspersed repeats such as Alu elements. Interspersed repeats comprised nearly half (46%) of the noncoding sequence (1987 of 4335 bp) examined (Fig. 1c).
|
In all, a core sample of 72 individuals (144 chromosomes) from three
populations was scanned across the target region, and 22 varying sites
were identified by comparing the amplified sequences using the
PolyPhred program (Fig. 1d and Table 1). Of
these, 21 variants (95%) were diallelic single nucleotide
substitutions. Among these, transition type substitutions were more
common (14 of 21, 67%) than transversions (7 of 21, 33%). One
multiallelic insertion/deletion type variant was identified in the
3' end of the APOE gene, resulting from a length change in
a mononucleotide-G tract (5229A). This position, 5229, was a compound
site of variation because a single nucleotide substitution was also
detected at this position (5229B).
|
Four of the 22 varying sites were located in the coding regions of
APOE (Fig. 1d and Table 1). All four changes lead to
nonsynonymous substitutions in the protein. Alleles defined by amino
acids at two variant sites, positions 3937 (Cys112Arg) and 4075 (Arg158Cys), determine the polypeptide isoforms originally detected by
protein electrophoresis
i.e., apoE2, apoE3, and apoE4
that are now
routinely typed by PCR (Hixson and Vernier 1990
). Two other coding
SNPs, Leu28Pro (3106) and Arg142Cys (4036), were identified in exons 3 and 4, respectively. These also lead to nonconservative amino acid
substitutions in the apoE protein and had been reported previously (Havel et al. 1983
; de Knijff et al. 1994
). Eighteen of the varying sites were located within the noncoding sequences (Fig. 1 and Table 1).
Of these, seven were found 5' to exon 1. The majority of 5'
sites (75, 471, 545, 560, and 624) were associated with Alu or
Mir repeats (Fig. 1c). Only two of these 5' sites, 308 and
832, were not associated with known repeat elements. Moreover, one of
these variants, 832, is located in a region of known enhancer activity.
An enhancer sequence has also been identified in intron 1 of
APOE, and a variant at position 1163 was found in this region as well (Fig. 1b; Mui et al. 1996
). Although both of these sites lie in
regulatory regions, neither is located in one of the mapped protein-binding sequences identified in APOE (Fig. 1b; Paik
et al. 1988
; Smith et al. 1988
). Comparison of APOE
sequences from human (positions 441 to 4478) and mouse reveals
extended similarity in the coding regions (Fig. 1e). Only two noncoding
regions had similarities >60% across >40 nucleotides, and both
of these fall in the regions of the known enhancer activity described above.
Among the 22 variants, five sites showed only a single copy of the rarer nucleotide in the core sample (i.e., singletons, positions 308, 545, 2907, 3106, and 3673; 3106 leading to a nonsynonymous polymorphism; Table 1). Another five sites had only two copies of the rarer nucleotide (doubletons, sites 73, 471, 1522, 1575, and 4036; 4036 leading to a nonsynonymous polymorphism). Whereas one or two copies of an allele in a sample of 144 chromosomes implies a low overall relative allele frequency (i.e., 0.005 to 0.01), their relative frequency within the population sample in which they were found is substantial (0.02 to 0.04, 2n = 48), i.e., frequencies that would not typically be considered rare in SNP identification searches.
A visual representation of the genotypes determined for the core sample
of 72 individuals reveals several key features of the sequence
variation (Fig. 2). In this representation,
the variable sites are color-coded for each individual, with
homozygotes for the allele with the highest relative frequency across
the samples color-coded blue, homozygotes for the less frequent
allele color-coded yellow, and heterozygotes color-coded red. On
average, each individual differed from the reference sequence at
approximately four positions (range, one to eight positions) either
by being heterozygous or homozygous for the rarer allele. Six
individuals (three in Jackson, two in North Karelia, and one in
Rochester) were homozygous across the entire scanned region and,
not surprisingly, were homozygous for the most common APOE
genotype
3/
3.
|
APOE Variation in the Larger Population-Based Sample Typed by OLA
Of the 21 SNPs identified in the core sequenced sample, 20 diallelic
variations were amenable to genotyping in larger epidemiological samples from the same three populations (n = 2179 total;
Table 1). Although we attempted to type a compound site of variation at
position 5229, the presence of a large number of highly variable-sized alleles in close proximity to an adjacent SNP made it impossible to
genotype variation at these sites accurately. The associated indel
(5229A) and SNP (5229B) were thus excluded from the analysis of the
larger sample. In certain cases only a subset of the full sample was
typed for a given variant. If no variation was observed in the first
188 individuals (376 chromosomes) surveyed in each population, the site
was regarded as monomorphic and scored accordingly (Table 1). In
addition, although minor technical difficulties precluded complete
genotyping in all individuals, only four of the typed positions in the
three samples had missing genotypes for >5% of the individuals.
Most sites had <1% of the individuals left unscored (Table 1). A
2 test of homogeneity of the estimates
of relative allele frequencies between the core and epidemiological
samples was not significant (P > 0.05) for each
population (Table 1). All sites that varied in a given core sample also
varied in the larger epidemiological sample from the same population.
However, in six cases, SNPs that showed no variation in a given core
sample were found to vary in the larger OLA-typed sample from the same
population (sites 624, 1998, and 4951 in Jackson; 1575 and 3106 in
North Karelia; and 2907 in Rochester). Therefore, if only the sites
that varied in a given core sample had been typed in the corresponding
epidemiological sample, these sites would have remained undetected.
The large OLA-typed samples give us the opportunity to detect smaller
deviations from Hardy-Weinberg proportions than would be possible using
the core sequence data alone. As shown in Table 2, there is a good fit with expectation in
most cases. Whereas several sites appear to have an excess of
heterozygotes (i.e., 73, 832, and 3937 in Jackson; 3937 in North
Karelia; and 624, 832, and 3937 in Rochester), others showed a relative
deficit of heterozygosity (1998 in Jackson and 5361 in North Karelia). None of these deviations were large enough to be considered significant at an experiment-wide
= 0.05, based on a Monte Carlo procedure like that of McIntyre et al. (2000)
.
|
Relationship of the
2/
3/
4 Alleles and Flanking SNPs
Full analysis of linkage disequilibrium among variable sites in the
APOE gene region requires knowledge of the linkage phase of
the individuals who are heterozygous at two or more sites. Lacking this
information for the OLA data, we can still examine the important issue
of the extent to which the flanking SNPs occur homogeneously across the
2-
3-
4 genotypes. Because of the sequence differences involved
at the two determinative sites, these genotypes can be scored
unambiguously without specific haplotype phasing. The major classical
genotypes are
3/
3,
3/
2, and
4/
3; so for each of these
three genotypes, we tallied the relative frequencies of the rare and
common nucleotide in each population sample (Table 3). In 26 of 37 cases, Fisher's exact
tests showed that SNP frequencies were significantly heterogeneous
(P < 0.05) across the
2-
3-
4 genotypes, and these
inferences were not dependent on the inclusion of the rare genotypes,
i.e.,
2/
2,
4/
2, and
4/
4. In many cases, relative SNP
frequencies differ strikingly among
2-
3-
4 genotypes; for
example the rare allele frequency for site 1998 in Rochester was 0.002, 0.004, and 0.391 in the
3/
3,
3/
2, and
4/
3 genotypes,
respectively. For such sites, the genotype at the
2-
3-
4 sites
provides information to predict the genotype at the SNPs (or vice
versa).
|
In addition to revealing marked intergenotypic heterogeneity observed
for nearly all the sites investigated, Table 3 also illustrates the
extent to which the relative frequency of an allele associated with a
particular genotype can vary among population samples. For example, in
those with the
4/
3 genotype, the C variant of site 1163 is found
at relative frequencies of 0.082, 0.185, and 0.217 among the Jackson,
North Karelia, and Rochester groups, respectively. This implies that
the degree of association between the
2-
3-
4 sites and the
flanking SNPs varies markedly across populations.
Population Distribution of APOE Diversity
Genetic variation at the APOE locus is clearly not
uniformly distributed among the three populations surveyed. For
example, only nine of the 22 variable sites identified in the core
sample (560, 832, 1163, 2440, 3937, 4075, 5229A, 5229B, and 5361) were found to vary in all three populations (Table 1). The proportion of
shared variation was slightly higher among the OLA-typed samples, with
10 out of the 20 sites showing variation in all three samples (560, 624, 832, 1163, 1998, 2440, 3937, 4075, 4951, and 5361; Table 1).
Fifteen of the 22 core variable sites and 16 of the 20 OLA-typed
variable sites were observed in the Jackson population, 14 of the core
sites and 14 of the OLA-typed sites in the North Karelia population,
and 14 core sites and 13 OLA-typed sites were observed in the Rochester
population (Table 1). The level of sequence polymorphism in the
samples, and its distribution among different subregions of the 5500 bp
surveyed, did not differ among the three populations surveyed as
assessed by the extensive overlap of the confidence intervals of the
estimates of nucleotide diversity (
in Table
4).
|
Average site heterozygosity within the populations was 0.134, 0.169, and 0.182 for the Jackson, North Karelia, and Rochester core samples and 0.129, 0.138, and 0.140, respectively, for the equivalent OLA-typed samples. For the 22 sites that varied in the core sample, the mean observed heterozygosity across all samples was 0.161; in the sample typed by OLA, the equivalent value was 0.136. Although only the two cSNPs (3937 and 4075) are commonly typed in surveys of APOE variation, nine other sites were also common in our samples, yielding heterozygosities >0.10.
The degree of population subdivision in APOE nucleotide
variation was quantified using classical F statistics (Weir 1996
). For
the core set of individuals who were fully sequenced, the estimate of
the proportion of the variation that was attributable to
among-population differences was FST = 0.045. Site-specific estimates of FST for the core sample ranged from 0.014 to
0.156 (Table 1). The equivalent estimate for the OLA-typed data
(n = 2179) was FST = 0.034 (with site-specific
values ranging from 0.002 to 0.060). As the larger sample is expected
to have more unmeasured rare variants that would most likely show
differences among the populations, the estimate of FST for
the samples typed by OLA is likely to be an underestimate of the total
differentiation among the populations; the theoretically expected
degree of underestimation is under investigation. Although the
differences were small, site-specific core FST values were,
in most cases, larger than estimates based on variation in the
OLA-typed samples (Table 1).
As shown in Figure 3, sites found in the
3' half of the 5.5-kb sequenced region have, on average, lower
FST estimates than sites found more 5' in the sequence,
and this is true whether estimates based on the core or OLA-typed
samples are considered (only FST values for the OLA-typed
samples are shown in Fig. 3; see Table 1 for corresponding core sample
estimates). The low estimates of FST for sites in and around
the fourth exon (including and surrounding the cSNPs at sites 3937 and
4075) are consistent with previous reports of low global variation in
the frequencies of the sites responsible for the
2,
3, and
4
alleles (e.g., Hallman et al. 1991
; Gerdes et al. 1992
).
|
Nucleotide Diversity in the Core Versus OLA-Typed Samples
For each population surveyed, an equal or greater number of
polymorphic variants was observed in the OLA-typed sample than in the
core sample (Table 5). However, the overall
level of nucleotide diversity in the samples typed by OLA, summarized
either as expected per nucleotide heterozygosity (
; Watterson
1975
) or average pairwise sequence difference (
; Tajima 1993
), was
consistently lower than the core sample.
for the total core
sample was 0.000690 ± 0.000214, whereas it was estimated as
0.000407 ± 0.000107 for the combined OLA-typed sample (Table 5).
Similarly,
for the core data set was estimated as
0.000565 ± 0.000295 and 0.000492 ± 0.000290 for the OLA-typed
data set. These estimates are all lower than, but not significantly
different from, estimates of diversity reported for other autosomal (Li
and Sadler 1991
; Harding et al. 1997
; Nickerson et al. 1998
; Rana et
al. 1999
; Rieder et al. 1999
) and X-linked (Zietkiewicz et al. 1997
;
Harris and Hey 1999
; Jaruzelska et al. 1999
; Kaessmann et al. 1999
)
human loci.
|
The two estimates of nucleotide diversity, which are expected to be
equal if the data are drawn from an equilibrium population of fixed
size and constant mutation rate and all mutations are selectively
neutral, may be compared using Tajima's (1989)
test statistic
D. For the total core data set,
was slightly lower than
, resulting in a negative, but nonsignificant, Tajima's D. For the total OLA-typed sample,
was larger than
, resulting in a nonsignificant positive D value. Similar
levels and patterns of polymorphism were observed when data for each of
the populations were considered separately (Table 5). In all but the
Jackson core sample, estimates of D were positive, suggesting
a slightly greater level of nucleotide site heterozygosity than
expected, consistent with previous studies of human nuclear DNA
sequence variation (as discussed in Hey 1997
). In all cases, estimates of D were larger for the equivalent OLA-typed sample.
We don't know how many sites this incomplete two-tiered approach
misses, but we can roughly estimate the number of variable sites
expected to vary in samples the size of our OLA-typed samples, assuming
that the data fit the infinite sites model. (If the data conform more
closely to a finite sites model of mutation, the expected number of
polymorphic sites in the OLA-typed samples would be smaller [Tajima
1996
], and the bias caused by two-staged sampling would be less
pronounced.) Under this model, the expected number of polymorphic sites
in a sample is given by the formula E(S) = 
1/i, in which
is the
estimate of expected heterozygosity based on variation observed in a
fully sequenced sample, and the sum runs from 1 to
2n
1, with 2n being the sample size (Watterson 1975
). Using estimates of
for APOE based on variation
observed in each of the separate core samples (Table 5) and given
sample sizes of 1686, 904, and 1768 for the Jackson, North Karelia, and Rochester populations, the expected numbers of polymorphic sites in the
larger population samples (i.e., the number we would expect to observe
if all individuals in the larger samples had been fully sequenced) are
estimated as 25.3, 21.6, and 23.6, respectively. Of the 22 sites
identified in the core sample, we observed variation in 16, 14, and 13 of OLA-typed sites, respectively, for Jackson, North Karelia, and
Rochester (Table 5). Therefore, our calculations suggest that
approximately 9, 8, and 11 additional variable sites exist at
APOE in our epidemiological samples that were missed by
genotyping sites found in the core sample only. Note that these estimates of missing sites are highly model dependent and that selection operating on APOE could drive the true number of
missed SNPs either up or down from these estimates. Furthermore, the relative frequency of the rarer alleles at such sites (not observed in
the core samples) in the equivalent larger samples is expected to be low.
| |
DISCUSSION |
|---|
|
|
|---|
There has been widespread interest in recent years in the use of
SNPs as markers in the search for candidate loci, and ultimately alleles, underlying complex genetic disorders. APOE is one of the most intensively investigated of all human loci, in large part
because of its role in lipid transport and metabolism, as well as its
involvement in modulating cell growth and differentiation, tissue
repair, and immunoregulation (Davignon et al. 1999
; Mahley and Huang
1999
). Studies of the three major alleles of APOE
i.e.,
2,
3, and
4
have revealed that the
2 allele is associated with
higher levels of apoE and lower levels of plasma cholesterol, low-density lipoprotein cholesterol, apoB, and Lp(a) and have suggested
2 plays a protective role against CVD, whereas the
4 allele is
associated with lower levels of apoE and higher levels of plasma total
cholesterol, low-density lipoprotein cholesterol, apoB, and Lp(a), as
well as increased risk of CVD (de Knijff et al. 1994
; Stengård et al.
1995
; Davignon et al. 1999
) and, in some populations, of Alzheimer's
disease (Corder et al. 1993
; Strittmatter et al. 1993
; Meyer et al.
1998
; Martin et al. 2000
). These well-documented associations have
rested almost exclusively on the consideration of the variant protein
isoforms alone, or the two cSNPs which determine them. Despite recent
efforts to characterize polymorphism in the promoter region of the gene
(e.g., Artiga et al. 1998b
; Bullido et al. 1998
; Lambert et al.
1998a
,b
) and a broader scan for SNPs in the region (Lai et al. 1998
),
no systematic investigation of variation in the locus at the nucleotide level has thus far been reported.
The genomic sequence analysis reported here, of 72 individuals (144 chromosomes) for the APOE locus and its flanking regions, identifies the extent of APOE genetic diversity more
comprehensively than has been performed to date and underscores the
heterogeneity that remained undetected by earlier investigations. In
all, 22 variable sites were observed in 5.5 kb, corresponding to an
overall average level per nucleotide heterozygosity of 0.0007 (Tables 4
and 5). In other words, approximately 1 in every 1400 bp varies on
average between two randomly sampled chromosomes in the core sample.
The four nonsynonymous coding region variants we identified had all
been reported previously: the two most common cSNPs (at sites 3937 and
4075) are those that define the major isoforms of apoE, and the rarer
variant at 4036 (Arg142Cys) has been previously associated with type
III hyperlipoproteinemia in a single family (Havel et al. 1983
; Rall et
al. 1989
). The other nonsynonymous variant at 3106 (Leu28Pro) is not
associated with any known lipid disorder (de Knijff et al. 1994
).
However, 14 of the remaining 18 noncoding variants (at sites 73, 308, 471, 545, 1522, 1575, 1998, 2440, 2907, 3673, 4951, 5229A, 5229B, and
5361) have not been observed previously. Several of these sites have
levels of heterozygosity comparable with the normally-assayed cSNPs at
3937 and 4075, yet their effects, if any, with respect to phenotype variation remain to be investigated (C. Sing, in prep.; J. Stengård, in prep.)
A number of recent studies have focused attention on SNPs in the 5'
flanking region of APOE that could alter gene expression and
be involved in the phenotypic associations with Alzheimer's disease
and CVD risk (Mui et al. 1996
; Artiga et al. 1998a
,b
; Bullido et al.
1998
; Lambert et al. 1998a
,b
; Lambert et al. 2000
). Several of these
SNPs, known as -491A/T (position 560) and -427 C/T (position 624), have
been associated with an increased risk of Alzheimer's disease that is
independent of the
4 status of the individual (Artiga et al.
1998a
,b
; Bullido et al. 1998
; Martin et al. 2000
). Another regulatory
region variant, denoted Th1/E47c or -219 G/T (832), has also been found
to be associated with the risk of Alzheimer's disease and myocardial
infarction (Lambert et al. 1998a
,b
, 2000
). Interestingly, the SNPs at
-471 A/T (560) and -427 C/T (624) are associated with an Alu
sequence. Substitutions in Alu sequences are not usually
involved in gene function or regulation. Nonetheless, site-directed
mutagenesis of the -471 position does lead to changes in APOE
promoter activity and differential binding to nuclear extracts (Bullido
et al. 1998
), although the constructs used in these studies included
sequences containing the -427, -219, and 1E1 (1163) sites. Therefore,
the effects ascribed to -491 A/T could be related to a combination of
the alleles at one or more of these sites (e.g., Lambert 1998b
).
Because our ultimate aim is to investigate the relationship of variation in measures of lipid metabolism that may play a role in CVD risk to the APOE polymorphism, 20 of the 22 variable sites were subsequently typed by OLA in much larger epidemiological samples from the same populations. Relative allele frequency estimates based on OLA genotyping were consistent with estimates derived from the fully-sequenced core samples. Sites that did not vary in the core sample of a particular population but were subsequently found to be polymorphic in the large epidemiological sample of the same population typically had relative frequencies of the rare allele <0.03. This illustrates one of several dangers inherent to two-tiered strategies. The core sample from a given population will identify some, but not all, of the sites that vary in that population. Further, the characteristics of variation in the larger sample may be well reflected by those of the same sites in the core sample but cannot accurately address the overall variation in the larger sample, because not all sites that vary in the latter will be known. Finally, of course, the potential efficacy of variation identified in the core sample for use in epidemiological association studies cannot fully be assessed, because the sites of etiological importance, the relative frequencies of their alleles, and their association with variation in the core sites will all be unknown.
Tests of site-specific Hardy-Weinberg equilibrium for the data from the
OLA-typed samples suggested no significant departures from expected
proportions. All three populations had a small deficit of homozygotes
for the rare allele at site 3937, which defines in part the
4
allele, consistent with a moderate deficit of
4 homozygous genotypes
relative to expectation. Although there is independent evidence for a
decline in the frequency of the
4 allele with age (Miettinen et al.
1994
; Schächter et al. 1994
; Haviland et al. 1995
; Stengård et
al. 1995
), it is not clear what causes the deficit observed here. There
was also weak deviation for the regulatory site, 832, in two
populations, and this has been associated with phenotypic effects in
some studies (Paik et al. 1988
; Smith et al. 1988
). Although Table 4
reflects the existence of linkage disequilibrium among sites in
APOE, the same data also show that without directly typing the
site(s) of etiological importance, a random SNP cannot be relied on to
reliably predict the
2/
3/
4 genotype. Similar observations have
been reported for a recent study of 1.5 Mb of sequence surrounding the
APOE gene (Martin et al. 2000
). In that case, only certain
close and some distant sites could pick up signal due to the two-site
4 haplotype.
That the core diversity is representative of the same alleles in the
larger sample is shown by the similarity of the nucleotide diversity
(
) values in both samples. That rare sites are missing in the OLA
sample is reflected in the
values for the OLA data relative to
the core and the consequent inflation of the Tajima's D
values that compare the two statistics. To the extent that the underlying theoretical assumptions hold, our calculations suggest that
approximately ten more sites may vary in each of the large epidemiological samples. Although relative allele frequencies at these
are likely to be rare, because they were not seen in the rather
substantial core samples, their details remain unknown, and thus their
relevance for predicting phenotypic variability cannot be evaluated.
The nature and size of the core sample can also have important
consequences for subsequent large-scale analyses in the geographic apportionment of the observed diversity. At APOE, estimates of FST were 0.045 for the core and 0.034 for the OLA-typed
samples. Although the second value is underestimated, both estimates
are low compared with an average estimate of 0.139 ± 0.010
reported earlier for a collection of 100 diallelic human genetic
markers (Bowcock et al. 1991
). Although some of the apparent difference may be because our study did not sample worldwide geographic variation, our values do agree with the previous suggestion that interpopulation differences in APOE diversity are low relative to other loci
(Gerlenter et al. 1998
). Yet, despite this fact, readily apparent
differences in site variability exist among the three populations
surveyed here. Seven of the 22 variable sites were observed to vary in one population only, for example, and all but one of these (site 1522)
were restricted to the Jackson African-American sample. These
population-specific variants attain relative frequencies of as much as
0.09 in the OLA-typed Jackson sample and would generally be considered
common polymorphisms by most human geneticists, hence well worth
consideration with respect to phenotypic variance. Such low-frequency
population-specific variants would have remained uninvestigated,
however, if the core analysis had focused either on a smaller sample
from the same population or, as has been proposed recently (Collins et
al. 1999
), relied solely on a small mixed panel of anonymous
individuals from different geographic regions. As such frequency
differences are only likely to be greater at more polymorphic loci, the
implications of poor sample choice at the SNP discovery stage are
considerable. Clearly, small core samples underrepresent the true
number of variable sites, missing even those with an appreciable
relative frequency of the rarer allele.
Genomic sequence analysis is an important prerequisite for designing and implementing large-scale genotyping studies. As our survey of APOE nucleotide diversity shows, however, care must be exercised in the way such core analysis is conducted and interpreted. With adequate population sampling at the sequence level, the likelihood of characterizing sites with high information content increases, and the ability to draw inferences about underlying variation left untyped in large-scale genotyping surveys is enhanced. Even then, it is not clear a priori what sampling design would be optimum for detecting disease association by disequilibrium with untyped sites, because that question involves the unknown number, arrangement, and frequency of the etiologically relevant variations. However, our results suggest that those who rely solely on a small core panel of SNPs, ascertained from a limited number of individuals with poorly defined population affiliation, may miss important underlying variation, with possible adverse consequences for the power of subsequent large-scale genotype-phenotype analyses. It is clear, for example, that there is much more to APOE genetic diversity than two cSNPs and their well-known resulting isoforms; there is considerable haplotypic variation within each of those isoforms as well (Fullerton et al., in press). The extent to which these newly discovered polymorphisms may explain additional variation in phenotypes is currently being investigated.
| |
METHODS |
|---|
|
|
|---|
Population Samples
Individuals from three populations were sampled: (1) Europeans from North Karelia, Finland (n = 24), (2) European-Americans from Rochester, Minnesota (n = 24), and (3) African-Americans from Jackson, Mississippi (n = 24). All subjects were selected for this survey without respect to their disease status or their levels of any risk factor trait. After the variable nucleotide sites were ascertained in this set of 72 individuals (144 chromosomes), a larger sample from North Karelia (n = 452), Rochester (n = 884), and Jackson (n = 843) was scored by OLA.
DNA Amplification
The APOE gene (reference sequence: GenBank AF261279) was amplified from each individual in nine overlapping segments. Either a universal forward (-21M13, TGTAAAACGACGGCCAGT) or reverse (M13reverse, CAGGAAACAGCTATGACC) sequence was added to each APOE specific primer (forward to forward; reverse to reverse) before synthesis. The following specific primer pairs were used to amplify the APOE gene (listed as the forward and reverse primers with PCR product size and primer annealing temperature in parentheses): (1) CTTGATGCTCAGAGAGGACAAG and GGCATAGAGTCTTT TGACCA (1122 bp, 63°C), (2) GGTCAGGAAAGGAGGACTCT and GTCCCAGTCTCGCATTCCTC (1072 bp, 58°C), (3) GGC AGCGACACGGTAGCTAG and AACCGAGGCCCAGAGAG CGT (672 bp, 61°C), (4) GTTGCTGGTCACATTCCTGG and GAGTCGGTTTAATCACTTG (940 bp, 63°C), (5) AGCCCT GCCTGGGGCACAC and GGACACTCACCTCAGTTCCT (744 bp, 58°C), (6) GAGTGGCAGAGCGGCCAGCG and CCTTCA ACTCCTTCATGGTCTC (1143 bp, 63°C), (7) CTAGCTCCTTC TTCGTCTCTG and GCTCGAACCAGCTCTTGAGG (694 bp, 58°C), (8) GCCAGCCGCTACAGGAGCG and CCAGCTACTG AGGCAGCAG (638 bp, 58°C), and (9) GTGTGTATCTTTCT CTCTGCC and GGCAGGCCGCTCGGAGCCCAT (751 bp, 63°C). All amplification reactions were performed in 96-well microtiter plate thermal cyclers (PTC 100, MJ Research). PCRs were assembled in 20-µL total volume containing 50 ng of genomic DNA using the advantage GC genomic polymerase system (Clontech). Following assembly, thermal cycling was performed with an initial denaturation at 94°C for 1 min followed by 35 cycles of denaturation at 95°C for 20 sec, primer annealing for 30 sec (temperatures above), and primer extension at 72°C for 2 min. After 35 cycles, a final extension was performed at 72°C for 5 min.
DNA Sequencing
Following DNA amplification, PCR products were purified by cutting
the specific product from a 1% low-melt agarose gel and isolating the
product with the Wizard PCR preps purification system (Promega) as
described previously (Nickerson et al. 1997
). Cycle sequencing was
performed according to the manufacturer's instructions using ABI PRISM
Dye Primer Sequencing Kits with Amplitaq FS DNA polymerase (PE
Biosystems). Dye primer sequencing using the universal forward and
reverse primers attached to the gene-specific primers was performed by
assembling four separate reac tions as follows: 1 µL each of the
PCR sample mixed with 4 µL of the PRISM ready premix for the A and
C reactions, and 2 µL each of the PCR sample mixed with 8 µL of
the PRISM ready premix for the G and T reactions. Sequencing reactions
were denatured for 1 min at 96°C and subjected to 15 cycles at
96°C for 10 sec, 55°C for 5 sec, and 70°C for 1 min and 15 cycles at 96°C for 10 sec and 70°C for 1 min. Then, the A, C, G,
and T reactions were pooled and subjected to ethanol precipitation,
resuspended in 1.5 µL of loading buffer (5:1, 1% deionized
formamide/50 mM EDTA at pH 8.0), heated for 2 min at 90°C, and
loaded onto an Applied Biosystems 377 sequencer according to the
manufacturer's directions.
Sequence Analysis and Polymorphism Identification
The ABI sequence software (version 2.1.2) was used for lane
tracking and first-pass base-calling (PE Biosystems). Chromatograms were transferred to a Sun UNIX workstation, base-called with Phred (Ewing et al. 1998
; Ewing and Green 1998
), assembled with Phrap (Green
1999
), and scanned by PolyPhred (Nickerson et al. 1997
). The results
were viewed with the Consed program (Gordon et al. 1998
).
Interspersed repeats in the target sequence were identified by the
program RepeatMasker (Smit and Green 1999
). Specific descriptions and
documentation for Phred, Phrap, Consed, and RepeatMasker, are available
at http://bozeman.mbt.washington.edu/index.html; for PolyPhred,
http://droog.mbt.washington.edu.
DNA polymorphisms were identified using the PolyPhred program (version
3.0; Nickerson et al. 1997
). Once identified, the variants were
visually inspected and automatically entered into a database for
subsequent analysis. Each variant position was confirmed by reamplifying and resequencing the variant site from the same or opposite strand. In addition, because of the sequence overlap within
the analyzed regions, more than one call for each genotype was obtained
for each position in a sample. In regard to data quality and accuracy,
it is important to note that (1) the base-calling program we applied,
Phred, has a significantly higher accuracy in calling bases correctly,
i.e., a lower error rate, than even the ABI software (Ewing et al.
1998
); (2) the genotype accuracy was estimated to be >99% based on
genotype confirmation obtained from multiple or opposite strand
sequencing (data not shown); and (3) all the identified variants were
confirmed by genotyping the PCR products independently by OLAs.
Information on these SNPs is available in AF261279 and in dbSNP (Sherry
et al. 1999
). Sequence comparisons between human (AF261279) and mouse
(D00466) were performed with Advanced Pipmaker
(http://bio.cse.psu.edu/pipmaker/) using the chaining option (Schwartz
et al. 2000
).
OLA
A colorimetric single-well OLA was used to type the identified SNPs
as described previously in detail (Tobe et al. 1996
). Regions of the
APOE gene containing SNPs were amplified as described above,
and the products (~20 µL) were diluted with 50 µL of
distilled H2O containing 0.1% Triton X-100. A 10-µL
aliquot of the diluted product was then mixed with 10 µL of a
solution containing 2× ligase buffer (40 mM Tris-HCl [pH 8.0]/20
mM MgCl2/2 mM dithiothreitol), 2 mM nicotinamide adenine
dinucleotide, 25 mM KCl, 0.167 U Ampligase DNA Ligase (Epicentre), and
200 fmol of each of the ligation primers (the two allele-specific
primers each labeled at its 5' end with a specific hapten
[digoxigenin or fluorescein] and the joining primer for the SNP being
tested phosphorylated and labeled at its 3' end with biotin).
Ligation reactions were overlaid with mineral oil and placed in a
thermocycler for 20 cycles at 93°C for 30 s and 58° C for 2 min.
After cycling, the reactions were stopped by the addition of 10 µL
of 0.1 M EDTA in 0.1% Triton H2O and transferred in their
entirety (including the mineral oil) to a 96-well flat bottom
microtiter plate (Falcon) that had been coated with streptavidin
(Sigma; 50 µL of 25 ug/ml incubated 1 hr at 37°C). Ligation
products were allowed to capture on the streptavidin plate at room
temperature (RT) for 1 hr, and the plate was washed twice with an NaOH
buffer (0.01 M NaOH/0.05% Tween 20) followed by two washes with Tris
buffer (100 mM Tris-HCl [pH 7.5]/150 mM NaCl/0.05% Tween 20). An
antibody mixture (40 µL in 1× PBS with 0.5% BSA) consisting of
a 1:1000 dilution of alkaline phosphatase-labeled anti-fluorescein
antibodies and 1:1000 dilution of horseradish peroxidase-labeled
anti-digoxigenin antibodies was added to each well. After 30 min at RT,
plates were washed six times with Tris buffer. After washing, an
alkaline phosphatase substrate (25 well, Bethesda Research Laboratories
enzyme-linked immunosorbent assay amplification system) was added to
the wells, the plates were incubated for an additional 10 min at RT,
and then 25 µL of amplifier were added to each well.
Spectrophotometric absorbances were taken at 490 nm using a microplate reader (Bio-Rad 3550) and saved as optical density (OD) readings in the attached computer. After detection of the fluorescein reporter, the plates were washed again six times with Tris buffer, and 50 µL of the horseradish peroxidase substrate, 3,3',5,5'-tetramethylbenzidine (TMB; Sigma), were added to each well to detect the digoxigenin reporter. Spectrophotometric absorbances were taken at 655 nm for this reporter and saved in the attached computer. Sequences of the OLA primers and a detailed protocol for the assay are available at http://droog.mbt.washington.edu.
Genotypes were automatically derived by applying a simple threshold to
call a positive (OD > 0.150) or negative (OD < 0.150) reaction. Duplicate genotypes were obtained for ~10% of the
individuals assayed at each site, and genotype concordance of
>98.6% was detected, a concordance rate that was similar to prior
studies on other SNPs (Delahunty et al. 1996
; Tobe et al. 1996
).
Individuals with discordant genotypes were reassayed by OLA or by
sequencing analysis, and the genotypes concordant for two of the three
assays were accepted as the final genotype. Genotyping of site 5229 in
APOE (varying sites 5229a and 5229b) in the larger population
samples revealed that the core sequencing sample did not reveal the
entire spectrum of allelic variation (number of G's in the tract) at this position. Also, this position could not be accurately interpreted by length because of the difficulties normally encountered for typing
mononucleotide tracts, which is also compounded by the presence of
substitution variation. Therefore, this position could only be
accurately called by sequence analysis combined with manual interpretation of the position, precluding its typing on a large scale.
Statistical Analyses
Allele frequencies for each variable site (with or without regard
to the observed
2/
3/
4 genotype) were estimated by gene counting, because genotypes were scored directly by sequencing or OLA typing.
Several standard statistics were estimated to characterize the amount
and pattern of nucleotide polymorphism in APOE. Nucleotide diversity,
, was estimated as the average heterozygosity for all
nonindel sites in the sequence (invariant sites counted as having
heterozygosity of zero); standard errors of this estimate included both
stochastic and sampling variance and were calculated with the
conservative assumption of no recombination between sites (Tajima
1993
). A related estimator,
= 4Neµ,
characterizes the variation in terms of that expected in a standing
population in mutation-drift equilibrium, with mutation rate, µ,
per sequence per generation and Ne as the effective
population size, and represents the expected average heterozygosity.
was estimated from the observed number of nonindel segregating
sites, S, in a sample of 2n chromosomes,
according to the formula
= S/
(1/i), summed from
i = 1 to 2n
1 (Watterson 1975
). The standard
error of this estimate, also following Watterson (1975)
, was calculated
assuming no recombination. The equality of the estimates of
and
was tested by Tajima's D statistic (Tajima 1989
).
Hardy-Weinberg tests for genotype frequency distributions were
performed on the observed genotype frequencies for each site and
population, with significance based on a standard observed-expected
2 with 1 df. The degree to which subdivision into
separate populations is reflected in the amount of allelic variation
was measured by the parameter FST, the ratio of the
variance of allele frequencies among the population to the genetic
variance of the pooled data (Weir 1996
). Pairs of sites showing
significant linkage disequilibrium were identified by application of a
likelihood ratio test that compares the likelihood of the data assuming
linkage equilibrium (calculated as the product of the allele
frequencies at each site) with the likelihood of the data assuming
haplotype frequencies estimated by the expectation-maximization
algorithm (Slatkin & Excoffier 1996
), implemented by the program
Arlequin v. 2 (http://anthropologie.unige.ch/arlequin/).
| |
ACKNOWLEDGMENTS |
|---|
We thank Cheryl Thayer, Christa Broers and Barney Gill for their assistance in obtaining the human APOE sequences and OLA typings. This work was accomplished with support from the National Heart, Blood, and Lung Institute (HL58238, HL58239, HL58240, and HL39107).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
7 Corresponding author.
E-MAIL debnick{at}washington.edu; FAX (206) 685-7301
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.146900.
| |
REFERENCES |
|---|
|
|
|---|
3/3.
J. Clin. Invest.
72:
379-387.