|
|
|
|
Vol. 12, Issue 3, 436-446, March 2002
LETTER
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
A major interest in human genetics is to determine whether a nonsynonymous single-base nucleotide polymorphism (nsSNP) in a gene affects its protein product and, consequently, impacts the carrier's health. We used the SIFT (Sorting Intolerant From Tolerant) program to predict that 25% of 3084 nsSNPs from dbSNP, a public SNP database, would affect protein function. Some of the nsSNPs predicted to affect function were variants known to be associated with disease. Others were artifacts of SNP discovery. Two reports have indicated that there are thousands of damaging nsSNPs in an individual's human genome; we find the number is likely to be much lower.
| |
INTRODUCTION |
|---|
|
|
|---|
A major interest in human genetics is to distinguish mutations that
are functionally neutral from those that contribute
to disease. Amino acid substitutions currently account for
approximately half of the known gene lesions responsible for human
inherited disease (Cooper et al. 1998
). Therefore, it is important to
determine whether a nonsynonymous single nucleotide polymorphism
(nsSNP) that affects the amino acid sequence of a gene product can
alter protein function and contribute to disease.
The number of potentially damaging nsSNPs in a human individual is also
of major interest because if the number is high, it can affect human
welfare. Two groups, Sunyaev et al. (2001)
and Chasman and Adams
(2001)
, have applied computational tools that predict the effect of an
amino acid substitution on protein function to nsSNPs. These groups
estimated that ~20% and 30%, respectively, of nsSNPs damage
protein function. Based on these estimates, they proposed that each
individual has on average 2000 (Sunyaev et al. 2001
) and 9500 nsSNPs
(Chasman and Adams 2001
) that affect protein function and may
contribute to health ailments.
Previously, we introduced SIFT, which uses sequence
homology to predict whether an amino acid substitution in a protein will affect protein function (Ng and Henikoff 2001
). SIFT is based on the premise that important amino acids will be conserved among sequences in a protein family, so changes at amino acids conserved in the family should affect protein function. Given a protein
sequence, SIFT chooses related proteins, obtains an
alignment of these proteins with the query, and, based on the amino
acids appearing at each position in the alignment, makes a prediction
as to whether a substitution will affect protein function. A position
in the protein query that is conserved in the alignment will be scored
by SIFT as intolerant to most changes; a position that is
poorly conserved will be scored by SIFT as tolerating most
changes. Unlike the tools of Sunyaev et al. (2001)
and Chasman and
Adams (2001)
, SIFT does not require structural information
and therefore can be applied to a much larger number of proteins.
Here we apply SIFT to human disease and polymorphism databases. We find that SIFT's prediction ability is similar to that of tools that require structural information. However, we do not arrive at a similar conclusion concerning the number of damaging nsSNPs in the human genome. Rather, our detailed examination of the source of nsSNPs in current databases reveals biases that inflated the other groups' estimates.
| |
RESULTS |
|---|
|
|
|---|
We define a damaging nsSNP as a mutation whose resulting amino acid substitution in the corresponding protein affects protein function. We define an nsSNP as tolerated or neutral if the resulting amino acid substitution in the protein does not detectably alter protein phenotype. These definitions exclude mutations that affect transcription, translation, splicing, and other possible pretranslational alterations. Because SIFT predicts on amino acid substitutions in the protein product, it does not take into account these factors.
SIFT Analysis of Human Variant Databases
SIFT was applied to three different datasets of human
variants and a summary of the prediction results is shown in Table
1. The first dataset consisted of
substitutions annotated as involved in disease according to
SWISS-PROT/TrEMBL (Bairoch and Apweiler 2000
).
SIFT predicted 69% (3626/5218) of these substitutions as
damaging. Some of these substitutions may be functionally neutral but
incorrectly annotated as causing disease if they were observed in
patients or are in linkage disequilibrium with another mutation that is
causing the disease phenotype. Thus, 69% is a lower bound of
prediction accuracy on damaging substitutions.
|
A second dataset consisted of nonsynonymous polymorphisms in normal
individuals detected by the Whitehead Institute (Cargill et al. 1999
).
These nsSNPs (referred to as WI-nsSNPs) represent an unbiased set of
nsSNPs because they were systematically detected and confirmed across
many genes in control individuals. Some of the WI-nsSNPs may affect
protein function, even though they were detected in control individuals
if the altered phenotype was recessive or undiagnosed. Of the
WI-nsSNPs, 19% (22/115) were predicted by SIFT to be
damaging (Table 1). However, these may be neutral because there was no
apparent difference between this value and SIFT's 20%
weighted false positive error. Because SIFT predicted most
(69%) of the substitutions involved in disease as damaging and most
(81%) of the known polymorphisms as neutral, the results from these
two datasets indicate that SIFT can distinguish between
damaging and neutral human nsSNPs.
A third dataset consisted of putative nsSNPs in dbSNP (Sherry et al.
2001
), one of the largest public SNP databases. Of the proteins
containing nsSNPs from dbSNP, 60% (1789/3005) had enough homologs for
SIFT prediction (Table 1). For these proteins, 25%
(757/3084) of the substitutions were predicted to be damaging by
SIFT. The weighted false positive error was calculated and
indicated that if all of the nsSNPs from dbSNP were functionally neutral, only 19% should have been predicted as damaging.
We investigated the difference between the percentage predicted to be
damaging for dbSNP variants (25%) and that expected if dbSNP contains
only functionally neutral substitutions (19%). Sixteen genes were
chosen because they had a high fraction of nsSNPs from dbSNP predicted
to affect protein function. In the following sections, we show that the
apparent polymorphisms in these genes could be explained by reasons
other than SIFT prediction error (Table
2 for summary).
|
Substitutions Already Shown to Be Involved in Disease
For 5 of the 16 genes with an excessive number of nsSNPs predicted
to be damaging, most of the nsSNPs came from patients with disease, and
the gene in which the nsSNPs were detected had been shown or suspected
to contribute to the disease. These genes had a high fraction of nsSNPs
predicted to affect protein function because many of their variants in
dbSNP contribute to disease. SIFT correctly predicted
18/22 of the nsSNPs found in disease patients to affect protein
function and 9/10 nsSNPs found in control patients as functionally
neutral (Table 3). This
provides additional evidence that SIFT can distinguish
between nsSNPs involved in disease and those that are functionally
neutral. All predictions for the five genes are shown in Table 3; we
highlight certain aspects of SIFT by discussing several
predictions in detail.
|
SIFT detects nsSNPs that are damaging to a protein,
although loss of protein function may not cause an obvious phenotype. Although a protein may not play an important role in the organism, if
the amino acid substitution resulting from an nsSNP occurs at a
conserved position, it will be predicted to affect function. For
example, some nsSNPs in the melanocyte stimulating hormone receptor
(MSHR) gene are associated with a twofold risk for cutaneous malignant
melanoma (Palmer et al. 2000
). Although MSHR is not under strong selection outside of African populations (Harding et al.
2000
) and has a minor role in overall health, SIFT correctly predicted the appropriate nsSNPs as damaging because the
amino acid substitutions occurred at conserved positions in the protein
alignment used for prediction.
Some nsSNPs might be damaging to the protein, but their effects on
health are difficult to ascertain. For example, when a candidate gene
for diabetes, the gene encoding peroxisome proliferator activated
receptor
(PPAR
), was screened for polymorphisms in diabetics and
nondiabetics, the nsSNP causing a L162V substitution in PPAR
was
found at similar frequencies in both populations. SIFT
predicted this substitution to affect protein function (Table 3). The
prediction might appear incorrect based on the lack of association with
diabetes, but carriers of this nsSNP have higher cholesterol levels and
increased apolipoprotein B concentrations, thus it has been proposed to
increase the risk of coronary artery disease (Vohl et al. 2000
;
Lacquemant et al. 2000
). SIFT was sensitive to
this mutation and predicted it to be damaging because the position of
substitution is conserved among orthologous proteins and other nuclear
hormone receptors present in the alignment used for prediction. Because
mutations in proteins can have pleiotropic effects, a mutation that
initially does not appear to have an effect but is predicted to affect
function by SIFT may have an effect that has not yet been
assayed for.
SIFT can detect overdominant nsSNPs in which the heterozygote has a selective advantage. Individuals severely deficient in methylenetetrahydrofolate reductase (MTHFR) activity develop mental retardation and cardiovascular disease (OMIM #236250). However, reduced MTHFR activity can also confer protection against child and adult acute leukemia and colon cancer. SIFT correctly predicted the two common variants of MTHFR with reduced enzymatic activity to affect protein function (Table 3). A lowered risk for some diseases has selected for these variants that reduce enzyme activity, despite other detrimental effects on health. Overdominant nsSNPs can become common in a population although they affect protein function. Common nsSNPs are often expected to be functionally neutral; their identification as damaging to the protein and perhaps maintained by overdominance may lead to the understanding of some common diseases.
nsSNPs Erroneously Mapped from Pseudogenes
SIFT detected two genes for which the changes from
dbSNP were mistakenly mapped from pseudogenes (Table 2). Programs that
identify SNPs by aligning ESTs (expressed sequence tags) or genomic
sequences might detect base differences between the functional gene and
a pseudogene and erroneously report these differences as SNPs in the
functional gene. For example, AGP1, the gene encoding
1-acid glycoprotein, was annotated to contain six missense
changes in dbSNP, but the source of the differences was ESTs from
AGP2. Although AGP2 is expressed, the protein has been suggested to lack function because it has evolved at an
unconstrained rate (Merritt et al. 1990
).
Damaging Mutations in Redundant Motifs
Like the pseudogene examples in the previous section, differences entered as nsSNPs into dbSNP for the gene encoding FLJ20079 actually matched other regions of the genome. However, this example is more complex because the other regions may code for functional genes (Fig. 1). After we inferred the hypothetical protein sequences from these regions, we observed that the amino acids predicted to affect protein function clustered in domains that were ancestrally derived from zinc-finger domains but could no longer function as zinc fingers (Fig. 1, dashed lines). Because these domains aligned to functional zinc-finger domains during prediction, the changes were predicted to affect protein function. These regions would have acquired a substitution that rendered the zinc finger nonfunctional; once the first deleterious substitution was acquired, other substitutions were allowed to accumulate in the nonfunctional domain. Thus, studying the location of amino acids predicted to be damaging in a protein might reveal regions that have lost their function when aligned to related sequences that have retained their function.
|
Sequencing Errors Mistaken for Polymorphisms
Most of the variation in the remaining eight genes with a high fraction of nsSNPs predicted to be damaging originated from comparison of sequences from ESTs and/or cDNA clones with the reference gene (Table 2). These sequences had multiple base changes with respect to the reference gene (http://blocks.fhcrc.org/~pauline/SIFTing_databases.html). It is doubtful that the observed differences are real SNPs occurring together on a rare allele; it is more likely that errors occurred in the EST sequencing or SNP interpretation procedure.
Among these eight genes, there were six nonsynonymous changes detected
from sequences that were identical to the reference gene except for the
change causing the amino acid substitution. These could be real nsSNPs
found in the population. For example, the nsSNP that causes a V39A
substitution in proteasome subunit
7 was detected in five different
individuals. Multiple independent observations support this as an nsSNP
occurring in the human population and SIFT predicted the
V39A substitution as tolerated. The other five substitutions were
predicted to affect protein function by SIFT. These could
be real nsSNPs rather than errors from SNP detection programs. As these
were detected in single libraries, they may be rare mutations under
negative selection.
| |
DISCUSSION |
|---|
|
|
|---|
Identifying Damaging nsSNPs
Currently, there are more than a million SNPs in dbSNP that can be screened for association with diseases. By predicting the nsSNPs most likely to be damaging, the number of SNPs screened for association with disease can be reduced to those that most likely alter gene function. SIFT returned predictions for 3084 of the 5780 nsSNPs in the dbSNP database (Table 1). Of these 3084 substitutions, SIFT identified 757 that are likely to affect protein function; these are of higher interest than nsSNPs predicted to be neutral because they are more likely to contribute to disease. Not all of these variants will be useful for screening for novel contribution to disease because some were already known to be involved in disease. Some mapped to pseudogenes and others were sequencing errors; these were mistakenly interpreted as polymorphisms but have no bearing on health.
If a marker is found to be associated with disease and the marker is an
nsSNP, prediction tools such as SIFT can provide
independent evidence as to whether the nsSNP itself contributes to
disease. A major problem in association studies is the high false
positive signal of markers that appears to be associated with disease
when a large number of markers are tested (Emahazion et al. 2001
).
nsSNPs in PPAR
, MTHFR, and MSHR have been
shown to be associated with disease, but assays for reduction of
protein function have only been conducted on a fraction of them.
Because carrying out the appropriate assays may be time-consuming,
SIFT can filter out nsSNPs that are unlikely to affect
protein function before experimentation. Users can choose to minimize
either false negative or false positive error, tailoring
SIFT predictions to their needs.
How useful are prediction programs such as SIFT for
detecting damaging nsSNPs in proteins with only subtle effects on
health? A protein may play only a minor or redundant role in the
organism, so that if its function is altered the organism is only
mildly affected. Nevertheless, over the long periods of evolution
represented in an alignment, natural selection will remove damaging
substitutions from such proteins and their homologs. For this reason,
it was possible for SIFT to predict nsSNPs in
PPAR
, MTHFR, and MSHR as damaging,
although they have only minor effects on a carrier's health.
SIFT prediction accuracy for a particular protein will depend on the alignment obtained. The sequences in the alignment are restricted to those homologous sequences that are available in the protein database; therefore, the resulting alignment information is expected to vary from protein to protein. The protein alignments constructed by SIFT contain paralogs as well as orthologs; therefore, active-site residues specific to orthologs may not appear conserved. However, a random mutation is more likely to affect structure than activity because relatively few residues are involved at the active site of the protein and many more are necessary for maintaining structure. Thus, reasonable prediction accuracy was obtained on the datasets when paralogs were included in the alignment used for prediction, although the ideal alignment is one composed of a diverse set of orthologs. As protein databases grow with data from sequencing whole genomes, a larger number of orthologs will become available and SIFT prediction should become more accurate.
Despite variation among protein families attributable to different evolutionary pressures and the heterogeneous set of sequence alignments used, our results show that SIFT works sufficiently well on a large scale so that it can be used as a first-pass filter to identify the substitutions worth pursuing. SIFT performance is similar to that of tools that require structure, as described below, so a more refined approach may not necessarily improve performance given the complexity of protein evolution.
Comparison of SIFT with Other Prediction Tools
Approximately 30% of the proteins encoded by the human genome are
likely to be homologous to proteins with known structures (Guex et al.
1999
). Therefore, the prediction tools of Sunyaev et al. (2001)
and
Chasman and Adams (2001)
, which require structural information, are
restricted to these proteins. SIFT needs only homologous
sequences for prediction and was able to predict on 60% of the protein
sequences that contained dbSNP nonsynonymous variants (Table 1),
providing twice the coverage of other tools.
Although SIFT does not use structural information, all
three tools appear to perform similarly (Table
4). Sixty-nine percent of amino acid
substitutions annotated to be involved in disease were predicted to be
damaging by SIFT and by Sunyaev et al. (2001)
.
SIFT (Ng and Henikoff 2001
) and Chasman and Adams (2001)
predicted similarly for neutral substitutions that did not alter LacI
function; each had a false positive error of ~30%. It is possible
that SIFT performs similarly to tools that use structural
information because constraints inferred from protein sequence
alignments are based ultimately on structural constraints.
|
Estimating the Number of Damaging nsSNPs in an Individual
By extrapolating their results to the human genome, Sunyaev et al.
(2001)
and Chasman and Adams (2001)
have estimated that an individual
would have on average 2000 and 9500 damaging nsSNPs, respectively. Our
results do not support these estimates; the percentage of nsSNPs
predicted to be damaging in dbSNP (25%) was close to the false
positive error expected (19%) if all variants in dbSNP are
functionally neutral (Table 1). Moreover, we found that some of the 6%
difference between these two estimates can be accounted for by database contamination.
To calculate the percentage of nsSNPs that are damaging, ideally one
should use an unbiased set of nsSNPs, estimate the percentage of nsSNPs
predicted to be damaging, and then subtract the false positive error
for functionally neutral substitutions. The WI-nsSNPs dataset is
an unbiased set of nsSNPs, but because the genes screened were
few in number and are candidates for disease, one still should be
cautious in extrapolating from this dataset to the entire human genome.
When SIFT was applied to WI-nsSNPs, there was no
significant difference between the percentage predicted to be damaging
for these SNPs and the false positive error (19% vs. 20%,
respectively), indicating that the number of damaging nsSNPs per
individual falls within our prediction error (Table 5).
|
What accounts for the difference in results? Chasman and Adams (2001)
estimated 27% of nsSNPs are damaging based on the WI-nsSNPs but did
not take into account their false positive prediction error. Their tool
calculates the probability that a substitution affects function, and if
this is below 0.5, the substitution is predicted to be functionally
neutral. The 27% estimate was obtained by averaging the probabilities
for all WI-nsSNPs. This type of analysis will fail to get a 0%
estimate of damaging nsSNPs even if all substitutions are functionally
neutral. On a set of neutral substitutions, low probabilities will
correctly predict these substitutions as neutral, but when the
probabilities are averaged, a nonzero value will be obtained. Because
their approach cannot be used to estimate the percentage of damaging
nsSNPs, we instead examine the number of WI-nsSNPs that Chasman and
Adams (2001)
predicted to be damaging and compare it with their false
positive error for functionally neutral substitutions. They predicted
15% of the WI-nsSNPs as damaging (Table 5). This is lower than their 31% false positive error observed for functionally neutral
substitutions (Table 4); therefore, no extrapolation for the number of
damaging nsSNPs in a human genome can be made.
In the case of Sunyaev et. al. (2001)
, we examined the origin of the 79 nsSNPs they predicted to affect protein function and found that some
are biased; therefore, they should not be included in the estimate of
damaging nsSNPs per individual. Eighteen of the 79 nsSNPs are found in
the HLA class I protein, most mapping to the peptide-binding region
that is favored by diversifying selection (Janeway and Travers 1996
).
An additional 17 polymorphisms predicted to affect protein function
were first discovered in an individual or population afflicted with
disease in a gene known or suspected to contribute to the disease.
These are far more likely to be involved in disease, and thus predicted
as damaging, than random nsSNPs. Three substitutions from in vitro
mutagenesis studies were also in the dataset. We were unable to account
for the origin of all 79 nsSNPs, but we concluded that at least 38 mutations were biased in the manner discussed above and are not representative of random nsSNPs
(http://blocks.fhcrc.org/~pauline/SIFTing_databases.html). After
removing these mutations, the percentage of polymorphisms predicted to
be damaging decreased to 19% (Table 5). After subtracting the 9%
false positive error they reported, this reduces the proportion of
damaging nsSNPs to 10%.
The 9% false positive error reported by Sunyaev et al. (2001)
was
based on applying their tool to substitutions that have occurred
between human proteins and their orthologs. These substitutions have
undergone millions of years of selection and must have had selection
coefficients very near zero to become fixed (with the exception of
substitutions that have been driven by positive selection). Conditional
mutations, those that affect protein function conditional on an
environment that may no longer exist (Fay et al. 2001
), are excluded
from Sunyaev et al.'s control set. Such substitutions will exist as
SNPs that will eventually be culled out over time, but they have
undetectable effects on an individual's health. Thus, Sunyaev et
al.'s control set is the easiest set of substitutions to predict on
because even long evolutionary periods are insufficient for them to be
culled out. Hence, the 9% false positive error is a lower limit for
their prediction method. The 10% difference between Sunyaev et al.'s
9% false positive error and 19% nsSNPs predicted to be damaging
(after correcting for biased nsSNPs) is an estimate of damaging nsSNPs
that severely affect protein function, as well as the slightly
deleterious nsSNPs that might eventually be removed by natural
selection. This latter class may be irrelevant to human disease.
Another study has estimated that ~20% of nsSNPs are selected
against by comparing the frequencies of common and rare nsSNPs (Fay et
al. 2001
). This estimate, like Sunyaev et al.'s, includes damaging, as
well as slightly deleterious, mutations. The discrepancy between the
two values may result from differences in the datasets and their small
sample sizes.
Based on the foregoing analysis, we were unable to conclude that the
percentage of damaging nsSNPs that can affect human health is as high
as 20% to 30%. We suggest there is a low number of nsSNPs that affect
protein function in each individual because estimates lie within false
positive error. This low number is supported by a study that examined
the prereproductive mortality in the children of first-cousin marriages
and estimated the average human is heterozygote for 1.4 lethal
equivalents, or ~0.002% of human genes (Bittles and Neel 1994
).We
conclude that there are very few damaging nsSNPs in an individual's
genome that could impact health.
| |
METHODS |
|---|
|
|
|---|
Predicting Damaging Amino Acid Substitutions
SIFT uses sequence homology to predict whether an
amino acid substitution predicts protein function and has two major steps (Ng and Henikoff 2001
). In the first step, sequences closely related to the protein are chosen and the alignment of these sequences is what prediction is based on. In the second step, a scaled
probability for the substitution of interest is calculated based on the
amino acids observed at the position of substitution in the alignment generated from the first step. The substitution is predicted to affect
protein function if its scaled probability falls below a cutoff. In
SIFT version 2 (available at
http://blocks.fhcrc.org/~pauline/SIFT.html), the method by which
sequences are chosen for the alignment has been changed. The user can
opt for either a low false negative error, which predicts most of the
substitutions that affect protein function, or a low false positive
error, which predicts fewer substitutions that affect function but with
a higher level of certainty.
SIFT version 2 first obtains related sequences, which are
assumed to be functional, by searching SWISS-PROT/TrEMBL (Bairoch and
Apweiler 2000
) with PSI-BLAST (Altschul et al. 1997
) for
two iterations (-e 0.0001, -h 0.002). The sequences found by
PSI-BLAST that are more than 90% identical to each other
are clumped together and a consensus sequence is obtained for each
clump by choosing the most frequently occurring amino acid for each
position in the sequence. An iterative procedure is then used to choose
the related sequences. The procedure starts by giving the query
sequence to PSI-BLAST to search among the consensus
sequences. The top hit is added and aligned to the query sequence.
Conservation, as measured by information content (Schneider et al.
1986
), is calculated for each position in the alignment, and the median
of these values is obtained. The median conservation can range from 4.3 (sequences nearly 100% identical to each other) to 0 (all 20 amino
acids are represented at the majority of positions in the sequence
alignment). If the median conservation over all positions does not fall
below a user-defined cutoff, the hit is retained in the alignment and a
PSI-BLAST checkpoint file is built from the alignment. The
checkpoint file is used as a query for PSI-BLAST to search
among the remaining consensus sequences and the highest-scoring hit is
added to the alignment only if the median conservation does not fall below the cutoff. The process repeats and sequences are continually added to
the growing alignment until the median conservation cutoff is reached.
For efficiency, a new PSI-BLAST search is conducted after
five sequences have been added. Once the process stops and the
consensus sequences to be included determined, the protein sequences
corresponding to these consensus sequences are obtained and their
PSI-BLAST alignment used. To prevent the alignment from
being contaminated by pseudogenes or protein sequences containing the
polymorphism, sequences >90% identical to the query sequence are
removed. SIFT allows a range of cutoffs, and similar results are obtained when sequences 95% and 99% identical to the query are removed
(http://blocks.fhcrc.org/~pauline/SIFTing_databases.html). The
alignment is used for the second step of SIFT prediction as described previously with the gap option turned off (Ng and Henikoff 2001
).
The user sets the median conservation cutoff to minimize either false
negative error or false positive error. We used the mutation dataset
from Escherichia coli LacI (Pace et al. 1997
) to
decide the range of median conservation values that work best. When the
median conservation ranges from 2.25 to 3.25, the total prediction
accuracy (number correctly predicted/number total substitutions assayed) on LacI remains the same (68%). Therefore, for prediction on
the databases described here, we used 2.75 as the median conservation cutoff. If the sequences represented at the position of substitution had median conservation >3.25, this indicated that there were not
enough homologous sequences in the database; therefore, no prediction
was made.
When SIFT returns the prediction for an amino acid substitution, it also returns the median conservation for the sequences used in the alignment. A lower value provides greater confidence that the prediction for a substitution has a low false positive error because a low median conservation value reflects that very diverse sequences were used in the alignment. Then a substitution predicted to be damaging has occurred at a position that has been well conserved among the diverse set of proteins despite the diversity of amino acid compositions at other positions. This indicates that the position of substitution is constantly under negative selection; therefore, it is likely that the change is damaging.
Databases
To identify amino acid substitutions involved in disease, we
searched SWISS-PROT 39.11 and TrEMBL 15.11 (http://www.expasy.ch/sprot, Bairoch and Apweiler 2000
) with the keywords disease and
mutation. We found 7397 disease-causing substitutions from 606 proteins after removing any substitution annotated as polymorphism or probable polymorphism.
nsSNPs in normal individuals were detected by the Whitehead Institute
(Cargill et al. 1999
). This dataset, downloaded from http://www.genome.wi.mit.edu/cvar_snps, is referred to as WI-nsSNPs.
Amino acid variants from dbSNP (build #95)
(http://www.ncbi.nlm.nih.gov/SNP, Sherry et al. 2001
) were found by
searching dbSNP for variants with FXN-"coding
nonsynonymous" in the organism Homo sapiens.
Entries that listed the amino acid position affected were retrieved.
For a given substitution, the reference amino acid was checked to match
the amino acid in the protein sequence corresponding to the accession
number referred to in the refSNP file. If the substitution did not
match, it was discarded. If a substitution was referenced to more than
one protein, such as in isoforms, the duplicated substitutions were
removed so that the substitution was represented only once. Only one
substitution per position was predicted on. After applying this filter,
5780 substitutions from 3005 protein sequences remained.
Database predictions are available at http://blocks.fhcrc.org/~pauline/SIFTing_databases.html
Estimation of False Positive Error
To test the hypothesis that all substitutions from a database are
neutral, the percentage predicted to be damaging on the test set was
compared with the percentage predicted to be damaging on a set of
substitutions known to be neutral. More than 4000 single amino acid
substitutions had been introduced into LacI and both neutral and
negative phenotypes were assayed (Pace et al. 1997
). In our previous
study, this dataset was used to measure SIFT performance
(Ng and Henikoff 2001
). Because the effects of substitutions are known
in this protein, we used this dataset as a standard to calibrate the
expected prediction accuracy. SIFT's prediction accuracy
for LacI is 68% for all substitutions with a median conservation
cutoff of 2.75. However, the mutation data for LacI was generated from
assaying 12 or 13 amino acid substitutions at each position, and some
of the amino acid substitutions tested could not have occurred from a
single base change, which is presumed for substitutions in the
polymorphism test set. Because performance on amino acid substitutions
that require multiple base changes has no relevance for the
substitutions assayed on the databases, and some types of substitutions
will occur more often than others, prediction accuracy must be
calibrated for the composition of the test set being predicted on. The
tolerated prediction accuracy weighted by composition of the test set
was calculated as:
|
(1) |
Thr and Thr
Ala neutral substitutions as tolerated
in the LacI dataset. This is the left term in Equation 1 for
i = Ala and j = Thr. Rather than the right term
simply being 107/3084, the denominator is reduced because not all
combinations of substitutions were assayed in the LacI dataset.
Tolerated prediction accuracy based on the LacI data is available for
2499 of the substitutions from dbSNP; thus, the contribution of the
Ala
Thr substitution to the weighted tolerated accuracy is 0.75 *107/2499. The weighted tolerated prediction accuracy is the sum over
all substitutions aai and aaj for which LacI
tolerated prediction accuracy can be calculated and is weighted by the
proportion of substitutions of aai and aaj
occurring in the polymorphism database. The weighted false positive
error is the weighted tolerated prediction accuracy subtracted from 100.
Genes with a High Fraction of nsSNPs Predicted to Affect Protein Function
We approximated the predictions for 217 genes with at least three
nsSNP entries from dbSNP according to a binomial distribution. SIFT, with median conservation 2.75, has a false positive error of 0.30 for the entire LacI dataset. If x is the number of substitutions predicted to be damaging by SIFT and n is the total number of substitutions predicted on for the
protein, the probability that at least x variants predicted to
affect function is:
|
(2) |
| |
WEB SITE REFERENCES |
|---|
|
|
|---|
http://blocks.fhcrc.org/~pauline/SIFT.html); site at which SIFT version 2 is available.
http://blocks.fhcrc.org/~pauline/SIFTing_databases.html; site at which database predictions are available.
http://www.expasy.ch/sprot; mutations annotated to be involved in disease were retrieved from SWISS-PROT/TrEMBL
http://www.genome.wi.mit.edu/cvar_snps; dataset of nsSNPs in normal individuals as detected by the Whitehead Institute and referred to as WI-nsSNPs.aa
http://www.ncbi.nlm.nih.gov/SNP; dbSNP, a public SNP database.
| |
ACKNOWLEDGMENTS |
|---|
We thank Harmit Malik and Jorja Henikoff for their support. Kami Ahmad, Leonid Kruglyak, and Wendy Thomas gave thoughtful comments on the manuscript. P. Ng is a Department of Energy Computational Science Graduate Fellow. This work was supported by a grant from NIH (GM20009).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
4 Corresponding author.
E-MAIL steveh{at}fhcrc.org; FAX (206) 667-5889.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.212802.
| |
REFERENCES |
|---|
|
|
|---|
gene in diabetic patients by ABI sequencing and high density oligonucleotide array technology.
Am. J. Hum. Genet.
63:
abs997 (data on poster, noted by A.J. Brookes and entered into HGBASE).
gene is associated with altered function in vitro and plasma lipid concentrations in Type II diabetic subjects.
Diabetologia
43:
673-680[CrossRef][Medline].
gene in Japanese subjects with mature-onset diabetes of the young.
J. Hum. Genet.
46:
285-288[CrossRef][Medline].
gene in type 2 diabetes associated with coronary heart disease.
Diabetes Metab.
26:
393-401[Medline].
1-acid glycoprotein genes and surrounding Alu repeats.
Genomics
6:
659-665[CrossRef][Medline].
gene: Association of the L162V mutation with hyperapobetalipoproteinemia.
J. Lipid Res.
41:
945-952Received August 27, 2001; accepted in revised form December 20, 2001.
This article has been cited by other articles:
![]() |
I. Feldman, A. Rzhetsky, and D. Vitkup Network properties of genes harboring inherited disease mutations PNAS, March 18, 2008; 105(11): 4323 - 4328. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Hung, M. Baragatti, D. Thomas, J. McKay, N. Szeszenia-Dabrowska, D. Zaridze, J. Lissowska, P. Rudnai, E. Fabianova, D. Mates, et al. Inherited Predisposition of Lung Cancer: A Hierarchical Modeling Approach to DNA Repair and Cell Cycle Control Pathways Cancer Epidemiol. Biomarkers Prev., December 1, 2007; 16(12): 2736 - 2744. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Torkamani and N. J. Schork Accurate prediction of deleterious protein kinase polymorphisms Bioinformatics, November 1, 2007; 23(21): 2918 - 2925. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. M. Holland, F. R. DeLeo, H. Z. Elloumi, A. P. Hsu, G. Uzel, N. Brodsky, A. F. Freeman, A. Demidowich, J. Davis, M. L. Turner, et al. STAT3 Mutations in the Hyper-IgE Syndrome N. Engl. J. Med., October 18, 2007; 357(16): 1608 - 1619. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Matakidou, R. el Galta, E. L. Webb, M. F. Rudd, H. Bridle, the GELCAPS Consortium, T. Eisen, and R. S. Houlston Genetic variation in the DNA repair genes is predictive of outcome in lung cancer Hum. Mol. Genet., October 1, 2007; 16(19): 2333 - 2340. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. G. Kann Protein interactions and disease: computational approaches to uncover the etiology of diseases Brief Bioinform, September 1, 2007; 8(5): 333 - 346. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Alkassab, P. Gourh, F. K. Tan, T. McNearney, M. Fischbach, C. Ahn, F. C. Arnett, and M. D. Mayes An allograft inflammatory factor 1 (AIF1) single nucleotide polymorphism (SNP) is associated with anticentromere antibody positive systemic sclerosis Rheumatology, August 1, 2007; 46(8): 1248 - 1251. [Abstract] [Full Text] [PDF] |
||||