|
|
|
|
Vol. 9, Issue 11, 1087-1092, November 1999
LETTER
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Using assembled expressed sequence tags (ESTs) from 50 different cDNA libraries, we have identified contigs that represent the complete coding sequences of 850 known human genes, and have scanned these for high quality sequence substitutions. We report the identification and characteristics of 201 candidate single nucleotide polymorphisms found in the coding sequences (cSNPs) of 165 of these genes. Using a conservative calculation, coding region nucleotide diversity (the average number of differences between any pair of chromosomes) was found to be 3 per 10,000 bp based on this data. This analysis reveals that assembled ESTs from multiple libraries may provide a rich source of comparative sequences to search for cSNPs in the human genome.
| |
INTRODUCTION |
|---|
|
|
|---|
There is great interest in developing a third generation
genetic map of the human genome composed of single
nucleotide polymorphism (SNP) markers (Collins et al. 1997
). SNPs are
the most common form of sequence variation (Nickerson et al. 1998
; Wang
et al. 1998
), and it is likely that highly dense human genetic maps
containing more than 100,000 markers can be developed from the human
genome. Since SNPs are found in both coding and non-coding regions of the genome, randomly distributed markers as well as markers clustered in genes can be discovered (Collins et al. 1997
; Harding et al. 1997
;
Nickerson et al. 1998
). The majority of SNPs found in coding regions
(cSNPs) are single base substitutions that may or may not lead to amino
acid substitutions. Some cSNPs alter a functionally important amino
acid residue, and these are of interest for their potential links with
phenotype. Other cSNPs may prove useful for their potential links to
functional cSNPs via linkage disequilibrium mapping (Collins et al. 1997
).
There are many approaches available to find SNPs in the human genome,
but all involve some form of comparative analysis of the same DNA
segment from different individuals or from different haplotypes within
the same individual. For example, DNA segments amplified by PCR can be
compared based on conformation analysis, melting temperature analysis,
the ability to be differentially cut with enzymes or chemicals (Cotton
1993
), or by direct sequence analysis (Harding et al. 1997
; Nickerson
et al. 1998
; Wang et al. 1998
). It is also possible that large numbers
of SNPs could be identified as high quality mismatches between
overlapping clone sequences from the human genome project
(Taillon-Miller et al. 1998
). Most such variants are likely to be
located in noncoding regions that form the bulk of genomic DNA
sequence. Another available resource of sequences from different
individuals is expressed sequence tags (ESTs), single-pass sequence
reads obtained from cDNAs. Large-scale analysis of ESTs has resulted in
a useful gene expression resource of both known and unknown genes
(Adams et al. 1991
; Hillier et al. 1996
; Gerhold and Caskey 1996
).
Since cDNA libraries from multiple sources (representing different
individuals) have been sequenced, there is substantial redundancy in
the EST data that can be exploited to find SNPs (Buetow et al. 1999
;
Picoult-Newberg et al. 1999
). We report the use of assembled ESTs,
representing the complete coding sequences for 850 known human genes,
to scan for coding region SNPs (cSNPs). Our analysis has identified
201candidate cSNPs, of which 87 are predicted to lead to amino acid changes.
| |
RESULTS |
|---|
|
|
|---|
Identification of cSNPs using ESTs
Human ESTs from 50 libraries (containing 574,401 sequence reads)
were base-called with phred (Ewing et al. 1998
) and assembled with phrap (Green 1994
). SNP analysis focused on a subset of
the contigs that matched the complete coding sequences of known human genes (see Methods). A total of 850 full-length coding sequences averaging 748 bp in size (range 204-2433 bp) and spanning a total of
637,497 bp were covered by EST contigs. An approach similar to that
described by Picoult-Newberg et al. (1999)
, but systematically applying
phred quality scores (Ewing and Green 1998
) to help automate the analysis, was then used to identify single nucleotide mismatches in
these genes. After filtering, 223 candidate SNPs were identified. Visual inspection of the traces for these candidate SNPs revealed two
types of errors resulting from base-calling (n = 21) and
misalignment (n = 1) problems. The remaining 201 candidate
cSNPs were verified to have high quality mismatches, with a minimum of
two reads for each of the alternative alleles. Of the 850 genes
examined, 165 genes contained candidate SNPs. Twenty-nine of these
genes contained more than one candidate cSNP, that is, 23 genes with
two candidate cSNPs, 5 genes with 3 candidate cSNPs, and 1 gene with 4 candidate cSNPs.
Using a probabilistic approach, we estimated the nucleotide diversity across the scanned sequences to be 3 polymorphic differences per 10,000 bp between two randomly selected chromosomes. As candidate polymorphisms were required to have two sequences representing each candidate allele, this is likely to be an underestimate because it does not include the more rare alleles represented only once in the set of ESTs. Diversity estimates of 2 and 6 polymorphic differences per 10,000 bp were obtained for positions with the potential to give rise to nonsynonymous and synonymous substitutions, respectively.
Characteristics of cSNPs
Candidate cSNPs were classified according to substitution type (A/C,
A/G, A/T, C/G, C/T, and G/T) as shown in Figure 1a.
As expected, transitions (69%) were more common than transversions (31%) among the sequence changes (Cooper and Krawczak 1990
); 41% of
the substitutions were found to occur at a CpG site, a dinucleotide known for its high mutability (Cooper and Youssoufian 1988
).
|
We also determined the position of each candidate cSNP in the codon, and whether the predicted change was synonymous (silent) or nonsynonymous (replacement). The majority of the candidate cSNPs (59%) were found to occur at the third codon position, and as expected most (92%) of these were synonymous (see Fig. 1b for distribution of candidate SNPs according to codon position). A number of changes were also identified in codon positions 1 and 2 (Fig. 1b) and these accounted for most of the nonsynonymous changes. In all, 87 of the candidate SNPs (43%) are predicted to result in nonsynonymous changes.
Based on sequence information alone it is difficult to predict the
effect of an amino acid substitution on protein function. One potential
way to obtain information about functional constraint is via
evolutionary comparisons, as it is generally believed that conserved
residues are more likely to be functionally significant than
nonconserved residues. We examined sequence conservation at candidate
nonsynonymous cSNP positions using orthologous protein sequences from
mouse (Mus musculus), which were available for 38 of the genes
containing 41 of the 87 nonsynonymous candidate cSNPs. It was found
that candidate cSNPs were disproportionately over-represented among the
nonconserved positions. Although only 17% of the amino acid residues
were nonconserved (not identical) in the 38 mouse-human pairs, these
included 42% (17/41) of the nonsynonymous candidate cSNPs. This is
consistent with the idea that a high fraction of nonsynonymous
polymorphic substitutions occurs at positions that are not functionally
constrained on evolutionary timescales, particularly since a fraction
of the "conserved" positions are presumably not under functional
constraint but have by chance not mutated in the mouse and human
lineages. However, some of the candidate cSNPs may have the potential
to alter function. Although none were predicted to occur in an active
site [as annotated in SwissProt (http://www.expasy.ch/sprot)] in
proteins with enzymatic or receptor activity, eleven candidate cSNPs
were found to lie in SH3 domains, transmembrane regions, binding sites,
signal sequences, or other residues associated with specific functions
(Table 1).
|
Two of the predicted changes among candidate cSNPs would lead to the substitution of an amino acid by a stop codon, and could affect function depending on where the protein is terminated. One predicted termination was located at a Trp residue (position 68 of 134) at the start of the binding domain in the beta-galactoside binding protein (LGALS1: SwissProt database: accession no. P09382). The other nonsense mutation was found at position 219 (Glu) of 245 residues of the 14-3-3 protein (SwissProt database: accession P27348). In each case, the reads containing the termination codon were contributed by only one cDNA library (although the libraries for the two cases were different) and that library also contained the "wild-type" allele.
Confirmation of Nonsynonymous cSNPs
By scanning the existing polymorphism literature, we were able to
verify 17 of the 87 nonsynonymous candidate polymorphisms (Table
2). In some of these cases allele frequencies had
been estimated. Although most of these cSNPs were common (average
heterozygosity was calculated to be 0.33 for 12 sites; Table 2), three
of them, K02215 and 16581 (alternative allele frequency 8%) and APOAI
(alternative allele frequencies 0.05%), were relatively rare among the
tested populations. In addition to the cSNPs identified as such in the literature, four nonsynonymous substitutions were indicated as "conflicts" of unknown origin in the SwissProt database (Table 2).
Thus, nearly a quarter of the nonsynonymous candidate cSNPs (21 of 87)
could be verified with data from other laboratories (Table 2). The
phred quality of the bases for the previously known cSNPs we
identified ranged from 21 to 51 with an average quality of 33 at the
mismatch. This is similar to the average quality at the mismatch for
the entire set of candidate cSNPs, which was 32.
|
| |
DISCUSSION |
|---|
|
|
|---|
ESTs previously have been proven useful for finding SNPs associated
with cDNA sequences (Buetow et al. 1999
; Picoult-Newberg et al. 1999
).
We have shown that SNPs in coding regions (cSNPs) can also be
identified using ESTs. Although our approach is similar to that
described by Picoult-Newberg, we focused entirely on coding sequences.
There was only a small overlap of five SNPs detected by each approach.
Approximately 640,000 bp of coding sequence were examined across 850 genes and 201 sequence-confirmed candidate cSNPs identified. Our
estimate of the number of cSNPs is highly conservative because we
required two representatives for each allele. Despite this stringency,
our probabilistic estimates of the nucleotide diversity
(3 × 10
4) in coding regions is broadly consistent
with estimates obtained by more direct approaches (Cargill et al. 1999
;
Halushka et al. 1999
).
One problem in identifying SNPs using assembled ESTs, is discriminating
mismatches due to base-calling error from those due to variation in the
sequence. This can be problematic for ESTs because of the high
estimated error rate (2%) associated with these single-pass sequences
(Hillier et al. 1996
). For preliminary screening, we applied a
previously developed heuristic approach but systematically used
sequence quality obtained from the phred base-caller to help
sort base-calling errors from SNPs. To avoid missing true
polymorphisms, we found it useful to set a relatively modest quality
threshold (q > 20, corresponding to an error
probability < 1%) but to require 2 reads for each allele. As the
quality threshold is set higher, fewer bases in a read meet the cut-off
and fewer candidate SNPs are identified.
Approximately 52% (n = 105) of our 201 candidate cSNPs have sequence-based confirmation from multiple cDNA libraries, in the sense that each allele is represented in at least two libraries. It is possible that a portion of the remaining 48% could represent somatic rather than germline mutations; distinguishing between these sources of variation would require testing DNA from the contributing individual (currently not available) or systematic genotyping of a large sample. Only 8 of 52, or 15%, of the single-library nonsynonymous candidate cSNPs had been previously identified and confirmed by other approaches, a much smaller fraction than for the multiple-library candidate nonsynonymous cSNPs (13 of 35, or 37%). This could indicate that either the single-library candidates are rarer or that some of them are somatic mutations. However, only one of the single-library cSNPs originated from a tumor cell cDNA library.
SNPs identified in coding regions are useful for association studies
because of their potential linkage disequilibrium with functional
variants in those genes (Collins et al. 1997
). In this respect, several
of the candidate cSNPs we uncovered have been associated with disease
susceptibility (Jeunemaitre et al. 1992
; Arngrimsson et al. 1993
;
Lachman et al. 1996
; Li et al. 1997
; Morgan et al. 1999
). All of the
candidate cSNPs reported here have been deposited in dbSNP
(http://www.ncbi.nlm.nih.gov/SNP) and at our web site
(http://droog.mbt.washington.edu) and are available for further
analysis with regard to their distribution within and among human
populations and their potential functional relevance.
| |
METHODS |
|---|
|
|
|---|
EST Assembly
Chromatogram files for human ESTs (574,401 reads from 50 cDNA
libraries) were downloaded from the Washington University EST web site
(ftp://genome.wustl.edu). Based on the limited descriptive information
available for the source of these libraries, it appears that sequences
from at least 53 different individuals (106 chromosomes) are
represented. Traces were base-called with phred (Ewing et al.
1998
, v. 0.961028) and assembled with phrap (Green 1994
, v. 0.971106). Assembly was carried out in two stages. In the first stage,
the collection of ESTs was divided into subsets of 100,000 to 200,000, and each subset was independently assembled. In the second stage, the
sets of contigs and unincorporated singletons from the first stage were
assembled together, producing the final set of contigs, and all the
ESTs were aligned to the contig sequences (C. Wilson and P. Green, unpubl.).
Identifying EST Contigs Matching Known Coding Sequences
A set of nonredundant full length coding sequences of human genes was prepared as follows (M. Robinson, unpubl.). Entries with an annotated full-length coding sequence (CDS), starting with NTG and ending with a stop codon, were extracted from GenBank (release 103.0, files gbpri1.seq and gbpri2.seq). Entries that contained any of the words "mutant," "variant," "antisense," "rearranged," or "T-cell receptor," or that did not contain the word "RNA" in the locus field were discarded. Redundant sequences were then deleted and the 5' and 3' untranslated regions were removed, leaving a total of 4883 intact coding sequences.
Cross_match (Green 1993
) was then used to compare EST contigs
to the CDSs, and each CDS that aligned >96% of its length to matching EST contig was identified. There were 954 such CDSs. Phrap permits mismatches if similarity between the reads is
high, and may assemble together ESTs from paralogous genes and genes from multigene families. Some of these show such a high degree of
similarity that it is difficult to distinguish between alleles and
paralogs. To minimize the risk of paralogs contributing spuriously to
the candidate cSNPs, we eliminated those CDSs for which the number
of high quality discrepancies (quality
30) between the contig
and the CDSs exceeded 1% of the length of the alignment; this left 896 CDS. We then further eliminated 20 CDSs whose associated contigs had
<3 or >500 reads, 23 CDSs belonging to ribosomal protein families, and 3 CDSs belonging to lipocalin, DEFA, and CGB gene families. This left a total of 850 CDSs that were then scanned for SNPs.
SNP Identification
To identify candidate SNPs in the EST contigs, we reassembled each
contig, including the sequence of the mRNA, as a reference sequence.
The latter was assigned zero quality. Since several chromosomes are
generally represented in the contigs, candidate SNPs can be identified
as base mismatches between reads. Mismatches could also be due to
copying errors introduced during the production or cloning of the
cDNAs. To help minimize this problem, we required that mismatches be
confirmed (
two reads for each alternative base; the reference
sequence was permitted to be one of these). Other sources of mismatches
are base calling errors and misalignments of the sequences. To decrease
this problem, we required all identified "mismatched sites" to pass
a series of filters: (1) phred quality >20; (2) average
phred quality
20 over a window of 5 bases on either side
of the site; (3) an exact sequence match at all the bases in a window
of 5 on either side of suspected site; and (4) manual trace inspection
using consed (Gordon et al. 1998
). The last step was required
to eliminate mismatches due to systematic errors in the base-calling by
phred or alignment by phrap. With current versions of
phred (0.980904) and phrap (0.990319), 21 base-calling errors (all due to the presence of multiple peaks at the
site in question) and 1 misalignment case (due to inclusion of reads
not belonging to the contig) were found. We only considered substitution type polymorphisms in our analysis because of their higher
confirmation rate (Picoult-Newberg et al. 1999
).
There were ~7000 ESTs that matched these 850 genes but did not assemble together with the contigs selected for these genes. These ESTs were directly aligned to these genes and were scanned for the candidate SNPs using all the criteria described above, except for manual trace inspection. This resulted in an additional set of 264 possible candidates, a random subset of which was selected for closer inspection. Most of these turned out to be in ESTs showing a large percentage of high quality mismatches with the CDS, indicating that the EST was likely to be from a paralogous rather than allelic gene. There were also a number of cases in which there were large gaps in the alignment between the CDS and ESTs. Some of the candidate SNPs in such ESTs may be real, if the gaps are due to alternative or incomplete splicing. However, the gaps could also indicate that the EST belongs to a different gene or is chimeric. Due to the high risk of spurious ESTs, the candidate cSNPs from these unincorporated ESTs were not pursued further and are not reported in this paper.
Based on the reference CDS, each substitution mismatch was classified as to whether it leads to a synonymous or nonsynonymous substitution. Information on candidate cSNPs was stored in a relational database and was deposited in dbSNP (ss 5277 to ss 5477). A list of the candidate cSNPs is available at http://droog.mbt.washington.edu. Each candidate also has the following information available: mRNA accession number, contig with assembly information, SNP position in mRNA, codon change, amino acid change, gene name (if known), whether the SNP has single or multilibrary confirmation, the ESTs with their source (library), and 10 bp sequence on either side of the SNP.
Calculation of Sequence Diversity
For the diversity calculation, only reads from those libraries derived from single individuals (excluding the two pooled libraries Gessler Wilms tumor and Morton Fetal Cochlea) were considered. All the libraries except for four (Soares adult brain N2b4HB55Y, Soares adult brain N2b5HB55Y, Soares retina N2b4HR, and Soares retina N2b5HR) were considered to be from different individuals. At each coding sequence position in the 850 genes, all reads meeting the three quality and sequence match filters (see above, SNP identification) were determined. Positions at which there were fewer than four such reads, or where exactly one of the reads was discrepant with the others, were excluded from further consideration, leaving 443,440 positions that could be used for the diversity calculations.
At each position, the selected reads were grouped according to the individual of origin and the number of distinct chromosomes represented in each group was estimated as follows. Since each group derives from a single individual, the number of distinct chromosomes cannot exceed two. If two different bases (alleles) occur within the group, then there are exactly two chromosomes. Otherwise (only one allele present), the probability that there is one chromosome represented (all reads derive from the same chromosome) is 2 * 0.5n and the probability that there are two chromosomes represented is 1-2* 0.5n, where n is the number of reads in the group.
The expected number of pairs of distinct chromosomes for which the two alleles are the same (ns), and the expected number of pairs for which the two alleles are different (nd), can now be estimated within and between groups at each position. For example, if only a single allele is present for a group of n reads, then there is a fractional contribution of 1- 2* 0.5n (the probability that two chromosomes are present) to the value of ns and a contribution of 0 to nd for that group. The within and between group estimates are added at each position, and then added over positions to get overall estimates for ns and nd. The estimated diversity is then given by nd/(ns + nd).
| |
ACKNOWLEDGMENTS |
|---|
We thank C. Wilson for providing the EST assembly, M. Robinson for the coding sequence database, and S. Taylor for helpful discussions. This work was supported by the National Science Foundation (NSFBIR9214821) to K.G. and D.A.N, the National Institutes of Health (HG00774) to P.G., and the Department of Energy (DE-FG03-97ER62385) to D.A.N.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
2 Corresponding author.
E-MAIL: kavitag{at}u.washington.edu; FAX (206) 685-7301.
| |
REFERENCES |
|---|
|
|
|---|
Received May 19, 1999; accepted in revised form August 20, 1999.
This article has been cited by other articles:
![]() |
D. Savage, J. Batley, T. Erwin, E. Logan, C. G. Love, G. A. C. Lim, E. Mongin, G. Barker, G. C. Spangenberg, and D. Edwards SNPServer: a real-time SNP discovery tool Nucleic Acids Res., July 1, 2005; 33(suppl_2): W493 - W495. [Abstract] [Full Text] [PDF] |
||||
![]() |
The Ludwig-FAPESP Transcript Finishing Initiative, M. C. Sogayar, and A. A. Camargo A Transcript Finishing Initiative for Closing Gaps in the Human Transcriptome Genome Res., July 1, 2004; 14(7): 1413 - 1423. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. L. Vettore, F. R. da Silva, E. L. Kemper, G. M. Souza, A. M. da Silva, M. I. T. Ferro, F. Henrique-Silva, E. A. Giglioti, M. V.F. Lemos, L. L. Coutinho, et al. Analysis and Functional Annotation of an Expressed Sequence Tag Collection for Tropical Crop Sugarcane Genome Res., December 1, 2003; 13(12): 2725 - 2735. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Batley, G. Barker, H. O'Sullivan, K. J. Edwards, and D. Edwards Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data Plant Physiology, May 1, 2003; 132(1): 84 - 91. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Sorek and H. M. Safer A novel algorithm for computational identification of contaminated EST libraries Nucleic Acids Res., February 1, 2003; 31(3): 1067 - 1074. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Ye and J. M. Parry The discovery and confirmation of single nucleotide polymorphisms in the human p53R2 gene by EST database analysis Mutagenesis, September 1, 2002; 17(5): 361 - 364. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. C. Ng and S. Henikoff Accounting for Human Polymorphisms Predicted to Affect Protein Function Genome Res., March 1, 2002; 12(3): 436 - 446. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Deutsch, C. Iseli, P. Bucher, S. E. Antonarakis, and H. S. Scott A cSNP Map and Database for Human Chromosome 21 Genome Res., February 1, 2001; 11(2): 300 - 307. [Abstract] [Full Text] |
||||
![]() |
E. Dawson, Y. Chen, S. Hunt, L. J. Smink, A. Hunt, K. Rice, S. Livingston, S. Bumpstead, R. Bruskiewich, P. Sham, et al. A SNP Resource for Human Chromosome 22: Extracting Dense Clusters of SNPs From the Genomic Sequence Genome Res., January 1, 2001; 11(1): 170 - 178. [Abstract] [Full Text] |
||||
![]() |
C. M. Ulrich, J. Bigler, C. M. Velicer, E. A. Greene, F. M. Farin, and J. D. Potter Searching Expressed Sequence Tag Databases: Discovery and Confirmation of a Common Polymorphism in the Thymidylate Synthase Gene Cancer Epidemiol. Biomarkers Prev., December 1, 2000; 9(12): 1381 - 1385. [Abstract] [Full Text] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||