|
|
|
|
Vol. 10, Issue 8, 1259-1265, August 2000
RESOURCE
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
SNPs (Single-Nucleotide Polymorphisms), the most common DNA variant in humans, represent a valuable resource for the genetic analysis of cancer and other illnesses. These markers may be used in a variety of ways to investigate the genetic underpinnings of disease. In gene-based studies, the correlations between allelic variants of genes of interest and particular disease states are assessed. An extensive collection of SNP markers may enable entire molecular pathways regulating cell metabolism, growth, or differentiation to be analyzed by this approach. In addition, high-resolution genetic maps based on SNPs will greatly facilitate linkage analysis and positional cloning. The National Cancer Institute's CGAP-GAI (Cancer Genome Anatomy Project Genetic Annotation Initiative) group has identified 10,243 SNPs by examining publicly available EST (Expressed Sequence Tag) chromatograms. More than 6800 of these polymorphisms have been placed on expression-based integrated genetic/physical maps. In addition to a set of comprehensive SNP maps, we have produced maps containing single nucleotide polymorphisms in genes expressed in breast, colon, kidney, liver, lung, or prostate tissue. The integrated maps, a SNP search engine, and a Java-based tool for viewing candidate SNPs in the context of EST assemblies can be accessed via the CGAP-GAI web site (http://cgap.nci.nih.gov/GAI/). Our SNP detection tools are available to the public for noncommercial use.
[The sequence data described in this paper have been submitted to the db SNP data library under accession nos. SS8196-SS18418.]
| |
INTRODUCTION |
|---|
|
|
|---|
SNPs (Single-Nucleotide Polymorphisms) are the
most common form of DNA variation in humans. These variants occur at an
estimated frequency of one per 1000 to 2000 base pairs (Cooper et al.
1995
; Kwok et al. 1996
; Wang et al. 1998
; Cargill et al. 1999
; Halushka et al. 1999
), making it possible in principle to identify a genetic marker in every gene.
A collection of tens or hundreds of thousands of SNPs would serve as a
valuable resource for the discovery of genetic factors affecting
disease susceptibility and resistance. These markers can be used in
association studies that assay how alleles of candidate disease loci
correlate with particular diseases (Lander and Schork 1994
; Lander
1996
; Risch and Merikangas 1996
). Likewise, an extensive collection of
SNPs will be useful for identifying genetic variants involved in drug
metabolism (Meyer and Zanger 1997
); this information will enable
clinicians to determine which pharmacological agent is most effective
for treating a given patient's condition, as well as which compounds
are least likely to produce an adverse reaction.
Because of their abundance, SNPs are the marker of choice for
constructing high-resolution genetic maps used for linkage analysis (Lander and Schork 1994
; Kruglyak 1997
; Zhao et al. 1998
) and positional cloning (Collins 1995
). High-density genetic maps are essential for studying complex traits such as predisposition to hypertension, diabetes, or asthma or susceptibility to infectious diseases such as malaria or acquired immune deficiency syndrome. Dense
SNP-based maps also will prove valuable for loss-of-heterozygosity studies (Cavenee et al. 1983
), which have played a critical role in
deciphering the genetic changes involved in cancer initiation and
progression. Understanding the genetic events that lead from immortalization to metastasis will improve cancer diagnosis and may
reveal common genetic changes in apparently unrelated tumor types,
thereby suggesting new therapies for certain forms of cancer.
Several large-scale SNP detection projects have been undertaken in
recent years. The first, performed at the Whitehead Institute, was
based on the hybridization of genomic PCR (Polymerase Chain Reaction)
products to DNA oligonucleotide arrays (Wang et al. 1998
). The
Whitehead collection contains 3241 putative SNPs, 2227 of which have
been placed on genetic maps. An alternative approach
examining high-throughput genomic sequence for nucleotide variants
was used by
Taillon-Miller et al. (1998)
to identify 153 potential SNPs in 200.6 kilobases of sequence from chromosomes 5, 7, and 13. More recently, SNP
mining strategies based on the analysis of ESTs (Expressed Sequence
Tags) have been described (Buetow et al. 1999
; Picoult-Newberg et al.
1999
). Because the high error rate in EST sequences (~1%) makes it
difficult to distinguish true genetic variants from sequencing
artifacts, both Buetow et al. and Picoult-Newberg et al. used the
basecalling program Phred (Ewing and Green 1998
; Ewing et al. 1998
) and
the sequence assembly program Phrap (http://genome.washington.edu) to
directly analyze EST sequencing traces. The two groups used different
algorithms to filter out false-positives and validate predicted SNPs.
The goal of the National Cancer Institute's Cancer Genome Anatomy
Project (CGAP) is to provide a comprehensive catalog of molecular
differences distinguishing tumorous cells from their normal
counterparts. Within CGAP, the Genome Annotation Initiative (CGAP-GAI)
group seeks to identify allelic variants of genes involved in cancer
initiation and progression. In our most recent round of SNP discovery,
we used the SNPpipeline, a set of sequence analysis tools described in
Buetow et al. (1999)
, to identify more than 10,000 high-probability
candidate single nucleotide polymorphisms among publicly available EST
sequences. Information about this collection of SNPs is accessible via
the internet (http://cgap.nci.nih.gov/GAI/). To present these SNPs in a
format useful to the human genetics community, we have placed >6800
predicted variants on integrated genetic/physical maps. We have
produced maps showing the locations of SNPs in genes expressed in the
breast, colon, kidney, liver, lung, or prostate in addition to a
comprehensive integrated map. We provide a Java-based SNP viewer that
displays sequence polymorphisms in the context of DNA sequence
alignments and a search engine that retrieves SNPs by keyword,
description, or gene symbol. Each SNP is linked to the extensive
annotation maintained by the National Center for Biotechnology
Information (NCBI). Our SNP prediction tools are publicly available for
noncommercial use.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
SNP Prediction, Validation, and Confirmation
Using the SNPpipeline set of sequence analysis tools (Buetow et al.
1999
), we have identified 10,243 high-probability (99% or better)
potential SNPs among sequences contained in the 14 April 1999 release
of UniGene (Schuler et al. 1996
). Candidates were derived from 6458 UniGene assemblies, 1862 of which corresponded to named genes. This set
of candidate SNPs has been submitted to dbSNP (Sherry et al. 1999
;
http://www.ncbi.nlm.nih.gov/SNP/). We verified predicted SNPs in a
two-step process. To validate a polymorphism, we showed that it is
present in genomic DNA from eight individuals via direct sequencing or
a RFLP (Restriction Fragment Length Polymorphism) assay. If validation
was successful, we confirmed that the variant was transmitted in Centre
d'Etude du Polymorphisme Humain pedigrees as a simple Mendelian trait. Confirmation is essential for distinguishing true allelic variants from
false-positives produced by the assembly of ESTs from members of a gene
family into a single contig and other artifacts.
As part of the confirmation procedure, we genetically mapped SNPs
against a set of reference maps (Murray et al. 1994
) by using the
CRI-MAP 2.4 program (Green 1992
). The first phase of genetic mapping
involves assigning a polymorphism to a chromosome. This was
accomplished by looking for pairwise linkage of the SNP and markers in
the mapping panel via the CRI-MAP two-point option. In the absence of
physical mapping data, LOD 4 (logarithm of odds) linkage to one genetic
marker or LOD 3 linkage to two reference markers on the same chromosome
is the minimum requirement for placing a SNP on a chromosome. If
physical mapping information was available, LOD 3 linkage to a single
reference marker on the same chromosome was sufficient for chromosome
assignment. Once a SNP was assigned to a chromosome, we used the build
option, which calculated a maximum likelihood score, to determine the "best" and "likely" map intervals. The best interval was that with the highest likelihood score, and the likely interval was the set
of map intervals whose likelihood scores were within three orders of
magnitude of the best score. Note that the best and likely locations
may be identical and may encompass more than one marker interval.
Construction of Genetic/Physical Maps
Integrated genetic/physical maps for the 22 autosomes and the X
chromosome are based on the CHLC/ABI (Cooperative Human Linkage Center/Applied Biosystems) Prism version 1 marker panel
(http://www.chlc.org/ABI/ABIRefMaps.html) and the GeneMap'98
version of the Genebridge4 radiation hybrid map (Gyapay et al. 1996
;
Deloukas et al. 1998
; http://www.ncbi.nlm.nih.gov/genemap98). Genetic
map distances between reference markers were obtained from the CHLC.
Gender-averaged map positions were used for autosomal markers. Whenever
possible, radiation hybrid map positions for reference markers were
taken from GeneMap'98. Ninety-two of 359 markers, however, have not
been localized on the reference physical map. To complete the linkage
of the genetic and physical maps, we assigned these markers the
radiation hybrid map position of a closely linked proxy marker. With
the exception of D7S513, D11S987, and D12S336, proxies were chosen from
the 1996 Genethon genetic map (Dib et al. 1996
). For the above three
markers, which are not on the Genethon map, proxies were chosen from
the Marshfield Medical Center genetic map (Broman et al. 1998
). To
select a proxy marker, we chose the closest marker on the genetic map
(either proximal to or distal to the ABI version 1 marker), which has been positioned on the GeneMap'98 Genebridge4 radiation hybrid map. With few exceptions, proxy markers lie within two centimorgans of
the corresponding reference marker. The genetic and physical maps are
colinear, with the exception of inversions on chromosome 16 (markers
D16S420 and D16S401), chromosome 18 (markers D18S462 and D18S70), and
chromosome 20 (markers D20S173 and D20S171).
Placement of SNPs on the Integrated Map
Each SNP is associated with a template mRNA or genomic DNA sequence from one of the 21,993 parental UniGene clusters that served as the starting point of the SNP detection project. If the parental UniGene assembly has been mapped to a single chromosomal region (via one or more STSs within the cluster) on the GeneMap'98 Genebridge4 radiation hybrid map, we assign the SNP to the appropriate marker interval on the integrated genetic/physical map. If sequences within the UniGene cluster have been mapped to multiple locations on a single chromosome, we assign the mean map position to the SNP. If the cluster contains STSs that map to different chromosomes, we use cytogenetic mapping data for the cluster, if available, to resolve the inconsistency. Through this strategy, we positioned 6845 of 10,243 predicted SNPs on the integrated map.
In contrast to candidate and validated SNPs, both genetic and physical map data are used to place confirmed SNPs on the integrated map. If the physical map position (or a physical map position) of the parental UniGene assembly lies within the best or likely genetic map interval, we place the SNP in that marker interval. If no physical mapping data are available for the SNP or if the physical map position of the SNP does not correspond to the likely genetic map interval, we assign it to the lowest numbered marker interval within the best genetic map interval.
We also constructed expression-based integrated SNP maps. To determine whether a single nucleotide polymorphism lies in a gene expressed in one of the major sites of cancer (breast, colon, kidney, lung, liver, or prostate gland), we used information provided by the National Center for Biotechnology Information to ascertain whether one or more ESTs from the UniGene cluster associated with the SNP were isolated from a cDNA library derived from the tissue of interest. Therefore, only positive expression results are meaningful.
Features of the CGAP-GAI Web Site
SNP maps and related materials are accessible on the CGAP-GAI web site (http://cgap.nci.nih.gov/GAI/). Key features of the site are described below.
SNP Maps
SNP imagemaps (Fig. 1) contain a genetic map, physical map, and chromosome ideogram. A histogram adjacent to the genetic map shows the number of confirmed, validated, and candidate SNPs mapped to each marker interval. To the right of the physical map are PCR primer set identification numbers for confirmed SNPs that have been physically mapped. Each histogram and primer identification number is linked to a genetic map interval summary page (see below). Reference markers names are linked to annotation from the CHLC. In addition to a comprehensive map set containing every mapped SNP, we have generated expression-based sets of SNP maps. Our current set of tissue-specific maps show the locations of sequence variants identified in genes expressed in the breast, colon, kidney, liver, lung, and prostate.
|
Linkage Maps
Linkage imagemaps (Fig. 2) show the best genetic mapping interval and the likely genetic mapping interval for each confirmed SNP in relation to the genetic/physical map. Primer set identification numbers identify SNPs. Genetic map interval numbers and primer identification numbers are linked to a genetic map interval summary page (see below).
|
Physical Maps
We provide a physical map of each interval of the integrated genetic/physical map. Physical maps display radiation hybrid mapped UniGene clusters containing candidate, validated or confirmed SNP as well as framework markers from the GeneMap'98 Genebridge4 map. Framework markers are hyperlinked to the corresponding GeneMap'98 chromosome map.Summary Pages
Summary pages list confirmed, validated, and candidate SNPs. Information about each SNP also is provided; annotation includes the SNP identification number, a short description of the UniGene cluster associated with the SNP, the GenBank accession number of a template sequence from the UniGene assembly, and the gene symbol of the UniGene cluster if it corresponds to a named gene. Summary pages contain links to the SNP viewer (see below) and UniGene annotation, as well as RFLP and genetic mapping reports, where appropriate. Confirmed, validated, and candidate SNPs are listed separately on the summary pages, so SNPs within a single locus may be listed in three different locations on a page. In addition, we maintain a list of validated SNPs that have not been physically mapped.SNP Viewer
Each SNP on a summary page is linked to a Java-based SNP viewer that displays two windows when launched. The first window shows the SNP in the context of a sequence alignment (Fig. 3), with minority residues at the polymorphic location shaded red. From this window, the user can retrieve additional information about the sequences in the assembly, obtain a list of cDNA libraries from which the sequences were derived, view the sequence traces, and access the PCR primer design program Primer3 (Rozen and Skaletsky 1998). The second window provides an overview of the sequence assembly (Fig. 4), displaying the locations of all SNPs in the assembly, SNP quality, contig depth, and position of the open reading frame in the assembly.
|
|
SNP Index
The SNP index search engine allows SNPs to be retrieved by keyword, GenBank accession number, or UniGene accession number. Search results are presented in a table that contains links to either a SNP summary page (for SNPs mapped on the Genebridge4 radiation hybrid map) or an integrated genetic/physical map (for SNPs physically mapped by other means). The results table also is linked to the SNP viewer and the UniGene web site. Because only two-thirds of the candidate SNPs have been placed on the genetic/physical map, information about the remaining SNPs can be accessed via the search engine. The SNP index search engine also allows users to view assemblies that do not contain SNPs, thereby providing a graphical overview of EST coverage for a gene of interest.SNP Lists
We also maintain information about SNPs as downloadable files. Candidate and confirmed SNPs in named genes are listed in hypertext format tables. The text file "all.fasta" displays each predicted SNP in the context of a published sequence, rather than an EST assembly consensus sequence; the set of published reference sequences is kept in the "summary.fasta" text file. SNP annotation is summarized in the tab-delimited "snps.all" file.Cooperative Human Linkage Center
The CGAP-GAI home page contains a link to the CHLC web site, a repository of information about human genetic markers and genetic maps. The CHLC site contains information in a variety of formats about SNPs detected by the GAI.SNP Finder
Access to our SNP detection tools for noncommercial use also is provided through the CGAP-GAI web site. Registered users may upload ABI or SCF format sequencing traces to our server for analysis. Submitted traces can be assembled with UniGene sequences to improve the sensitivity of SNP detection.Future Directions
We plan to make a number of modifications to the SNP maps, including the incorporation of Stanford_G3 radiation hybrid map (Stewart et al. 1997| |
METHODS |
|---|
|
|
|---|
Platform
CGI (Common Gateway Interface) scripts and scripts that extract
annotation from the NCBI UniGene web site, draw GIF (Graphic Interchange Format) images, and generate the HTML (Hypertext Markup Language) pages were written in Perl 5.005_02 (Wall et al. 1996
). The
LWP Perl module was used to access the NCBI web site, and the GD Perl
module (http://stein.cshl.org/WWW/software/GD) was used to generate GIF
images. The SNP viewer was written in Java 1.0 using the Linux port of
Sun's Java Development Kit, version 1.0.2 (http://www.blackdown.org).
Availability
The CGAP-GAI site is accessible to the public at http://cgap.nci.nih.gov/GAI/. Pages displaying the integrated genetic/physical maps and genetic map intervals contain client-side imagemaps that require Netscape 3.0 or Microsoft Internet Explorer 3.0 or higher. The SNP viewer is a Java applet, requiring a Java-capable browser.
Noncommercial use of the SNP finder is available to registered users. To obtain an account, contact K.H.B. (buetowk{at}nih.gov).
Data Storage and Retrieval
Integrated genetic/physical map pages and summary pages are maintained as static files. Linkage map pages and genetic mapping reports are generated via Perl CGI scripts from data in static files. Information in the CGAP-GAI site is indexed and searched using the Center for Networked Information Discovery and Retrieval Isearch-cgi 1.05 software (http://www.cnidr.org). The SNP viewer, SNP index, and RFLP report search engine are run from an Apache web server using the mod_perl extension (http://perl.apache.org). Data used by the SNP viewer and SNP index are retrieved from a Postgres database (http://www.quantum.de/~thh/postgres95/index.html) using the DBI Perl module (http://www.symbolstone.org/technology/perl/DBI/index.html), whereas the RFLP reports are stored as flat files.
| |
WWW RESOURCES |
|---|
|
|
|---|
http://www-genome.wi.mit.edu/genome_software/other/primer3.html. Rozen, S. and Skaletsky, H.J. 1998. Primer3.
| |
ACKNOWLEDGMENTS |
|---|
We thank Valerie Lantz, Amy Voltz, and two anonymous reviewers for helpful comments on this manuscript. Scot Drew provided excellent editorial assistance. We especially thank J. Kelley for her oversight of SNP validation and confirmation and S. Mayer, T. Bandey, T. Pham, C. Tanzola, K. Smith, and other members of the Laboratory of Population Genetics for their superb technical support.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 Corresponding author.
E-MAIL buetowk{at}nih.gov; FAX (301) 435-8963.
| |
REFERENCES |
|---|
|
|
|---|
Database for single nucleotide polymorphisms and other classes of minor genetic variation.
Genome Res.
9:
677-679Received October 21, 1999; accepted in revised form June 2, 2000.
This article has been cited by other articles:
![]() |
C. F. Schaefer Utilizing Cancer Genome Anatomy Project (CGAP) Tools to Interrogate Cancer Genomes Am. Assoc. Cancer Res. Educ. Book, April 1, 2005; 2005(1): 7 - 11. [Full Text] [PDF] |
||||
![]() |
The Ludwig-FAPESP Transcript Finishing Initiative, M. C. Sogayar, and A. A. Camargo A Transcript Finishing Initiative for Closing Gaps in the Human Transcriptome Genome Res., July 1, 2004; 14(7): 1413 - 1423. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Savas, D. Y. Kim, M. F. Ahmad, M. Shariff, and H. Ozcelik Identifying Functional Genetic Variants in DNA Repair Pathway Using Protein Conservation Analysis Cancer Epidemiol. Biomarkers Prev., May 1, 2004; 13(5): 801 - 807. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. CLIFFORD, M. N. EDMONSON, C. NGUYEN, T. SCHERPBIER, Y. HU, and K. H. BUETOW Bioinformatics Tools for Single Nucleotide Polymorphism Discovery and Analysis Ann. N.Y. Acad. Sci., May 1, 2004; 1020(1): 101 - 109. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Brentani, O. L. Caballero, A. A. Camargo, A. M. da Silva, W. A. da Silva Jr., E. D. Neto, M. Grivet, A. Gruber, P. E. M. Guimaraes, W. Hide, et al. The generation and utilization of a cancer-oriented representation of the human transcriptome by using expressed sequence tags PNAS, November 11, 2003; 100(23): 13418 - 13423. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Batley, G. Barker, H. O'Sullivan, K. J. Edwards, and D. Edwards Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data Plant Physiology, May 1, 2003; 132(1): 84 - 91. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Marth, G. Schuler, R. Yeh, R. Davenport, R. Agarwala, D. Church, S. Wheelan, J. Baker, M. Ward, M. Kholodov, et al. Sequence variations in the public human genome data reflect a bottlenecked population history PNAS, January 7, 2003; 100(1): 376 - 381. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Fujii, T. Dracheva, A. Player, S. Chacko, R. Clifford, R. L. Strausberg, K. Buetow, N. Azumi, W. D. Travis, and J. Jen A Preliminary Transcriptome Map of Non-Small Cell Lung Cancer Cancer Res., June 1, 2002; 62(12): 3340 - 3346. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Zhang, P. M. Laborde, K. R. Coombes, D. A. Berry, and S. R. Hamilton Cancer Genomics: Promises and Complexities Clin. Cancer Res., August 1, 2001; 7(8): 2159 - 2167. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. J. Riggins and R. L. Strausberg Genome and genetic resources from the Cancer Genome Anatomy Project Hum. Mol. Genet., April 1, 2001; 10(7): 663 - 667. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||