|
|
|
|
Vol. 9, Issue 2, 189-194, February 1999
RESOURCE
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Ongoing efforts to sequence the human genome are already generating large amounts of data, with substantial increases anticipated over the next few years. In most cases, a shotgun sequencing strategy is being used, which rapidly yields most of the primary sequence in incompletely assembled sequence contigs ("prefinished" sequence) and more slowly produces the final, completely assembled sequence ("finished" sequence). Thus, in general, prefinished sequence is produced in excess of finished sequence, and this trend is certain to continue and even accelerate over the next few years. Even at a prefinished stage, genomic sequence represents a rich source of important biological information that is of great interest to many investigators. However, analyzing such data is a challenging and daunting task, both because of its sheer volume and because it can change on a day-by-day basis. To facilitate the discovery and characterization of genes and other important elements within prefinished sequence, we have developed an analytical strategy and system that uses readily available software tools in new combinations. Implementation of this strategy for the analysis of prefinished sequence data from human chromosome 7 has demonstrated that this is a convenient, inexpensive, and extensible solution to the problem of analyzing the large amounts of preliminary data being produced by large-scale sequencing efforts. Our approach is accessible to any investigator who wishes to assimilate additional information about particular sequence data en route to developing richer annotations of a finished sequence.
[Our software system is available via an extensive web supplement to this article at http://www.ncbi.nlm.nih.gov/Kuehl/prefinished.]
| |
INTRODUCTION |
|---|
|
|
|---|
The systematic sequencing of the human genome has
begun as part of the ongoing Human Genome Project (Olson 1995
; Boguski
et al. 1996
). The dominant strategy being used in this effort is clone-by-clone shotgun sequencing (Wilson and Mardis 1997
). Typically, this process first involves deriving large numbers of sequence reads
from subclones derived from a bacterial artificial chromosome (BAC) or
P1 artificial chromosome (PAC) to provide a high-redundancy sampling of
the starting clone. Preliminary sequence contigs are then assembled in
an automated fashion by use of software tools that are becoming
increasing powerful (Ewing et al. 1998
; Ewing and Green 1998
; Gordon et
al. 1998
). In the second stage of this process, additional sequence
reads are obtained in a highly directed fashion, so as to close the
remaining sequence gaps, to ensure the presence of high-quality data
throughout the assembled sequence, to prevent predictable artifacts
from producing errors in the sequence, and to produce a final finished
product. Assembled sequence at the various intermediate stages (i.e.,
not yet a finished product) is often referred to as "prefinished"
sequence. Because the first stage in shotgun sequencing is more
straightforward, there is often a considerable time lag between the
generation of prefinished sequence data for a clone, which typically
corresponds to nearly all of its sequence (albeit not always properly
assembled or contiguous), and the ultimate production of a finished
sequence. It is important to note that the groups participating in
human genome sequencing within the international Human Genome Project
have agreed to make prefinished sequence data freely available to other
investigators (Statement on the Rapid Release of Genomic DNA Sequence
1998
), either through the public databases (e.g., GenBank) and/or on their individual web sites (Pruitt 1997
). Ultimately, all finished sequence will be submitted to GenBank, EMBL, or DDBJ (Ouellette and
Boguski 1997
).
In addition to the publicly funded project, private efforts to generate
large amounts of human genome sequence over the next few years are
planned (Venter et al. 1998
). In this case, a shotgun sequencing
strategy will be applied to the whole human genome en masse, an
approach whose efficacy has been critically debated in the past (Green
1997
; Weber and Myers 1997
). Regardless of the nature of its ultimate
product(s), such an initiative should, at a minimum, produce large
amounts of prefinished (i.e., partially assembled) human genome sequence.
The combined public and private sequencing efforts are thus certain to
generate large amounts of prefinished sequence data that will be of
great interest to investigators wishing to identify genes,
polymorphisms, and other important sequence elements. Two inherent
features of prefinished sequence data that can make analysis challenging are (1) its sheer size (even for individual BAC or PAC
clones) and (2) its dynamic nature (i.e., prefinished sequence can
change on a regular basis as additional data is generated, as new
assemblies are made, and as problems are resolved). These challenges
prompted us to develop a simple but thorough approach for analyzing and
annotating prefinished genomic sequence, such as that being generated
at the sequencing centers. We were also motivated by the fact that
there is little or no commercial software that can be readily used for
managing, analyzing, and transmitting large amounts of sequence data,
and most research groups lack the resources and infrastructure to
develop their own systems. As a result, investigators typically resort
to the use of ad hoc methods for this purpose, such as manual storage
of the frequently changing data or annotations in spreadsheets or
laboratory notebooks. The latter approaches make long-term maintenance
of the data difficult and hinder the convenient sharing of results with
remote collaborators or other investigators. Here we describe that
three publicly available tools
PowerBLAST (Zhang and Madden 1997
),
Musk (Chao et al. 1995
), and Sequin (Kans and Ouellette 1998
)
can be
used in combination for the systematic analysis of prefinished sequence
data. We illustrate the utility of these tools by analyzing sequences
generated from human chromosome 7.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
We sought to develop and implement a systematic approach for
analyzing genomic sequence data, even when it is at a prefinished stage
(i.e., prior to complete assembly and verification). The approach we
established for this task (Fig. 1 and Table
1) was developed around three
publicly available software tools: PowerBLAST (ftp://ncbi.nlm.nih.gov/pub/sim2/PowerBlast) (Zhang and Madden 1997
),
Musk (Chao et al. 1995
), and Sequin (Kans and Ouellette 1998
).
PowerBLAST is a graphical network client for sequence similarity searching. Musk is a graphical viewer for visualizing ASN.1 objects (Ostell and Kans 1998
), such as PowerBLAST output. Sequin is a software
tool for annotating sequence and submitting it to GenBank (Benson et
al. 1998
), but for this application we have used it as a local database
and visualization system for managing and exchanging sequence data.
Importantly, PowerBLAST, Musk, and Sequin run on computers that are
universally available in typical molecular biology laboratories. These
programs are freely available and already designed for use with large
sequence files. In addition, they have features that make them uniquely
suited for use with genomic sequence data from ongoing shotgun
sequencing projects. These features include
| 1. | Scalability for use with sequences ranging in size from several kilobase pairs to hundreds of kilobase pairs. |
| 2. | The ability of PowerBLAST to output data in a format (ASN.1) that can be directly read by Sequin. |
| 3. | The efficiency with which multiple analyses of the sequence [e.g.,
repeat masking followed by comparison to dbEST, dbSTS, and the
nonredundant nucleotide (nr) database (Zhang and Madden 1997 |
| 4. | The ability to track all the analyses performed on a clone insert within a single database (Sequin) record. |
| 5. | The interactive graphical interface that facilitates the interpretation of the resulting patterns of sequence matches. |
| 6. | The ease with which such analyses can be propagated as sequences progress from prefinished to finished stages. |
|
|
As a demonstration of this approach, Figure
2 depicts an initial analysis performed on a
~45-kb prefinished sequence contig assembled from PAC DJ1099C19, a
clone derived from human chromosome 7 and sequenced by the Washington
University Genome Sequencing Center
(http://genome.wustl.edu/gsc/Search/ftp.shtml). Figure 2A is a global
view of the PowerBLAST analysis, as displayed by Musk (Zhang and Madden
1997
). Nucleotide matches from BLASTN queries against dbEST, dbSTS, and
GenBank are contained within a single boxed region labeled BLASTN.
Although, in this instance, no significant protein matches were found
by BLASTX (Altschul et al. 1994
), any protein matches would normally be
indicated within a separate boxed region distinct from the nucleotide
matches. Such an overview of database matches provides an appreciation
of the larger spatial relationships among the sequence matches. Figure
2A contains an artificially generated cluster of ESTs demonstrating how
multiple ESTs can be compressed into a single feature, thereby reducing the complexity of the display (note that many and varied additional examples are included in the web supplement). The color of each depicted match varies with the number of spatially separate alignments between the matching sequence and the query sequence. This is illustrated more clearly in Figure 2B, a higher resolution view of the
last ~6 kb of the sequence contig. Note the occurrence of single EST
sequences matching multiple sites in the query sequence, suggesting
division of those ESTs into multiple exons. In addition, more than one
EST often aligns under a single region in the query sequence,
indicating a cluster of ESTs. Often, but not always, such sets of ESTs
would already correspond to a single UniGene cluster (Schuler et al. 1996
).
|
The higher-resolution views of the data within Musk (Fig. 2) reveal important additional features of the sequence matches. For example, the GenBank accession number of each matching sequence is provided; each of these is in turn electronically linked to the corresponding GenBank record, allowing quick access to additional information about the matching sequences. Mismatches between the query sequence and a matched sequence are indicated by an orange line in the boxed region. Insertions and deletions are marked as gapped regions and vertical black lines, respectively. This coloring allows a rapid, rough assessment of the degree of identity between the query and aligned database sequences. Finally, additional information about the nature of the matching sequences is indicated in a color-coded fashion as described above. The latter feature is especially useful for characterizing genes of interest and defining their intron-exon boundaries.
A key aspect of our approach for sequence analysis/annotation is its ability to manage the inevitable maturation of sequence contigs, especially when used for the repeated analysis of sequences at various prefinished stages. Figure 3 illustrates how the PowerBLAST/Musk/Sequin combination can deal with such a progression. At an earlier stage of analysis, PAC DJ1099C19 was represented by multiple sequence contigs, with the analysis and annotation performed for two such contigs shown in Figure 3A. As sequencing progressed with this clone, contigs merged to create a single, larger sequence contig. All of the earlier annotations could be simply carried over and used for the larger contig without having to recompute and reanalyze the stable features.
|
In summary, we have devised and implemented a straightforward approach
for analyzing genomic sequence data, either at a prefinished or
finished stage. Such a strategy is useful both for organizing additional studies of regions of particular interest (e.g., searching for disease genes, such as during a positional cloning project) and as
a framework for more detailed or specialized sequence annotation (e.g.,
third party annotation). In either context, the annotation of genomic
sequence is an evolving process that requires both initial analysis
followed by more thorough experimental investigations and refreshed
comparisons with the growing sequence databases. The structured,
computer-readable (ASN.1) product of our approach could also serve as
an ideal electronic supplement to publications that focus on annotation
of sequence data beyond that supplied in the original GenBank sequence
record (Wheelan and Boguski 1998
). In any event, our strategy
represents a particularly practical approach for mining the wealth of
information that will become increasingly available over the next few years as
a working draft of the human genome is assembled (Collins et al. 1998
).
A users guide to the software system reported here is provided on an extensive web supplement to this article at http://www.ncbi.nlm.nih.gov/Kuehl/pre-finished.
| |
METHODS |
|---|
|
|
|---|
Prefinished sequences were retrieved by ftp from the Washington University Genome Sequencing Center (http://genome.wustl.edu/gsc/Search/ftp.shtml). Intermittently, the status of sequence data was checked, with changes identified by a new date for the corresponding file. Following ftp downloading, all sequence files were stored locally and subjected to the analyses described below.
PowerBLAST analyses were performed by use of the following parameters:
BLASTN (M = 1, N =
3, S = 40, S2 = 40) and BLASTX (S = 90, S2 = 90, FILTER = SEG). The results from PowerBLAST were then output as ASN.1 objects for ease of importation into Sequin (Kans
and Ouellette 1998
). Output from PowerBLAST was examined by the Musk
viewer [included as part of the PowerBLAST package (ftp://ncbi.nlm.nih.gov/pub/sim2/PowerBlast)]. For additional details about the PowerBLAST/Musk programs, see
http://www.genome.org/cgi/content/full/7/6/649.
A separate ASN.1 file was generated for the PowerBLAST results from each sequence contig of a BAC/PAC clone. Each ASN.1 file was then examined for sequence matches with >95% identity across 50 nucleotides. Furthermore, sequence matches were identified as ESTs, STSs, or known genes. ESTs were compared against the UniGene database (http://www.ncbi.nlm.nih.gov/UniGene). When multiple ESTs were represented in a UniGene cluster, the ESTs were annotated as single features for both the 5' and 3' ends, thereby reducing the complexity of the Musk views. STSs were annotated as STS features. Finally, nucleotide matches were checked for consistency with protein match (BLASTX) information within the region. In addition to searching the public sequence databases, each sequence contig was compared with a series of local databases with BLASTN. As before, matches of >95% identity across 50 nucleotides were annotated in the Sequin record of that sequence contig.
In a separate step, sequences likely to contain polymorphisms were identified by two approaches. First, individual sequence contigs were compared with a local database of simple nucleotide repeats with BLASTN. Second, the SEG program was used to identify regions of low compositional complexity; such regions often contain simple sequence repeats or pseudorepeats. Regions containing five or more copies of a repeat element were annotated.
As part of our sequence analysis process, several gene prediction
programs [GRAIL (Xu et al. 1994
), GenScan (Burge and Karlin 1997
), and
MZEF (Zhang 1997
)] were used to identify putative genes. Typically,
these programs were used once a sequence contig reached ~50 kb in
size. To reduce the manual effort involved in the visual examination of
these results, a series of Perl scripts (available on request) were
written to convert the output from each gene prediction program to
ASN.1 sequence records that were compatible with Sequin.
| |
ACKNOWLEDGMENTS |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
4 Corresponding author.
E-MAIL boguski{at}ncbi.nlm.nih.gov; FAX (301) 435-7794.
| |
REFERENCES |
|---|
|
|
|---|
Received August 31, 1998; accepted in revised form December 10, 1998.
This article has been cited by other articles:
![]() |
M. D. Wilson, C. Riemer, D. W. Martindale, P. Schnupf, A. P. Boright, T. L. Cheung, D. M. Hardy, S. Schwartz, S. W. Scherer, L.-C. Tsui, et al. Comparative analysis of the gene-dense ACHE/TFR2 region on human chromosome 7q22 with the orthologous region on mouse chromosome 5 Nucleic Acids Res., March 15, 2001; 29(6): 1352 - 1365. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||