|
|
|
|
Vol. 9, Issue 11, 1135-1142, November 1999
METHODS
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Several efforts are under way to condense single-read expressed sequence tags (ESTs) and full-length transcript data on a large scale by means of clustering or assembly. One goal of these projects is the construction of gene indices where transcripts are partitioned into index classes (or clusters) such that they are put into the same index class if and only if they represent the same gene. Accurate gene indexing facilitates gene expression studies and inexpensive and early partial gene sequence discovery through the assembly of ESTs that are derived from genes that have yet to be positionally cloned or obtained directly through genomic sequencing. We describe d2_cluster, an agglomerative algorithm for rapidly and accurately partitioning transcript databases into index classes by clustering sequences according to minimal linkage or "transitive closure" rules. We then evaluate the relative efficiency of d2_cluster with respect to other clustering tools. UniGene is chosen for comparison because of its high quality and wide acceptance. It is shown that although d2_cluster and UniGene produce results that are between 83% and 90% identical, the joining rate of d2_cluster is between 8% and 20% greater than UniGene. Finally, we present the first published rigorous evaluation of under and over clustering (in other words, of type I and type II errors) of a sequence clustering algorithm, although the existence of highly identical gene paralogs means that care must be taken in the interpretation of the type II error. Upper bounds for these d2_cluster error rates are estimated at 0.4% and 0.8%, respectively. In other words, the sensitivity and selectivity of d2_cluster are estimated to be >99.6% and 99.2%.
[Supplementary material to this paper may be found online at www.genome.org and at www.pangeasystems.com.]
| |
ARTICLE |
|---|
|
|
|---|
The rapid generation of single-read sequence from the 3' ends
and 5' portions of sufficiently expressed mRNAs (popularly referred to as ESTs) (Adams et al. 1991
; Okubo et al. 1991
; Wilcox et al. 1991
)
has resulted in the discovery of many genes well before the projected
completion of the human genome project and before the completion of
sequencing efforts in other organisms (Adams et al. 1992
; Matsubara and
Okubo 1993
; Venter 1993
). Because the source
information is available for every EST, the intracluster representation
of libraries from discrete disease and developmental states can be
contrasted and hence large-scale expression studies can be performed
(Okubo et al. 1992
, 1994
; Adams et al. 1995
; Vasmatzis et al. 1998
).
EST sequence has enabled the construction of a physical map of the
human genome (Hudson et al. 1995
) as well a gene map that localizes
many genes with respect to the markers of the physical map (Schuler et
al. 1996
). The utility of EST data has also been increased greatly by
the establishment of centralized databases (Boguski et al. 1993
).
The fragmented nature and vast quantity of EST data pose an obstacle to harvesting the full potential from this data source. Hence, several projects are in progress to construct information frameworks, called gene indices, where the fragmented, error-prone EST data and the known gene sequence data can be consolidated and placed in a correct pathologic and mapping context indexed by gene such that all data concerning a single gene is in a single index class and each index class contains the information for only one gene. Algorithmically, these projects all comprise some type of cluster analysis in which sequence similarity and possibly other criteria are used to form the clusters or index classes. Below, we detail some of the clustering methods used in several gene index projects.
The Institute for Genome Research (TIGR) Gene Index (TGI)
(http://www.tigr.org/tdb/hgi/hgi.html; Adams et al. 1995
; Sutton et al.
1995
; White and Kerlavage 1996
) is constructed by assembling full-length sequence [from the Expressed Gene and Anatomy Database (EGAD); White and Kervalage 1996
)] and ESTs to form tentative human
consensi (THCs). The THC_BUILD program (G. Sutton, pers. comm.)
constructs index groups according to the following schedule: (1) BLAST
and FASTA (Pearson 1990
) are used to identify all sequence overlaps,
(2) all detected overlaps are stored in a relational database, (3) the
CLUB program forms transitive closure groups from the overlap database,
and (4) groups are subjected to assembly using TIGR assembler
(Sutton et al. 1995
; http://www.tigr.org/hgi/hgi_info.html). The
assembler also imposes matching constraints on the ends of sequences
and a minimum sequence identity within an index group. Most sequence
assembly programs share similar properties with TIGR assembler;
however, it is worth noting that some assemblers, such as the PHRAP
package, incorporate sequence quality data derived from sequence traces
into the assembly process (P. Green, pers. comm.) allowing for the
incorporation of higher error data.
In UniGene (Boguski and Schuler 1995
; Schuler at al. 1996
), genes are
indexed by forming initial groups with full-length sequences (mRNA and
genomic). Then groups are formed within the EST set and between the
ESTs and the initial groups. EST matches are not allowed to join
distinct initial groups and clusters without a polyadenylation signal
or unless at least two 3' ESTs are discarded. Finally, clone
information was used to further join clusters and singleton clusters,
and unmatched ESTs were added to nonsingleton clusters at lower
stringency levels (http://ncbi.nlm.nih.gov/UniGene/TXT/build.html). To
enhance the speed of the clustering, a two-phase searching process was
used in which two sequences were compared with a constrained Smith-Waterman algorithm only if they shared two common words of
length 13 separated by no more than 2 bases. The STACK gene indexing
system that uses d2_cluster is covered in the discussion section of
this paper and in a companion paper.
In this article we describe d2_cluster, an algorithm for clustering sequences into index classes. First, we describe the basic method (which we call D20 to distinguish it from future variants of the algorithm). We then demonstrate the utility of d2_cluster by performing a data analysis of the results of clustering a moderate sized data set (~43,000 sequences). We characterize the relative performance of d2_cluster and UniGene clustering showing that although the behavior of the two algorithms is very similar, there are measurable differences in the rate of merging sequences into clusters. Finally, we derive estimates of the absolute type I and type II error rates (the probability of under or over clustering) for d2_cluster.
Description of the d2_cluster Method (D20)
d2_cluster is an agglomerative clustering method [every sequence
begins in its own cluster, and the final clustering is constructed through a series of mergers (Johnson and Witchern 1994
)]. d2_cluster can be described in terms of minimal linkage clustering (sometimes called single linkage or transitive closure in the sequence analysis literature). The term transitive closure refers to the property that
any two sequences with a given level of similarity will be in the same
cluster; hence, A and B are in the same cluster even if
they share no similarity but there exists a sequence C with
enough similarity to both A and B. The criterion for
joining clusters is the detection of two sequences that share a window
of (Window_Size) bases that is (Stringency) percent or
more identical. The only criterion for clustering is sequence overlap
and source or annotation information is not used. To detect the overlap
criterion, we use the d2 algorithm and set parameters and threshold
values as described in previous work (Torney et al. 1990
; Hide et al.
1994
; Wu et al. 1997
). The initial and final state of the algorithm is
a partition of the input sequences in which each sequence is in a
cluster and no sequence appears in more than one cluster.
For ease of notation, let the following conventions hold:
| 1. | We signify the d2 distance between two sequences, say A and B, as d2(A,B). |
| 2. | Given two clusters, e.g., clusters i and j, the operation
MERGE(cluster i, cluster j), also denoted
MERGE(cluster i cluster j), means that all sequences in
cluster j are assigned to cluster i.
|
| 3. | The database to be clustered contains N sequences that are
numbered 0 through (N 1). Let sequence (i) be denoted
Si or S(i).
|
| 4. | The membership of sequence Si is denoted Ci. |
The notation d2(A,B) is conveniently used, but, of
course, d2(,) is not a function of only A and B but
also of various parameters (specified in Torney et al. 1990
; Hide et
al. 1994
; Wu et al. 1997
). The MERGE operation can be expressed in terms of convention 4 above: For all sequences, Sr, such
that Cr = j, Cr is reset to be Cr = i.
We describe the progression of d2_cluster inductively in that we first detail what happens in the first two iterations (I0 and I1) and then describe how one performs iteration (i + 1) given that iteration (i) has been completed. Technically speaking, it is sufficient to state only the first step and then to give the step (i) to step (i + 1) instructions, but we detail the first two steps for clarity.
|
The clustering is finished when N iterations are completed. Transitive closure is obtained because clusters are joined if they contain any sequences with sufficient identity. d2_cluster, as described above, can be mapped to the minimal linkage algorithm commonly seen in the statistics and engineering texts, and details of this are given in the online supplement to this paper (http://www.genome.org and www.pangeasystems.com).
Data and Error Analysis
Relative Performance of d2_cluster and UniGene
The first step in evaluating d2_cluster performance was to compare it with other methods. The UniGene data set was chosen for comparison due to its high quality and wide acceptance. Because an implementation of the UniGene clustering suite was not available, it was not possible to actually run the UniGene clustering algorithm. Hence, the base sequences from a version of UniGene (Rat section, UniGene build 19, August 1998) were processed with d2_cluster, and the d2_cluster/UniGene groupings were compared. Unless explicitly stated otherwise, all UniGene clusters referenced in this paper come from build 19. Rat UniGene had 43,612 ESTs and full-length mRNA sequences. Because the screening information for UniGene was unavailable, it was necessary to rescreen the data set against repetitive elements and mitochondrial sequence. The cross_match (P. Green, unpubl.) program was used for this purpose, and an "x" screening marker was left over sequence positions matching repetitive elements. Sequences with <100 bases of nonscreened sequence remaining were dropped from the analysis, and the remaining 42,441 sequences were clustered with d2_cluster (at the parameter settings Window_Size = 100, Stringency = 0.9, Min_Seq = 100, and Rev_Comp = 1. At these settings two sequences are placed into the same cluster if they contained a window of size 100 bases with at least 90% identity. Sequences with <100 bases are disregarded, and the reverse strand is searched). Approximately 31 hr were required to complete the cluster analysis on a SUN machine E450 with a 400-MHz processor. As stated above, the clustering of sequence with d2_cluster was made strictly on the basis of sequence similarity, and annotation information was not used. After this processing, every sequence was a member of exactly one d2 cluster and one UniGene cluster. Table 1 compares the cluster size distribution for the UniGene and d2_cluster groupings. d2_cluster produced ~20% fewer singleton sequences (6463 to 5198) and reduced the overall number of clusters by 10% (15,225-13,756). Generally, the numbers of smaller clusters are reduced, whereas larger clusters appear with slightly higher frequency.
|
|
|
|
Estimating the Absolute Error Rate
Although a comparison of d2_clustering with UniGene can compare the relative efficiency of the two clustering methods and can demonstrate the tendency of d2_cluster to partition sets into fewer classes than UniGene, no information about the absolute correctness of either clustering method can be inferred. To gauge absolute upper bounds for the error rates of d2_cluster, we performed two rigorous analyses of the groups formed by d2_cluster to analyze sensitivity and selectivity (or, more precisely, to estimate upper bounds for the actual type I and type II errors). In hypothesis testing the type II error rate is the probability of incorrectly rejecting the null hypothesis when it is true. We define the null hypothesis to be that any two sequences do not belong in the same cluster. In this case the type II error, in the context of sequence clustering, is the probability of placing two sequences into the same cluster by mistake, or, in other words, the probability of over clustering. We say that the presence of a type II error can be discounted in cases in which a single, high-quality assembly can represent the cluster in which overlap satisfies the matching criterion. At this point it must be noted that there are caveats to this error analysis. For example, there are biological reasons, such as gene paralogs, that ESTs that are actually from distinct genes might be perfectly alignable. Some situations, such as alternative splicing, require that more than a single consensus represent the entire cluster. In such cases, type II error can be excluded if the cluster consensi can be shown to contain sufficiently large domains of identity with each other. The Rat EST clusters formed by d2_cluster were aligned and analyzed with CRAW (Burke et al. 1998DISCUSSION
We have characterized d2_cluster and described the algorithm in terms that should be familiar to statisticians, computer scientists, and biologists alike. It has been shown that d2_cluster performs quite favorably to, and is consistent with, current EST clustering methods. In an empirical study based on >40,000 available rat EST and mRNA sequences, d2_cluster produced results that were between 83% and 90% identical with UniGene although d2_cluster created 10% fewer clusters and 20% fewer singleton clusters than UniGene. Three different measures of joining strength are used to compare the overall sensitivity of d2_cluster and UniGene, and these numbers are averaged to provide an estimate that d2_cluster joins sequences at a rate ~13% higher than UniGene. It remains undetermined, however, if this higher join rate results in more accurate index classes or is simply due to the joining of paralogous genes or other phenomena. Additionally, the absolute correctness of groups formed by d2_cluster has been quantified, and the sensitivity and selectivity are shown to be >99% (i.e., type I and type II error rates are bounded above by 1%).
d2_cluster has found application in the STACK project (Hide et al.
1997
) in which ESTs are hierarchically clustered within tissue and
arbitrary source categories. d2_cluster is set to join ESTs that are
>96% identical over a window of 150 bases. More detail on STACK is
given elsewhere (R.T. Miller, A.G. Christoffels, C. Gopalakrishnan, J. Burke, A.A. Ptitsyn, T.R. Broveak, and W.A. Hide, in prep.).
Unfortunately, space limitations prevent elaboration of the many other
tools for sequence clustering that have been developed to cluster DNA
sequence or to remove redundancies from sequence sets (Houlgatte et al.
1995
; Parsons 1995
; Grillo et al. 1996
; Gill et al. 1997
; Eckman et al.
1998
; Yee and Conklin 1998
; Pietu et al. 1999
). Significant research
has also been put into the grouping of protein sequence and the
determination of domain boundaries (Sonnhammer and Kahn 1994
; Worley et
al. 1995
; Sonnhammer et al. 1997
; Gracy and Argos 1998a
,b
).
Sequence clustering is of little consequence in and of itself, and the
prime motivation is to obtain biological knowledge. Effective sequence
clustering is an organizing principle that serves as a starting point
for discovery. For example, with all sequence information corresponding
to a single gene in a cluster, features such as alternative exons and
aberrant splicing, among others, can be modeled with greater ease. It
is difficult to tune the parameters of primary sequence clustering such
that features like splicing differences and even artifacts, such as
chimerism, are accounted for while simultaneously generating proper
index classes. Instead, decoupling this feature detection and artifact correction from the clustering step allows these problems to be handled
in a more robust fashion. Hence, postprocessing steps, such as CRAW,
are used to contrast gene variants and correct for artifact. Figure
4 shows how ESTs and mRNAs from a cluster of human
sequences corresponding to human RAD1/REC1 cell-cycle control checkpoint protein are placed in the same cluster, whereas CRAW is used
to separate distinct splice variants into separate subclusters. In a
similar fashion, CRAW is also used to isolate and correct for chimeric
sequence and other artifacts. Full details and additional examples are
found in previous work (Burke et al. 1998
; Chow and Burke 1999
). Work
has also been done to associate discovered multiple gene forms with
sequence source information to infer state specificity or associate
novel exon/UTR usage with disease (Burke et al. 1998
; Gautheret et al.
1998
). The d20 algorithm and others specified here are available at no
cost for university researchers, and commercial licenses are available
(details available from authors upon request). The commercial versions
of CRAW and CRAWview (www.pangeasystems.com) were used in the
preparation of this manuscript.
|
| |
FOOTNOTES |
|---|
4 Corresponding author. Present address: Pangea Systems, Oakland, California 94612 USA.
E-MAIL jburke{at}pangeasystems.com; FAX (510) 628-0108.
| |
REFERENCES |
|---|
|
|
|---|
Received June 16, 1999; accepted in revised form August 16, 1999.
This article has been cited by other articles:
![]() |
B. Lee, T. Hong, S. J. Byun, T. Woo, and Y. J. Choi ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences Nucleic Acids Res., July 13, 2007; 35(suppl_2): W159 - W162. [Abstract] [Full Text] [PDF] |
||||
![]() |
K.-S. Chow, K.-L. Wan, Mohd. N. M. Isa, A. Bahari, S.-H. Tan, K Harikrishna, and H.-Y. Yeang Insights into rubber biosynthesis from transcriptome analysis of Hevea brasiliensis latex J. Exp. Bot., July 1, 2007; 58(10): 2429 - 2440. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. P.M. Weber, K. L. Weber, K. Carr, C. Wilkerson, and J. B. Ohlrogge Sampling the Arabidopsis Transcriptome with Massively Parallel Pyrosequencing Plant Physiology, May 1, 2007; 144(1): 32 - 42. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. A. Erwin, E. G. Jewell, C. G. Love, G. A. C. Lim, X. Li, R. Chapman, J. Batley, J. E. Stajich, E. Mongin, E. Stupka, et al. BASC: an integrated bioinformatics system for Brassica research Nucleic Acids Res., January 12, 2007; 35(suppl_1): D870 - D873. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan A hitchhiker's guide to expressed sequence tag (EST) analysis Brief Bioinform, January 1, 2007; 8(1): 6 - 21. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. J. Dumonceaux, J. E. Hill, S. M. Hemmingsen, and A. G. Van Kessel Characterization of Intestinal Microbiota and Response to Dietary Virginiamycin Supplementation in the Broiler Chicken Appl. Envir. Microbiol., April 1, 2006; 72(4): 2815 - 2823. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Feilner, C. Hultschig, J. Lee, S. Meyer, R. G. H. Immink, A. Koenig, A. Possling, H. Seitz, A. Beveridge, D. Scheel, et al. High Throughput Identification of Potential Arabidopsis Mitogen-activated Protein Kinases Substrates Mol. Cell. Proteomics, October 1, 2005; 4(10): 1558 - 1568. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Yu, M. Dong, X. Wu, S. Li, S. Huang, J. Su, J. Wei, Y. Shen, C. Mou, X. Xie, et al. Genes "Waiting" for Recruitment by the Adaptive Immune System: The Insights from Amphioxus J. Immunol., March 15, 2005; 174(6): 3493 - 3500. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Kang, X. Chen, Y. Zhou, B. Liu, W. Zheng, R. Li, J. Wang, and J. Yu The analysis of large-scale gene expression correlated to the phase changes of the migratory locust PNAS, December 21, 2004; 101(51): 17611 - 17615. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. H. Mecham, G. T. Klus, J. Strovel, M. Augustus, D. Byrne, P. Bozso, D. Z. Wetmore, T. J. Mariani, I. S. Kohane, and Z. Szallasi Sequence-matched probes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene expression measurements Nucleic Acids Res., May 25, 2004; 32(9): e74 - e74. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Kalyanaraman, S. Aluru, S. Kothari, and V. Brendel Efficient clustering of large EST data sets on parallel computers Nucleic Acids Res., June 1, 2003; 31(11): 2963 - 2974. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Batley, G. Barker, H. O'Sullivan, K. J. Edwards, and D. Edwards Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data Plant Physiology, May 1, 2003; 132(1): 84 - 91. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. A. Lippert, H. Huang, and M. S. Waterman Inaugural Article: Distributional regimes for the number of k-word matches between two random sequences PNAS, October 29, 2002; 99(22): 13980 - 13989. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Osato, M. Itoh, H. Konno, S. Kondo, K. Shibata, P. Carninci, T. Shiraki, A. Shinagawa, T. Arakawa, S. Kikuchi, et al. A Computer-Based Method of Selecting Clones for a Full-Length cDNA Project: Simultaneous Collection of Negligibly Redundant and Variant cDNAs Genome Res., July 1, 2002; 12(7): 1127 - 1134. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Zhang, P. M. Laborde, K. R. Coombes, D. A. Berry, and S. R. Hamilton Cancer Genomics: Promises and Complexities Clin. Cancer Res., August 1, 2001; 7(8): 2159 - 2167. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Christoffels, A. v. Gelder, G. Greyling, R. Miller, T. Hide, and W. Hide STACK: Sequence Tag Alignment and Consensus Knowledgebase Nucleic Acids Res., January 1, 2001; 29(1): 234 - 238. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Konno, Y. Fukunishi, K. Shibata, M. Itoh, P. Carninci, Y. Sugahara, and Y. Hayashizaki Computer-Based Methods for the Mouse Full-Length cDNA Encyclopedia: Real-Time Sequence Clustering for Construction of a Nonredundant cDNA Library Genome Res., February 1, 2001; 11(2): 281 - 289. [Abstract] [Full Text] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||