Genome Research

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow Supplemental Research Data
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Burke, J.
Right arrow Articles by Hide, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Burke, J.
Right arrow Articles by Hide, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Vol. 9, Issue 11, 1135-1142, November 1999

METHODS
d2_cluster: A Validated Method for Clustering EST and Full-Length cDNA Sequences

John Burke,1,4 Dan Davison,2 and Winston Hide3

1 Pangea Systems, Oakland, California 94612 USA; 2 Bioinformatics Department, Bristol-Myers Squibb Pharmaceutical Research Institute, Wallingford, Connecticut 06492-7660 USA; 3 South African National Bioinformatics Institute, Bellville 7535, University of the Western Cape, South Africa

Several efforts are under way to condense single-read expressed sequence tags (ESTs) and full-length transcript data on a large scale by means of clustering or assembly. One goal of these projects is the construction of gene indices where transcripts are partitioned into index classes (or clusters) such that they are put into the same index class if and only if they represent the same gene. Accurate gene indexing facilitates gene expression studies and inexpensive and early partial gene sequence discovery through the assembly of ESTs that are derived from genes that have yet to be positionally cloned or obtained directly through genomic sequencing. We describe d2_cluster, an agglomerative algorithm for rapidly and accurately partitioning transcript databases into index classes by clustering sequences according to minimal linkage or "transitive closure" rules. We then evaluate the relative efficiency of d2_cluster with respect to other clustering tools. UniGene is chosen for comparison because of its high quality and wide acceptance. It is shown that although d2_cluster and UniGene produce results that are between 83% and 90% identical, the joining rate of d2_cluster is between 8% and 20% greater than UniGene. Finally, we present the first published rigorous evaluation of under and over clustering (in other words, of type I and type II errors) of a sequence clustering algorithm, although the existence of highly identical gene paralogs means that care must be taken in the interpretation of the type II error. Upper bounds for these d2_cluster error rates are estimated at 0.4% and 0.8%, respectively. In other words, the sensitivity and selectivity of d2_cluster are estimated to be >99.6% and 99.2%.

[Supplementary material to this paper may be found online at www.genome.org and at www.pangeasystems.com.]


4 Corresponding author. Present address: Pangea Systems, Oakland, California 94612 USA.


9:1135-1142 ©1999 by Cold Spring Harbor Laboratory Press  ISSN 1088-9051/99 $5.00

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
B. Lee, T. Hong, S. J. Byun, T. Woo, and Y. J. Choi
ESTpass: a web-based server for processing and annotating expressed sequence tag (EST) sequences
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W159 - W162.
[Abstract] [Full Text] [PDF]


Home page
J Exp BotHome page
K.-S. Chow, K.-L. Wan, Mohd. N. M. Isa, A. Bahari, S.-H. Tan, K Harikrishna, and H.-Y. Yeang
Insights into rubber biosynthesis from transcriptome analysis of Hevea brasiliensis latex
J. Exp. Bot., July 1, 2007; 58(10): 2429 - 2440.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
A. P.M. Weber, K. L. Weber, K. Carr, C. Wilkerson, and J. B. Ohlrogge
Sampling the Arabidopsis Transcriptome with Massively Parallel Pyrosequencing
Plant Physiology, May 1, 2007; 144(1): 32 - 42.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. A. Erwin, E. G. Jewell, C. G. Love, G. A. C. Lim, X. Li, R. Chapman, J. Batley, J. E. Stajich, E. Mongin, E. Stupka, et al.
BASC: an integrated bioinformatics system for Brassica research
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D870 - D873.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan
A hitchhiker's guide to expressed sequence tag (EST) analysis
Brief Bioinform, January 1, 2007; 8(1): 6 - 21.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
T. J. Dumonceaux, J. E. Hill, S. M. Hemmingsen, and A. G. Van Kessel
Characterization of Intestinal Microbiota and Response to Dietary Virginiamycin Supplementation in the Broiler Chicken
Appl. Envir. Microbiol., April 1, 2006; 72(4): 2815 - 2823.
[Abstract] [Full Text] [PDF]


Home page
Mol. Cell. ProteomicsHome page
T. Feilner, C. Hultschig, J. Lee, S. Meyer, R. G. H. Immink, A. Koenig, A. Possling, H. Seitz, A. Beveridge, D. Scheel, et al.
High Throughput Identification of Potential Arabidopsis Mitogen-activated Protein Kinases Substrates
Mol. Cell. Proteomics, October 1, 2005; 4(10): 1558 - 1568.
[Abstract] [Full Text] [PDF]


Home page
J. Immunol.Home page
C. Yu, M. Dong, X. Wu, S. Li, S. Huang, J. Su, J. Wei, Y. Shen, C. Mou, X. Xie, et al.
Genes "Waiting" for Recruitment by the Adaptive Immune System: The Insights from Amphioxus
J. Immunol., March 15, 2005; 174(6): 3493 - 3500.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
L. Kang, X. Chen, Y. Zhou, B. Liu, W. Zheng, R. Li, J. Wang, and J. Yu
The analysis of large-scale gene expression correlated to the phase changes of the migratory locust
PNAS, December 21, 2004; 101(51): 17611 - 17615.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
B. H. Mecham, G. T. Klus, J. Strovel, M. Augustus, D. Byrne, P. Bozso, D. Z. Wetmore, T. J. Mariani, I. S. Kohane, and Z. Szallasi
Sequence-matched probes produce increased cross-platform consistency and more reproducible biological results in microarray-based gene expression measurements
Nucleic Acids Res., May 25, 2004; 32(9): e74 - e74.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Kalyanaraman, S. Aluru, S. Kothari, and V. Brendel
Efficient clustering of large EST data sets on parallel computers
Nucleic Acids Res., June 1, 2003; 31(11): 2963 - 2974.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
J. Batley, G. Barker, H. O'Sullivan, K. J. Edwards, and D. Edwards
Mining for Single Nucleotide Polymorphisms and Insertions/Deletions in Maize Expressed Sequence Tag Data
Plant Physiology, May 1, 2003; 132(1): 84 - 91.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
R. A. Lippert, H. Huang, and M. S. Waterman
Inaugural Article: Distributional regimes for the number of k-word matches between two random sequences
PNAS, October 29, 2002; 99(22): 13980 - 13989.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
N. Osato, M. Itoh, H. Konno, S. Kondo, K. Shibata, P. Carninci, T. Shiraki, A. Shinagawa, T. Arakawa, S. Kikuchi, et al.
A Computer-Based Method of Selecting Clones for a Full-Length cDNA Project: Simultaneous Collection of Negligibly Redundant and Variant cDNAs
Genome Res., July 1, 2002; 12(7): 1127 - 1134.
[Abstract] [Full Text] [PDF]


Home page
Clin. Cancer Res.Home page
W. Zhang, P. M. Laborde, K. R. Coombes, D. A. Berry, and S. R. Hamilton
Cancer Genomics: Promises and Complexities
Clin. Cancer Res., August 1, 2001; 7(8): 2159 - 2167.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Christoffels, A. v. Gelder, G. Greyling, R. Miller, T. Hide, and W. Hide
STACK: Sequence Tag Alignment and Consensus Knowledgebase
Nucleic Acids Res., January 1, 2001; 29(1): 234 - 238.
[Abstract] [Full Text] [PDF]


Home page
Genome Res.Home page
H. Konno, Y. Fukunishi, K. Shibata, M. Itoh, P. Carninci, Y. Sugahara, and Y. Hayashizaki
Computer-Based Methods for the Mouse Full-Length cDNA Encyclopedia: Real-Time Sequence Clustering for Construction of a Nonredundant cDNA Library
Genome Res., February 1, 2001; 11(2): 281 - 289.
[Abstract] [Full Text]




Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
Genes Dev. Learn. Mem.
Protein Science RNA Genome Res.