Genome Res. 6:829-845, 1996
©1996 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051
Toward the development of a gene index to the human genome: an assessment of the nature of high-throughput EST sequence data.
J S Aaronson,
B Eckman,
R A Blevins,
J A Borkowski,
J Myerson,
S Imran, and
K O Elliston
Merck Research Laboratories, Department of Bioinformatics, Rahway, New Jersey 07065, USA. aaronson@merck.com
Abstract
A rigorous analysis of the Merck-sponsored EST data with respect to known gene sequences increases the utility of the data set and helps refine methods for building a gene index. A highly curated human transcript data base was used as a reference data set of known genes. A detailed analysis of EST sequences derived from known genes was performed to assess the accuracy of EST sequence annotation. The EST data was screened to remove low-quality and low-complexity sequences. A set of high-quality ESTs similar to the transcript data base was identified using BLAST; this subset of ESTs was compared with the set of known genes using the Smith-Waterman algorithm. Error rates of several types were assessed based on a flexible match criterion defining sequence identity. The rate of lane-tracking errors is very low, approximately 0.5%. Insert size data is accurate within approximately 20%. Reversed clone and internal priming error rates are approximately 5% and 2.5%, respectively, contributing to the incorrect identification of reads as 3' ends of genes. Follow-up investigation reveals that a significant number of clones, miscategorized as reversed, represent overlapping genes on the opposite strand of entries in the transcript data base. Relevance of these results to the creation of a high-quality index to the human genome capable of supporting diverse genomic investigations is discussed.

CiteULike Connotea Del.icio.us Digg Reddit Technorati What's this?
This article has been cited by other articles:

|
 |

|
 |
 
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan
A hitchhiker's guide to expressed sequence tag (EST) analysis
Brief Bioinform,
January 1, 2007;
8(1):
6 - 21.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
R. Sorek and H. M. Safer
A novel algorithm for computational identification of contaminated EST libraries
Nucleic Acids Res.,
February 1, 2003;
31(3):
1067 - 1074.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
C. Iseli, B. J. Stevenson, S. J. de Souza, H. B. Samaia, A. A. Camargo, K. H. Buetow, R. L. Strausberg, A. J.G. Simpson, P. Bucher, and C. V. Jongeneel
Long-Range Heterogeneity at the 3' Ends of Human mRNAs
Genome Res.,
July 1, 2002;
12(7):
1068 - 1074.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. K. Nam, S. Lee, G. Zhou, X. Cao, C. Wang, T. Clark, J. Chen, J. D. Rowley, and S. M. Wang
Oligo(dT) primer generates a high frequency of truncated cDNAs through internal poly(A) priming during reverse transcription
PNAS,
April 30, 2002;
99(9):
6152 - 6156.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
C. Gemund, C. Ramu, B. Altenberg-Greulich, and T. J. Gibson
Gene2EST: a BLAST2 server for searching expressed sequence tag (EST) databases with eukaryotic gene-sized queries
Nucleic Acids Res.,
March 15, 2001;
29(6):
1272 - 1277.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
O. Ohara and G. Temple
Directional cDNA library construction assisted by the in vitro recombination reaction
Nucleic Acids Res.,
February 15, 2001;
29(4):
e22 - e22.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
J. Muilu, P. Rodriguez-Tomé, and A. Robinson
GBuilder---An Application for the Visualization and Integration of EST Cluster Data
Genome Res.,
January 1, 2001;
11(1):
179 - 184.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
L. Huminiecki and R. Bicknell
In Silico Cloning of Novel Endothelial-Specific Genes
Genome Res.,
November 1, 2000;
10(11):
1796 - 1806.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
S. Kawamoto, J. Yoshii, K. Mizuno, K. Ito, Y. Miyamoto, T. Ohnishi, R. Matoba, N. Hori, Y. Matsumoto, T. Okumura, et al.
BodyMap: A Collection of 3' ESTs for Analysis of Human Gene Expression Information
Genome Res.,
November 1, 2000;
10(11):
1817 - 1827.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
M. Hirosawa, K.-i. Ishikawa, T. Nagase, and O. Ohara
Detection of Spurious Interruptions of Protein-Coding Regions in Cloned cDNA Sequences by GeneMark Analysis
Genome Res.,
September 1, 2000;
10(9):
1333 - 1341.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
J. Stollberg, J. Urschitz, Z. Urban, and C. D. Boyd
A Quantitative Evaluation of SAGE
Genome Res.,
August 1, 2000;
10(8):
1241 - 1248.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
A. E. Lash, C. M. Tolstoshev, L. Wagner, G. D. Schuler, R. L. Strausberg, G. J. Riggins, and S. F. Altschul
SAGEmap: A Public Gene Expression Resource
Genome Res.,
July 1, 2000;
10(7):
1051 - 1060.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
C.-H. Lai, C.-Y. Chou, L.-Y. Ch'ang, C.-S. Liu, and W.-c. Lin
Identification of Novel Human Genes Evolutionarily Conserved in Caenorhabditis elegans by Comparative Proteomics
Genome Res.,
May 1, 2000;
10(5):
703 - 713.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
E. Dias Neto, R. Garcia Correa, S. Verjovski-Almeida, M. R. S. Briones, M. A. Nagai, W. da Silva Jr., M. A. Zago, S. Bordin, F. F. Costa, G. H. Goldman, et al.
Shotgun sequencing of the human transcriptome with ORF expressed sequence tags
PNAS,
March 28, 2000;
97(7):
3491 - 3496.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
R. T. Miller, A. G. Christoffels, C. Gopalakrishnan, J. Burke, A. A. Ptitsyn, T. R. Broveak, and W. A. Hide
A Comprehensive Approach to Clustering of Expressed Human Gene Sequence: The Sequence Tag Alignment and Consensus Knowledge Base
Genome Res.,
November 1, 1999;
9(11):
1143 - 1155.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
R. M. Ewing, A. B. Kahla, O. Poirot, F. Lopez, S. Audic, and J.-M. Claverie
Large-Scale Statistical Analyses of Rice ESTs Reveal Correlated Patterns of Gene Expression
Genome Res.,
October 1, 1999;
9(10):
950 - 959.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
J.-M. Claverie
Computational methods for theidentification of differential and coordinated gene expression
Hum. Mol. Genet.,
September 1, 1999;
8(10):
1821 - 1832.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
S. Shintani, C. O'hUigin, S. Toyosawa, V. Michalová, and J. Klein
Origin of Gene Overlap: The Case of TCP1 and ACAT2
Genetics,
June 1, 1999;
152(2):
743 - 754.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
D. Gautheret, O. Poirot, F. Lopez, S. Audic, and J.-M. Claverie
Alternate Polyadenylation in Human mRNAs: A Large-Scale Analysis by EST Clustering
Genome Res.,
May 1, 1998;
8(5):
524 - 530.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
L. C. Bailey Jr., D. B. Searls, and G. C. Overton
Analysis of EST-Driven Gene Annotation in Human Genomic Sequence
Genome Res.,
April 1, 1998;
8(4):
362 - 376.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
J. Jiang and H. J. Jacob
EbEST: An Automated Tool Using Expressed Sequence Tags to Delineate Gene Structure
Genome Res.,
March 1, 1998;
8(3):
268 - 275.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
J. Burke, H. Wang, W. Hide, and D. B. Davison
Alternative Gene Form Discovery and Candidate Gene Selection from Gene Indexing Projects
Genome Res.,
March 1, 1998;
8(3):
276 - 290.
[Abstract]
[Full Text]
|
 |
|

|
 |

|
 |
 
S. Audic and J.-M. Claverie
The Significance of Digital Gene Expression Profiles
Genome Res.,
October 1, 1997;
7(10):
986 - 995.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
G. Miller, R. Fuchs, and E. Lai
IMAGE cDNA Clones, UniGene Clustering, and ACeDB: An Integrated Resource for Expressed Sequence Information
Genome Res.,
October 1, 1997;
7(10):
1027 - 1032.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. A. Ruddy, G. S. Kronmal, V. K. Lee, G. A. Mintier, L. Quintana, R. Domingo Jr., N. C. Meyer, A. Irrinki, E. E. McClelland, A. Fullan, et al.
A 1.1-Mb Transcript Map of the Hereditary Hemochromatosis Locus
Genome Res.,
May 1, 1997;
7(5):
441 - 456.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. K. Nam, S. Lee, G. Zhou, X. Cao, C. Wang, T. Clark, J. Chen, J. D. Rowley, and S. M. Wang
Oligo(dT) primer generates a high frequency of truncated cDNAs through internal poly(A) priming during reverse transcription
PNAS,
April 30, 2002;
99(9):
6152 - 6156.
[Abstract]
[Full Text]
[PDF]
|
 |
|

|
 |

|
 |
 
D. Zhuo, W. D. Zhao, F. A. Wright, H.-Y. Yang, J.-P. Wang, R. Sears, T. Baer, D.-H. Kwon, D. Gordon, S. Gibbs, et al.
Assembly, Annotation, and Integration of UNIGENE Clusters into the Human Genome Draft
Genome Res.,
May 1, 2001;
11(5):
904 - 918.
[Abstract]
[Full Text]
|
 |
|
|
|