|
|
|
Published online before print
June 18, 2002, 10.1101/gr.75202. Article published online before print in June 2002
Vol. 12, Issue 7, 1127-1134, July 2002
METHODS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
We describe a computer-based method that selects representative
clones for full-length sequencing in a full-length cDNA project. Our
method classifies end sequences using two kinds of criteria, grouping,
and clustering. Grouping places together variant cDNAs, family genes,
and cDNAs with sequencing errors. Clustering separates those cDNA
clones into distinct clusters. The full-length sequences of the clones
selected by grouping are determined preferentially, and then the
sequences selected by clustering are determined. Grouping reduced the
number of rice cDNA clones for full-length sequencing to 21% and mouse
cDNA clones to 25%. Rice full-length sequences selected by grouping
showed a 1.07-fold redundancy. Mouse full-length sequences showed a
1.04-fold redundancy, which can be reduced by ~30% from the
selection using our previous method. To estimate the coverage of unique
genes, we used FANTOM (Functional Annotation of RIKEN Mouse cDNA
Clones) clusters (the RIKEN Genome Exploration Research Group 2001
).
Grouping covered almost all unique genes (93% of FANTOM clusters), and
clustering covered all genes. Therefore, our method is useful for the
selection of appropriate representative clones for full-length
sequencing, thereby greatly reducing the cost, labor, and time
necessary for this process.
[The programs used in this paper are available online at http://genome.gsc.riken.go.jp/software/2C.]
| |
INTRODUCTION |
|---|
|
|
|---|
Full-length cDNA projects attempt to collect all mRNAs transcribed
from the genome and determine the full-length cDNA
sequences in their entirety. Considerable effort has been expended to
improve the techniques involved. At the completion of mouse and human full-length cDNA projects, 30,000 to 100,000 full-length cDNA sequences
will have been determined. The determination of full-length sequences
requires large amounts of reagents, manpower, and time, which can be
reduced by removing redundant full-length cDNA clones that potentially
number in the thousands. As an experimental approach, mRNAs transcribed
at high levels can be removed by normalization and subtraction during
the construction of cDNA libraries (Carninci et al. 1996
, 1997
, 1998
,
2000
; Carninci and Hayashizaki 1999
). However, because experimental
normalization and subtraction are very difficult to apply to similar
clones belonging to the same gene family, redundancies remain in the
normalized cDNA library. Another possibility is, after end sequencing
of the cDNAs, to identify negligibly redundant clones by a
computational approach to classify the end sequences.
In general, when several full-length cDNA sequences are determined,
redundancy can be reduced using homology search software by aligning
the full-length cDNA sequences with their end sequences. Genomic
sequences are also useful for reducing the redundancy. However, early
in a project or when genome sequences are unavailable, the end
sequences need to be classified. Several programs for clustering
single-read expressed sequence tags (EST) have been reported (Adams et
al. 1995
, Boguski and Schuler 1995
; Sutton et al. 1995
;
Schuler et al. 1996
, 1997
; Burke et al. 1998
, 1999
; Miller et al. 1999
;
Parsons and Rodriguez-Tome 2000
; Haas et al. 2000
; Christoffels et al.
2001
; Quackenbush et al. 2001
). However, the results of classification
vary slightly depending on the program used (Bouck et al. 1999
). Table
1 shows the characteristics of several gene
indexing databases and the databases used for our full-length cDNA
projects.
|
In our mouse full-length cDNA project
(http://genome.gsc.riken.go.jp), we classified 1,100,000 end
sequences using BLAST homology search software (Pearson
and Lipman 1988
). We focused on the 100-bp 3' end sequences of each
cDNA sequence and placed into the same cluster end sequences those
90% identical over a
90-bp region (Konno et al. 2001
). The
accuracy and average read-length of the sequences were not consistent
at the beginning of the project but have improved as the project has
progressed. However, the criteria of classification were determined in
light of the accuracy and average read-length of the early stages of the project and have been applied without further modification.
After determining the full-length sequences of the mouse cDNAs, we analyzed the redundancy of 21,076 mouse full-length sequences selected by the previous classification method. The analysis showed a 1.35-fold redundancy in the sequences. The method selects a single representative clone from each cluster, rearrays it from the master plate onto another plate, and does the sequencing in full. When we determine the full-length sequences of additional clones from the same cluster to analyze variant mRNAs and various members of the same gene family, we have to return to the master plates. However, this step requires much time and the enormous number of master plates increases the possibility of error.
To address these problems, we developed a two-step classification method, which consists of two distinct criteria for classifying end sequences, grouping, and clustering. Grouping places cDNAs with similar sequences together because they are derived from the same gene family and those cDNAs whose differences are a result of sequencing errors. Clustering segregates variant clones into different clusters. We designed the methods for classifying each process and adjusted the parameters of the homology searches by using the 213,404 3' end sequences of mouse cDNAs determined in our laboratory. We selected the representative clones for full-length sequencing from each group and cluster in light of the results of classifying the end sequences. To ensure the effectiveness of the classification, we calculated the redundancy of the representative clones selected by grouping and estimated the coverage of unique genes. Although grouping does not cover a family gene, clustering can cover family genes derived from several loci on the mouse genome (ftp://ftp.sanger.ac.uk/pub/image/tmp/ssahaAssemble/mouse).
In 2000, we applied our method to the rice full-length cDNA project (S. Kikuchi, K. Satoh, T. Nagata, N. Kawagashira, K. Doi, N. Kishimoto, J. Yazaki, M. Ishikawa, K. Kojima, T. Namiki et al. in prep.) and classified both end sequences of cDNAs. During the project, we plotted a graph showing the increase in the number of novel groups and clusters in proportion to the increasing number of end sequences. We describe these efforts here.
| |
RESULTS |
|---|
|
|
|---|
Two-Step Classification
We established grouping and clustering criteria for the classification of end sequences to collect minimally redundant and variant clones in parallel (Fig. 1). Overlap length, percentage identity, direction of strand, and aligned positions extracted from the results of BLAST searches were used, but not other information such as annotations.
|
For the grouping, the entire set of end sequences was searched, and
transcripts of various lengths but containing a similar region were
placed together. To further consolidate similar sequences into a single
group, those with a common clone were merged, resulting in each clone
belonging to only one group. The grouping procedure is as follows:
| 1. | Vector and poly(A) sequences are eliminated by computational analysis
(Konno et al. 2001 |
| 2. | Repeat sequences in end sequences are masked using RepeatMasker software (A. Smit, unpubl.; http://www.genome.washington.edu/uwgc/analysistools/repeatmask.htm). |
| 3. | An entire end sequence is searched against all other end sequences using BLAST homology search software. |
| 4. | Clones containing cDNAs that are 90% identical over 80 bp are
placed into the same group in light of the pairwise alignments constructed in Step 3.
|
| 5. | Steps 3 and 4 are repeated for all remaining end sequences. |
| 6. | Groups with a clone in common are merged so that each clone appears in only one group. |
| 7. | End sequences in each group are aligned using FASTA
homology search software (Pearson and Lipman 1988 |
| 8. | The group information is stored in a relational database. |
In clustering, repeat sequences were not masked, because masked
regions are not regarded as matched regions and cannot satisfy stringent clustering criteria. In clustering, previously unclassified sequences that satisfied the clustering criteria were selected and
allocated to a new cluster to not consolidate similar sequences as
groups to prevent the same clone from appearing in more than one
cluster. In summary, the clustering procedure is as follows:
We processed the grouping and clustering of about 100,000 sequences in 2 d using a Compaq AlphaServer ES40 (four 500-MHz Alpha 21264 processors) with 6534 MB memory. Grouping and clustering can be performed faster in parallel, especially when the analyses involve more than 100,000 sequences.
Reduction of the Number of Clones for Full-Length Sequencing
We used our two-step classification method for the rice full-length
cDNA project (S. Kikuchi, K. Satoh, T. Nagata, N. Kawagashira, K. Doi,
N. Kishimoto, J. Yazaki, M. Ishikawa, K. Kojima, T. Namiki et al. in
prep.) in 2000. We used the results of classification of 5'
as well as 3' end sequences to determine the population of
representative clones. When clones could be classified into several
groups according to the 5' end of the clones, we selected representative clones from all separated groups of the 5' end. After
the grouping and clustering of 97,808 cDNA clones, the number of
representative clones for full-length sequencing decreased to 20,806 (21%) by grouping and to 62,888 (64%) by clustering. Here, we also
classified 213,404 mouse 3' end sequences by grouping and clustering
and calculated the numbers of representative clones. The number for
full-length sequencing decreased to 54,371 (25%) by grouping and to
194,977 (91%) by clustering. The clustering of rice end sequences
eliminated 36% of the total cDNA clones, whereas the clustering of
mouse end sequences eliminated only 9%. Because mouse end sequences
used in this study were determined several years ago, they contain
sequences with low accuracy (<98%), and it is difficult to judge
whether these sequences were derived from sequencing errors or gene
variants. Recently, sequencing accuracy has improved to
98%, so that
in the case of the rice full-length cDNA project we could distinguish
between sequencing errors and gene variants using clustering criteria.
The 7802 rice cDNA clones selected on the basis of the grouping results
were sequenced completely, and there was a 1.07-fold redundancy in the
full-length sequences according to the Smith-Waterman algorithm, with
the restriction of
90% base identity over 80% overall length. When
both end sequences were used, the cDNAs inserts of the clones selected
as representatives were longer. On the other hand, the mouse
full-length cDNA sequences of FANTOM clones (the RIKEN Genome
Exploration Research Group 2001
; Bono et al. 2002
) selected by grouping
showed a 1.04-fold redundancy according to the Smith-Waterman algorithm
with the same restriction in the rice full-length sequences. When we
aligned the 4% redundant full-length sequences with their end
sequences, we found that none of the redundant clones could be removed
from the collection of representative clones by using only the
information in the end sequences because of differences in the
identity, overall length, and overlap length between the end sequences
and full-length sequences (data not shown). In comparison, the
full-length sequences of the FANTOM clones selected by using the
clustering method previously used in the mouse full-length cDNA project
(Konno et al. 2001
) showed a 1.35-fold redundancy according to the
Smith-Waterman algorithm with the same restriction. Approximately 30%
of these redundant clones could be removed from the collection of
representative clones for full-length sequencing using our grouping process.
Coverage of Genes by the Two-Step Classification
Because the criteria for selection by grouping are looser than those for clustering, some unique genes may not be selected as representative clones after grouping. Therefore we used the FANTOM clusters to evaluate the coverage of genes by grouping. The representative clones selected by grouping accounted for 93% (13,359 clusters) of the FANTOM clusters, whereas those selected by clustering covered 98% (14,084 clusters). However, the functional annotations and cDNA sequences of the remaining 2% of the FANTOM clusters were the same as those of the already covered clusters (98%), so that clustering covered all the FANTOM clusters. Therefore, when the full-length sequences of the representative clones selected by grouping are determined predominantly in a full-length cDNA project, 93% of the unique genes can be collected with less redundancy. In addition, the remaining unique genes will already be rearrayed to allow easy selection and determination of their full-length sequences without the need to handle a large number of master plates. This combination of grouping and clustering enabled us to efficiently and thoroughly collect unique genes.
Clustering separated variant clones into distinct clusters. Figure
2 shows examples of family genes that were
placed into a single group but separated into several clusters. Figure
2A presents the situation of family member genes containing similar and
variable regions. This group includes four clones for histone 4 protein
(AK016310, AK007642, AK011560, and AK010085) that were separated
into four clusters. These full-length cDNA sequences matched other loci
on the mouse draft genome
(ftp://ftp.sanger.ac.uk/pub/image/tmp/ssahaAssemble/mouse). Grouping
led to the selection of a clone as the representative for full-length
sequencing from the family genes, but clustering covered the remaining
genes as representative clones. Figure 2B presents the situation of
family member genes that contain different polyadenylation sites. This
group includes four clones for acidic ribosomal phosphoprotein PO
(AK002315, AK009767, AK010267, and AK012606) that were separated into
three clusters. These full-length cDNA sequences were similar to each
other, but the lengths of their 5' and 3' ends differed; differences in
the length of the 3' end can be caused by differential poly(A) sites
(Gautheret et al. 1998
), internal priming, or artifacts incurred during
the construction of the cDNA library. These clones matched one or two
loci on the mouse draft genome.
|
Number of Groups and Clusters Formed from a cDNA Library
In a full-length cDNA project, it is necessary to estimate the number of clones with unique cDNA inserts that can be derived from a library to avoid redundant full-length sequencing. This can be effectively performed with a graph showing the increase in the number of novel groups and clusters in proportion to the increase in the number of end sequences (Fig. 3). As the number of sequences increases, the number of new groups and clusters increase. After 10,000 sequences had been determined, the addition of 10,000 new sequences yielded about 2699 novel groups. After 50,000 sequences had been determined, the addition of 10,000 new sequences yielded only about 1331 novel groups. However, the numbers of groups and clusters are not subject to saturation under these circumstances. From the graph in Figure 3, we can use the increase in the number of novel groups to determine whether we can collect almost all cDNA clones in a cDNA library.
|
| |
DISCUSSION |
|---|
|
|
|---|
A full-length cDNA library includes similar sequences that have been duplicated on the genome during evolution and are derived from sequencing errors. Among them, distinct full-length cDNA sequences should be determined first, before the variant sequences, for efficient execution of the full-length cDNA project. In this study, we have shown that grouping can collect negligible redundant cDNAs and clustering can select variant cDNAs.
Among several gene index projects (Table 1), TGI, UniGene, and
GeneNest adopted single clustering criteria, which allow splice variants to be incorporated into the same cluster or to be separated into distinct clusters. STACK categorizes splice variants according to
the tissue from which they were derived. However, our method can
collect and separate splice variants using two distinct criteria; at
present, our method collects splice variants that occur on end
sequences by computationally comparing their length and identity. However, we cannot select each splice variant by examining their alignments. A visualization tool should be added to our method as in
the mouse full-length cDNA project. In the case of UniGene, a sequence
of a single cluster, which may be derived from a low-quality sequence,
is merged with a cluster at a lower level of stringency, and GeneNest
removes low-quality sequences before clustering. Our method does not
take sequence quality into consideration, so that low-quality sequences
will be separated into distinct clusters according to clustering
criteria. We also compared our method with the d2_cluster
program in STACK and StackPack (Burke et al. 1999
). The results of the
classification of 213,404 mouse 3' end cDNA sequences using a
d2_cluster were nearly the same as for those from
grouping; d2_cluster covered 93% of the FANTOM clusters.
The results of the classification of variant clones selected by
clustering were assessed with CRAW (Burke et al. 1999
).
However, neither of these programs can automatically select
representative clones for full-length sequences. Because a full-length
cDNA project must identify likely candidates for full-length
sequencing, we designed our two-step classification method to
automatically and simultaneously collect representative clones that
were negligibly redundant and variant. Our method also allows the
researcher to change the priority with which clones derived from gene
variants and artifacts are sequenced in full.
Additional methods might facilitate the collection of rare genes or
reduction of artifact products by changing the priority of clone
sequencing. In a full-length cDNA project, the representative clones
selected by grouping and clustering need to be rearrayed from the
master plates before beginning full-length sequencing, and then the
full-length sequences of the representative clones selected by grouping
should be determined by collecting cDNAs that are as distinctive as
possible. Therefore, by aligning the full-length cDNA and end
sequences, we can identify variant sequences with alternative
transcriptional start sites, alternative polyadenylation sites, and
artifacts such as internal priming (Gautheret et al. 1998
). We then can
change the priority of the sequencing of these clones.
Here we propose another method for selecting the representative clones. Clones in single-member groups or clusters are likely to be derived from rare genes or to be the result of contamination by genomic sequences. Therefore, after grouping or clustering of the end sequences, we can rearray the representative clones derived from the single-member groups and clusters on the same plates and postpone determining the full-length sequences of these clones. Clones belonging to groups or clusters with two or more members are not likely to be derived from artifacts. Therefore, the full-length sequencing of these clones can be our first priority.
In the rice full-length cDNA project, our two-step classification method yielded many of the same results as for the mouse cDNA sequences. This indicates that our method may be useful for full-length cDNA projects of various species. Clones selected by grouping should also be useful for the construction of a DNA microarray and proteome analysis, because these clones include negligibly redundant and unique genes, and the full-length sequences of these clones are determined preferentially.
| |
METHODS |
|---|
|
|
|---|
Criteria for Classification of End Sequences
We established grouping and clustering criteria for the
classification of end sequences to collect minimally redundant and variant clones in parallel. Because BLAST homology search software can process many sequences quickly, we used this software in
the classification of end sequences. For the purpose of grouping, we
determined the criteria of the BLAST homology search by
examining the number of groups obtained when we varied the lower limits
of identity and overlap length between the end sequences under
comparison. We used our laboratory's 213,404 3' end sequences, which
include those from the full-length cDNA sequences of the 21,076 FANTOM
clones (the RIKEN Genome Exploration Research Group 2001
), to carry out
this experiment.
First, we set the lower limit for the length of the overlap between two end sequences at 50 to 200 bp and varied the identity threshold for these sequences from 10% to 98%; the expected (E) value threshold was set at 100 (Fig. 4A). The number of groups was relatively constant between 80% and 92% identity but dramatically increased at >94% identity. Gene variants and cDNAs with sequencing errors will increasingly be placed in the same group when the identity threshold increases from 80% to 92%; therefore we adopted a 90% identity threshold as a grouping criterion for this study.
|
Next, we analyzed the change in the number of groups obtained when the threshold for the length of the overlap between two end sequences varied from 20 to 200 bp (Fig. 4B). The number of groups dramatically increased when the overlap threshold was 20 to 30 bp but only gradually increased when this limit ranged from 150 to 200 bp. Therefore, an overlap threshold of 30 to 150 bp was appropriate for the grouping criterion; here, we adopted a limit of 80 bp.
Because the goal of the clustering process is to separate variant
sequences into distinct clusters, the criteria for the
BLAST homology search needed to be sufficiently stringent
to prevent the merging of collections of similar end sequences, as
happens in grouping. When using the 213,404 mouse cDNAs, we
classified them by evaluating only their 3' end sequences. For the rice
full-length cDNA project, we used 5' and 3' end sequences. The lower
limit for the identity value was set at 98%, because the accuracy of sequencing typically is
98% and because sequences that were
98% identical were regarded as the same gene by human inspection. End
sequences used in clustering are trimmed at upstream 300 bp, because
the identity threshold of clustering is high (98%); therefore a
high-quality part of the sequences should be used. At the 3' end of the
cDNAs, the polyadenylation site starts 10 to 30 bp downstream of the
poly(A) signal (Wahle and Keller 1992
; Wahle 1995
; Edwalds-Gilbert et
al. 1997
). Therefore, we placed in the same cluster, clones whose 3'
end sequences differed by
20 bp in length. Because the
transcriptional start site typically lies mainly 10 to 80 bp from the
5' end of the sequence (Suzuki et al. 2001
), we also clustered clones
whose 5' ends differed by
10 bp.
| |
WEB SITE REFERENCES |
|---|
|
|
|---|
http://genome.gsc.riken.go.jp; Mouse full-length cDNA project.
http://genome.gsc.riken.go.jp/software/2C; Programs and sequences used in this paper.
http://www.genome.washington.edu/uwgc/analysistools/repeatmask.htm; Web site for RepeatMasker software.
| |
ACKNOWLEDGMENTS |
|---|
This work would not have been possible without the encouragement and support offered by Keiji Kainuma. We thank Hidemasa Bono, Shiro Fukuda, Jun Adachi, Rintaro Saito, Tetsuya Saito, Takeya Kasukawa, Masaaki Furuno, Shigeyasu Yoshida, and Yoshimi Ota for their discussion, encouragement, and technical assistance. We also thank Katsunori Aizawa, Norihito Hayatsu, Tomoko Hirozane, Yoshiyuki Ishii, Ayako Yasunishi, Ayako Hara, Daisuke Sasaki, and the members of the RIKEN Genome Exploration Research Group Science Center for the data preparation.
This study was supported by a research grant for the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports, Science, and Technology of the Japanese Government to Y.H. and ACT-JST (Research and Development for Applying Advanced Computational Science and Technology) of the Japan Science and Technology Corporation (JST) to Y.H. This work was also supported by a research grant for the Rice Genome Full-Length cDNA Library Construction Project from BRAIN (Bio-oriented Technology Research Advancement Institution) to Y.H.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
4 Corresponding author.
E-MAIL rgscerg{at}gsc.riken.go.jp; FAX: 81 45 503 9222.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.75202. Article published online before print in June 2002.
| |
REFERENCES |
|---|
|
|
|---|
Received January 9, 2002; accepted in revised form April 12, 2002.
This article has been cited by other articles:
![]() |
T. Nanjo, N. Futamura, M. Nishiguchi, T. Igasaki, K. Shinozaki, and K. Shinohara Characterization of Full-length Enriched Expressed Sequence Tags of Stress-treated Poplar Leaves Plant Cell Physiol., December 15, 2004; 45(12): 1738 - 1748. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Yamada, J. Lim, J. M. Dale, H. Chen, P. Shinn, C. J. Palm, A. M. Southwick, H. C. Wu, C. Kim, M. Nguyen, et al. Empirical Analysis of Transcriptional Activity in the Arabidopsis Genome Science, October 31, 2003; 302(5646): 842 - 846. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Nishiyama, T. Fujita, T. Shin-I, M. Seki, H. Nishide, I. Uchiyama, A. Kamiya, P. Carninci, Y. Hayashizaki, K. Shinozaki, et al. Comparative genomics of Physcomitrella patens gametophytic transcriptome and Arabidopsis thaliana: Implication for land plant evolution PNAS, June 24, 2003; 100(13): 8007 - 8012. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Carninci, K. Waki, T. Shiraki, H. Konno, K. Shibata, M. Itoh, K. Aizawa, T. Arakawa, Y. Ishii, D. Sasaki, et al. Targeting a Complex Transcriptome: The Construction of the Mouse Full-Length cDNA Encyclopedia Genome Res., June 1, 2003; 13(6): 1273 - 1289. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||