|
|
|
Published online before print
July 19, 2002, 10.1101/gr.220302
Vol. 12, Issue 8, 1286-1293, August 2002
RESOURCES
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Although many studies on alternative splicing of specific genes have been reported in the literature, the general mechanism that regulates alternative splicing has not been clearly understood. In this study, we systematically aligned each pair of the 21,076 cDNA sequences of Mus musculus, searched for putative alternative splicing patterns, and constructed a list of potential alternative splicing sites. Two cDNAs are suspected to be alternatively spliced and originating from a common gene if they share most of their region with a high degree of sequence homology, but parts of the sequences are very distinctive or deleted in either cDNA. The list contains the following information: (1) tissue, (2) developmental stage, (3) sequences around splice sites, (4) the length of each gapped region, and (5) other comments. The list is available at http://www.bioinfo.sfc.keio.ac.jp/intron. Our results have predicted a number of unreported alternatively spliced genes, some of which are expressed only in a specific tissue or at a specific developmental stage.
| |
INTRODUCTION |
|---|
|
|
|---|
Alternative splicing of pre-mRNA plays an important
role in the production of diverse mRNAs from individual genes, and it helps increase the functional range of gene products in higher eukaryotes. In many cases, gene expression is tightly regulated at the
splicing level by specific mechanisms to provide suitable proteins
for a particular tissue or stage (McKeown 1992
; Chabot 1996
; Wang and
Manley 1997
). On the other hand, alternative transcripts are generated
in the same tissue, especially in brain or muscle, to supply an
extensive number of proteins that have distinct functions, contributing
to their plasticity (Bernstein et al. 1986
; Missler and Sushof 1998
).
The total number of genes in the human genome is estimated to range
from 28,000 to 120,000 (Crollius et al. 2000
; Ewing and Green 2000
;
Liang et al. 2000
; Wright et al. 2001
), and at least one-third of them
might give rise to alternatively spliced transcripts (Mironov et al.
1999
; Brett et al. 2000
). Although the databases of alternative
splicing were established by collecting alternatively spliced genes
from annotated databases (Dralyuk et al. 2000
; Ji et al. 2001
), the
number of alternatively spliced genes cataloged in such databases is
small compared with the estimated total number of alternatively spliced
human genes (Modrek et al. 2001
).
Using the approach of single-pass end sequence from randomly selected
cDNA clones, >1 million expressed sequence tags (ESTs) have been
submitted to publicly available databases (Adams et al. 1991
). The
accumulation of ESTs contributes not only to the discovery of new genes
(Adams et al. 1995
) but also to the detection of new alternatively
spliced genes. There are several ways to detect alternatively spliced
genes, including (1) mapping EST sequences onto the genome sequence
(Wolfsberg and Landsman 1997
; Modrek et al. 2001
), (2) comparing
full-length mRNA sequences from annotated databases against the EST
database (Brett et al. 2000
), and (3) clustering EST sequences (Burke
et al. 1998
). Although the ESTs are effective material to identify
novel candidates of alternatively spliced genes, full-length cDNAs are
much more desirable for that purpose because they cover entire coding regions.
In this study, we used 21,076 full-length cDNA clones of Mus
musculus derived from numerous tissues or developmental stages (The
RIKEN Genome Exploration Research Group Phase II and the FANTOM
Consortium 2001
) to analyze the extent of alternative splicing. Here,
we conducted a systematic analysis to extract putative alternative cDNAs by comprehensive, round-robin comparisons among the 21,076 clone
sequences and constructed a list of potential alternatively spliced
transcripts. After that, we analyzed the expression patterns of
clusters using their expression profile (Miki et al. 2001
) and adopted
the clusters whose cDNAs showed a tendency to express in a specific
tissue or developmental stage. It has been reported that 69 out of 1600 rat genes were detected as alternatively spliced genes based on
expression data (Hu et al. 2001
). Our analysis used a putative
alternative splicing data set and an enormous microarray data set.
The use of this method is significant not only because it allowed alternatively spliced genes to be identified but also because it can be limited to the specific condition of alternative splicing and reduce experimental work. This method may be a model of transcriptome analysis of alternative splicing.
| |
RESULTS |
|---|
|
|
|---|
Overview of the Clusters Predicted as Alternatively Spliced Genes
The data set of alternatively spliced cDNAs was constructed from a
library of 21,076 cDNAs as described in the previous section. The data
set consists of 415 clusters with a total of 1136 cDNAs. In the data
set, potentially alternatively spliced cDNAs are listed with the
following information: (1) tissue, (2) developmental stage, (3)
sequences around splice sites, (4) the length of each gapped region,
and (5) other comments. These cDNAs are available at
http://www.bioinfo.sfc.keio.ac.jp/intron. Most clusters have only one
gapped region (putative alternatively spliced site), as summarized in
Table 1.
|
Various types of alternative splicing patterns have been discussed.
Breitbart et al. (1987)
suggested five canonical types of alternative
splicing (illustrated in Fig. 1): (A)
cassette, (B) internal donor site, (C) internal acceptor site, (D)
mutually exclusive, and (E) retained intron. We classified the 490 gapped regions of the 415 clusters into one of these five categories according to the criteria defined below. For the sake of
classification, we consider nucleotide sequences around the splicing
sites (Mount 1982
; Padgett et al. 1986
) 5'-(a/c)ag|GT(a/g)agt and
(c/t)10N(c/t)AG|g-3'. These consensus nucleotides are
reflected in Figure 1. For each gapped region to be classified into one
of the five categories, the nucleotides represented by capital letters
are compulsory, and the nucleotides represented by lower-case letters
are preferred. More precisely, we used the following criteria: (A)
cassette: GT or AG; (B) internal donor site: GT required, and at least
four of the seven preferred nucleotides of donor site; (C) internal acceptor site: AG required, and at least 8 of the 13 preferred nucleotides of acceptor site; and (E) retained intron: GT
AG required, and at least four of the seven preferred nucleotides of donor site and
8 of the 13 preferred nucleotides of acceptor site. Because category D
can be uniquely determined by the pattern of alignment alone, no
nucleotides were checked for it. The gapped regions that could not be
classified in each category were categorized as Unclassified. The
results of this categorization are presented in Table
2. To estimate the tendency of
misclassifications, alternative exons of M. musculus known in
the literature (Stamm et al. 2000
) were used as a sample set and
classified according to the same criteria. The result of this
classification is represented in Table 3.
The majority of the known exons were categorized correctly in
accordance with their appropriate splicing patterns, except many (A)
cassette exons were classified as (C) internal accepter sites. These
misclassifications arise from the fact that exonic consensus sequences
in the acceptor site are similar to the intronic consensus sequence AG,
making it difficult to predict the form of alternative splicing on the
basis of sequence data (Thanraj 2000
). From this control study, it can
be inferred that a good portion of the 134 gapped regions listed as (C)
internal acceptor sites in Table 2 are actually (A) cassettes.
|
|
|
The numbers of spliced and unspliced regions (illustrated in Fig. 2) of putative alternative splicing are summarized in Tables 4 and 5 according to expressed tissue and developmental stage, respectively. No general tendency specific to tissue or specificity of developmental stage was found, indicating that alternative splicing is taking place widely in all tissues and at all developmental stages.
|
|
|
Details of the Several Clusters Predicted as Alternatively Spliced Genes
One of the clusters in category D (mutually exclusive) is homologous
(96% identity) to the CHIP protein (Ballinger et al. 1999
).
The form of this protein is shown in Figure
3. Although the CHIP gene has not
been reported as an alternatively spliced gene, it is likely that this
gene has alternative transcripts.
|
Figure 4 shows examples of more complicated
alternative splicing patterns in which three cDNAs were potentially
produced in different forms from a single gene. An open reading frame
(ORF) was predicted for each cDNA using the RIKEN
DECODER program (Fukunishi and Hayashizaki 2001
).
|
In the case that an alternatively spliced region resides in a predicted
ORF, it is likely that the spliced exon increases variation of the
protein function. In particular, cDNA Cluster 8 has three splicing
patterns, and the second spliced region causes a drastic change of
amino acids by a frameshift. Although it is possible that this
frameshift is caused by a sequencing error, we think it is not, because
the frameshifted region includes a zinc finger motif (Table
6). It could be suggested that the variety of zinc finger motifs in the three translation products contributes to
variation in gene regulation by altering their DNA-binding sites.
|
Besides this case, frameshifts were identified in cDNA Clusters 63 and
3071, but a motif was not found in these exons. It has been reported
that in the integrin
5 subunit of mouse and major protein
zero (MPZ) of human, the occurrence of alternative splicing events in
the ORF resulted in open-reading frameshifts (Besancon et al. 1999
).
Thus, two clusters may also have distinct gene functions regulated by frameshifts.
Transcriptome Analysis of Mouse DNA Arrays with Our Data Set
Figures 5 and 6 show the transcriptome analyses of mouse DNA arrays with our putative alternative splicing data set. These clusters each have a prominent splicing pattern in specific tissues or at distinct developmental stages. The level of gene expression is presented as a score of signal intensity between cDNAs.
|
|
In Cluster 2204, cDNAs are homologs to prolactin-like peptide. It is
known that the prolactin (PRL)/growth hormone (GH)
gene is expressed in the pituitary gland, uterus, or the placenta
(Ishibashi and Imai 1999
). Our data show that SeqID 4107 is
expressed in the placenta but not in the thymus or uterus. On the other
hand, SeqID 3784 presents high expression in thymus and uterus. The alternative exon may contribute to the construction of this protein in
a particular tissue.
In Cluster 3148, cDNAs are homologs to bisphosphate 3'-nucleotidase
(Spiegelberg et al. 1999
), which has not been reported to have
alternative transcripts. Although the distal start codon may be adopted
by both cDNAs, two start codons may be properly used at a specific
developmental stage by alternative splicing.
Some alternatively spliced regions are outside of predicted ORFs
(Clusters 3082, 3138). The cDNAs of Cluster 3138 are homologs to TIA-1
cytotoxic granule-associated RNA-binding protein-like 1. This gene is
expressed in the cells fated to be brain and retina at embryonic days
12.5. Its expression is also found in the lung, kidney, and
thymus (Lowin et al. 1996
). On the other hand, the gene expression of
cDNA Cluster 3082 is likely to be regulated according to the skin
developmental stage. The cDNAs of this cluster are homologs to 28S
ribosomal protein S17 (Gantt and Thompson 1990
). It has been reported
that alternative splicing often occurs in 5'-untranslated regions,
resulting in alternative regulation of gene expression (Mironov et al.
1999
). Therefore, the alternatively spliced regions may contain
regulatory elements.
| |
DISCUSSION |
|---|
|
|
|---|
We divided 1136 cDNAs into 415 clusters as putative alternatively
spliced transcripts. These cDNAs constitute 7.4% of the 15,294 cDNAs
(the estimated number of unique sequences). Although it has, indeed,
been reported that ~38% of all human genes are produced by
alternative splicing (Brett et al. 2000
), our number should not be
interpreted as the percentage of alternatively spliced genes in
general. In the process of constructing the cDNA library, we tried to
reduce redundancy by not sequencing cDNAs with the same nucleotide
sequence in their 5'- or 3'-untranslated regions (The RIKEN Genome
Exploration Research Group Phase II and the FANTOM Consortium 2001
).
This procedure should have eliminated a large number of alternatively
spliced transcripts.
It has been reported that many genes are alternatively spliced at
multiple sites (Smith et al. 1989
), from which hundreds of alternate
transcripts could be produced in theory. One example of this is the
lymphocyte homing receptor gene CD44, which can generate
enormous molecular diversity, >1000 potential isoforms, by including
or excluding each of 10 exons in the gene (Screaton et al. 1992
; Tolg
et al. 1993
). In our results, on the other hand, most of the clusters
showed potential alternative splicing at only one site (Table 1); it
may be that they have many more splicing variants that we have
overlooked. To study this possibility, a greater amount of cDNA
sequence data from a given gene will be necessary (Regan et al. 2000
).
In summary, computational analysis is a powerful means for predicting potential sites of alternative splicing, and we have constructed a list of these sites from the largest available data set of mouse full-length cDNA sequences. Our results have predicted a number of unreported alternatively spliced genes, some of which are expressed only in a specific tissue or at a specific developmental stage.
| |
METHODS |
|---|
|
|
|---|
We used a set of 21,076 mouse full-length cDNAs produced by The
RIKEN Genome Exploration Research Group Phase II and the FANTOM Consortium (2001)
. The average length of all the cDNAs was 1257 bp. The
number of unique sequences, after eliminating redundant sequences, is
presumed to be 15,294. In our work, however, we did not make any
attempt to eliminate redundancy and used all of the 21,076 sequences,
in order not to miss any potential alternative transcripts.
First, we conducted a round-robin BLAST search
(Altschul et al. 1990
) of the 21,076 cDNAs sequences against each other. The cDNA pairs whose BLAST output met the
following criteria were extracted from the data set: (1) >95% of
nucleotides were identical for >20 consecutive nucleotides; and (2)
more than one such matching region in common. After these comprehensive pair-wise comparisons, the cDNA pairs were merged into clusters, if one
sequence was paired with two or more different sequences.
Next, the sequences of these clusters were aligned using the multiple
sequence alignment program CLUSTALW (Thompson et al.
1994
). The gap penalty parameter was set to 0 to tolerate large gaps.
If the output of alignment shared most of the region with a high degree
of sequence homology but parts of the sequences were very distinctive
or deleted in either cDNA, the cluster was suspected to be
alternatively spliced originating from the common gene. We define such
distinctive or deleted regions as gapped regions, and consider them as
candidate alternatively spliced exons.
We also used microarray data of expression patterns for 18,816 mouse
cDNA sequences (Miki et al. 2001
), to extract alternatively spliced
genes whose expression pattern is prominent in a specific tissue or at
a specific developmental stage. We presented the level of gene
expression as a score of signal intensity between cDNAs.
| |
WEB SITE REFERENCES |
|---|
|
|
|---|
http://www.bioinfo.sfc.keio.ac.jp/intron; a list of alternative splicing patterns.
| |
ACKNOWLEDGMENTS |
|---|
We thank Atsushi Sakurai, Shigeo Fujimori, Koya Mori, Hitomi Itoh, and members of the Tomita laboratory for helpful discussions and suggestions during the course of this work. This study was supported in part by a research grant for the RIKEN Genome Exploration Research Project from the Japanese Ministry of Education, Culture, Sports, Science and Technology (MEXT) to Y.H. This work was also supported by a research grant from the Ministry of Agriculture, Forestry and Fisheries of Japan (Rice Genome Project), New Energy and Industrial Technology Development Organization (NEDO) of the Ministry of Economy, Trade and Industry of Japan (Development of a Technological Infrastructure for Industrial Bioprocesses Project), and Japan Science and Technology Agency.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
5 The RIKEN Genome Exploration Research Group Phase II Team: Jun Kawai, Akira Shinagawa, Kazuhiro Shibata, Masayasu Yoshino, Masayoshi Itoh, Yoshiyuki Ishii, Takahiro Arakawa, Ayako Hara, Yoshifumi Fukunishi, Hideaki Konno, Jun Adachi, Shiro Fukuda, Katsunori Aizawa, Izawa Masaki, Katsuo Nishi, Hidenori Kiyosawa, Shinji Kondo, Itaru Yamanaka, and Tetsuya Saito.
6 Corresponding author.
E-MAIL mt{at}sfc.keio.ac.jp; FAX 81 (466) 47-5099.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.220302. Article published online before print in July 2002.
| |
REFERENCES |
|---|
|
|
|---|
Received October 26, 2001; accepted in revised form May 17, 2002.
This article has been cited by other articles:
![]() |
J.-i. Takeda, Y. Suzuki, M. Nakao, R. A. Barrero, K. O. Koyanagi, L. Jin, C. Motono, H. Hata, T. Isogai, K. Nagai, et al. Large-scale identification and characterization of alternative splicing variants of human gene transcripts using 56 419 completely sequenced and manually annotated full-length cDNAs Nucleic Acids Res., September 1, 2006; 34(14): 3917 - 3928. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Xing, T. Yu, Y. N. Wu, M. Roy, J. Kim, and C. Lee An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs Nucleic Acids Res., June 6, 2006; 34(10): 3150 - 3160. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Le, K. Mitsouras, M. Roy, Q. Wang, Q. Xu, S. F. Nelson, and C. Lee Detecting tissue-specific regulation of alternative splicing as a qualitative change in microarray data Nucleic Acids Res., December 14, 2004; 32(22): e180 - e180. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. ITOH, T. WASHIO, and M. TOMITA Computational comparative analyses of alternative splicing regulation using full-length cDNA of various eukaryotes RNA, July 1, 2004; 10(7): 1005 - 1018. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Carninci, K. Waki, T. Shiraki, H. Konno, K. Shibata, M. Itoh, K. Aizawa, T. Arakawa, Y. Ishii, D. Sasaki, et al. Targeting a Complex Transcriptome: The Construction of the Mouse Full-Length cDNA Encyclopedia Genome Res., June 1, 2003; 13(6): 1273 - 1289. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Kan, D. States, and W. Gish Selecting for Functional Alternative Splices in ESTs Genome Res., December 1, 2002; 12(12): 1837 - 1845. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||