Genome Research

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


Genome Res. 13:1335-1344, 2003
©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kanapin, A.
Right arrow Articles by Yuan, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kanapin, A.
Right arrow Articles by Yuan, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Letter

Mouse Proteome Analysis

Alexander Kanapin1,11, Serge Batalov4, Melissa J. Davis3, Julian Gough5, Sean Grimmond3, Hideya Kawaji2,8, Michele Magrane1, Hideo Matsuda2, Christian Schönbach6, Rohan D. Teasdale3, RIKEN GER Group7, GSL Members9,10 and Zheng Yuan3

1EMBL Outstation—European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SD, UK 2Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, Toyonaka, Osaka 560-8531, Japan 3Institute for Molecular Bioscience and ARC Special Research Centre for Functional and Applied Genomics, University of Queensland, St. Lucia, Queensland 4072, Australia 4Genomic Institute of the Novartis Research Foundation (GNF), San Diego, California 92121, USA 5Department of Structural Biology, Stanford University, Stanford, California 94305, USA 6Knowledge Discovery Team, Bioinformatics Group, RIKEN Genomic Sciences Center, Yokohama 230-0045, Japan 7Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center (GSC), RIKEN Yokohama Institute, Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa 230-0045, Japan 8NTT Software Corporation, Naka-ku, Yokohama, Kanagawa, 231-8554, Japan 9Genome Science Laboratory, RIKEN, Hirosawa, Wako, Saitama 351-0198, Japan


    ABSTRACT
 Top
 ABSTRACT
 RESULTS AND DISCUSSION
 METHODS
 REFERENCES
 WEB SITE REFERENCES
 
A general overview of the protein sequence set for the mouse transcriptome produced during the FANTOM2 sequencing project is presented here. We applied different algorithms to characterize protein sequences derived from a nonredundant representative protein set (RPS) and a variant protein set (VPS) of the mouse transcriptome. The functional characterization and assignment of Gene Ontology terms was done by analysis of the proteome using InterPro. The Superfamily database analyses gave a detailed structural classification according to SCOP and provide additional evidence for the functional characterization of the proteome data. The MDS database analysis revealed new domains which are not presented in existing protein domain databases. Thus the transcriptome gives us a unique source of data for the detection of new functional groups. The data obtained for the RPS and VPS sets facilitated the comparison of different patterns of protein expression. A comparison of other existing mouse and human protein sequence sets (e.g., the International Protein Index) demonstrates the common patterns in mammalian proteomes. The analysis of the membrane organization within the transcriptome of multiple eukaryotes provides valuable statistics about the distribution of secretory and transmembrane proteins


The Mouse Gene Encyclopedia project (FANTOM Consortium 2002Go) provides a unique opportunity for researchers to investigate a mammalian proteome from its functional perspective. The data provide a snapshot of proteins present in the living cell and can therefore be used for functional analysis and classification.

The following paper summarizes a general analysis of the mouse proteome sets deduced from the transcriptome DNA sequences based on various algorithms and approaches. We used protein domain databases, namely InterPro (Apweiler et al. 2001Go) and Superfamily (Gough and Chothia 2002Go), to carry out initial functional annotation of the protein sequences and to classify these sequences according to existing biological resources, such as Gene Ontology (GO). The general coverage of proteins in the representative proteins set is about 92% for both InterPro and Superfamily, and this provides a comprehensive overview of the proteome. InterPro analysis has also been used for comparison of the different proteomes produced; this analysis highlights interesting differences between various mouse sequencing projects. New domains which are not included in existing resources have been detected using algorithms implemented in the MDS database (Kawaji et al. 2002Go), and seven new domain candidates have been discovered. Determination of the membrane organization within the secretory pathway, namely whether a protein is secreted into the extracellular media, a membrane-spanning protein (transmembrane protein), or a nonsecretory protein, is essential for understanding its function. This information complements other computational annotation projects, as it provides the context by determining the membrane topography of predicted functional protein units and is essential for the prediction of subcellular localization, which depends on the class of protein.


    RESULTS AND DISCUSSION
 Top
 ABSTRACT
 RESULTS AND DISCUSSION
 METHODS
 REFERENCES
 WEB SITE REFERENCES
 
Two protein sets have been produced as a result of the FANTOM2 sequencing project. The representative proteins set (RPS) is derived from the representative set of transcriptional units. The variant-based proteins set (VPS) combines RPS and complete protein sequences representing splice variants not included in RPS. The VPS includes variant forms of known genes identified by sequencing of the FANTOM2 clones. We summarized the characterization of the sets in the main FANTOM2 paper (FANTOM Consortium 2002Go). We describe here the different characteristics of the variants and provide comparisons with other available sequence data for mouse and human.

InterPro Matches Statistics
The major goal of the domain/site/motif composition analysis was to obtain a general functional overview of the proteome and to use these results for initial functional assignments. We used InterPro as a standard tool to determine the domain/site/motif composition of different mouse protein sequence data sets. In addition to the RPS and VPS described earlier, we also analyzed a mouse sequence data set of hypothetical proteins computationally predicted by Celera and the nonredundant mouse protein set produced as part of the International Protein Index (IPI) (http://www.ebi.ac.uk/IPIGo). The human protein set provided by IPI was also analyzed.

The general number of proteins for both FANTOM2 proteome sets having matches for InterPro entries is about 72% (92% for combined InterPro and Superfamily databases). This amount is quite similar to other existing proteomes analyzed in the Proteome Analysis Database (http://www.ebi.ac.uk/proteomeGo); about 60%–75% for complete proteomes in the database. This provides some evidence of the high quality of the FANTOM2 data. We also analyzed amino-acid frequency distribution for the mouse protein sequences (data not shown). The difference in the frequencies between the different mouse datasets is only about 0.3%, which is far less than the difference between various eukaryotic proteomes (about 3%).

Comparative Proteomics
The algorithms implemented in the Proteome Analysis Database also include several InterPro-based statistical analyses, including a list of the top 20 InterPro entries. Table 1 presents statistics for the described mouse proteome data and also includes human IPI statistics. The analyses suggest that the general domain/site/motif composition is similar for all four mouse proteome sets. The statistics of the InterPro entries can be used to infer some functional information about the proteome. The most commonly represented functional groups are nucleic acid binding proteins and proteins belonging to the immunoglobulin family. The other major group of InterPro entries includes serine/threonine and tyrosine protein kinase domains. The RPS and VPS proteome sets have similar statistics for InterPro entry composition, which describes the protein sets from the point of view of functional domains/sites/motifs. This can be considered evidence of the relative stability of the functional potential of the transcriptome, which maintains a constant ratio of proteins of different functions despite the presence of splice variants. The InterPro entry-matches distribution is very similar for the human and mouse proteomes at the level of the top entries and can, therefore, provide valuable information about conserved functional domains/sites/motifs across different mammalian species.


View this table:
[in this window]
[in a new window]
 
Table 1. Top 20 InterPro Entries for Different Mouse Protein Sequence Sets and Human Proteome

 
Functional Classification and Assignment
InterPro analysis also provides a basis for the functional assignment of proteins to standard biological classification resources, such as Gene Ontology (GO). We used the existing curated mapping of InterPro entries to GO terms to classify the mouse and human proteome sets described above. A modified version of GO called "GO Slim" which is implemented in the Proteome Analysis Database was used to compare the functional composition of the proteomes. GO Slim comprises a selection of high-level terms from each of the three GO sections (molecular function, biological process, and cellular component), which were chosen to cover most aspects of the three ontologies without overlapping in the GO hierarchy. The molecular function terms of GO Slim were used here to provide an overview of the functional composition of the proteomes. The results are presented in Figure 1. The diagram shows that FANTOM sets are similar to each other rather than to other sequence data. They differ mainly in quantitative order, but not in the "pattern" of proteins synthesized. There is also a great degree of similarity between two genome sequence data sets—Celera and IPI—they show more similarity to each other rather than to FANTOM2 data. The number of G-protein coupled receptor (GPCR) proteins was higher for the Celera set, possibly indicating the lacking annotation of pseudogenes, abundant for this class of proteins. At the time of publication, Celera, through the process of expert curation, retired 20% of their gene predictions and annotated 6% as pseudogenes (Release R13d).



View larger version (71K):
[in this window]
[in a new window]
 
Figure 1 Comparative diagram of the GO Slim categories for different mouse and human proteomes. The y-axis indicates the number of proteins; the x-axis indicates the following GO categories: GO:0003676 nucleic_acid_binding; GO:0003754 chaperone; GO:0003774 motor; GO:0003793 defense/immunity_protein; GO:0003824 enzyme; GO:0004871 signal_transducer; GO:0005198 structural_molecule; GO:0005215 transporter; GO:0005488 binding; GO:0005554 molecular_function_unknown; GO:0015070 toxin; GO:0030234 enzyme_regulator; GO: 0030528 transcription_regulator.

 
We also compared the GO Slim statistics for the mouse and human proteome sets with other less closely related eukaryotes, namely Drosophila melanogaster and Arabidopsis thaliana. The resulting diagram is presented in Figure 2. The general functional overview for the different eukaryotic proteomes is quite similar despite some obvious differences between the mammalian proteomes and those of the plant and insect species, which were used for comparison across a wider range of species. The statistics provide an insight into the conserved functional groups common to all eukaryotic genomes.



View larger version (69K):
[in this window]
[in a new window]
 
Figure 2 Comparative diagram of the GO Slim categories for different eukaryotic proteomes. The y-axis shows the number of proteins; the x-axis shows GO categories (see Fig. 1).

 
Superfamily Domain Analysis
The SUPERFAMILY hidden Markov model library (Gough and Chothia 2002Go), representing all proteins of known structure, was run against the complete FANTOM2 mouse cDNA collection (FANTOM Consortium 2002Go). The results in this section correspond to the VPS set of sequences, because this is the closest available representation of the actual mouse proteome. Detailed results are available at the SUPERFAMILY web site (http://supfam.org/SUPERFAMILY/cgi-bin/gen_list.cgi?genome=mrGo).

The SUPERFAMILY analysis is used to detect and classify evolutionarily related groups of domains for which there is a known structural representative. All assigned domains are classified at the superfamily level. An accurate superfamily level of classification is obtained by a detailed hand analysis by an expert of structural, sequence, and functional evidence of a common evolutionary ancestor (Murzin et al. 1995Go). The superfamily classification used is that defined by the SCOP database.

Functional Annotation
These assignments provide information which has been used as part of the MATRICS, annotation of the FANTOM2 mouse cDNA sequences. Because proteins with the same structure usually have the same or a related function, the structural domain assignments are a useful component of the information used in the functional annotation. Furthermore, the SUPERFAMILY analysis provides assignments for many sequences where there is little or no other significant information. At least one domain was assigned to 59% of the sequences, and the domains cover 42% of all residues. This is close to the coverage of other eukaryote proteomes. The rest of the analysis shown in this section pertains to the subset of sequences and domains which were detected. The top 12 superfamilies are shown in Table 2. This is very similar to that which is observed in the human proteome based on gene predictions from the genomic sequence (Hubbard et al. 2002Go).


View this table:
[in this window]
[in a new window]
 
Table 2. The Top 12 Most Commonly Occurring Superfamily Domains

 
Structural Genomics
There are implications for experimental structural genomics projects (Gough 2002Go), most notably the discovery of novel domain combinations. Using strict criteria, pairwise structural domain combinations were enumerated, and compared to the already solved combinations in the Protein Data Bank (PDB) (Berman et al. 2000Go). Here, 335 structurally novel pairs were identified, 29 of which had not previously been found in any other proteome. These are listed at http://supfam.org/FANTOM2/domcombs.htmlGo). Although three-dimensional structures of the individual domains exist in the PDB, structures of these combinations of domains adjacent to each other on the polypeptide chain have not yet been solved. As well as being unique recombination events in evolution, these domain pairs provide targets for structural genomics projects which are assured to be novel. Solving the structures of novel domain-pair combinations will probably yield new 3D interfaces, which could be essential to or play an active role in the function of the protein as a whole.

Evolutionary Overview
The evolutionary relationships can give us a meaningful overview of a large proportion of the genome. The ancestral domain from each superfamily represents a genetic building block. These building blocks have been duplicated, recombined, and mutated to create the proteins which are currently observed in the genome. The assigned domain architecture for each sequence (available at the URL at the beginning of this section) shows the recombination of ancestral domains which has taken place during evolution. It can be seen that a small number of domains have been duplicated a very large number of times (see Fig. 3), and that a large number of domains have been duplicated very few times (see Fig. 4). In fact, 98% of the identified domains have been produced by duplication from 716 ancestral domains. This is very close to the pattern observed in the human proteome.



View larger version (20K):
[in this window]
[in a new window]
 
Figure 3 Observed evolutionary domains. The ordered sizes of superfamilies with greater than 100 members.

 



View larger version (36K):
[in this window]
[in a new window]
 
Figure 4 Observed evolutionary domains. The distribution sizes of superfamilies with less than 100 members.

 
Novel Domains
We applied the MDS motif discovery method (Kawaji et al. 2002Go) to the FANTOM2 cDNA sequence set and identified seven new motif candidates that were deposited into the MDS database (http://motif.ics.es.osaka-u.ac.jp/fantom2/Go). Two candidate motifs (MDS00150 MDS00155 were found among hypothetical proteins, and five new motif candidates have been identified among proteins related to SCML2 (MDS00151, VPARP (MDS00152and MDS00153, IAN4 (MDS00154, and ADMP (MDS00156. MDS00154is a new structural GTPase submotif that is specific for proteins of the immune associate nucleotide family (IAN), which is conserved in mammals and plants (Poirier et al. 1999Go). Interestingly, our motif appears to be restricted to mammals (FANTOM Consortium 2002Go). MDS00151is a nuclear-localization signal containing a repeat motif for the transcriptional repressor gene sex-comb on midleg-like-2 (Scml2; Montini et al. 1999Go) and its homologs (AK016533 [GenBank] , 6030439N15). The motif spans 24 amino acids and has two to six copies. Closer analysis of the flanking regions revealed that MDS00151 contains the NLS (nuclear localization signal) [KR]{3,5} and is flanked by the NLS KKPx{6,9}KxKR. The flanking NLS region and MDS00151were not found in any other mammalian, suggesting an insertion/duplication event in the mouse lineage and a specific role for the nuclear import of mouse Scml2.

In addition, we performed, with the MDS motifs that were extracted from the FANTOM1 sequence set (Kawaji et al. 2002Go), Hidden Markov Model (HMM) searches against the FANTOM2 protein sequences and SWISS-PROT/TrEMBL nonredundant database (SWISS-PROT Release 40.27 of 30-Aug-2002, TrEMBL Release 21.12 of 13-Sep-2002, and TrEMBL_new of 13-Sep-2002). As a result, we obtained several new members with the FANTOM1 MDS motifs (see http://motif.ics.es.osaka-u.ac.jp/fantom2/Go). Proacrosin binding protein (E130112G13; AK053586 [GenBank] ) was detected as a new member of motif MDS001052 (the ING1-homolog subfamily motif)-containing proteins. The multiple alignment of the subfamily members (see http://motif.ics.es.osaka-u.ac.jp/fantom2/Go) shows that the clone is identical to a newly found splicing variant (mINGh-L, ING1-like protein long form; TrEMBL AAK63168 [GenBank] of the mouse ING1-homolog proteins (Ha et al. 2002Go).

Molecules interacting with CasL (MICALs; Suzuki et al. 2002Go; Terman et al. 2002Go) derived from human (TrEMBL Q8TDZ2), fruitfly (TrEMBL AAM55242 [GenBank] AAM55243 [GenBank] AAM55244 [GenBank] and MICAL-like protein, TrEMBL AAM55245 [GenBank] , and mouse (TrEMBL AAH34682 [GenBank] were detected as new members of the leucine zipper-like motif MDS00113containing proteins. Terman and coworkers (2002Go) showed that the fruitfly MICAL interacts with neuronal plexin A (PlexA) receptor in its C-terminal region including the MDS00113motif, confirming our previous prediction that this motif may act as a novel protein–protein interaction site.

Motif MDS00146 which previously comprised only hypothetical proteins, was expanded to the human Cdc42-activating protein zizimin1 (TrEMBL AAM90306 [GenBank] Meller et al. 2002Go). Zizimin1 contains a new domain named CDM (CED-5, DOCK180, MyoBlast city) zizimin homology domain 2 (CZH2) that mediates direct interaction with the Cdc42 Rho GTPase. Motif MDS00146is included within the CZH2 domain and appears to be a submotif of CZH2 that is specific for two of the four CZH2 domain-containing protein subfamilies, namely zizimin, KIAA1395, DOCK180, and KIAA0299. Our HMM search with motif MDS00146detected only members of the zizimin and KIAA1395 subfamilies.

Membrane Organization
In an attempt to annotate the membrane organization of entire proteomes from a range of species, we developed a computational strategy. Based on the prediction of two features, endoplasmic reticulum signal peptides (used for translocation into the secretory pathway) and membrane-spanning domains (transmembrane domains), the membrane organization of proteins can be classified. We have annotated 10 proteome databases from a range of species (see Table 3). Determination of the membrane organization of an individual protein is dependent on knowing the full-length protein open reading frame (ORF) and cannot be applied to partial protein ORFs. For each proteome dataset, we removed any readily identifiable partial protein sequences (see Table 3). As expected in both the human and mouse ENSEMBL proteome databases, significant numbers of partial ORFs were present (37% and 44% respectively.) This highlights the high level of partial protein sequences generated from predicted genes. These partial sequences would result in inaccurate predictions of their membrane organization if retained. Surprisingly, the M. musculus IPI proteome contained similar levels of partial sequences (45%), whereas the other proteomes analyzed contained less than 10%.


View this table:
[in this window]
[in a new window]
 
Table 3. Predicted Signal Peptides and Transmembrane Domains in Eukaryotic Proteomes

 
Prediction of Endoplasmic Reticulum Signal Peptides
We used two independent methods to predict endoplasmic reticulum signal peptides, Neural Networks (NN) and hidden Markov Models (HMM) methods from SignalP V2 (Nielsen and Krogh 1998Go). These methods were selected because they have low levels of false-negative predictions (1.0% and 1.1% respectively; Menne et al. 2000Go). A consensus approach was adopted. Where the two methods agreed, we annotated the protein as containing a signal peptide. When the two methods conflicted, we used a third independent signal peptide prediction method, SPScan (von Heijne 1987Go) to resolve the conflict. We considered this method suitable because of its lower false-positive rate compared to the other two methods (Menne et al. 2000Go). Typically less than 6% of the total number of annotated signal peptides required resolution using SPScan.

The results of this analysis using the various proteome datasets are presented in Table 3. The proportion of proteins predicted to contain signal peptides within the RIKEN RPS was 21.1%. Similar levels were annotated in the other higher eukaryotic proteomes (human and mouse). Lower proportions of signal peptides were annotated in the D. melanogaster (18.4%), C. elegans (19.3%), A. thaliana (14.6%), and S. cerevisiae (8.5%) proteome databases.

Prediction of the Membrane-Spanning Regions or Transmembrane Domains
Next we annotated the transmembrane domains for each protein. Although a consensus approach has been proposed (Nilsson et al. 2000Go), its application to entire genomes was not practical. To analyze all proteins, we selected two prediction methods that could be readily applied to large datasets. TMHMM 2.0, which was clearly the best performer in a recent comparative evaluation (Moller et al. 2001Go), was selected first. Secondly, SVMtm, a new transmembrane prediction method using a support vector machine (Z. Yuan and R. Teasdale, unpubl., http://microarray.imb.uq.edu.au/predictors/Go) was selected. When SVMtm was compared to TMHMM 2.0 it showed comparable accuracy levels (specificity 94.0% vs. 95.2% and sensitivity 91.8% vs. 90.8%, respectively).

Each protein within the different proteome databases was analyzed with both TMHMM 2.0 and SVMtm. Transmembrane domains were annotated when both methods positively predicted a membrane-spanning domain. Sequences containing conflicting predictions were further analyzed using three additional transmembrane prediction tools (SOSUI, HMMTOP, and MEMSAT). Only membrane-spanning regions that were positively predicted by more than two of these additional methods were annotated as transmembrane domains. Between 14% and 20% of the total number of transmembrane domains annotated were assigned by this method. In addition, an initial prediction was considered a false-positive prediction when not supported by any of the other transmembrane prediction methods. This approach is similar to the "majority vote" consensus method recently used by others (Nilsson et al. 2000Go; Ikeda et al. 2002Go).

Transmembrane domain prediction methods are known to incorrectly predict signal peptides as transmembrane domains. Therefore we adopted a filter for predicted N-terminal transmembrane segments: If the predicted transmembrane domain's starting point was within the first 15 residues of the ORF and a signal peptide was predicted, then this region was regarded as a signal peptide instead of a transmembrane domain. This filtering procedure was applied to the results of all transmembrane prediction tools. The results from this analysis using the various proteome datasets are presented in Table 3.

In contrast to the signal peptide analysis, the proportion of proteins with predicted transmembrane domains varied little between proteomes, (21.6%–26.6%), with the exception of the C. elegans proteome, where the proportion was higher (31.5%). These results are consistent with the similar attempts to annotate membrane-spanning domains in eukaryotes (Wallin and von Heijne 1998Go; Krogh et al. 2001Go; Liu and Rost 2001Go; Ward 2001Go). For example, using an earlier version of TMHMM, Krogh and others predicted transmembrane domains for S. ceresiviae, D. melanogaster, and C. elegans, at 20.7%, 20.1%, and 30.3% respectively.

Classification of Proteins Into Distinct Classes Based on Their Predicted Membrane Organization
Here we propose an alternative broad classification scheme for protein classes based on their predicted membrane organization. This approach utilizes the combined annotation within individual full-length protein ORFs of both signal peptides and transmembrane domains (see Table 4). Transmembrane-negative soluble proteins are classified as intracellular or extracellular based on the signal peptide predictions. Transmembrane-positive proteins are classified into three groups. The topology of single membrane-spanning proteins, Type I (Nout/Cin), or Type II (Nin/Cout), is assigned based on the presence or absence of a signal peptide. Proteins with more than one membrane-spanning domain are classified as multi-span membrane proteins. Based on the above annotation of signal peptides and membrane-spanning regions, we obtained six groups of proteins (see Table 4).


View this table:
[in this window]
[in a new window]
 
Table 4. Membrane Organization of Protein Classes Assigned Based on the Prediction of Signal Peptides and Transmembrane Domains

 
In contrast to simply comparing the total proportion of transmembrane domains, which do not vary significantly across eukaryotic genomes, our classification scheme highlighted variation in the membrane organization of the proteome datasets analyzed. Overall, comparison of the results from the predicted membrane organization across the 10 proteome databases revealed that higher eukaryotes have a greater proportion of soluble secreted proteins and Type I membrane proteins, whereas the proportions of Type II membrane and multi-span membrane proteins remained similar. For example, Riken RPS compared to S. cerevisiae had 2.8- and 2.7-fold increases in soluble secreted proteins and Type I membrane proteins, respectively, whereas the other classes of membrane proteins remained essentially unchanged. Comparison of the RPS and VPS proteomes revealed no difference in the degree of alternative splicing among the different membrane organization classes. The other result of note from this comparison, as previously observed (Krogh et al. 2001Go), is the higher proportion of multi-span membrane proteins in C. elegans.


    METHODS
 Top
 ABSTRACT
 RESULTS AND DISCUSSION
 METHODS
 REFERENCES
 WEB SITE REFERENCES
 
Databases
We analyzed the following proteome databases available from EBI on June 15th 2002 (http://www.ebi.ac.uk/proteome/Go; Apweiler et al. 2001Go): Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Homo sapiens (International Protein Index, IPI), Mus musculus (IPI), and Saccharomyces cerevisiae. In addition, we analyzed the RIKEN Mouse Representative Transcript and Protein Sets (Riken RPS; http://genome.gsc.riken.go.jp/), RIKEN Mouse Variable Protein Set (Riken VPS; http://genome.gsc.riken.go.jp/Go), and the predicted protein ORFs from the ENSMBL human genome database (Human Build 29) and mouse genome database (MGSC Mouse Assembly 3; http://www.ensembl.org/Go; Hubbard et al. 2002Go). The sequences were filtered so that partial ORFs that did not contain a methionine at position 1 or were clearly annotated as partial or fragments were removed.

The set of predicted proteins produced by Celera (Release R13b, April 2002, http://www.celeradiscoverysystem.com/Go) was used as one of the whole-genome sets of computational predictions. The Celera sequencing, assembly, and transcript prediction methods are described (Mural et al. 2002Go). The complete Celera protein set contained 47,256 transcripts and protein sequences, corresponding to 46,941 gene predictions. Of this set, 14,994 weak-confidence predictions were excluded for compatibility with the analyses of the Ensembl set and of the Chr.16 (Mural et al. 2002Go). The remaining 32,262 high- to medium-confidence protein predictions were used in this study; 15,548 of these hypothetical proteins had BLAST hits to the publicly available protein sequences, 7085 were in common with the RefSeq(Pruitt and Maglott 2001Go) mouse protein set, and 19,089 had InterPro assignments.

InterPro version 5.1 (May 2002) and InterProScan version 3.1 (Zdobnov and Apweiler 2001Go) were used for the functional sites and domains composition analysis.

Prediction Methods
Signal P V2 (NN and HMM; Nielsen and Krogh 1998Go), SPScan (von Heijne 1987Go), TMHMM 2.0 (Krogh et al. 2001Go), SVMtm (Z. Yuan and R. Teasdale, in prep.; http://microarray.imb.uq.edu.au/predictors/), MEMSAT 1.5 (Jones et al. 1994Go), HMMTOP (Tusnady and Simon 2001Go), and SOSUI (Hirokawa et al. 1998Go) were applied using their default values except for selection of organism group. SPScan analysis was performed using the Genetics Computer Group (GCG) Wisconsin Package (version 8.1) located at the Australian National Genomic Information Service (ANGIS). SOSUI analysis was performed using its Web interface (http://sosui.proteome.bio.tuat.ac.jp/~sosui/proteome/welcomeE.htmlGo).


    Acknowledgements
 
We thank N. Mulder and W. Fleischmann (European Bioinformatics Institute) for consultations and support, A. Krogh and S. Brunak (Technical University of Denmark) for providing the TMHMM and SignalP software packages, G.E. Tusnady (Hungarian Academy of Sciences) for providing the HMMTOP software package, D.T. Jones (University College) for providing the MEMSAT software package, and K. Miranda (IMB, University of Queensland) for technical assistance. This work was supported in part by the Australian Research Council and the National Health and Medical Research Council of Australia; also in part by ACT-JST of JST (Japan Science and Technology Corp.), and a Grant-in-Aid for Scientific Research on Priority Areas "Genome Information Science" from the Ministry of Education, Culture, Sports, Science and Technology of Japan to H.M.


    Footnotes
 
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.978703.

11 Corresponding author.
E-MAIL alex{at}ebi.ac.uk; FAX 44 0 1223 494610. Back

10 Takahiro Arakawa, Piero Carninci, Jun Kawai, and Yoshihide Hayashizaki. Back


    REFERENCES
 Top
 ABSTRACT
 RESULTS AND DISCUSSION
 METHODS
 REFERENCES
 WEB SITE REFERENCES
 

Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D., et al. 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29:37 -40.[Abstract/Free Full Text]

Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, W., Shindyalov, I.N., and Bourne, P.E. 2000. The protein data bank. Nucleic Acids Res. 28:235 -242.[Abstract/Free Full Text]

FANTOM Consortium 2002. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420:563 -573.[CrossRef][Medline]

Gough, J. 2002. The superfamily database in structural genomics. Acta Crystallogr. D Biol. Crystallogr. 58:1897 -1900.[CrossRef]

Gough, J. and Chothia, C. 2002. Superfamily: HMMs representing all proteins of known structure. Nucleic Acids Res. 30:268 -272.[Abstract/Free Full Text]

Ha, S., Lee, S., Chung, M., and Choi, Y. 2002. Mouse ING1 homologue, a protein interacting with A1, enhances cell death and is inhibited by A1 in mammary epithelial cells. Cancer Res. 62:1275 -1278.[Abstract/Free Full Text]

Hirokawa, T., Boon-Chieng, S., and Mitaku, S. 1998. SOSUI: Classification and secondary structure prediction system for membrane proteins. Bioinformatics 14:378 -379.[Abstract/Free Full Text]

Hubbard, D., Barker, E., Birney, G., Cameron, Y., Chen, L., Clark, T., Cox, J., Cuff, V., Curwen, T., Down, R., et al. 2002. The ensembl genome database project. Nucleic Acids Res. 30: 38-41.[Abstract/Free Full Text]

Ikeda, M., Arai, M., Lao, D.M., and Shimizu, T. 2002. Transmembrane topology prediction methods: A reassessment and improvement by a consensus method using a dataset of experimentally-characterized transmembrane topologies. In Silico Biol. 2: 19-33.[Medline]

Jones, D.T., Taylor, W.R., and Thornton, J.M. 1994. A model recognition approach to the prediction of all-helical membrane protein structure and topology. Biochemistry 33:3038 -3049.[CrossRef][Medline]

Kawaji, H., Schönbach, C., Matsuo, Y., Kawai, J., Okazaki, Y., Hayashizaki, Y., and Matsuda, H. 2002. Exploration of novel motifs derived from mouse cDNA sequences. Genome Res. 12:367 -378.[Abstract/Free Full Text]

Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L.L. 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305:567 -580.[CrossRef][Medline]

Liu, J. and Rost, B. 2001. Comparing function and structure between entire proteomes. Protein Sci. 10:1970 -1979.[Abstract/Free Full Text]

Meller, N., Irani-Tehrani, M., Kiosses, W.B., Del Pozo, M.A., and Schwartz, M.A. 2002. Zizimin1, a novel Cdc42 activator, reveals a new GEF domain for Rho proteins. Nat. Cell Biol. 4: 639-647.[CrossRef][Medline]

Menne, K.M.L., Hermjakob, H., and Apweiler, R. 2000. A comparison of signal sequence prediction methods using a test set of signal peptides. Bioinformatics 16:741 -742.[Abstract/Free Full Text]

Moller, S., Croning, M.D.R., and Apweiler, R. 2001. Evaluation of methods for the prediction of membrane spanning regions. Bioinformatics 17:646 -653.[Abstract/Free Full Text]

Montini, E., Buchner, G., Spalluto, C., Andolfi, G., Caruso, A., den Dunnen, J.T., Trump, D., Rocchi, M., Ballabio, A., and Franco, B. 1999. Identification of SCML2, a second human gene homologous to the Drosophila sexcomb on midleg (Scm): A new gene cluster on Xp22. Genomics 58:65 -72.[CrossRef][Medline]

Mural, R.J., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L.G., Wides, R., Halpern, A., Li, P.W., Sutton, G.G., Nadeau, J., et al. 2002. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296:1661 -1671.[Abstract/Free Full Text]

Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247:536 -540.[CrossRef][Medline]

Nielsen, H. and Krogh, A. 1998. Prediction of signal peptides and signal anchors by a hidden Markov model. Proc. Int. Conf. Intell. Syst. Mol. Biol. 6:122 -130.[Medline]

Nilsson, J., Persson, B., and von Heijne, G. 2000. Consensus predictions of membrane protein topology. FEBS Lett. 486:267 -269.[CrossRef][Medline]

Poirier, G.M., Anderson, G., Huvar, A., Wagaman, P.C., Shuttleworth, J., Jenkinson, E., Jackson, M.R., Peterson, P.A., and Erlander, M.G. 1999. Immune-associated nucleotide-1 (IAN-1) is a thymic selection marker and defines a novel gene family conserved in plants. J. Immunol. 163:4960 -4969[Abstract/Free Full Text]

Pruitt, K.D. and Maglott, D.R., 2001. RefSeqand LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29:137 -140.[Abstract/Free Full Text]

Suzuki, T., Nakamoto, T., Ogawa, S., Seo, S., Matsumura, T., Tachibana, K., Morimoto, C., and Hirai, H. 2002. MICAL, a novel CasL interacting molecule, associates with vimentin. J. Biol. Chem. 277:14933 —14941.[Abstract/Free Full Text]

Terman, J.R., Mao, T., Pasterkamp, R.J., Yu, H.H., and Kolodkin, A.L. 2002. MICALs, a family of conserved flavoprotein oxidoreductases, function in plexin-mediated axonal repulsion. Cell 109:887 -900.[CrossRef][Medline]

Tusnady, G.E. and Simon, I. 2001. The HMMTOP transmembrane topology prediction server. Bioinformatics 17:849 -850.[Abstract/Free Full Text]

von Heijne, G. 1987. Sequence analysis in molecular biology: Treasure trove or trivial pursuit, pp. 1-188.Academic Press, San Diego, CA.

Wallin, E. and von Heijne, G. 1998. Genome-wide analysis of integral membrane proteins. Protein Sci. 7:1029 -1038.[Abstract]

Ward, J.M. 2001. Identification of novel families of membrane proteins from the model plant Arabidopsis thaliana. Bioinformatics 17:560 -563.[Abstract/Free Full Text]

Zdobnov, E.M. and Apweiler, R. 2001. InterProScan—An integration platform for the signature-recognition methods in InterPro. Bioinformatics 17:847 -848.[Abstract/Free Full Text]


    WEB SITE REFERENCES
 Top
 ABSTRACT
 RESULTS AND DISCUSSION
 METHODS
 REFERENCES
 WEB SITE REFERENCES
 

http://www.celeradiscoverysystem.com; Celera Corporation.

http://www.ensembl.org/; ENSEMBL genome databases.

http://www.ebi.ac.uk/IPI; International Protein Index.

http://www.ebi.ac.uk/proteome; Proteome Analysis Database.

http://genome.gsc.riken.go.jp/; RIKEN Mouse Representative Transcript and Protein Sets.

http://sosui.proteome.bio.tuat.ac.jp/~sosui/proteome/welcomeE.html; SOSUI Web Interface.

http://supfam.org/SUPERFAMILY/cgi-bin/gen_list.cgi?genome=mr; SUPERFAMILY Database.

http://supfam.org/FANTOM2/domcombs.html; SUPERFAMILY FANTOM2 data.

http://microarray.imb.uq.edu.au/predictors; SVMtm Support Vector Machines to predict transmembrane domains.

http://motif.ics.es.osaka-u.ac.jp/fantom2/; MDS Motif Database for FANTOM2.

http://microarray.imb.uq.edu.au/predictors/proteome/; SRC microarray facility—Proteome supplementary material.

Received November 11, 2002; accepted in revised format March 5, 2003.
Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
J. L. Fink, R. N. Aturaliya, M. J. Davis, F. Zhang, K. Hanson, M. S. Teasdale, C. Kai, J. Kawai, P. Carninci, Y. Hayashizaki, et al.
LOCATE: a mouse protein subcellular localization database
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D213 - D217.
[Abstract] [Full Text] [PDF]


Home page
Mol. Interv.Home page
A. Vitale and E. Pedrazzini
Recombinant Pharmaceuticals from Plants: The Plant Endomembrane System as Bioreactor
Mol. Interv., August 1, 2005; 5(4): 216 - 225.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
T. Mijalski, A. Harder, T. Halder, M. Kersten, M. Horsch, T. M. Strom, H. V. Liebscher, F. Lottspeich, M. H. de Angelis, and J. Beckers
Identification of coexpressed gene clusters in a comparative analysis of transcriptome and proteome in mouse tissues
PNAS, June 14, 2005; 102(24): 8621 - 8626.
[Abstract] [Full Text] [PDF]


Home page
J. Am. Soc. Nephrol.Home page
G. A. Challen, G. Martinez, M. J. Davis, D. F. Taylor, M. Crowe, R. D. Teasdale, S. M. Grimmond, and M. H. Little
Identifying the Molecular Phenotype of Renal Progenitor Cells
J. Am. Soc. Nephrol., September 1, 2004; 15(9): 2344 - 2357.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
C. G. Knight, R. Kassen, H. Hebestreit, and P. B. Rainey
From The Cover: Global analysis of predicted proteomes: Functional adaptation of physical properties
PNAS, June 1, 2004; 101(22): 8390 - 8395.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
T. Z. Berardini, S. Mundodi, L. Reiser, E. Huala, M. Garcia-Hernandez, P. Zhang, L. A. Mueller, J. Yoon, A. Doyle, G. Lander, et al.
Functional Annotation of the Arabidopsis Genome Using Controlled Vocabularies
Plant Physiology, June 1, 2004; 135(2): 745 - 755.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
A. I. Su, T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching, D. Block, J. Zhang, R. Soden, M. Hayakawa, G. Kreiman, et al.
A gene atlas of the mouse and human protein-encoding transcriptomes
PNAS, April 20, 2004; 101(16): 6062 - 6067.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kanapin, A.
Right arrow Articles by Yuan, Z.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kanapin, A.
Right arrow Articles by Yuan, Z.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
Genes Dev. Learn. Mem.
Protein Science RNA Genome Res.
Copyright © 2003 by Cold Spring Harbor Laboratory Press.