|
|
|
|
Genome Res. 14:1957-1966, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods Predicting Subcellular Localization via Protein Motif Co-Occurrence1 McGill Center for Bioinformatics, McGill University, Montreal, Quebec H3A 2B4, Canada 2 Biochemistry Department, Faculty of Medicine, McGill University, Montreal, Quebec H3G 1Y6, Canada
The prediction of subcellular localization of proteins from their primary sequence is a challenging problem in bioinformatics. We have created a Bayesian network localization predictor called PSLT that is based on the combinatorial presence of InterPro motifs and specific membrane domains in human proteins. This probabilistic framework generates a likelihood of localization to all organelles and allows to predict multicompartmental proteins. When used to predict on nine compartments, PSLT achieves an accuracy of 78% as estimated by using a 10-fold cross-validation test and a coverage of 74%. When used to predict the localization of proteins from other closely related species, it achieves a prediction accuracy and a coverage >80%. We compared the localization predictions of PSLT to those determined through GFP-tagging and microscopy for a group of human proteins. We found two general classes of proteins that are mislocalized by the GFP-tagging strategy but are correctly localized by PSLT. This suggests that PSLT can be used in combination with experimental approaches for localization to identify proteins for which additional experimental validation is required. We used our predictor to annotate all 9793 human proteins from SWISS-PROT release 41.25, 16% of which are predicted by PSLT to be present in more than one compartment.
Eukaryotic proteins are organized into organelles and suborganelles that generate appropriate environments for their specialized functions. Thus, subcellular localization often offers important clues toward determining the function of an uncharacterized protein. The mechanisms of targeting of proteins to various subcellular localizations have been widely studied, and the predominant mechanisms uncovered so far involve specific amino acid sequence motifs. The consequences of mislocalization and mistargeting are manifested in a number of human genetic diseases, including cystic fibrosis (Skach 2000
There are numerous experimental approaches that attempt to determine both the subcellular localization of a protein and the amino acid motifs responsible for this targeting. Although these methods are capable of determining the linear amino acid motifs that are necessary for targeting, they are generally not able to help determine structural requirements and are generally not suited for use in a high-throughput fashion. The latter point is important because high-throughput proteomic efforts are now able to identify the most abundant proteins of an organelle (Bell et al. 2001 The existing bioinformatics localization predictors in the literature can be broadly grouped into three categories.
Existing predictors have several shortcomings. Most localization prediction methods achieve high accuracy for the most populated compartments, such as the nucleus and cytosol, but are generally less accurate on the numerous compartments containing fewer individual proteins. Many existing predictors use only three or four different subcellular localizations. Moreover, the sets of proteins used to train these methods often do not contain transmembrane proteins because the localization of these proteins is believed to be already elucidated (Huang and Li 2004
This article presents PSLT (Protein Subcellular Localization Tool, pronounced "silt"), a system that addresses the aforementioned issues and problems. PSLT uses the combinatorial presence of InterPro motifs, as well as signal peptides and the number of transmembrane domains in human proteins, to predict the subcellular localization of proteins within a Bayesian framework. InterPro is a database of protein domains, families, functional sites, and posttranslational modifications (Mulder et al. 2003
PSLT uses a Bayesian framework to integrate the presence or absence of combinations of motifs in a statistically coherent manner. The accuracy of PSLT is estimated to be 78% using a 10-fold cross-validation test and >85% using an independent data set test. When used to predict the localization of the independent set of human proteins from the LIFEdb project (Simpson et al. 2000
Statistical Tests of Accuracy The prediction accuracy of PSLT is assessed by three distinct approaches: a self-consistency test, a 10-fold cross-validation test, and independent data set tests. The different data sets used to train and test PSLT are shown in Table 1. With respect to the self-consistency test (also known as the resubstitution test), the accuracy of the predictor is evaluated by using the same data set used for training. As shown in Table 2, the overall prediction accuracy of PSLT using the self-consistency test on the Hera (Human Endoplasmic Reticulum Aperçu) data set (see Methods) is 90% and the coverage is 88%.
In the 10-fold cross-validation test, the data set is randomly partitioned into 10 distinct nonoverlapping sets of proteins. Nine of these sets are used to train the predictor. The prediction accuracy of the predictor is evaluated on the remaining excluded group. This procedure is repeated 10 times. The cross-validation prediction accuracy shown in Table 2 is the average of the 10 experiments. The overall prediction accuracy using the 10-fold cross-validation test on the Hera human data set is 78% and the coverage is 74%.
The third approach used to assess the prediction accuracy of PSLT is an independent data set test. In this test, PSLT is trained by using the entire Hera human data set and tested independently by using the LIFEdb GFP human data set. As shown in Table 3, the overall prediction accuracy of PSLT using this independent test is 55%. The coverage of the GFP data set is 50%. We note that many of the proteins in the LIFEdb GFP data set are hypothetical proteins derived from cDNA sequence data and have not been previously studied. Their subcellular localization was experimentally determined by tagging GFP to their N and C termini (in two separate experiments) and by visualizing the resulting protein localization by microscopy (Simpson et al. 2000
Because of the possibility of false-positive and false-negative localization annotations of the GFP high-throughput localization data set, we also measured the prediction accuracy of PSLT on the subset of proteins of the LIFEdb data set that have been independently studied by other research groups in a non-high-throughput manner and that have localization that is available in the literature. This subset consists of 82 proteins. As shown in Table 3 in the middle pair of columns, the prediction accuracy of PSLT on this subset of the LIFEdb data set is 87% with a coverage of 67%. This improved performance is much closer to the values obtained by the 10-fold cross-validation test on human proteins from the Hera data set and by the independent test on the mouse data set (see "Generalization to Other Organisms"). In contrast, the concordance of the high-throughput experimental GFP-tagging method with the literature was evaluated to be 59% (as shown in the rightmost pair of columns in Table 3). We decided to further investigate the discrepancy between the prediction accuracy results of PSLT on the full LIFEdb data set and on the subset verified by the literature. Of the 197 proteins from the full LIFEdb GFP data set for which PSLT could predict localization, the PSLT prediction disagrees with the GFP localization results for 89 proteins. Although many proteins in the LIFEdb are annotated "unknown" or "hypothetical," we found reports of experimental evidence for the localization of 25 of the 89 proteins in the literature. Table 4 shows the comparisons between the LIFEdb localization annotation and the PSLT localization prediction, as well as the information currently available in the literature for these 25 proteins. The available scientific literature confirms the LIFEdb localization annotation of five of the 25 proteins and the PSLT prediction of 20 of the 25 proteins (including two proteins that have been confirmed to be in both the LIFEdb annotated compartment and the compartment predicted by PSLT). Although experimental evidence in the literature could be erroneous or incomplete, these results suggest that the prediction accuracy of PSLT may exceed the prediction accuracy of the LIFEdb GFP data set.
We further studied the cases of proteins with LIFEdb localization annotation that disagrees with the PSLT prediction. We note two general recurring cases: (1) proteins that have been shown in the literature to be plasma membrane or secreted proteins but that are annotated as being localized elsewhere in LIFEdb, and (2) proteins predicted by PSLT as peroxisomal but annotated in LIFEdb as localized to another compartment (usually the mitochondria). With respect to case 1, many proteins predicted to be localized in the plasma membrane or secreted by PSLT are annotated by LIFEdb as being localized elsewhere in the cell, mostly in the ER but also in the cytosol and the nucleus. It is possible that these proteins spend a longer than expected amount of time in the ER (potentially due to an increase in the duration of folding caused by the added GFP tag) or never even succeed in entering the ER. This may explain why such proteins are visualized via microscopy to be localized to the ER or the cytosol when ultimately they are destined for the plasma membrane. This hypothesis may also explain the low proportion of plasma membrane proteins (6% to 7%) in the LIFEdb GFP data set compared with other public localization databases. With respect to case 2, all proteins predicted by PSLT to be peroxisomal are annotated as being localized elsewhere in the cell by the LIFEdb GFP data set. However, we note that several of these proteins have been confirmed to be peroxisomal in the literature. Perhaps the GFP tag systematically targets the peroxisomal proteins to the mitochondria or elsewhere in the cell. It should be noted, however, that there exist several proteins that are actually annotated to be localized in both the peroxisome and the mitochondria in SWISS-PROT. It could be the case that some of the proteins predicted by PSLT to be peroxisomal and by the GFP localization to be mitochondrial are actually multicompartmental proteins. In general, disagreements between the prediction of PSLT and one specific experimental approach might warrant further investigation using different experimental techniques to verify the localization (e.g., by immunofluorescence microscopy of normal cells using an antibody specific to the protein of interest). Conversely, when the prediction of PSLT agrees with experimental evidence, our belief that the protein is indeed localized to this compartment should be strengthened.
Comparison Between PSLT and the SMART Domain Projection Method
Generalization to Other Organisms and Multicompartmental Prediction
Because PSLT is based on a probabilistic framework, it can output the probability that a protein is localized to each compartment (not only the most likely compartment). Although for most proteins there is a single compartment that has a high likelihood, there do exist some proteins for which there is an (almost) equally high likelihood for several compartments. If PSLT is allowed to predict the two most likely compartments for each protein (referred to as the second-best test in Tables 2, 3, 5) in the yeast data set, the accuracy of our framework for predicting the correct localization increases to 71% as determined by an independent data set test with a coverage of 54%. As shown in Table 2, when PSLT is allowed to predict the two most likely compartments for each protein in the Hera human data set, the accuracy is 86% by 10-fold cross validation test and 97% by the self-consistency test. When the second-best test is used on the mouse data set, with PSLT trained on human proteins, the prediction accuracy is 92% as shown in Table 5. The Hera data set contains relatively few multicompartmental proteins (shown in Table 1). It is probable that some of the proteins predicted by PSLT to be multicompartmental are in fact multicompartmental proteins even though they are annotated as residing in only one compartment in public databases. Because the mouse data set contains a higher proportion of multicompartmental proteins than does the Hera human data set, we use it to further explore the multicompartment prediction potential of PSLT. Because PSLT outputs the likelihood of localization to all compartments studied, it is possible to define proteins to be predicted as multicompartmental, if the likelihood difference between their two most likely compartments is less than a certain percentage of their most likely compartment. If we study the proteins for which the likelihoods of the two highest scoring compartments are within 50% of each other, we identify 20% of the multicompartmental mouse proteins from Table 1 and only 10% of the mouse proteins annotated as being localized to only one compartment in Table 1. Therefore, the true-positive rate is two times greater than the false-positive rate (and could be much greater if some proteins annotated as unicompartmental in SWISS-PROT actually are multicompartmental as predicted by PSLT). This holds true for all thresholds between 25% and 75%. This probably indicates that certain specific combinations of motifs are more frequently used by multicompartmental proteins than unicompartmental proteins.
Human Proteome Annotation
Distribution of Motifs in Compartments The prediction accuracy of PSLT is influenced by how well the different compartments and cellular processes are characterized by InterPro motifs and to what extent the different compartments share motifs. As shown in Table 7, some compartments such as the plasma membrane, nucleus, and extracellular protein group are better covered by InterPro motifs than are other compartments. In fact, >90% of proteins in these organelles contain at least one such motif. Proteins localized to the Golgi apparatus contain the fewest motifs. The average number of motifs per covered protein varies considerably between compartments. The plasma membrane contains by far the most motifs per protein, whereas the lysosome contains the least. The high number of motifs per protein in plasma membrane proteins could reflect the fact that this group is involved in many signaling events and proteins localized here are known to interact with many different proteins. It is also possible that some compartments are not as well characterized by InterPro motifs than others.
To determine whether PSLT predicts localization based mostly on the co-occurrence of motifs or on the presence of single motifs in proteins, we counted the number of proteins with localization that was predicted by PSLT using more than one motif. As shown in Table 7, in most compartments, >50% of proteins are predicted by the co-occurrence of motifs (as opposed to single motifs). Notable exceptions to this are the Golgi apparatus and the lysosome. In contrast, the localization of 80% of plasma membrane proteins are predicted by the co-occurrence of motifs, which is not surprising given the high average number of distinct motifs per protein in that compartment. It should be noted that PSLT also uses additional motif information (the presence/absence of signal peptides/anchors as well as the number of transmembrane domains) to predict localization, and as such, all proteins are actually predicted based on more than one motif. As proteins in compartments such as the lysosome and the Golgi apparatus become better characterized as a group and as the data sets of these organelles increase in size, the prediction accuracy of PSLT for these organelles should increase. We also evaluated the extent of motif frequency in compartments. This is defined as the ratio of the total number of occurrences of motifs contained within all proteins in a given compartment to the total number of distinct motifs contained within all proteins in this compartment. Such motif frequency values may give an indication as to the degree of process diversity in the different compartments. As shown in Table 7, the motif frequency of the nucleus and plasma membrane is much higher than is the motif frequency for all other compartments. These two compartments both have large protein families with many members performing similar functions (e.g., the very large receptor families in the plasma membrane or the transcription factor families in the nucleus). Because PSLT predicts localization based on the combinatorial presence of motifs, the performance of the predictor is influenced by the motif frequency of the different compartments. In fact, the motif frequency correlates well with the sensitivity obtained by PSLT for the different compartments. The extent of motif sharing between the different compartments can also affect the localization prediction accuracy of PSLT. The motif compartment specificity index shown in Table 7 is the ratio of the number of motifs that are unique to a given compartment divided by the total number of motifs in the compartment. The motif compartment specificity index can be an indication of the extent of process sharing as well as protein trafficking and structural motifs shared between the different compartments. The nucleus contains the lowest proportion of compartment-specific motifs. The peroxisome has the highest proportion of compartment-specific motifs; this may indicate that the peroxisome shares few processes with other compartments or that the proteins involved in processes it shares with other compartments are characterized by compartment-specific motifs. The motifs shared by proteins in the largest number of compartments are shown in Table 8. The proline-rich region is present in proteins in all compartments considered. The average number of compartments in which a given motif is present is 1.3.
Taken together, the ratios in Table 7 are an indication of how InterPro motifs characterize the cell, the various compartments, and processes. The InterPro classification scheme provides a novel way of describing these entities and of evaluating the extent of our knowledge of the different organelles.
We present a framework PSLT to predict the subcellular localization of proteins based on InterPro motifs and protein membrane domains. PSLT addresses the problems of low prediction accuracy for underrepresented compartments, the specific organelle prediction of transmembrane proteins, and most importantly, it allows an increased understanding (and prediction of) multicompartmental proteins. PSLT was initially built using only InterPro motifs. The addition of protein membrane domain information, including the presence of signal peptides and the number of transmembrane domains, always improves PSLT under every test we performed and can increase the prediction accuracy by up to 10%. This information is especially important to distinguish between plasma membrane and secreted proteins (data not shown). Because PSLT is built by using a Bayesian framework, it could be easily further improved by incorporating other types of relevant information. When tested on human proteins, PSLT (based on InterPro motifs and protein membrane domains) achieves an overall prediction accuracy of 78% (with a coverage of 74%) and sensitivity and positive predictive values between 43% and 93% for all compartments considered, including compartments that contain few proteins. When PSLT is allowed to predict the two most likely compartments, its ability to predict at least one compartment well increases to >85% for human proteins. The ability to predict multicompartmental proteins allows us to estimate that at least 16% of human proteins are in fact multicompartmental. When used to predict proteins from closely related species such as the mouse, it achieves high prediction accuracies approaching those of the self-consistency test. This type of predictor achieves a reasonable prediction accuracy because protein families and functional units are often colocalized in the cell. Even when proteins in different compartments can be characterized by the same InterPro motif and thus share some similar features, it is often the case that they also contain other InterPro motifs capable of differentiating them. Perhaps these additional motifs also modulate their function.
Because PSLT uses InterPro motifs to predict localization, it considers not only known organellar targeting motifs, as several other predictors have done in the past, but also the possible influence on localization of posttranslational modifications and protein-protein interaction domains and their combinations in proteins. Some posttranslational modifications are well known to influence protein localization. The most obvious example of this is probably the addition of lipid anchors to proteins. Phosphorylation has also been shown to cause a change in localization in many proteins, in particular in regulating the nuclear-cytoplasmic shuttling of many proteins (Hood and Silver 1999
Data Sets The Hera database (www.mcb.mcgill.ca/~hera; Scott et al. 2004
The LIFEdb database contained the experimentally determined subcellular localization information for
The yeast data set was generated by retrieving all proteins annotated with subcellular localization information from the Saccharomyces Genome Database (SGD; Dwight et al. 2002
Generation of Maximal Motif Sets for Each Compartment
We are interested in finding all maximal motif sets for each compartment. Although this is a computationally intractable problem, we use a dynamic programming approach to find these sets. Intuitively, we begin by identifying all motifs present in at least one protein localized to a given compartment. These are simple motif sets consisting of only one motif. Now, in a dynamic programming fashion, we extend each of the candidate motif sets by exhaustively adding all possible motifs one by one. If any such extended motif set has the property that all members of the motif set occur in a set of proteins that are colocalized, we keep this motif set. Otherwise, we discard this candidate. The algorithm is guaranteed to find the maximal motif sets (although it might require exponential time, the computation was feasible for our human proteins and the InterPro motifs). Throughout this process, each motif set (and all possible subsets of this motif set) are annotated with the proportion of proteins in the compartment under study that contain the motifs.
Likelihood of Localization
Here Pr[M| C] is the probability that the protein contains all motifs in set M given that we know the protein is localized to compartment C. This conditional probability is estimated in a straightforward way from the training set. Also, Pr[M] is the prior probability of a protein-containing motif set M regardless of the localization of the proteins containing M. This prior probability is estimated by determining the presence of InterPro motifs in all 9793 human proteins from SWISS-PROT release 41.25. Lastly, Pr[C] is the prior probability of a protein localizing to compartment C. These compartment priors were initially evaluated by averaging the number of human proteins annotated with localization information in three public databases: SWISS-PROT (Boeckmann et al. 2003
Subcellular Localization Prediction The prediction accuracy of PSLT is evaluated in the Results section by using several different tests. The total prediction accuracy is defined as the number of correctly predicted proteins in the test set divided by the total number of proteins in the test set. Because PSLT predicts the localization of proteins based on protein motifs, it will be unable to predict on proteins not containing such motifs or proteins containing motifs not used as motif sets for localization in the training phase of the algorithm. As a consequence, we exclude such proteins from our prediction accuracy statistics. However, for each reported accuracy estimate, the coverage (proportion of predictable proteins in the data sets) is given.
Compartment Prior Optimization
We note that other methods are available, such as structural EM learning, that provide alternative computational approaches in which both the compartment prior optimization and the selection of motif sets could potentially be done simultaneously. However, we chose the above the techniques given the amount of data available and the computational difficulty of these alternative methods.
Additional Protein Characteristics Considered
Essentially, we are "extending" the Bayesian network to include new information in the initial Bayesian network that uses the presence of motif sets to compute the Pr[C| M] likelihood. However, if the additional information and the motif sets are not independent random variables, the likelihood of localization should be calculated as follows:
which is equivalent to constructing a global Bayesian network that incorporates simultaneously the motif sets and the new additional information.
We investigated the addition of information relating to the presence of signal peptides and the number of transmembrane domains. The presence of a signal peptide and the number of transmembrane domains were respectively evaluated by using SignalP software (Nielsen et al. 1997
We are grateful to Dr. Scott Bunnell for critical reading of this manuscript. We wish to thank François Pepin for logistical support, Dr. Ted Perkins for useful discussions, and Dr. Richard Mott for kindly making his testing data set available. This work was supported by grants to D.Y.T. and M.H. from Genome Quebec/Genome Canada as well as to D.Y.T. from the Canadian Institutes of Health Research (CIHR). M.S.S. is a recipient of a Canada Graduate Scholarship (CGS) from CIHR.
3 Corresponding author. E-MAIL hallett{at}mcb.mcgill.ca; FAX (514) 398-3387. [Supplemental material is available online at www.genome.org and www.mcb.mcgill.ca/~hera/PSLT.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2650004.
Bell, A.W., Ward, M.A., Blackstock, W.P., Freeman, H.N., Choudhary, J.S., Lewis, A.P., Chotai, D., Fazel, A., Gushue, J.N., Paiement, J., et al. 2001. Proteomics characterization of abundant Golgi membrane proteins. J. Biol. Chem. 276: 5152-5165.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, O., Phan, I., et al. 2003. The SWISS-PROT protein knowledge base and its supplement TrEMBL in 2003. Nucleic Acids Res. 31: 365-370. Breckenridge, D.G., Germain, M., Mathai, J.P., Nguyen, M., and Shore, G.C. 2003. Regulation of apoptosis by endoplasmic reticulum pathways. Oncogene 22: 8608-8618.[CrossRef][Medline] Cai, Y.D., Liu, X.J., Xu, X.B., and Chou, K.C. 2002. Support vector machines for prediction of protein subcellular location by incorporating quasi-sequence-order effect. J. Cell. Biochem. 84: 343-348.[CrossRef][Medline] Chou, K.C. 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins 43: 246-255.[CrossRef][Medline]
Chou, K.C. and Cai, Y.D. 2002. Using functional domain composition and support vector machines for prediction of protein subcellular location. J. Biol. Chem. 277: 45765-45769. Chou, K.C. and Cai, Y.D. 2003. A new hybrid approach to predict subcellular localization of proteins by incorporating gene ontology. Biochem. Biophys. Res. Commun. 311: 743-747.[CrossRef][Medline] Claros, M.G. and Vincens, P. 1996. Computational method to predict mitochondrially imported proteins and their targeting sequences. Eur. J. Biochem. 241: 779-786.[Medline] Drawid, A. and Gerstein, M. 2000. A Bayesian system integrating expression data with sequence patterns for localizing proteins: Comprehensive application to the yeast genome. J. Mol. Biol. 301: 1059-1075.[CrossRef][Medline]
Dwight, S.S., Harris, M.A., Dolinski, K., Ball, C.A., Binkley, G., Christie, K.R., Fisk, D.G., Issel-Tarver, L., Schroeder, M., Sherlock, G., et al. 2002. Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res. 30: 69-72. Eisenhaber, F. and Bork, P. 1998. Wanted: Subcellular localization of proteins based on sequence. Trends Cell. Biol. 8: 169-170.[CrossRef][Medline]
____. 1999. Evaluation of human-readable annotation in biomolecular sequence databases with biological rule libraries. Bioinformatics 15: 528-535. Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300: 1005-1016.[CrossRef][Medline] Hettema, E.H., Distel, B., and Tabak, H.F. 1999. Import of proteins into peroxisomes. Biochim. Biophys. Acta 1451: 17-34.[Medline] Hood, J.K. and Silver, P.A. 1999. In or out? Regulating nuclear transport. Curr. Opin. Cell. Biol. 11: 241-247.[CrossRef][Medline] Horton, P. and Nakai, K. 1996. A probabilistic classification system for predicting the cellular localization sites of proteins. Proc. Int. Conf. Intell. Syst. Mol. Biol. 4: 109-115.[Medline]
Hua, S. and Sun, Z. 2001. Support vector machine approach for protein subcellular localization prediction. Bioinformatics 17: 721-728.
Huang, Y. and Li, Y. 2004. Prediction of protein subcellular locations using fuzzy k-NN method. Bioinformatics 20: 21-28. Huh, W.K., Falvo, J.V., Gerke, L.C., Carroll, A.S., Howson, R.W., Weissman, J.S., and O'Shea, E.K. 2003. Global analysis of protein localization in budding yeast. Nature 425: 686-691.[CrossRef][Medline]
Karin, M. 1999. The beginning of the end: I Krogh, A., Larsson, B., von Heijne, G., and Sonnhammer, E.L 2001. Predicting transmembrane protein topology with a hidden Markov model: Application to complete genomes. J. Mol. Biol. 305: 567-580.[CrossRef][Medline]
Lu, Z., Szafron, D., Greiner, R., Lu, P., Wishart, D.S., Poulin, B., Anvik, J., Macdonell, C., and Eisner, R. 2004. Predicting subcellular localization of proteins using machine-learned classifiers. Bioinformatics 20: 547-556.
Marcotte, E.M., Xenarios, I., van Der Bliek, A.M., and Eisenberg, D. 2000. Localizing proteins in the cell from their phylogenetic profiles. Proc. Natl. Acad. Sci. 97: 12115-12120. Michaud, G.A. and Snyder, M. 2002. Proteomic approaches for the global analysis of proteins. Biotechniques 33: 1308-1316.[Medline]
Mott, R., Schultz, J., Bork, P., and Ponting, C.P. 2002. Predicting protein cellular localization using a domain projection method. Genome Res. 12: 1168-1174.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., et al. 2003. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31: 315-318. Nakai, K. and Kanehisa, M. 1992. A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14: 897-911.[CrossRef][Medline]
Nielsen, H., Engelbrecht, J., Brunak, S., and von Heijne, G. 1997. Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng. 10: 1-6.
Parfrey, H., Mahadeva, R., and Lomas, D.A. 2003.
Payne, A.S., Kelly, E.J., and Gitlin, J.D. 1998. Functional expression of the Wilson disease protein reveals mislocalization and impaired copper-dependent trafficking of the common H1069Q mutation. Proc. Natl. Acad. Sci. 95: 10854-10859.
Peri, S., Navarro, J.D., Kristiansen, T.Z., Amanchy, R., Surendranath, V., Muthusamy, B., Gandhi, T.K., Chandrika, K.N., Deshpande, N., Suresh, S., et al. 2004. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res 32: D497-D501. Rachubinski, R.A. and Subramani, S. 1995. How proteins penetrate peroxisomes. Cell 83: 525-528.[CrossRef][Medline]
Rapoport, T.A. 1992. Transport of proteins across the endoplasmic reticulum membrane. Science 258: 931-936.
Reinhardt, A. and Hubbard, T. 1998. Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Res. 26: 2230-2236.
Scott, M., Lu, G., Hallett, M., and Thomas, D.Y. 2004. The Hera database and its use in the characterization of endoplasmic reticulum proteins. Bioinformatics 20: 937-944. Simpson, J.C., Wellenreuther, R., Poustka, A., Pepperkok, R., and Wiemann, S. 2000. Systematic subcellular localization of novel proteins identified by large-scale cDNA sequencing. EMBO Rep. 1: 287-292.[CrossRef][Medline] Skach, W.R. 2000. Defects in processing and trafficking of the cystic fibrosis transmembrane conductance regulator. Kidney Int. 57: 825-831.[CrossRef][Medline] Taylor, S.W., Fahy, E., and Ghosh, S.S. 2003. Global organellar proteomics. Trends Biotechnol. 21: 82-88.[CrossRef][Medline] von Heijne, G. 1990. The signal peptide. J. Membr. Biol. 115: 195-201.[CrossRef][Medline] Xu, L. and Massague, J. 2004. Nucleocytoplasmic shuttling of signal transducers. Nat. Rev. Mol. Cell. Biol. 5: 209-219.[CrossRef][Medline]
Zdobnov, E.M. and Apweiler, R. 2001. InterProScan: An integration platform for the signature-recognition methods in InterPro. Bioinformatics 17: 847-848.
www.mcb.mcgill.ca/~hera; Human ER Aperçu home page. www.dkfz.de/LIFEdb/LIFEdb.aspx; LIFEdb database home page. www.yeastgenome.org/; Saccharomyces Genome Database (SGD). www.hprd.org/; Human Protein Reference Database home page. www.mcb.mcgill.ca/~hera/PSLT; Protein subcellular localization tool. www.inra.fr/predotar/; Home page of Predotar, a prediction service for identifying putative mitochondrial and plastid targeting sequences.
Received April 2, 2004; accepted in revised format July 22, 2004. This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||