|
|
|
|
Published online before print
January 27, 2006, 10.1101/gr.4526006 Genome Res. 16:428-435, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00
Methods Systematic identification of functional orthologs based on protein network comparison1 Program in Bioinformatics, University of California at San Diego, La Jolla, California 92093, USA 2 Department of Bioengineering, University of California at San Diego, La Jolla, California 92093, USA 3 School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel
Annotating protein function across species is an important task that is often complicated by the presence of large paralogous gene families. Here, we report a novel strategy for identifying functionally related proteins that supplements sequence-based comparisons with information on conserved proteinprotein interactions. First, the protein interaction networks of two species are aligned by assigning proteins to sequence homology clusters using the Inparanoid algorithm. Next, probabilistic inference is performed on the aligned networks to identify pairs of proteins, one from each species, that are likely to retain the same function based on conservation of their interacting partners. Applying this method to Drosophila melanogaster and Saccharomyces cerevisiae, we analyze 121 cases for which functional orthology assignment is ambiguous when sequence similarity is used alone. In 61 of these cases, the network supports a different protein pair than that favored by sequence comparisons. These results suggest that network analysis can be used to provide a key source of information for refining sequence-based homology searches.
The idea that similar protein sequences imply similar protein functions has long been a central concept in molecular biology. With each new completed genome, an increasingly complex array of sequence alignment and comparative modeling tools are used to annotate functions for the typically thousands of encoded proteins, based largely on similarity to proteins that are well characterized in other species (Brenner 1999
The difficulty of assigning protein orthology depends largely on the evolutionary history. Protein families for which speciation predates gene duplication are particularly challenging; in these cases, every cross-species protein pair is technically orthologous but it is still necessary to distinguish which protein pairs play functionally equivalent roles, that is, which are functional orthologs (Remm et al. 2001
A variety of sequence-based approaches have been proposed to address these challenges. The COGs (Clusters of Orthologous Groups) approach (Tatusov et al. 2000
Other than gene and protein sequences, several large-scale data types have recently become available that provide complementary information on functional conservation. For instance, several groups have used correlated patterns of gene expression across species as evidence for functional relatedness (Stuart et al. 2003
Here, we investigate whether it is possible to use protein network information to predict functionally orthologous proteins across species. While previous tools such as Interolog mapping and PathBLAST have used orthology to identify conserved protein interactions, our approach aims to reverse this logic and use conserved protein interactions to predict functional orthology. It is built on the concept that a protein and its functional ortholog are likely to interact with proteins in their respective networks that are themselves functional orthologs. This type of network-based approach is related to methods for predicting other protein properties based on the interaction network, such as functional annotation of a protein based on the annotations of its neighbors (Letovsky and Kasif 2003
Motivation: Interaction conservation is related to orthology Proteinprotein interaction networks for yeast and fly were obtained from the Database of Interacting Proteins (December 2004 download) (Xenarios et al. 2002
To determine the extent to which proteins and their functional orthologs had conserved protein interactions, we examined the network neighborhoods of definite functional orthologs and compared them to the neighborhoods of less related protein pairs (Fig. 1). As a measure of local network conservation, we computed the conservation index of each protein pair as proportional to the fraction of interactions that were conserved across the two species. For example, in Figure 2b the orthologous pairing B/B' has a higher conservation index (4/9) than the alternative pairing B/B'' (2/9). Figure 1A shows the set of conservation indices for definite functional orthologs versus those of ambiguous functional orthologs, nonorthologous homologs (best cross-species BLAST matches not assigned to the same Inparanoid cluster), and random pairs of proteins chosen independently of sequence similarity. As expected, the set of definite functional orthologs had the highest occurrence of conserved interactions. Moreover, the mean conservation index was related to the stringency of the pairing: Definite functional orthologs tended to have higher conservation indices than did ambiguous functional orthologs, ambiguous functional orthologs had higher indices than did homologs, and homologs had higher indices than did random protein pairs. Beyond the mean conservation index, there were also significant differences among the four distributions (Supplemental Table 1). These findings confirm that yeast/fly proteins classified as definite functional orthologs are more likely to have equivalent functional roles in the protein network and, conversely, that conserved network context could be used to help discriminate functional orthology from general sequence similarity.
Network-based identification of functional orthologs
Application to yeast and fly identifies new putative functionally orthologous pairs We applied this approach to resolve ambiguous functional orthology relationships in the yeast and fly protein networks. Of the 692 ambiguous Inparanoid clusters, 121 contained protein pairs for which at least one pair had conserved interactions between networks. Application of our Gibbs sampling procedure yielded estimates of probability of functional orthology for each protein pair in these 121 ambiguous clusters. In 60 of these clusters, the highest probability was assigned to the protein pair that was also the most sequence-similar via BLAST. These cases reinforced the intuition that the best sequence matches are also the most functionally similar. The remaining 61 clusters showed the opposite behavior; that is, the highest probability pair was not the most sequence similar pair. Of these 61 cases, 15 were supported by two or more conserved interactions (Table 1). Because the yeast and fly networks are incomplete (i.e., they contain false negatives), in some of these cases we cannot rule out the possibility that conserved interactions with the best BLAST matches have been missed (see Discussion). A complete listing of the results can be found on the Supplemental Web site (http://www.cellcircuits.org/Bandyopadhyay2006/).
Validation A straightforward validation of the approach would be to analyze its accuracy in recapitulating a gold standard set of protein functional annotations. However, databases of functional annotations are based directly on sequence similarity, and they typically lack the specificity to discriminate among subtle functional differences across large gene families. As an alternative, we used the technique of cross-validation to test the ability of the approach to reclassify protein pairs in the definite functional ortholog set (positive test data) versus the nonorthologous homolog set (negative test data). In each cross-validation trial, 1% of these assignments were hidden (declassified) and monitored during Gibbs sampling to obtain probabilities of functional orthology for positive and negative examples. Reclassification was judged successful if the probability of functional orthology exceeded a particular cutoff value. These statistics were compiled over 100 trials. Figure 3A charts cross-validation performance over a range of probability cutoffs. At a probability cutoff of 0.5, we observed a 50% true-positive rate and a 15% false-positive rate. This shows marked improvement over a random predictor, where we would expect to see the same true-positive rate as false-positive rate. Declassifying 1% of the known functional orthologous and nonorthologous pairs reduces the amount of information available to the algorithm and, thus, can reduce its predictive ability. To assess the severity of this effect, we repeated the cross-validation analysis at varying percentages of declassification of positive and negative data (ranging from 1%100%) (Fig. 3B). For instance, changing the amount of declassification of available training data from 1% to 25% reduced the maximum precision from 83% to 75%. Further declassification yielded more marked reductions in precision and recall.
Specific examples of yeast/fly functional orthologs resolved by the network-based approach are shown graphically in Figure 4. In Figure 4A, yeast transportin (Kap104) is orthologous to both Trn and CG8219 in fly with highly significant sequence homology (BLAST E-values 9 x 10128 and 7 x 1096, respectively). Transportin is a member of a complex responsible for the nuclear import of mRNA binding proteins and is known to be highly conserved among diverse organisms (Aitchison et al. 1996
The cluster in Figure 4C contains two alternative catalytic
As a final example, Figure 4D shows evidence that the yeast Calmodulin (Cmd1) protein is functionally orthologous to fly Androcam (And) rather than to the more sequence-similar fly Calmodulin (Cam1; 60% identity vs. 51% for And). The existence of many conserved interactions for the Cmd1/And pair, compared with only one for Cmd1/Cam1, does not appear to be a result of incomplete coverage: Cmd1 has a total of 61 interactions in the yeast network, and Cam1 and And have 19 and 26 interactions, respectively, in the fly network (most of these do not appear in Fig. 4 because the network alignment only shows interactions that are conserved). Furthermore, multiple sequence alignment and phylogenetic analysis of these genes over a larger number of organisms, including worms and mammals, indicates a closer phylogenetic relationship for yeast Cmd1 and fly And, supporting our hypothesis that they are the true functional orthologs (Supplemental Fig. 1). This apparent discrepancy between functional and sequence similarity is probably a result of the large amount of sequence variability among the calmodulin family of proteins (Tombes et al. 2003
In future work, it is possible that incorporating yet other types of conserved linkages, such as transcriptional interactions (Harbison et al. 2004 In summary, we have presented an algorithm that uses protein interaction measurements to achieve more specific discrimination of functional orthologs than is possible with sequence-based methods alone. It is built on the concept that conserved proteins typically do not function independently but rely on interactions with other proteins to form conserved pathways, and that the specific patterns of conservation of these pathways are informative for determining which cross-species protein pairs have similar functional roles. As these methods mature and as ever greater numbers of protein interactions become available across species, comparative network analysis is likely to play an increasingly central role as a bridge among protein sequence, evolution, and function.
Inparanoid clusters generation The complete sets of 5878 yeast and 18,746 fly protein sequences were downloaded from the Saccharomyces Genome Database (Christie et al. 2004
Network alignment
Each node in the alignment graph is associated with a state z, indicating whether that protein pair represents true functional orthology (z = 1) or not (z = 0). Links between nodes that are each associated with true functional orthology are said to be "strongly conserved." To compute the frequencies shown in Figure 1A, the protein pair in each Inparanoid cluster having the lowest BLAST E-value is set to z = 1; all others are set to z =0.
Conservation index
Probabilistic model
is some assignment to the states of all nodes in the graph, U(·) is an "energy" function that integrates the potentials over all cliques in the graph, and Z is a normalizing constant. It is not necessary to compute the normalization constant, since all that is required are the conditional probabilities for each node given its neighbors (rather than the joint distribution). For computational efficiency, we used the common auto-logistic model (Besag 1974
above, reduces to a logistic function. Based on our initial observation that the functional orthology of a node is a function of its conservation index (well approximated by a logistic function) (see Fig. 1A; Results), we set i = and ij = i =2 /[d(ai)+ d(ai')] to obtain the following:
N(i). Note that i and ij could be set to accommodate other equations for conservation index, as long as they are linear in the number of strongly conserved neighbors d(i).
Fitting the logistic function
Orthology inference The Gibbs sampling procedure was carried out for an initial period of 2 x 106 "burn-in" iterations. From this point onward, 2 x 107 additional iterations were performed and statistics computed on the fraction of iterations in which each node acquires a "functionally orthologous" z = 1 state. The final probabilities of functional orthology for each node, P(zi), were estimated as this fraction. The above numbers of iterations were chosen to ensure that results were stable across multiple runs of random initialization configurations (standard deviations for each P(zi) are available in the Supplemental material). Compiled results were aggregated over 100 separate runs of the algorithm and mean probabilities reported.
We thank Tomer Shlomi and Ryan Kelley for critical reading of the manuscript, as well as Silpa Suthram for providing interaction data. We gratefully acknowledge the following sources of funding for this project: award R01-GM070743-01 from the NIGMS (T.I.), an Alon fellowship (R.S.), and a Quantitative Systems Biology grant from the NSF (S.B.). T.I. is a fellow of the David and Lucille Packard Foundation.
[Supplemental material is available online at www.genome.org and http://www.cellcircuits.org/Bandyopadhyay2006/.] Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4526006.
4 Corresponding authors.
Aebersold, R. and Mann, M. 2003. Mass spectrometry-based proteomics. Nature 422: 198207.[CrossRef][Medline] Aitchison, J.D., Blobel, G., and Rout, M.P. 1996. Kap104p: A karyopherin involved in the nuclear transport of messenger RNA binding proteins. Science 274: 624627. Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402. Besag, J. 1974. Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. B: 192236. Brenner, S.E. 1999. Errors in genome annotation. Trends Genet. 15: 132133.[CrossRef][Medline] Christie, K.R., Weng, S., Balakrishnan, R., Costanzo, M.C., Dolinski, K., Dwight, S.S., Engel, S.R., Feierbach, B., Fisk, D.G., Hirschman, J.E., et al. 2004. Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res. 32: D311D314. Drysdale, R.A., Crosby, M.A., Gelbart, W., Campbell, K., Emmert, D., Matthews, B., Russo, S., Schroeder, A., Smutniak, F., Zhang, P., et al. 2005. FlyBase: Genes and gene models. Nucleic Acids Res. 33: D390D395. Eisen, J.A. and Wu, M. 2002. Phylogenetic analysis and gene functional predictions: Phylogenomics in action. Theor. Popul. Biol. 61: 481487.[CrossRef][Medline] Espadaler, J., Aragues, R., Eswar, N., Marti-Renom, M.A., Querol, E., Aviles, F.X., Sali, A., and Oliva, B. 2005. Detecting remotely related proteins by their interactions and sequence similarity. Proc. Natl. Acad. Sci. 102: 71517156. Fields, S. and Song, O. 1989. A novel genetic system to detect proteinprotein interactions. Nature 340: 245246.[CrossRef][Medline] Guarente, L. 1993. Synthetic enhancement in gene interaction: A genetic tool come of age. Trends Genet. 9: 362366.[CrossRef][Medline] Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J., Macisaac, K.D., Danford, T.W., Hannett, N.M., Tagne, J.B., Reynolds, D.B., Yoo, J., et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature 431: 99104. Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180183.[CrossRef][Medline] Kelley, B.P., Sharan, R., Karp, R.M., Sittler, T., Root, D.E., Stockwell, B.R., and Ideker, T. 2003. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl. Acad. Sci. 100: 1139411399. Kelley, B.P., Yuan, B., Lewitter, F., Sharan, R., Stockwell, B.R., and Ideker, T. 2004. PathBLAST: A tool for alignment of protein interaction networks. Nucleic Acids Res. 32: W83W88. Leone, M. and Pagnani, A. 2005. Predicting protein functions with message passing algorithms. Bioinformatics 21: 239247. Letovsky, S. and Kasif, S. 2003. Predicting protein function from protein/protein interaction data: A probabilistic approach. Bioinformatics 19(Suppl 1): I197I204. Li, L., Stoeckert Jr., C.J., and Roos, D.S. 2003. OrthoMCL: Identification of ortholog groups for eukaryotic genomes. Genome Res. 13: 21782189. Matthews, L.R., Vaglio, P., Reboul, J., Ge, H., Davis, B.P., Garrels, J., Vincent, S., and Vidal, M. 2001. Identification of potential interaction networks using sequence-based searches for conserved proteinprotein interactions or "interologs." Genome Res. 11: 21202126. Pearl, J. 1988. Probabalistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufmann Publishers, San Mateo, CA. Press, W.H. 1992. Numerical recipes in FORTRAN: The art of scientific computing. Cambridge University Press, Cambridge. Reese, M.G., Hartzell, G., Harris, N.L., Ohler, U., Abril, J.F., and Lewis, S.E. 2000. Genome annotation assessment in Drosophila melanogaster. Genome Res. 10: 483501. Remm, M., Storm, C.E., and Sonnhammer, E.L. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314: 10411052.[CrossRef][Medline] Ronne, H., Carlberg, M., Hu, G.Z., and Nehlin, J.O. 1991. Protein phosphatase 2A in Saccharomyces cerevisiae: Effects on cell growth and bud morphogenesis. Mol. Cell Biol. 11: 48764884. Sharan, R., Suthram, S., Kelley, R.M., Kuhn, T., McCuine, S., Uetz, P., Sittler, T., Karp, R.M., and Ideker, T. 2005. Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. 102: 19741979. Siomi, M.C., Fromont, M., Rain, J.C., Wan, L., Wang, F., Legrain, P., and Dreyfuss, G. 1998. Functional conservation of the transportin nuclear import pathway in divergent organisms. Mol. Cell Biol. 18: 41414148. Sjolander, K. 2004. Phylogenomic inference of protein molecular function: Advances and challenges. Bioinformatics 20: 170179. Smith, A. and Roberts, G. 1993. Bayesian computation via the Gibbs sampler and related Markov chain Monte Carlo methods J. Roy. Statist. Soc. B 55: 323. Sprinzak, E., Sattath, S., and Margalit, H. 2003. How reliable are experimental proteinprotein interaction data? J. Mol. Biol. 327: 919923.[CrossRef][Medline] Stuart, J.M., Segal, E., Koller, D., and Kim, S.K. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302: 249255. Tatusov, R.L., Galperin, M.Y., Natale, D.A., and Koonin, E.V. 2000. The COG database: A tool for genome-scale analysis of protein functions and evolution. Nucleic Acids Res. 28: 3336. Tombes, R.M., Faison, M.O., and Turbeville, J.M. 2003. Organization and evolution of multifunctional Ca2+/CaM-dependent protein kinase genes. Gene 322: 1731.[CrossRef][Medline] van Noort, V., Snel, B., and Huynen, M.A. 2003. Predicting gene function by conserved co-expression. Trends Genet. 19: 238242.[CrossRef][Medline] Vazquez, A., Flammini, A., Maritan, A., and Vespignani, A. 2003. Global protein function prediction from proteinprotein interaction networks. Nat. Biotechnol. 21: 697700.[CrossRef][Medline] Wagner, A. 2003. How the global structure of protein interaction networks evolves. Proc. Roy. Soc. Lond. B Biol. Sci. 270: 457466. Xenarios, I., Salwinski, L., Duan, X.J., Higney, P., Kim, S.M., and Eisenberg, D. 2002. DIP, the Database of Interacting Proteins: A research tool for studying cellular networks of protein interactions. Nucleic Acids Res. 30: 303305. Yang, H., Jiang, W., Gentry, M., and Hallberg, R.L. 2000. Loss of a protein phosphatase 2A regulatory subunit (Cdc55p) elicits improper regulation of Swe1p degradation. Mol. Cell Biol. 20: 81438156.
Received August 3, 2005; accepted in revised format November 21, 2005. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||