|
|
|
|
Genome Res. 15:945-953, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Letter Assessing the limits of genomic data integration for predicting protein networks1 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA 2 Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA 3 Program of Computation Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
Genomic data integrationthe process of statistically combining diverse sources of information from functional genomics experiments to make large-scale predictionsis becoming increasingly prevalent. One might expect that this process should become progressively more powerful with the integration of more evidence. Here, we explore the limits of genomic data integration, assessing the degree to which predictive power increases with the addition of more features. We focus on a predictive context that has been extensively investigated and benchmarked in the pastthe prediction of proteinprotein interactions in yeast. We start by using a simple Naive Bayes classifier for integrating diverse sources of genomic evidence, ranging from coexpression relationships to similar phylogenetic profiles. We expand the number of features considered for prediction to 16, significantly more than previous studies. Overall, we observe a small, but measurable improvement in prediction performance over previous benchmarks, based on four strong features. This allows us to identify new yeast interactions with high confidence. It also allows us to quantitatively assess the inter-relations amongst different genomic features. It is known that subtle correlations and dependencies between features can confound the strength of interaction predictions. We investigate this issue in detail through calculating mutual information. To our surprise, we find no appreciable statistical dependence between the many possible pairs of features. We further explore feature dependencies by comparing the performance of our simple Naive Bayes classifier with a boosted version of the same classifier, which is fairly resistant to feature dependence. We find that boosting does not improve performance, indicating that, at least for prediction purposes, our genomic features are essentially independent. In summary, by integrating a few (i.e., four) good features, we approach the maximal predictive power of current genomic data integration; moreover, this limitation does not reflect (potentially removable) inter-relationships between the features.
A major challenge in post-genomic biology is systematically mapping the interactome, the set of all proteinprotein interactions within an organism. Since proteins carry out their functions by interacting with one another and with other biomolecules, reconstructing the interactome of a cell is the important first step toward understanding protein function and cell behavior (Hartwell et al. 1999
Each genomic feature, by itself, is only a weak predictor of protein interactions. However, predictions can be improved by integrating different genomic features (Marcotte et al. 1999b
One might expect genomic data integration to become increasingly powerful with the integration of more evidence. Here, we explore the limits of genomic data integration, assessing the degree to which predictive power increases with addition of more features. We focus on a predictive context that has been extensively investigated and benchmarked in the past; the prediction of proteinprotein interactions in yeast. Previously, we developed a Naive Bayesian classification approach to predict proteinprotein interactions in yeast by integrating four genomic features (functional similarity based on MIPS and GO annotations, mRNA expression correlation, and coessentiality) (Jansen et al. 2003 In this study, we expand the list of genomic features to include 16 diverse features that are plausible indicators for protein interactions. These 16 features are assembled based on both protein pair features and single protein features, and they are derived from a wide range of physical, genetic, contextual, and evolutionary properties of yeast genes. We believe that such "feature-richness" is an essential property of genomic data sets; therefore, we would like to test whether protein-interaction predictions can be further improved by exploiting the diversity of the features, and if so, by how much. Naive Bayes classifiers assume conditional independence between features (see Methods). In the following text, when we say (in)dependent, we mean conditionally (in)dependent. We would expect that there exists a high dependence between a number of genomic features, and that this would become increasingly likely as we try to integrate more features. In this case, Naive Bayes may no longer be the optimal approach, as the dependence among features needs to be taken into account.
In this study, we apply boosting to Naive Bayes classifiers as an automated and efficient way for handling dependent features. Boosting (Schapire 1990
A list of features useful for predicting protein interactions In addition to the four features in Jansen et al. (2003
Predictive power of individual features A good feature, i.e., one with high predictive power, simultaneously has a large number of true positives and a small number of false positives. In this case, the ROC curve climbs rapidly away from the origin (lower left hand corner of the graph). How quickly the ROC curve arises away from the origin can be quantified by measuring the area under the curve. The larger the area, the better the feature. Ranking the features by the area they cover in the ROC curves (easily seen in Fig. 3A), the best feature in the first group is MIP, followed by GOF, COE, EXP, ESS, MES, and APA. All of these features show strong predictive power (i.e., well above the diagonal). The best feature in the second group is INT, followed by PGP, GNN, REG, ROS, and THR, while SYL shows very little predictive power. EVL and GNC are not shown here because they each have only two overlaps with the positive GSTD, and are thus unsuitable for this test. Because of the low coverage of these group-two features, the results in Figure 3B may be misleading without a careful interpretation. For example, SYL covers only 887 protein pairs in the GSTDs, it is thus unreliable to estimate its overall predictive power based on this 0.04% of the GSTDs when its coverage is likely to increase in the future (Fig. 3B).
Another point we need to pay attention to is that we should not take the performance of a feature against the GSTDs as indicative of the accuracy or usefulness of the feature in its original context. This is because the performance of a feature against the GSTDs only measures its usefulness in relation to a specific taski.e., predicting complex membershipwhich is probably not what the feature was originally designed to do. For example, multimeric threading method is designed for predicting physical interactions between two proteins. However, because of the way the GSTDs are constructed, the majority of protein pairs in the GSTDs are simply in the same molecular complex without direct contacts. Therefore, when predicting physical interactions, these GSTDs are not a good means of judging the accuracy or usefulness of the multimeric threading method. Quite often, only the TPR for a specific FPR is valued. For example, COE outperforms MIP until the FPR reaches 5%, even though MIP covers more area in the whole range of FPR. Thus, the features can also be ranked and selected according to the acceptable FPR in prediction.
Feature selection and improvement of performance The performance of combining new features is presented in Figure 4A by a ROC curve. By integrating the three additional features in the range of all FPR values, we obtain a better performance in the predictive power (higher TPR at a certain FPR value) than by integrating the four original features. However, such improvement is marginal; although each of the three new features shows a fairly strong predictive power, the increase of TPR at any value of FPR is no more than 3%. Because of the dominant performance of the two functional similarity features (MIP and GOF), the improvement accomplished by incorporating new features may not seem obvious. We thus exclude these two functional features, showing the improvement by incorporating three additional features over the remaining two original features (i.e., COE and ESS). Including three additional features shows a significant improvement over the original two features (Fig. 4B).
Another benefit of genomic data integration is the improvement in coverage; by incorporating more features, two predictors with similar ROC curve performance may cover different parts of the system to varying degrees. Note, it is the coverage of not only the labeled pairs (GSTDs), but also unlabeled pairs (unseen pairs). So far, our assessments have been done for labeled pairs only; however, if additional features allow the predictor to have a more extensive view of the system despite no significant improvement in ROC curve, they probably should be considered as beneficial, because in this case, the coverage of unlabeled pairs is improved. Here, we find the coverage is slightly improved by integrating more features. For all possible 21,658,071 protein pairs (6582 ORFs from MIPS), the four original features cover 18,527,741 pairs (85.5%), whereas the seven most populous features cover 18,880,102 (87.2%).
Correlations and statistical dependence between features We first calculate the Pearson correlation coefficients (CCs) between each pair of features. Such correlations between features can often generate useful biological insights. The five highest absolute values are highlighted in bold in Table 1A. None of the feature pairs exhibit significant correlation.
In addition, we calculate mutual information between genomic features as an alternative to CCs. Whereas CC only measures linear relationships, mutual information is a more general measure of correlation. The results show an agreement with Ccs. The five pairs containing the most mutual information are exactly the same as those of the CCs. These correlations between some of the features, albeit not strong, are expected. For example, the correlations between the two functional features (MIP and GOF) are the highest among feature pairs. It is also expected that absolute mRNA expression (EXP) and absolute protein abundance (APA) are somewhat correlated. We next investigate the conditional dependence between features given the positive or negative GSTD by calculating mutual information. In other words, we calculate the mutual information between pairs of features by taking into account only protein pairs that occur in both features and in either set of GSTDs. The small amount of mutual information, given either set of GSTDs, indicates that the features we integrated by Naive Bayes classifier are largely conditionally independent (Table 1B).
Simple Naive Bayes classifier vs. boosted Naive Bayes classifier on data sets with or without high dependence Even though the conditional dependence between our features is not strong, it is possible that the combined weak dependence can still significantly decrease the predictive power of a Naive Bayes classifier. In this section, we address this question by comparing the performance of a simple Naive Bayes classifier (SNB) with that of a boosted Naive Bayes classifier (BNB). Since a BNB is fairly resistant to feature dependence, a significantly worse performance by a SNB on the same data set means that the feature dependence does affect the predictive power of the SNB.
We first conduct a control experiment with highly dependent features to verify the resistance of BNB to feature dependence. To obtain a highly dependent set of features, we used mRNA expression data from microarray experiments conducted by Cho et al. (1998 We then compare a SNB with a BNB on our data set, with only weak conditional dependence; the original four features plus only one instead of eight sets of expression data. If the BNB significantly outperforms the SNB, it indicates that the SNB is affected by feature dependence, even though it is not strong. The results show that the SNB performs as well as the BNB on this weakly dependent data set (Fig. 5). Clearly, the SNB is hardly affected by this weak feature dependence. The results in Figure 5 also suggest that the SNB performs sufficiently well on our collection of genomic features, while the BNB may be useful to analyze the potential problem of highly dependent features as more features are considered in the future.
In this study, we quantitatively address the question of how far genomic data integration can be improved by integrating more and more features. We use a SNB for integrating diverse sources of genomic evidence, ranging from coexpression relationships to similar phylogenetic profiles. By integrating three more strong features, marginal improvement on both accuracy and coverage can be achieved. The calculations of correlation coefficients, mutual information, and boosting all suggest that the marginality of the improvement on prediction by incorporating more features is unlikely to result from the weak feature dependencies. It is also unlikely to result from an excess of parameters, relative to data points (resulting in overfitting), because our Naive Bayes approach involves simple models with only small numbers of free parameters that are fitted against a large number of data points. Rather, this suggests that by integrating a few good features, we approach the maximal predictive power, or limit, of current genomic data integration. Furthermore, this limitation does not reflect (potentially removable) inter-relationships between the features. Unless we obtain features that are stronger in predictive power than MIP and GOF and simultaneously possess a reasonable coverage, it is unlikely that the prediction will be significantly improved by integrating a few more features. It is also possible that a higher coverage of our examined 16 features may allow better predictive power in the future. Our discovery that no strong dependence exists between features is an interesting finding in and of itself. Among as many as seven populous features, one might expect some dependence high enough to significantly decrease SNB's predictive power. However, our calculation on correlation coefficients and mutual information, as well as our boosting results, suggest otherwise. One possibility is that the observed lack of dependence among different features may result from differences in coverage, since all of these data sets are essentially incomplete. Specifically, the overlap of proteins or protein pairs represented among the different features is likely to increase with extended coverage and possibly results in higher feature dependence. In this case, the BNB can be used as an alternative solution.
Finally, SNB is chosen in this study because of its simplicity, as well as the ability to compare with an existing benchmark study using the same technique (Jansen et al. 2003 Other machine-learning techniques could have been potentially used in this study. However, most alternative techniques have issues in their own right, such as suffering from the missing value problems or being prohibitively time-consuming. Such problems prevent them from being applied to this problem as readily as a SNB. In addition, since BNB does not improve SNB on our collection of features, it is probably not the case that the conclusions made here will be significantly different if other machine-learning techniques are usedthough, of course, we cannot definitely say this without a comprehensive test.
Naive Bayesian formalism Inferring proteinprotein interactions from genomic features can be formulated as a classification problem, in which we classify a pair of proteins into two classes (C1 = interact, C0 = not interact), given an n-dimensional vector of genomic features x = (x1,x2,...,xn).5
The Bayesian Decision Rule states that in order to minimize the average probability of a classification error, one must choose the class with the highest posterior probability, i.e., assign a feature vector x to the class Ck, such that: Ck = argCimax P(Ci | x), where Ci ranges over the set of classes (see for example, Bishop 1995
Using Bayes theorem, the posterior probability can be rewritten, as
The idea behind Naive Bayes is to make the simplifying assumption that the attribute values are conditionally independent, given the target values. The computation of each is thus made efficient by approximating it as a product of conditional probabilities
In the case of stochastic independence, the covariance between two features is zero. Thus, the covariance between features is a measure of the deviation from the condition of stochastic independence and is indicative of the amount of approximation introduced by the Naive Bayes assumption. For this reason, the next section shall present an analysis of the covariance between the various features, given the class. Alternatively, the Bayesian Decision rule for two classes can be stated thusly:
If we then introduce the Naive Bayes approximation, we can rewrite equation 2 as:
ROC (receiver operating characteristic) curve
Our earlier discussion on Naive Bayes was motivated by the goal of minimizing the average probability of a classification error; it was aimed at reducing the total number of wrong predictions, regardless of the type of error that was made. This amounts to saying that we were maximizing the number of
A ROC curve graphically depicts the performance of a classification method for different costs. It consists of a set of points, each computed for a different setting of the cost, connected by lines. For each point, the vertical coordinate is a true positive rate (TPR) given by the ratio of the number of true positives to the total number of positives (i.e., TP/[TP+FN]), while the horizontal coordinate is a false positive rate (FPR) given by the ratio of the number of false positives to the total number of negatives (i.e., FP/[FP+TN]). Note that the TPR is equivalent to the commonly used term sensitivity, while FPR is equivalent to 1specificity. Clearly, the ROC curve for a good classifier will be as close as possible to the upper-left corner of the chart; that is where we have the highest number of true positives and at the same time the smallest number of false positives.
Mutual information
Boosting AdaBoost consists of sequentially applying a weak classification algorithm to modified versions of the data, producing a sequence of weak classifiers. Then, the prediction from each classifier is combined through a weighted majority vote. The data is modified by applying weights to each of the training observations. At each iteration, a weak learner is trained on the weighted set of data and the weights are updated. This operation is repeated until the desired performance for the training data is achieved. The updating rule for these weights is such that training pairs that had been misclassified in the previous step will have their weights increased, while those that were correctly classified will have their weights decreased. At each iteration, then, training pairs that are more difficult to classify have more influence, and classifiers are forced to focus on pairs overlooked by previous classifiers.
Given a data set of N training pairs (xi,yi), i = 1...N, where xi is an input vector of features and yi
Training and testing data sets
We thank Drs. Ronald Jansen, Valery Trifonov, and Haoxin Lu for stimulating discussions and proofreading of this manuscript. Y.X. is a Fellow of the Jane Coffin Childs Memorial Fund for Medical Research. This work is supported by a grant from NIH/NIGMS for work in the PSI.
[All genomic feature data used in this study can be downloaded at http://networks.gersteinlab.org/intint/.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3610305.
4 Corresponding author.
5 Bold letters denote vectors; P(·) denote probabilities; p(·) denote probability density functions.
Alberts, B. 2002. Molecular biology of the cell. Garland Science, New York. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29.[CrossRef][Medline]
Baker, D. and Sali, A. 2001. Protein structure prediction and structural genomics. Science 294: 93-96. Berger, J.M., Gamblin, S.J., Harrison, S.C., and Wang, J.C. 1996. Structure and mechanism of DNA topoisomerase II. Nature 379: 225-232.[CrossRef][Medline] Bishop, C.M. 1995. Neural networks for pattern recognition. Clarendon Press, Oxford University Press, Oxford, UK. Bowers, P.M., Pellegrini, M., Thompson, M.J., Fierro, J., Yeates, T.O., and Eisenberg, D. 2004. Prolinks: A database of protein functional linkages derived from coevolution. Genome Biol. 5: R35.[CrossRef][Medline]
Brown, M.P., Grundy, W.N., Lin, D., Cristianini, N., Sugnet, C.W., Furey, T.S., Ares Jr., M., and Haussler, D. 2000. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Natl. Acad. Sci. 97: 262-267. Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., et al. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2: 65-73.[CrossRef][Medline] Drawid, A., Jansen, R., and Gerstein, M. 2000. Genome-wide analysis relating expression level with protein subcellular localization. Trends Genet. 16: 426-430.[CrossRef][Medline] Duda, R.O., Hart, P.E., and Stork, D.G. 2001. Pattern classification Wiley, New York; Chichester, UK. Eisenberg, D., Marcotte, E.M., Xenarios, I., and Yeates, T.O. 2000. Protein function in the post-genomic era. Nature 405: 823-826.[CrossRef][Medline] Freund, Y. and Schapire, R.E. 1996. Experiments with a new boosting algorithm. In Proceedings of the thirteenth conference on machine learning, pp. 148-156. . 1999. A short introduction to boosting. J. Japanese Soc. Artificial Intell. 14: 771-780.
Friedman, N. 2004. Inferring cellular networks using probabilistic graphical models. Science 303: 799-805. Friedman, J., Hastie, T., and Tibshirani, R. 2000. Additive logistic regression: A statistical view of boosting, Ann. Stat. 28: 337-374.[CrossRef] Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., et al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141-147.[CrossRef][Medline] Ge, H., Liu, Z., Church, G.M., and Vidal, M. 2001. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 29: 482-486.[CrossRef][Medline]
Gerstein, M., Lan, N., and Jansen, R. 2002. Proteomics. Integrating interactomes. Science 295: 284-287. Goh, C.S. and Cohen, F.E. 2002. Co-evolutionary analysis reveals insights into proteinprotein interactions. J. Mol. Biol. 324: 177-192.[CrossRef][Medline] Goh, C.S., Bogan, A.A., Joachimiak, M., Walther, D., and Cohen, F.E. 2000. Co-evolution of proteins with their interaction partners. J. Mol. Biol. 299: 283-293.[CrossRef][Medline]
Greenbaum, D., Jansen, R., and Gerstein, M. 2002. Analysis of mRNA expression and protein abundance data: An approach for the comparison of the enrichment of features in the cellular population of proteins and transcripts. Bioinformatics 18: 585-596. Greenbaum, D., Colangelo, C., Williams, K., and Gerstein, M. 2003. Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol. 4: 117.[CrossRef][Medline] Hartwell, L.H., Hopfield, J.J., Leibler, S., and Murray, A.W. 1999. From molecular to modular cell biology. Nature 402: C47-C52.[CrossRef][Medline] Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180-183.[CrossRef][Medline] Horak, C.E. and Snyder, M. 2002. ChIP-chip: A genomic approach for identifying transcription factor binding sites. Methods Enzymol. 350: 469-483.[Medline]
Ideker, T., Thorsson, V., Ranish, J.A., Christmas, R., Buhler, J., Eng, J.K., Bumgarner, R., Goodlett, D.R., Aerbersold, R., and Hood, L. 2001. Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science 292: 929-934.
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., and Sakaki, Y. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. 98: 4569-4574.
Jansen, R., Greenbaum, D., and Gerstein, M. 2002a. Relating whole-genome expression data with proteinprotein interactions. Genome Res. 12: 37-46. Jansen, R., Lan, N., Qian, J., and Gerstein, M. 2002b. Integration of genomic datasets to predict protein complexes in yeast. J. Struct. Funct. Genomics 2: 71-81.[CrossRef][Medline]
Jansen, R., Yu, H., Greenbaum, D., Kluger, Y., Krogan, N.J., Chung, S., Emili, A., Snyder, M., Greenblatt, J.F., and Gerstein, M. 2003. A Bayesian networks approach for predicting proteinprotein interactions from genomic data. Science 302: 449-453. Joachims, T. 1997. A probabilistic analysis of the Rocchio Algorithm with TFIDF for text categorization. 14th International Conference on Machine Learning. Kemmeren, P., van Berkum, N.L., Vilo, J., Bijma, T., Donders, R., Brazma, A., and Holstege, F.C. 2002. Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol. Cell. 9: 1133-1143.[CrossRef][Medline]
Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, T., et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799-804.
Lee, I., Date, S.V., Adai, A.T., and Marcotte, E.M. 2004. A probabilistic functional network of yeast genes. Science 306: 1555-1558. Letovsky, S. and Kasif, S. 2003. Predicting protein function from protein/protein interaction data: A probabilistic approach. Bioinformatics 19: 197-204. Lin, N., Wu, B., Jansen, R., Gerstein, M., and Zhao, H. 2004. Information assessment on predicting proteinprotein interactions. BMC Bioinformatics 5: 154.[CrossRef][Medline] Lu, L., Lu, H., and Skolnick, J. 2002. MULTIPROSPECTOR: An algorithm for the prediction of proteinprotein interactions by multimeric threading. Proteins 49: 350-364.[CrossRef][Medline]
Lu, L., Arakaki, A.K., Lu, H., and Skolnick, J. 2003. Multimeric threading-based prediction of proteinprotein interactions on a genomic scale: Application to the Saccharomyces cerevisiae proteome. Genome Res. 13: 1146-1154.
Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999a. Detecting protein function and proteinprotein interactions from genome sequences. Science 285: 751-753. Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O., and Eisenberg, D. 1999b. A combined algorithm for genome-wide prediction of protein function. Nature 402: 83-86.[CrossRef][Medline]
Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P., Gerstein, M., et al. 2003. Distribution of NF- McCallum, A. and Nigam, K. 1998. A comparison of event models for Naive Bayes text classification. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
Mewes, H.W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morganstern, B., Munsterkotter, M., Rudd, S., and Weil, B. 2002. MIPS: A database for genomes and protein sequences. Nucleic Acids Res. 30: 31-34. Pazos, F. and Valencia, A. 2002. In silico two-hybrid system for the selection of physically interacting protein pairs. Proteins 47: 219-227.[CrossRef][Medline]
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., and Yeates, T.O. 1999. Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. 96: 4285-4288. Schapire, R.E. 1990. The strength of weak learnability. Machine Learning 5: 197-227. Schwikowski, B., Uetz, P., and Fields, S. 2000. A network of proteinprotein interactions in yeast. Nat. Biotechnol. 18: 1257-1261.[CrossRef][Medline] Skolnick, J. and Kolinski, A. 2002. In Computational methods for protein folding. Vol. 120 (ed. R.A. Friesner), pp. 131-192. John Wiley & Sons, New York.[CrossRef] Strong, M., Mallick, P., Pellegrini, M., Thompson, M.J., and Eisenberg, D. 2003. Inference of protein function and protein linkages in Mycobacterium tuberculosis based on prokaryotic genome organization: A combined computational approach. Genome Biol. 4: R59.[CrossRef][Medline] Tamames, J., Casari, G., Ouzounis, C., and Valencia, A. 1997. Conserved clusters of functionally related genes in two bacterial genomes. J. Mol. Evol. 44: 66-73.[CrossRef][Medline]
Thatcher, J.W., Shaw, J.M., and Dickinson, W.J. 1998. Marginal fitness contributions of nonessential genes in yeast. Proc. Natl. Acad. Sci. 95: 253-257.
Tong, A.H., Evangelista, M., Parsons, A.B., Xu, H., Bader, G.D., Page, N., Robinson, M., Raghibizadeh, S., Hogue, C.W., Bussey, H., et al. 2001. Systematic genetic analysis with ordered arrays of yeast deletion mutants. Science 294: 2364-2368.
Tong, A.H., Lesage, G., Bader, G.D., Ding, H., Xu, H., Xin, X., Young, J., Berriz, G.F., Brost, R.L., Chang, M., et al. 2004. Global mapping of the yeast genetic interaction network. Science 303: 808-813.
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., and Altman, R.B. 2001. Missing value estimation methods for DNA microarrays. Bioinformatics. 17: 520-525. Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. 2000. A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature 403: 623-627.[CrossRef][Medline] Valencia, A. and Pazos, F. 2002. Computational methods for the prediction of protein interactions. Curr. Opin. Struct. Biol. 12: 368-373.[CrossRef][Medline] Vazquez, A., Flammini, A., Maritan, A., and Vespignani, A. 2003. Global protein function prediction from proteinprotein interaction networks. Nat. Biotechnol. 21: 697-700.[CrossRef][Medline] von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., and Bork, P. 2002. Comparative assessment of large-scale data sets of proteinprotein interactions. Nature 417: 399-403.[Medline]
Wong, S.L., Zhang, L.V., Tong, A.H., Li, Z., Goldberg, D.S., King, O.D., Lesage, G., Vidal, M., Andrews, B., Bussey, H., et al. 2004. Combining biological networks to predict genetic interactions. Proc. Natl. Acad. Sci. 101: 15682-15687. Xia, Y., Yu, H., Jansen, R., Seringhaus, M., Baxter, S., Greenbaum, D., Zhao, H., and Gerstein, M. 2004. Analyzing cellular biochemistry in terms of molecular networks. Annu. Rev. Biochem. 73: 1051-1087.[CrossRef][Medline] Yu, H., Luscombe, N.M., Qian, J., and Gerstein, M. 2003. Genomic analysis of gene expression relationships in transcriptional regulatory networks. Trends Genet. 19: 422-427.[CrossRef][Medline] Yu, H., Greenbaum, D., Xin Lu, H., Zhu, X., and Gerstein, M. 2004a. Genomic analysis of essentiality within protein networks. Trends Genet. 20: 227-231.[CrossRef][Medline]
Yu, H., Luscombe, N.M., Lu, H.X., Zhu, X., Xia, Y., Han, J.D., Bertin, N., Chung, S., Vidal, M., and Gerstein, M. 2004b. Annotation transfer between genomes: Proteinprotein interologs and proteinDNA regulogs. Genome Res. 14: 1107-1118. Zhang, L.V., Wong, S.L., King, O.D., and Roth, F.P. 2004. Predicting co-complexed protein pairs using genomic and proteomic data integration, BMC Bioinformatics 5: 38.[CrossRef][Medline]
Received December 22, 2004; accepted in revised format May 2, 2005. This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||