|
|
|
|
Genome Res. 13:2568-2576, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Letter From Gene Networks to Gene Function1 European Bioinformatics Institute, EMBL-EBI, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK 2 Department of Computer Science, FIN-00014 University of Helsinki, Finland
We propose a novel method to identify functionally related genes based on comparisons of neighborhoods in gene networks. This method does not rely on gene sequence or protein structure homologies, and it can be applied to any organism and a wide variety of experimental data sets. The character of the predicted gene relationships depends on the underlying networks;they concern biological processes rather than the molecular function. We used the method to analyze gene networks derived from genome-wide chromatin immunoprecipitation experiments, a large-scale gene deletion study, and from the genomic positions of consensus binding sites for transcription factors of the yeast Saccharomyces cerevisiae. We identified 816 functional relationships between 159 genes and show that these relationships correspond to proteinprotein interactions, co-occurrence in the same protein complexes, and/or co-occurrence in abstracts of scientific articles. Our results suggest functions for seven previously uncharacterized yeast genes: KIN3 and YMR269W may be involved in biological processes related to cell growth and/or maintenance, whereas IES6, YEL008W, YEL033W, YHL029C, YMR010W, and YMR031W-A are likely to have metabolic functions.
The function of many genes is still unknown; even for the well studied yeast Saccharomyces cerevisiae, about one-third of all genes are still uncharacterized (Ball et al. 2001
Many biological data sets can be represented as gene networks, where nodes represent genes or proteins, and the connections between the nodes represent relationships between these entities. Directed relationships such as "protein A activates gene B" are represented by arcs (A B), whereas symmetric relationships such as "protein A and protein B bind to each other" are represented by edges (AB; Schwikowski et al. 2000
We compared the neighborhoods of genes in networks derived from microarray experiments on gene deletion mutants (Hughes et al. 2000
Validation of functional relationships is problematic, because various aspects and meanings are subsumed under the term "function" of a gene or protein. This is mainly due to different experimental approaches that focus either on the effects of mutations or on biochemical activities (Ashburner et al. 2000
We use three approaches to validate the predicted functional relationships: We compare the gene pairs that are predicted to be related (1) with proteinprotein interaction data, (2) with protein complexes, and (3) with a literature network. Many biological functions involve proteinprotein interactions, and several large proteinprotein interaction data sets are available (Uetz et al. 2000 Here we describe how the comparison of gene neighborhoods from different gene networks can be used to identify functionally related genes. We provide evidence that gene pairs with similar network neighborhoods occur more frequently together in article abstracts and more frequently encode proteins that interact physically than do genes with dissimilar neighborhoods. Our method allowed us to identify 816 functional relationships between 159 genes and to assign biological process annotation to seven previously uncharacterized genes. We examine some of the predictions in detail, and show that for the networks studied here the predicted functions concern biological processes rather than biochemical activities.
Our aim was to study the similarity of genes or proteins by assessing the similarity of their neighborhoods in gene networks (Fig. 2). Here we studied relationships between genes/proteins in six different networks of three different types for the yeast Saccharomyces cerevisiae (Table 1):
All networks listed above are represented as directed graphs. In a directed graph, a node can have incoming and outgoing arcs, and thus we can divide the neighborhood of a node depending on the orientation of the arcs. We call the genes with outgoing arcs source genes, and for every source gene s1 we define the target set T1 as the set of genes which have incoming arcs from s1 (see Figs. 1, 2). All of the networks described above are asymmetric: Although source genes are an a priori selected subset of the genome (particular for each network), the whole genome is tested for targets. We call such networks comprehensive target networks.
For every pair of source genes s1 and s2, we test whether their target sets T1 and T2 intersect more than expected by chance, using the hypergeometric distribution (Sokal and Rohlf 1995
We performed 23,758 target-set comparisons for 15,061 source gene pairs within and between the networks (Table 2). For 816 (5.4%) source gene pairs, we found a strong target-set similarity (P
When we compared target sets for the same source gene from different networks, we found that 34 out of 80 target-set pairs are highly similar. The similarities occur more frequently between the ChIP networks and between the in silico network and the ChIP networks. According to this comparison, the ChIP networks are similar to each other, and to the in silico network, whereas the mutant network is most different from the others. This is consistent with the small intersection of the mutant network and the ChIP networks: They share 16 source genes, but only 78 connections, although there are on average between 51 and 145 connections per source gene in both networks (Table 1). To test whether the target-set similarity can be used to identify functionally related genes, we used three additional networks as reference networks:
Functionally related genes are connected in these reference networks, and therefore we validated the results of the target-set comparison by comparing them with the connectivity in the reference networks. The proportion of genes with similar target sets increases four- to eightfold if we consider only gene pairs present in the reference networks, instead of all possible source gene pairs (Table 3). This indicates that functionally related genes, that is, genes connected in the reference networks, have similar target sets. In order to test our hypothesis, we ranked all source gene pairs according to their best target-set similarity, that is, from high similarity (low P-values) to low similarity (high P-values). All source gene pairs with a reported interaction were counted as true positive (tp) if their corrected P-value was smaller than a chosen threshold or as false negative (fn) if their corrected P-value was greater than this threshold. Pairs lacking an interaction were counted as false-positive (fp) if their corrected P-value was smaller than the chosen threshold or as true negative (tn) if their corrected P-value was greater than the threshold. We calculated the true-positive rate (sensitivity) as tp/(tp+fn) and the false-positive rate (1 - specificity) as fp/(fp+tn) at each row of the ranking, using the P-value of the respective row as a threshold. An ROC curve displays the true-positive rate versus the false-positive rate in Figure 3. Ideal prediction methods have a high true-positive rate and a low false-positive rate, with ROC curves getting close to the upper left corner of the plot, whereas randomized predictions would produce ROC curves close to the diagonal from the lower left corner to the upper right corner (Witten and Eibe 1999
The ROC curves in Figure 3 show the false-positive rate and the true-positive rate for our prediction method with respect to the different reference networks. A true-positive rate of 82% with a corresponding false-positive rate of 32% is found when using a verification network that is a union of ppi2, mips, and mi3 (Fig. 3C). If we use the more stringent reference sets ppi2 or mi3, the quality of our predictions is better (i.e., the ROC curve is further away from the diagonal). This effect may be due to high error rates in the reference sets; the accuracy of the proteinprotein interaction network increases if several methods report the same interactions (Edwards et al. 2002 If we base the predictions on target-set comparisons between different networks, we greatly expand the number of source gene pairs for which we perform target-set comparisons, but the false-positive rate also increases (Fig. 3B). This increase in false positives is higher for proteinprotein interactions than for cocitations. The data indicate that for the identification of proteinprotein interactions, a comparison of source genes within the ChIP networks yield the best results. However, comparisons of target sets in the mutant network perform best for the identification of interactions in MIPS complexes and literature data. Generally, comparisons between different networks perform worse than comparisons within the same network (see netComparison.pdf in our Supplemental data). It should be noted that there is not enough data available for a reliable analysis of which network combinations yield the best predictions.
The correlation between target-set similarity and functional similarity is evident in the graph representation of the predictions (Fig. 4, fig4.txt in Supplemental data). Genes involved in the same biological processes such as pheromone response or cell-cycle control are linked by several target-set similarities, and are therefore close to each other in the graph. Applying a guilt-by-association approach, we used proximity in the graph to infer gene function (Oliver 2000
It is difficult to find terms describing a set of genes appropriately and objectively; therefore we use the "SGD Gene Ontology Term Mapper" (http://db.yeastgenome.org/cgi-bin/SGD/GO/goTermMapper
Lastly, we examined some source gene pairs with high target-set similarity in detail to illustrate the nature of our predictions: There are 14 source gene pairs for which both genes are present in the protein interaction network ppi2, but no interaction between them is reported in this network, although they are connected in cocitation network mi3. Of these 14 pairs, six have highly similar target sets (P
We conclude that the comparison of target sets in gene networks can be used to find functionally related proteins: We predict 816 relations for 159 genes (P 0.01). The nature of the predicted functional relationships is dependent on the nature of the comprehensive target networks. The Gene Ontology consortium differentiates between three major subcategories "cellular localization," "biological process," and "molecular function" (Ashburner et al. 2000
Our method can be used for the comparison of data from a variety of methods. Large-scale experiments can vary extensively in terms of data quality, as has been described by several groups (Edwards et al. 2002 With the proposed method we did not identify all functional relationships reported in the reference networks. It therefore remains an open question as to how many of the errors are due to limitations of the available data or due to the method. There are several reasons why not all of the target-set pairs derived from the same source gene, or from two genes having a known functional relationship, were highly similar. One reason is that we combined experimental data from different types of experiments, and certain interactions are only observable under very specific conditions not necessarily attained in a given experiment. For example, some transcription factors may bind DNA only if they are phosphorylated. One advantage of this method is that we can use and integrate a wide variety of different experimental data sets, as long as they can be represented as comprehensive target networks. Even small data sets can be successfully included; unlike clustering of microarray data, there is no need for extensive experiments consisting of tens of microarray hybridizations to provide biologically meaningful results. Our method is versatile; in the present study, for instance, we were able to explore which transcription factor deletions lead to predicted effects on the basis of the localization of its binding sites. We can also look for transcription factors which act in combination with other factors and elucidate possible upstream regulatory mechanisms. Although sequence information may be important for the design of the experiments which underlie the comprehensive target networks, this is not a prerequisite for our method, which is completely independent of sequence or structural homology. A limitation of this method is that the data sets used for our predictions must be represented as comprehensive target sets. This means that, for example, large-scale protein-interaction networks cannot be used, because of the way these experiments are performed. Only positive interactions are reported, and we do not know which protein interactions do not occur. In contrast, the data sets we included for the predictions always report a signal for all genes in the genome. Therefore, within the limitations of the experimental methods, we always have information regarding the individual behavior of all genes. The possibility of integrating data derived from different experimental methods and conditions allows the exploration of the complexity of cellular regulatory mechanisms. It is feasible to perform repeated analysis of data from different experimental conditions and then use the variations in conditions to explain the changes in interactions predicted. This would lead to a dynamic rather than a static view of protein function.
Construction of the Networks The mutant network was constructed with data from Hughes et al. (2000
The in silico network was compiled from data reported by Pilpel et al. (2001
The four ChIP networks were constructed from data published by Ren et al. (2000
Experimental data on yeast proteinprotein interactions was retrieved from the following databases and publicly available data sets: DIP (Xenarios et al. 2001
The MIPS network was derived from manually annotated complexes at MIPS (Mewes et al. 2002
The cocitation network: Using a synonym dictionary for gene/protein names in yeast, we scanned over 70,000 journal abstracts from Medline for co-occurrences of genes/proteins, using the SRS server (http://srs.ebi.ac.uk
Network Comparison
We thank Michael Ashburner, Aria Baniahmad, Cath Brooksbank, Frank Holstege, Patrick Kemmeren, Helen Parkinson, and Steve Russell for helpful discussion of the manuscript and Christian von Mering for the MIPS reference set. The project is funded by the European Commission as the TEMBLOR, contract no. QLRI-CT-2001-00015 under the RTD programme "Quality of Life and Management of Living Resources." K.P. and E.U. are supported by a grant from the Academy of Finland. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1111403.
3 Corresponding author. [Supplemental material is available online at www.genome.org.]
4 The SGD database has been recently updated and the KIN3 gene is now assigned to the biological process "chromosome segregation" based on an experimental analysis performed by Chen et al. 2002
Alepuz, P.M., Cunningham, K.W., and Estruch, F. 1997. Glucose repression affects ion homeostasis in yeast through the regulation of the stress-activated ENA1 gene. Mol. Microbiol. 26: 91-98.[CrossRef][Medline] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29.[CrossRef][Medline] Bader, G.D. and Hogue, C.W. 2002. Analyzing yeast proteinprotein interaction data obtained from different sources. Nat. Biotechnol. 20: 991-997.[CrossRef][Medline]
Ball, C.A., Jin, H., Sherlock, G., Weng, S., Matese, J.C., Andrada, R., Binkley, G., Dolinski, K., Dwight, S.S., Harris, M.A., et al. 2001. Saccharomyces Genome Database provides tools to survey gene expression and functional analysis data. Nucleic Acids Res. 29: 80-81. Blake, J. and Harris, M. 2003. The Gene Ontology (GO) Project: Structured vocabularies for molecular biology and their application to genome and expression analysis. In Current protocols in bioinformatics.(eds. A. Baxevanis, et al.), J. Wiley, New York. Blaschke, C., Andrade, M.A., Ouzounis, C., and Valencia, A. 1999. Automatic extraction of biological information from scientific text: Proteinprotein interactions. Proc. Int. Conf. Intell. Syst. Mol. Biol. 7: 60-67.
Blaschke, C., Hirschman, L., and Valencia, A. 2002. Information extraction in molecular biology. Brief Bioinform. 3: 154-165. Bork, P. and Koonin, E.V. 1998. Predicting functions from protein sequencesWhere are the bottlenecks? Nat. Genet. 18: 313-318.[CrossRef][Medline] Chen, Y., Riley, D.J., Zheng, L., Chen, P.L., and Lee, W.H. 2002. Phosphorylation of the mitotic regulator protein Hecl by Nek2 kinase is essential for faithful chromosome segregation. J. Biol. Chem. 277: 494088-49416.
Dwight, S.S., Harris, M.A., Dolinski, K., Ball, C.A., Binkley, G., Christie, K.R., Fisk, D.G., Issel-Tarver, L., Schroeder, M., Sherlock, G., et al. 2002. Saccharomyces Genome Database (SGD) provides secondary gene annotation using the Gene Ontology (GO). Nucleic Acids Res. 30: 69-72. Edwards, A.M., Kus, B., Jansen, R., Greenbaum, D., Greenblatt, J., and Gerstein, M. 2002. Bridging structural biology and genomics: Assessing protein interaction data with known complexes. Trends Genet. 18: 529-536.[CrossRef][Medline] Enright, A.J., Iliopoulos, I., Kyrpides, N.C., and Ouzounis, C.A. 1999. Protein interaction maps for complete genomes based on gene fusion events. Nature 402: 86-90.[CrossRef][Medline] Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., et al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141-147.[CrossRef][Medline] Ge, H., Liu, Z., Church, G.M., and Vidal, M. 2001. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 29: 482-486.[CrossRef][Medline]
Gerstein, M., Lan, N., and Jansen, R. 2002. Proteomics. Integrating interactomes. Science 295: 284-287. Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180-183.[CrossRef][Medline] Holm, S. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics 6: 65-70. Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennett, H.A., Coffey, E., Dai, H., He, Y.D., et al. 2000. Functional discovery via a compendium of expression profiles. Cell 102: 109-126.[CrossRef][Medline] Huynen, M.A., Snel, B., Mering, C., and Bork, P. 2003. Function prediction and protein networks. Curr. Opin. Cell Biol. 15: 191-198.[CrossRef][Medline]
Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., and Sakaki, Y. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. 98: 4569-4574. Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M., and Brown, P.O. 2001. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409: 533-538.[CrossRef][Medline] Jenssen, T.K., Laegreid, A., Komorowski, J., and Hovig, E. 2001. A literature network of human genes for high-throughput analysis of gene expression. Nat. Genet. 28: 21-28.[CrossRef][Medline] Kemmeren, P., van Berkum, N.L., Vilo, J., Bijma, T., Donders, R., Brazma, A., and Holstege, F.C. 2002. Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol. Cell 9: 1133-1143.[CrossRef][Medline]
Koch, C., Moll, T., Neuberg, M., Ahorn, H., and Nasmyth, K. 1993. A role for the transcription factors Mbp1 and Swi4 in progression from G1 to S phase. Science 261: 1551-1557.
Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.T., Thompson, C.M., Simon, I., et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799-804. Lindgren, A., Bungard, D., Pierce, M., Xie, J., Vershon, A., and Winter, E. 2000. The pachytene checkpoint in Saccharomyces cerevisiae requires the Sum1 transcriptional repressor. EMBO J. 19: 6489-6497.[CrossRef][Medline] Manke, T., Bringas, R., and Vingron, M. 2003. Correlating proteinDNA and proteinprotein interaction networks. J. Mol. Biol. 333: 75-85.[CrossRef][Medline] Marcotte, E.M. 2000. Computational genetics: Finding protein function by nonhomology methods. Curr. Opin. Struct. Biol. 10: 359-365.[CrossRef][Medline]
Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999. Detecting protein function and proteinprotein interactions from genome sequences. Science 285: 751-753. Measday, V., McBride, H., Moffat, J., Stillman, D., and Andrews, B. 2000. Interactions between Pho85 cyclin-dependent kinase complexes and the Swi5 transcription factor in budding yeast. Mol. Microbiol. 35: 825-834.[CrossRef][Medline]
Mewes, H.W., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B., Munsterkotter, M., Rudd, S., and Weil, B. 2002. MIPS: A database for genomes and protein sequences. Nucleic Acids Res. 30: 31-34. Oliver, S. 2000. Guilt-by-association goes global. Nature 403: 601-603.[CrossRef][Medline] Palin, K., Ukkonen, E., Brazma, A., and Vilo, J. 2002. Correlating gene promoters and expression in gene disruption experiments. Bioinformatics 18: 172-180. Pilpel, Y., Sudarsanam, P., and Church, G.M. 2001. Identifying regulatory networks by combinatorial analysis of promoter elements. Nat. Genet. 29: 153-159.[CrossRef][Medline]
Ponting, C.P. 2001. Issues in predicting protein function from sequence. Brief Bioinform. 2: 19-29.
Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., et al. 2000. Genome-wide location and function of DNA binding proteins. Science 290: 2306-2309. Rung, J., Schlitt, T., Brazma, A., Freivalds, K., and Vilo, J. 2002. Building and analysing genome-wide gene disruption networks. Bioinformatics 18: 202-210. Schlitt, T. and Brazma, A. 2002. Learning about gene regulatory networks from gene deletion experiments. Comp. Funct. Genom. 3: 499-503.[CrossRef] Schwikowski, B., Uetz, P., and Fields, S. 2000. A network of proteinprotein interactions in yeast. Nat. Biotechnol. 18: 1257-1261.[CrossRef][Medline] Simon, I., Barnett, J., Hannett, N., Harbison, C.T., Rinaldi, N.J., Volkert, T.L., Wyrick, J.J., Zeitlinger, J., Gifford, D.K., Jaakkola, T.S., et al. 2001. Serial regulation of transcriptional regulators in the yeast cell cycle. Cell 106: 697-708.[CrossRef][Medline] Sokal, R.R. and Rohlf, F.J. 1995. BiometryThe principles and practice of statistics in biological research, 3rd ed. W.H. Freeman and Company, New York. Sprague, G.F.J. and Thorner, J.W. 1992. Pheromone response and signal transduction during the mating process of Saccharomyces cerevisiae. In The molecular and cellular biology of the yeast Saccharomyces: Gene expression (eds. E.W. Jones, J.R. Pringle and J.R. Broach), pp. 657-744. Cold Spring Harbor Press, Cold Spring Harbor, NY. Stillman, D.J., Dorland, S., and Yu, Y. 1994. Epistasis analysis of suppressor mutations that allow HO expression in the absence of the yeast SWI5 transcriptional activator. Genetics 136: 781-788.[Abstract] Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. 2000. A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature 403: 623-627.[CrossRef][Medline] Valencia, A. and Pazos, F. 2002. Computational methods for the prediction of protein interactions. Curr. Opin. Struct. Biol. 12: 368-373.[CrossRef][Medline] von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., and Bork, P. 2002. Comparative assessment of large-scale data sets of proteinprotein interactions. Nature 417: 399-403.[Medline] Walhout, A.J. and Vidal, M. 2001. Protein interaction maps for model organisms. Nat. Rev. Mol. Cell Biol. 2: 55-62.[CrossRef][Medline] Witten, I.H. and Eibe, F. 1999. Data mining: Practical machine learning tools and techniques with JAVA implementations. Morgan Kaufman, London. Wu, L.F., Hughes, T.R., Davierwala, A.P., Robinson, M.D., Stoughton, R., and Altschuler, S.J. 2002. Large-scale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters. Nat. Genet. 31: 255-265.[CrossRef][Medline]
Xenarios, I., Fernandez, E., Salwinski, L., Duan, X.J., Thompson, M.J., Marcotte, E.M., and Eisenberg, D. 2001. DIP: The Database of Interacting Proteins: 2001 update. Nucleic Acids Res. 29: 239-241. Xie, J., Pierce, M., Gailus-Durner, V., Wagner, M., Winter, E., and Vershon, A.K. 1999. Sum1 and Hst1 repress middle sporulation-specific gene expression during mitosis in Saccharomyces cerevisiae. EMBO J. 18: 6448-6454.[CrossRef][Medline] Zanzoni, A., Montecchi-Palazzi, L., Quondam, M., Ausiello, G., Helmer-Citterich, M., and Cesareni, G. 2002. MINT: A Molecular INTeraction database. FEBS Lett. 513: 135-140.[CrossRef][Medline]
http://db.yeastgenome.org/cgi-bin/SGD/GO/goTermMapper; goTermMapper from SGD. http://mips.gsf.de/ and http://mips.gsf.de/proj/yeast/; MIPS. http://www.yeastgenome.org/; SGD. http://srs.ebi.ac.uk; SRS server. http://www.ebi.ac.uk/proteome/; EBI Proteome Analysis Database.
Received December 18, 2002;
accepted in revised format September 24, 2003.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||