|
|
|
|
Published online before print
September 4, 2007, 10.1101/gr.6202607 Genome Res. 17:1537-1545, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Methods A systems biology approach for pathway level analysis1 Karmanos Cancer Institute, Wayne State University, Detroit, Michigan 48202, USA; 2 Department of Computer Science, Wayne State University, Detroit, Michigan 48202, USA; 3 Perinatology Research Branch, NIH/NICHD, Detroit, Michigan 48201, USA
A common challenge in the analysis of genomics data is trying to understand the underlying phenomenon in the context of all complex interactions taking place on various signaling pathways. A statistical approach using various models is universally used to identify the most relevant pathways in a given experiment. Here, we show that the existing pathway analysis methods fail to take into consideration important biological aspects and may provide incorrect results in certain situations. By using a systems biology approach, we developed an impact analysis that includes the classical statistics but also considers other crucial factors such as the magnitude of each genes expression change, their type and position in the given pathways, their interactions, etc. The impact analysis is an attempt to a deeper level of statistical analysis, informed by more pathway-specific biology than the existing techniques. On several illustrative data sets, the classical analysis produces both false positives and false negatives, while the impact analysis provides biologically meaningful results. This analysis method has been implemented as a Web-based tool, Pathway-Express, freely available as part of the Onto-Tools (http://vortex.cs.wayne.edu).
Together with the ability of generating a large amount of data per experiment, high-throughput technologies also brought the challenge of translating such data into a better understanding of the underlying biological phenomena. Independent of the platform and the analysis methods used, the result of a high-throughput experiment is, in many cases, a list of differentially expressed genes. The common challenge faced by all researchers is to translate such lists of differentially expressed genes into a better understanding of the underlying biological phenomena and, in particular, to put this in the context of the whole organism as a complex system. In 2002, a computerized analysis approach using the Gene Ontology (GO) was proposed to deal with this issue (Khatri et al. 2002
Both ORA and FCS techniques currently used are limited by the fact that each functional category is analyzed independently without a unifying analysis at a pathway or system level (Tian et al. 2005 The approaches currently available for the analysis of gene signaling networks share a number of important limitations. First, these approaches consider only the set of genes on any given pathway and ignore their position in those pathways. This may be unsatisfactory from a biological point of view. If a pathway is triggered by a single gene product or activated through a single receptor and if that particular protein is not produced, the pathway will be greatly impacted, probably completely shut off. A good example is the insulin pathway (http://www.genome.ac.jp/KEGG/pathway/hsa/hsa04910.html). If the insulin receptor (INSR) is not present, the entire pathway is shut off. Conversely, if several genes are involved in a pathway but they only appear somewhere downstream, changes in their expression levels may not affect the given pathway as much. Second, some genes have multiple functions and are involved in several pathways but with different roles. For instance, the above INSR is also involved in the adherens junction pathway as one of the many receptor protein tyrosine kinases. However, if the expression of INSR changes, this pathway is not likely to be heavily perturbed because INSR is just one of many receptors on this pathway. Once again, all these aspects are not considered by any of the existing approaches. Probably the most important challenge today is that the knowledge embedded in these pathways about how various genes interact with each other is not currently exploited. The very purpose of these pathway diagrams is to capture some of our knowledge about how genes interact and regulate each other. However, the existing analysis approaches consider only the sets of genes involved on these pathways, without taking into consideration their topology. In fact, our understanding of various pathways is expected to improve as more data are gathered. Pathways will be modified by adding, removing or redirecting links on the pathway diagrams. Most existing techniques are completely unable to even sense such changes. Thus, these techniques will provide identical results as long as the pathway diagram involves the same genes, even if the interactions between them are completely redefined over time. Finally, up to now the expression changes measured in these high-throughput experiments have been used only to identify differentially expressed genes (ORA approaches) or to rank the genes (FCS methods), but not to estimate the impact of such changes on specific pathways. Thus, ORA techniques will see no difference between a situation in which a subset of genes is differentially expressed just above the detection threshold (e.g., twofold) and the situation in which the same genes are changing by many orders of magnitude (e.g., 100-fold). Similarly, FCS techniques can provide the same rankings for entire ranges of expression values, if the correlations between the genes and the phenotypes remain similar. Even though analyzing this type of information in a pathway and system context would be extremely meaningful from a biological perspective, currently there is no technique or tool able to do this. We propose a radically different approach for pathway analysis that attempts to capture all aspects above. An impact factor (IF) is calculated for each pathway incorporating parameters such as the normalized fold change of the differentially expressed genes, the statistical significance of the set of pathway genes, and the topology of the signaling pathway. We show on a number of real data sets that the intrinsic limitations of the classical analysis produce both false positives and false negatives while the impact analysis provides biologically meaningful results.
Our goal is to develop an analysis model that would require both a statistically significant number of differentially expressed genes and biologically meaningful changes on a given pathway. In this model, the IF of a pathway Pi is calculated as the sum of two terms:
The second term in Equation 1 is a functional term that depends on the identity of the specific genes that are differentially expressed as well as on the interactions described by the pathway (i.e., its topology). In essence, this term sums up the absolute values of the perturbation factors (PFs) for all genes g on the given pathway Pi. The PF of a gene g is calculated as follows:
E (g) represents the signed normalized measured expression change of the gene g determined using one of the available methods (Quackenbush 2001 ug, which reflects the type of interaction: ug = 1 for induction, ug = –1 for repression. (In KEGG, which is the source of the pathways used here, this information about the type of interaction is available for every link between two genes in the description of the pathway topology.) USg is the set of all such genes upstream of g. The second term here is similar to the PageRank index used by Google (Page et al. 1998
Under the null hypothesis, which assumes that the list of differentially expressed genes only contains random genes, the likelihood that a pathway has a large IF is proportional to the number of such "differentially expressed" genes that fall on the pathway, which in turn is proportional to the size of the pathway. Thus, we need to normalize with respect to the size of the pathway by dividing the total perturbation by the number of differentially expressed genes on the given pathway, Nde(Pi). Furthermore, various technologies can yield systematically different estimates of the fold changes. For instance, the fold changes reported by microarrays tend to be compressed with respect to those reported by RT-PCR (Canales et al. 2006
It can be shown that the IFs correspond to the negative log of the global probability of having both a statistically significant number of differentially expressed genes and a large perturbation in the given pathway. IF values, if, will follow a
The impact analysis proposed here extends and enhances the existing statistical approaches by incorporating the novel aspects discussed above. For instance, the second term of the gene perturbation (in Equation 2) increases the PF scores of those genes that are connected through a direct signaling link to other differentially expressed genes (e.g., the PFs of F5 and F11 in Fig. 1 are both increased because of the differentially expressed SERPINC1 and SERPINA1). This will yield a higher overall score for those pathways in which the differentially expressed genes are localized in a connected subgraph, as in this example. Interestingly, when the limitations of the existing approaches are forcefully imposed (e.g., ignoring the magnitude of the measured expression changes or ignoring the regulatory interactions between genes), the impact analysis reduces to the classical statistics and yields the same results. For instance, if there are no perturbations directly upstream of a given gene, the second term in Equation 2 is zero and the PF reduces to the measured expression change
We have used this pathway analysis approach to analyze several data sets. A first such set includes genes associated with better survival in lung adenocarcinoma (Beer et al. 2002
From a statistical perspective, the power of both classical techniques appears to be very limited. The corrected P-values do not yield any pathways at the usual 0.01 or 0.05 significance levels, independently of the type of correction. If the significance levels were to be ignored and the techniques used only to rank the pathways, the results would continue to be unsatisfactory. According to the classical ORA analysis, the most significantly affected pathways in this data set are prion disease, focal adhesion, and Parkinsons disease. In reality, both prion and Parkinsons diseases are pathways specifically associated to diseases of the central nervous system and are unlikely to be related to lung adenocarcinomas. In this particular case, prion disease ranks at the top only due to the differential expression of LAMB1. Since this pathway is rather small (14 genes), every time any one gene is differentially expressed, the hypergeometric analysis will rank it highly. A similar phenomenon happens with Parkinsons disease, indicating that this is a problem associated with the method rather than with a specific pathway. At the same time, pathways highly relevant to cancer such as cell cycle and Wnt signaling are ranked in the lower half of the pathway list. The most significant pathways reported as enriched in cancer by GSEA (Subramanian et al. 2005
In contrast, the impact analysis reports cell cycle as the most perturbed pathway in this condition and also as highly significant from a statistical perspective (P = 1.6 x 10–6). Since early articles on the molecular mechanisms perturbed in lung cancers (Slebos and Rodenhuis 1989
The third pathway as ranked by the impact analysis is Wnt signaling (FDR corrected P = 0.055, significant at 10%). The importance of this pathway is well supported by independent research. At least three mechanisms for the activation of Wnt signaling pathway in lung cancers have been recently identified: (1) over-expression of Wnt effectors such as Dvl, (2) activation of a non-canonical pathway involving MAPK (previously known as JNK), and (3) repression of Wnt antagonists such as WIF (Mazieres et al. 2005 In the same data set, Huntingtons disease, Parkinsons disease, prion disease, and Alzheimers disease have low IFs (corrected P-values of >0.20), correctly indicating that they are unlikely to be relevant in lung adenocarcinomas.
A second data set includes genes identified as being associated with poor prognosis in breast cancer (vant Veer et al. 2002
TGF-beta signaling (P = 0.032) and MAPK (P = 0.064) are also significant. Both fit well with previous research results. TGF-beta1, the main ligand for the TGF-beta signaling pathway, is known as a marker of invasiveness and metastatic capacity of breast cancer cells (Todorovic-Rakovic 2005
A third data set involves a set of differentially expressed genes obtained by studying the response of a hepatic cell line when treated with palmitate (Swagell et al. 2005
Conclusions A statistical approach using various models is commonly used in order to identify the most relevant pathways in a given experiment. This approach is based on the set of genes involved in each pathway. We identified a number of additional factors that may be important in the description and analysis of a given biological pathway. Based on these, we developed a novel impact analysis method that uses a systems biology approach in order to identify pathways that are significantly impacted in any condition monitored through a high-throughput gene expression technique. The impact analysis incorporates the classical probabilistic component but also includes important biological factors that are not captured by the existing techniques: the magnitude of the expression changes of each gene, the position of the differentially expressed genes on the given pathways, the topology of the pathway that describes how these genes interact, and the type of signaling interactions between them. The results obtained on several independent data sets show that the proposed approach is very promising. This analysis method has been implemented as a Web-based tool, Pathway-Express, freely available as part of the Onto-Tools (http://vortex.cs.wayne.edu).
This material is based upon work supported by the following grants: NSF DBI-0234806, CCF-0438970, 1R01HG003491-01A1, 1U01CA117478-01, 1R21CA100740-01, 1R01NS045207-01, 5R21EB000990-03, 2P30 CA022453-24. Onto-Tools currently runs on equipment provided by Sun Microsystems EDU 7824-02344-U and by NIH(NCRR) 1S10 RR017857-01. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the NSF, NIH, DOD, or any other of the funding agencies.
4 Corresponding author.
E-mail sod{at}cs.wayne.edu; fax (313) 577-0868. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6202607
Beer, D.G., Kardia, S.L., Huang, C.-C., Giordano, T.J., Levin, A.M., Misek, D.E., Lin, L., Chen, G., Gharib, T.G., Thomas, D.G., et al. 2002. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat. Med. 8: 816–824.[Medline] Beviglia, L., Golubovskaya, V., Xu, L., Yang, X., Craven, R.J., and Cance, W.G. 2003. Focal adhesion kinase N-terminus in breast carcinoma cells induces rounding, detachment and apoptosis. Biochem. J. 373: 201–210.[CrossRef][Medline] Breslin, T., Krogh, M., Peterson, C., and Troein, C. 2005. Signal transduction pathway profiling of individual tumor samples. BMC Bioinformatics 6: 163. doi: 10.1186/1471-2105-6-163.[CrossRef][Medline] Canales, R.D., Luo, Y., Willey, J.C., Austermiller, B., Barbacioru, C.C., Boysen, C., Hunkapiller, K., Jensen, R.V., Knight, C.R., Lee, K.Y., et al. 2006. Evaluation of DNA microarray results with quantitative gene expression platforms. Nat. Biotechnol. 24: 1115–1122.[CrossRef][Medline] Chen, Z., Gibson, T.B., Robinson, F., Silvestro, L., Pearson, G., Xu, B., Wright, A., Vanderbilt, C., and Cobb, M.H. 2001. MAP kinases. Chem. Rev. 101: 2449–2476.[CrossRef][Medline] Chung, H.-J., Kim, M., Park, C.H., Kim, J., and Kim, J.H. 2004. ArrayXPath: Mapping and visualizing microarray gene-expression data with integrated biological pathway resources using scalable vector graphics. Nucleic Acids Res. 32: W460–W464. doi: 10.1093/nar/gkh476. Churchill, G.A. 2002. Fundamentals of experimental design for cDNA microarrays. Nat. Genet. 32: 490–495 (Suppl. S).[CrossRef][Medline] Coe, B.P., Lockwood, W.W., Girard, L., Chari, R., Macaulay, C., Lam, S., Gazdar, A.F., Minna, J.D., and Lam, W.L. 2006. Differential disruption of cell cycle pathways in small cell and non-small cell lung cancer. Br. J. Cancer 94: 1927–1935.[CrossRef][Medline] Dahlquist, K., Salomonis, N., Vranizan, K., Lawlor, S., and Conklin, B. 2002. GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nat. Genet. 31: 19–20.[CrossRef][Medline] Doniger, S.W., Salomonis, N., Dahlquist, K.D., Vranizan, K., Lawlor, S.C., and Conklin, B.R. 2003. MAPPfinder: using Gene Ontology and GenMAPP to create a global gene expression profile from microarray data. Genome Biol. 4: R7. doi: doi:10.1186/gb-2003-4-1-r7.[CrossRef][Medline] Draghici, S. 2002. Statistical intelligence: Effective analysis of high-density microarray data. Drug Discov. Today 7: S55–S63.[CrossRef][Medline] Draghici, S., Khatri, P., Martins, R.P., Ostermeier, G.C., and Krawetz, S.A. 2003. Global functional profiling of gene expression. Genomics 81: 98–104.[CrossRef][Medline] Draghici, S., Khatri, P., Eklund, A.C., and Szallasi, Z. 2006. Reliability and reproducibility issues in DNA microarray measurements. Trends Genet. 22: 101–109.[CrossRef][Medline] Goeman, J.J., de van Geer, S.A., de Kort, F., and van Houwelingen, H.C. 2004. A global test for groups of genes: Testing association with a clinical outcome. Bioinformatics 20: 93–99. Golubovskaya, V., Beviglia, L., Xu, L.H., Earp, H.S., Craven, R., and Cance, W. 2002. Dual inhibition of focal adhesion kinase and epidermal growth factor receptor pathways cooperatively induces death receptor-mediated apoptosis in human breast cancer cells. J. Biol. Chem. 277: 38978–38987. Grosu, P., Townsend, J.P., Hartl, D.L., and Cavalieri, D. 2002. Pathway processor: A tool for integrating whole-genome expression results into metabolic networks. Genome Res. 12: 1121–1126. Holford, M., Li, N., Nadkarni, P., and Zhao, H. 2004. VitaPad: Visualization tools for the analysis of pathway data. Bioinformatics 21: 1596–1602.[CrossRef][Medline] Joshi-Tope, G., Gillespie, M., Vasrik, I., DEustachio, P., Schmidt, E., de Bone, B., Jassal, B., Gopinath, G.R., Wu, G.R., Matthews, L., et al. 2005. Reactome: A knowledgebase of biological pathways. Nucleic Acids Res. 33: D428–D432. doi: 10.1093/nar/gki072. Khatri, P. and Draghici, S. 2005. Ontological analysis of gene expression data: Current tools, limitations, and open problems. Bioinformatics 21: 3587–3595. Khatri, P., Draghici, S., Ostermeier, G.C., and Krawetz, S.A. 2002. Profiling gene expression using Onto-Express. Genomics 79: 266–270.[CrossRef][Medline] Mazieres, J., He, B., You, L., Xu, Z., and Jablons, D.M. 2005. Wnt signaling in lung cancer. Cancer Lett. 222: 1–10.[CrossRef][Medline] Mootha, V.K., Lindgren, C.M., Eriksson, K.-F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrale, M., Laurila, E., et al. 2003. Pgc-1 Nau, M.M., Brooks, B.J., Battey, J., Sausville, E., Gazdar, A.F., Kirsch, I.R., McBride, O.W., Bertness, V., Hollis, G.F., Minna, J.D., et al. 1985. L-myc, a new myc-related gene amplified and expressed in human small cell lung cancer. Nature 318: 69–73.[CrossRef][Medline] Nikitin, A., Egorov, S., Daraselia, N., and Mazo, I. 2003. Pathway studio—The analysis and navigation of molecular networks. Bioinformatics 19: 2155–2157. Nikolic-Vukosavljevic, D., Todorovic-Rakovic, N., Demajo, M., Ivanovic, V., Neskovic, B., Markicevic, M., and Neskovic-Konstantinovic, Z. 2004. Plasma TGF-beta1-related survival of postmenopausal metastatic breast cancer patients. Clin. Exp. Metastasis 21: 581–585.[CrossRef][Medline] Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27: 29–34. Page, L., Brin, S., Motwani, R., and Winograd, T. 1998. The pagerank citation ranking: Bringing order to the Web. Technical report. Stanford University, Palo Alto, CA. Pan, D., Sun, N., Cheung, K.-H., Guan, Z., Ma, L., Holford, M., Deng, X., and Zhao, H. 2003. PathMAPA: A tool for displaying gene expression and performing statistical tests on metabolic pathways at multiple levels for Arbidopsis. BMC Bioinformatics 4: 56. doi: 10.1186/1471-2105-4-56.[CrossRef][Medline] Panani, A.D. and Roussos, C. 2006. Cytogenetic and molecular aspects of lung cancer. Cancer Lett. 239: 1–9.[CrossRef][Medline] Pandey, R., Guru, R.K., and Mount, D.W. 2004. Pathway Miner: Extracting gene association networks from molecular pathways for predicting the biological significance of gene expression microarray data. Bioinformatics 20: 2156–2158. Pavlidis, P., Qin, J., Arango, V., Mann, J.J., and Sibille, E. 2004. Using the gene ontology for microarray data mining: A comparison of methods and application to age effects in human prefrontal cortex. Neurochem. Res. 29: 1213–1222.[CrossRef][Medline] Quackenbush, J. 2001. Computational analysis of microarray data. Nat. Rev. Genet. 2: 418–427.[CrossRef][Medline] Rahnenfuhrer, J., Domingues, F. S., Maydt, J., and Lengauer, T. 2004. Calculating the statistical significance of changes in pathway activity from gene expression data. Stat. Appl. Genet. Mol. Biol. 3: article16. http://www.bepress.com/sagmb/vol3/iss1/art16/. Robinson, P.N., Wollstein, A., Bohme, U., and Beattie, B. 2004. Ontologizing gene-expression microarray data: Characterizing clusters with gene ontology. Bioinformatics 20: 979–981. Rodrigues, S., Fathers, K., Chan, G., Zuo, D., Halwani, F., Meterissian, S., and Park, M. 2005. CrkI and CrkII function as key signaling integrators for migration and invasion of cancer cells. Mol. Cancer Res. 3: 183–194. Sanders, T.A., de Grassi, T., Miller, G.J., and Humphries, S.E. 1999. Dietary oleic and palmitic acids and postprandial factor VII in middle-aged men heterozygous and homozygous for factor VII R353Q polymorphism. Am. J. Clin. Nutr. 69: 220–225. Shannon, P., Markiel, A., Ozier, O., Baliga, N.S., Wang, J.T., Ramage, D., Amin, N., Schwikowskis, B., and Ideker, T. 2003. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res. 13: 2498–2504. Slebos, R.J. and Rodenhuis, S. 1989. The molecular genetics of human lung cancer. Eur. Respir. J. 2: 461–469.[Abstract] Stelling, J. 2004. Mathematical models in microbial systems biology. Curr. Opin. Microbiol. 7: 513–518.[CrossRef][Medline] Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., et al. 2005. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102: 15545–15550. Susztak, K., Ciccone, E., McCue, P., Sharma, K., and Bttinger, E.P. 2005. Multiple metabolic hits converge on CD36 as novel mediator of tubular epithelial apoptosis in diabetic nephropathy. PLoS Med. 2: e45. doi: 10.1371/journal.pmed.0020045.[CrossRef][Medline] Swagell, C., Henly, D., and Morris, C.P. 2005. Expression analysis of a human hepatic cell line in response to palmitate. Biochem. Biophys. Res. Commun. 328: 432–441.[CrossRef][Medline] Tavazoie, S., Hughes, J.D., Campbell, M.J., Cho, R.J., and Church, G.M. 1999. Systematic determination of genetic network architecture. Nat. Genet. 22: 281–285.[CrossRef][Medline] Tian, L., Greenberg, S.A., Kong, S.W., Altschuler, J., Kohane, I.S., and Park, P.J. 2005. Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. 102: 13544. Todorovic-Rakovic, N. 2005. Tgf-beta 1 could be a missing link in the interplay between er and her-2 in breast cancer. Med. Hypotheses 65: 546–551.[CrossRef][Medline] van Nimwegen, M.J. and de van Water, B. 2006. Focal adhesion kinase: A potential target in cancer therapy. Biochem. Pharmacol. 73: 597–609.[Medline] van Nimwegen, M.J., Huigsloot, M., Camier, A., Tijdens, I.B., and de van Water, B. 2006. Focal adhesion kinase and protein kinase b cooperate to suppress doxorubicin-induced apoptosis of breast tumor cells. Mol. Pharmacol. 70: 1330–1339. vant Veer, L.J., Dai, H., de van Vijver, M.J., He, Y.D., Hart, A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveenothers, A.T., et al. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530–536.[CrossRef][Medline] Vincenzi, B., Schiavon, G., Silletta, M., Santini, D., Perrone, G., Di Marino, M., Angeletti, S., Baldi, A., and Tonini, G. 2006. Cell cycle alterations and lung cancer. Histol. Histopathol. 21: 423–435.[Medline] Yang, Y.H. and Speed, T. 2002. Design issues for cDNA microarray experiments. Nat. Rev. Genet. 3: 579–588.[CrossRef][Medline]
Received December 11, 2006; accepted in revised format June 28, 2007. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||