|
|
|
|
Published online before print
August 10, 2007, 10.1101/gr.6431107 Genome Res. 17:1304-1318, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Letter A multidimensional analysis of genes mutated in breast and colorectal cancersLudwig Center for Cancer Genetics and Therapeutics, and The Howard Hughes Medical Institute at The Johns Hopkins Kimmel Cancer Center, Baltimore, Maryland 21231, USA
A recent study of a large number of genes in a panel of breast and colorectal cancers identified somatic mutations in 1149 genes. To identify potential biological processes affected by these genes, we examined their putative roles based on sequence similarity, membership in known functional groups and pathways, and predicted interactions with other proteins. These analyses identified functional groups and pathways that were enriched for mutated genes in both tumor types. Additionally, the results pointed to differences in molecular mechanisms that underlie breast and colorectal cancers, including various intracellular signaling and metabolic pathways. These studies provide a multidimensional framework to guide further research and help identify cellular processes critical for malignant progression and therapeutic intervention.
Cancer arises through the gradual accumulation of alterations in oncogenes and tumor suppressor genes. In an effort to identify such genes on a genomic scale, we have recently performed a systematic sequencing study of the majority of human genes in breast and colorectal cancers (Sjöblom et al. 2006 Given this complexity, a systems biological approach could be useful to identify patterns among the mutated genes and to help interpret the genetic landscape of the two tumor types. An optimal approach of this sort would not only examine the individual roles of the mutated gene products, but would also explore their relationships, interactions, and network properties. Understanding this interplay could provide insight into mechanisms of tumorigenesis and prioritize specific pathways and processes for future genetic and biochemical research. In this study, we take advantage of existing genomic and proteomic databases to highlight different aspects of the genes that are mutated in breast and colorectal cancers. Our analysis uses four different system-level perspectives: (1) sequence similarity, (2) functional annotation (including cellular function, biochemical processes, and subcellular localization), (3) protein–protein interactions, and (4) molecular pathways. At each of these levels, we identify specific gene groups that were enriched for genetic alterations, revealing potentially aberrant cellular processes in the tumors.
Protein sequence similarity We first evaluated the proteins encoded by the 1149 mutated genes through sequence-similarity analyses. This approach provides an unbiased means to group proteins based on their encoded information content. Two complementary methods were used: pairwise basic local alignment search tool (BLAST) analysis and comparison of protein domains using information from existing databases. Sequence comparisons via BLAST facilitated examination of entire coding regions, while analyses of protein domains identified motifs and sequence relationships that would not be evident through whole gene comparisons.
To compare entire coding regions we used BLASTP (Altschul et al. 1990
Genes that have high sequence identity often participate in similar intracellular roles, either through related biochemical functions, protein dimerization, genetic interactions, or more complex relationships. Within the clusters shown in Figure 1 there were several instances of patterns suggesting common functions during tumorigenesis. For example, mutations in ephrin receptors EPHA3, EPHA4, EPHA7, or EPHB6 affected 10 of the 35 colorectal tumors examined, but no tumor contained mutations in more than a single ephrin receptor, suggesting mutual exclusivity among mutations in these genes. Global analyses of sequence-similarity clusters in both breast and colorectal cancers identified nine and four clusters that showed mutual exclusivity, respectively. While the genes within some pathways act in series, and mutation of one member of the pathway is sufficient to disrupt function, clusters of sequence similarity may also include members that act in parallel pathways. For example, mutations in the TGF-beta pathway mediators SMAD2, SMAD3, or SMAD4 occurred in seven of 35 tumors. While mutations in SMAD4 did not occur in tumors with other SMAD mutations, both SMAD2 and SMAD3 were co-mutated in colorectal tumors Mx30 and Hx5 (Supplemental Fig. 1). Interestingly, SMAD2 or SMAD3 can separately heterodimerize with SMAD4 transcription factors upon pathway activation and mediate transcriptional responses (Jayaraman and Massague 2000
A complementary method for analysis of sequence similarity takes advantage of information from existing databases. Instead of determining relatedness solely using BLASTP, other methods such as Hidden Markov Models and consensus sequences have facilitated in-depth comparisons of protein sequences. The Integrated Resource of Protein Families, Domains, and Sites (InterPro) database incorporates information from 16 protein databases, including Pfam, ProDom, PRINTS, PROSITE, and SMART (Apweiler et al. 2001
We examined these data in two ways to determine whether gene groups containing specific domains were more likely to be mutated than predicted by chance alone. First, we determined whether the number of mutations in gene groups containing specific domains reflected a mutation prevalence that was significantly higher than the passenger mutation prevalence. We performed these calculations for breast and colorectal cancers separately, using the conservative assumption that the observed mutation frequencies of 2.5 and 3.3 mutations per million base pairs, respectively, constituted the passenger rates. Note that this criterion is highly conservative, as the observed mutations actually represent the sum of passenger mutations and those mutations selected for during tumorigenesis (i.e., pathogenic mutations). The resulting Group CaMP score is similar to that used to derive the Cancer Mutation Prevalence (CaMP) score for individual genes. The Group CaMP score incorporated the total number of mutations from all genes within each group, the combined lengths of the genes in each group, and the total number of tumors examined. The P-value of observing at least the number of mutations in a binomial distribution was calculated and corrected with the Benjamini–Hochberg algorithm (Benjamini and Hochberg 1995
Second, we examined whether the distributions of individual CaMP scores of mutated genes containing domains of interest were different from mutated genes not containing such domains. To compare such distributions, we adapted the Gene Set Enrichment Analysis (GSEA) algorithm, using CaMP scores of individual genes instead of summaries of gene-expression values (Subramanian et al. 2005 After identification of candidate groups that were significantly enriched for mutations using these approaches, we filtered the results to identify those groups that were also enriched for an increased number of mutant genes. Specifically, we determined whether the ratio of the number of mutant genes containing each specific domain to all genes containing that domain was statistically higher than the ratio of the total number of mutant genes (1149) to the number of all the genes (13,023) analyzed. This filtering step ensured that multiple genes within each gene group must be affected in order for the entire group to be considered of interest. A gene group that had contained only one highly mutated gene (e.g., mutations only in TP53) would thereby be excluded. Using these two analysis approaches (Group CaMP and CaMP GSEA), a total of 31 and 22 InterPro domains were significantly associated with colorectal and breast cancers, respectively (Table 1; Supplemental Table 1). In colorectal cancers, the majority were determined to be significant by both methods and involved several related protein domains. For example, 14 of the identified domains are in proteins that have extracellular regions or are involved in cell–cell interactions (e.g., four immuloglobulin-related domains, two fibronectin domains, six EGF-related domains, and two cadherin-related domains). An additional five domains (e.g., pleckstrin-like domain, DH domain, Ephrin receptor ligand-binding domain, Sterile alpha motif homology 2, and receptor tyrosine kinase domain) are known to be involved in protein kinase or G protein signal transduction pathways. Domains identified that were associated with metalloproteases include reprolysin, peptidase M12B propeptide, cysteine-rich ADAM, and disintegrin. Finally, domains present in TGF-beta pathway transcription mediators SMAD (MAD homology 1 and MAD homology 2 domains) were also identified as significantly associated with colorectal cancer. Interestingly, proteins containing MAD homology, ephrin receptor, and Treacher Collins Syndrome protein domains were found to be exclusively mutated in colorectal cancers, while members of the other domains were mutated in both tumor types. Other domains shared by both cancer types include three of the extracellular EGF-related domains, as well as two domains involved in signaling, the DH domain and the pleckstrin-like domain. In breast cancers, two motifs were detected by both the Group CaMP and GSEA methods: one was the spectrin repeat domain that is present in various cytoskeletal proteins, while the second was the relatively nonspecific proline-rich region domain that was also associated with colorectal cancers. Three domains related to ABC transporters and two domains involved in actin binding were preferentially identified in breast tumors.
Functional annotation and gene ontology In addition to analyses based on sequence content, the mutated genes were categorized according to their annotated biological roles. The Gene Ontology (GO) Consortium has devised a controlled vocabulary for describing molecular functions and biological processes of genes based on information obtained from the literature and from sequence and biological databases (Ashburner et al. 2000
Classification of the mutated genes into general functional categories was visualized using OSPREY (Fig. 2) (Breitkreutz et al. 2003 In order to identify more specific molecular functions for the mutant genes, we examined the full set of 18,740 GO groups. Using approaches similar to that used in the analysis of protein domains, we identified GO groups that were enriched for the number of mutations or distribution of CaMP scores using CaMP GSEA and Group CaMP approaches. In colorectal cancer, we identified 11 GO groups to be significant by either method (Table 2; Supplemental Table 2). Groups such as ephrin receptor activity as well as metalloendopeptidase activity corroborated results identified above through the analysis of protein domains. Two of the largest functional groups, cell adhesion and receptor activity, had 24 and 39 mutated genes and 60 and 63 mutations, respectively. More specific subgroups from these groups included insulin receptor binding and homophilic cell adhesion. In breast cancers, 15 functional groups were identified, none overlapping precisely with those of colon. The most closely related ones involved functional groups that were involved in cell adhesion. The largest group identified was calcium ion binding, which included 50 mutated genes and 77 mutations. Five groups were associated with the extracellular matrix, including extracellular matrix organization and biogenesis, extracellular matrix structural constituent, microtubule binding, actin binding, and cell–cell adhesion. Interestingly, two metabolic groups were affected in breast tumors: the overlapping groups of the urea cycle and arginine biosynthesis. Finally, there were three groups related to G protein signaling: GTPase activator activity and two Rho protein modulating groups. These analyses clearly show that while overall functional patterns may be similar between breast and colorectal cancers, the specific group constituents of these general categories are quite different.
Protein interactions
To identify networks of interacting proteins that were preferentially altered in cancers, we analyzed the predicted interactions of mutated proteins in each tumor type (Fig. 3>; Supplemental Figs. 3, 4). In breast cancers, over half of the mutated proteins (59 of 83) were predicted to participate in a large interaction cluster driven by links to TP53, BRCA1, PIK3R1, and NFKB. In contrast, the largest interaction cluster in colorectal cancers involved SMAD proteins and contained only 12 proteins, and the only cluster containing more than five proteins included TP53. These analyses emphasize how mutation studies coupled with systems analysis can provide information useful for understanding the pathways through which the mutant proteins function. For example, the mutation interactome highlighted three interacting SMAD proteins in colorectal cancers and a cluster of circadian rhythm proteins (PER1, PER2, and TIMELESS) in breast cancers. The proteins encoded by the latter three genes are thought to control cell cycle progression, and genetic inactivation of one of the genes (Per2) has been shown to lead to tumor predisposition in mice (Fu et al. 2002
Molecular pathways Pathways can be defined as the stepwise interaction of multiple proteins designed to achieve a defined cellular process. A variety of signaling, metabolic, and other pathways have been cataloged by the Kyoto Encylopedia of Genes and Genomes (KEGG) (Ogata et al. 1999 In colorectal cancer, 21 pathways were identified to be enriched for mutations from the different pathway databases (Table 3; Supplemental Table 3). Two pathways previously implicated in colorectal tumorigenesis were identified in multiple databases; TGF-beta signaling was identified in three databases and WNT signaling in two. Other signaling pathways contributing to tumorigenesis were also identified, including Insulin signaling, JAK/STAT signaling, MAP kinase signaling, and hedgehog signaling pathways. Two pathways identified were related to the cell cycle and the G1/S and G2/M checkpoints. Finally, genes in pathways thought to be important in controlling cell–cell interactions (axon guidance, adherens junctions, and gap junctions) were preferentially mutated in colorectal cancers.
In breast cancer, several known signaling and checkpoint pathways were also identified (Table 3; Supplemental Table 3). These included those involved in AKT signaling, in BRCA1 and BRCA2 repair and cell cycle regulation processes, and in ATM/ATR checkpoint control. Although TP53 was frequently mutated in each of these pathways, many other genes were also implicated, suggesting that multiple mechanisms may exist for dysregulation of these pathways in breast cancer. Additionally, seven members of the RAN regulation pathway were found to be mutated in breast cancers, while none were mutated in colorectal cancers. The RAN pathway members included proteins involved in nuclear transport such as NUP133, NUP214, NUP98, and KPNA5. NUP98 and NUP214 have been shown to be targets of translocation in several human malignancies (Kau et al. 2004
Integrative analysis
Interpreting the large and complex datasets that arise from genome-wide mutational analyses of cancer is challenging. Given the improvements in bioinformatics and sequencing technologies, we expect that many such projects will come to fruition over the next several years. In the first study of this type, Sjöblom et al. (2006)
The first is that the distribution of mutations observed in the Sjöblom et al. (2006) A second conclusion is that there is substantial value in examining these datasets from different dimensions. Enrichment in protein domains reveals groups of highly related proteins, each of which may be mutated at low levels. Although there is a clear relationship between sequence and function, analysis of enriched functional annotation can allow for abstraction of important biological processes shared by disparate proteins that may not be similar on a sequence level. Examination of protein–protein interactions can provide a more global view of networks that are enriched for mutations. Finally, pathways reveal organizing structures that may not be determined from the other three dimensions. Together, these four complementary views can provide a global view of mutated gene groups and processes.
What are the gene groups and processes that are enriched in these cancer types and what do they tell us about the mechanisms underlying tumorigenesis? For both tumor types, our results pointed to the importance of alterations in intercellular interactions. These included proteins with extracellular domains involved in adhesion (e.g., fibronectin and cadherins), functional groups involved in cell adhesion and extracellular matrix generation and biogenesis, and pathways implicated in cell–cell communication. A multitude of mutated genes are contained in these groups and are delineated in Supplemental Tables 1–3. These observations are generally consistent with the hypothesis that in order for tumor cells to proliferate and invade, they must alter their adhesion dependence to other cells and to the basement membrane and escape control by contact inhibition (Gupta et al. 2007
The enriched groups and pathways also suggested that certain aspects of intracellular signaling, cell cycle control, and metabolism may be important for tumorigenesis. Two known signaling pathways, involving AKT and ATM/ATR, were enriched in both colorectal and breast tumors, reflecting the important role these play in these tumor types (Vogelstein and Kinzler 2004
Finally, the results lead to a deeper understanding of the mutational data and its implications for neoplasia. In the Sjöblom et al. (2006)
Genomic sequence similarity calculation The nucleotide and amino acid sequences of all 14,795 CCDS entries were downloaded from NCBI (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/current) according to the March 2, 2005 release based on the 35.1 genome annotation build. Using formatdb, we created a CCDS blast database and analyzed each of the CCDS entries using blastp with a minimum E-value cut-off of 0.05 and a score of 100. In total, 1639 and 2118 sequences for breast and colon, respectively, were identified and visualized with Cytoscape 2.3.1.
Protein domain comparisons
Significance of gene sets
CaMP GSEA
For the groups identified by either the Group CaMP or CaMP GSEA approaches, we further focus on those that were also enriched for an increased number of mutant genes. For each group, we computed the total number of genes observed to be mutated and sequenced, taking into consideration multiple CCDS entries for some genes. For each group, the expected number of mutated genes was calculated to be the product of the number of sequenced genes in the group and the proportion of genes mutated in the entire study. Although this is a post-test filter and not a test in itself, we report a P-value calculated using the Pearson
Functional annotation and Gene Ontology Biological process and molecular function categories were obtained from the GO Consortium website (http://www.geneontology.org). These contained 11,295 biological processes and 7445 molecular functions, as of August 2006. The cross-reference to CCDS entries resulted in 22,705 and 26,430 assignments, respectively. For each GO category, similar calculations were performed for the total number of mutations observed and the total number of base pairs sequenced, as described above for the protein domains. The Group CaMP and GSEA CaMP scores were calculated as described above.
Predicted protein–protein interactions Cellular component data was obtained from Gene Ontology. As of August, 2006, 1802 cellular component terms were available. For network visualization, we first generated the initial network with Cytoscape 2.3.1. Each individual gene was then placed into an appropriate cellular component based on the Gene Ontology data.
Molecular pathways analysis
This study was supported by The Virginia and D.K. Ludwig Fund for Cancer Research, NIH grants CA 121113, CA 43460, CA 57345, CA105090-03 and CA62924, NSF grant DMS034211, The Pew Charitable Trusts, The Clayton Fund, The Blaustein Foundation, and the NCI Division of Cancer Prevention contract HHSN261200433002C.
1 Corresponding author.
E-mail velculescu{at}jhmi.edu; fax (410) 955-0548. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6431107
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403–410.[CrossRef][Medline] Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Birney, E., Biswas, M., Bucher, P., Cerutti, L., Corpet, F., Croning, M.D., et al. 2001. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res. 29: 37–40. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25–29.[CrossRef][Medline] Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. [Ser A] 57: 289–300. Breitkreutz, B.J., Stark, C., and Tyers, M. 2003. Osprey: A network visualization system. Genome Biol. 4: R22. doi: 10.1186/gb-2003-4-3-r22.[CrossRef][Medline] Camon, E., Barrell, D., Lee, V., Dimmer, E., and Apweiler, R. 2004. The Gene Ontology Annotation (GOA) Database–an integrated resource of GO annotations to the UniProt Knowledgebase. In Silico Biol. 4: 5–6.[Medline] Casero, R.A. and Marton, L.J. 2007. Targeting polyamine metabolism and function in cancer and other hyperproliferative diseases. Nat. Rev. Drug Discov. 6: 373–390.[CrossRef][Medline] Chen, S.T., Choo, K.B., Hou, M.F., Yeh, K.T., Kuo, S.J., and Chang, J.G. 2005. Deregulated expression of the PER1, PER2 and PER3 genes in breast cancers. Carcinogenesis 26: 1241–1246. Deryugina, E.I. and Quigley, J.P. 2006. Matrix metalloproteinases and tumor metastasis. Cancer Metastasis Rev. 25: 9–34.[CrossRef][Medline] Fu, L., Pelicano, H., Liu, J., Huang, P., and Lee, C. 2002. The circadian gene Period2 plays an important role in tumor suppression and DNA damage response in vivo. Cell 111: 41–50.[CrossRef][Medline] Giot, L., Bader, J.S., Brouwer, C., Chaudhuri, A., Kuang, B., Li, Y., Hao, Y.L., Ooi, C.E., Godwin, B., Vitols, E., et al. 2003. A protein interaction map of Drosophila melanogaster. Science 302: 1727–1736. Grünbaum, B. 1975. Venn diagrams and Independent Families of Sets. Mathematics Mag. 48: 12–23. Gupta, G.P., Nguyen, D.X., Chiang, A.C., Bos, P.D., Kim, J.Y., Nadal, C., Gomis, R.R., Manova-Todorova, K., and Massague, J. 2007. Mediators of vascular remodelling co-opted for sequential steps in lung metastasis. Nature 446: 765–770.[CrossRef][Medline] Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., and Sakaki, Y. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. 98: 4569–4574. Jayaraman, L. and Massague, J. 2000. Distinct oligomeric states of SMAD proteins in the transforming growth factor-beta pathway. J. Biol. Chem. 275: 40710–40717. Jonsson, P.F. and Bates, P.A. 2006. Global topological features of cancer proteins in the human interactome. Bioinformatics 22: 2291–2297. Kau, T.R., Way, J.C., and Silver, P.A. 2004. Nuclear transport and cancer: From mechanism to intervention. Nat. Rev. Cancer 4: 106–117.[Medline] Lee, C.C. 2006. Tumor suppression by the mammalian Period genes. Cancer Causes Control 17: 525–530.[CrossRef][Medline] Li, S., Armstrong, C.M., Bertin, N., Ge, H., Milstein, S., Boxem, M., Vidalain, P.O., Han, J.D., Chesneau, A., Hao, T., et al. 2004. A map of the interactome network of the metazoan C. elegans. Science 303: 540–543. Martin, D., Brun, C., Remy, E., Mouren, P., Thieffry, D., and Jacq, B. 2004. GOToolBox: Functional analysis of gene datasets based on Gene Ontology. Genome Biol. 5: doi: 10.1186/gb-2004-5-12-r101.[CrossRef][Medline] Nakamura, T. 2005. NUP98 fusion in human leukemia: Dysregulation of the nuclear pore and homeodomain proteins. Int. J. Hematol. 82: 21–27.[CrossRef][Medline] Ogata, H., Goto, S., Sato, K., Fujibuchi, W., Bono, H., and Kanehisa, M. 1999. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27: 29–34. Sjöblom, T., Jones, S., Wood, L.D., Parsons, D.W., Lin, J., Barber, T.D., Mandelker, D., Leary, R.J., Ptak, J., Silliman, N., et al. 2006. The consensus coding sequences of human breast and colorectal cancers. Science 314: 268–274. Smyth, G.K. 2005. limma: Linear models for microarray data. In Bioinformatics and computational biology solutions using R and bioconductor (eds. R. Gentleman et al.), pp. 397–420. Springer, London, UK. Subramanian, A., Tamayo, P., Mootha, V.K., Mukherjee, S., Ebert, B.L., Gillette, M.A., Paulovich, A., Pomeroy, S.L., Golub, T.R., Lander, E.S., et al. 2005. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. 102: 15545–15550. Tian, L., Greenberg, S.A., Kong, S.W., Altschuler, J., Kohane, I.S., and Park, P.J. 2005. Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci. 102: 13544–13549. Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403: 623–627.[CrossRef][Medline] Vogelstein, B. and Kinzler, K.W. 2004. Cancer genes and the pathways they control. Nat. Med. 10: 789–799.[CrossRef][Medline]
Received February 23, 2007; accepted in revised format June 28, 2007. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||