|
|
|
|
Genome Res. 14:1085-1094, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Letter Coexpression Analysis of Human Genes Across Many Microarray Data Sets1 Columbia Genome Center, Columbia University, New York, New York 10032, USA 2 College of Physicians and Surgeons, Columbia University, New York, New York 10032, USA 3 Department of Biomedical Informatics, Columbia University, New York, New York 10032, USA
We present a large-scale analysis of mRNA coexpression based on 60 large human data sets containing a total of 3924 microarrays. We sought pairs of genes that were reliably coexpressed (based on the correlation of their expression profiles) in multiple data sets, establishing a high-confidence network of 8805 genes connected by 220,649 "coexpression links" that are observed in at least three data sets. Confirmed positive correlations between genes were much more common than confirmed negative correlations. We show that confirmation of coexpression in multiple data sets is correlated with functional relatedness, and show how cluster analysis of the network can reveal functionally coherent groups of genes. Our findings demonstrate how the large body of accumulated microarray data can be exploited to increase the reliability of inferences about gene function.
Gene expression microarray data is a form of high-throughput genomics data providing relative measurements of mRNA levels for thousands of genes in a biological sample. In the last few years, hundreds of laboratories have collected and analyzed microarray data, and the data are beginning to appear in public databases or on researchers' Web sites. These resources serve at least two purposes. One is as an archive of the data, which allows other researchers to confirm the results that have been published by the originator of the data. A second use is to permit novel analyses of the data, that go beyond what was envisioned or possible at the time of the original study. A novel analysis could involve just a single data set, or a meta-analysis of many data sets (where a "data set" is a group of microarrays that were collected together, and typically described as a group in a single publication). The combined analysis of multiple data sets forms the main topic of this paper.
Most existing studies that have analyzed multiple independently collected microarray data sets have focused on differential expression, comparing two or more similar data sets to look for genes that distinguish different sets of samples (Breitling et al. 2002
Another way of using microarray data is to exploit gene coexpression instead of differential expression. In this approach, genes that have similar expression patterns across a set of samples are hypothesized to have a functional relationship. It has been shown in a number of studies that coexpression is correlated with functional relationships, such as physical interaction between the encoded proteins, though coexpression does not necessarily imply a causal relationship among transcript levels (Eisen et al. 1998 In this paper we describe an analysis of gene coexpression in 60 large human microarray data sets, and we assess the functional relevance and reproducibility of the coexpression patterns we detected. We found that a substantial number of correlated expression patterns occur in multiple independent data sets. This confirmation of correlated expression provides a useful way to improve the confidence in any particular correlated expression pattern. Indeed, we show that coexpression patterns that are confirmed are more likely to be functionally relevant. The database and methods we describe can form the basis for further large-scale exploration of gene coexpression data.
We analyzed pairwise correlation of gene expression in a large corpus of microarray data of 60 diverse data sets (Table 1). This corpus contains a total of 62.2 million expression measurements distributed among 3924 microarrays; all of the data sets have at least 10 samples (microarrays), and the largest contains 255 samples. We analyzed correlation of gene expression profiles within each data set, selecting for further study the "coexpression links" that were deemed to be statistically significant (see Methods). For the analysis presented in this paper, we considered a set of 16,511 human genes from RefSeq, of which 15,700 were detectably expressed in at least one data set.
This analysis yielded 9.7 million different "raw" coexpression links between genes. A total of 11 million occurrences of these links were found, indicating that some links occur in multiple data sets. Of the 9.7 million different links, 5.39 million (56%) had positive correlations, compared to 4.31 million negative correlations. This imbalance apparently occurs because negative correlations tended to be less common than positive correlations in the raw data, and fewer of them reach significance in our primary analysis. Between 673 and 1.5 million correlated gene pairs (raw coexpression links) were stored for each data set (median 56,000; Table 1). Of the genes tested, 15,458 (98%) had at least one coexpression link, with a median of 990 per gene. For the most part, the number of links a data set yielded was proportional to the number of genes represented on the array, but this was also affected by the number of samples in the data set (data not shown). This is because our criteria for link acceptance takes into account the number of samples in the calculation of statistical significance.
Coexpression Link Confirmation
Figure 2A shows the number of times a link is confirmed in a given number of data sets. This figure shows that whereas most links are not confirmed in our database, many links are confirmed and some links are found in numerous data sets. The largest number of data sets a link was seen in was 31. Of the links in our original selected pool of 9.7 million, none were testable in all 60 data sets (the maximum was 57), because as mentioned none of the genes we considered occurred or were considered detectable in all 60 data sets. The wide variety of microarray platforms represented in our database lead to most links being tested in far fewer than 60 data sets, and the links in the original pool were tested in a mean of 18 data sets (median 15).
Although confirmation of coexpression suggests greater reliability, we expect some confirmations to occur purely by chance, due to the large number of data sets we tested. To estimate the statistical significance of link confirmation, we created randomized databases where the number of links per gene and per data set had the identical distributions as in our real data, but the links were created between genes within a data set at random. These randomized databases produce links confirmed in three or more data sets (hereafter denoted as "3+ confirmed") at a rate of 5.24±0.08% (mean ± standard deviation) of that observed using the original data, and produce very few links confirmed in more than four data sets (less than 0.5% of those found in the unshuffled data). However, for 2+ confirmed links the rate is 34%. We note that these tests examine the random occurrence of confirmed links, not the random occurrence of links within single data sets. When we instead shuffled the microarray expression profiles before raw link determination, we obtained almost no 3+ confirmed links (<10). For much of the remainder of our analysis we focus on 3+ confirmed links. Out of 9.7 million unique coexpression links, 220,649 (2.2%) are seen in at least three data sets (3+ confirmed). In addition, 8805 of the genes tested have at least one 3+ confirmed link, encompassing 60% of the 14,172 genes that were expressed in at least three data sets (and thus capable of having 3+ confirmable links). Not surprisingly, genes with many raw links tended to have more 3+ confirmed links (Spearman's rank correlation 0.81; Fig. 2B).
Figure 2C shows the number of 3+ confirmed links per gene. The distribution approximately obeys a power law distribution, as is observed for many biological as well as other types of networks (Barabasi and Albert 1999 Although the numbers of positive and negative correlations we selected were fairly similar, a much larger fraction (88.8%) of confirmed links were for genes that showed positive correlations (a positive correlation in one data set and a negative correlation in another data set was not considered a confirmation). The overall 3+ confirmation rate for negative correlations was 0.5%, over seven times lower than the rate for positive correlations of 3.6%. Very few negative correlations (694) were confirmed at higher levels than 4+, and none were confirmed in more than eight data sets.
Functional Relevance of Link Confirmation
As links are increasingly confirmed, the semantic similarity of the genes also tends to increase (Fig. 3). Importantly, the distribution of GO term overlap for links that are seen only in a single data set is significantly different from randomly generated links (signed-rank test, P < 1015). This suggests that our initial link selection procedure is at least somewhat effective in selecting biologically relevant links, even if they are never confirmed in other data sets. Links that are confirmed two or more times have higher GO term overlaps than those seen only once (P < 1015), and those 3+ confirmed are significantly more similarly annotated than those at 2+ (P < 1015), each confirmation corresponding to about one additional GO term in common, on average. At high levels of confirmation, a high degree of known functional relatedness of the pairs is very likely, as shown by the curve for 15+ confirmations (Fig. 3). These findings were also confirmed using an alternative measure of semantic similarity (Lord et al. 2003
Cluster Analysis of the Confirmed Coexpression Network The set of coexpression links forms a network among the genes. The density of the 3+ network (the ratio of links between genes to the number of possible links) is 0.0057, with a diameter of 10 (the longest minimal path between two genes). The network can be broken into just 49 unconnected components, the largest of which contains almost all the genes (8705). The remaining 48 components contain only two or three genes each. We used two clustering approaches to gain further insight into the structure of the gene interaction network predicted from confirmed coexpression. First, we used hierarchical clustering (Methods; Fig. 4). Because of the large size of the 3+ network, for this analysis we used the set of 7+ confirmed links, further limiting the analysis to those genes having at least six 7+ links (720 genes and 10,089 links). By applying hierarchical clustering to a matrix representation of the network, we identified a series of "core clusters" that appear along the diagonal of the matrix (left-hand side of Fig. 4). Interactions between genes in these core clusters appear as spots off the diagonal. The right-hand side of Figure 4 is a visualization of GO categories associated with each gene. The columns of the GO matrix were also clustered to put terms with similar patterns near each other.
A statistical analysis (see Methods) allowed us to associate many of the clusters with specific GO terms, illustrated by the color coding on the right-hand side of Figure 4. For example, a cluster of genes at the upper right is clearly associated with the GO terms related to protein translation including "cytosolic ribosome," and indeed includes many ribosomal proteins and translation initiation and elongation factors. A smaller identifiable cluster is represented by MHC II protein coding genes. The MHC II genes are associated with several other clusters containing many genes related to the immune response (in the lower left of the matrix, orange box). The middle of Figure 4 is dominated by a large, fairly diffuse cluster of about one-third of the genes (indicated by the light blue box) that contains within it several tighter groups of genes associated with GO terms related to RNA processing, DNA replication, and the cell cycle. The many links between these groups of genes (off the diagonal) may represent robust interactions between these processes. We stress that all of the coexpression events in Figure 4 were seen in at least seven different microarray data sets.
Although the hierarchical clustering approach yields a high-level overview, it is difficult to study individual genes in the network in this manner, and it was difficult to analyze larger networks. Therefore to analyze the network of 3+ confirmed genes, we used a second approach based on MCODE, an algorithm designed to identify groups of highly interconnected genes from networks (Bader and Hogue 2003
Two illustrative clusters are shown in Figure 5. Figure 5A shows a cluster of 15 genes, several of which are associated with the GO terms "cell junction" (CLDN3, CLDN4, CLDN7, CDH1) and "epidermal differentiation" (ELF3, CRABP2). Many of the other genes in this cluster have identified or suspected roles in the regulation of cell motility or tumor cell invasiveness (including DDR1, SPINT2, HRIHFB2122, TACSTD1, and WNT5A; Vogel et al. 1997
This study provides information on the structure of correlation-based links between genes in multiple microarray data sets. Our main goal was to establish whether comparing analyses across data sets is relevant to understanding gene function. The primary evidence that this is the case is that many genes show patterns of correlated expression that are reproducible across data sets, and that there is a clear relationship between confirmation of correlated expression and related gene function. Reproducible coexpression links are found for numerous genes. This suggests that this type of analysis can be used rather broadly, and is not confined to use on a small set of genes. On the other hand, only a small fraction of all links were confirmed in at least three data sets. Though this suggests that many links seen only once may not be biologically relevant, our Gene Ontology analysis shows that even links that are never confirmed are substantially more informative than random data (Fig. 3). The obvious difficulty with using results that are never confirmed is identifying the meaningful novel relationships, and therefore focusing on confirmed coexpression seems preferable. In order for a link to be confirmed, several criteria must be met. First, the pairs of genes must be present and detectably expressed in multiple data sets; a gene that is only represented in one data set will never have any confirmed links. In our database, not a single gene was considered detectable in all 60 data sets; the maximum was 57, for seven genes, and 5667 were detectable in 25 or more data sets. We also expect that confirmation of a link might be sample-type specific, even if the genes are expressed in all cases. Thus, two genes might be coexpressed only in leukemia data sets, even though they are expressed in other types of data sets. Because we used a fairly wide variety of data sets in our study, the lack of confirmation of many links could be due simply to lack of including appropriate data (there may also be a positive bias to the links discovered due to the particular data sets we studied). Finally, we may miss confirmations if our link selection criteria are too stringent. When a link is seen in many data sets, it is increasingly likely that it represents a known functional relationship. This means that, to a certain extent, it is unlikely that many novel functional relationships will be found by seeking coexpression that is ubiquitous. We believe that confirmation near the 3+ level, or even 2+ for smaller data corpuses, will yield a higher fraction of novel relationships while still having a high enough degree of reliability. The exact level of confirmation required before one is motivated to seek additional evidence or perform follow-up studies is difficult to generalize, and our method provides a high degree of flexibility in how the results are interpreted. For some purposes, a higher level of confirmation may be worth the risks of losing information, whereas in other cases even links seen only a single time can be of value.
Most previous studies of gene networks have used data from unicellular organisms, primarily the budding yeast Saccharomyces cerevisiae. In yeast, it has been estimated that there are at least 30,000 interactions among the
Negative correlations were much less likely to be confirmed in independent data sets. This was counter to our expectation because, in principle, negative correlations seem less likely to be the product of technically induced artifacts. Thus we expected the raw pool to be "cleaner" than for the positive correlations. There are several possible explanations for this result. One is that biologically meaningful negative correlations are harder to detect using microarrays, and our failure to detect them is due to experimental or analytical shortcomings. We may also not have appropriate data sets to confirm negative correlation links. A final explanation is that there may be biological reasons to favor positive coregulation of gene expression. We are unaware of any global analysis of this issue, though it may be relevant that active gene-specific transcriptional repression is a relatively uncommon regulatory mechanism in eukaryotes (Struhl 1999
We envision that databases of correlated expression will have many uses for biologists. One is to discover or confirm functional relationships that could only be made with low confidence from a single data set. Taken as a whole, the database represents a complex network of correlated expression that can be used for the analysis of large-scale properties of biological networks. It will also be of interest to integrate the information from correlated expression with other types of `links,' including the GO approach we have taken thus far, as well information mined from literature databases and other experimental sources such as yeast two-hybrid data. Careful integration of heterogeneous data types will be essential to making full use of the accumulated expression data. Another topic of interest is coexpression that is conserved across species (Stuart et al. 2003 To make our findings and database available for further evaluation and use by the scientific community, we have developed a simple Web interface to the database that can be accessed at http://microarray.cpmc.columbia.edu/tmm. The interface permits simple queries to extract the links for a gene at a desired degree of confirmation stringency. The interface also displays visualizations of the original microarray data that generated the coexpression links, and has hyperlinks to external databases for each set of linked genes to facilitate exploration of the results. We are also making available extracted tables of coexpression links from the entire database that can be used for further bioinformatic analysis.
Data Preparation Sixty human microarray data sets were included in this study, totaling 3924 arrays. All but one of the data sets is currently publicly available (the exception is the `Sibille-pfc' data set). Major data sources were the Stanford Microarray Database (Sherlock et al. 2001
Coexpression Link Identification
Correction for Multiply Represented Genes
Link Confirmation
GO Similarity Metric
Cluster Analysis To compare hierarchical clustering with GO annotations, we first identified all GO terms that were associated with at least five genes in the set under consideration, but that did not apply to more than 20% of the genes (to avoid overly general or specific terms). For each term we examined each branch of the hierarchical clustering tree to identify the branch with the highest over-representation of the term relative to the rest of the genes, flagging clusterings with P < 0.05 (cumulative hypergeometric distribution and Bonferroni-corrected for the number of GO terms examined), which contained at least five genes, and had an average pairwise correlation of at least 0.5 (to avoid always detecting the entire data set as the optimal cluster). The GO annotations were then represented as a binary matrix, where each entry indicates whether a gene and GO term were associated. Note that this procedure only analyzes relative GO term enrichment within the genes used for clustering, not the entire database. We performed a similar analysis to help identify MCODE clusters that were enriched in particular GO terms.
We sincerely thank the many groups who generously made their microarray data available, in some cases prior to publication, and to the organizers of the public microarray databases that facilitated data acquisition; Etienne Sibille, John Mann, and Victoria Arango for use of the human prefrontal cortex microarray data set; and Andrey Rzhetsky, Etienne Sibille, Gary Bader, Nick Socci, Agnes Viale, Alex Lash, and the anonymous reviewers for helpful suggestions. This work was supported in part by a pilot grant from the Avon Breast Cancer Foundation to P.P. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1910904.
4 Corresponding author. [Supplemental material is available online at www.genome.org and http://microarray.cpmc.columbia.edu/tmm.]
Alizadeh, A.A., Eisen, M.B., Davis, R.E., Ma, C., Lossos, I.S., Rosenwald, A., Boldrick, J.C., Sabet, H., Tran, T., Yu, X., et al. 2000. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503511.[CrossRef][Medline]
Allander, S.V., Nupponen, N.N., Ringner, M., Hostetter, G., Maher, G.W., Goldberger, N., Chen, Y., Carpten, J., Elkahloun, A.G., and Meltzer, P.S. 2001. Gastrointestinal stromal tumors with KIT mutations exhibit a remarkably homogeneous gene expression profile. Cancer Res. 61: 86248628. Armstrong, S.A., Staunton, J.E., Silverman, L.B., Pieters, R., den Boer, M.L., Minden, M.D., Sallan, S.E., Lander, E.S., Golub, T.R., and Korsmeyer, S.J. 2002. MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet. 30: 4147.[CrossRef][Medline] Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 2529.[CrossRef][Medline] Bader, G.D. and Hogue, C.W. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinformatics 4: 2.[CrossRef][Medline]
Barabasi, A.L. and Albert, R. 1999. Emergence of scaling in random networks. Science 286: 509512. Batagelj, V. and Mrvar, A. 1998. Pajek: Program for large network analysis. Connections 21: 4757.
Bhan, A., Galas, D.J., and Dewey, T.G. 2002. A duplication growth model of gene expression networks. Bioinformatics 18: 14861493.
Bhattacharjee, A., Richards, W.G., Staunton, J., Li, C., Monti, S., Vasa, P., Ladd, C., Beheshti, J., Bueno, R., Gillette, M., et al. 2001. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. 98: 1379013795. Bittner, M., Meltzer, P., Chen, Y., Jiang, Y., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., et al. 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: 536540.[CrossRef][Medline]
Breitling, R., Sharif, O., Hartman, M.L., and Krisans, S.K. 2002. Loss of compartmentalization causes misregulation of lysine biosynthesis in peroxisome-deficient yeast cells. Eukaryot. Cell 1: 978986.
Butte, A.J., Tamayo, P., Slonim, D., Golub, T.R., and Kohane, I.S. 2000. Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc. Natl. Acad. Sci. 97: 1218212186.
Chang, H.Y., Chi, J.T., Dudoit, S., Bondre, C., van de Rijn, M., Botstein, D., and Brown, P.O. 2002. Diversity, topographic differentiation, and positional memory in human fibroblasts. Proc. Natl. Acad. Sci. 99: 1287712882.
Chaussabel, D., Semnani, R.T., McDowell, M.A., Sacks, D., Sher, A., and Nutman, T.B. 2003. Unique gene expression profiles of human macrophages and dendritic cells to phylogenetically distinct parasites. Blood 102: 672681.
Chen, X., Cheung, S.T., So, S., Fan, S.T., Barry, C., Higgins, J., Lai, K.M., Ji, J., Dudoit, S., Ng, I.O., et al. 2002. Gene expression patterns in human liver cancers. Mol. Biol. Cell 13: 19291939. Cheok, M.H., Yang, W., Pui, C.H., Downing, J.R., Cheng, C., Naeve, C.W., Relling, M.V., and Evans, W.E. 2003. Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nat. Genet. 34: 8590.[CrossRef][Medline] Choi, J.K., Yu, U., Kim, S., and Yoo, O.J. 2003. Combining multiple microarray studies and modeling interstudy variation. Bioinformatics (Suppl.) 19: I84I90. Dabrowski, M., Aerts, S., Van Hummelen, P., Craessaerts, K., De Moor, B., Annaert, W., Moreau, Y., and De Strooper, B. 2003. Gene profiling of hippocampal neuronal culture. J. Neurochem. 85: 12791288.[CrossRef][Medline] Detours, V., Dumont, J.E., Bersini, H., and Maenhaut, C. 2003. Integration and cross-validation of high-throughput gene expression data: Comparing heterogeneous data sets. FEBS Lett. 546: 98102.[CrossRef][Medline] Dhanasekaran, S.M., Barrette, T.R., Ghosh, D., Shah, R., Varambally, S., Kurachi, K., Pienta, K.J., Rubin, M.A., and Chinnaiyan, A.M. 2001. Delineation of prognostic biomarkers in prostate cancer. Nature 412: 822826.[CrossRef][Medline]
Diehn, M., Alizadeh, A.A., Rando, O.J., Liu, C.L., Stankunas, K., Botstein, D., Crabtree, G.R., and Brown, P.O. 2002. Genomic expression programs and the integration of the CD28 costimulatory signal in T cell activation. Proc. Natl. Acad. Sci. 99: 1179611801. Dyrskjot, L., Thykjaer, T., Kruhoffer, M., Jensen, J.L., Marcussen, N., Hamilton-Dutoit, S., Wolf, H., and Orntoft, T.F. 2003. Identifying distinct classes of bladder carcinoma using microarrays. Nat. Genet. 33: 9096.[CrossRef][Medline]
Edgar, R., Domrachev, M., and Lash, A.E. 2002. Gene expression omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res. 30: 207210.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95: 1486314868. Featherstone, D.E. and Broadie, K. 2002. Wrestling with pleiotropy: Genomic and topological analysis of the yeast gene expression network. Bioessays 24: 267274.[CrossRef][Medline]
Gachotte, D., Eckstein, J., Barbuch, R., Hughes, T., Roberts, C., and Bard, M. 2001. A novel gene conserved from yeast to humans is involved in sterol biosynthesis. J. Lipid Res. 42: 150154.
Garber, M.E., Troyanskaya, O.G., Schluens, K., Petersen, S., Thaesler, Z., Pacyna-Gengelbach, M., van de Rijn, M., Rosen, G.D., Perou, C.M., Whyte, R.I., et al. 2001. Diversity of gene expression in adenocarcinoma of the lung. Proc. Natl. Acad. Sci. 98: 1378413789. Ge, H., Liu, Z., Church, G.M., and Vidal, M. 2001. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 29: 482486.[CrossRef][Medline]
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al. 1999. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531537. Graeber, T.G. and Eisenberg, D. 2001. Bioinformatic identification of potential autocrine signaling loops in cancers from gene expression profiles. Nat. Genet. 29: 295300.[CrossRef][Medline]
Greenbaum, D., Luscombe, N.M., Jansen, R., Qian, J., and Gerstein, M. 2001. Interrelating different types of genomic data, from proteome to secretome:'Oming in on function. Genome Res. 11: 14631468.
Greenberg, S.A., Sanoudou, D., Haslett, J.N., Kohane, I.S., Kunkel, L.M., Beggs, A.H., and Amato, A.A. 2002. Molecular profiles of inflammatory myopathies. Neurology 59: 11701182.
Gruvberger, S., Ringner, M., Chen, Y., Panavally, S., Saal, L.H., Borg, A., Ferno, M., Peterson, C., and Meltzer, P.S. 2001. Estrogen receptor status in breast cancer is associated with remarkably distinct gene expression patterns. Cancer Res. 61: 59795984.
Hedenfalk, I., Duggan, D., Chen, Y., Radmacher, M., Bittner, M., Simon, R., Meltzer, P., Gusterson, B., Esteller, M., Kallioniemi, O.P., et al. 2001. Gene-expression profiles in hereditary breast cancer. N. Engl. J. Med. 344: 539548.
Hedenfalk, I., Ringner, M., Ben-Dor, A., Yakhini, Z., Chen, Y., Chebil, G., Ach, R., Loman, N., Olsson, H., Meltzer, P., et al. 2003. Molecular classification of familial non-BRCA1/BRCA2 breast cancer. Proc. Natl. Acad. Sci. 100: 25322537.
Huang, Y., Prasad, M., Lemon, W.J., Hampel, H., Wright, F.A., Kornacker, K., LiVolsi, V., Frankel, W., Kloos, R.T., Eng, C., et al. 2001. Gene expression in papillary thyroid carcinoma reveals highly consistent profiles. Proc. Natl. Acad. Sci. 98: 1504415049. Huang, E., Cheng, S.H., Dressman, H., Pittman, J., Tsou, M.H., Horng, C.F., Bild, A., Iversen, E.S., Liao, M., Chen, C.M., et al. 2003. Gene expression predictors of breast cancer outcomes. Lancet 361: 15901596.[CrossRef][Medline]
Jansen, R., Greenbaum, D., and Gerstein, M. 2002. Relating whole-genome expression data with proteinprotein interactions. Genome Res. 12: 3746.
Jazaeri, A.A., Yee, C.J., Sotiriou, C., Brantley, K.R., Boyd, J., and Liu, E.T. 2002. Gene expression profiles of BRCA1-linked, BRCA2-linked, and sporadic ovarian cancers. J. Natl. Cancer Inst. 94: 9901000. Jeong, H., Mason, S.P., Barabasi, A.L., and Oltvai, Z.N. 2001. Lethality and centrality in protein networks. Nature 411: 4142.[CrossRef][Medline] Kemmeren, P., van Berkum, N.L., Vilo, J., Bijma, T., Donders, R., Brazma, A., and Holstege, F.C. 2002. Protein interaction verification and functional annotation by integrated analysis of genome-scale data. Mol. Cell 9: 11331143.[CrossRef][Medline] Khan, J., Wei, J.S., Ringner, M., Saal, L.H., Ladanyi, M., Westermann, F., Berthold, F., Schwab, M., Antonescu, C.R., Peterson, C., et al. 2001. Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat. Med. 7: 673679.[CrossRef][Medline]
Khatua, S., Peterson, K.M., Brown, K.M., Lawlor, C., Santi, M.R., LaFleur, B., Dressman, D., Stephan, D.A., and MacDonald, T.J. 2003. Overexpression of the EGFR/FKBP12/HIF-2
Kuo, W.P., Jenssen, T.K., Butte, A.J., Ohno-Machado, L., and Kohane, I.S. 2002. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics 18: 405412.
Lee, P.D., Sladek, R., Greenwood, C.M., and Hudson, T.J. 2002. Control genes and variability: Absence of ubiquitous reference transcripts in diverse mammalian expression studies. Genome Res. 12: 292297.
Leung, S.Y., Chen, X., Chu, K.M., Yuen, S.T., Mathy, J., Ji, J., Chan, A.S., Li, R., Law, S., Troyanskaya, O.G., et al. 2002. Phospholipase A2 group IIA expression in gastric adenocarcinoma is associated with prolonged survival and less frequent metastasis. Proc. Natl. Acad. Sci. 99: 1620316208.
Lord, P.W., Stevens, R.D., Brass, A., and Goble, C.A. 2003. Investigating semantic similarity measures across the gene ontology: The relationship between sequence and annotation. Bioinformatics 19: 12751283.
Luo, J., Duggan, D.J., Chen, Y., Sauvageot, J., Ewing, C.M., Bittner, M.L., Trent, J.M., and Isaacs, W.B. 2001. Human prostate cancer and benign prostatic hyperplasia: Molecular dissection by gene expression profiling. Cancer Res. 61: 46834688.
Ma, X.J., Salunga, R., Tuggle, J.T., Gaudet, J., Enright, E., McQuary, P., Payette, T., Pistone, M., Stecker, K., Zhang, B.M., et al. 2003. Gene expression profiles of human breast cancer progression. Proc. Natl. Acad. Sci. 100: 59745979. MacDonald, T.J., Brown, K.M., LaFleur, B., Peterson, K., Lawlor, C., Chen, Y., Packer, R.J., Cogen, P., and Stephan, D.A. 2001. Expression profiling of medulloblastoma: PDGFRA and the RAS/MAPK pathway as therapeutic targets for metastatic disease. Nat. Genet. 29: 143152.[CrossRef][Medline] Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O., and Eisenberg, D. 1999. A combined algorithm for genome-wide prediction of protein function. Nature 402: 8386.[CrossRef][Medline] Nielsen, T.O., West, R.B., Linn, S.C., Alter, O., Knowling, M.A., O'Connell, J.X., Zhu, S., Fero, M., Sherlock, G., Pollack, J.R., et al. 2002. Molecular characterisation of soft tissue tumours: A gene expression study. Lancet 359: 13011307.[CrossRef][Medline]
Pavlidis, P. and Noble, W.S. 2003. Matrix2png: A utility for creating matrix visualizations. Bioinformatics 19: 295296.
Perou, C.M., Jeffrey, S.S., van de Rijn, M., Rees, C.A., Eisen, M.B., Ross, D.T., Pergamenschikov, A., Williams, C.F., Zhu, S.X., Lee, J.C., et al. 1999. Distinctive gene expression patterns in human mammary epithelial cells and breast cancers. Proc. Natl. Acad. Sci. 96: 92129217. Perou, C.M., Sorlie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S., Rees, C.A., Pollack, J.R., Ross, D.T., Johnsen, H., Akslen, L.A., et al. 2000. Molecular portraits of human breast tumours. Nature 406: 747752.[CrossRef][Medline] Pomeroy, S.L., Tamayo, P., Gaasenbeek, M., Sturla, L.M., Angelo, M., McLaughlin, M.E., Kim, J.Y., Goumnerova, L.C., Black, P.M., Lau, C., et al. 2002. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature 415: 436442.[CrossRef][Medline]
Pruitt, K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29: 137140.
Ramaswamy, S., Tamayo, P., Rifkin, R., Mukherjee, S., Yeang, C.H., Angelo, M., Ladd, C., Reich, M., Latulippe, E., Mesirov, J.P., et al. 2001. Multiclass cancer diagnosis using tumor gene expression signatures. Proc. Natl. Acad. Sci. 98: 1514915154. Ramaswamy, S., Ross, K.N., Lander, E.S., and Golub, T.R. 2003. A molecular signature of metastasis in primary solid tumors. Nat. Genet. 33: 4954.[CrossRef][Medline]
Rhodes, D.R., Barrette, T.R., Rubin, M.A., Ghosh, D., and Chinnaiyan, A.M. 2002. Meta-analysis of microarrays: Interstudy validation of gene expression profiles reveals pathway dysregulation in prostate cancer. Cancer Res. 62: 44274433.
Rickman, D.S., Bobek, M.P., Misek, D.E., Kuick, R., Blaivas, M., Kurnit, D.M., Taylor, J., and Hanash, S.M. 2001. Distinctive molecular profiles of high-grade and low-grade gliomas based on oligonucleotide microarray analysis. Cancer Res. 61: 68856891.
Rosenwald, A., Alizadeh, A.A., Widhopf, G., Simon, R., Davis, R.E., Yu, X., Yang, L., Pickeral, O.K., Rassenti, L.Z., Powell, J., et al. 2001. Relation of gene expression phenotype to immunoglobulin mutation genotype in B cell chronic lymphocytic leukemia. J. Exp. Med. 194: 16391647. Ross, D.T. and Perou, C.M. 2001. A comparison of gene expression signatures from breast tumors and breast tissue derived cell lines. Dis. Markers 17: 99109.[Medline] Ross, D.T., Scherf, U., Eisen, M.B., Perou, C.M., Rees, C., Spellman, P., Iyer, V., Jeffrey, S.S., Van de Rijn, M., Waltham, M., et al. 2000. Systematic variation in gene expression patterns in human cancer cell lines. Nat. Genet. 24: 227235.[CrossRef][Medline] Seipel, K., O'Brien, S.P., Iannotti, E., Medley, Q.G., and Streuli, M. 2001. Tara, a novel F-actin binding protein, associates with the Trio guanine nucleotide exchange factor and regulates actin cytoskeletal organization. J. Cell Sci. 114: 389399.[Abstract]
Sherlock, G., Hernandez-Boussard, T., Kasarskis, A., Binkley, G., Matese, J.C., Dwight, S.S., Kaloper, M., Weng, S., Jin, H., Ball, C.A., et al. 2001. The Stanford Microarray Database. Nucleic Acids Res. 29: 152155. Shipp, M.A., Ross, K.N., Tamayo, P., Weng, A.P., Kutok, J.L., Aguiar, R.C., Gaasenbeek, M., Angelo, M., Reich, M., Pinkus, G.S., et al. 2002. Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat. Med. 8: 6874.[CrossRef][Medline] Singh, D., Febbo, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D'Amico, A.V., Richie, J.P., et al. 2002. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1: 203209.[CrossRef][Medline] Smith, L.L., Coller, H.A., and Roberts, J.M. 2003. Telomerase modulates expression of growth-controlling genes and enhances cell proliferation. Nat. Cell Biol. 5: 474479.[CrossRef][Medline]
Sorlie, T., Perou, C.M., Tibshirani, R., Aas, T., Geisler, S., Johnsen, H., Hastie, T., Eisen, M.B., van de Rijn, M., Jeffrey, S.S., et al. 2001. Gene expression patterns of breast carcinomas distinguish tumor subclasses with clinical implications. Proc. Natl. Acad. Sci. 98: 1086910874.
Sorlie, T., Tibshirani, R., Parker, J., Hastie, T., Marron, J.S., Nobel, A., Deng, S., Johnsen, H., Pesich, R., Geisler, S., et al. 2003. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl. Acad. Sci. 100: 84188423.
Staunton, J.E., Slonim, D.K., Coller, H.A., Tamayo, P., Angelo, M.J., Park, J., Scherf, U., Lee, J.K., Reinhold, W.O., Weinstein, J.N., et al. 2001. Chemosensitivity prediction by transcriptional profiling. Proc. Natl. Acad. Sci. 98: 1078710792. Struhl, K. 1999. Fundamentally different logic of gene regulation in eukaryotes and prokaryotes. Cell 98: 14.[CrossRef][Medline]
Stuart, J.M., Segal, E., Koller, D., and Kim, S.K. 2003. A gene-coexpression network for global discovery of conserved genetic modules. Science 302: 249255.
Su, A.I., Cooke, M.P., Ching, K.A., Hakak, Y., Walker, J.R., Wiltshire, T., Orth, A.P., Vega, R.G., Sapinoso, L.M., Moqrich, A., et al. 2002. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. 99: 44654470.
Tezak, Z., Hoffman, E.P., Lutz, J.L., Fedczyna, T.O., Stephan, D., Bremer, E.G., Krasnoselska-Riz, I., Kumar, A., and Pachman, L.M. 2002. Gene expression profiling in DQA1*0501+ children with untreated dermatomyositis: A novel model of pathogenesis. J. Immunol. 168: 41544163. Vahey, M.T., Nau, M.E., Jagodzinski, L.L., Yalley-Ogunro, J., Taubman, M., Michael, N.L., and Lewis, M.G. 2002. Impact of viral infection on the gene expression profiles of proliferating normal human peripheral blood mononuclear cells infected with HIV type 1 RF. AIDS Res. Hum. Retroviruses 18: 179192.[CrossRef][Medline] van't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., et al. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530536.[CrossRef][Medline]
Virtaneva, K., Wright, F.A., Tanner, S.M., Yuan, B., Lemon, W.J., Caligiuri, M.A., Bloomfield, C.D., de La Chapelle, A., and Krahe, R. 2001. Expression profiling reveals fundamental biological differences in acute myeloid leukemia with isolated trisomy 8 and normal cytogenetics. Proc. Natl. Acad. Sci. 98: 11241129. Vogel, W., Gish, G.D., Alves, F., and Pawson, T. 1997. The discoidin domain receptor tyrosine kinases are activated by collagen. Mol. Cell 1: 1323.[CrossRef][Medline] von Mering, C., Krause, R., Snel, B., Cornell, M., Oliver, S.G., Fields, S., and Bork, P. 2002. Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417: 399403.[Medline] Weeraratna, A.T., Jiang, Y., Hostetter, G., Rosenblatt, K., Duray, P., Bittner, M., and Trent, J.M. 2002. Wnt5a signaling directly affects cell motility and invasion of metastatic melanoma. Cancer Cell 1: 279288.[CrossRef][Medline]
Welle, S., Brooks, A., and Thornton, C.A. 2001. Senescence-related changes in gene expression in muscle: Similarities and differences between mice and men. Physiol. Genomics 5: 6773.
Welsh, J.B., Zarrinkar, P.P., Sapinoso, L.M., Kern, S.G., Behling, C.A., Monk, B.J., Lockhart, D.J., Burger, R.A., and Hampton, G.M. 2001. Analysis of gene expression profiles in normal and neoplastic ovarian tissue samples identifies candidate molecular markers of epithelial ovarian cancer. Proc. Natl. Acad. Sci. 98: 11761181.
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., Olson Jr., J.A., Marks, J.R., and Nevins, J.R. 2001. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. 98: 1146211467. Westfall, P.H. and Young, S.S. 1993. Resampling-based multiple testing. Wiley, New York.
Wheeler, D.L., Church, D.M., Lash, A.E., Leipe, D.D., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Tatusova, T.A., Wagner, L., et al. 2001. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 29: 1116.
Whitfield, M.L., Sherlock, G., Saldanha, A.J., Murray, J.I., Ball, C.A., Alexander, K.E., Matese, J.C., Perou, C.M., Hurt, M.M., Brown, P.O., et al. 2002. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell 13: 19772000. Wilson, S.H., Bailey, A.M., Nourse, C.R., Mattei, M.G., and Byrne, J.A. 2001. Identification of MAL2, a novel member of the mal proteolipid family, though interactions with TPD52-like proteins in the yeast two-hy |