|
|
|
|
Genome Res. 13:1706-1718, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Methods Subsystem Identification Through Dimensionality Reduction of Large-Scale Gene Expression Data1 Department of Chemistry, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA 2 Biological Engineering Division, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA 3 Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
The availability of parallel, high-throughput biological experiments that simultaneously monitor thousands of cellular observables provides an opportunity for investigating cellular behavior in a highly quantitative manner at multiple levels of resolution. One challenge to more fully exploit new experimental advances is the need to develop algorithms to provide an analysis at each of the relevant levels of detail. Here, the data analysis method non-negative matrix factorization (NMF) has been applied to the analysis of gene array experiments. Whereas current algorithms identify relationships on the basis of large-scale similarity between expression patterns, NMF is a recently developed machine learning technique capable of recognizing similarity between subportions of the data corresponding to localized features in expression space. A large data set consisting of 300 genome-wide expression measurements of yeast was used as sample data to illustrate the performance of the new approach. Local features detected are shown to map well to functional cellular subsystems. Functional relationships predicted by the new analysis are compared with those predicted using standard approaches; validation using bioinformatic databases suggests predictions using the new approach may be up to twice as accurate as some conventional approaches.
Gene-expression microarrays are a recently developed technology that allows
genome-wide measurement of RNA expression levels in a highly quantitative
fashion (Fodor et al. 1993
The collection, processing, and analysis of microarray data present many
challenges. Appropriate treatment of noise and systematic error is necessary
to ensure that further analysis is not clouded by data inaccuracy, and some
approaches have been proposed (e.g., Brown
et al. 2001
One productive use of expression data is to propose and to study
relationships between genetic, cellular, or environmental components. Examples
include the elucidation of metabolic
(DeRisi et al. 1997
Here, the potential usefulness of NMF for the analysis of high-dimensional
biological data was evaluated using a publicly available compendium microarray
data set for Saccharomyces cerevisiae, in which 6316 ORFs were
monitored in each of 300 experiments
(Hughes et al. 2000
The compendium data set contained expression patterns monitored for 6316 S. cerevisiae genes in 300 experiments involving a variety of strains and conditions. The expression of each gene in each experiment was represented as a ratio of the expression in the experiment to that in a control experiment of wild type grown under standard conditions. Genes whose expression in the control was not measurable, were removed from the data set to prevent division by zero, leaving 5346 genes, and the natural logarithm of each ratio was taken. Data analysis involved using NMF to reduce the dimensionality of the data and to extract common features repeated in correlated fashion throughout the data (see Methods). These common feature elements were represented as basis vectors resulting from the technique. In typical usage, each basis vector represented an experiment, in that it contained a relative expression for each gene comprising the feature represented.
Selection of NMF Dimensionality
Basis vectors (basis experiments) obtained from NMF factorization with a dimensionality of 50 were sparse and reproducible. One measure of sparsity is the fraction of non-zero entries per basis vector, which averaged 5% over the 50 vectors. The factorization produced somewhat different results each time it was started from a different random starting point. When the basis vectors from different factorizations using the same dimensionality were compared, the correlation coefficient was found to be >0.9 between pairs. This indicates that results of NMF are robust with respect to the mathematical procedures used here to perform the calculations. The RMS error of the reconstructed data (through NMF dimensional reduction) compared with the original data was only about 7.8% of the RMS error of a random permutation of the original data, in which the experiments (columns) of the original data matrix were permuted. Figure 2 illustrates six examples of expression experiments in the original gene expression space, in the 50-dimensional NMF space, and reconstructed from the NMF space back into the original space. This shows the ability of the dimensionality reduction to still capture many of the details of the original data. Also, it is demonstrated that most experiments are dominated by the combination of only a few important basis vectors. They correspond to similarities across many, but not all genes.
To examine the robustness of the algorithm to noise, gaussian noise was added to the original data to produce corrupted data vectors. Table 1 lists the average correlation between results of the analysis performed on the original and corrupted data. Gaussian noise was added in progressively larger increments of the standard deviation of the data. As a recent model that captures the physical processes underlying microarray measurements indicates, the ratio of the cDNA distributions can be approximated with a log-normal distribution (K.H. Duggar, T. Ideker, D.A. Lauffenburger, and P.K. Sorger, in prep.). These results suggest that gaussian noise is appropriate to apply to the log of the ratios. At low noise (0.2 times the standard deviation), there was very little change in the results. The correlation of NMF vectors was better than 0.90, as was that for the data reconstructed from the dimensionality reduction. This is not surprising, because the original data vectors and corrupted vectors also showed a correlation coefficient of >0.90. However, when adding more noise (equal to the full standard deviation), both the NMF basis vectors, as well as the reconstructed data, were still very similar after adding noise (correlation of better than 0.80), whereas the original data was changed substantially more (correlation of 0.57). This fact shows the high robustness of NMF to noise in the data, and suggests that NMF might be useful as a noise-reduction filter in certain applications.
Annotation of Basis Vectors
Each basis vector appeared to be dominated by only a few functional categories, with some categories showing increased and others decreased expression relative to wild-type, untreated cells. Basis vector 17, for example, showed increased expression of genes associated with amino-acid metabolism and metabolism of energy reserves together with decreased expression of genes involved in rRNA transcription. Basis vector 20 showed increased expression of genes involved in ion transport, homeostasis of cations, ribosomal function, and mitochondrial organization with decreased expression of genes for amino-acid metabolism, (other) ribosomal proteins, translation, and organization of cytoplasm. Basis vector 9 showed increased expression of genes associated with carbon compound (C-compound) and carbohydrate metabolism and transporters as well as metabolism of energy reserves, and at the same time decreased expression of amino-acid metabolism genes. In some cases, specific metabolic pathways could be seen in the basis vectors. For instance, fatty acid oxidation was up-regulated in basis vector 42. Most elements of the TCA-cycle were up-regulated in basis vector 43. Furthermore, this basis vector, which seemed mostly responsible for energy metabolism, contained all but two of the genes involved in the pentose-phosphate shunt. Of these two genes, one is a transketolase that is highly homologous to another transketolase found in basis vector 43, and the other is the ribose-5-phosphate ketol isomerase. In 14 of the basis vectors, no single MIPS category was significantly enriched, which is partly due to the lack of sparsity (i.e., too many genes occur in a basis vector, therefore, no single category was significant), and partly due to an abundance of as yet uncategorized genes.
Independent of the classification scheme proposed by MIPS, the occurrence
of well-characterized gene groups was examined in basis vectors. The processed
data set contained nine histone genes, which were all present together in
basis vector 1. This enrichment was >5
Next, the occurrence of genes in both the GAL4 and the
STE12 pathway was examined. These pathways were recently studied
extensively by Ren et al.
(2000
There was a deletion mutant of STE12 in the data, along with
several mutants of related genes, including FUS3, KSS1, and
STE5. A total of 25 genes, forming a subset of those identified by
Ren et al. (2000
Basis vector 8 (the mating basis vector) was then examined more closely,
and the function of all of its member genes examined using information from
the Yeast Proteome Database (YPD), constructed by Proteome, Inc.
(http://www.incyte.com/
Prediction of Functional Relationships
Predictions of functional relationships were made using the pairwise
correlations between experiments measured in each of six spacesthe
original data space, the 50-dimensional NMF space, and four other
50-dimensional spaces chosen for comparison. The six spaces are (1) the
original space in which the data was collected, corresponding to 5346 genes
used in the analysis, (2) the 50-dimensional space resulting from NMF data
reduction, (3) the 50-dimensional space spanned by 50 genes whose expression
varied the most across the 300 experiments, (4) the 50-dimensional space
explaining the largest variation in the experimental data as found by SVD, (5)
the 50-dimensional space resulting from NMF data reduction without applying
the sparsification procedure, and (6) the space spanned by the eigenvectors
from SVD that have been subjected to the same sparsification procedure as NMF
in (2). In addition, comparison was made to the average validity of
predictions made from k-means clustering with 50 clusters. Note that the
comparison with clustering is not completely fair, as here we are testing for
pairwise relationships between genes, whereas clustering finds groupwise
relationships. To compare k-means clustering, we treat each pair of
experiments within a given cluster as related. For each case, the pairwise
correlations were sorted by magnitude, with the higher magnitude correlations
corresponding to stronger predictions. Predictions were checked against the
MIPS database (see Methods), and the results are shown in
Figure 3. This figure shows,
for each of the methods, the percentage of predictions validated by MIPS as a
function of the number of predictions made (when ordered from strongest to
weakest correlation). In general, the methods exhibited the highest validation
for their strongest predictions. For up to 600 predicted relationships (four
per gene, on average), NMF far outperformed all other methods. For instance,
for the 100 strongest predictions, the reliability in the NMF space was
A second and independent method was used to evaluate the predictions of
functional relationships produced by NMF by comparing with data compiled in
the Yeast Proteome Database (YPD; Costanzo
et al. 2001
Here, NMF, a new machine-learning approach capable of identifying localized features in complex data sets, was applied to the analysis of microarray data from a series of 300 yeast experiments (of which 276 were deletion strains; Hughes et al. 2000 The experimental variation sampled by the 300 experiments could be well represented with just 50 features. Moreover, this set of 50 features encoded in the basis vectors tended to correspond to sets of known functional genetic groupings of genes. Large numbers of genes involved in similar or related cell functions appeared together due to a local similarity in their expression profiles. It should be noted that because of the limited data (i.e., not all yeast deletion strains were sampled), not all cellular functions were identified. Some cellular systems were sampled more in the experiments than others. For example, the mating and pheromone grouping is particularly well identified. Basis vector 8 consisted mostly of genes involved in mating and even contained six verified targets of STE12 that were not identified by previous studies. Conventional clustering techniques focus on elucidating groupwise relationships among genes by sorting them according to a pairwise similarity metric. NMF procedures applied here also identify groups of genes related to one another in expression patterns and form them together into basis vectors. It is clear that genes in the same cluster have similar expression patterns. In the case of NMF basis vectors, the relationship is less clear; treating the contribution of a set of basis vector genes as a group is an efficient representation of the expression data. This may or may not be a good indicator of biological relevance.
Pairwise relationships between experiments were evaluated by locating pairs
of experiments that were constructed from the same NMF building blocks (basis
vectors). With this detection scheme, NMF was superior to any other method
examined, including other sets of 50 basis vectors constructed from other
procedures, as well as standard correlations in the full gene expression space
of the original experiments. The initial analysis of this same compendium data
set reported by Hughes et al.
(2000 Figure 4 illustrates the increased similarity seen in the NMF feature space compared with that seen in the original data space for four pairwise functional relationships from Tables 3 and 4. Two of these (yer084w:SBH2 and ymr025w:ymr029c) were not corroborated by YPD, whereas the other two (STE5:STE11 and RTS1:RTG1) are each known to be functional relationships. As the numerical values in the figure indicate, the correlation in NMF space was significantly higher than in the original gene expression space. Essentially, this stems from the fact that NMF recognized the expression patterns of strains deleted for the genes in question as being constructed of very similar sets of building blocks, and the correlation in the expression pattern was larger for the genes comprising these building blocks. For instance, the expression profiles for strains deleted in STE11 and in STE5 were each dominated by basis vector 8 (the building block consisting largely of mating genes) and had relatively small (but still correlated contributions) from other basis vectors. NMF recognized this local similarity across some genes, whereas most clustering algorithms would focus only on the global similarity of the expression profile. Comparing the same two strains in the original data space shows that their gene expression patterns were highly correlated for some genes but not for others. Therefore, NMF is a way to focus on the functionally important parts of gene expression profiles.
Note that due to the fact that all values in NMF space are non-negative by definition, the distribution of correlations is somewhat shifted toward higher values and it has a longer tail (see Supplemental Material Fig. S-1, available online at www.genome.org). However, this effect alone does not explain the higher correlation coefficients found in NMF versus in the original space, seen in Figure 4. The correlations found in the NMF space occur at a higher percentile than those in the original space. For example, a correlation of 0.8 corresponded to a 99.90 (99.93) percentile in the NMF (original) space; a correlation of 0.4 corresponded to a 95.00 (98.00) percentile in the NMF (original) space. The differences in correlation coefficients observed here correspond to values in the neighborhood of 0.8 (99.90 percentile) in NMF space and only 0.4 (98.00 percentile) in the original space. Thus, even though the distribution is somewhat shifted in the NMF space, the higher correlation coefficients observed do correspond to higher significance, as indicated by percentile ranking.
In Table 4 are listed 42
predictions of functional relationships detected by NMF, but not present in
YPD. Some predicted relationships are between genes classified as
mitochondrial (e.g., AEP2:YER050C and MSU1:MRPL33), just as
some of the verified relationships are between mitochondrial genes (e.g.,
RML2:YMR293C). Moreover, a number of small networks of mitochondrial
genes occur in the strongest 100 NMF relationships; most genes in these
networks were clustered together in the original analysis of the data by
Hughes et al. (2000
While this manuscript was in preperation, two studies
(Gavin et al. 2002 One feature of the approach taken here is that pairwise relationships were only scored for genes that had been directly manipulated in the experiments (deleted or overexpressed). As described in the Methods section, NMF can also be applied to detect relationships between genes that have been monitored using expression arrays, but not directly manipulated experimentally. Preliminary studies using NMF in this mode suggest that it is again superior in detecting functional genetic relationships compared with approaches that apply clustering or correlation directly in the original data space. A further shortcoming that remains, however, is the elimination from analysis of genes whose expression is undetectable in the control experiments (to avoid division by zero). Functional relationships involving such genes (comprising roughly one-sixth of the genome for the current data set) cannot be scored. In future studies, it may be possible to insert a minimal expression level for such genes in the control experiment, although further work is necessary to see whether this introduces other problems, such as feature misscaling. In the current study, no separate attempt was made to smooth or filter the data set to reduce or eliminate the effects of experimental noise or error. In some sense, NMF itself performs a smoothing function on the data through factorization and reconstruction. Features that appear consistently in the data set are selected out to become basis vectors, whereas features that appear inconsistently in the data due to experimental variability or other factors tend to be smoothed. For the results reported here, only genes with no detectable expression in the control experiment were removed. When more stringent significance filters were applied to the data, the results remained similar (data not shown). Of the 50 basis vectors resulting from this analysis, many were sparse (that is, they represented features consisting of a relatively small number of genes). However, some basis vectors were not sparse and contained too many genes to be easily annotated as associated with a small number of cellular functions. The NMF algorithm could be modified to enforce sparser basis vectors; alternatively, it is anticipated that larger data sets will result in basis vectors that are more uniformly sparse and may correspond to smaller features. An advantage of NMF is that it is expected to be a better detector of features when confronted with larger data sets.
General Approach Data from a set of expression array experiments were represented as a single matrix . Each column
corresponded to the processed intensities from one experiment; each element of
a column was derived from the intensity for one gene probe in the
corresponding experiment. A row of the matrix corresponded to the processed
intensity for a single gene probe across all experiments. An n
x m matrix
corresponded to m arrays (i.e., experiments) in which measurements
were made for the same n genes in each. The major analysis method
applied here, NMF, corresponded to an approximate factorization of the matrix
into a pair of matrices
and
.
was of dimension n
x k and was
k x m. In the work described here, k was
chosen to be relatively small compared with the dimensions of the original
data (i.e.,
k · (m + n) < n · m),
so the factorization was approximate and corresponded to a compression of the
data. Moreover, the factorization could be viewed as a representation of the
data in a new space of lower dimensionality (k). There are two
equally valid interpretations of the dimensionality reduction. One is that the
columns of
Implementation of NMF
)
between the actual data and the
reduced-dimension reconstruction of the data
( ;
Lee and Seung 2001The update rules corresponded to a form of gradient descent, and thus, found only a local minimum. To address this limitation, the procedure was repeated 100 times, starting with different initial matrices. The factorization leading to the lowest RMS error was used in further analysis. Studies were carried out for values of the NMF dimensionality (k) ranging from 1080. The solutions found were reproducible; basis vectors from factorizations that differed in the initial matrices showed correlation coefficients of >0.90.
A single NMF factorization for a 5346 x 300 data set required
Trial implementations on smaller test problems were also carried out with
nonlinear optimizers CONOPT2 version 2.071G (ARKI Consulting & Development
A/S) and LOQO version 4.01 (Princeton University); the values of matrix
elements in To ensure sparsity of the resulting basis vectors, the most significant genes for every basis experiment were selected so as to produce an average of 5% of the entries used across all basis vectors. Operationally, this was achieved by allowing a fixed percentage (9.7%) of the maximum gene to be signifcant (non-zero). After selecting of the most significant genes, all other genes were constrained to zero, and the resulting sparsified basis vectors were reoptimized to convergence using the update rules in equations 2 and 3. This sparsification procedure was found to be the best performing one from many procedures tried, when minimizing both the overall RMSD of the factorization and the number of signifcant genes for each basis vector. The RMSD of the sparsified factorization was 2132 versus 1702 for the factorization without sparsification (a difference of 25%), whereas only 5% of the entries are used as significant genes. As can be seen in Figure 3, the sparsification has a minor effect on the performance of the algorithm. In separate calculations, SVD of data matrices was carried out using the built-in functionality in MATLAB. A representation of an SVD factorization of rank k corresponded to using only the k highest eigenvalues. Absolute RMS error values were calculated for the same data set and the same ranks as for NMF. Furthermore, as a control, SVD was carried out on random matrices composed of vectors of the same mean and standard deviation as the sample data.
Annotating Basis Vectors
Each basis vector (basis experiment) was annotated with the MIPS categories
that dominated its makeup by comparing the frequency with which genes from
each category appeared in a basis vector with that expected from a random
distribution. One million genes were selected at random from the same set of
genes present in the experimental data. The corresponding MIPS categories were
identified, and the mean and the standard deviation of occurrence was
calculated for every category. This procedure was carried out twice to ensure
convergence of the random distribution. If the occurrence of a particular MIPS
category in a basis vector exceeded the mean of the random occurrence by more
than five times its standard deviation (a 5
Predicting Functional Relationships In the data set used, most experiments corresponded to deletion mutants of a specific gene, so that functional relationships between experiments in turn implied functional relationships of the deleted genes. Other experiments corresponded to the overexpression of genes, which again linked the experiments directly to the gene in question. The rest of the experiments corresponded to treatment with a well-characterized drug. Those experiments then linked the response in expression pattern to the functional mechanism of this particular drug.
To judge the predicted functional relationships between genes required some
set of true relationships. For this purpose, existing bioinformatic databases
were used, although clearly such data are largely incomplete and may not be
fully verified. The two databases used were the MIPS categorization
(Mewes et al. 2000 The functional relationships predicted from gene expression data using NMF were compared with functional relationships predicted from other approaches. The same analysis and validation procedure was applied to the correlation score in five other spaces: the original full-dimensionality of the experimental space, reduced dimensionality using SVD with the 50 most significant dimensions, reduced dimensionality using only the 50 most variable genes in the data set, reduced dimensionality using 50 NMF basis vectors that were not sparsified, and reduced dimensionality using 50 eigenvectors from SVD that were sparsified. The value of 50 was chosen to compare different same-sized reduced-dimension representations of the data to that from NMF. Eigenvectors from SVD were sparsified using the above procedure. The encodings were then obtained using the pseudoinverse. As a further comparison, k-means clustering was carried out using the euclidian distance metric starting from a random initial seed of cluster centers and interatively updating the center positions.
A second and independent method of scoring predicted functional
relationships used the Yeast Proteome Database (YPD;
Costanzo et al. 2001
Data Source and Preprocessing
Gene expression was measured using spotted microarrays, giving the ratio of
expression in the mutant (or drug-treated) strain relative to the gene
expression in the control (wild-type) experiments. The spotted arrays measured
expression for a total of 6316 ORFs; the data set was 6316 genes by 300
experiments (data available from Rosetta Inpharmatics Inc. at
http://www.rii.com/register/cell2000102Hughes/EULA.htm The log-transformed ratios were used as input data for our algorithm; the transformed ratios ranged from approximately -3 (1000 times down-regulated with respect to the control experiment) to +3 (1000 times up-regulated). Some genes had no detectable expression in the control experiment and were removed from further analysis to prevent division by zero. The resulting data set contained 5346 genes. To make the data fit the constraint of non-negativity, the data were folded. Every gene was represented in two rows of the matrix, the first occurrence to indicate positive expression relative to wild type, and the second to indicate negative. This effectively doubled the size of the data set (to 10692 genes). In any one experiment, the log-expression ratio for every gene was either positive (i.e., the gene was up-regulated with respect to the control experiment) or negative. The resulting data matrix was of size 10692 x 300, and half of its entries were equal to zero. This procedure was necessary, as NMF performs most optimally on sparse data sets. A simple shifting procedure, that is, adding a fixed constant to each matrix element to make all positive, would create a positive, but very non-sparse matrix, and hence, was inappropriate. For reconstructing the data, we simply reversed this procedure by subtracting the row corresponding to down-regulation from the row corresponding to up-regulation. Correlations were computed by operating on vectors of length 10692, with no special treatment for paired entries involving the up- and down-regulation of the same gene. Interestingly, in each basis vector, the same gene was never represented as both up- and down-regulated.
We thank H. Sebastian Seung, Michael D. Altman, Justin A. Caravella, Gerald R. Fink, David F. Green, Chris Kaiser, Sriram Kosuri, Douglas A. Lauffenburger, Robert T. Sauer, Anthony J. Sinskey, Peter K. Sorger, and Shari Spector for helpful discussions and suggestions. We also thank the two anonymous referees for insightful comments. This work was partially supported by the Alfred P. Sloan Foundation and the National Institutes of Health (MH62344). P.M.K. was supported by a Merck/MIT Graduate Fellowship and a Ph.D. Fellowship from the Boehringer Ingelheim Fonds. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.903503.
4 Corresponding author. [Supplemental material is available online at www.genome.org.]
Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack,
D., and Levine, A.J. 1999. Broad patterns of gene expression
revealed by clustering analysis of tumor and normal colon tissues probed by
oligonucleotide arrays. Proc. Natl. Acad. Sci.
96:
6745-6750.
Alter, O., Brown, P.O., and Botstein, D. 2000.
Singular value decomposition for genome-wide expression data processing and
modeling. Proc. Natl. Acad. Sci.
97:
10101-10106. Bittner, M., Meltzer, P., Chen, Y., Jiang, J., Seftor, E., Hendrix, M., Radmacher, M., Simon, R., Yakhini, Z., Ben-Dor, A., et al. 2000. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406: 536-540.[CrossRef][Medline] Broet, P., Richardson, S., and Radvanyi, F. 2002. Bayesian hierarchical model for identifying changes in gene expression from microarray experiments. J. Comput. Biol. 9: 671-683.[CrossRef][Medline]
Brown, C.S., Goodwin, P.C., and Sorger, P.K. 2001.
Image metrics in the statistical analysis of DNA microarray data.
Proc. Natl. Acad. Sci.
98:
8944-8949.
Brown, M.P.S., Grundy, W.N., Lin, D., Christiani, N., Sugnet, C.W.,
Furey, T.S., Ayres Jr., M., and Haussler, D. 2000.
Knowledge-based analysis of microarray gene expression data by using support
vector machines. Proc. Natl. Acad. Sci.
97:
262-267. Cho, R.J., Campbell, M.J., Winzeler, E.A., Steinmetz, L., Conway, A., Wodicka, L., Wolfsberg, T.G., Gabrielian, A.E., Landsman, D., Lockhart, D.J., et al. 1998. A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell 2: 65-73.[CrossRef][Medline]
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D.,
Brown, P.O., and Herskowitz, I. 1998. The transciptional program
of sporulation in budding yeast. Science
282:
699-705.
Coller, H.A., Grandori, C., Tamayo, P., Colbert, T., Lander, E.S.,
Eisenman, R.N., and Golub, T.R. 2000. Expression analysis with
oligonucleotide microarrays reveals that MYC regulates genes involved in
growth, cell cycle, signaling, and adhesion. Proc. Natl. Acad.
Sci. 97:
3260-3265.
Costanzo, M.C., Crawford, M.E., Hirschman, J.E., Kranz, J.E.,
Olsen, P., Robertson, L.S., Skrzypek, M.S., Braun, B.R., Hopkins, K.L., Kondu,
P., et al. 2001. YPD, PombePD, and WormPD: Model organism volumes
of the BioKnowledge library, an integrated resource for protein information.
Nucleic Acids Res. 29:
75-79.
DeRisi, J.L., Iyer, V.R., and Brown, P.0. 1997.
Exploring the metabolic and genetic control of gene expression on a genomic
scale. Science 278:
680-686.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D.
1998. Cluster analysis and display of genome-wide expression
patterns. Proc. Natl. Acad. Sci.
95:
14863-14868.
Ferea, T.L., Botstein, D., Brown, P.O., and Rosenzweig, R.F.
1999. Systematic changes in gene expression patterns following
adaptive evolution in yeast. Proc. Natl. Acad. Sci.
96:
9721-9726. Fodor, S.P.A., Rava, R.P., Huang, X.H.C., Pease, A.C., Holmes, C.P., and Adams, C.L. 1993. Multiplexed biochemical assays with biological chips. Nature 364: 555-556.[CrossRef][Medline] Gasch, A.P. and Eisen, M.B. 2002. Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering. Genome Biol. 3: 1-22. Gavin, A.C., Boesche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., et al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141-147.[CrossRef][Medline]
Getz, G., Levine, E., and Domany, E. 2000. Coupled
two-way clustering analysis of gene microarray data. Proc. Natl.
Acad. Sci. 97:
12079-12084.
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M.,
Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., et al.
1999. Molecular classification of cancer: Class discovery and
class prediction by gene expression monitoring.
Science 286:
531-537. Granjeaud, S., Bertucci, F., and Jordan, B.R. 1999. Expression profiling: DNA arrays in many guises. BioEssays 21: 781-790.[CrossRef][Medline] Hartemink, A.J., Gifford, D.K., Jaakkola, T.S., and Young, R.A. 2001. Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks. Pac. Symp. Biocomput. 6: 422-433.
Heyer, L.J., Kruglyak, S., and Yooseph, S. 1999.
Exploring expression data: Identification and analysis of coexpressed genes.
Genome Res. 9:
1106-1115. Ho, Y., Gruhler, A., Heilbut, A., Bader, G.D., Moore, L., Adams, S.L., Millar, A., Taylor, P., Bennett, K., Boutilier, K., Yang, L., et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180-183.[CrossRef][Medline] Holstege, F.C., Jennings, E.G., Wyrick, J.J., Lee, T.I., Hengartner, C.J., Green, M.R., Golub, T.R., Lander, E.S., and Young, R.A. 1998. Dissecting the regulatory circuitry of a eukaryotic genome. Cell 95: 717-728.[CrossRef][Medline] Hughes, T.R., Marton, M.J., Jones, A.R., Roberts, C.J., Stoughton, R., Armour, C.D., Bennet, H.A., Coffey, E., Dai, H., He, Y.D., et al. 2000. Functional discovery via a compendium of expression profiles. Cell 102: 109-126.[CrossRef][Medline]
Ideker, T., Thorsson, V., Ranish, J.A., Christmas, R., Buhler, J.,
Eng, J.K., Baumgarner, R., Goodlett, D.R., Aebersold, R., and Hood, L.
2001. Integrated genomic and proteomic analyses of a
systematically perturbed metabolic network. Science
292:
929-934.
Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee,
J.C.F., Trent, J.M., Staudt, L.M., Hudson Jr., J., Boguski, M.S., et al.
1999. The transcriptional program in the response of human
fibroblasts to serum. Science
283: 83-87.
Kim, S.K., Lund, J., Kiraly, M., Duke, K., Jiang, M., Stuart, J.M.,
Eizinger, A., Wylie, B.N., and Davidson, G.S. 2001. A gene
expression map for Caenorhabditis elegans. Science
293:
2087-2092. Lee, D.D. and Seung, H.S. 1999. Leaming the parts of objects by non-negative matrix factorization. Nature 401: 788-791.[CrossRef] |