|
|
|
|
Genome Res. 14:1654-1663, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods Elucidation of Gene Interaction Networks Through Time-Lagged Correlation Analysis of Transcriptional DataDepartment of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
The photosynthetic cyanobacterium Synechocystis sp. strain PCC 6803 uses a complex genetic program to control its physiological response to alternating light conditions. To study this regulatory program time-series experiments were conducted by exposing Synechocystis sp. to serial perturbations in light intensity. In each experiment whole-genome DNA microarrays were used to monitor gene transcription in 20-min intervals over 8- and 16-h periods. The data was analyzed using time-lagged correlation analysis, which identifies genetic interaction networks by constructing correlations between time-shifted transcription profiles with different levels of statistical confidence. These networks allow inference of putative cause-effect relationships among the organism's genes. Using light intensity as our initial input signal, we identified six groups of genes whose time-lagged profiles possessed significant correlation, or anti-correlation, with the light intensity. We expanded this network by using the average profile from each group of genes as a seed, and searching for other genes whose time-lagged profiles possessed significant correlation, or anti-correlation, with the group's average profile. The final network comprised 50 different groups containing 259 genes. Several of these gene groups possess known light-stimulated gene clusters, such as Synechocystis sp. photosystems I and II and carbon dioxide fixation pathways, while others represent novel findings in this work.
The DNA microarray has become an established tool for the parallel monitoring of gene expression profiles. Most common experimental design strategies observe static gene expression differences between conditions, such as disease versus nondisease case comparisons. While such experiments generate information for diagnostic applications, they are not well suited for uncovering the roles of these genes in the larger context of cellular regulation.
Dynamic transcriptional data allow the formation of gene clusters with similar temporal expression profiles. The various forms of clustering (Eisen et al. 1998 Consider Figure 1 in which an input signal, such as light intensity, affects the transcription of a pair of genes through a cascade from gene 1 to gene 2. In an experiment that only measures static gene expression values at each input signal intensity, the best conclusion that might be drawn from such data is that the genes are somehow related. On the other hand, if dynamic experiments are conducted that allow the observation of delayed responses, then it is possible to extract additional information from these measurements pointing to potential directionality.
A relatively complete picture of transcriptional regulatory behavior should be possible by probing the transcriptional dynamics of carefully designed experiments covering a wide range of conditions. Dynamic experiments that sequentially vary external parameters offer insights into how cellular physiology depends on changing environmental conditions. Time-lagged correlation analysis is one method that can be applied to infer putative causal relationships between system perturbations and system responses.
Linear Pearson correlations have been used to identify genes that are coexpressed or antiexpressed for clustering purposes (D'Haeseleer et al. 1998
i is the expression value of gene i averaged across all time points, and the angled brackets represent the inner product between the time-shifted profiles. The matrix of lagged correlations R( ) can be used to rank the correlation and anticorrelation between genes through conversion to a Euclidean distance metric, dij:
. If the value of that gives the maximum correlation is 0, then the two genes are best correlated with no time lag. The matrix D = (dij) describes the correlation between two genes, i and j, in terms of "distance" by making genes that are least correlated (for any ) the "farthest" apart (Arkin and Ross 1995 , an underlying network of potential cause and effect relationships can be elucidated. Some caution is needed to ensure genes with high correlation have been chosen using enough data points to give statistical significance, otherwise all of the values used will merely overfit the data. Such errors may be obvious if values for are unreasonably long from a biological standpoint.
Arkin, Shen, and Ross (1997
By sequencing the Synechocystis genome (Kaneko et al. 1996
In this work we identified a network of putative directional interactions between cascades of genes by using time-lagged correlation analysis. This network relates the changing light input signal to dynamic gene transcription data obtained using full genome DNA microarrays (Schmitt Jr. and Stephanopoulos 2003
In the first experiment the culture was exposed to a series of three-step changes in light intensity ranging from 0 to 16 to 90 µmol/m2/s, as shown in Figure 2. Forty-seven of the fifty samples from this experiment were successfully hybridized to DNA microarrays and provided sufficient signal intensities for further analysis. Two iterations of the time-lagged correlation implementation algorithm were applied to this data. Figure 3 shows 64 genes that were divided among six groups with high correlation directly to the input signal (|R| > 0.7 for at least one member of the group), while Figure 4 shows the expansion of this original set to 50 groups comprising 259 genes, using an additional iteration of the algorithm. Both time-lagged correlations (bold arrows) and time-lagged anticorrelations (dotted arrows) are shown, along with zero-lagged correlations (straight lines). Note that genes correlated with no lag to other genes need not necessarily belong in the same group, as they do not show the same degree of correlation with other clusters in the diagram. It is possible that further experiments will confirm their inclusion into a single group, or suggest the elimination of one or both of the two groups.
The annotated genes that respond to the input light signal in a time-lag "wave" are listed in Table 1. Among the transcripts in Table 1, many encode proteins found in the Synechocystis photosystem complexes. For example, genes associated with photosystem I (such as psaE or psaK) and photosystem II (such as psbEFLJ) seem to become activated at several different time lags relative to the light intensity. Interestingly, this analysis also found both ycf3 and ycf48 having transcriptional expression coordinated with light exposure with the minimum time lag of 20 min. It has been suggested that these genes contribute to either assembly or stability of photosystem I (Wilde et al. 2001
Other genes that are known to be light-regulated, such as apcF, apcE, and apcABC (Gill et al. 2002
As reported in earlier studies (Watson and Tabita 1996
In addition to validating the response of known light sensitive genes, some information concerning the roles, and potential interactions, of putative genes can be derived from our analysis. Genes slr0581 and slr0582, were inversely correlated to the light intensity. A homology search using BLAST on these open reading frames (ORFs) suggests no strong homologies with known proteins, so assigning a functional role is difficult. However, slr0582 has at least some similarity with putative binding factors, and therefore may play a role in the transcription of genes regulated as a response to light. Furthermore, slr0581, which had a sufficiently strong signal at every time point to be included in the Principle Component Analysis (Raychaudhuri et al. 2000
Although we used levels of R large enough to minimize the chance of randomly observing highly correlated genes, a second experiment, with a different input light signal (see Fig. 2) was conducted to test the correlations observed in the first experiment. Results for the genes shown in Figure 3 are shown in Table 2. Most of these genes have similar correlations with the input signal in both experiments. Exceptions that have different values between the experiments, could then be used to "prune" or adjust the networks shown in Figures 3 and 4. Consider, for example, the cpc genes found in Group 7, which seem to correlate less well in the second experiment. Although the correlation is still significant at this level for a time lag of two units (40 min), we also observe a nearly equal correlation at a lag of three units (60 min), and further experiments are required to accurately plot these genes within the network. However, unless exceedingly low correlation is observed, such connections cannot reasonably be rejected with only this data, and testing the correlations using complementary techniques would be highly beneficial.
Having conducted two sets of experiments under different input light profiles, the consistency of the measured transcriptional profiles was investigated. One way to do so is by projecting the transcriptional state of the cells in a reduced dimensional space, defined by coordinates that are linear combinations of the gene expression data. In such a space we anticipate that cellular samples taken under similar experimental conditions will cluster together. Principal components analysis (PCA) was used as an unsupervised data visualization tool to test whether cells exposed to the same environmental conditions in different experiments clustered together in the space defined using the first two principle components (Misra et al. 2002
Only genes with expression values for all 74 samples, from both experiments, were considered in the PCA analysis. For these 113 genes, the two largest principle components account for approximately 68.7% of the variance. Figure 5 shows the 74 samples in a space defined by the two principal components. It can be seen that samples obtained during the "dark" conditions (with a time lag of 20 min) reside in an area distinct from those obtained during either of the light conditions in both experiments, with the exception of a single, easily identified outlier point from the second experiment. Not surprisingly, examination of the gene loadings (Misra et al. 2002
To further substantiate the results of the analysis two sets of simulations were conducted. In the first set of simulations we reanalyzed the data set using only half the data points. Thus instead of having 47 data points at 20-min intervals, we used 24 data points at 40-min intervals. Genes that possessed a maximum correlation value at a time lag of 20 min in the original data (47 time points) set fell into either the 0- or 40-min time-lagged group when 40-min sampling intervals were used. Likewise, genes that possessed a maximum correlation value at a time lag of 60 min in the original data set were placed in either the 40 or 80 min group. Although some additional genes now pass the |R| > 0.70 threshold and enter the analysis, none of the genes that were identified using all 47 time points are eliminated from the analysis when only 24 time points are used. Thus as more time points are used, only the strongest correlations emerge in the analysis. In the second set of simulations we tested the statistical significance of our correlations. To do this we created 10 million sets of random data in the same format as the original data set (47 time points) and found only two samples that had a correlation value of |R| > 0.70for
The three design criteria that need to be considered in planning similar dynamic experiments are the frequency of sampling, the properties of the induced perturbations, and the dynamics of gene transcription. These factors interact with one another and should be optimized to address a specific condition. In these experiments we wanted to study the transcriptional dynamics of Synechocystis sp. in response to light perturbations. Because changes in light intensity can be introduced instantaneously, we were not limited by diffusional effects that may slow the response, or create varying time lags dependent upon the magnitude of the induced change. The induction and relaxation times of gene transcription also need to be considered in selecting the sampling frequency. Because changes in gene transcription have been observed over 2- to 6-h periods (Hirahara et al. 2001; Gill et al. 2002
We have conducted two time-series DNA microarray experiments, one with 47 measurements (of 50 time points) taken over 17 h, and another of 27 measurements taken over 9 h. Both of these experiments share the same sampling interval (20 min) and are taken from single, homogeneous (well-stirred) cultures. These experiments are especially unique because of the application of an instantaneous forcing environmental input, the intensity of the incident light to the system. Compared to earlier time-series experiments with DNA microarrays (Chu et al. 1998
Time-lagged correlations (Arkin and Ross 1995 Although a substantial effort is required to plan and perform this type of experiment, an enormous amount of information is obtained. This information enables the construction of genetic networks using the system's identification technique of time-lagged correlations. The directionality of the resulting networks provides more information than clustering alone, and therefore allows the researcher to generate hypotheses based on the system structure. Additionally, it is important to consider similarly expressed genes as potential regulon members. Regulons are sets of coregulated genes with common promoter regions differing from operons in that they are not necessarily sequentially oriented in the genome. To this end, genes with the same time-lagged correlation may be considered as good regulon candidates. The 50 groups containing 259 genes uncovered by this analysis technique (Table 1) contained genes that are known to have light-induced regulation, as well as unannotated genes, whose functions have yet to be completely assessed. The former group, containing genes such as apcF, apcE, and apcABC, demonstrate the reliability of the technique, while the latter group can help formulate testable hypotheses for a gene's function. This suggests that dynamic studies of transcriptional behavior with significant numbers of time-points can play a key role in understanding cellular regulation. As other measurements such as protein and metabolite data become available at similar scales and frequencies as DNA microarray data, then time-lagged correlation studies should allow for the creation of hypothetical networks similar to Figures 3 and 4 that will contain greater degrees of mechanistic information. Such approaches will hold new insights into the regulation of Synechocystis and may contribute to its utilization for carbon fixation at a practical scale.
Time-Lagged Correlation Methodology To analyze DNA microarray data using time-lagged correlations, some adjustments were made to the method described by Arkin et al. (Arkin and Ross 1995
Step 1: Filter Low-Signal Genes and Cluster Potential Operon Members
Because the Synechocystis genome has been sequenced, additional information is available about each gene's chromosomal location and relative ordering within the genome (http://www.kazusa.or.jp/cyano/). This information, along with the experimental expression data, may suggest the existence of operons, or coexpressed sets of genes due to a common upstream promoter system. In this algorithm, instead of traditional clustering (Dillon and Goldstein 1984
Step 2: Correlate Gene Expression Profiles With the Input Signal
Recall that the correlation of equations 1 and 2 comparing the input signal in the first iteration (or the average profile of a gene cluster in subsequent iterations), i, to gene j best identifies relationships of the type
= 0. At this point Sij = A · j, where j is the variance of, which gj corresponds to rij = 1. Because gene expression is affected by a variety of variables, some of which are not accounted for, the resulting pair-wise-correlations will often be less than unity. This suggests that key genes will possess strong but imperfect correlation with the input signal. By lowering the threshold values for rij from unity, such imperfect correlations with appropriate time lags can be captured. All genes having at least one r( ) value greater than the preselected cutoff of |R| > 0.7, were set aside for further consideration. In this way, a set of "first-order" interactions was obtained in a computational time increasing linearly with the number of genes included.
Using simulations, we found that the determination of time-lags and correlations is highly dependent on perturbations in the experimental conditions that lead to measurable changes in gene transcription. This has also been noted by Arkin and Ross (1995
Step 3: Assemble Groups Containing Retained Genes From Step 2
Step 4: Expand the Network by Repeating Steps 2 and 3
Step 5: Create a Graphical Representation of the Network
Experimental Conditions
Seed cultures were used to inoculate intermediate cultures, which fed the reactor vessel. Seed cultures were grown in 250-ml flasks with cotton and gauze caps and contained 100 ml of H2O that was heat sterilized. Concentrated (50X) BG-11 medium (2 ml) and 0.38 M Na2CO3 (300 µl) were added to the cooled (30°C maximum) flasks. Intermediate cultures used to inoculate the larger reactor vessels were grown in 1-L flasks with 300 ml of H2O. Concentrated (50X) BG-11 (6 ml) and 0.38M Na2CO3 (300 µl) were added to the flasks, along with sterile-filtered 1 M HEPES (6 ml) to create an environment similar to the sparged reactor vessel (see below). Approximately 10 ml from a 250-ml seed culture in late-exponential or stationary phase was used for inoculation of the intermediate cultures. Intermediate cultures were grown to mid-exponential phase (A730
The layout of the sparged-gas vessel is shown in Figure 2. Six liters of H2O was autoclaved in the 10-L reactor vessel with a large stir-bar placed in the center of the gas-sparging ring. Since CO2 was bubbled through this reactor, no Na2CO3 was added, only BG-11 media (120 ml). Dissolved CO2 gas in the form of (H+)(
All experiments were conducted as described in Schmitt Jr. and Stephanopoulos (2003 To isolate the RNA, pellets were resuspended and processed according to the Qiaquick (Qiagen) RNA extraction kit, with the additional step of grinding the cell pellets in a bead mill for 4 min to break the Synechocystis' outer cell wall. Typical yields were 1050 µg of RNA for 50-ml tubes of samples, which is enough to run between one and four microarrays. Full-length cDNA microarrays (provided by DuPont Co. comprising 3078 unique cDNA sequences were used. These were shipped dry in 384 well (16 x 24) plates (Genetix), resuspended with 5 µl of 50% (vol.) DMSO in H2O, and stored at 80°C until printing. Arrays were printed on a MicroGrid II quill pin microarrayer (BioRobotics) at 3545% relative humidity at room temperature on Corning Gap slides. Sixteen quill pins were used to print with a 0.29 pitch (290 µm spacing) between features. The features were printed in 16 grids, each containing 15 rows and 15 columns of cDNA features. While printing, each pin was blotted three times on each of four spare slides to remove excess liquid from the quills. Then printing was performed at one tap/slide for each of 104 slides. This entire procedure was repeated on the bottom half of each slide. The printing procedure took about 20 h to print the entire cDNA library in duplicate on 104 slides. Slides were then crosslinked in batches of 18 slides using a Stratalinker (Stratagene) set to the "Autocrosslink" option at 1200 µJ, and were stored in the dark until use.
All RNA samples were processed into labeled cDNA, hybridized to arrays, and scanned as described in Schmitt Jr. and Stephanopoulos (2003 Microarray quality, sample processing, and data filters were analyzed in aggregate by hybridizing six samples from cultures grown in parallel and labeled with different dyes to three microarrays. On average, the expression ratio of 8.8% genes differed by greater than 1.75, while only 5.1% of genes differed by more than twofold, consistent with other cDNA microarray experiments. Based on these results, genes were deemed to be differentially expressed if they exhibited an expression ratio of two-fold or greater with respect to the control. A compilation of all duplicate spots within slides gave a within-slide coefficient of variation of 0.18.
Data and computer files are available from the corresponding author. This work was supported by NSF grant number BES-9985421. Additional support was provided by NIH grant number 1-RO1-DK5853301. PCR products for the construction of the whole genome microarrays of Synechocystis were provided by the DuPont Company. Their assistance in this research, in particular Drs. Lisa Hwang and Ethel Jackson, is gratefully acknowledged. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2439804.
1 Corresponding author.
Alter, O., Brown, P.O., and Botstein, D. 2000. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl. Acad. Sci. 97: 1010110106. Arkin, A. and Ross, J. 1995. Statistical construction of chemical-reaction mechanisms from measured time-series. J. Phys. Chem. 99: 970979.[CrossRef]
Arkin, A., Shen, P.D., and Ross, J. 1997. A test case of correlation metric construction of a reaction pathway from measurements. Science 277: 12751279. Bryant, D.A., Delorimier, R., Guglielmi, G., and Stevens, S.E. 1990. Structural and compositional analyses of the phycobilisomes of Synechococcus sp Pcc 7002Analyses of the wild-type strain and a phycocyanin-less mutant constructed by interposon mutagenesis. Arch. Microbiol. 153: 550560.[CrossRef][Medline]
Chu, S., DeRisi, J., Eisen, M., Mulholland, J., Botstein, D., Brown, P.O., and Herskowitz, I. 1998. The transcriptional program of sporulation in budding yeast. Science 282: 699705. D'Haeseleer, P., Wen, X., Fuhrman, S., and Somogyi, R. 1998. Mining the gene expression matrix: Inferring gene relationships from large scale gene expression data. In Information processing in cells and tissues (eds. R.C. Paton and M. Holcombe), pp. 203212. Plenum, New York. Dillon, W.R. and Goldstein, M. 1984. Multivariate analysis. Wiley, New York.
Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D. 1998. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95: 1486314868. Friedman, N., Linial, M., Nachman, I., and Pe'er, D. 2000. Using Bayesian networks to analyze expression data. Fourth Annual International Conference on Computational Molecular Biology. Tokyo.
Gill, R.T., Katsoulakis, E., Schmitt, W., Taroncher-Oldenburg, G., Misra, J., and Stephanopoulos, G. 2002. Genome-wide dynamic transcriptional profiling of the light-to-dark transition in Synechocystis sp strain PCC 6803. J. Bacteriol. 184: 36713681.
Heyer, L.J., Kruglyak, S., and Yooseph, S. 1999. Exploring expression data: Identification and analysis of coexpressed genes. Genome Res. 9: 11061115.
Hihara, Y., Kamei, A., Kanehisa, M., Kaplan, A., and Ikeuchi, M. 2001. DNA microarray analysis of cyanobacterial gene expression during acclimation to high light. Plant Cell 13: 793806.
Holter, N.S., Mitra, M., Maritan, A., Cieplak, M., Banavar, J.R., and Fedoroff, N.V. 2000. Fundamental patterns underlying gene expression profiles: Simplicity from complexity. Proc. Natl. Acad. Sci. 97: 84098414.
Iyer, V.R., Eisen, M.B., Ross, D.T., Schuler, G., Moore, T., Lee, J.C.F., Trent, J.M., Staudt, L.M., Hudson, J., Boguski, M.S., et al. 1999. The transcriptional program in the response of human fibroblasts to serum. Science 283: 8387. Kamimura, R.T. 1997. "Application of multivariate statistics to fermentation database mining." Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA. Kaneko, T., Sato, S., Kotani, H., Tanaka, A., Asamizu, E., Nakamura, Y., Miyajima, N., Hirosawa, M., Sugiura, M., Sasamoto S., et al. 1996. Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. 2. Sequence determination of the entire genome and assignment of potential protein-coding regions. DNA Res 3: 109136.[Abstract] Kuruvilla, F.G., Park, P.J., Schreiber, S.L. 2002. Vector algebra in the analysis of genome-wide expression data. Genome Biol. 3: 0011.10011.11. Meurer, J., Plucken, H., Kowallik, K.V., and Westhoff, P. 1998. A nuclear-encoded protein of prokaryotic origin is essential for the stability of photosystem II in Arabidopsis thaliana. Embo J. 17: 52865297.[CrossRef][Medline]
Misra, J., Schmitt, W., Hwang, D., Hsiao, L.L., Gullans, S., Stephanopoulos, G. 2002. Interactive exploration of microarray gene expression patterns in a reduced dimensional space. Genome Res. 12: 11121120. Raychaudhuri, S., Stuart, J.M., and Altman, R.B. 2000. Principal components analysis to summarize microarray experiments: Application to sporulation time series. Pacific Symposium on Biocomputing. Hawaii. Schmitt Jr., W.A. and Stephanopoulos, G. 2003. Prescription of transcriptional profiles of Synechosystis PCC6803 by dynamic autoregressive modeling of DNA microarray data. Biotechnol. Bioeng. 84: 855863.[CrossRef][Medline] Somogyi, R. and Fuhrman, S. 1997. Distributivity, a general information theoretic network measurement, or why the whole is more than the sum of its parts. The International Workshop on Information Processing in Cells and Tissues. Sheffield, UK.
Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., and Futcher, B. 1998. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 9: 32733297.
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., Lander, E.S., and Golub, T.R. 1999. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. 96: 29072912. Watson, G.M.F. and Tabita, F.R. 1996. Regulation, unique gene organization, and unusual primary structure of carbon fixation genes from a marine phycoerythrin-containing cyanobacterium. Plant Mol. Biol. 32: 11031115.[CrossRef][Medline] Wei, W. Time series analysis. 1990. Addison-Wesley, Redwood City, CA. Wilde, A., Lunser, K., Ossenbuhl, F., Nickelsen, J., and Borner, T. 2001. Characterization of the cyanobacterial ycf37: Mutation decreases the photosystem I content. Biochem. J. 357: 211216.[CrossRef][Medline] Zhu, J. and Zhang, M.Q. 2000. Cluster, function and promoter: Analysis of yeast expression array. Pacific Symposium on Biocomputing. Hawaii.
http://www.research.att.com/sw/tools/graphviz/; AT&T Labs, Graphviz. http://www.kazusa.or.jp/cyano/; CyanoBase: The Genome Database for Cyanobacteria. http://web.mit.edu/cheme/gnswebpage/index.shtml; Bioinformatics and Metabolic Engineering Laboratory at MIT.
Received February 11, 2004; accepted in revised format April 22, 2004. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||