|
|
|
|
Vol. 9, Issue 8, 681-688, August 1999
PERSPECTIVE
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
The use of high-density DNA arrays to monitor gene expression at a genome-wide scale constitutes a fundamental advance in biology. In particular, the expression pattern of all genes in Saccharomyces cerevisiae can be interrogated using microarray analysis where cDNAs are hybridized to an array of each of the ~6000 genes in the yeast genome. In this survey I review three recent experiments related to transcriptional regulation and discuss the great challenge for computational biologists trying to extract functional information from such large-scale gene expression data.
| |
ARTICLE |
|---|
|
|
|---|
Ever since the theory of genetic regulation of protein synthesis was
first worked out almost 40 years ago (Jacob and Monod 1961
), biologists
have been fascinated by how different genetic programs hard coded in
the DNA are involved in the control and regulation of gene
expression. This is important because different temporal-spatial gene expression patterns relate directly to
developmental control, morphogenesis and cell differentiation, tissue
specificity, hormonal communication, or cellular stress responses. Gene
expression is largely controlled at the transcriptional level, and
transcriptional regulatory elements are located primarily in the
upstream promoter region of each gene; however, the lack of quality
upstream experimental data has made systematic global investigations
very difficult (Zhang 1998a
; for review, see Fickett and Hatzigeorgiou
1997
). In the past, computational genomics has focused mainly on gene finding (Claverie 1997
; Zhang 1997
), namely finding the protein coding
region and extracting functional information about the protein product.
To get equally important functional information about control elements
of a gene, one has to analyze functional motifs, most of which occur in
noncoding regions (Lavorgna et al. 1998
). There have been some
excellent works on genomic identification of individual eukaryotic
transcriptional factor (TF) binding sites (e.g., Fondrat and
Kalogeropoulos 1996
; Tronche et al. 1997
; Wasserman and Fickett 1998
).
The most tedious part of these approaches is collecting enough
experimentally verified cis-acting elements that are shared by
a set of coregulated genes.
Since the complete yeast genome has become available, there have been
several attempts to computationally identifying upstream activation
sequences (UASs) by their clustering properties (Wagner 1997
, 1998
).
This approach suffers from the simple random background (i.e., all
oligonucleotides are Poisson distributed) assumption and will certainly
miss many TF sites that do not cluster or do not cluster in a simple
fashion. Genome-wide monitoring of gene expression has provided a far
more effective way of systematically studying coregulated genes and TF
sites (for the latest reviews, see "The Chipping Forecast" 1999). I
focus on the computational problems of identifying coregulated genes
and their promoter elements and refer people to read elsewhere on other
interesting and equally challenging issues, such as data
requisition/process (Marton et al. 1998
), data presentation/management
(Ermolaeva et al. 1998
), and genetic network dynamics (Altman et al.
1998
, 1999
).
The power of DNA microarray hybridization on a genome scale was
demonstrated early on (Schena et al. 1995
; Lockhart et al. 1996
; for
introduction on the basic concept and protocols, see Stein 1998
). For
example, in a yeast diauxic shift experiment, five groups of distinct
temporal patterns of induction or repression could be recognized
visually, as glucose concentration was increased (Fig.
1). The characterized members of each of these groups
shared important similarities in their functions, and common regulatory mechanisms could be inferred for most sets of genes with similar expression profiles. When searching for known UAS motifs in each group,
many coregulated genes did share common TF sites (DeRisi et al. 1997
).
|
In preparation for regulatory sequence analysis from such expression
data, a statistical method was developed (van Helden et al. 1998
) based
on detection of over-represented oligonucleotides in a target set of
upstream sequences over all noncoding sequences from the genome. It was
applied to 10 families containing from 5 to 38 genes; 2 of the families
were actually built from the DNA microarray expression data of
YAP1 overexpression and TUP1 deletion (DeRisi et
al. 1997
). This method was very useful for identifying short core
motifs, which is equivalent to the oligonucleotide relative information method
(Zhang 1998b
) and other methods used in the following experiments.
I now describe real data analyses in the following three sets of recent
whole-genome expression experiments. The first was two-point
comparisons using oligonucleotide chips, which detected relative mRNA
levels before and after nutrient change, heat shock, or mating-type
switch (Roth et al. 1998
). The second was multipoint (time-course)
comparisons also using oligonucleotide chips, which detected mRNA level
changes (at different time points after cell cycle release) relative to
an arbitrary (but fixed) standard (Cho et al. 1998
) for the purpose of
identifying cell cycle-regulated genes. The third consisted of both
two-point and multipoint comparisons, but using cDNA microarrays, which
detected relative mRNA levels (at different time points after cell
cycle release) in the synchronous cells relative to the control of
asynchronous cells (Spellman et al. 1998
) coupled with separate
experiments of CLN3 and CLB2 induction. In Table 1,
the main features in these experiments are listed for easy
comparison.
|
In a two-point experiment (such as in Roth et al. 1998
and in part of
Spellman et al. 1998
), one measures the relative ratio of mRNA
concentration under two different conditions for all genes. After
sorting these ratios (one ratio per gene) of mRNA levels, one can
identify the most induced or most inhibited genes from the two extreme
ends of the sorted list. Some criteria are needed for the gene
selection. For example, in the promoter analysis of Roth et al. (1998)
,
the upstream sequences (relative to ATG) of the 10 ORFs were taken from
the top and the bottom of the sorted list. Hence a highly induced set,
a highly inhibited set, and a combined set were used in searching by
AlignACE for common UAS motifs. The rationale for examining the
combined set is that a single regulatory motif may serve as either a
positive or a negative element depending on its sequence context or
environment. AlignACE is a modified version of the Gibbs sampler
(Neuwald et al. 1995
) and was optimized for finding multiple motifs
(via an iterative masking procedure) and for aligning DNA sequences on
both strands. It also scores alignments by frequency of occurrence in
the intergenic regions of a given genome (for the algorithmic details,
see Roth et al. 1998
). To suppress false positives, a motif must pass
two criteria: (1) exceeding an alignment threshold, and (2) having an
occurrence score below 1% (i.e., <1% of genes in the genome may
have this motif).
For the galactose versus glucose comparison, UASG motif
t(T/c)CGG(C/A)(G/c)NNcT(g/c)(T/c)NNcCGG, which is known to regulate galactose utilization genes via the Gal4p/Gal80p activation complex (Lohr et al. 1995
) was found successfully, but other expected UASs such
as the Rap1 site, the Gcr1 site, or the Mig1
site were not found. For the 39°C vs. 30°C comparison, the heat
shock element (HSE) and stress response element (STRE) were not found.
Because heat shock is known to have broad effects, including transient cell cycle arrest in G1 (Rowley et al. 1993
), the cell cycle
activation (CCA) motif GCGAA(a/g)ttNT(g/c)(a/g)GAA(C/g) of histones was
found, but other cell cycle UASs, such as the negative regulatory
element of histones (NEG), Swi4/6 cell cycle box (SCB),
MluI cell cycle box (MCB), and early cell cycle box (ECB) (see
below), were not found. For mating-type
versus a
comparison, all four mating type-specific elements (
2 operator,
tCAAtgNcAg; P box, TtCCTAATT(a/g)GgN(c/a)(a/t); pheromone response
element (PRE), aTGAAAC; and Q box, tCAAtgNcAg) were found. Some
putative motifs were also found, but many were suspected to be false
positives. As only one time point (or averaged stationary time points)
is taken for each pair of conditions in a two-point comparison
experiment, dynamic information is totally lost. It is impossible to
separate the primary transcriptional event from the downstream
cascades. It would have been much more informative for detecting
coregulated genes if measurements had been taken at multiple
consecutive time points. (In principle, curve-fitting may result in
more robust time series but currently available points in a particular
experiment were too limited for smoothing.) This is why I discuss the
other two sets of time-course experiments for identifying a complete set of cell cycle-regulated genes and regulatory sequences in yeast.
Figure 2 shows a current model illustrating
interactions that determine cell cycle-regulated transcription in yeast
(Koch and Nasmyth 1994
; McInerny et al. 1997
). Cln3-associated
kinase activates late G1-specific transcription factors [SBF
(SCB binding factor) and MBF (MCB binding factor)] in a cell
size-dependent fashion. SBF and MBF mediate the expression of CLN1,2
and CLB5,6 as well as S-phase proteins leading to budding and S-phase
entry. CLN1,2 activity allows accumulation of Clbs by an
unknown mechanism. Clb1 and Clb2 activate
transcription of G2-specific genes and thereby autoactivate
their own synthesis, possibly via transcription factors Mcm1 and Sff.
At the same time, Clb1,2/cdc28 represses SBF-mediated
transcription. Whereas Clb1,2/cdc28 activates expression of
SWI5 and possibly of ACE2 RNAs via mcm1/Sff, it keeps the gene products in an inactive state by phosphorylation of the nuclear localization signals. Clb proteolysis at the end of mitosis
dramatically changes the situation: Clb-mediated activation of
G2-specific genes is stopped, and Swi5 loses its
inhibitory phosphorylations, leading to its uptake into the nucleus
where it can activate early G1-specific transcripts. At late
M phase, a mcm1-related factor binds to ECB (early cell cycle
box) and initiates M/G1-specific activation of CLN3, SWI4,
and some DNA replication genes; these gene products have critical roles
in promoting the initiation of the next S phase.
|
Since oscillation of histone mRNAs was discovered (Hereford et al.
1981
), 103 cell cycle-regulated messages have been identified using
traditional methods, and it was estimated that some 250 in total might
exist (Price et al. 1991
). To create a comprehensive catalog of yeast
genes whose transcript levels vary periodically with the cell cycle,
two independent sets of genome-wide transcriptional experiments have
been completed recently (Cho et al. 1998
; Spellman et al. 1998
).
Cho et al. (1998)
used the commercially available oligonucleotide
arrays and temperature-sensitive cdc28 mutant synchronization. After normalization of the expression profiles, cell cycle-dependent periodicity was found for 416 of the ~6200 monitored transcripts. These genes were classified into five groups (early G1, late
G1, S, G2, and M) according to their visual peak
positions and the consistency with known genes. Through matches of
known TF sites and visual extension of over-represented short
oligonucleotides, a dozen putative motifs (including MCB, SCB, ECB/MCM1) were
identified in the promoter regions (Table 2 in Cho et al. 1998
).
In contrast, Spellman et al. (1998)
used "home-made" cDNA
microarrays and samples from yeast cultures synchronized by three independent methods:
-factor arrest, elutriation, and blockage of
a cdc15 temperature-sensitive mutant. Using Fourier transform and Pearson-type correlation algorithms, 800 genes were identified that
met an objective minimum criterion for cell cycle regulation. In
separate experiments, designed to examine the effects of inducing either the G1 cyclin Cln3p or the B-type cyclin Clb2p, it was found that mRNA levels of more than half of these 800 genes responded to cyclin induction. These 800 gene expression patterns (sorted by the
phases) are shown in Figure 3. The genes were divided
into the following five groups: M/G1 (113 genes),
G1 (300 genes), S (71 genes), G2 (121 genes), and M
(195 genes). Spellman et al. (1998)
used the following clustering
strategy: Five profiles were built from genes known to be expressed in
each class. The averaged peak correlation score (defined as the highest
correlation value between the log 2 (ratio) data series for each gene
and each profile) of different experiments was used to construct an
objective "aggregate CDC (cell
cycle-dependent clustering) score." Genes
were ranked by their aggregate CDC scores, and the list was examined to
determine a threshold that was exceeded by 91% of known cell
cycle-regulated genes. Altogether, 800 genes met or exceeded the
threshold. Clustering of randomly shuffled data indicated that the
false-positive rate should be <10%. Table 2 provides some examples
of the kinds of scores obtained for several genes (including specific
examples that are and are not cell cycle regulated).
|
|
Using a newly developed Saccharomyces cerevisiae promoter
database and analysis tools (Zhu and Zhang 1999
), the 700-bp upstream regions of the five phase groups were analyzed further.
Spellman et al. (1998)
first computed the relative pentamer information
(an olignucleotide bias measure) of each phase group versus the control
group of non-cell-cycle genes (Zhang 1998b
). They then tried to extend
the informative oligomers or to find other longer motifs by using
GibbsDNA (Z. Ioschikhes and M.Q. Zhang, unpubl.), which is another
modified version of the Gibbs sampler and includes features, such as
double strand, palindrome symmetry, distance constraint and submotif
inclusion/exclusion. As Gibbs sampling is a stochastic process, a
sufficient number of runs had to be carried out for each data set with
various parameters, and the motifs that had higher maximum aposteri
probability (MAP) values were selected. Once motifs were established
for a group, their predictive value was tested by searching for the
motif consensus (with specified mismatches) in the promoter regions of
all groups, as well as for the control group. Figure 4 shows the
selectivity of some consensus motifs with respect to different
regulatory groups and positions.
|
Because the cutoffs for different phase groups were somewhat arbitrary,
to search for better coregulated gene clusters, the Peason-type
correlation clustering algorithm (Eisen et al. 1998
) was used to
identify nine clusters [data from Cho et al. (1998)
was also included
during clustering for completeness]. The dendragram of these clustered
expression profiles is shown in Figure 5. UAS motifs
for each of these clusters were calculated, and results are shown in
Table 3. As the statistics show, many of these motifs contain information predictive of cell cycle regulation (see Fig. 4). A
full description and complete data sets are available at http://cellcycle-www.stanford.edu and at http://www.cshl.org/mzhanglab.
|
|
It is clear that multiple time points are more useful and better for
clustering and promoter analysis. The most obvious difference between
the results of Spellman et al. (1998)
and Cho et al. (1998)
is the
number of cell cycle-regulated genes and promoter elements identified.
With a manual decision process, Cho et al. found 421 genes to be cell
cycle regulated. A set of 800 genes found by Spellman et al. includes
304 of these, but the other 117 do not appear significantly cell cycle
regulated in their experiments. The set of 800 genes therefore contains
496 genes not identified by Cho et al. The main cell cycle control
promoter element SWI5 site (among some others) was not identified in
the analysis of Cho et al. because (according to Cho et al.)
"Swi5 do not have a highly conserved binding sequence,
making it difficult to accurately search genomic sequence for possible
sites of action." Here, the SWI5 site was a good example for which
the factor was well known generally, but the site had not been well
characterized experimentally. Another advantage of the analysis of
Spellman et al. was the diversity of experiments, which allowed them to
distinguish cell cycle regulation from confounding patterns such as
those caused by the heat shock response when a culture is shifted from
one temperature to another. Although transcriptional cascade could be
better resolved by adding more time points, there are certainly
technical limits. Because differential stability could also affect the
transcript level as well as transcription rate, a systematic detection
of the turnover rate for each transcript would be also crucial for more
accurate global picture.
In summary, with the help of genome-wide expression techniques, it is
possible to identify coregulated genes by clustering analysis.
Furthermore, by combination of over-represented oligonucleotide analysis and multiple-sequence alignment programs, it is also possible
to identify upstream regulatory motifs commonly shared by coregulated
genes. Good clustering is better than sophisticated motif-search
algorithms. It would be highly desirable if one could combine motif and
cluster analyses, as good clustering can facilitate motif
identification, and, conversely, conserved motifs (or any other
functional information related to the sequences) can help to improve
clustering. We ought to work toward a self-consistent iteration process
(clustering coregulated genes
detecting common motifs) as used
in all scientific inference (functional groups
conserved
structures). Although I have only discussed promoter motif detection in
the context of array data analysis, it can be equally applied to a
similar search of 3'-untranslated regions (3' UTRs). As the
3' UTR is often important for message stability and transport,
identification of conserved motifs in this region may be also be
instructive for gene regulation.
| |
ACKNOWLEDGMENTS |
|---|
I thank all my collaborators: P.T. Spellman, G. Sherlock, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, B. Futcher, Z. Ioschikhes, and J. Zhu. I am grateful to F.P. Roth for providing the preprint. I also thank R.W. Davis for helpful discussions during a visit to the Stanford Genome Center and T.G. Wolfsberg for useful references. I further thank Lincoln Stein for reading and commenting on the manuscript. My laboratory is supported by National Institutes of Health/National Human Genome Research Institute (NIH/NHGRI; HG01696), Cold Spring Harbor Laboratory (CSHL) Association, and Merck Genome Research Institute.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
NOTE |
|---|
Recently, an interesting new clustering algorithm GENECLUSTER based
on self-organizing maps has been developed (Tamayo et al. 1999
); it was
used to recluster the data of Cho et al. (1998)
and found essentially
similar results. Unfortunately, there was no regulatory sequence analysis.
| |
FOOTNOTES |
|---|
1 E-MAIL mzhang{at}cshl.org; FAX (516) 367-8461.
| |
REFERENCES |
|---|
|
|
|---|


. 1998. Distribution of transcription factor binding sites in the
yeast genome suggests abundance of coordinately regulated genes.
Genomics (in press).This article has been cited by other articles:
![]() |
S. J. Galbraith, L. M. Tran, and J. C. Liao Transcriptome network component analysis with limited microarray data Bioinformatics, August 1, 2006; 22(15): 1886 - 1894. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. B. Fogel, D. G. Weekes, G. Varga, E. R. Dow, H. B. Harlow, J. E. Onyia, and C. Su Discovery of sequence motifs related to coexpression of genes using evolutionary computation Nucleic Acids Res., July 20, 2004; 32(13): 3826 - 3835. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Xiong, J. Li, and X. Fang Identification of Genetic Networks Genetics, February 1, 2004; 166(2): 1037 - 1052. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. E. Boardman, S. G. Oliver, and S. J. Hubbard SiteSeer: visualisation and analysis of transcription factor binding sites in nucleotide sequences Nucleic Acids Res., July 1, 2003; 31(13): 3572 - 3575. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Farre, R. Roset, M. Huerta, J. E. Adsuara, L. Rosello, M. M. Alba, and X. Messeguer Identification of patterns in biological sequences at the ALGGEN server: PROMO and MALGEN Nucleic Acids Res., July 1, 2003; 31(13): 3651 - 3653. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Zhang, Y. Ramanathan, P. Soteropoulos, M. L. Recce, and P. P. Tolias EZ-Retrieve: a web-server for batch retrieval of coordinate-specified human DNA sequences and underscoring putative transcription factor-binding sites Nucleic Acids Res., November 1, 2002; 30(21): e121 - e121. [Abstract] [Full Text] [PDF] |
||||
![]() |
P.T. Simmons and C. J. Portier Toxicogenomics: the new frontier in risk analysis Carcinogenesis, June 1, 2002; 23(6): 903 - 905. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Zhao, A. Kho, A. M. Kenney, D.-i. Yuk, I. Kohane, and D. H. Rowitch Identification of genes expressed with temporal-spatial restriction to developing cerebellar neuron precursors by a functional genomic approach PNAS, April 16, 2002; 99(8): 5704 - 5709. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Wang, S. Ghosh, and S.-W. Guo Quantitative quality control in microarray image processing and data acquisition Nucleic Acids Res., August 1, 2001; 29(15): e75 - e75. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. M. Umek, S. W. Lin, J. Vielmetter, R. H. Terbrueggen, B. Irvine, C. J. Yu, J. F. Kayyem, H. Yowanto, G. F. Blackburn, D. H. Farkas, et al. Electronic Detection of Nucleic Acids: A Versatile Platform for Molecular Diagnostics J. Mol. Diagn., May 1, 2001; 3(2): 74 - 84. [Abstract] [Full Text] |
||||
![]() |
M. R. Fielden and T. R. Zacharewski Challenges and Limitations of Gene Expression Profiling in Mechanistic and Predictive Toxicology Toxicol. Sci., March 1, 2001; 60(1): 6 - 10. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Florea, C. Riemer, S. Schwartz, Z. Zhang, N. Stojanovic, W. Miller, and M. McClelland Web-based visualization tools for bacterial genome alignments Nucleic Acids Res., September 15, 2000; 28(18): 3486 - 3496. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Meldrum Automation for Genomics, Part Two: Sequencers, Microarrays, and Future Trends Genome Res., September 1, 2000; 10(9): 1288 - 1303. [Abstract] [Full Text] |
||||
![]() |
H. J. Bussemaker, H. Li, and E. D. Siggia Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis PNAS, August 10, 2000; (2000) 180265397. [Abstract] [Full Text] |
||||
![]() |
J. S. Jacobs Anderson and R. Parker Computational identification of cis-acting elements affecting post-transcriptional control of gene expression in Saccharomyces cerevisiae Nucleic Acids Res., April 1, 2000; 28(7): 1604 - 1617. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Gécz and J. Mulley Genes for Cognitive Function: Developments on the X Genome Res., February 1, 2000; 10(2): 157 - 163. [Abstract] [Full Text] |
||||
![]() |
H. J. Bussemaker, H. Li, and E. D. Siggia From the Cover: Building a dictionary for genomes: Identification of presumptive regulatory sites by statistical analysis PNAS, August 29, 2000; 97(18): 10096 - 10100. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Birnbaum, P. N. Benfey, and D. E. Shasha cis Element/Transcription Factor Analysis (cis/TF): A Method for Discovering Transcription Factor/cis Element Relationships Genome Res., September 1, 2001; 11(9): 1567 - 1573. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||