|
|
|
|
Vol. 9, Issue 8, 775-792, August 1999
METHODS
|
| |
ABSTRACT |
|---|
|
|
|---|
Recent developments in genome-wide transcript monitoring have led to a rapid accumulation of data from gene expression studies. Such projects highlight the need for methods to predict the molecular basis of transcriptional coregulation. A microarray project identified the 420 yeast transcripts whose synthesis displays cell cycle-dependent periodicity. We present here a statistical technique we developed to identify the sequence elements that may be responsible for this cell cycle regulation. Because most gene regulatory sites contain a short string of highly conserved nucleotides, any such strings that are involved in gene regulation will occur frequently in the upstream regions of the genes that they regulate, and rarely in the upstream regions of other genes. Our strategy therefore utilizes statistical procedures to identify short oligomers, five or six nucleotides in length, that are over-represented in upstream regions of genes whose expression peaks at the same phase of the cell cycle. We report, with a high level of confidence, that 9 hexamers and 12 pentamers are over-represented in the upstream regions of genes whose expression peaks at the early G1, late G1, S, G2, or M phase of the cell cycle. Some of these sequence elements show a preference for a particular orientation, and others, through a separate statistical test, for a particular position upstream of the ATG start codon. The finding that the majority of the statistically significant sequence elements are located in late G1 upstream regions correlates with other experiments that identified the late G1/early S boundary as a vital cell cycle control point. Our results highlight the importance of MCB, an element implicated previously in late G1/early S gene regulation, as most of the late G1 oligomers contain the MCB sequence or variations thereof. It is striking that most MCB-like sequences localize to a specific region upstream of the ATG start codon. Additional sequences that we have identified may be important for regulation at other phases of the cell cycle.
[A companion website to this manuscript is available from http://www.ncbi.nlm.nih.gov/CBBresearch/Landsman/Cell_cycle_data]
| |
INTRODUCTION |
|---|
|
|
|---|
The recent surge in the availability of complete genome sequences,
as well as the development of technologies such as DNA microarrays, is
ushering in a new era in the analysis of gene regulation. The yeast Saccharomyces
cerevisiae, whose genome is the first to be fully sequenced from a
eukaryote, is ideal for these studies. A number of groups have begun to
merge the previously gathered wealth of biochemical information about
yeast with its newly published complete genome sequence. One way to
analyze gene expression is to scan the genome for the consensus-binding
sequence of a known transcription factor to predict additional genes
that may be regulated by that factor. This method has been used to look
for Gcn4p-binding sites (Schuldiner et al. 1998
) and stress-response elements (Moskvina et al. 1998
; Treger et al. 1998
). Another strategy focuses on the identification of upstream gene regulatory sites in
groups of coregulated genes. Van Helden et al. (1998)
and Brazma et al.
(1998)
looked at groups of coregulated genes to find over-represented oligonucleotide sequences. Both groups detected new candidate regulatory sites, as well as sites that had already been characterized. Brazma et al. (1998)
also identified sequence patterns that occur more
frequently in promoter regions than in other regions of the genome.
The availability of the complete yeast genome sequence has also
facilitated whole genome expression analyses in which the expression
levels of the ~6200 yeast genes are assayed in different conditions.
Recent studies have identified the genes that are expressed during
growth on rich and minimal medium (Wodicka et al. 1997
), during the
diauxic shift from anaerobic to aerobic metabolism (DeRisi et al.
1997
), and when the transcriptional corepressor Tup1p is deleted or the
transcriptional activator Yap1p is overexpressed (DeRisi et al. 1997
).
The role of key components of the transcriptional machinery on the
population of yeast genes was assessed by Holstege et al. (1998)
. Chu
et al. (1998)
assayed genes induced during sporulation in yeast. Roth
et al. (1998)
measured transcript abundance in the galactose response,
heat shock, and mating type regulation systems, and took the further step of using a modified Gibbs sampling strategy to identify sequences that may be involved in the gene regulation. Cho et al. (1998)
identified 420 genes whose mRNA expression exhibits cell cycle periodicity. They classified the genes by whether their expression peaked in the early G1, late G1, S, G2,
or M phase of the cell cycle. This analysis paves the way for a new
type of experiment
a computational prediction of the sequence elements
involved in cell cycle regulation. A preliminary analysis of these
elements was described previously (Cho et al. 1998
). In this
manuscript, we present a more detailed interpretation.
We have developed a statistical technique to predict short sequence
elements (i.e., oligomers) that may be involved in the expression of
groups of coregulated genes. Our strategy is to look for pentamers and
hexamers that are over-represented among the upstream regions of genes
whose expression peaks at a particular phase of the cell cycle. We have
identified 9 hexamers and 12 pentamers that may play a role in cell
cycle regulation. Some of these oligomers may function in an
orientation-dependent manner; the function of others may be dependent
on their position within the promoter. The highest scoring hexamer to
come out of this study, by all criteria, is the previously
characterized sequence element MCB, which is known to control the
expression of genes expressed in late G1 (McIntosh 1993
).
Thus, this method is able to select biologically relevant sequence
elements, and we predict that the other elements we identified also
play roles in cell cycle regulation.
A companion website to this manuscript is available from http://www.ncbi.nlm.nih.gov/CBBresearch/Landsman/Cell_cycle_data. This site includes the upstream sequence of each of the cell cycle-regulated yeast genes, electronic versions of Tables 1-4 with links from the oligomer to the names of the genes that contain the oligomer, and the 50-100 top scoring oligomers for each analysis, including some sequences that are not statistically significant.
| |
RESULTS |
|---|
|
|
|---|
Strategy for Finding Novel Candidate Regulatory Elements
Our goal, in general, is to identify novel regulatory elements in
upstream regions of coexpressed yeast genes. In this specific case, we
apply our method to genes that may be involved in the cell
cycle-dependent regulation of transcription. Cho et al. (1998)
have
identified those genes whose transcription exhibits cell cycle-dependent periodicity, and have furthermore classified the genes
into five sets, those expressed during the early G1 (63 genes), late G1 (134 genes), S (74 genes), G2 (56 genes), and M (56 genes) phases of the cell cycle. Our hypothesis is
that sequence elements that are found more frequently (i.e.,
over-represented) in the upstream regions of genes expressed during one
phase of the cell cycle, as compared with genes expressed during other phases of the cell cycle, may play a role in gene expression during that phase. As many transcription factors bind to short, highly conserved stretches of DNA, our analysis centers on short oligomers of
length five or six, pentamers or hexamers. We limited our search to the
sequence 600 nucleotides upstream of the translation start site of each
gene, as most yeast regulatory elements are found within this region
(Struhl 1995
). Many yeast regulatory elements are analogous to
mammalian enhancer sequences, and function in both orientations and at
variable distances upstream of the transcription start site (Struhl
1995
; Kunzler et al. 1996
). Thus, we searched for oligomers whose
representation is statistically significant, independent of their
position and orientation. However, in higher eukaryotes, some
regulatory elements act only when placed in certain locations or
orientations with respect to the transcription start site (see, for
example, Godambe et al. 1995
; Nolan et al. 1996
; Pfaff and Taylor
1998
). To cover all biologically relevant possibilities, we also
searched for potential orientation- and position-dependent elements,
whose distribution is statistically significant on the basis of the
strand or location in which the element is found.
Position-Independent Elements
The first analysis was to identify candidate elements important for cell cycle regulation in a position-independent manner. The statistical procedure is illustrated in Figure 1 with two oligomers, ACGCGT and GATGTA. Details are presented in Methods. The height of the white bar in each graph indicates the number of genes in each of the five cell cycle-regulated data sets. The number of genes whose upstream regions contain one or more copy of the sequence element is shown in purple. Pink squares indicate the number of upstream regions in that data set that would be expected to contain the sequence element, if the element were distributed evenly among the upstream regions of all of the cell cycle-regulated genes. Figure 1A illustrates that there are more ACGCGT elements in late G1 than are expected, and fewer in G2 and M. This observation is confirmed by the score for each data set (light blue), which is close to 0 in early G1 and S, in which the observed number of elements is similar to the expected number, and higher in late G1, G2, and M, in which the observed and expected numbers differ. Conversely, the observed number of GATGTA elements in each of the five data sets is similar to the expected number, so the score for each data set is close to 0 (Fig. 1B). Therefore, this hexamer is not over- or under-represented in any of the five phases.
|
In Table 1 we present a list of predicted
position-independent regulatory elements. The P value
associated with each element provides an estimate of the probability
with which a score of that magnitude occurs by chance alone. Although
we list all elements with P
0.2, (i.e., those whose
score has a
20% chance of occurring by chance), we are most
confident in those with P
0.05. Twelve elements that
could function on either strand, assayed by the presence of a hexamer
and/or its reverse complement on one strand, have
P
0.2 (Table 1A). The distribution of an additional
10 hexamers on one strand also has P
0.2 (Table 1B).
However, only one of these hexamers, CCCTTT, represents a new element; the other nine were also identified in the search for hexamers and/or
their reverse complements.
|
The statistical significance is made on the basis of the sum of the scores across all five data sets, and provides a measure of whether the hexamer is randomly distributed among the five sets of upstream regions. However, it does not identify in which data set the hexamer is over- or under-represented. For each sequence in Table 1, we present the number of upstream regions in each phase in which it occurs, as well as the contribution that the data set makes to the final score. We have also included the number of upstream regions of non-cell cycle-regulated genes in which each oligomer occurs. This number is supplied for reference purposes only; it was not included in any calculations. We classify a hexamer as being over-represented in a particular phase on the basis of the data set that provides the highest contribution to the final score (bold in Table 1). For example, the element defined by the hexamer CCCCGC and its reverse complement GCGGGG, has a total score of 31.37 (Table 1). Its over-representation in early G1 contributes 25.60 units to the score. Some scores are also inflated by under-representation. For example, the sequence ACGCGT has a total score of 60.44. Late G1, in which it is over-represented, contributes 33.34 units to the score, whereas G2 and M, in which it is under-represented, contribute 13.74 and 11.82 units to the score, respectively. The distribution of some hexamers suggests that they may be important in more than one phase. For example, the M and early G1 upstream regions contribute equally to the final score of the hexamer CCCTTT.
This method provides us with nine oligomers in which we have high
confidence (P
0.05), one in early G1, seven
from late G1, and one from G2. With lower
confidence (P
0.2) we present an additional four
oligomers, one in early G1, two in S, and one in both early
G1 and M. In late G1, the highest scoring hexamer is the well-characterized palindrome ACGCGT, also known as the MluI cell cycle box (MCB), which is involved in the
transcriptional regulation of genes at the late G1/S boundary
(McIntosh 1993
). It is notable that all of the hexamers that are
over-represented in late G1 contain some variation of the
sequence ACGCGT. Thus, the seven late G1 hexamers listed in
Table 1 may not be independent, but rather, variants or parts of the
same regulatory site.
Position-Dependent Elements I: Elements Over-Represented in Defined Intervals
Preliminary analysis of the top scoring hexamer from the previous
analysis, ACGCGT, the MCB element, showed that most copies of this
sequence reside within the interval 100-200 nucleotides upstream of
the ATG start codon. Thus, our next step was to calculate whether any
other oligomers are over-represented in a particular interval within
one of the data sets. For statistical reasons discussed in Methods, we
were unable to search for hexamers in this analysis; we instead turned
to pentamers. The method is similar to that which we used to identify
position-independent elements, except that in this experiment, we
identified sequences that are over-represented in a single
99-nucleotide interval within a data set, not in the data set as a
whole. Figure 2 illustrates the method, with the
pentamer ACGCG as an example. The number of genes in each data set is
indicated by the height of the white bar. Purple shading denotes the
number of genes in each data set whose upstream region contains one or
more copy of the sequence element in the 99-nucleotide interval. Pink
squares indicate the number of upstream regions in that interval that
would be expected to contain the sequence element, if the element were
distributed evenly among all intervals in all data sets. The pentamer
ACGCG occurs more frequently than expected in the interval
104 to
202 of late G1, and, to a lesser extent, in the interval
104 to
202 of S. Thus, its score (light blue) in those
intervals is higher than it is in the other intervals.
|
Table 2 is a list of all statistically significant
position-depenent pentamers. With P
0.05, we find
five pentamers and/or their reverse complements over-represented in
late G1 in the interval
104 to
202 from the ATG start
codon, one in M in the interval
203 to
301, and one in
G2 from
401 to
499 and late G1 from
5
to
103 (Table 2A). We also find 13 individual pentamers with a
P
0.05 (Table 2B); 8 of these pentamers represent one
or both strands of the elements listed in Table 2A, whereas one is a
new oligomer for early G1, two for late G1, one for
S, and one for multiple sets. If we raise the threshold to
P
0.2, we add in four additional pentamers in early
G1 and late G1, three in S, and one in many data
sets. Eight of the pentamers identified by this method are contained
within hexamers identified above and may not represent novel regulatory
sites. Once again, the highest scoring pentamers are contained within
the MCB element (ACGCGT) and variations of that sequence (Table 2A).
|
Position-Dependent Elements II: Clustered Within a Set of Upstream Regions
The experiment described above identifies oligomers that are
over-represented in one or more 99-nucleotide intervals within the
upstream regions of one or more of the five sets of cell
cycle-regulated genes. However, it does not identify those elements
that exhibit a preference for a specific location within one set of
upstream regions. The details of this procedure are presented in
Methods, and the basic concept is illustrated in Figure
3. The hexamers ACGCGT and CGACGC are both
over-represented in late G1 (Table 1). Their positioning
along all late G1 upstream regions is shown in the figure.
The light blue lines indicate the number of times that the element is
found at each position upstream of the ATG start codon across all late
G1 genes. We calculated the normalized, cumulative number of
occurrences of each oligomer along the late G1 upstream
regions (light pink histogram), and compared it with the cumulative
number of occurrences of a hypothetical, completely uniformly
positioned sequence (dark pink line). The hexamer ACGCGT is observed
less frequently than the hypothetical sequence from position
~
600 to
125, whereas the hexamer CGACGC behaves similarly to
the hypothetical sequence. The statistic (not illustrated) confirms
that ACGCGT is clustered within late G1 upstream regions, whereas CGACGC is not. We observe empirically that ACGCGT is clustered between ~
100 and
200 from the ATG start codon, whereas the
element CGACGC is fairly evenly positioned (light blue lines).
|
We repeated this test to determine whether any of the other hexamers and pentamers identified above (Tables 1 and 2) are clustered in a particular location within the group of upstream regions in which they were shown to be over-represented. Tables 3 and 4 show the hexamers and pentamers, respectively, whose positioning within a set of upstream regions is statistically significant. Because of space constraints, we do not list the number of elements present at each position in the 600-nucleotide upstream region. Rather, for illustrative purposes, we have tabulated the total number of elements found in 50-nucleotide intervals in a particular data set. The total number of elements found in this method is different from the that presented in Tables 1 and 2; in Tables 1 and 2, the number of upstream regions containing one or more copy of an oligomer are counted, whereas in Tables 3 and 4, each occurrence of an element is counted separately (see Methods).
|
|
We find that a total of six hex- amers and/or their reverse
complements are clustered among a set of upstream regions (Table 3A).
An additional five individual hexamers are also clustered, but these do
not represent new binding sites (Table 3B). All clustered hexamers are
from late G1, and all show a preference for the interval
101 to
200 from the ATG start codon. Seven pentamers and/or
their reverse complements are clustered, as are twelve individual
pentamers (Table 4). Four of the late G1 pentamers also tend
to be located between
101 and
200. Pentamers assessed in other
data sets do not show such an overwhelming preference for one location.
| |
DISCUSSION |
|---|
|
|
|---|
Using high-density oligonucleotide arrays, Cho et al. (1998)
identified 420 transcripts, of the 6220 ORFs in the S. cerevisiae genome, whose expression is cell cycle regulated. They
further divided these transcripts into those whose expression peaks
during the early G1, late G1, S, G2, or M
phase of the cell cycle. We have developed a statistical method to scan
upstream regions for short oligomers that may be involved in the
regulation of coexpressed genes. These elements can be position- and
orientation-independent or dependent. We have applied this technique to
the transcripts identified by Cho et al. (1998)
and have produced a
list of elements, some new, some identified previously, which are
likely responsible for the coordinate regulation of genes during the
yeast cell cycle. Although the oligomers themselves may not represent
the complete protein-binding sites, they will serve as excellent
starting points for future studies of longer, more complex, regulatory sites.
Table 5 presents a composite list
of the elements that may be important for cell cycle-dependent
transcriptional regulation in yeast. The statistically significant
elements with P
0.05 are marked with an asterisk;
unmarked sequences have P
0.2. To make it easier to
identify the potential regulatory sites, we have grouped together pairs
of sequences that are reverse complements. We list the pentamers and
hexamers separately, even though many may be part of the same binding
site, because some pentamers are contained within two or more different
hexamers. A total of nine hexamers have P
0.05, and
another four have P
0.2. Twelve pentamers have
P
0.05; six of these elements are contained within a
hexamer. Another 12 pentamers have P
0.2.
|
The hexamers listed in Table 5 have a statistically significant distribution independent of their position within an upstream region. Thus, these elements could act at any location within the upstream region. However, we have found that six of them are clustered in a particular region within the set of upstream regions in which they are over-represented (Table 3). This positioning may be important for their function. The pentamers, which were identified as being over-represented in a particular 99-nucleotide upstream interval, are likely to be part of binding sites that function in a position-dependent manner. Furthermore, within a group of coregulated upstream regions, some of the pentamers are clustered in a specific region (Table 4). It is not surprising that only a subset of the position-dependent pentamers is also clustered. The method identifies position-dependent pentamers by counting the number of upstream regions that contain one or more copies of the element; the clustering portion of the method counts all occurrences of the element.
One result of our study is that many of the sequence elements likely function in an orientation-independent manner, that is, they could act when placed on either DNA strand. The over-representation of some orientation-independent elements is due to the number of upstream regions that contain either the oligomer or its reverse complement on one strand, such as GAGTCA/TGACTC in G2 (Table 1A). In other orientation-independent elements, the sequence and its reverse complement are both individually over-represented, such as the pair AACGCG and CGCGTT in late G1 (Table 1B). In contrast, elements in which only one of the two elements is statistically significant, such as CCCTT in early G1, may function in an orientation-dependent fashion, with the sequence CCCTT being on the coding strand (Table 1B).
An important validation of our method is the fact that one of the
oligomers on our list has been identified previously as being involved
in cell cycle-dependent gene regulation. The transcription factor
complex MBF, a heterodimeric complex composed of Mbp1p and Swi6p, binds
to the consensus sequence ACGCGT, the MCB (for review, see McIntosh
1993
; Koch and Nasmyth 1994
). It has been shown that this sequence is
present upstream of many genes involved in DNA synthesis and repair
that are regulated at the late G1/early S transition (for
review, see McIntosh 1993
). This palindromic sequence is the highest
scoring hexamer by our methods, and our analysis supports the idea that
it functions during late G1 (Table 1). The element occurs in
49% of late G1-upstream regions, and, in 65% of these
upstream regions, it is located between
101 and
200 from the
start codon. It is also present in 22% of the upstream regions from S
phase. Furthermore, 29% of the late G1 upstream regions that
contain the MCB element contain two or more copies of it (not shown).
This result correlates with the experimental observations that
increasing the number of MCB sequences in a gene upstream region leads
to a higher rate of transcription of the gene (McIntosh 1993
).
It is striking that the highest scoring elements are all variants of
the MCB sequence, ACGCGT, are all over-represented in late
G1, and, to some extent, in S, and most are located between 100 and 200 nucleotides upstream of the ATG start codon. Table 5
documents the amount of similarity that the elements share with MCB.
All seven of the late G1 hexamers with a
P
0.05 are identical to MCB at four of six positions,
and five are identical at five of six positions. Of the seven late
G1 pentamers with P
0.05, four overlap with
MCB at four of five positions. Furthermore, all of the MCB-like
pentamers were identified because they are over-represented in the
interval between
202 and
104 (Table 2), and all of the MCB-like
pentamers and hexamers are clustered in the same region (Tables 3 and
4). MCB and the related sequences are thus different from other yeast
upstream-activating sequences, as most elements function at variable
distances from the mRNA start site (Struhl 1995
; Kunzler et al. 1996
). The
positioning of the MCB-like sequences is likely important for their function.
Some elements, such as AACGCG and CGCGTC clearly
overlap with MCB, but the sequence ACGCGA may represent a novel binding
site. Although it is tempting to create a longer MCB-like consensus sequence by combining overlapping sequences, this procedure is not
sound. The heptamers AACGCGA/TCGCGTT, AACGCGT/ACGCGTT, and ACGCGAA/TTCGCGT can be formed by combining two hexamers listed in Table
1. The over-representation of these heptamers in late G1
upstream regions is statistically significant (not shown). However, the
distribution of the heptamers, ACGCGTA/TACGCGT, GACGCGA/TCGCGTC, and
TACGCGA/TCGCGTA, also created by merging hexamers from Table 1, is not
statistically significant. Thus, it is likely that the sequence context
of the pentamers and hexamers plays a role in protein binding. Two
sequence elements that activate transcription at the G1/S
boundary are SCB (CACGAAA), which is recognized by the Swi4p-Swi6p
complex, and MCB, recognized by a Mbp1p-Swi6p complex. The yeast CLN1
upstream region contains three MCB-like sequence elements, ACGCGT,
TCGCGA, and CCGCGT. Surprisingly, these elements are bound by
Swi4p-Swi6p, not the typical MCB-binding complex of Mbp1p-Swi6p
(Partridge et al. 1997
). Thus, the MCB-like elements may be part of
longer sequences that interact with Mbp1p-Swi6p, Swi4p-Swi6p, or even
a novel combination of factors. We plan to analyze the sequences
flanking the MCB and MCB-like elements in the future to better classify
particular subclasses of binding sites.
One of the major cell cycle control checkpoints in yeast, called Start,
occurs during late G1 phase (for review, see Lew et al.
1997
). Thus, it is interesting that we find a larger collection of
statistically significant sequences upstream of late G1 genes than upstream of other cell cycle-regulated genes. Two-thirds of the
elements that may be important for cell cycle control are over-represented in late G1, and all but two of the late
G1 elements share sequence similarity, and are
over-represented or clustered in the region 100 to 200 nucleotides
upstream of the ATG start codon. This finding leads us to speculate
that control of transcription during late G1, around the time
of Start, is especially important to the cell. However, transcription
during other phases may be controlled by a wide range of activities,
including the regulation of mRNA stability, which do not necessarily
require upstream sequence elements.
Some of the pentamers and hexamers that we have identified are also
contained within sequences shown previously to be important for
transcription in yeast. As a reference for transcriptionally active
yeast sequences, we used the 312 yeast protein-binding sites listed in
the TRANSFAC database, version 3.5 (Heinemeyer et al. 1999
). Of the 21 hexamers and pentamers with P
0.05 (Table 5), 17 are
contained within one or more TRANSFAC-binding sites. These 119 TRANSFAC
entries represent the binding sites for 23 binding factors (Table
6). Two of these binding factor entries, T01094 and
T01120, are for MCB, the late G1 transcription factor described above. Another entry, T00096, is for the binding factor SBF,
a late G1-specific transcription factor (Koch and Nasmyth 1994
). However, this factor binds at least two similar sites, and the
one that contains the pentamers and hexamers is not the one cited as
the consensus in the literature. A third factor, T00798, is TBP, the
TATA-binding protein (Hahn et al. 1989
). The pentamers that overlap
with the TBP-binding site, ATAAA, AATAA, and ACAAA are over-represented
in the interval from
5 to
103 (Table 2), the region in which
TBP is active (Struhl 1995
; Kunzler et al. 1996
). These pentamers
could be a part either of the TBP-binding site, or of a different site.
|
This method clearly identifies sequence elements that are
over-represented in one of the five sets of cell cycle-regulated upstream regions. However, elements that are evenly distributed among
the five data sets will not be highlighted. The SCB element, which has
the consensus sequence CACGAAA, was identified previously as a
regulatory sequence located upstream of genes transcribed in late
G1 and early S. SCB is bound by the SBF transcription factor,
a complex of Swi4p and Swi6p (for review, see Koch and Nasmyth 1994
).
Because the two hexamers that compose SCB, CACGAA and ACGAAA, are
evenly distributed among all cell cycle-regulated upstream regions,
neither is statistically significant by our analysis. A pentamer,
CGAAA, that is part of SCB is over-represented in late G1 in
the interval
104 to
202, but with P
0.15.
However, CACGAAA may be part of a longer sequence that serves as the
SBF-binding site. This longer, yet unidentified sequence may be
over-represented in late G1 upstream regions. Some of the
late G1 oligomers that we have identified in this study may
also be contained within this longer binding sequence.
A second analysis of cell cycle-regulated yeast genes was recently
performed by Spellman et al. (1998)
. These researchers used different
technology and methods from those described in Cho et al. (1998)
, and
identified ~800 genes that are cell cycle regulated. Although this
set of 800 genes includes only 304 of the 420 identified by Cho et al.
(1998)
, the cell cycle periodicity described by Cho et al. (1998)
has
been confirmed in separate clustering experiments (Tamayo et al. 1999
).
Spellman et al. (1998)
searched for potential regulatory elements among
groups of coregulated genes by a Gibbs sampling strategy, a different
statistical procedure from that used here. They describe ~15 motifs,
some known, some novel, that are found upstream of clusters of
coregulated genes. They also identified the MCB site, ACGCGT, upstream
of a set of genes whose expression peaks in G1. In the same
group of genes, their strategy identified SCB, another late
G1 element. One of our late G1 pentamers, GCGAA, is
part of a motif they identified in the histone cluster. And the
pentamer AACAA that we identified in S phase is part of a regulatory
site they predict for their MCM cluster. Other sites do not overlap
with those we identified. The discrepancies between our results
and those of Spellman et al. (1998)
are not surprising. Most
important, because of the techniques used to analyze the gene
expression data, the lists of coregulated genes are not the same.
Furthermore, it appears that many of the motifs they describe are
extended consensus sequences for previously identified regulatory
binding sites, whereas our analysis is performed with no pre-existing
knowledge of the sites. We are currently applying our statistical
methods to the coregulated genes described in Spellman et al. (1998)
,
and plan a detailed comparison with the results presented here.
The sequence elements that we have identified are responsible for much
of the cell cycle regulation described by Cho et al. (1998)
. When the
elements are analyzed independently, it is clear that none is solely
responsible for transcription at a specific time during the cell cycle.
The highest scoring element, ACGCGT (MCB), is present in 49% of late
G1 upstream regions; the hexamers with
P
0.05 over-represented in early G1 and
G2 are found in only 22% and 27% of their respective
upstream regions. These frequencies are in agreement with those found
by Chu et al. (1998)
, who looked for sequence elements controlling
sporulation in yeast. However, previous work on cell cycle regulation
in yeast suggests that more than one sequence element may be
responsible for transcription at the same phase (e.g., SCB and MCB both
regulate late G1 mRNA expression). Thus, we counted the
number of upstream regions that had either one of the statistically
significant hexamers (P
0.05), at any position, or
one of the significant pentamers, in the interval specified in Table 2.
The results are as follows: 35% of early G1 upstream regions
contain one or more of the predicted elements, as do 98% of late
G1 upstream regions, 45% of S upstream regions, 27% of
G2 upstream regions, and 25% of M upstream regions. Thus,
our work predicts that a variety of elements can be responsible for
transcription at each phase of the cell cycle. However, other mechanisms, such as differential mRNA stability, may also play a role.
We anticipate that the oligomers described here will be used by
researchers studying yeast cell cycle gene regulation as a basis for
further investigation.
| |
METHODS |
|---|
|
|
|---|
Creating a Database of Cell Cycle-Regulated Yeast Upstream Regions
We have created a database of all yeast upstream region sequences
(T.G. Wolfsberg and D. Landsman, unpubl.) after extracting information
present in YPD (Hodges et al. 1999
) and MIPS (Mewes et al. 1999
) as of
February, 1997. This database contains the sequence 600 nucleotides
upstream of each ATG start codon. We identified all ORFs that were
annotated in both yeast databases. We then extracted the sequence 600 nucleotides upstream of each of these ORFs from the MIPS version of the
yeast genome sequence. All upstream region sequences are oriented in
the direction in which transcription occurs, with the start of the
sequence at
600 and the end of the sequence at
1 with respect
to the ATG start codon.
Cho et al. (1998)
have carried out a comprehensive analysis of cell
cycle-regulated transcription in yeast. We used the results of their
work to create five data sets that contain the upstream region
sequences of genes whose transcription is induced in one phase of the
cell cycle. The data sets are defined as follows: early G1;
63 upstream regions; late G1; 134 upstream regions; S; 74 upstream regions; G2; 56 upstream regions; M; 56 upstream regions.
Assessing Nonrandom Representation of Short Sequences Within the Cell Cycle-Regulated Yeast Upstream Regions
Our search for oligomers responsible for cell cycle-dependent regulation was made on the basis of the following hypothesis. If a short sequence is responsible for gene regulation in a specific phase of the cell cycle, it should be over-represented among the corresponding upstream regions, relative to other cell cycle-regulated upstream regions. Moreover, any short subsequences contained in it should also be over-represented. As a hypothetical example, if the sequence element AGGCATTC were responsible for gene regulation specifically in late G1, it should be over-represented in late G1 upstream regions, relative to upstream regions specific to the other cell cycle phases. Moreover, AGGCATTC is composed of three shorter hexamers, AGGCAT, GGCATT, and GCATTC, they also should be similarly over-represented in late G1 upstream regions.
Thus, we were interested in identifying short DNA oligomers that are over-represented in upstream regions specific to a particular phase of the cell cycle, relative to other cell cycle-regulated upstream regions. We therefore wished to apply statistical tests for this over-representation. Most available statistical tests, for example, Gibbs sampling, representation ratios, z-scores, etc., assume a statistical parent population in which random DNA bases are generated by a Markov model, or perhaps independently. The model becomes part of what is being tested statistically, which is undesirable. We therefore developed a statistical test whose parent population represents real DNA, so that in principle, before the experiments were performed, samples from the parent population were possible experimental outcomes.
Our actual data set consisted of 383 cell cycle-regulated upstream
regions (nucleotides
1 to
600), divided into five classes corresponding to early G1 (63 genes), late G1 (134 genes), S (74 genes), G2 (56 genes), and M (56 genes). If
only these numbers are known, before the experimental results classify
these 383-upstream regions, every classification into early
G1 (63 genes), late G1 (134 genes), S (74 genes),
G2 (56 genes), and M (56 genes) is equally likely. The random
classifications are the natural parent population for our problem. To
sample the parent population, we used Monte Carlo random sampling to
create 1000 mock data sets by classifying the 383-upstream regions into
the five classes at random. The 1000 mock data sets were an adequate
sample for the over-representation tests described next.
Statistical Tests for Orientation-Dependent, Position-Independent Over-Representation
If an oligomer within an upstream region can regulate regardless of its specific position but only in a particular orientation, we call it orientation dependent, position independent. We can test all DNA oligomers of a fixed length for orientation-dependent, position-independent over-representation as follows. (For orientation independence, etc., see below.) Because the over-representation is orientation dependent, each oligomer is considered different from its reverse complement from the other DNA strand (e.g., ACGCGA separately from TCGCGT).
First, in the real data set, count the number of upstream regions in
each class that contained the oligomer. (Note: not all upstream regions
contain the oligomer, and multiple appearances in the same upstream
region are counted only once.) To rank oligomers by non random
representation in the five classes, calculate a
2 score
for each oligomer. The
2 score is defined as
|
The
2 score for each oligomer can then be used to rank the
oligomers for over-representation. On one hand, oligomers that are represented
in each class according to the expectation have a small
2 score
(i.e., they are randomly represented). On the other hand, oligomers that are
over-represented in a class should have a large
2 score. Over-representation in one class is logically equivalent to under-representation in
some other classes. Accordingly, the ith term in the
2 score for a particular oligomer indicates whether class
i deviates from the oligomer's expected representation; it does not
distinguish directly whether the oligomer was over-represented or under-represented. Once statistical significance has been established, however,
over-represented classes for an oligomer are easily determined by selecting
+he largest terms with oi > ei in
the
2 score.
Each oligomer has now been assigned a
2 score
corresponding to the real data set, so a statistical significance from
the 1000 mock data sets can then be calculated as follows. In each mock data set, calculate the
2 score for each oligomer as
above. Find the largest
2 score from each of the 1000 mock data sets, and then sort these 1000 numbers. For any particular
P value, find the 1000 × Pth largest
2 score among these 1000 numbers. For the standard
P = 0.05, for example, find the fiftieth largest (to avoid
missing biologically interesting oligomers, we also used
P = 0.2, corresponding to the two-hundredth largest). This
cutoff has a direct interpretation: For any oligomer, a
2 score larger than the cutoff occurs only 5% of the
time in a random data set. The cutoff therefore provides a statistical
test at the P = 0.05 level of significance. This statistical
test automatically accounts for multiple testing, as it must, because
all oligomers of a fixed length have been tested.
Under the random classification of the mock data sets, the
2 score for each oligomer approximately follows a
2 distribution with (D = 4 = 5
1)
degrees of freedom, as long as the expected number in each class
satisfies ei
5. Thus, of two oligomers, if
one is overall more frequent within cell cycle-regulated upstream
regions (i.e., in the aggregate of all five classes), the two
2 scores have the same distribution in the absence of
any systematic over-representation. Thus, in examining
over-representation, the
2 score has a very desirable feature:
It is insensitive to the overall frequency of the corresponding oligomer.
As mentioned above, under the assumption that the expected number in
each class satisfies ei
5, the
2 score for individual oligomers follows a
2 distribution with D = 4 degrees of freedom.
By the Bonferroni inequality, because there are
N = 4l oligomers of length l, each with a
2 score, the maximum
2 score satisfies
P(max
2
x)
N
P(
2
x). The right side
P(
2
x) is a standard
2 tail probability and is available from a function
built into Microsoft Excel. Our results indicate that the simple
approximation P(max
2
x)
N P(
2
x) holds for
probabilities up to ~P = 0.05. Above P = 0.05, however, P(max
2
x) and
N P(
2
x) diverge as
they become larger, making the P values from the
2 statistical test too conservative. Thus, in the absence of
any theoretical reassurance, the stringent estimation of P(max
2
x) does require the mock data sets.
Recall that frequency insensitivity requires every class to have an
expected number ei
5. This restriction
places an upper limit on the choice of the length l of the
oligomers. The oligomers should not be so long that they appear
infrequently (ei < 5) in five classes of 383 upstream regions, each of length 600. On the other hand, they should not be so
short that they appear in almost every upstream region of length 600. Because
600
4l
383 × 600/5, giving
5
l
6. We considered using pentamers (l = 5) instead of hexamers (l = 6), but upstream
regions of length 600 contain too large a fraction of the
45 = 1024 possible pentamers. Thus, theory indicates
orientation-dependent, position-independent over-representation can
be tested with hexamers, and we did so. There are
N = 46 = 4096 hexamers.
The theoretical considerations above were borne out empirically by
testing with pentamers and heptamers, which proved less effective than
hexamers (data not shown). Additionally, it is interesting to note that
many statistical surveys on DNA have been made on the basis of
hexamers. Hutchinson (1996)
used hexamer distributions to predict the
location of upstream region and nonpromoter sequences in vertebrate DNA
sequences. Fujibuchi and Kanehisa (1997)
used a hexamer search as a
basis for discriminating upstream regions of housekeeping genes from
those of tissue-specific ones. Recently, van Helden et al. (1998)
identified potential regulatory oligomers from groups of coregulated
yeast genes using the theory of over-represented hexamers.
Many variants on our test are feasible. For example, we considered
sampling our 1000 mock data sets from a sixth class, a control
consisting of upstream regions that are not cell cycle regulated. We
also considered adding an extra term in the
2 score to
correspond to this sixth class. No feasible variant on the test
described above, however, produced significant differences from the
results presented.
Statistical Tests for Orientation-Independent, Position-Independent Over-Representation
If an oligomer within an upstream region can regulate, regardless of its position and orientation, we call it orientation-independent, position-independent. Because the over-representation is now orientation independent, each oligomer is considered the same as its reverse complement from the other DNA strand (e.g., ACGCGA is now the same as TCGCGT). For reasons similar to those given above, we tested orientation-independent, position-independent over-representation with hexamers (l = 6). Because of palindromes, there are N = 1/2 (46 + 43) = 2080 distinct pairs of hexamers and reverse complements. For a given pair, we now count upstream regions that contain either member of the pair. Otherwise, all statistical calculations were carried out as described above for orientation-dependent, position-independent over-representation.
Statistical Tests for Orientation-Dependent, Position-Dependent Over-Representation
If an oligomer within an upstream region can regulate, but only in a particular position and particular orientation, we call it orientation dependent, position dependent. Because the over representation is now orientation dependent, each oligomer is considered different from its reverse complement from the other DNA strand (e.g., ACGCG is different from CGCGT). For the same theoretical reasons that led to hexamers (l = 6) above, this analysis was done with pentamers (l = 5). There are N = 45 = 1024 pentamers.
To detect orientation-dependent, position-dependent
over-representation, we divided each upstream region into six
nonoverlapping subsequences, numbered with respect to the ATG start
codon:
1 to
103,
104 to
202,
203 to
301,
302 to
400,
401 to
499, and
500 to
598. This
division should increase statistical power when searching for
position-dependent upstream region elements, although it does suffer
from inevitable difficulties with edge effects at the ends of the
subsequences. Conceptually, for the purposes of statistical testing,
each of the 6 subsequences is a set of 99 pentamers. The upstream
region subsequences therefore give 383 × 6 = 2298 different sets
of pentamers, falling into 5 × 6 = 30 different classes with
respect to cell cycle regulation and position.
The statistical computations were carried out as described above for
orientation-dependent, position-independent over-representation, with the following exceptions. To sample the parent population, we
created the 1000 mock data sets by reclassifying the 2298 different sets of pentamers at random into the 30 different classes. The corresponding
2 score had 30 terms and d = 29
degrees of freedom. In the Bonferroni inequality,
N = 45 = 1024.
Statistical Tests for Orientation-Independent, Position-Dependent Over-Representation
If an oligomer within an upstream region can regulate, regardless of its orientation but only in a particular position, we call it orientation independent, position dependent. Because the over-representation is now orientation independent, each oligomer is considered the same as its reverse complement from the other DNA strand (e.g., ACGCG is now the same as CGCGT). For reasons given above, we tested orientation-independent, position-dependent over-representation with pentamers (l = 5). Because pentamers cannot have reverse complement palindromes, there are N = 1/2 (45) = 512 distinct pairs of pentamers and reverse complements. For a given pair, we now count upstream regions that contain either member of the pair.
Otherwise, to detect orientation-independent, position-dependent
over-representation, we proceeded as described above for orientation-dependent, position-dependent over-representation, with
upstream region subsequences representing 383 × 6 = 2298 different sets of pentamers, falling into 5 × 6 = 30 different classes with respect to cell cycle regulation and position. We created
the 1000 mock data sets by reclassifying the 2298 different sets of
pentamers at random into the 30 different classes. The corresponding
2 score had 30 terms and d = 29 degrees of
freedom. In the Bonferroni inequality, N = 45 = 512.
The Kolmogorov-Smirnov Test for Assessing Nonuniform Position
We were also interested in determining and/or confirming the
position dependence of any of the over-represented oligomers identified
above, even if the position dependence was not apparent with the
2 score. The null hypothesis here is that if the oligomer
has no position dependence, it will be uniformly positioned along the upstream region. There is a standard method to assess the uniformity of
oligomer placement. We performed this analysis only if an oligomer had
been statistically significant with P
0.2 under the
previous tests, and we tested its position dependence only in the
upstream regions in which it was over-represented. To perform the
Kolmogorov-Smirnov test, at each position from
1 to
600, we
counted the number of occurrences of the oligomer. We then converted
this to a cumulative score at each position giving the number of counts
at the position or prior to it. The cumulative scores were then
normalized by dividing by the total number of counts. The normalized
cumulative score was then compared with the expected cumulative score
corresponding to a uniform distribution of that oligomer across the
range
1 to
600. Thus, we calculated
|
| |
ACKNOWLEDGMENTS |
|---|
We thank Jo McEntyre and Francis Ouellette for a critical reading of the manuscript.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
4 These authors contributed equally to this work.
5 Corresponding author.
E-MAIL landsman{at}ncbi.nlm.nih.gov; FAX (301)435-7794.
| |
REFERENCES |
|---|
|
|
|---|