|
|
|
|
Vol. 11, Issue 4, 566-584, April 2001
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
Identifying the complete transcriptional regulatory network for an organism is a major challenge. For each regulatory protein, we want to know all the genes it regulates, that is, its regulon. Examples of known binding sites can be used to estimate the binding specificity of the protein and to predict other binding sites. However, binding site predictions can be unreliable because determining the true specificity of the protein is difficult because of the considerable variability of binding sites. Because regulatory systems tend to be conserved through evolution, we can use comparisons between species to increase the reliability of binding site predictions. In this article, an approach is presented to evaluate the computational predicitions of regulatory sites. We combine the prediction of transcription units having orthologous genes with the prediction of transcription factor binding sites based on probabilistic models. We augment the sets of genes in Escherichia coli that are expected to be regulated by two transcription factors, the cAMP receptor protein and the fumarate and nitrate reduction regulatory protein, through a comparison with the Haemophilus influenzae genome. At the same time, we learned more about the regulatory networks of H. influenzae, a species with much less experimental knowledge than E. coli. By studying orthologous genes subject to regulation by the same transcription factor, we also gained understanding of the evolution of the entire regulatory systems.
| |
INTRODUCTION |
|---|
|
|
|---|
The number of complete microbial genome sequences is increasing at
an unprecedented rate. To date, 29 bacterial genomes
have been determined, 11 more are in annotation stage, and 83 are in progress. This surge of sequence information provides an enormous amount of data for comparative genomics analysis. During the earlier stage of genomic analysis, most of the effort was devoted to analyses of protein-coding regions because, in the course of evolution, protein-coding sequences change much slower than the noncoding sequences (Koonin et al. 1997
, 1998
). These comparative genomics studies have proved highly informative, allowing functional assignments for many putative proteins in poorly studied organisms (Overbeek et al.
1999
). One surprising result from these analyses was the lack of
long-range conservation of gene order in bacterial genomes, with the
exception of species within the same genus (Tatusov et al. 1996
;
Himmelreich et al. 1997
). For species of intermediate phylogenetic
distance, such as in Escherichia coli and Haemophilus influenzae, many clusters are conserved, but their orders are less
conserved (Dandekar et al. 1998
). However, a more recent study shows a
clear conservation of pairs of orthologs to genes within an operon, as
opposed to genes at the boundaries of transcription units (TU) (G. Moreno-Hagelsieb et al. 2001
).
Besides knowing individual protein functions, knowledge about the
transcriptional regulatory network is an indispensable prerequisite for
an adequate understanding of cellular functions. The computational identification of regulatory proteins from a bacterial genome sequence
is more solid given the limited number of transcription factor families
and the conservation of the helix-turn-helix motif in bacteria
(Perez-Rueda and Collado-Vides 2000
). For the set of regulatory
proteins, we want know the entire set of genes whose expression is
regulated by each of these regulators, its regulon (Salgado et al.
2000a
). The first step in this direction is the identification of
transcription factor binding sites, which then can help to predict
transcription units regulated by these proteins. Although the problem
of regulatory site prediction has been studied for >20 years, it is
still far from being solved (Gelfand 1995
; Thieffry et al. 1998a
). The
major reasons for this are small training set size (often <20
sequences) and poor understanding of the biophysics of protein-DNA
interaction, making it very difficult to deduce a proper set of rules
for pattern recognition algorithms.
Besides identifying regulatory sites, the other key to predicting new
members of a regulon is to have a good estimate of transcription units
in a given genome. However, even for an organism as extensively studied
as E. coli, the set of known TUs is far from complete (Salgado
et al. 2000a
). Also, predicting TUs is a nontrivial problem that has
not been studied extensively. Recently, three groups have published new
methods to predict TUs (Yada et al. 1999
; Craven et al. 2000
; Salgado
et al. 2000b
). These studies represent a promising first step toward a
more accurate prediction of TUs. In this article, transcription units
are defined as sets of genes (one or more) that are cotranscribed.
Operons are defined as the polycistronic subset (more than one gene) of
all transcripition units.
We have adopted a combined approach to identifying new members of regulons. We find that high scoring matches to binding patterns for transcription factors are likely to represent real regulatory sites based on the distribution of such sites. The predictions of lower scoring sites are less reliable, so we add evidence from a comparative analysis with other species, based on the premise that regulons tend to be conserved. If we find that orthologous genes in two or more species appear to be controlled by the same factor, that provides added confidence in the prediction of even the lower scoring sites. However, because many prokaryotic genes are transcribed as operons, the transcriptional control regions may be far removed from a particular gene. Therefore, the analysis of TUs is essential to the identification of pairs of orthologous genes belonging to common regulons. Therefore, the overall approach combines the prediction of TUs in each species, the identification of orthologous genes, and the prediction of transcription factor binding sites based on probabilistic models, such as weight matrices.
In this article, we predict new members of the cAMP receptor protein (CRP) and fumarate and nitrate reduction regulatory protein (FNR) regulons in E. coli and H. influenzae. We chose these two genomes because E. coli transcription regulation is by far the best understood among all bacterial species, and H. influenzae is the only complete genome (as of this writing) that is close enough so that many TUs are conserved. The CRP and the FNR are two global transcriptional regulators that occur in many bacteria. Genes regulated by them (CRP and FNR regulons) have a wide range of functions. Our overall strategy is shown in Figure 1. Briefly, binding patterns derived from known E. coli CRP- and FNR-binding sequences are used to predict novel binding sites for these two proteins. Predicted binding sites are combined with our knowledge of orthologous genes and predictions of TUs in both genomes. This combined information is used to predict novel members of CRP and FNR regulons.
|
Other groups previously have used comparative analyses to predict new
sets of regulated genes. McGuire et al. (2000)
recently examined 17 completely sequenced microbial genomes to identify regulatory sites for
groups of related genes. They used a pattern discovery approach to find
putative sites and then used various filtering techniques to diminish
the number of false predictions. They used known E. coli
regulons as positive controls and showed that the method worked well to
identify known sites. They even showed that the method could be applied
to archaebacterial species, as in another article by Gelfand et al.
(2000)
. However, they did not use the patterns for E. coli
regulatory sites to expand the set of genes likely to be regulated by
specific factors, which is the main purpose of this article. Mironov et
al. (1999)
also used a comparative analysis to predict regulatory sites
in other species for a few regulons in E. coli. In addition,
they did predict a few new sites in E. coli for the PurR and
ArgR regulatory proteins. Our approach in this study was similar, but
by incorporating TU prediction and using two well-studied regulons, we
were able to predict many more new members of the CRP and FNR regulons.
| |
RESULTS |
|---|
|
|
|---|
Conserved Recognition Patterns by CRP and FNR in Both Genomes
The cocrystal structure of an E. coli CRP-DNA complex has
been solved at 2.5 Å resolution (Parkinson et al. 1996
). The
principal specificity-conferring interactions are those between the
first two residues of the recognition helix, R180 and E181, and the two
G : C pairs in the deduced consensus half-site TGTGA (Ebright et al.
1989
; Gunasekera et al. 1992
). Residue R185 in the recognition helix
also contributes to binding specificity although to a lesser extent
(Fig. 2). We aligned 10 CRP orthologs from
various bacterial genomes (Fig. 3). The
first two residues of the specificity-conferring motif, RE
R are 100%
conserved across the species. The third residue in the motif is
conserved except for CRP_MTB in which the second arginine is replaced
by a lysine. This high level of conservation in the DNA-binding domain
implies a CRP-binding pattern similar to that of E. coli. CRP
also exists in these bacterial genomes with a cognate CRP protein.
|
|
Both CRP and FNR belong to the CRP/FNR helix-turn-helix transcription
factor superfamily. In E. coli, the two proteins are 23%
identical and 36% similar, with the conservation concentrated in the
domain containing the HTH motif in which they have 27% identity and
43% similarity (using BLAST and BestFit). The FNR consensus half-site motif (Spiro et al. 1990
), TTGAT, is
analogous to that of CRP half-site (TGTGA). In fact, a common site that
can bind both FNR and CRP has been reported (Jennings and Beacham
1993
). In E. coli, the proposed specificity-conferring interactions for FNR are those between E209 and the G-C base pair common to both core motifs and a discriminatory interaction between S212 and the first T-A base pair in the FNR site, which replaces that
between R180 and the common G-C base pair in the CRP site. Another
conserved interaction involves R213 and the common G-C base pair (Fig.
2). From the multiple alignment of eight FNR orthologs (Fig. 3), we see
that the first and third residues of the specificity-conferring motif,
E-SR, are absolutely conserved across the species whereas the second
residue is highly but not absolutely conserved. Again, this high degree
of sequence conservation implies a conserved recognition pattern for
FNR binding to its operators.
CRP and FNR Weight Matrices Obtained by Aligning Characterized Binding Sequences in E. coli
Using the program CONSENSUS (Hertz and Stormo 1999
), we
aligned the training set sequences to generate weight matrices used by
the program PATSER. Specifically, a mononucleotide matrix was used to represent the binding specificity of a
transcription factor. The assumption in using such a matrix is that
contributions to binding specificity are additive across all positions
of the site. We tested this assumption by using the program
MIXY (Gutell et al. 1992
) that can identify
covariation(s) between any two positions across the binding sites.
No significant covariation was observed between positions for CRP and
FNR. Thus, we believe that a mononucleotide matrix is a valid
representation of the binding specificities of CRP and FNR.
CONSENSUS calculates a P value for an ungapped
multiple alignment, so different alignments can be compared and the most significant one identified (Hertz and Stormo 1999
). For each protein, we compared site lengths ranging from 14 to 28 nucleotides and
compared symmetric models with asymmetric ones. Symmetric models are
clearly more significant than asymmetric ones, with expectation values
at least 102 times lower at all even lengths tested. The
expectation values for different lengths were not very different over
the entire range of lengths tested, consistent with the proteins having
a core conserved region of 16 or 14 bp, for CPR and FNR, respectively, surrounded by more weakly conserved sequences. We used 22 nucleotides for the length of each protein's binding site based on previous work
(Kolb et al. 1993
) and for consistency with previous analyses (Salgado
et al. 2000a
). The CRP protein has the half-site consensus of TGTGA
with a separation of six nucleotides between the two half-sites. The
FNR protein has a half-site consensus of TTGAT with a separation of
four nucleotides between the half-sites (Fig. 4; Table 1).
|
|
Determination of Cutoff Scores for CRP and FNR Sites
To determine the appropriate weight matrix for each transcription factor and the cutoff scores to be used for strong and weak predicted sites, we needed to identify a trusted set of example sites and the score distribution for those sites as well as potential sites in the genome. One of the difficulties arises because transcription factors may bind to DNA cooperatively so that a particular experimentally determined site would not, in fact, be a high-affinity site for the factor in another context without a neighboring site. To eliminate such potential artifacts, we picked only the highest scoring site for each transcription unit (see Methods), assuming that at least one of the sites should be high affinity on its own. This still may result in a few intrinsically low affinity sites in our training set but should minimize that number. We then set thresholds for high scoring sites based on the scores of the training sets and taking into account the distribution of scores in the background (i.e., genomic) sequence.
To determine cutoff scores for CRP sites, we used the following
procedure. First, we determined the range and mean score for the
following two sets of sequences: (1) training sequences; and (2) all 22 mers in the E. coli genome. As shown in Table
2, training sequences scored between 8.77 and 20.62 bits with a mean of 14.4 bits and a standard deviation of 3.6 bits. The mean score and the standard deviation of all 22 mers in the
E. coli genome were
15.84 and 8.53 bits, respectively. Such
a negative mean score is expected because most of the genomic sequence
contains no CRP-binding sites. Next, for each site with a score
between 7 and 23 bits from the whole-genome scan, we determined its
location relative to the TUs downstream from or encompassing it.
Functional regulatory sites usually are located upstream of TUs (in the
regulatory region) although there are a few known cases where the sites
are located within TUs (8 of 361 in RegulonDB). Given this observation,
we can approximate the false-positive rate of our site predictions based on the fraction of predicted sites that are located within transcription units. Figure 5a shows the
fraction of CRP-binding sites located either upstream of or within a TU
in the E. coli genome. At low cutoff scores, almost all sites
are located within transcription units, indicating a high
false-positive rate. The size of all upstream regions is ~1.23 Mb,
~27% of the genome size of E. coli (4.63 Mb). Thus, sites
with random localization occur ~73% of the time within TUs. Raising
the cutoff score decreases the fraction of predicted sites located
within TUs and thus decreases false-positive rates. Using a cutoff of
17 bits, only 6% of all sites are located within transcription units,
indicating a low false-positive rate at this cutoff. Thus, we used 17 as the cutoff score for strong sites. To increase the sensitivity of
our search, we also chose a cutoff score for weak sites. We decided to
use a cutoff at which greater than half of all sites are located
upstream of TUs. As shown in Figure 5a, at a cutoff of 10 bits, 56% of all sites are located upstream of rather than within TUs. Thus, we chose 10 as
the cutoff score for weak sites. Using this weak site cutoff, we only missed
two training sequences (glnALG and rpoS; Table 3).
|
|
|
We applied the same criteria described above to determine the cutoff
scores for FNR-binding sites. The training sequences had a score range
of 12 to 25.84 bits and a mean of 19.8 bits with a standard deviation
of 4.5 bits (Table 2). Based on Figure 5b, we chose 20 as the cutoff
for strong sites. At this cutoff, only 9% of all sites are located
within TUs. As for the weak site cutoff, we chose 14 because at this
cutoff greater than half of all sites (57%) are located upstream of
rather than within TUs (Fig. 5b). Using the weak site cutoff, we only
missed one training sequences (dmsA; Table 4).
|
New Members of the CRP Regulon
The sets of upstream sequences from both genomes were scanned by PATSER by using the CRP weight matrix. Putative sites were filtered using the two cutoffs for CRP sites described above. For each CRP site scored above 10 bits, we predicted the TU downstream from it. Orthologs (if any) to all genes in a predicted TU were identified. Based on the two cutoffs for CRP-binding sites, we first partitioned our predictions into the following two categories: (1) TUs having at least one strong site; and (2) TUs having only weak site(s). Because the cutoff for strong sites is 2.6 bits higher than the mean score of training sequences and are likely to have few false-positives (Fig. 5a), we were confident of those category I predictions even without orthology information. Predictions in category II have only weak binding sites and are less reliable than those in category I. However, for some category II predictions, additional evidence exists to support them. The first type of evidence is orthology information. If a category II TU shares orthologous member(s) with a TU from the other genome and the latter also has CRP-binding site(s) (either weak or strong), we put such a TU in category IIA. The second type of evidence is the presence of two or more weak binding sites in the regulatory region of a TU. The probability that two or more sites occur in close proximity by chance is fairly low. We examined all weak CRP sites in the E. coli genome. For all sites located upstream of TUs, 12% are within 100 nucleotides apart. Conversely, only 2% of all sites located within TUs are within 100 nucleotides apart. Thus, closely positioned tandem sites in the regulatory region are more likely to be true binding sites than a single weak site in the regulatory region. We put all category II predictions with two or more sites but without orthology information in category IIB. The rest of category II predictions, TUs having only one weak binding site and no orthology information, are labeled category IIC. This category has the least evidence to support them. Thus, we expect a high false-positive rate among category IIC predictions.
For clarity, the 46 training set TUs and H. influenzae TUs having orthologs to genes in the training set were put in a separate category. For the 46 E. coli TUs, our predictions of CRP-binding sites largely agreed with the data in RegulonDB except for a few cases in which our method predicted extra binding sites (Table 3). We identified 23 H. influenzae TUs that have orthologs to genes in the training set TUs. Of these 23 TUs, only seven contain CRP-binding sites in their upstream regions (Fig. 6; Table 5)
|
|
In category I, we predicted 62 and 49 TUs in E. coli and
H. influenzae, respectively. In category IIA, we predicted 30 and 21 TUs in E. coli and H. influenzae,
respectively. For both categories, predicted CRP sites, their scores,
and locations relative to the transcription start are tabulated in
Table 6
(E. coli) and Table 5 (H. influenzae). Category IIB
contains 25 and 12 TUs in E. coli and H. influenzae,
respectively. This is a total of 117 and 82 new TUs in E. coli
and H. influenzae, respectively, that we are reasonably
confident belong to the CRP regulon. Category IIC contains 319 and 150 TUs in E. coli and H. influenzae, respectively. These
predictions are less reliable but probably contain some true regulated
TUs. Because of space limitation, we are unable to display results in
categories IIB and IIC in this article. These data are available as
supplementary material at http://www.genome.org.
|
In Figure 7, we depict structures of predicted TUs that share orthologous members. They are from categories I and IIA in both genomes. Strong and weak binding sites are represented by black and gray squares, respectively. Thus, one can identify the category to which a TU belongs by the colors of binding site squares.
|
New Members of the FNR Regulon
The same procedures (FNR-binding site predictions, predictions of
downstream TUs, and categorization) were performed to identify new
members of the FNR regulon. We have nine training set TUs (Table 4) and
four H. influenzae TUs that have ortholgs to genes in the
training set. Among these four H. influenzae TUs, only one
still maintains FNR regulation (Fig. 6; Table
7). The other five E. coli TUs do
not have detectable orthologs in H. influenzae.
|
Category I contains 10 and eight TUs in E. coli and H. influenzae, respectively, each with at least one strong site.
Category IIA contains 0 and 2 TUs in E. coli and H. influenzae, respectively. For both categories, predicted FNR sites,
their scores, and distances relative to the transcription start are
tabulated in Table 7 (H. influenzae) and Table
8 (E. coli). We did not find any
TU in category IIB in E. coli. In H. influenzae,
category IIB contains 2 TUs. Thus, this is a total of 10 and 12 new TUs
in E. coli and H. influenzae, respectively, that we
are fairly confident belong to the FNR regulon. In category IIC, we
predicted 70 E. coli and 79 H. influenzae TUs, all
of which have only one weak binding site and no orthology information.
Categories IIB and IIC are available as supplementary material at
http://www.genome.org. Again, the structures of predicted TUs that
share orthologous members are depicted in Figure 7.
|
| |
DISCUSSION |
|---|
|
|
|---|
We have described a method to systematically search for additional members of bacterial regulons based on information both intrinsic and extrinsic to a given genome. The intrinsic information consists of transcription factor binding sites and structures of downstream TUs. The extrinsic information is the orthology relationship between TUs obtained by comparing the respective complete sets of gene products. Our comparative approach consists of the following three major steps: (1) obtaining DNA recognition pattern for a given regulatory protein; in this study, we used weight matrices to represent binding site patterns; (2) prediction of transcription factor binding sites using the recognition pattern obtained in step one; and (3) prediction of TUs downstream from binding sites from step two and identification of any orthologs to members of the predicted TUs. At low thresholds, transcription factor binding site predictions by any present-day computer algorithm are expected to have a relatively high false-positive rate due to small training set size and poor conservation of noncoding sequences. However, incorporation of orthology information in step three increases the reliability of our inferences. Another reinforcement to the prediction of regulatory sites is the use of information on TUs. The correspondence between predicted TUs and the assignment of putative regulatory sites will help to establish other means to score the predictions and make them as more reliable. Certainly, we do not have a statistical model to evaluate how much the probability of a site increases when the site is present in front of orthologous TUs. However, qualitatively, our confidence does increase in the presence of orthology information. In this way, we are at least confident that predictions in categories IIA and IIB have a lower false-positive rate compared with those in category IIC.
The sensitivity and specificity of our predictions are difficult to
determine because we do not know the complete set of genes that are
regulated by CRP and FNR in either species. In E. coli we have
a set of genes that are known to be regulated by each protein, based on
genetic and biochemical criteria, but that set is certainly incomplete.
For most genes, we simply do not know whether they are regulated by CRP
or FNR, and the primary purpose of this article was to identify new TUs
that are likely to be regulated by these factors. We can estimate
sensitivity and specificity measures by using both the known set of
regulated genes and some assumptions about the distribution of sites
with various scores. For example, we know that functional regulatory
sites usually occur in the region we have defined as the upstream
region, between
400 and +50 bp of the start of translation of the
first gene in the TU. Rarely, although occasionally, functional sites
occur either farther upstream of or within the TU. We also assume that if a binding site for a regulatory protein occurs within that upstream
region then it is very likely to be involved in the regulation of the
adjacent TU. We set the threshold based on those assumptions for strong
sites to be such that >90% of the sites occur in the upstream
regions, and therefore we expect very few false-positive among category
I predictions. This gives us confidence in the new predicted category I
CRP-regulated TUs, 62 and 49 in E coli and H. influenzae, respectively, even without additional evidence. The
category I new predictions for FNR are 10 and 8. However, of the known
E. coli TUs regulated by these proteins, only 9 and 6 have
strong sites, so the sensitivity based on strong site cutoffs alone is
only 16.1% and 46.2% for CRP and FNR, respectively.
The threshold for weak sites was chosen such that >50% occur in the upstream regions. Remember that even these weak sites have much higher scores than the average background site (Table 2), and that most upstream regions do not have them, and any randomly chosen sites would occur only 27% of the time in the upstream regions, based on the sizes of the two sequence sets. Therefore, even such weak sites, category II predictions, are likely to contain many functional sites but undoubtedly contain false-positives as well. Therefore, we look for additional evidence before considering them reliable. One type of additional evidence is if TUs in H. influenzae that contain orthologous genes also appear to be regulated by the factor with either a strong or a weak site, which we call category IIA. Another type is if there are two weak sites near each other, which we call category IIB. We know that CRP can bind cooperatively, so nearby weak sites may have a combined affinity comparable to single strong sites. Furthermore, nearby pairs of weak sites occur infrequently within TUs, but relatively frequently in the upstream regions, as is expected of functional regulatory sites. Combining categories IIA and IIB, we predict 55 and 33 new CRP regulated TUs in E. coli and H. influenzae. An additional 319 and 150 TUs are in category IIC, some of which are probably real and some false. For FNR, we predict 0 and 4 new TUs in E. coli and H. influenzae from categories IIA and IIB, and there are an additional 70 and 79 category IIC TUs.
We can estimate the sensitivity of our approach by scoring the known TUs for each regulon. For the 56 TUs in the CRP regulon that we extracted from RegulonDB, only nine of them score in category I. An additional 41 have weak sites and therefore are put in category II, resulting in a combined sensitivity of 89.3% (50/56). However, among the 41 TUs with only weak sites, only 16 are in categories IIA and IIB (six and 10, respectively), with the remaining 25 in category IIC. Therefore, our confident predictions, combining categories I, IIA, and IIB, account for only 25 known sites, a sensitivity of only 44.6%. If the same proportions exist in the whole genome, then many of the category IIC sites will be functional CRP regulatory sites; however, we cannot determine which are true and which are false from the current data. Similar results are obtained for FNR, in which categories I and II together account for 11 of the 13 known TUs, for a senstivity of 84.6%, but five of those are in category IIC.
The net result of our analysis is the prediction of 116 and 10 new CRP and FNR TUs in E. coli that we consider highly reliable because they fall into categories I, IIA, and IIB. These are clearly not all of the genes regulated by these factors because some of the known TUs are missing from such predictions. Functional sites may be missed because these factors bind cooperatively with some other factor that is not included in the analysis, or because the weight matrix is not a good enough descriptor of the proteins' binding specificity to get all of the functional sites. Many of the missing sites can be found in category IIC predictions, but those predictions probably also contain many false predictions, and we do not include them in our reliable set. Nonetheless, the computational approach we have applied in this article has greatly increased the set of TUs likely to be regulated by these factors in E. coli, with high but not perfect sensitivity. In addition, we make 82 and 12 reliable predictions of TUs regulated by CRP and FNR in H. influenzae, most of which had not been previously identified as members of those regulons.
Interestingly, one experimentally verified CRP site exists in RegulonDB for the E. coli operon glpABC. However, this is a weak site (6.2 bits). In this study, we detected another strong CRP site for this operon (17.25 bits; Table 6). Another interesting case is the E. coli operon fucAO. Before this study, only genetic evidence existed to support the regulation of this operon by CRP. In our study, we identified three CRP-binding sites upstream of fucA (Fig. 7; Table 6), providing further evidence for previous observations.
In E. coli, the gene ansB is under the dual regulation of CRP
and FNR (Scott et al. 1995
). This joint regulation by both
transcription factors might be important in achieving optimal gene
expressions. Based on our analysis, ansB may be regulated only by CRP
in H. influenzae because the highest scored FNR site in the
regulatory region of H. influenzae ansB was only 9.74 bits.
This is much lower than the weak site cutoff for FNR but might still be
a functional site. Two other TUs, ung and yfiD (its ortholog in H. influenzae is HI0017), seem to be dually reguated in both genomes.
Interestingly, for both TUs, the CRP site is the same as the FNR site
in both genomes with E. coli TUs having an additional FNR
site. It is possible that those sites are true only for one of the
regulators and are false-positives for the other regulator. Conversely,
we cannot rule out the possibility that those sites are truly
recognized by both regulators, because some sites that can bind CRP
also can bind FNR (Sawers et al. 1997
).
Negative autoregulation is quite dominant in E. coli, and it
can be viewed as playing a homeostatic role for the regulatory genes
(Thieffry et al. 1998b
). Based on our results, CRP seems not to be
autoregulated in H. influenzae (the highest scoring CRP site
had a score of 4.6 bits). Conversely, FNR does seem to be
autoregulated in H. influenzae.
Based on our comparative analysis of the CRP and FNR regulons in the two genomes, we noticed three types of structural changes in operons that are subject to the same mode of regulation. The first type involves insertion or deletion of individual genes in otherwise conserved operons. Examples in E. coli includes operons glpTQ (glpT in H. influenzae, Fig. 6), fnr (fnr-HI1426 in H. influenzae, Fig. 6), and b2736-b2737 (HI1010-HI1011-HI1012-HI1013 in H. influenzae, Fig 7.).
The second type of change involves breakup of an operon in one genome
into several smaller ones in the other genome. Not all of the smaller
operons retain their regulation by the same regulator. For instance,
the E. coli xylFGHR operon is broken in H. influenzae into two operons, xylFGH and xylR (Fig. 7). Only xylFGH maintains CRP regulation in H. influenzae. The protein products of genes xylF, G, and H constitute the high-affinity xylose transport system in
both genomes and that of xylR encodes a regulatory protein (Sumiya et
al. 1995
). In E. coli, xylR acts as a transcriptional activator for the xylFGHR operon and the expression of itself is
regulated by CRP (Song and Park 1997
). In H. influenzae, the regulation of xylR might be taken over by a different regulator. Alternatively, it could be autoregulated. If this is the case, it is
another example of uncoupled versus coupled transcription regulations
in two bacteria, an organization with different dynamic consequences
(Hlavacek and Savageau 1996
). Another example of this second type of
change involves the E. coli galETKM operon. The same operon is
broken up into two pieces in H. influenzae : galE and galTKM
(Fig 6). Again, only galTKM still is regulated by CRP in H. influenzae.
Third, also the most common type of change during regulon evolution is
the loss of E. coli regulon members in the H. influenzae genome. Examples include operons caiTABCDE, malEFG, and narGHJI. Tatusov et al. (1996)
suggested that the common ancestor of E. coli and H. infuenzae could have a genome of intermediate
size. The subsequent evolution may have proceeded in opposite
directions
toward the reduction of the genome size by deletion of
genes and entire transcription units in the Haemophilus
lineage and toward the diversification of regulatory and transport
functions via gene duplication in the E. coli lineage (Tatusov
et al. 1996
). As a result, the decrease in CRP and FNR regulon members
may be the result of degenerative evolution of H. influenzae.
The parasitic lifestyle of H. influenzae might require a less
complicated metabolism to cope with enviromental changes. However, as a
fraction of the total number of genes, both species appear to have
similar sized regulons.
The location of regulatory sites along the genome has a clear influence
on how regulation through these sites occurs (Gralla and Collado-Vides
1996
). An interesting question to ask is whether regulatory sites of
orthologous genes have identical or close positions, that is, whether
the distance between regulatory sites and their regulated promoters
remain more or less unchanged between bacterial species. To obtain such
information, we would need to have a reasonably accurate method to
predict promoters in those organisms. Unfortunately, current promoter
prediction methods are not satisfactory in this regard. Future work is
needed to address this very interesting question.
We noticed that some of our predicted TUs have quite distal binding site(s). Because we report the position of a binding site relative to the translation start of the first downstream gene, these large distances could simply result from the existence of a long 5' untranslated region. Conversely, they could be true distal sites even if our measurement were based on transcription start. Because of the global nature of regulatory functions, CRP and FNR regulated TUs often have another local, dedicated regulator, such as LacI for the lac operon and GalR for the gal operon. Thus, we suspect TUs predicted here with distal sites will show regulation by additional proteins.
The approach we have used in this article has identified many new genes that we predict are regulated by the CRP and FNR proteins in E. coli and H. influenzae. Combined evidence from site scores and comparative analyses gives us high confidence in many of these predictions. But this is clearly just a first step. More bacterial species can be included and many more regulons can be studied, although regulons with few known members are more problematic because of the small sample size. The accurate prediction of transcription units is critical to the success of such an approach, as operons are often rearranged in evolution and common regulatory sites may be located at long and variable distances from orthologous pairs of genes. In this work, many steps were performed manually, in that careful examination of some results were used to constrain further analyses. Experience gained from this work will allow us to develop more fully automated procedures that can be applied to more regulatory systems in more species in a rapid and reliable approach.
| |
METHODS |
|---|
|
|
|---|
Sequence Data and Programs
Experimentally characterized (mostly by DNA footprinting technique)
E. coli CRP- and FNR-binding sequences were extracted from the
RegulonDB (Salgado et al. 2000a
) database. Complete genome sequences of
E. coli and H. influenzae were downloaded from
GenBank (Benson et al. 1999
). Weight matrices were constructed by
CONSENSUS (Hertz and Stormo 1999
), which generates optimal
ungapped multiple sequence alignments with predefined width. In
addition, the program reports the statistical significance of the
generated multiple sequence alignment. Given a weight matrix, searches
for transcription factor binding sites were performed using
PATSER (Hertz et al. 1990
). PATSER scores
each possible binding site position in a sequence by using the
designated weight matrix and returns the scores and positions of all
sites above a user-defined threshold. Multiple alignments of protein
sequences were constructed using the program CLUSTALX
(Thompson et al. 1997
). Protein sequence database searches were
performed using the gapped BLASTP program (Altschul et al.
1997
). All searches were performed against the National Center for
Biotechnology Information nonredundant protein sequence database.
Sequence comparisons between E. coli CRP and FNR were
performed using the BestFit program (Wisconsin Package
Version 10.0; Genetics Computer Group). Sequence logos were constructed
using the web interface (S.E. Brenner, http://www.bio.cam.ac.uk/cgi-bin/seqlogo/logo.cgi) to the
MAKELOGO program by Schneider (Schneider and Stephens
1990
). The rest of the analysis was performed by using ad hoc PERL
scripts (Wall et al. 1996
).
Preparation of Training Set Sequences
The current version of the RegulonDB database (version 3.0) contains 80 experimentally verified E. coli CRP-binding sequences from 56 TUs (because some of these 56 TUs have multiple CRP-binding sites the number of sites exceeds the number of TUs). We expect that some of these 80 sites are weak CRP-binding sites. Presumably, CRP binds these weak sites through cooperativity with other regulatory proteins. Weak sites were filtered out from our training set by using the following procedures. In step one, we ran CONSENSUS on the 80 binding sequences and generated an initial weight matrix. PATSER then was used to score the original 80 binding sequences by using the weight matrix generated in step one. After this initial step, we only chose the highest-scoring sequence from each TU for further processing. This gave us 48 sites representing 53 TUs (all sites from three of the 56 TUs were rejected by CONSENSUS and thus not included). Because of the existence of divergent TUs, the number of sites is less than the number of TUs. The mean and standard deviation of the scores of these 48 sites were 13.1 and 3.6 bits, respectively. For our final training set, we excluded, from the 48 sites, any sites with scores that are more than one standard deviation below the mean, that is, 9.5 bits. We ended up with 42 sequences in the training set, representing 46 TUs.
The current version of RegulonDB database contains 17 experimentally verified E. coli FNR-binding sequences from 13 TUs. We applied the same procedures to these 17 sequences to generate the training set. We ended up with nine sequences in our training set, representing nine TUs. The mean and standard deviation of these nine sequences were 19.8 and 4.5 bits, respectively.
Prediction of Transcription Factor Binding Sites
During the first step of our analysis, weight matrices for both CRP
and FNR binding sites were generated by CONSENSUS by using
our training set sequences (42 for CRP and nine for FNR). Subsequently,
the published annotations of all the open reading frames (ORFs) in
E. coli (Blattner et al. 1997
) and H. influenzae (Fleischmann et al. 1995
) were used to generate two sets of
putative regulatory sequences (one for each genome), covering 400 nt
upstream of and 50 nt downstream from the beginning of each ORF. This
length was chosen from the known distribution of a large collection of regulatory sites in
70 promoters (Gralla and Collado-Vides
1996
). Then, PATSER was used to scan the sets of
regulatory sequences to identify potential binding sites by using the
weight matrices generated in step one (Hertz and Stormo 1999
; Hertz et
al. 1990
). Potential binding sites scored above the chosen cutoffs were
reported. Eventually, binding site information was combined with
orthology relationship between TUs to predict new members of the CRP
and FNR regulons. We classified binding sites into two categories based
on their locations relative to the TUs downstream from or encompassing it (1) sites located in the regulatory region of a TU; and (2) sites
located within a TU. The latter category includes two cases: within
genes of a TU and within the upstream region of an internal gene.
Determination of Orthology between E. coli and H. influenzae Genes
Fitch first introduced the term ortholog for genes derived from
speciation events (Fitch 1970
). At present, there is not a simple and
perfect method for detecting orthology relationship because of
complicating events during genome evolution, such as gene duplication,
gene loss, and horizontal gene transfer (Huynen and Bork 1998
). For our
study, we used the minimal definition of orthology described by Huynen
and Bork (1998)
: (1) orthologous ORFs between two genomes compared must
be the most similar ORF reciprocally; (2) sequence similarity between
the ORFs has to be statistically significant. In this article, sequence
similarity was calculated by the BLASTP program (version
2.0; Altschul et al. 1997
). Any alignment with an E-value of
1e-15 was considered significant for our purpose. and (3) sequence
similarity extends to at least 60% of one of the genes.
Prediction of Transcription Units
The prediction of TUs was described for E. coli by Salgado
et al. (2000b)
. The method is based on the differences between pairs of
adjacent genes in operons and pairs of adjacent genes at the borders of
TUs. The differences studied were distances between genes and their
functional relationships, the latter ones being an update of the
functional classification described by Monica Riley (Riley 1993
; Riley
and Labedan 1996
). Here, to apply the method to H. influenzae,
we inherited the functional classification for E. coli genes
and then applied the prediction method to the whole H. influenzae
genome, dividing it into putative TUs. In this way, we obtained
sets of TUs that can be compared between organisms when a regulatory
site was found close to orthologous genes that may in turn lie inside
analogous TUs.
| |
ACKNOWLEDGMENTS |
|---|
We thank members of the Stormo and Collado-Vides labs for insightful discussions. We thank three anonymous reviewers for their comments. This work was supported by Grant HG-00249 from National Institutes of Health (G.D.S.), Grant 0028 from Conacyt (J.C.-V.), and Grant DE-FG02-98ER62558 from U.S. Department of Energy (J.C.-V.).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL stormo{at}ural.wustl.edu; FAX (314) 362-7855.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.149301.
| |
REFERENCES |
|---|
|
|
|---|