|
|
|
Published online before print
January 13, 2003, 10.1101/gr.207502
Vol. 12, Issue 2, 298-308, February 2002
LETTER
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
The comparison of homologous noncoding DNA for organisms a suitable evolutionary distance apart is a powerful tool for the identification of cis regulatory elements for transcription and translation and for the study of how they assemble into functional modules. We have fit the three parameters of an affine global probabilistic alignment algorithm to establish the background mutation rate of noncoding seqeunce between E. coli and a series of gamma proteobacteria ranging from Salmonella to Vibrio. The lower bound we find to the neutral mutation rate is sufficiently high, even for Salmonella, that most of the conservation of noncoding sequence is indicative of selective pressures rather than of insufficient time to evolve. We then use a local version of the alignment algorithm combined with our inferred background mutation rate to assign a significance to the degree of local sequence conservation between orthologous genes, and thereby deduce a probability profile for the upstream regulatory region of all E. coli protein-coding genes. We recover 75%-85% (depending on significance level) of all regulatory sites from a standard compilation for E. coli, and 66%-85% of sigma sites.
We also trace the evolution of known regulatory sites
and the groups associated with a given transcription factor.
Furthermore, we find that approximately one-third of paralogous gene
pairs in E. coli have a significant degree of correlation in
their regulatory sequence. Finally, we demonstrate an inverse
correlation between the rate of evolution of transcription factors and
the number of genes they regulate. Our predictions are available at
http://www.physics.rockefeller.edu/~siggia.
[Online
supplemental material available at http://www.genome.org.]
| |
INTRODUCTION |
|---|
|
|
|---|
Functional genomics has made great progress in the prediction of protein coding regions using Markov models whose hidden states encode the components of a gene (promoter, exon, intron, splice sites) and whose parameters are fit to known instances of these states. Annotating the regions of the genome that control transcription and translation has proved more refractory. The binding site for a single protein is much smaller than a typical exon and regulatory proteins work in modules, but we know nothing about the syntax governing the assembly of functional modules, there is no counterpart to cDNA libraries to tell us which bits of sequence belong to a common module, and there is no analogue to the extensive libraries of known proteins to compare against.
Regulatory sequences occur in multiple copies in a single genome, which
is the basis for their detection computationally. Strategies range from
the prediction of a single weight matrix motif for a cluster of genes
(Stormo and Hartzell 1989
; Lawrence et al. 1993
; Bailey and Elkan 1994
)
to string counts with probabilities assigned with reference to genes
not in the cluster (van Helden et al. 1998
; Brazma et al. 1998
), and
finally fits to more elaborate models for all regulatory regions in the
genome and the simultaneous determination of many putative motifs at
once (Bussemaker et al. 2000
). DNA microarray experiments have been a
boon to studies of gene regulation because they provide complete sets
of covarying genes. However, they also quantify how much more remains
to be understood. In yeast, most copies (e.g., 75%) of the
canonical control elements for cell cycle and sporulation occur in the
upstream regions of nonresponding genes (Bussemaker et al. 2001
). Hence there are other sequence signals to be found, but probabilistic methods
on a single genome encounter the fundamental problem that there is
never a single sharp secondary motif that delimits the active from
inactive class, but many marginally significant ones.
The availability of genomic sequence for related species compensates
for the greater plasticity of regulatory sequence modules (compared to
proteins) and makes interspecies comparison a powerful technique for
their identification. There have been studies of the globin locus
across many species (Stojonovic et al. 1998
), comparisons of several
Drosophila species (Blanchette et al. 2000
), and many mouse
comparisons (Hardison et al. 1997
; Loots et al. 2000
; Wasserman et al.
2000
). For prokaryotes, a broad collection of fully sequenced genomes
was examined by McGuire et al. (2000)
, and more limited comparisons
were made by Gelfand et al. (2000)
. A recent study by McCue et al.
(2001)
uses many of the same organisms we do, but a complementary
algorithm. Comparisons with prior work are reserved for the Discussion herein.
In this paper we address the intertwined questions of how rapidly do
gene control regions evolve and what are the most informative species
pairs to study for the elucidation of cis regulatory regions. We work
primarily at the module or locus level and only as a second step
discern individual protein binding sites. We thus impose no
preconceptions about what aspects of the module are most important (as
measured by conservation) to the regulatory net of the cell. Escherichia coli is the most useful species to examine at the moment since it is the best studied prokaryote and has the densest set
of related genomes in various stages of sequencing. Although in the
future, sequencing projects may be undertaken largely to ascertain
regulatory sites, the regulatory system of E. coli is not
simple; there are seven sigma factors and several other regulatory proteins (e.g., crp and lrp) that control many operons, plus several factors with widespread activity that facilitate contacts between other
factors (e.g., fis, ihf, hns) (Lin and Lynch 1995
). There are also cis
elements regulating translation which are much more extensive that the
core Shine Dalgarno sequence (Lin and Lynch 1995
). All are fused
together in the several hundred bases upstream from translation start
(Gralla and Collado-Vides 1996
). In public databases are approximately
800 protein binding sites (including sigma sites) regulating ~400
genes or about 10% of all operons (Robison et al. 1998
; Salgado et
al. 2000b
).
Our alignment algorithm, which is essential to comprehension of our results, is described in the Methods section.
| |
METHODS |
|---|
|
|
|---|
Alignment Algorithm
Sequence alignments (e.g., BLAST) are generally done with predetermined penalties for mutations and gaps, and assign a probability to the best score thus obtained based on a null model of completely uncorrelated sequence pairs. This naturally responds to the question of whether the query sequence, run against the database, has a better score than chance alone would suggest. Frequently a scoring scheme adapted to the evolutionary distance of the match one is exploring will enhance the significance of the relevant matches compared to others, but in all cases significance is assessed relative to the probability that two random sequences (with the residue frequencies of the database) would score as well as the putative functional match.
Our task is more difficult. Since we are comparing organisms which are
manifestly related, we first have to fit an evolutionary model to
determine as best possible the neutral or basal evolution rate (Thorne
et al. 1991
). With respect to this correlated model, we must then
ascertain whether certain regions of sequence are more similar than
expected, and thereby score them as functional. In practice, it is
impossible to know what if any regions of the genome are not subject to
fitness constraints, and for bacteria, lateral gene transfer is so
common that one may question whether the most recent common ancestor is
a well defined concept. (Since we examine relatively close organisms
and look at the entire genome, lateral gene transfer should not
contaminate most of our results.) Thus, operationally we compute the
basal evolution rate as the most rapid evolution we can find for a
substantial block of manifestly homologous sequence, pooled from
multiple regions of the genome. Further details are given in the
Results section.
We rather conventionally describe sequence evolution in terms of three
processes, a single base mutation, a gap opening, and a gap extension.
More precisely,
|
(1) |
|
(2) |
sets the rate of base substitutions (b
b'), and µ,
are the gap opening and extension
parameters. We use these parameters within a probabilistic alignment
algorithm based on that of Yu and Hwa (2001)Our local version of probabilistic alignment is complicated by the fact
that we want to assign significance relative to a neutral model
described by the three parameters above. Since we are looking for
protein binding sites, we assume zero gap parameters and a new
substitution parameter
subject to
<
which we
adjust to optimize significance. (Our scoring formula is given by Eq. A.6.) With
= 0, we would score positively only regions with no mutations but then assign them higher significance than for any
> 0. If
approaches the background level, then all regions would be assigned marginal significance. Thus for each homologous upstream region in each pair of organisms, we scanned over
to maximize the significance of the highest scoring region. On average,
was smaller when comparing E. coli with Salmonella than E. coli with Vibrio, but there was considerable variability among genes.
The local alignment then yields diagonal segments of high significance in the rectangle defined by running the two sequences along the x and y axes. To obtain a conventional graph, we take the maximum significance calculated for each E. coli base ( and any base of the comparison species) and plot it as a function of upstream position in E. coli. Note that the blocks of high significance in this graph need not occur in the same order in the other species as in E. coli, and it should be checked that the blocks indeed correspond to contiguous bases in the other sequence.
Statistical Tests
Based on the size of the typical upstream region (300-500 bp) that we scan over with our local algorithm, a marginally significant log odds score in our units is ~6, i.e., ln(500). A single score can be deduced from a collection of pairwise alignments by either of two methods which make opposite assumptions about the sequences being compared with E. coli; the truth is somewhere in between. Either take an envelope of all log odds profiles (if the comparison sequences are maximally correlated) or take a sum if they are completely uncorrelated. In the latter case, to filter noise we only consider those bases in each pairwise comparison where the log odds score is over 9, otherwise it is omitted from the sum.
In order to extract a series of disjoint high significance intervals from the log odds graph to compare with footprinted factor binding sites, we used an edge detection heuristic defined by computing the derivative with respect to position and thresholding. What fell between successive bands of positive and negative derivatives subject to some plausible length limitations was the prediction.
For measuring the similarity between a set of predicted sequence
intervals and the experimental data base, we define a hit (following McCue et al. 2001
) as any overlap. A single prediction can
hit multiple sites if they are nearby or nested, and the score hs for a given E. coli control region is
just the total number of hits. The score expected by chance is computed
by fixing the predictions and randomizing the positions of the
experimental sites individually, subject only to the constraint that
the distribution of positions for each site matches that for all sites
in the database. The average
h
and variance
(h
h
)2
in the number
of hits are computed separately for each site, and summed over all
sites in the regulatory region to give
hs
and
s =
(hs
hs
)2
, respectively. The
significance is then parameterized by
|
(3) |
Given a set of N aligned sequences with
n
,
|
(4) |
Extracting Homologous Regions From Genomic Data
The key to our method is the selection of pairs of organisms which
give the most informative comparisons. Figure
1 gives the rRNA phylogeny of the species
we have examined. We do not require a fully assembled sequence, merely
large enough contigs to give a good protein match plus ~500 bp
upstream of AUG, which is where almost all control elements are found
(Gralla and Collado-Vides 1996
).
|
We used the program tfastx (Pearson 1999
) and a spectrum
of scoring matrices to match each of the ~4200 orfs in E. coli (represented as a protein sequence) against all other genomic sequences including E. coli itself (to detect paralogous
genes). Since we are examining very similar species we can insist on a stringent match criterion of a probability score <10
25
(10
5 would be marginal in these units), and at least
40% identity as defined by the program. We assembled all valid hits
into disjoint closed intervals on the target genome, which frequently
began with the first amino acid of the query protein. When a given
protein had several distinct hits, we ordered them by probability
score, and then by percent identity (probability scores often
underflowed to 0). We restricted attention to upstream control regions
that did not overlap any annotated coding region on either strand, were
at least 50 bp in length and were cutoff at 500 bp. The minimum length
restricts us to approximately one promoter per operon (Salgado et al.
2000a
). Table 1 gives the statistics of our
matches, and Table 2 a breakdown of the
number of conserved gene pairs between organisms. The subclass of
noncoding regions between two conserved pairs will be useful in what follows.
|
|
As a database of known E. coli protein binding sites, we used the
compilation of Robison et al. (1998)
and a related set from McCue et al.
(2001)
.
| |
RESULTS |
|---|
|
|
|---|
Prediction of Functional Regulatory Sites
For the compact genomes of prokaryotes, it is by no means obvious what regions of the genome are not subject to selective pressure and thus suitable for estimating a mutation rate against which to measure the degree of conservation of the promoters. The most propitious regions to examine are those between conserved gene pairs, because the intervening sequence should all be homologous under our assumptions and we can use fixed boundary conditions on both ends to do the global alignment. In Table 2 we show that the ratio of divergent gene pairs (common upstream region, e.g., 5' 5' pairs) to convergent ones (sharing a noncoding terminal region, e.g., 3'3' pairs), increases significantly as one moves from Salmonella to Vibrio. This finding confirms, without any potential biases as to where experiments looked, the general observation that most cis regulatory elements in bacteria are upstream of the gene, not downstream. We examined the set of convergently transcribed (i.e., 3' 3') gene pairs; a small number of these were dropped which had significant amounts of conservation either because of recent lateral gene transfer, or some functional secondary structure that still retains some primary sequence homology. This is legitimate because we are looking for nonfunctional, neutrally evolving sequence. The homology of the regions being compared is guaranteed by the good match between the bracketing gene pairs. From the remaining set, upwards of a kilobase of sequence from several such pairs was fit with a single set of parameters. These fits were stable for other selections of sequence.
As seen in Table 3 part b, only Salmonella
retains some degree of correlation in minimally functional regions; the
other pairs of species are random (i.e., the optimal fit corresponds to
a point mutation rate of chance). These fits are conservative, since
they are a lower bound to the neutral mutation rate and thus the
statistical significance of any feature we find in the E. coli
control region will be higher than we report, given our model. For
comparison, the same fits were done for gene pairs with a common 5'
control region and show much higher conservation.
|
Our evolution model fits were then used to assign a significance to
matches between ungapped regions for all orthologous upstream regions.
Examples are given in Figures 2 and 3. As noted in the Methods section,
the
parameter can be adjusted to optimize significance. A small
gives sharp delineation but poor overall structure since it only
selects perfectly conserved blocks. The significance generally rises as
increase from ~0 and then declines unless the entire upstream
region merges into one block and our probability calculation ceases to
be valid. For the more distant species, ype and vch, a suitable
automatic way of adjusting
is to maximize the root mean square
fluctuation in the log probability profile. For stm and kpn however, we
imposed in addition an upper bound of 0.004 and 0.006 respectively on
, which aids in the delineation of sites.
In Figure 2, we contrast
= 0 with
what we consider the optimal
for each organism. Notice that the
sum delineates the overlapping annotated sites better than does the
envelope. The maximum for all four organisms falls on top of the
annotated sites in Figure 2b, whereas with
= 0 the maximum for
ype is around i = 25. The second most prominent structure
in Figure 2 does not get any contribution from stm, and broadens when
is fit. The third structure around i = 300 only rises
above the cutoff of 9 in Figure 2b. The genome-wide statistics of the
intervals we flag as significant, such as those shown in Figure 2b, are
discussed below.
|
Figure 3 shows the intergenic region between a pair of divergently transcribed genes which were conserved for all four species. There is only one documented site and it appears as a `hat' on top of the largest block which reflects its presence in vch, which is contributing nothing elsewhere to the summed profile. In the next most significant block around 125, ype doe not follow kpn and stm as is true elsewhere. Perhaps the offset lobe of signal for ype is moderating the left gene rather than the right one.
|
Our complete set of predictions are available on the Web. It remains
true genome-wide that when properly discounted by evolutionary distance, Salmonella (an organism so close to E. coli that
recombination between the genomes is possible; Rayssiguier et al. 1989
)
is both informative yet does not dominate the comparisons.
Though it is not our primary purpose to predict individual
transcription factor binding sites, it is obviously important to show
that the known sites fall within our conserved regions and to put a
significance value on our predictions (e.g., if we claim that most of
the upstream region is conserved, as sometimes occurs when it is short,
our significance is low). To compare with McCue et al. (2001)
, who used
a multiple alignment tool which assumes a null model of mutually
uncorrelated data segments, we took the sum of all our pairwise
alignments over a significance threshold of 9 (cf. Figs. 2, 3) position
by position along the E. coli reference sequence. We then made
two categories of predictions, both genome-wide: a single best
prediction for each gene, and all significant segments. Two data sets
of known regulatory sites were used, those upstream of the 184 `test
set' genes of McCue et al. (2001)
, Table 4
(with and without the sigma sites from Robison et al. 1998
), and simply all sites from Robison et al. (1998)
, Table
5. Any comparisons with the results of
McCue et al. (2001)
are approximate because we focused on different
aspects of the signal; for example, they excluded sigma sites by
fitting to a palindromic model, whereas we do not distinguish them.
Further comparison is deferred until the Discussion. Gene by gene the
statistical significance is low, because as is evident from Figures 2
and 3, a good portion of the upstream region seems functional; we
predict 3141 sites (with an average length of 32 bases) for 2127 upstream regions. In addition, our estimate for the number of sites hit
by chance was high, because we randomized each experimental site
independently even if several of them overlapped.
|
|
Analysis of the Upstream Regions of Paralogous Pairs
Our interspecies comparisons can be trivially extended to paralogous
pairs of genes within E. coli. There are 169 unique gene pairs
which satisfy the stringent cutoffs defined for the last line of Table
1. We ran our local alignment procedure on each pair and assumed a
completely random background evolution model. The parameter
was
optimized with an upper cutoff of 0.1. The maximum log odds score
exceeded 9 for 52 of these pairs which are listed on our website along
with their GENBANK annotation, and alignments. Of the 52, ten are
transposon-related ORFs (using Riley's functional classification
5.1.4; see http://genprotec.mbl.edu:80/start) and are grouped
separately since their upstream conserved regions are presumably not
conventional transcription factor binding sites. There was no
correlation between the maximum score for the upstream region and the
percent identity between the paralogous proteins.
Correlation between Gene Function and Conservation of Upstream Region
We have investigated how a quantitative measure of the conservation of orthologous upstream regions, the maximum log odds score, correlates with the functional class of the respective genes. We restricted the maximum to a region of 300 nucleotides upstream of the translation start of the gene in order to avoid spurious signal from divergently transcribed genes. It is already interesting to simply rank all 2127 genes for which we have orthologs by this score (details on our web site). Perhaps not very surprisingly, all the ribosomal genes get very high scores; all 20 are ranked above 365, and five (rpmB, rplJ, rpsT, rpsF, rplM) are among the top 30 genes. Interestingly, there are seven genes (ORFS) with no known function (Riley category 0.0.0 and "hypothetical protein") among the top 30 (yhbC, yafB, ybeB, yeaA, yaeO, yeaQ, yhdG).
Restricting attention to the 768 genes with orthologues in all four
species, we made a histogram of the maximum log odds score (summed over
the four species) for the 239 genes with no known function and compared
with the corresponding histogram for the functionally annotated genes
(Fig. 4). The latter group gave better scores, and the probability that the histograms came from a common distribution was less than 0.07 as defined by a
2 test.
We examined other functional groupings from Riley's classification (e.g., metabolism or DNA replication/repair) but could find no other correlations as strong as that for the ribosomal genes.
|
Evolution of Known Regulatory Sites
For each site annotated to control a particular gene in E. coli, we mapped it into the upstream region of homologous genes by two methods with different biases. The first scheme simply looked for the minimal number of mutations and in the one orientation of site relative to gene defined by E. coli. We accepted a match in the target species only when the probability of a chance match was small, and the match was unique (e.g., when there are two copies of the same site upstream, each must have a unique match). Under this mapping, the number of sites identified in the target species decreased with increasing evolutionary distance, and the weight matrix computed from all the mapped sites for a given factor was less specific than that in E. coli. (Comparing with ype, the average score of the defining sites against the weight matrix, decreased by 2× for crp and rpoD (sigma 70), was unchanged for lexA, and other factors fell in between.)
An alternative mapping of sites assumes that the regulatory network is preserved, that is, homologous genes must be regulated by the same factors though copy number can change. Therefore, the factor's weight matrix in E. coli was used to find the best match in the homologous upstream region. Thus virtually all sites are matched, and we find the specificity of the weight matrix defined by the mapped sites to be generally the same as in E. coli. This mapping of course contains a bias towards good sites in the target species, and to control for this we randomized the upstream regions and remapped. Now the specificity of the weight matrix defined by the mapped sites decreases by 1/2-1/3, except for less specific factors such as cytR, fis, flhCD, gcvA, hipB, his, lrp, rhoD, and rpoS. On this basis, we can say that the regulatory network is approximately preserved between the organisms we examine. Mapping by weight matrix is similar to what the transcription factor `sees' (assuming the DNA binding domain has not changed) if we can take the weight matrix score as a surrogate for the binding energy.
Under either mapping, the ratio of transversions to transitions was around 1:1 when comparing E. coli and Salmonella and 2:1 for most factors in more distant pairs of species. The 2:1 ratio is expected when all base changes are equally likely. For factors with many known sites, we have adequate statistics to show that the number of mutations per site negatively correlates with the information score or specificity of the site. By this measure, the pattern of change between species is similar to that intraspecies.
In the aggregate, we expect that homologous genes are regulated in
comparable ways within the species we are examining. Thus most
mutations between homologous upstream regions should be neutral. To
project this information onto a plausible subspace to analyze quantitatively, we asked whether mutations in a factor binding site
tend to compensate so as to preserve the weight matrix score (again
taken as a surrogate for the binding energy). We scored each site
against the weight matrix for the respective species, ignored the least
significant bases, and considered only pairs of homologous sites for
which there were two or more mutations (so that compensation is a
possibility). Let P be the set of bases within the site (numbered from
left to right) which change, and m


m


m
For the sigma factors, it is known that sites downstream of the binding
site can significantly affect the rate of transcription (McClure et al.
1983
). We found however that the mutation rate in the 16 sites
downstream of the rpoD17 binding site was nearly as high as in the
middle (i.e., the nonconserved region) of the binding site itself.
Similarly, no meaningful reduction in the transition or transversion
rates upstream of the binding site was observed (Estrem et al. 1998
).
Evolution of Transcription Factors
Clearly the evolution of binding sites is correlated with the
proteins that bind there, so in cases where an E. coli
factor-DNA cocrystal was available (Table
6), we analyzed the conservation of the set
of residues R defined in the structure paper to be in contact
with the DNA. For each species, the protein sequence for the E. coli factor was aligned to the orthologous protein using
tfastx (see the Methods section). Since our proteins are
very well conserved and there were virtually no gaps in the regions of
interest, more elaborate protein family alignments were not necessary.
Even for vch, in half of the factors all residues in R are
conserved. In fact for Crp, 96% of all residues are conserved between E. coli and Vibrio (vs. an average of 0.66 ± 0.15 for these species). However contrary to our expectation, the number of
mutations found among the residues from R was consistent with the percent identity (PID) computed for the entire protein alignment. This observation allows us to use all the known factors and ask whether
the evolution rate of the factor as defined by its PID relative to one,
1-PID, correlates with the number of sites that it regulates. The
number of regulated sites we approximate (Robison et al. 1998
) by the
weight matrix score of the experimental sites relative to background
normalized by the combined variance,
, viz,
|
(5) |
|
|
Of course Figure 5 is subject to whatever biases are implicit in the set of factors and binding sites available from experiment. The situation is not hopeless in that we are only looking for correlation with the specificity of the binding site, and the data provide a decent range of examples. We do not care if experiment has captured all of the specific sites and few of the sloppy ones, but only that there is no bias in the degree of factor preservation as a function of site specificity. We have insulated ourselves from how many sites have been collected for each factor, by using an information-based measure of specificity, x, and not just the number of sites in the database (however, sites with few copies will have larger error bars).
| |
DISCUSSION |
|---|
|
|
|---|
The multiple alignment of correlated data has been an active area of
research for many years (Durbin et al. 1998
), and our algorithm, which
uses only information from pairs, would seem a step backwards. Its
utility in our context is firstly the ability to fit all three mutation
parameters for each species pair; these numbers are intrinsically
interesting and make it feasible to put all sequence pairs on a common
scale. The closest species to E. coli, Salmonella, now does
not give the highest score; on average Klebsiella does. For
the closely related bacteria we study, the neutral evolution rate is
high, and the extensive upstream conservation we see is a consequence
of functional constraints. Our profile plots do not impose assumptions
about the width of the conserved blocks, which frequently are much
broader than a single factor binding site. The evolution of regions
from organism to organism is also apparent on the same plot.
On our website, http://www.physics.rockefeller.edu/~siggia,
we have all our pairwise profiles plotted against the E. coli
upstream region for each gene having one or more homologs. Superimposed are the predictions of McCue et al. (2001)
, the experimental data collected in McGuire et al. (2000)
(plus all matches to the known weight matrices), and our own predictions. The data are sorted in
various ways to facilitate reference. The primary data supporting our
other conclusions are also included. (Information also available online
as Supplementary Table 1 at www.genome.org.
Clearly multiple sequence alignment can in principle detect more subtle signals than pairwise methods. However, when the multiple alignment is restricted to a length smaller than the pairwise conserved blocks and there is a dense enough set of comparison genomes, the situation is more ambiguous. When we use the envelope of the local alignment score to compare multiple species to the same E. coli gene some information may be lost, but our significance underestimates the true one. Using the sum of the scores seems to pick out the strongest sites, since it emphasizes sites with a copy in all species.
We have also experimented with the CLUSTALW program
(Higgins et al. 1996
; Durbin et al. 1998
), which tries to build a
phylogeny and align using only pairwise data. It has difficulty in
selecting the limits of the region to match. We have found many
instances within our sets of pairwise alignments where the strongest
single motif is only present for a subset of the species. The program
is then forced to choose or compromise between a strong motif and a
weaker one present in more species.
For our algorithm, Klebsiella furnished the most informative
comparison, and the more distant species Yersinia and Vibrio typically
did not add much; in contrast, McGuire et al. (2000)
used nothing
closer to E. coli than Haemophilus, which is more removed
(Fig. 1) than any of our examples. Not many E. coli sites were
recovered by comparing just single gene orthologous regions from the
fully sequenced species selected by McGuire et al.
McCue et al. (2001)
used a somewhat broader range of organisms than we
did including Salmonella and Yersinia (but not Klebsiella). They used a
Gibbs algorithm which assumed reverse complement symmetry, with at most
17 functional sites that could span up to 24 bases, and they allowed
0-2 copies of the motif per upstream region. Statistical significance
was assayed relative to a null model where all the upstream regions are
random and uncorrelated. Their scoring function was trained on a
`study set' of known motifs, and perhaps for this reason the typical
maximum a posteriori score against the study set was higher than that
for the remaining genes. We took a single 24 bp interval around our
highest scoring base to compare against their predictions for their
study set, and used their definition of a hit, any overlap. Reverse
complement symmetry clearly excludes sigma sites, whereas our set of
all significant sites hit a respectable percentage of the sigma 70 sites. It is encouraging that such different algorithms which use
complementary parts of the sequence statistics do this well in hitting
known sites. However, it should be borne in mind that the statistical
significance of these predictions for any single gene is not high; a
random 24-base interval placed in the noncoding region upstream of a
gene in the Church set has a 30% probability of hitting a site.
The clearest and least biased way to measure what is of selective
advantage to an organism is by evolutionary conservation. Our
statistically significant (gapless) regions frequently are larger than
any single protein binding site and thus are suggestive of several
factors interacting. For most factors we found no evidence of
compensatory mutations that would be evidence that evolution preserves
the quality (as measured by proximity to the consensus) of the binding
site. Several possibilities are suggested: 1) consensus sequence is a
poor measure of binding affinity; 2) there is neutral evolution within
a sphere of sites, so intraspecies variability is comparable to that
between homologs; or 3) the unit of selection is larger than just one
binding site. We favor the latter possibility, and recall that even the
McCue data (McCue et al. 2001
) with all its assumptions about the sites
misaligned the typical crp (lexA) site by a range of 10 (4) bases. We
take this not as an indictment of their algorithm but rather as an
indication that even with reverse complement symmetry imposed, the most
conserved signal between homologous upstream regions is not the same as
the consensus signal defined intragenome. There is abundant evidence
that the quality of the promoter site sets an upper limit to a gene's
expression level, and for the known sigma 70 sites mapped to homologous
upstream regions by the best weight matrix match, we did find evidence for compensatory mutations. Some of this could be an artefact of the
selection method (though it might be a model for what the protein
itself does), and we suspect that the existing compilation of sigma 70 sites is biased in favor of those close to the consensus. We did not
find statistical evidence of preservation for bases both up- and
downstream of the footprinted (
10,
35) region, which are also
known to influence transcription rate.
The transcription regulatory network is preserved in an average sense for all the bacteria we examined, since when the known binding sites are mapped with their weight matrix to the homogous upstream region, the score of the new weight matrix composed from the mapped sites is as specific as it was in E. coli. For some of the less specific factors such as sigma 70, this is expected by chance, but for most factors, mapping to random upstream regions generates a poor new weight matrix.
In spite of the complexity of our conserved blocks we did observe a
statistically very significant correlation between the DNA binding
specificity (i.e., the number of binding sites) of E. coli
transcription factors and their conservation on the amino acid level.
This is consistent with the natural expectation that on the average,
factors which regulate many genes evolve more slowly than others.
However, it is very surprising that our naive measure of evolutionary
distance, the overall amino acid percentage identity, is a suitable
quantity at all since only a small part of the transcription factor
protein is in direct contact with the DNA. Perhaps the proteins which
regulate many genes are also involved in many protein-protein
interactions. It would be interesting to look at the outliers in Figure
5 in this regard. Futhermore, one can use our results to make a rough
prediction about the DNA binding specificities of the putative ~300
transcription factors in E. coli (Perex-Rueda and
Collado-Vides 2000
) from the conservation of the factor.
Additional statistical information is lost because we have ignored the
one feature that intraspecies algorithms exploit, namely repetition
between different genes. One strategy is to use as input to an
algorithm such as that of Bussemaker et al. (2000)
the portion of the
E. coli control regions whose probability envelope is over
some threshold. Since that code already predicted ~1/4-1/3 of the
known sites using only the complete E. coli genome (H. Li, V. Rhodius, C. Gross, and E.D. Siggia, preprint), the results should improve.
Another interesting task is to cluster sequences which are conserved by means of our analysis, and thus to identify regulons. Preliminary results (E. van Nimwegen, M. Zavolan, N. Rajewsky, and E.D. Siggia, in prep.) indicate that this is indeed possible; the number of statistically significant clusters (i.e., regulons) is roughly 70 (including some of the known regulons).
| |
ACKNOWLEDGMENTS |
|---|
We thank Aaron J. Mackey and William R. Pearson for providing us with tfastx scores of E. coli vs. kpn, stm and ype, as well as for numerous helpful suggestions concerning the use of the tfastx package. The V. cholerae sequence was provided in advance of publication by TIGR. Erik van Nimwegen helped us with issues of clustering sequences.
| |
FOOTNOTES |
|---|
1 Corresponding author.
E-MAIL siggia{at}eds1.rockefeller.edu; FAX (212) 327-8544.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.207502. Article published online before print in January 2002.
| |
REFERENCES |
|---|
|
|
|---|
70 subunit fragment from E. coli rna polymerase.
Cell
87:
127-136.
-strands.
Nature
359:
387-393.| |
APPENDIX |
|---|
|
|
|---|
We work within the parameter space defined by a gap opening
(closing) parameter µ, a gap extention parameter
, and a
substitution matrix w defined in terms of a transition matrix
T for base b to change to b',
|
(A.1) |
3
and off-diagonal
elements
, thus defining a third parameter
. The scale factor
will be defined subsequently. Label the bases in the two
sequences under comparison with i, j running from 1 to
(m, n) and define Z1,2,3(i, j) to
be the cumulative weight for all alignments ending in the
configurations (i, j), (i, gap), and (gap,
j) respectively, and we will use the shorthand of (i,
j) to stand for the ith, jth bases
when there is no possibility of confusion, with the index. The standard
dynamic programming recursions then read for 1
i
m, 1
j
n,
|
|
|
|
(A.2) |
1, j) in the equation for Z2).
These equations must be supplemented by boundary conditions for the
fictious points (i
0, j = 0) and
(i = 0, j
0), in order to initialize (A.2);
and by additional conditions along the lines (i
0, j = n) and (i = m, j
0), to terminate the recursions and extract a single global
alignment score Z from the Z1,2,3 matrices.
Free boundary conditions do not penalize overhangs (i.e., when
the alignment begins as (i, gap) or (gap, j)). For
initialization, they read
|
|