|
|
|
|
Genome Res. 14:1575-1584, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Letter Stress-Induced DNA Duplex Destabilization (SIDD) in the E. coli Genome: SIDD Sites Are Closely Associated With Promoters1 UC Davis Genome Center, University of California, Davis, California 95616, USA 2 Diversa Corporation, San Diego, California 92121, USA
We present the first analysis of stress-induced DNA duplex destabilization (SIDD) in a complete chromosome, the Escherichia coli K12 genome. We used a newly developed method to calculate the locations and extents of stress-induced destabilization to single-base resolution at superhelix density = 0.06. We find that SIDD sites in this genome show a statistically highly significant tendency to avoid coding regions. And among intergenic regions, those that either contain documented promoters or occur between divergently transcribing coding regions, and hence may be inferred to contain promoters, are associated with strong SIDD sites in a statistically highly significant manner. Intergenic regions located between convergently transcribing genes, which are inferred not to contain promoters, are not significantly enriched for destabilized sites. Statistical analysis shows that a strongly destabilized intergenic region has an 80% chance of containing a promoter, whereas an intergenic region that does not contain a strong SIDD site has only a 24% chance. We describe how these observations may illuminate specific mechanisms of regulation, and assist in the computational identification of promoter locations in prokaryotes.
Because the initiation of transcription and the initiation of replication both require local separation of the DNA duplex, the locations and occasions where this transition occurs in vivo must be stringently controlled. One biologically important way in which local DNA stability is regulated is through superhelical stresses imposed on the duplex (Benham 1979
Local sites of strand separation (also called local denaturation, duplex opening, or unwinding), the most extreme form of duplex destabilization, can be induced by moderate levels of negative superhelicity (Benham 1980
The relaxation induced by strand separation at any one site is felt by all other base pairs, and their propensities to separate change accordingly. Thus, transitions that involve only near-neighbor interactions when they occur in linear or nicked DNA will in stressed DNA be coupled to the conformational states of all other base pairs that experience the stress. Whether transition occurs at a given site thus depends not just on its local properties, such as thermodynamic stability, but also on how that site competes with all others in the domain. In consequence of this coupling, stress-induced transitions have a large repertoire of highly intricate, nonlinear, and interactive behaviors that far transcend what is possible for thermally driven transitions in unconstrained molecules (Benham 1996
Superhelical stresses are modulated in vivo by a variety of processes, including topoisomerase enzyme activity, translocation of RNA polymerase during transcription, changes of nucleosome binding patterns, histone acetylation, helicase activity, and constraints imposed by other DNA-binding events. In prokaryotes, the basal level of superhelical stress is known to vary with the energy charge, both between stationary and growth phases, and in response to environmental changes such as altered osmolarity, hydrogen peroxide, or thermal stress (Rohde et al. 1994 Strand separation in most DNA regulatory events is not mediated by superhelical stresses alone, but instead usually involves interactions between the DNA and other molecules, commonly proteins. However, stress-induced changes in the local stability of the DNA duplex can strongly affect these events. The ease with which a DNA region can be opened by a reversible intermolecular process depends exponentially on the energy required; destabilizing the duplex at the site by 3 kcal/mole, much less than what is required for its opening, will shift the equilibrium more than 100-fold toward the denatured state, other factors remaining unchanged. Even relatively small amounts of stress-induced duplex destabilization (SIDD), therefore, well below what would be required to drive full strand separation, can greatly facilitate opening reactions that are mediated by other molecules. In this way, even fractional changes of stability at a regulatory site can drastically affect its activity.
SIDD has been implicated in the mechanisms of activity governing a wide variety of biological processes. One mechanism of transcriptional regulation known to occur in E. coli involves the binding-induced transmission of superhelical destabilization. This mechanism has been implicated in the activation of the ilvPG promoter by ihf protein binding, and in the fis binding-mediated regulation of the leuV promoter (Sheridan et al. 1998
These and other results show that stress-induced duplex destabilization is an essential component of regulatory mechanisms governing a wide range of normal and pathological events. This makes it essential to have computational methods that accurately analyze the SIDD properties of DNA sequences. This research group has developed three such methods, all based in statistical mechanics, to calculate the pattern of duplex destabilization experienced by a short DNA sequence in response to negative superhelicity (Benham 1990
These methods have recently been extended to enable the analysis of long DNA sequences, including complete genomes (Benham and Bi 2004
Here we report the results of the SIDD analysis of the E. coli K12 chromosomal sequence (version M54, accession number NC000193; Blattner et al. 1997
The Distribution of SIDD Sites in the E. coli K12 Chromosome
Figure 1 presents three basic properties of the distribution of SIDD sites in the E. coli genome. The top left graph gives the cumulative distribution of destabilization levels expressed in terms of the destabilization energy G(x). For each value G on the horizontal axis, the curve plots the total number of base pairs for which G(x)
The top right graph of Figure 1 shows the number of SIDD regions that are destabilized below a threshold G, that is, G(x) G for G = 0, 1, 2, 3, 4, 5, and 6. This is the number of runs of contiguous base pairs that satisfy this inequality. Finally, the graph at the bottom of the figure gives the distribution of lengths of the SIDD sites that satisfy this condition for each integer threshold value G = 0,..., 6. Taken together, these last two graphs show that destabilization tends to occur in a relatively small number of reasonably long sites. For example, there are 692 sites of strong destabilization, satisfying G(x) 0.0, whose average length is 74.0 bp. Similarly, 1448 runs have G(x) 2.0, with an average length of 106.3 bp. This gives an average density of approximately one site destabilized to this level per 3 kb.
The SIDD profile of a representative 5-kb region of the E. coli genome is shown in Figure 2, annotated with the coding regions and promoters known to be present. In this example, the sites of significant destabilization are largely confined to intergenic regions, whereas coding regions are not significantly destabilized. Indeed, the stabilities within coding regions remain around G(x) = 910 kcal/mole, essentially unchanged from what they would be in a relaxed or unconstrained molecule. The strongest destabilization occurs in the intergenic region separating the dnaK gene from the divergently oriented yaa1 ORF. This region contains two documented promoters, both regulating dnaK, whose locations are annotated in the figure. Indeed, the three most strongly destabilized sites in this profile all occur at intergenic positions located in the upstream, 5'-flanks of coding regions. All of these sites are destabilized below G(x)
In summary, this SIDD profile shows a distinctive pattern in which significant destabilization is concentrated at noncoding regions. And among these, strong destabilization appears to occur primarily at those intergenic regions that contain promoters. Visual inspection shows that this pattern of SIDD distribution occurs frequently in the E. coli genome. We next perform a variety of analyses to determine precisely how representative this pattern is of the arrangement of SIDD sites throughout this genome.
Correlations Between SIDD Sites and Promoters, Terminators, and Coding Regions
We also examine the set of intergenic regions that contain experimentally characterized promoters, noting that the number of known promoters is small. For this purpose, we have used two sets of documented promotersthose that are so annotated in the GenBank entry for this sequence, and those compiled in the PromEC database (Hershberg et al. 2001
In the analyses that follow, we define an SIDD site to be a collection of consecutive base pairs whose G(x) values all are <8.0 kcal/mole. This energy level is thereby regarded as the threshold for regarding a region as being destabilized. We denote the minimum value of G(x) within such an SIDD site by Gm. This value is used to classify the degree of destabilization within a site. (We note that this definition differs from the one used to compile the information in Figure 1. There we counted the numbers of base pairs, and the numbers and lengths of regions, where G(x) falls below each value G. Here an SIDD site is a region where G(x) falls below 8 kcal/mole. Such an SIDD site could contain multiple minima, and hence may have several distinct internal regions where G(x) falls below a given level G.) We sort the SIDD sites according to their levels of destabilization in two ways, either cumulatively or into disjoint bins. The cumulative sets are those satisfying Gm
The Fraction of SIDD Sites That Overlap Intergenic Regions
The most strongly destabilized site in the SIDD profile in Figure 2 coincides with the intergenic region containing the documented promoters dnaKp1 and dnaKp2. The results shown in Figure 3 demonstrate that this pattern is representative of the genome-wide SIDD distribution. Although 89% of the sites having Gm 0 colocate with noncoding regions, only 3% occur at CON sites that are unlikely to contain promoters. In contrast, 31% occur at DIV sites, which may be inferred to contain promoters. The remaining 55% occur at tandem regions, which either may or may not contain promoters. There are almost four times as many tandem regions as divergent ones, thus this represents a twofold enrichment of SIDD sites in DIV over TAN. This clearly shows that strongly destabilized sites are most closely associated with promoter-containing intergenic regions. Next we examine this association in greater detail.
The Fractions of Intergenic Regions That Overlap SIDD Sites
These results show that the strongest association with SIDD sites occurs for intergenic regions that are known (DTP) or inferred (DIV) to contain promoters. Tandem intergenic regions (TAN), which may or may not contain promoters, have an intermediate level of SIDD association, whereas CON sites (which are inferred not to contain promoters) and coding regions have the lowest levels of association. The fact that the fractions of SIDD-associated documented promoters (DTP) and divergent intergenic regions (DIV) are closely similar at each destabilization level lends further support to the inference that promoters are present within DIV regions.
The Statistical Significance of These Associations
Next we investigate in greater detail how the statistical significance of each type of association varies with SIDD level. For this purpose, we repeat the Monte Carlo analysis at each SIDD level, with the SIDD sites now partitioned into seven disjoint bins according to their minimum G(x) value, Gm. Sites with Gm 0 are placed in bin 0, whereas sites with i 1 < Gm i are placed in bin i, 1 i 6. In all cases, the random distributions by inspection are seen to be approximately normal (data not shown), as was observed above for the cumulative distributions. We calculate their means and standard deviations of these distributions, and from them the z-scores of the actual SIDD associations. (The z-score is the number of standard deviations that the observed value is away from the mean of the random distribution. Positive z-scores correspond to values above the mean, negative z-scores to values below the mean.) The statistical significances of associations between these SIDD sites and the five classes DTP, DIV, CON, TAN, and internal coding regions are plotted in Figure 6 for each destabilization level (i.e., bin) as the z-scores found by this Monte Carlo procedure. A z-score of ±3.5 corresponds to a probability of slightly less than 0.001 that this value occurs by chance. We see that for values Gm 0, 1, or 2, the SIDD sites are statistically highly significantly associated with both promoter-containing sets DTP and DIV. The association between SIDD sites and CON, which are regions that are inferred not to contain promoters, is never more than marginally significant at the 0.001 level. At all destabilization levels, SIDD sites show a statistically highly significant propensity to avoid coding regions, and to associate with TANs. The statistical significances found for the four categories DTP, DIV, TAN, and coding regions are greatest for the highest levels of destabilization. For the most destabilized sites, those satisfying Gm 0, we find they cluster at DIV regions at rates >27 standard deviations above random, and avoid coding regions at rates that are >30 standard deviations below random.
A second set of 472 documented promoters has been compiled in the PromEC database (Hershberg et al. 2001 0, whereas 105 of the 285 PromEC-documented sites (36.8%) are destabilized to this level.
SIDD Sites at Promoters in E. coli
We have estimated the probabilities that an intergenic region either is (D) or is not (
2. The probabilities required in these formulas are found as follows. Of the 555 regions in CON, which are inferred not to contain promoters, only 62 are destabilized at this level, so p(D| ) = 62/555 = 0.112. We first estimate p(D|P) from the 189 intergenic regions in DTP, which contain experimentally characterized promoters. We find that 107 of these DTP sites are destabilized at this level, giving p(D|P) = 107/189 = 0.566. Finally, of the 3584 intergenic regions, we find that 1032 are destabilized at the level Gm 2.0, thus the overall probability of destabilization is p(D) = 0.288. Substituting these values into the above formulas, we find that p(P|D) = 0.763, and p(P| ) = 0.236. If we use DIV instead of DTP as our promoter-containing data set, we get p(P|D) = 0.81, and p(P| ) = 0.249. This analysis indicates that SIDD properties alone are anticipated to be highly accurate predictors of the presence of promoters within strongly destabilized intergenic regions.
We may use these results to make a rough estimate of the number n(P) of promoter-containing intergenic regions in E. coli. Using the set of SIDD sites satisfying Gm
1940 promoters. This database also finds that 15% of the documented promoters overlap or are internal to coding regions. If this is representative of the genome-wide average, there would be an additional 350 promoters, for an estimated total of 2290 promoters in this genome.
This paper presents the results of the first analysis of the stress-induced duplex destabilization properties of a complete genome, that of E. coli K12. This is the first application to a complete chromosome of our newly developed computational method for analyzing the SIDD properties of long DNA sequences (Benham and Bi 2004 The statistics of this association are indeed extreme. The most strongly destabilized sites are present in intergenic regions that contain either documented (DTP) or inferred (DIV) promoters at rates that are >20 standard deviations above random. And they occur within coding regions at rates >30 standard deviations below random. The p-values associated with these z-scores are infinitesimal. Stated otherwise, the density of the strongest SIDD sites is >85 times greater in promoter-containing intergenic regions than in coding regions. We are unaware of any other single attribute that shows such a high degree of clustering at promoters. Such extreme density differences would be very unlikely to persist within a population without strong selection pressure to conserve them.
The results presented here suggest that SIDD attributes may be useful for finding promoter locations in prokaryotes, a problem that has proven surprisingly difficult to resolve using string-based methods alone (Hertz and Stormo 1996
We anticipate that improved promoter prediction algorithms may be achievable by including both the SIDD properties and the known sequence attributes of promoters. (In addition to the sequence attributes used in previous predictors, there is statistical evidence, for example, that TAN regions that do not contain promoters are usually quite short, commonly One might anticipate that specific structural and physico-chemical attributes could be closely associated with particular types of regulatory regions. After all, the mechanisms by which regulation is effected involve interactions with other molecules that depend on the structures and physicalchemical properties of all the participants, including the DNA. Thus, it should not be surprising that the propensity of sites to become destabilized under the types of stresses that occur in vivo should correlate with the locations of regulatory regions governing processes in which duplex opening is required. Stress-induced destabilization could be expected to be associated specifically with promoters because strand separation is a necessary step in the initiation of transcription. Although this opening is mediated by the polymerase holoenzyme, even fractional destabilization can greatly assist this process. For example, destabilization of the DNA duplex at the 10 region by 2.8 kcal/mole (from 10.2 kcal/mole to 7.4 kcal/mole) would shift the equilibrium of the opening reaction by two orders of magnitude toward the open state. This is a direct mechanism by which SIDD could alter the transcriptional activities of promoters. However, specific cases have shown that destabilization events also can play other specific roles in transcriptional regulatory mechanisms.
A strong SIDD site has been shown to be present 90 bp upstream of the ilvPG promoter of E. coli (Sheridan et al. 1998 We note that the strong SIDD site involved in this mechanism coincides with the binding site for a regulatory molecule, not with the location of the promoter itself. This shows the possible functional importance of SIDD sites located at any positions within intergenic regions, not just at the promoter. It also shows that regulatory mechanisms can involve interactions between destabilization and binding events. It is reasonable to suppose that SIDD sites near other promoters in E. coli also could be involved in the specific mechanisms of their regulation. These issues, and many others arising from the analysis of this data set, will be carefully addressed in future publications.
An initial analysis of several yeast genes has shown that the strongest SIDD sites in that primitive eukaryote are found in the 3'-terminal flanks of genes, not in their 5'-flanks (Benham 1996 We currently are developing a Web site to provide access to the results of our SIDD analyses at the address http://www.genomecenter.ucdavis.edu/benham. There one can get SIDD profiles of any regions of interest in any complete chromosome that has been analyzed. To date this is just E. coli K12, although yeast results will be added soon. We do not at present intend our analysis to exactly reflect in vivo conditions. Indeed, one anticipates these will vary in complex ways according to the precise manner in which stresses are imposed, how the DNA is constrained, binding events, and many other specific effects. The present calculations are intended to illuminate a relatively simple physicalchemical attribute of the DNA duplexits propensity to become locally destabilized in response to the superhelical stresses that are imposed on it in vivo. Although the assumptions implicit in this approach are much simpler than is the in vivo situation, their results already have illuminated attributes of the DNA that have been implicated in a variety of important regulatory processes. Accordingly, we designed our analysis procedure to most effectively illuminate the SIDD behavior of genomic sequences, and thereby enable correlations of SIDD sites with regulatory and other regions to be evaluated. Correlations that are found can suggest possible roles of SIDD in mechanisms of regulation, and experimental tests thereof. But calculations that accurately reflect in vivo conditionswhere the basal level of superhelicity can vary in complex ways with growth state and environmental parameters, transient supercoiling is driven by replication and transcription, and both domain boundaries and susceptibilities to transition vary with protein-binding eventsmust await a fuller understanding of these conditions.
The Analysis of SIDD in Long DNA Sequences The Equilibrium Thermodynamics of Superhelical DNA Denaturation In a superhelical domain, the DNA is constrained so that its linking number Lk is fixed. (Lk is the total number of helical turns in the DNA within a domain when its central axis is held planar.) When the domain is relaxed, it has linking number Lk0. But DNA in vivo is commonly constrained in a negatively superhelical state, in which Lk < Lk0. This results in a (negative) linking difference = Lk Lk0 < 0, which exerts untwisting torsional stresses on the base pairs within the domain. The superhelix density is = /Lko.
A given linking difference (i.e., superhelicity)
of available states is indexed by i, and if the free energy of state i is Gi, then the fraction of a population of identical molecules that is in state i at equilibrium will be
has value i in state i, then its population average (i.e., expected) value at equilibrium is
To specify a state of strand separation in a DNA molecule of specified base sequence and imposed superhelicity
Equations 4 and 5 may be used to evaluate the equilibrium value of any property of interest, once the states and their energies have been specified. The method by which this is done has been presented elsewhere (Benham 1992
The Approximate Method of SIDD Analysis
0.04) show that the approximate method achieves four to five significant figures of accuracy in all calculated parameters when a threshold of = 12 kcal/mole is used (Fye and Benham 1999 5000 bp under these conditions commonly will have somewhere between 106 and 109 states that satisfy this threshold, a number that is small enough to execute efficiently.
Long DNA sequences, including complete chromosomes, are analyzed by partitioning them into windows and analyzing each window separately (Benham and Bi 2004
Computational Implementation
All energy parameters used in these calculations are given the values that have been experimentally determined to occur at an ionic strength of 0.01 M and a temperature of 37°C (Benham 1992
Previous calculations performed on a wide variety of short (i.e., 38 kb length) DNA sequences have shown that this approach, with these energy parameters, provides highly accurate predictions of stress-induced transition behavior, both in vitro and in vivo. For sequences in which the locations and extents of superhelically driven duplex opening have been measured using the mung bean digestion procedure, our computational predictions of the positions of opening and the relative amounts of opening at each position, both as functions of imposed superhelicity, have been shown to be quantitatively accurate to within experimental precision (Benham 1992 We note, however, that the precise level of superhelicity needed to drive a given extent of transition does depend on environmental conditions of temperature and ionic strength. Our results are quantitatively highly accurate when compared with experiments performed under the environmental conditions of the nuclease digestion procedure of Kowalski, as these are the conditions in which the energy parameters we use pertain. One may modify our calculations to suit other conditions by simply using the energy parameters that are appropriate to those conditions. However, at present the experimental information available regarding superhelical transitions under other conditions is not sufficiently detailed to allow a rigorous assessment of the quantitative accuracy of our predictions of the superhelical dependence of transition properties. We use a window size of 5000 bp for our calculations because previous analyses of sequences of this size have proven to be highly accurate and informative, as described above. As there is no information available regarding the distances over which superhelical stresses propagate in vivo, there is at present no biological basis for selecting any specific length scale. Indeed, one may speculate that the sizes and boundaries of topological domains may change dynamically with protein-binding events, translocation of polymerases, reptation through constraints, and perhaps other effects. If experiments suggest that a particular length scale is relevant to a specific phenomenon, the calculations can be easily modified to use that scale.
Analysis of the E. coli Genome The results of these calculations may be accessed at http://www.genomecenter.ucdavis.edu/benham. There the user may request the G(x) values for any specified region of any chromosome for which the calculation has been performed. The SIDD and probability profiles of regions 5 kb in length can be graphed, or tabulated output for specified regions can be sent by e-mail. The output for the entire chromosome can be provided on request.
We are grateful for the assistance of Peter Morrison in writing programs implementing our early algorithmic strategies for analyzing the SIDD profiles of E. coli. The work reported here was supported in part by grants DBI 99-04549 from the National Science Foundation and RO1-HG01973 from the National Institutes of Health, and by additional support from the Diversa Corporation. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2080004.
3 Corresponding author.
Aranda, A., Perez-Ortin, J., Benham, C.J., and del Olmo, M. 1997. Analysis of the in vivo structure of a natural alternating d(AT)n sequence in yeast. Yeast 13: 313326.[CrossRef][Medline]
Benham, C.J. 1979. Torsional stress and local denaturation in supercoiled DNA. Proc. Natl. Acad. Sci. 76: 38703874. Benham, C.J. 1980. The equilibrium statistical mechanics of the helix-coil transition in torsionally stressed DNA. J. Chem. Phys. 72: 36333639.[CrossRef] Benham, C.J. 1981. A theoretical analysis of competing conformational transitions in torsionally stressed DNA. J. Mol. Biol. 150: 4368.[CrossRef][Medline] Benham, C.J. 1990. Theoretical analysis of heteropolymeric transitions in superhelical DNA molecules of specified sequence. J. Chem. Phys. 92: 62946305.[CrossRef] Benham, C.J. 1992. The energetics of the strand separation transition in superhelical DNA. J. Mol. Biol. 225: 835847.[CrossRef][Medline]
Benham, C.J. 1993. Sites of predicted stress-induced DNA duplex destabilization occur preferentially at regulatory regions. Proc. Natl. Acad. Sci. 90: 29993003. Benham, C.J. 1996. Duplex destabilization in superhelical DNA is predicted to occur at specific transcriptional regulatory regions. J. Mol. Biol. 255: 425434.[CrossRef][Medline] Benham, C.J. and Bi, C.-P. 2004. The analysis of stress-induced duplex destabilization in long genomic DNA sequences. J. Comp. Biol. (in press). Benham, C.J., Kohwi-Shigematsu, T., and Bode, J. 1997. Stress-induced duplex destabilization in chromosomal scaffold/matrix attachment regions. J. Mol. Biol. 274: 181196.[CrossRef][Medline]
Blattner, F., Plunkett, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 14531474. Bloomfield, V., Crothers, D., and Tinoco, I. 1974. The physical chemistry of nucleic acids. Harper & Row, New York.
Cheung, K.J., Badarinarayana, V., Selinger, D.W., Janse, D., and Church, G.M. 2003. A microarray-based antibiotic screen identifies a regulatory role for supercoiling in the osmotic stress response of Eschicheria coli. Genome Res. 13: 206215. Eskin, E., Keich, U., Gelfand, M.S., and Pevzner, P. 2003. Genome-wide analysis of bacterial promoter regions. 2003 Pac. Symp. Biocomp. 2940. Fye, R.M. and Benham, C.J. 1999. Exact method for numerically analyzing a model of local denaturation in superhelically stressed DNA. Phys. Rev. E 59: 34083426.[CrossRef] Hatfield, G.W and Benham, C.J. 2002. DNA topology-mediated control of global gene expression in Escherichia coli. Ann. Rev. Genet. 36: 175203.[CrossRef][Medline] He, L., Liu, J., Collins, I., Sanford, S., O'Connell, B., Benham, C.J., and Levens, D. 2000. Loss of FBP function arrests cellular proliferation and extinguishes c-myc expression. EMBO J. 19: 10341044.[CrossRef][Medline]
Hershberg, R., Bejerano, G., Santos-Zavaleta, A., and Margalit, H. 2001. PromEC: An updated database of Escherichia coli mRNA promoters with experimentally identified transcriptional start sites. Nucleic Acids Res. |