|
|
|
|
Genome Res. 17:940-946, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00 OPEN ACCESS ARTICLE Methods Detection of DNA structural motifs in functional genomic elements1 Program in Bioinformatics, Boston University, Boston, Massachusetts 02215, USA; 2 Department of Chemistry, Boston University, Boston, Massachusetts 02215, USA
The completion of the human genome project has fueled the search for regulatory elements by a variety of different approaches. Many successful analyses have focused on examining primary DNA sequence and/or chromatin structure. However, it has been difficult to detect common sequence motifs within the feature of chromatin structure most closely associated with regulatory elements, DNase I hypersensitive sites (DHSs). Considering just the nucleotide sequence and/or the chromatin structure of regulatory elements may neglect a critical feature of what is recognized by the regulatory machineryDNA structure. We introduce a new computational method to detect common DNA structural motifs in a large collection of DHSs that are found in the ENCODE regions of the human genome. We show that DHSs have common DNA structural motifs that show no apparent sequence consensus. One such structural motif is much more highly enriched in experimentally identified DHSs that are in CpG islands and near transcription start sites (TSSs), compared to DHSs not in CpG islands and farther from TSSs, suggesting that DNA structural motifs may participate in the formation of functional regulatory elements. We propose that studies of the conservation of DNA structure, independent of sequence conservation, will provide new information about the link between the nucleotide sequence of a DNA molecule and its experimentally demonstrated function.
Since the completion of the sequence of the human genome (Lander et al. 2001
Regions of the genome that are hypersensitive to digestion by deoxyribonuclease I (called DNase I hypersensitive sites, DHSs) have been shown to be associated with a wide variety of functional genomic elements, including promoters, enhancers, origins of replication, and centromeres (Gross and Garrard 1988
Although DHSs occur nonrandomly in the genome, it has been difficult to detect specific DNA sequence motifs that are held in common by DHSs (Noble et al. 2005
Hydroxyl radical cleavage as a measure of local DNA structure While there are many algorithms that can find regions in a genome that are similar in nucleotide sequence, locating regions that have similar three-dimensional shape or structure is not as straightforward. In order to identify these regions, some measure of structure must be obtained. Chemical probes are capable of providing such structural information for long stretches of DNA (Nielsen 1990
We have shown previously that the extent of hydroxyl radical cleavage of a given nucleotide in duplex DNA is governed by its exposure to solvent (Balasubramanian et al. 1998
For the purpose of the work presented here, a key feature of the hydroxyl radical cleavage experiment is that different DNA sequences can produce similar cleavage patterns (Price and Tullius 1993
We have determined the hydroxyl radical cleavage patterns of a substantial collection of DNA sequences, constructed a database of these patterns, and then used this database to develop an algorithm to predict the cleavage pattern of any DNA sequence (Greenbaum et al. 2007; see Methods). We used this algorithm to predict the cleavage pattern of the 30 Mb of DNA within the ENCODE regions. These predicted cleavage patterns are available for display and analysis in the UCSC Genome Browser (Karolchik et al. 2003
Detection of segments of common local DNA structurein DNase hypersensitive sites We used the CORCSScrU program to align DHSs by their predicted hydroxyl radical cleavage patterns. The DHSs we studied were derived from several individual data sets and from the union of some of these data sets (3150 DHSs total). The DHS data sets are publicly available via the UCSC Genome Browser (see Methods). To assess the significance of the cleavage pattern alignment scores, we shuffled the DHS sequences, preserving sequence composition, and then ran the CORCSScrU program on the shuffled sequences. This process was repeated 5000 times. Histograms of alignment scores for the two data sets are shown in Figure 1. Visual inspection indicates that these two distributions are clearly different. A Kolmogorov-Smirnov (KS) test confirms this conclusion, with P < 1017.
CORCSScrU identified several common hydroxyl radical cleavage motifs within DHSs. Three representative motifs, CORCS1, CORCS2, and CORCS3, are depicted as heat maps in Figure 2, AC. Here, the X-axis represents each position in the identified motif and the Y-axis represents cleavage value bins. Dark blue cells in the heat map indicate no cleavage values for bin Y at position X are present, whereas red cells indicate a large proportion of the cleavage values for that column. If cleavage values were randomly distributed, each column would be uniformly colored. The motifs illustrated here were discovered in separate runs of the CORCSScrU program. CORCS1 was discovered using CORCSScrU in discretized mode with the smaller MPSS DHS data set, while CORCS2 and CORCS3 were discovered using CORCSScrU in discretized and continuous mode, respectively, with the Union DHS data set (see Methods section for more details about data sets). Close inspection reveals that the two CORCS found by aligning DHSs from the Union data set are similar, but offset by one nucleotide (Fig. 2, cf. B and C). To quantitatively assess the similarity of these two CORCS, we calculated the correlation between the mean predicted hydroxyl radical cleavage intensity for positions 28 from CORCS2 and for positions 17 from CORCS3. A highly significant correlation (Pearsons r = 0.951; p < 0.0005) confirms that CORCS2 and CORCS3 are, indeed, similar. The stochastic nature of the Gibbs sampling algorithm makes it unlikely that exactly the same motif is converged on in every run. However, we found that a very similar, if slightly offset, signal emerged consistently in repeated runs.
The fact that similar CORCS were recovered from a data set when CORCSScrU was run either in discretized or continuous mode suggests that use of the quicker discretized mode does not result in a significant loss of information. To investigate this point further, we plotted the mean predicted hydroxyl radical cleavage pattern values for each position in each CORCS (Fig. 2DF). The mean cleavage intensity at any given position mirrors closely what CORCSScrU finds in discretized mode (e.g., Fig. 2, cf. B and E).
To determine whether a CORCS arises simply as the result of finding similar DNA sequences in different DHSs, we examined the corresponding nucleotide alignments between human DHSs. We found little similarity between nucleotide patterns within CORCS, which can be summarized in the form of sequence logos (Fig. 2GI; Schneider and Stephens 1990
A CORCS represents a common DNA structural pattern having little sequence similarity
Enrichment analysis of CORCS1 The three CORCS we show in Figure 2 were found by running CORCSScrU on a set of annotated DHSs found in the ENCODE regions. We next asked what the distribution of one of these CORCS was in the entire set of ENCODE regions, to see how specific the CORCS is for DHSs. To do this, we used a MatInspector-like algorithm (Quandt et al. 1995 We scanned CORCS1 across the predicted hydroxyl radical cleavage patterns of the ENCODE regions and scored each overlapping segment. We found 588 cleavage patterns with a similarity score to CORCS1 above the 99.999-th percentile threshold. These sequences were extracted and their coordinates recorded into a browser extensible data (BED) format file, for viewing in the UCSC Genome Browser (Fig. 4A) and for enrichment analysis.
We show in Figure 4A that although not every example of CORCS1 found in the ENCODE regions aligns with an annotated DHS, the majority do. Most striking is the overlap of several of the CORCS with DHSs in data sets other than the training data set. A portion of this figure is enlarged to highlight a few examples. The oval on the right shows a CORCS1 site that aligns with a DHS discovered by three different methods across three different cell lines. The oval on the left shows a CORCS1 site that aligns with a DHS that is not in the training set. We found that CORCS1 is 5.0-fold (Z-score = 18.3) enriched for experimentally identified DNase hypersensitive sites. This enrichment is reinforced by the histogram in Figure 4B, which shows a tighter clustering of CORCS1 sites near annotated DHSs compared to either TSSs or CpG islands.
CORCS1 is found preferentially in CpG islands that harbor a DHS
Although similar in sequence composition, by definition, to their TSS-distal counterparts, TSS-proximal CpG islands have distinctly different functional roles (Takai and Jones 2002
CORCS1 is found preferentially in DHSs near TSSs
CORCS2 and CORCS3 are also found preferentially in DHSs near TSSs and DHSs overlapping CpG islands
The observation that all CORCS reported here are more enriched for TSS-proximal CpG islands compared to TSS-distal CpG islands (Table 1) and the minor enrichment for the cytosine nucleotide in the CORCS motifs (Fig. 2GI) prompted us to investigate G+C% in TSS-proximal and TSS-distal CpG islands. TSS-proximal CpG islands have higher G+C% than TSS-distal CpG islands (mean = 63.9%, 60.7%, respectively; p < 1 x 1024), which may explain the slight enrichment in cytosine among the identified CORCS motifs.
CORCS are moderately enriched for multispecies conserved sequences
When using computational techniques to search for functional non-coding sequences, considering sequence information alone has the potential to overlook important functional elements that are manifested at the level of DNA structure. This raises the tantalizing possibility that some non-coding functional elements may be under evolutionary selection at the level of structure rather than sequence. This concept accords well with the finding by the ENCODE Consortium that Regulatory Factor Binding Regions (RFBRs) often are only weakly enriched in identifiable transcription factor-binding motifs, and that there is a surprisingly low level of sequence constraint in experimentally identified non-coding elements (The ENCODE Project Consortium 2007 Here we have used the hydroxyl radical cleavage pattern to identify regions in DNase hypersensitive sites that are more similar in structure than in nucleotide sequence. We identified CORCS (Conserved OH Radical Cleavage Signatures) in a large collection of annotated DHSs from the ENCODE regions of the human genome. The striking correlation of the location of CORCS1 with annotated DHSs (Fig. 4B), combined with the results of enrichment analysis (Fig. 5) makes a compelling argument for the existence of common structures within or near DNase hypersensitive sites. The question of whether such common structural features are responsible for conferring nuclease hypersensitivity on these regions remains open. Observing the effect on DNase I sensitivity of deletion or mutation of a CORCS could test this hypothesis directly. A striking result of our analysis is the finding that CORCS1 is much more highly enriched in CpG islands that harbor DNase hypersensitive sites (Fig. 5B), suggesting that there may be a structural difference between CpG islands that are in regions of open chromatin compared to those that are not. Further speculation raises the possibility that these differences in structure are the underlying determinant of the functional differences between these elements. The broad implications of verifying this hypothesis, along with the compelling evidence presented in this study, warrant its further investigation. The methods we describe here can be applied to other types of functional genomic elements and are particularly suited to the analysis of elements that show no apparent sequence consensus. The work presented here suggests that the identification of common DNA structural motifs to distinguish among functional elements may be a plausible and cost-effective initiative.
Data sets used for this work All data sets we used are freely available for download from the UCSC Genome Browser (http://genome.ucsc.edu/encode/). To discover CORCS1, we used DHS sequences from the NHGRI CD4+ T cell MPSS data set (229 sequences in total) (Crawford et al. 2006b
Gibbs sampling of cleavage patterns in DNase hypersensitive sites: CORCSScrU
The computer program we developed for Gibbs sampling is based on an algorithm previously reported for the identification of protein and DNA sequence motifs (Lawrence et al. 1993
Alternatively, we skipped the discretization step and applied the Gibbs sampling algorithm to continuous-value predicted hydroxyl radical cleavage data. Owing to the nature of these data, we made one modification during sampling: We scored each window of length L in the chosen sequence by calculating the probability that the predicted cleavage intensity at position i, where 1
Heat maps
Calculation of hydroxyl radical cleavage conservation
Identification of regions in ENCODE having cleavage patterns similar to CORCS
Enrichment analysis
We thank The ENCODE Project Consortium for making their data publicly available, and the ENCODE Chromatin and Replication analysis group for providing the DHS data and for helpful suggestions. We are grateful to John Stamatoyannopoulos and Scott Kuehn for providing us with access to their analysis pipeline for determining the proximity of CORCS to genes, transcripts, and other experimentally annotated features within the ENCODE regions. This work was funded by a grant from the National Human Genome Institute of the National Institutes of Health (R01 HG003541).
3 Corresponding author.
E-mail tullius{at}bu.edu; fax (617) 353-6466. Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5602807
Balasubramanian, B., Pogozelski, W.K., and Tullius, T.D. 1998. DNA strand breaking by the hydroxyl radical is governed by the accessible surface areas of the hydrogen atoms of the DNA backbone. Proc. Natl. Acad. Sci. 95: 97389743. Baylin, S.B., Herman, J.G., Graff, J.R., Vertino, P.M., and Issa, J.P. 1998. Alterations in DNA methylation: A fundamental aspect of neoplasia. Adv. Cancer Res. 72: 141196.[Medline] Bird, A. 2002. DNA methylation patterns and epigenetic memory. Genes & Dev. 16: 621. Crawford, G., Holt, I., Mullikin, J., Tai, D., Blakesley, R., Bouffard, G., Young, A., Masiello, C., Green, E., Wolfsberg, T., et al. 2004. Identifying 174 gene regulatory elements by genome-wide recovery of DNase hypersensitive sites. Proc. Natl. Acad. Sci. 101: 992997. Crawford, G.E., Davis, S., Scacheri, P.C., Renaud, G., Halawi, M., Erdos, M.R., Green, R., Meltzer, P.S., Wolfsberg, T.G., and Collins, F.S. 2006a. DNase-chip: A high resolution method to identify DNaseI hypersensitive sites using tiled microarrays. Nat. Methods 3: 503509.[CrossRef][Medline] Crawford, G.E., Holt, I.E., Whittle, J., Webb, B.D., Tai, D., Davis, S., Margulies, E.H., Chen, Y., Bernat, J.A., Ginsburg, D., et al. 2006b. Genome-wide mapping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Res. 16: 123131. Dorschner, M.O., Hawrylycz, M., Humbert, R., Wallace, J.C., Shafer, A., Kawamoto, J., Mack, J., Hall, R., Goldy, J., Sabo, P.J., et al. 2004. High-throughput localization of functional elements by quantitative chromatin profiling. Nat. Methods 1: 219225.[CrossRef][Medline] The ENCODE Project Consortium, 2004. The ENCODE (ENCyclopedia of DNA Elements) Project. Science 306: 636640. The ENCODE Project Consortium, 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature (in press). Feil, R. and Khosla, S. 1999. Genomic imprinting in mammals: An interplay between chromatin and DNA methylation? Trends Genet. 15: 431435.[CrossRef][Medline] Felsenfeld, G. 1992. Chromatin as an essential part of the transcriptional mechanism. Nature 355: 219224.[CrossRef][Medline] Felsenfeld, G. and Groudine, M. 2003. Controlling the double helix. Nature 421: 448453.[CrossRef][Medline] Greenbaum, J.A., Pang, B., and Tullius, T.D. 2007. Construction of a genome-scale structural map at single-nucleotide resolution. Genome Res. (this issue) doi: 10.1101/gr.6073107. Gross, D. and Garrard, W. 1988. Nuclease hypersensitive sites in chromatin. Annu. Rev. Biochem. 57: 159197.[CrossRef][Medline] Jones, P.A. and Laird, P.W. 1999. Cancer epigenetics comes of age. Nat. Genet. 21: 163167.[CrossRef][Medline] Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al. 2003. The UCSC Genome Browser database. Nucleic Acids Res. 31: 5154. Lander, E., Linton, L., Birren, B., Nusbaum, C., Zody, M., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline] Lawrence, C., Altschul, S., Boguski, M., Liu, J., Neuwald, A., and Wootton, J. 1993. Detecting subtle sequence signals: A Gibbs sampling strategy for multiple alignment. Science 262: 208214. Margulies, E.H., Cooper, G.M., Asimenos, G., Thomas, D.J., Dewey, C.N., Siepel, A., Birney, E., Keefe, D., Schwartz, A.S., Hou, M., et al. 2007. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome. Genome Res. (this issue) doi: 10.1101/gr.6034307. Maston, G.A., Evans, S.K., and Green, M.R. 2006. Transcriptional regulatory elements in the human genome. Annu. Rev. Genomics Hum. Genet. 7: 2959.[CrossRef][Medline] Nielsen, P.E. 1990. Chemical and photochemical probing of DNA complexes. J. Mol. Recognit. 3: 125.[CrossRef][Medline] Noble, W.S., Kuehn, S., Thurman, R., and Stamatoyannopoulos, J. 2005. Predicting the in vivo signature of human gene regulatory sequences. Bioinformatics (Suppl. 1) 21: i338i343. Panning, B. and Jaenisch, R. 1998. RNA and the epigenetic regulation of X chromosome inactivation. Cell 93: 305308.[CrossRef][Medline] Pavlidis, P. and Noble, W.S. 2003. Matrix2png: A utility for visualizing matrix data. Bioinformatics 19: 295296. Pogozelski, W.K. and Tullius, T.D. 1998. Oxidative strand scission of nucleic acids: Routes initiated by hydrogen abstraction from the sugar moiety. Chem. Rev. 98: 10891107.[CrossRef][Medline] Price, M.A. and Tullius, T.D. 1993. How the structure of an adenine tract depends on sequence context. A new model for the structure of TnAn DNA sequences. Biochemistry 32: 127136.[CrossRef][Medline] Quandt, K., Frech, K., Karas, H., Wingender, E., and Werner, T. 1995. MatInd and MatInspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res. 23: 48784884. Sabo, P.J., Hawrylycz, M., Wallace, J.C., Humbert, R., Yu, M., Shafer, A., Kawamoto, J., Hall, R., Mack, J., Dorschner, M.O., et al. 2004a. Discovery of functional noncoding elements by digital analysis of chromatin structure. Proc. Natl. Acad. Sci. 101: 1683716842. Sabo, P.J., Humbert, R., Hawrylycz, M., Wallace, J.C., Dorschner, M.O., McArthur, M., and Stamatoyannopoulos, J.A. 2004b. Genome-wide identification of DNaseI hypersensitive sites using active chromatin sequence libraries. Proc. Natl. Acad. Sci. 101: 45374542. Sabo, P.J., Kuehn, M.S., Thurman, R., Grant, C., Johnson, B., Johnson, S., Kao, H., Yu, M., Goldy, J., Weaver, M., et al. 2006. Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat. Methods 3: 511518.[CrossRef][Medline] Saxonov, S., Berg, P., and Brutlag, D.L. 2006. A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc. Natl. Acad. Sci. 103: 14121417. Schneider, T. and Stephens, R. 1990. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 18: 60976100. Shadle, S.E., Allen, D.F., Guo, H., Pogozelski, W.K., Bashkin, J.S., and Tullius, T.D. 1997. Quantitative analysis of electrophoresis data: Novel curve fitting methodology and its application to the determination of a proteinDNA binding constant. Nucleic Acids Res. 25: 850861. Takai, D. and Jones, P.A. 2002. Comprehensive analysis of CpG islands in human chromosomes 21 and 22. Proc. Natl. Acad. Sci. 19: 37403745. Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. 2001. The sequence of the human genome. Science 291: 13041351.
Received June 6, 2006; accepted in revised format November 22, 2006. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||