|
|
|
|
Genome Res. 14:1967-1974, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods Decoding Human Regulatory Circuits1 Center for Bioinformatics, The Wadsworth Center, New York State Department of Health, Albany, New York 12208, USA 2 Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia V5Z 4H4, Canada 3 Department of Statistics, Harvard University, Cambridge, Massachusetts 02138, USA 4 Computer Science Department, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
Clusters of transcription factor binding sites (TFBSs) which direct gene expression constitute cis-regulatory modules (CRMs). We present a novel algorithm, based on Gibbs sampling, which locates, de novo, the cis features of these CRMs, their component TFBSs, and the properties of their spatial distribution. The algorithm finds 69% of experimentally reported TFBSs and 85% of the CRMs in a reference data set of regions upstream of genes differentially expressed in skeletal muscle cells. A discriminant procedure based on the output of the model specifically discriminated regulatory sequences in muscle-specific genes in an independent test set. Application of the method to the analysis of 2710 10-kb fragments upstream of annotated human genes identified 17 novel candidate modules with a false discovery rate 0.05, demonstrating the applicability of the method to genome-scale data.
Technologies for large-scale assessment of gene expression have become a mainstay of the postgenome era. Such profiling studies in yeast have been analyzed to gain insights into the regulatory program of this organism (Segal et al. 2003
The CRM can be viewed as a circuit translating input signals from diverse pathways into an output, gene activity, through the binding of multiple transcription factors in a combinatorial fashion. Though regulatory circuits can be defined through extensive laboratory effort, most tissues and contexts are insufficiently characterized to allow such approaches. Although pattern discovery techniques have proven effective in the identification of transcription factor binding sites (TFBSs) for several single-celled organisms (McCue et al. 2000
Cross-species comparison of sequences from orthologous genes, or phylogenetic footprinting, shortens the amount of sequence under consideration by focusing attention on conserved regions that are more likely to serve a biological function (Wasserman et al. 2000
Protein interactions provide the mechanistic basis for much of gene regulation in all organisms (Wei et al. 2004 In the present study, we developed a synergy-based de novo algorithm that models neighbor interactions among TFBSs. We also explored the utility of using aligned human-mouse sequences as an input data set for training the algorithm. We found that the use of aligned human-mouse sequences and the use of neighboring interactions both enhance the specificity of site and module predictions. We show that this model can be used to specifically discriminate regulatory sequences from control sequence in an independent test set, and we use the resulting discrimination procedure to predict additional genes that are likely to be regulated in a manner similar to those in the study set.
Positive Training Model To explore the utility of human-rodent sequence comparison (an evolutionary distance of 50-100 million years) for locating CRMs, we selected a study set of 24 3-kb upstream regions of orthologous gene pairs specifically up-regulated in human skeletal muscle tissue. These genes were selected because they contain numerous reported TFBSs as determined from biochemical and genetic studies (Wasserman and Fickett 1998
Within our study set, there are a total of 188 reported sites (94 mouse/human pairs). Because the regulatory mechanisms governing the expression of these genes have not been fully delineated, additional functional sites are likely to be present. After masking repetitive sequences in the human gene sequence with RepeatMasker (A.F.A. Smit and P. Green, unpubl.; http://www.repeatmasker.org/), using BLASTZ (Schwartz et al. 2003
We defined a CRM to be a fragment of sequence in which there are at least two reported TF binding sites with intersite spaces
Similar to the analysis by Wasserman et al. (2000
Sequence logos (Schneider and Stephens 1990
In order to examine the contributions of the various components of the algorithm, we compared its performance to two other modes of Gibbs sampling (Thompson et al. 2003 100 bp apart. As Table 2 shows, the most improvement in site identification emerges with phylogenic footprinting (the addition of the mouse sequences). Table 2 also shows that inferences of neighboring pair relationships that are unique to the module sampler also strongly improves site identification.
To compare the module sampler results with predictions obtained when motif models were known a priori, we obtained the COMET software (Frith et al. 2002 50% overlap). The module sampler correctly located 33 reported Myf, Mef-2, and SRF TFBSs with 19 ambiguous sites in the human sequences and 17 reported modules. (The human sequences represent half the totals in Table 2.) Thus, COMET predictions were somewhat more specific and slightly less sensitive than the module sampler's (see Table 2). These results demonstrate that the incorporation of aligned sequence pairs and neighbor interactions can circumvent the need for large reference collections.
In addition to the muscle-specific sequences, we collected a set of 10 human upstream sequences from genes expressed in liver tissue and their rodent orthologs (Krivan and Wasserman 2001
Finding New Muscle-Specific Genes We found that models built with these data using uninformed prior models contained very few sites, and that the resulting models were so unlike those found in the muscle-specific set that they were of little value for discrimination. To force the negative model to focus on muscle-like features, we used the five predicted motif models shown in Figure 1 as very strongly informed prior motif models, and we set the distribution of the number of sites per sequence to that obtained from the original muscle-specific sequence pairs. Consequently, the discrimination of muscle-specific modules from negative controls stems primarily from the posterior differences in numbers of sites, the frequencies of each predicted type of site, and the neighboring relationships among them. Using these parameters, the algorithm predicted modules in 24 of the 100 negative training sequence pairs.
The ratio of the probability of a given sequence under the positive model to the probability under the negative model defines a Bayes factor ratio (Gelman et al. 1995
Validation For direct validation, we performed an intensive search of the literature and found a set of 13 additional human genes with evidence of specific expression in muscle tissue and reported TFBSs. For each of these positive test sequences, we selected a 10-kb noncoding fragment that included the reported TFBS. We processed these and aligned them with mouse sequences as above. We randomly sampled another 100 from the remaining 2810 10-kb upstream regions as the negative test set. The module sampler was applied to both sets, using the parameters learned in training. Modules were found in nine of the 13 positive sequence pairs and in 22 of the 100 negative sequence pairs. Figure 3A shows a histogram of the Bayes ratios for these 31 positive and negative direct validation sequence pairs. The negative controls contained candidate CRMs with low Bayes ratios; the maximum Bayes ratio was only 90.0 with only two values greater than 10. Among the nine positive sequence pairs, five had ratios greater than 90 and three had ratios much above 90. A Kolmogorov-Smirnov test (Venables and Ripley 1999 0.
To test the algorithm with a larger set of positive sequences, we used cross-validation. In this process one sequence pair at a time, the target pair, was removed from the data sets. For the 24 positive training-sequence pairs, the positive models were rebuilt, as necessary, with the remaining 23 sequence pairs. Bayes ratios for these are shown in Figure 3B. This cross-validation process was also applied to the 100 negative training sequences, and a module was identified in 24 of these. The Bayes ratio was below 10 for 21 of these negative pairs. The remaining three had ratios of 22.5, 35.8, and 130.8. Thirteen of the positive sequences had Bayes ratios above 130.8 and, as shown in Figure 3B, most of them had ratios several orders of magnitude higher.
Sequence Data Mining
We introduce an algorithm, based on a generative statistical model, for finding CRMs in sequence data that requires only aligned human-mouse sequence pairs of likely coregulated genes, and does not require prior knowledge of motif models or other parameters. It makes only minimal assumptions about the sizes and numbers of differing kinds of binding sites. We also introduce a novel discriminant procedure for using the inferred models to search for other genes that might be specifically expressed in a manner similar to those in our study set. Our tests of this procedure, using both cross-validation and direct validation, indicate that a subset of muscle-specific genes can be discriminated from control sequences. Application of this procedure to 2710 upstream sequences from the human genome identifies 17 genes containing a module (q-values 5%).
The algorithm performs well on the positive training set data, a well studied collection of muscle-related genes with known regulatory sites. It locates
The predicted motif models were used to search the JASPAR database of transcription factor binding profiles (http://jaspar.cgb.ki.se; Sandelin et al. 2004
The number of different motif models was initially set to five to facilitate the search for Mef-2, Myf, SRF, SP1, and Tef binding motifs, as suggested by others (Wasserman and Fickett 1998 Our model distinguishes itself from previous de novo models in four ways: it (1) incorporates terms that seek to capture interaction of neighboring TFs, (2) explores variations in the patterns of conserved base pairs in a binding motif via a fragmentation step, (3) searches for sites common to a pair of species, and (4) employs a generative statistical model. Our results show that the first three of these factors improved the performance of the algorithm. (1) The incorporation of neighbor interactions into the CRM pattern discovery process improves performance. In the absence of neighbor interactions, the module sampler produced a solution with a MAP value approximately equal to the MAP value produced by the full module sampler. However, in the absence of neighbor interactions, the number of reported sites correctly predicted for the solution with the highest MAP was lower (56% vs. 69%) and the total number of predicted sites was higher (222 vs. 176). An additional test with a liver-specific data set behaved similarly. Thus, these interactions seem to contribute significantly to specificity and sensitivity of the algorithm. (2) We found that use of the fragmentation algorithm was required to distinguish SRF sites from MEF sites. (3) The use of aligned sequence pairs greatly improved the Gibbs sampler's ability to detect TFBSs and CRMs, as shown in Table 2. In calculating the sampling distribution, we assumed that the individual sequences in the pairs were independent. Although this is almost certainly incorrect, we found that a down-weighting of the mouse sequences, to adjust for phylogenic correlation, adversely affected performance. We prefer the use of generative statistical models, because they provide a facile means for the future incorporation of additional models of biological processes such as selective evolutionary pressures. Our experiences with Gibbs sampling models from this class and the experiences of others with hidden Markov models from this class suggest that taking this path, though somewhat more demanding at the outset, will in the long run promote extensions that model emerging biological findings on the mechanisms of transcription regulation.
We examined three different prior distributions of the total number of sites per sequence. Both a prior that was equal to the reported number of sequences and a Poisson prior with The relatively small proportion of the total number of human genes represented by the orthologous gene pairs in our sequence mining set stems primarily from a requirement to identify unique mouse orthologs for each candidate human gene. We know of no reason why the resulting set should be biased, but we also have no good evidence that the set is not biased. However, it was the data set available from the human genome assembly at the time of the study. The latest revision of the human genome does not substantially increase this proportion of human genes with useful mouse orthologs. We were somewhat surprised that the sampler failed to identify a CRM in three of the sequence pairs in the positive training set. However, this finding is consistent with the fact that the sequences making up the positive training set are regulatory regions of a heterogeneous mixture of regulatory mechanisms. This heterogeneity is further reflected by the findings from the test set, in which the sampler could not identify a module in four of the 13 positive control sequences. We thus suspect that the modules that we do detect are germane to some specifically regulated subset of muscle-specific genes. The use of the FDR approach gave us a means by which to select a critical value cutoff that did not require input from our validation studies. The fact that this FDR-based critical Bayes ratio of 194.5 is somewhat greater than the highest ratio among the negative validations results supports its use. Throughout this study, our focus has been on producing CRM predictions that are unlikely to include false positives, that is, which have low q-values. We thus intentionally tipped the balance toward specificity with a concomitant loss of sensitivity. In our opinion, the confidence gained in the predicted set is well worth the tradeoff of missing some additional muscle-specific genes remaining in the data set that we mined. Coexpression alone does not imply coregulation. Genes may be expressed within a given type of cell through multiple and cascading responses to a single stimulus, to say nothing of the complexity that exists in heterogeneous tissues containing multiple cell types. Thus, even though the module sampler has the capacity to ignore sequences that do not contain a cis module pattern common to the rest, we recommend that this approach be applied only to sets of genes that are likely to be coregulated, for example, to subsets from large-scale expression studies in cell cultures that follow a consistent and common time course. For such sets of coregulated genes, the present results indicate potential to unravel the mechanisms behind their coexpression patterns, through the identification of CRMs and specific TFBSs, and our findings also show that the resulting models may be employed to identify additional genes that are regulated in a similar manner.
We adopted a procedure similar to that of Wasserman et al. (2000
Modules for different genes may vary in the total number of contributing transcription factors, and in both the number and order of binding sites for each type of factor. Some of the sequences in a data set may contain no sites at all, or no sites for one or more of the factors involved in regulating the rest. Because the algorithm is focused exclusively on cis-regulation, it makes no direct inferences regarding the trans components, the transcription factors. Rather, it infers the DNA binding patterns of unspecified transcription factors in the form of p different statistical models or motifs. The algorithm infers the total number, 0 k kmax, of TFBSs in each module, and the overall distribution of the number of sites per module. Because the order of the sites among the CRMs may vary but still reflect ordering preferences arising from protein-protein interactions, the algorithm also infers the frequencies of neighboring pairs of TFBSs. To address our key aim of finding modules without prior information on motif binding patterns, we employed uniform (i.e., uninformed) prior motif models. In addition, modules are kept compact by the requirement that no successive pair of TFBSs within a CRM be separated by more than 100 bp.
The algorithm has two phases. The forward phase, as described in the Supplemental material, uses recursive sums over all possible alignments of 0
This work was supported by NIH grant R01HG01257, and NSF grant NSF-DMS 0204674, and a grant from the Canadian Institutes of Health Research (to W.W.W.).
5 Corresponding author. E-MAIL thompson{at}wadsworth.org; FAX (518) 402-4623. [Supplemental material is available online at www.genome.org.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2589004.
Aerts, S., Van Loo, P., Thijs, G., Moreau, Y., and De Moor, B. 2003. Computational detection of cis-regulatory modules. Bioinformatics 19: 5ii-14ii.[Abstract] Aicher, W.K., Sakamoto, K.M., Hack, A., and Eibel, H. 1999. Analysis of functional elements in the human Egr-1 gene promoter. Rheumatol. Int. 18: 207-214.[CrossRef][Medline] Bischoff, C., Kahns, S., Lund, A., Jorgensen, H.F., Praestegaard, M., Clark, B.F.C., and Leffers, H. 2000. The Human Elongation Factor 1 A-2 Gene (EEF1A2): Complete sequence and characterization of gene structure and promoter activity*1. Genomics 68: 63-70.[CrossRef][Medline]
Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K.D., Ovcharenko, I., Pachter, L., and Rubin, E.M. 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 1391-1394. Brenner, V. 1998. Von der Sequenz zur Funktion: Genomanalyse einer 102 KB-Region des humanen X-Chromosoms. Friedrich-Schiller University, Jena, Germany.
Carson, J.A., Fillmore, R.A., Schwartz, R.J., and Zimmer, W.E. 2000. The smooth muscle Downes, M., Mynett-Johnson, L., and Muscat, G.E. 1994. The retinoic acid and retinoid X receptors are differentially expressed during myoblast differentiation. Endocrinology 134: 2658-2661.[Abstract]
Fohr, U.G., Weber, B.R., Muntener, M., Staudenmann, W., Hughes, G.J., Frutiger, S., Banville, D., Schafer, B.W., and Heizmann, C.W. 1993. Human
Frith, M.C., Spouge, J.L., Hansen, U., and Weng, Z. 2002. Statistical significance of clusters of motifs represented by position specific scoring matrices in nucleotide sequences. Nucleic Acids Res. 30: 3214-3224. Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. 1995. Bayesian Data Analysis. Chapman and Hall, London, UK.
Georgiades, P. and Brickell, P.M. 1997. Differential expression of the rat retinoid X receptor
Halfon, M.S. and Michelson, A.M. 2002. Exploring genetic regulatory networks in metazoan development: Methods and models. Physiol. Genomics 10: 131-143. Hazama, M., Watanabe, D., Suzuki, M., Mizoguchi, A., Pastan, I., and Nakanishi, S. 2002. Different regulatory sequences are required for parvalbumin gene expression in skeletal muscles and neuronal cells of transgenic mice. Molecular Brain Research 100: 53-66.[Medline]
Knudsen, S.M., Frydenberg, J., Clark, B.F., and Leffers, H. 1993. Tissue-dependent variation in the expression of elongation factor-1
Krivan, W. and Wasserman, W.W. 2001. A predictive model for regulatory sequences directing liver-specific transcription. Genome Res. 11: 1559-1566. Latchman, D.S. 1998. Eukaryotic transcription factors. Academic Press, San Diego, CA. Liu, J., Neuwald, A., and Lawrence, C. 1995. Bayesian models for multiple local sequence alignment and Gibbs sampling strategies. J. Am. Stat. Assoc. 432: 1156-1170.[CrossRef] Liu, J.S., Neuwald, A.F., and Lawrence, C.E. 1999. Markovian structures in biological sequence alignments. J. Amer. Stat. Assoc. 94: 1-15.[CrossRef] Liu, J.S., Gupta, M., Liu, X., Mayerhofer, L., and Lawrence, C.E. 2002. Statistical models for biological sequence motif discovery. In Case studies in Bayesian statistics VI (eds. C. Gatsonis et al.), pp. 3-22. Springer-Verlag, New York. Marsan, L. and Sagot, M.F. 2000. Algorithms for extracting structured motifs using a suffix tree with an application to promoter and regulatory site consensus identification. J. Comput. Biol. 7: 345-362.[CrossRef][Medline]
Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A.E., Kel-Margoulis, O.V., et al. 2003. TRANSFAC(R): Transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31: 374-378.
McCue, L.A., McDonough, K., and Lawrence, C.E. 2000. Functional classification of cNMP-binding proteins and nucleotide cyclases with implications for novel regulatory pathways in mycobacterium tuberculosis. Genome Res. 10: 204-219.
McCue, L.A., Thompson, W., Carmack, C.S., and Lawrence, C.E. 2002. Factors influencing the identification of transcription factor binding sites by cross-species comparison. Genome Res. 12: 1523-1532.
Pruitt, K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29: 137-140.
Rajewsky, N., Socci, N.D., Zapotocky, M., and Siggia, E.D. 2002a. The Evolution of DNA regulatory regions for proteo- Rajewsky, N., Vergassola, M., Gaul, U., and Siggia, E. 2002b. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics 3: 30.[CrossRef][Medline] Rebhan, M., Chalifa-Caspi, V., Prilusky, J., and Lancet, D. 1997. GeneCards: Encyclopedia for genes, proteins and diseases. Weizmann Institute of Science, Bioinformatics Unit and Genome Center, Rehovot, Israel.
Sandelin, A., Alkema, W., Engstrom, P., Wasserman, W.W., and Lenhard, B. 2004. JASPAR: An open-access database for eukaryotic transcription factor binding profiles. Nucleic Acids Res. 32: D91-94.
Schneider, T.D. and Stephens, R.M. 1990. Sequence logos: A new way to display consensus sequences. Nucleic Acids Res. 18: 6097-6100. Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Human-mouse alignments with BLASTZ. Genome Res. 13: 103-107. Segal, E. and Sharan, R. 2004. A discriminative model for identifying spatial cis-regulatory modules. In Proceedings of the 8th annual international conference on computational molecular biology, pp. 141-149. Segal, E., Yelensky, R., and Koller, D. 2003. Genome-wide discovery of transcriptional modules from DNA sequence and gene expression. Bioinformatics 19: 273i-282.[Abstract] Shore, P. and Sharrocks, A.D. 1995. THe MADS-box family of transcription factors. Eur. J. Biochem. 229: 1-13.[Medline] Storey, J.D. 2002. A direct approach to false discovery rates. J. Royal Stat. Soc. B 64: 479-498.[CrossRef]
Storey, J.D. and Tibshirani, R. 2003. Statistical significance for genomewide studies. PNAS 100: 9440-9445.
Thompson, W., Rouchka, E.C., and Lawrence, C.E. 2003. Gibbs recursive sampler: Finding transcription factor binding sites. Nucleic Acids Res. 31: 3580-3585.
Tsai, J.C., Liu, L., Cooley, B.C., Dichiara, M., Topper, J., and Aird, W.C. 2000. The Egr-1 promoter contains information for constitutive and inducible expression in transgenic mice. FASEB J. 14: 1870-1872. Valle, G., Faulkner, G., De Antoni, A., Pacchioni, B., Pallavicini, A., Pandolfo, D., Tiso, N., Toppo, S., Trevisan, S., and Lanfranchi, G. 1997. Telethonin, a novel sarcomeric protein of heart and skeletal muscle. FEBS Lett. 415: 163-168.[CrossRef][Medline] Venables, W.N. and Ripley, B.D. 1999. Modern applied statistics with S-Plus. Springer, New York. Vincent, S.R., Kwasnicka, D.A., and Fretier, P. 2000. A Novel RING finger-b box-coiled-coil protein, GERP. Biochem. Biophys. Res. Comm. 279: 482-486.[CrossRef][Medline] Wasserman, W.W. and Fickett, J.W. 1998. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278: 167-181.[CrossRef][Medline] Wasserman, W.W. and Krivan, W. 2003. In silico identification of metazoan transcriptional regulatory regions. Naturwissenschaften 90: 156-166.[Medline] Wasserman, W.W., Palumbo, M., Thompson, W., Fickett, J.W., and Lawrence, C.E. 2000. Human-mouse genome comparisons to locate regulatory sites. Nat. Genet. 26: 225-228.[CrossRef][Medline] Wei, G.H., Liu, D.P., and Liang, C.C. 2004. Charting gene regulatory networks: Strategies, challenges and perspectives. Biochem. J. Epub ahead of print. Yuh, C.-H., Dorman, E.R., Howard, M.L., and Davidson, E.H. 2004. An otx cis-regulatory module: A key node in the sea urchin endomesoderm gene regulatory network. Dev. Biol. 269: 536-551.[CrossRef][Medline]
http://zlab.bu.edu/~mfrith/comet/; COMET. http://jaspar.cgb.ki.se; JASPAR database. http://bayesweb.wadsworth.org/gibbs/module; Module sampler results and data. http://www.repeatmasker.org/; RepeatMasker. http://www.gene-regulation.com; TRANSFAC.
Received March 17, 2004; accepted in revised format July 22, 2004. This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||