|
|
|
|
Published online before print
July 15, 2004, 10.1101/gr.1953904 Genome Res. 14:1562-1574, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00
Letter Clustering of DNA Sequences in Human Promoters1 Genome Analysis Unit, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA 2 Laboratory of Metabolism, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
We have determined the distribution of each of the 65,536 DNA sequences that are eight bases long (8-mer) in a set of 13,010 human genomic promoter sequences aligned relative to the putative transcription start site (TSS). A limited number of 8-mers have peaks in their distribution (cluster), and most cluster within 100 bp of the TSS. The 156 DNA sequences exhibiting the greatest statistically significant clustering near the TSS can be placed into nine groups of related sequences. Each group is defined by a consensus sequence, and seven of these consensus sequences are known binding sites for the transcription factors (TFs) SP1, NF-Y, ETS, CREB, TBP, USF, and NRF-1. One sequence, which we named Clus1, is not a known TF binding site. The ninth sequence group is composed of the strand-specific Kozak sequence that clusters downstream of the TSS. An examination of the co-occurrence of these TF consensus sequences indicates a positive correlation for most of them except for sequences bound by TBP (the TATA box). Human mRNA expression data from 29 tissues indicate that the ETS, NRF-1, and Clus1 sequences that cluster are predominantly found in the promoters of housekeeping genes (e.g., ribosomal genes). In contrast, TATA is more abundant in the promoters of tissue-specific genes. This analysis identified eight DNA sequences in 5082 promoters that we suggest are important for regulating gene expression.
Vertebrate gene expression is often regulated by the basal promoter, which traditionally is defined as being between 200 bp and the transcription start site (TSS). The DNA sequence properties of basal promoters are poorly described because it is difficult to identify the TSS. Two recent results have helped to resolve this problem: (1) RefSeq (Maglott et al. 2000
A fundamental question in gene expression studies is to determine which DNA sequences that are bound by TFs are biologically relevant. Often, the same DNA sequence is functional in one context but not in another. We reasoned that if a DNA sequence clusters relative to the TSS, the DNA sequences that are in the cluster have a high likelihood of being biologically significant. In human promoters the CAAT box, SP1, and TATA box are recognized by the constitutive transcription factors NF-Y, SP1, and TBP, respectively, and are thought to be localized near the TSS (Breathnach and Chambon 1981 To identify additional DNA sequences that localize near the TSS and thus may be biologically important, we determined the distribution of each of the 65,536 8-mer DNA sequences in 13,010 human promoters sequences from 2500 to 500 bp relative to the TSS. A detailed analysis of the 8-mers with the most significant clustering indicates that they primarily represent variations of only nine DNA consensus sequences. Eight motifs cluster between 100 and the TSS. They include (1) TF binding sites that have been previously suggested to cluster within the promoter (CAAT, SP1, CREB, and TATA); (2) TF binding sites that were not known to localize in the core promoter region, ETS, NRF-1, and USF; and (3) a single DNA sequence, designated Clus1, that is not a known TF binding site. The ninth motif is the Kozak sequence that clusters downstream of the TSS. We observe correlations between the presence of DNA sequences that cluster in promoters and the mRNA expression properties and function of genes.
We combined the cDNA data for RefSeq genes (Maglott et al. 2000
Distribution of Dinucleotide Pairs in Promoters
Distribution of All 8-mer DNA Sequences To identify DNA sequences that cluster relative to the TSS, we determined the distribution of all sequences ranging from 2-mers to 8-mers in this set of 13,010 promoter sequences. As the length of the DNA sequence increased, we identified sequences that clustered more dramatically. This manuscript will focus on the distribution of 8-mers. Because both strands of complementary DNA were examined, the number of independent 8-mers was reduced from 65,536 to 32,896 (32,640 nonpalindromic 8-mers + 256 palindromic 8-mers).
To determine if a DNA sequence clustered, the mean (
Three controls evaluated the significance of the observed localization of particular 8-mers near the TSS. The distribution plot of a seventh-order Markov random data set (see below) shows a complete lack of clustering for any of the 8-mers (Fig. 2C). Figure 2D presents the CF values between 2500 and 1000 bp for the 13,010 putative promoter sequences in which no preferential localization is observed for the 7471 8-mers that contain at least 20 members in the most abundant bin. To further characterize the unique nature of the distribution of 8-mers around the TSS, we performed an experiment in which we aligned the 13,010 sequences based on a translocation of the putative TSS within a random distance between 0 and 500 bp upstream or downstream (Fig. 2E). The CF distribution for this data set does not identify sequences that cluster.
To determine the statistical significance of the CF values, we converted the CF into a probability term. One thousand random data sets, each containing 13,010 sequences that are 1500 bp long, were generated by using the 8-mer frequencies observed in the original data set. For each of the 1000 data sets, the distribution and CFexpt value for all 32,896 8-mers were determined. From the 1000 separate CFexpt values for each 8-mer, the frequency distribution was plotted, and then a mean (
To determine if the 8-mers with the highest CF values are also the most abundant sequences, we compared the abundance of these sequences in the 13,010 promoters between 1000 and 500 bp to the abundance of all 32,896 8-mers (Fig. 4). To determine the abundance of a sequence, we counted the total numbers of occurrence of each 8-mer between 1000 and 500 bp in the set of 13,010 genes. The overall prevalence of the different 8-mers is very variable. On average, we expect 590 occurrences of an 8-mer across the whole promoter region in 13,010 sequences. The observed occurrences, however, are scattered in a very wide range: from minimum of 12 for the palindrome TCGTACGA to maximum of 43,517 for TTTTTTTT. Although the 156 8-mers that showed the most significant clustering in bins 45 through 56 are not the rarest sequences in promoters, they do not appear to represent only the abundant 8-mers. Although they are frequently ignored (masked) in promoter analyses, we have not excluded repetitive sequences in this study. Our rationale is that such sequences may actually contain active control elements by virtue of their specific location.
The 159 8-mers with a P value 7, a one in 10 million single sampling probability of occurring by chance, were examined in more detail. One hundred fifty-six of these sequences had peaks between bin 45 and bin 56. Those sequences were found to be composed of overlapping sequences that could be grouped into nine distinct classes (Table 1). The manual placement of an 8-mer into a particular group was guided by (1) the similarity among DNA sequences, (2) the shapes of the distribution histogram, and (3) the peak position relative to the TSS. Seven of the DNA sequences that cluster are known TF binding sites, listed in the order as they appear in promoters, starting with the most 5' member: CCAAT, SP1, USF, CREB, TATA, NRF-1, and ETS (Table 1). One sequence, TCTCGCGA that we name Clus1, did not resemble any known TF DNA-binding site. The final sequence is the Kozak sequence that is 3'of the TSS, and thus is transcribed into mRNA, and encodes the initiating methionine of the protein. The distribution of the 8-mer with the greatest probability of having a nonrandom distribution (P = 40.6) is shown in Figure 5A, and the distribution of the 159th 8-mer with a P = 7.0 is shown in Figure 5B.
Extending DNA Sequences That Cluster For each of the eight groups in Table 1 that are 5'of the TSS, we manually constructed a consensus sequence, some of which extended 5'and 3' beyond the central core of identity that initially guided the formation of the groups. For example, Figure 6A shows the result of expanding the 5-mer CCAAT sequence to the degenerate 9-mer (RRCCAATSR). The background is dramatically reduced, whereas the peak height is not. The longer consensus sequences provide greater confidence that the sequences within the peak may be functional TF binding sites in their promoters because the occurrence of the sequence outside the peak is so low.
For each consensus, we also varied the identity of each base to determine if the related DNA sequences also clustered. The general result was that related sequences do not cluster.
Eight Clustering Sequences Upstream of the TSS
CAAT
SP1
Clus1 The 8-mer sequence TCTCGCGA that we termed Clus1 is found in 140, or 1.1%, of promoters. No related sequence is observed to cluster when each of the 8 bp is varied (Fig. 7B). No described TF is known to bind this sequence.
USF
CRE Eleven 8-mers contain the six-base sequence GTGACG. These appear to be of two classes. One class may be degenerate CRE and/or USF sites or a binding site for an unknown TF. The second class, for example, the complement of GAAGTGAC (GT CACTTC), can be extended to the clustering 11-mer CCG GAAGTGAC, which is the juxtaposition of an ETS sequence and half of a CRE sequence (data not shown).
TATA
NRF-1
ETS
Four 8-mers contain a 1-bp variant of the ETS sequence, the 6-mer GCGGAA. The extension of this sequence is the 9-mer RGCGGAAGY found in 243, or 1.9%, of promoters. DNA-binding site selection experiments indicate that this ETS site variant is bound by the PEA-3 subfamily of ETS proteins (Brown and McKnight 1992
Kozak Sequence
Transcription Factor DNA-Binding Sites That Do Not Cluster To determine if clustering is a property exhibited by all TF binding sites, we determined the distribution of 193 DNA sequences reported to be TF binding sites found in the TRANSFAC Database, version 3.4 (http://transfac.gbf.de/TRANSFAC; Matys et al. 2003
A couple of DNA sequences have been implicated in the initiation of polymerase II transcription. These include the initiator (YYANWYY; Lo and Smale 1996
DNA Sequences That Cluster Occur Together in Promoters
Clustering DNA Sequences Correlate With Biological Activity We examined whether the presence of clustered DNA sequences in promoters predicts their mRNA expression properties. Initially, we divided genes into two groups depending on whether or not they had a GO ontology annotation that was indicative of some biochemical insight into the function of the gene (Ashburner and Lewis 2002
We next examined if clustering sequences were found in the promoters of genes with a related function (Table 3). The most general observation is that proteins involved in essential cellular functions, for example, translation (ribosome) and degradation (proteosome) often have ETS sequences in their promoters. For example, the ETS sequence clusters in 8% of promoters but is observed in 23% of ribosomal genes, 43% of mitochondrial ribosomal genes, and 42% of proteosomal genes. NRF-1 and Clus1 are preferentially observed in the ribosomal genes, but unlike ETS sequences, they are not observed in proteosomal genes. This suggests a combinatorial system of DNA sequences is used to regulate expression of functionally related genes. These data are in sharp contrast to the 147 channel-related genes that do not have a single ETS or Clus1 sequence in their promoters.
We also determined whether clustering of DNA in promoters correlates with tissue-specific mRNA expression (Table 3). The Web site (http://expression.gnf.org) contains mRNA expression data for 29 human tissues derived from microarray data (Su et al. 2002 To ascertain if the sequence within the peak had different properties than the same sequence outside the peak, we examined the mRNA expression profile for promoters that contained ETS, NRF-1, or Clus1 sequences outside the clustering peak. These DNA sequences outside the peak do not correlate with housekeeping genes (Table 4). In addition, promoters containing the ETS sequences in the peak contain 4.5 times more mitochondrial ribosomal genes than expected. In contrast, promoters containing the ETS sequences outside the peak contain 0.8 times the number of mitochondrial ribosomal genes as expected. This type of analysis provides greater assurance that individual ETS sequences that occur in the peak are biologically important. A similar analysis of TATA sequences indicates that the observation that TATA correlates with tissue-specific gene expression is only true for those TATA sequences under the peak.
We determined the distribution of all 65,536 8-mer DNA sequences in 13,010 human promoters relative to the TSS. One hundred fifty-nine sequences clustered relative to TSS with a random single sampling probability of less than one in 10 million (P 7). One hundred fifty-six of the 159 sequences clustered near the TSS and were variants of nine sequences, eight were 5' to the TSS, and one (Kozak) was 3'. Seven of the eight DNA sequences that cluster upstream of the TSS are known TF binding sites. The distribution of the TF binding sites relative to the TSS is different for each sequence. The CAAT and SP1 sequences cluster at around 100 bp, whereas the other sequences cluster closer to the TSS. Additional sequences may also cluster but were not identified because our analysis was limited to those 8-mers that occurred frequently enough in 13,010 promoters to allow reliable analysis. An enigma in eukaryotic promoter analysis is that not all DNA sequences that can be bound by a TF are biologically relevant. We suggest, however, that if a particular DNA sequence is observed in the same position relative to the TSS, it is likely that the individual DNA sequences that comprise the cluster are important for regulating gene expression of their promoters. We identify 5082 promoters that contain one or more of eight DNA sequences that cluster. For each of these eight DNA sequence families, we generated a consensus sequence. However, although our approach permits us to identify sequences that are likely to be biologically relevant, it does not necessarily imply that related DNA sequences are not important. It could simply be that the related sequences are not sufficiently abundant to form a peak.
A prevailing theme in gene expression studies is that TFs bind a variety of related DNA sequences to regulate gene expression. To determine if variants of the DNA sequence we identified also clustered, we systematically varied each base in the consensus DNA sequences. Different results are obtained for each consensus sequence. For example, when the five invariant bases of the RRCCAATSR consensus are individually varied, none of the related 15 DNA sequences cluster. However, for the ETS consensus sequences, the variants VCCGGAARY and VGCGGAARY both form peaks. In vitro DNA-binding selection experiments have shown that different ETS family members preferentially bind one or the other of these two sequences (Brown and McKnight 1992 Three of the nine sequences highlighted in this analysis are palindromic (CREB, USF, and NRF-1), although only 0.3% (256/65,536) of all octamers are palindromic. Two properties of palindromes may explain their predominance as important TF binding sites. First, palindromes can be bound on either strand of DNA, thus doubling their concentration and increasing the number of productive encounters between the TF and the DNA. Second, palindromes can be bound by dimeric proteins, as is known to be the case for the CREB and USF sites. Monomers dimerize to double their local concentration and now bind palindromic DNA that, again, is in a higher concentration because it can be "viewed" on both strands of DNA. Both of these effects make palindromic sequences "attractive" structures for TFs to bind.
Coordinate gene expression is a hallmark of genetic regulation and may be mediated by TFs that bind in the promoter of coordinately regulated genes. We have addressed this issue by determining if correlations exist between the presence of a particular DNA sequence in a cluster and the mRNA expression properties and/or function of the gene. In general, at the mRNA expression level, promoters containing ETS, NRF-1, and Clus1 tend to be housekeeping genes. Looking at gene function, the ancient classes of essential housekeeping genes, for example, the ribosomal and proteosome genes, have the ETS, NRF-1, and Clus1 sequence in their promoters. The significance of these results is bolstered by the observation that the promoter of mitochondrial ribosomal genes has NRF-1 sites (Scarpulla 2002
TATA is often considered the prototypical DNA promoter element, however, in this analysis TATA is an exception. It is the only TF binding site to show strand-specific distribution, it has the sharpest peak, it has the highest background, it has the most variant sequences that cluster, and it is the only TF that positively correlates with tissue-specific gene expression. In the Eukaryotic Promoter Database (EPD), 51% of the genes contain a TATA (Davuluri et al. 2000
This study did not identify a single DNA sequence that clustered relative to the TSS for the majority of promoters. Thus, if such a sequence exists, it is sufficiently degenerate to be missed by this analysis. Previous studies of eukaryotic promoters have identified the initiator element as a DNA sequence that can act with or without TATA to direct accurate transcription initiation by RNA polymerase II (Smale 1997 The observation that many clustering sequences positively correlate with the presence of additional clustering sequences, including themselves, suggests that promoters tend to contain multiple TF binding sites. This analysis has identified key DNA sequences in 5082 promoters that cluster relative to the TSS and thus may be important for regulating gene expression. We expect that analyses using less stringent parameters may identify additional DNA sequences that are critical for gene expression.
Data Set Generation We combined the DNA sequence data for the annotated RefSeq genes in the Golden Path Human Genome Assembly (version December 2001; Kent et al. 2002
8-mer Analysis
CF Calculation
Calculation of P Value for Distribution The clustering and graphing of the data was performed using the programs Excel (Microsoft) and/or Grace (http://plasma-gate.weizmann.ac.il/Grace/). A collection of 193 transcription motifs were selected from the TRANSFAC database (version 3.4) for analysis of the distribution of TF binding sites across the 1500 bp.
Tissue Specificity Classification
To classify those genes we defined two floating cutoffs, the values of which were individual for each given gene and depended on the maximum expression value of that gene: (1) high expression cutoff, the value that is 70% of the maximum expression level for the given gene always staying within the limits of Those genes, the expression level of which is greater than high expression cutoff in one or two samples, and at the same time the expression level is greater than middle expression cutoff in four or fewer samples, were classified as tissue-specific genes (12.6% of the 6744 genes). The genes with an expression level that was greater than middle expression cutoff or high expression cutoff at least in 62 of the 63 samples were classified as housekeeping genes (9.2% of the 6744 genes).
Calculation of P Value for Subsets in a Set
is combinatorial combination.
Then we calculated the integrated probability that our observed value (m*) occurred at greater than expected frequency, if m* is greater than the most probable value of m
The value of P indicates the statistical probability of numbers being nonrandom: The greater the number, the more statistically nonrandom the result. For instance, in Table 2, in the CCAATSP1 intersection: there are 13,010 genes in total, 994 of them have a CCAAT site (7.6%), 2696 have a SP1 site (20.7%), and 288 have both sites. Thus, of CCAAT genes 29.1% have SP1, which is
We thank Barbara Graves for conversations about ETS DNA binding, Robert Perry for conversations about ribosomal gene promoters, and David FitzGerald for comments on the manuscript. This study used the high-performance computational capabilities of the Biowulf PC/Linux cluster at the National Institutes of Health, Bethesda, Maryland (http://biowulf.nih.gov). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1953904. Article published online before print in July 2004.
3 Corresponding author.
Ashburner, M. and Lewis, S. 2002. On ontologies for biologists: The Gene Ontology: Untangling the web. Novartis Found. Symp. 247: 6680.[Medline]
Bendall, A.J. and Molloy, P.L. 1994. Base preferences for DNA binding by the bHLH-Zip protein USF: Effects of MgCl2 on specificity and comparison with binding of Myc family members. Nucleic Acids Res. 22: 28012810.
Boyd, K.E. and Farnham, P.J. 1999. Coexamination of site-specific transcription factor binding and promoter activity in living cells. Mol. Cell. Biol. 19: 83938399. Breathnach, R. and Chambon, P. 1981. Organization and expression of eucaryotic split genes coding for proteins. Annu. Rev. Biochem. 50: 349383.[CrossRef][Medline]
Brown, T.A. and McKnight, S.L. 1992. Specificities of proteinprotein and proteinDNA interaction of GABP Conkright, M.D., Guzman, E., Flechner, L., Su, A.I., Hogenesch, J.B., and Montminy, M. 2003. Genome-wide analysis of CREB target genes reveals a core promoter requirement for cAMP responsiveness. Mol. Cell 11: 11011108.[CrossRef][Medline]
Davuluri, R.V., Suzuki, Y., Sugano, S., and Zhang, M.Q. 2000. CART classification of human 5' UTR sequences. Genome Res. 10: 18071816. Dynan, W.S. and Tjian, R. 1985. Control of eukaryotic messenger RNA synthesis by sequence-specific DNA-binding proteins. Nature 316: 774778.[CrossRef][Medline] Ferre-D'Amare, A.R., Prendergast, G.C., Ziff, E.B., and Burley, S.K. 1993. Recognition by Max of its cognate DNA through a dimeric b/HLH/Z domain. Nature 363: 3845.[CrossRef][Medline]
Ferreri, K., Gill, G., and Montminy, M. 1994. The cAMP-regulated transcription factor CREB interacts with a component of the TFIID complex. Proc. Natl. Acad. Sci. 91: 12101213. Geiger, J.H., Hahn, S., Lee, S., and Sigler, P.B. 1996. Crystal structure of the yeast TFIIA/TBP/DNA complex. Science 272: 830836.[Abstract] Graves, B.J. and Petersen, J.M. 1998. Specificity within the ets family of transcription factors. Adv. Cancer Res. 75: 155.[Medline] Hapgood, J.P., Riedemann, J., and Scherer, S.D. 2001. Regulation of gene expression by GC-rich DNA cis-elements. Cell. Biol. Int. 25: 1731.[CrossRef][Medline]
Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, D. 2002. The human genome browser at UCSC. Genome Res. 12: 9961006. Kim, Y., Geiger, J.H., Hahn, S., and Sigler, P.B. 1993. Crystal structure of a yeast TBP/TATA-box complex. Nature 365: 512520.[CrossRef][Medline]
Kutach, A.K. and Kadonaga, J.T. 2000. The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters. Mol. Cell. Biol. 20: 47544764. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline] Lo, K. and Smale, S.T. 1996. Generality of a functional initiator consensus sequence. Gene 182: 1322.[CrossRef][Medline]
Maglott, D.R., Katz, K.S., Sicotte, H., and Pruitt, K.D. 2000. NCBI's LocusLink and RefSeq. Nucleic Acids Res. 28: 126128. Mangalam, H.J. 2002. tacg: A grep for DNA. BMC Bioinformatics 3: 8.[CrossRef][Medline] Mantovani, R. 1999. The molecular biology of the CCAAT-binding factor NF-Y. Gene 239: 1527.[CrossRef][Medline]
Matys, V., Fricke, E., Geffers, R., Gossling, E., Haubrock, M., Hehl, R., Hornischer, K., Karas, D., Kel, A.E., Kel-Margoulis, O.V., et al. 2003. TRANSFAC: Transcriptional regulation, from patterns to profiles. Nucleic Acids Res. 31: 374378. Mayr, B. and Montminy, M. 2001. Transcriptional regulation by the phosphorylation-dependent factor CREB. Nat. Rev. Mol. Cell. Biol. 2: 599609.[CrossRef][Medline]
Moll, J.R., Acharya, A., Gal, J., Mir, A.A., and Vinson, C. 2002. Magnesium is required for specific DNA binding of the CREB B-ZIP domain. Nucleic Acids Res. 30: 12401246.
Pruitt, K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29: 137140. Pruitt, K.D., Katz, K.S., Sicotte, H., and Maglott, D.R. 2000. Introducing RefSeq and LocusLink: Curated human genome resources at the NCBI. Trends Genet. 16: 4447.[CrossRef][Medline] Rice, P., Longden, I., and Bleasby, A. 2000. EMBOSS: The European Molecular Biology Open Software Suite. Trends Genet. 16: 276277.[CrossRef][Medline]
Romier, C., Cocchiarella, F., Mantovani, R., and Moras, D. 2003. The NF-YB/NF-YC structure gives insight into DNA binding and transcription regulation by CCAAT factor NF-Y. J. Biol. Chem. 278: 13361345. Sawadogo, M. and Roeder, R.G. 1985. Interaction of a gene-specific transcription factor with the adenovirus major late promoter upstream of the TATA box region. Cell 43: 165175.[CrossRef][Medline] Scarpulla, R.C. 2002. Transcriptional activators and coactivators in the nuclear control of mitochondrial function in mammalian cells. Gene 286: 8189.[CrossRef][Medline] Sharrocks, A.D. 2001. The ETS-domain transcription factor family. Nat. Rev. Mol. Cell. Biol. 2: 827837.[CrossRef][Medline] Shaywitz, A.J. and Greenberg, M.E. 1999. CREB: A stimulus-induced transcription factor activated by a diverse array of extracellular signals. Annu. Rev. Biochem. 68: 821861.[CrossRef][Medline]
Shuman, J.D., Cheong, J., and Coligan, J.E. 1997. ATF-2 and C/EBP
Sinha, S., Maity, S.N., Lu, J., and de Crombrugghe, B. 1995. Recombinant rat CBF-C, the third subunit of CBF/NFY, allows formation of a protein-DNA complex with CBF-A and CBF-B and with yeast HAP2 and HAP3. Proc. Natl. Acad. Sci. 92: 16241628. Smale, S.T. 1997. Transcription initiation from TATA-less promoters within eukaryotic protein-coding genes. Biochim. Biophys. Acta 1351: 7388.[Medline]
Su, A.I., Cooke, M.P., Ching, K.A., Hakak, Y., Walker, J.R., Wiltshire, T., Orth, A.P., Vega, R.G., Sapinoso, L.M., Moqrich, A., et al. 2002. Large-scale analysis of the human and mouse transcriptomes. Proc. Natl. Acad. Sci. 99: 44654470.
Suzuki, Y., Yamashita, R., Nakai, K., and Sugano, S. 2002. DBTSS: DataBase of human Transcriptional start sites and full-length cDNAs. Nucleic Acids Res. 30: 328331.
Trinklein, N.D., Aldred, S.J., Saldanha, A.J., and Myers, R.M. 2003. Identification and functional analysis of human transcriptional promoters. Genome Res. 13: 308312.
Vinson, C.R., Hai, T., and Boyd, S.M. 1993. Dimerization specificity of the leucine zipper-containing bZIP motif on DNA binding: Prediction and rational design. Genes & Dev. 7: 10471058.
Vinson, C., Myakishev, M., Acharya, A., Mir, A.A., Moll, J.R., and Bonovich, M. 2002. Classification of human B-ZIP proteins based on dimerization properties. Mol. Cell. Biol. 22: 63216335.
http://genome.nci.nih.gov/publications/promoters; Supplemental data for this paper. http://transfac.gbf.de/TRANSFAC; the Transcription Factor Database. http://expression.gnf.org; GNF Gene Expression Atlas. http://genome.ucsc.edu/; UCSC Genome Bioinformatics site. http://dbtss.hgc.jp/index.html; database of TSS (DBTSS). http://plasma-gate.weizmann.ac.il/Grace/; Grace Graphing Software. http://biowulf.nih.gov; NIH Biowulf cluster.
Received September 9, 2003; accepted in revised format May 18, 2004. This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||