|
|
|
|
Genome Research
|
| |
INTRODUCTION |
|---|
|
|
|---|
Computational analysis of polymerase II (Pol II) promoters may contribute to improved gene identification and to prediction of the expression context of genes. Before assessing the state of computational promoter recognition per se in the main body of this review, we will provide a context by giving a brief overview of these two problems.
Partitioning a Genome into Genes
Only recently has it become common to determine eukaryotic genomic sequences large enough to contain several genes. With these data comes a new problem for gene finding programs: to partition a set of exons correctly among several genes.
One line of development in eukaryotic gene identification begins with
coding region identification by statistical means and adds pattern
recognition for sites of transcriptional, splicing, and translational
control to produce algorithms capable of suggesting overall gene
structure (for review, see Gelfand 1995
; Fickett 1996a
). To date, most
development effort has focused on integration of the various kinds of
pattern information in the relatively simple case where a single
complete gene is present in the input sequence. In this case, current
algorithms usually suggest a putative protein translation similar to
that in the literature, though there is still significant room for
improvement (Burset and Guigo 1996
). The extension of these algorithms
to deal with a sequence containing multiple or partial genes is just
beginning (Burge and Karlin 1997
;
http://gnomic.stanford.edu/~chris/GENSCANW.html). Because the
signals that control the start and stop of transcription and
translation, and the location of splicing, are still not very well
understood, it is not uncommon for a gene-finding algorithm to confuse
internal with initial and terminal exons, thus wrongly partitioning the
exons. The problem is compounded by our incomplete understanding of
alternative splicing control elements.
Another line of development in gene identification is based on homology
(e.g., Gish and States 1993
; Gelfand et al. 1996
). If there is a close
homolog in the databases to one of the genes in the sequence under
analysis, sequence similarity will usually group the exons for this
gene correctly. Still, in many cases there is no close homolog and no
guarantee when there is some homolog that the encoded protein lacks
insertions/deletions.
Clearly, some means of recognizing the beginnings of genes, probably
via the promoter, or the ends, probably by means of the polyadenylation
signal or translation termination signal (e.g., Kondrakhin et al. 1994
;
Wahle and Keller 1996
; Dalphin et al. 1997
; Solovyev and Salamov 1997
),
would enable a major advance. The promoter seems to be a much richer
signal than the 3
processing signals, though, as we shall see
below, it is not easy to take advantage of the information in the
promoter.
Determining the Correct Protein Translation
Of course, the single most important goal in gene identification
is to correctly deduce the protein product(s) of the gene. After
partitioning the genome into genes, the greatest difficulty in
eukaryotes is correctly determining the splicing structure. Locating
the correct initiation codon is also a difficult and important step in
this case. If the transcription start site (TSS) is known, and there is
no intron interrupting the 5
-untranslated region, Kozak's (1996)
rules can probably locate the correct initiation codon in most cases.
In prokaryotes the problem is of a different nature. Because
splicing is normally absent, dividing the genome into gene units is
ordinarily straightforward. This does not make the correct deduction of
protein product trivial, however, for finding the correct initiation
codon within an open reading frame (ORF) is difficult. In this case,
promoter location, though useful, does not provide the key information
that it does for eukaryotes because of the existence of multicistronic
operons. Rather, for prokaryotes, the key need is reliable localization
of the ribosome binding site (Shine and Dalgarno 1974
).
Determination of Expression Context
Many experimental techniques are being developed for cataloging the expression context of genes (e.g., Prashar and Weismann 1996 and references therein). Development of computer algorithms to predict expression context from genomic sequence has received much less attention but may represent an important opportunity.
Gene expression is regulated at many levels, including chromatin
packing (for review, see Kingston et al. 1996
), transcription initiation (see below), polyadenylation (for review, see Wahle and
Keller 1996
), splicing (for review, see McKeown 1992
), mRNA stability
(e.g., Decker and Parker 1994
), translation initiation (for review, see
Kozak 1992
), and others. But it is generally thought that the single
most important point of regulation is at transcription initiation. The
initiation of transcription seems to be regulated in large part by
coordinate binding of many proteins to the promoter and, for some
genes, to one or more enhancers. Specific combinations of binding
sites, then, may provide the information necessary to suggest a
particular expression context, and it is here that computational work
to date has focused.
In most cases, researchers in this area have taken the locations of
transcriptional regulatory regions (promoters and enhancers) as given
and, in attempting to define those patterns in the DNA (combinations of
binding sites) that determine expression context, have only attempted
to give patterns with sufficient information content to sort regulatory
regions into those that are active in a particular context and those
that are not (e.g., Claverie and Sauvaget 1985
; Fondrat and
Kalogeropoulos 1994
; Pedersen et al. 1996
; Rosenblueth et al. 1996
).
For this approach to be successful in the long run, reliable algorithms
must be developed for the recognition of promoters and enhancers in
general. Another approach to the problem is to attempt to define
patterns with very high information content, capable of distinguishing
regulatory regions active in a specific context from all the other DNA
in the genome (e.g., Fickett 1996b
; Tronche et al. 1997
). With this
approach, one can imagine that general promoter recognition would
eventually consist of separately recognizing a large number of specific
cases. It is too early to clearly define the benefits of either
strategy, and in any case, techniques developed with one approach will
almost certainly transfer in part to the other.
Eukaryotic Promoter Recognition
In the rest of the paper we concentrate on the key problem of general eukaryotic promoter recognition. First, we review a few salient points from recent advances in biochemical understanding of transcription initiation, next, the core computational resources and techniques are discussed, and then currently available tools are described. To give some feeling for the current state of the art, the application of these tools to some recently determined promoter sequences is also described. Finally, we discuss prospects for the future.
Eukaryotic Transcription Initiation
The biochemical mechanisms controlling transcription initiation in
eukaryotes are currently under intense investigation. Recent advances
are reviewed in, for example, Burley and Roeder (1996)
; Chao and Young
(1996)
; Kaiser and Meisterernst (1996)
; Kornberg (1996)
; Novina and Roy
(1996)
; Roeder (1996)
; Stargell and Struhl (1996)
; Verrijzer and Tjian
(1996)
; Ptashne and Gann (1997)
; Smale (1997)
. Here we will attempt to
summarize the conclusions most relevant to sequence analysis.
The so-called preinitiation complex (PIC) recognizes the core promoter
and initiates transcription. The PIC includes, besides Pol II, the
general initiation factors (or general transcription factors, GTFs)
TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH. Each of these may itself
be a multiprotein complex. TFIID, which consists of TATA-binding
protein [TBP; the so-called TATA box is ~25 bp upstream of the
transcription start site (TSS) in metazoans] and several
TBP-associated factors (TAFs), is the only one of these known to have
site-specific DNA-binding ability (though several other GTFs are known
to be in close contact with the DNA; cf. Coulombe et al. 1994
). TBP is
one of the major determinants of this DNA-binding specificity, and the
consensus sequence or position weight matrix (PWM) often used to
recognize the TATA box (Bucher 1990
) is probably characterizing the
DNA-binding specificity of TBP (see Singer et al. 1990
; Wiley et al.
1992
).
Around the TSS there is a loosely conserved initiator region
(abbreviated Inr; for review, see Kaufmann et al. 1996
; Smale 1997
)
that is one determinant of promoter strength and, in the absence of a
TATA box, can determine the location of the TSS. To some extent, the
TATA box and the Inr are interchangeable. For example, TFIID containing
a mutated TBP defective in DNA binding cannot function on TATA-only
promoters, but supports transcription from Inr-containing promoters
(Martinez et al. 1995
). There is evidence that several different
proteins can bind to the Inr. Some of these seem to be capable of
directing the initiation of transcription even in the absence of TBP
(e.g., YY1; cf. Usheva and Schenk 1994). Javahery et al. (1994)
(see
also Purnell et al. 1994
; Kraus et al. 1996
) compare the sequence
requirements for Inr activity in mammals to those for DNA binding of
several proteins and to the initiation site characterization derived by Bucher (1990)
and conclude that in most cases basic Inr activity is
probably mediated by a single protein within the TFIID complex, though
possibly modulated by others. On the other hand, TFIID (via
TAFII150 or TAFII250), TFII-I, and Pol II all seem
to have Inr-specific binding capacity and possible involvement in
mediating Inr specificity of transcription initiation (for review, see
Smale 1997
).
Drosophila TAFII150 contacts the DNA as far as 35 bp
3
of the transcription start site (Verrijzer et al. 1994
) and
could perhaps also be involved in functionally important patterns
downstream of the Inr. Ince and Scotto (1995)
identified a conserved
region 20-45 bp downstream of the 3
-most TSS in a set of 14 promoters lacking both a TATA box and an Inr, and having a similar
pattern of multiple start sites. This site, with consensus GCTCCS, was found to bind two proteins in a sequence-specific manner and, by
mutation, was found to be essential for the pattern of TSS in at least
one of the genes. Larsen et al. (1995)
found a conserved motif, CTNCNG,
at about +8 in a large-scale alignment of mammalian promoters. Burke
and Kadonaga (1996)
found an RGWCGTG motif at about +30 in a number
of TATA-less Drosophila promoters. Mutation analysis
demonstrated function, and footprinting showed TFIID binding. At
present, the generality of these patterns is unknown.
To a first approximation, it seems that gene expression is controlled
by a proximal promoter, which with the PIC determines the location of
transcription initiation, together with a number of specific regulatory
regions (often, but not always, 5
to the proximal promoter), that
specify the tissue, developmental stage, or biochemical context of gene
expression (for an overview, see Tjian 1995
). Usually each such
regulatory region contains binding sites for a number of specific
transcription factors, sometimes called activators or repressors, that
seem to act synergistically. There may be many such regions, and they
may either enhance or repress expression of the gene in particular
circumstances (see Yuh and Davidson 1996
for an elegant example). Often
these specific regulatory regions are active even if their location of
orientation is changed, in which case they are termed enhancers.
Enhancers may be located up to tens of thousands of base pairs from
the TSS.
Transcription factor binding sites are typically 5-15 bp long. The
nucleotide specificity at different positions within the site
varies. For a site n long, the information content of the binding specificity is typical much less than the maximal
2n bits. Note that if a protein is to be sufficiently
discriminatory to have a binding site only once every N bases,
its binding specificity must have information content at least
log2N bits (cf. Schneider et al. 1986
).
Protein-protein interactions mediating synergistic action of multiple
transcription factors may impose spacing constraints on the
protein-DNA-binding sites. To take one example from among many,
insertion of 5 bp (CCAAC) between a MyoD site and the TATA box in the
desmin promoter was found to reduce myotube expression to 45% of
normal, whereas insertion of 10 bp (CGGAGTGTCG) gave 85% of normal
expression (Li and Capetanaki 1994
).
There is also dependence between the DNA sequence at the binding site
of one transcription factor and the ability of that factor to interact
with another. For example, there has been evidence for over a decade
that activator inducibility probably depends on the sequence of the
core promoter (e.g., Struhl 1986
). Emami et al. (1995)
reviewed the
field and tested various chimeric transcription factors with synthetic
promoters containing a TATA box, an Inr, both, or neither. Among a
number of interesting conclusions, they found that Sp1 contains
multiple activation domains, one of which preferentially interacts with
a core promoter containing an Inr. Another example of Inr/TATA
differences is found in the Fc
R1b gene, which contains a canonical
Inr but not a TATA box. Fc
R1b is normally expressed only in
myeloid cells, and is
-interferon (IFN-
)- but not
IFN-
-inducible. When a 3-bp mutation introduced a TATA box 30 bp
upstream of the transcription initiation site, the altered gene
responded to IFN-
as well as IFN-
, and cell type specificity
was lost (Eichbaum et al. 1994
). In a few cases, detailed studies have
shown that point mutations in the TATA box destroy the ability of an
upstream enhancer binding transcription factor to up-regulate
expression (e.g., Harbury and Struhl 1989
; Diagana et al. 1997
).
The mechanism by which core promoter sequence differences are
translated into different receptivity to specific transcription factors
remains unclear. In some cases, a conformational change may be
involved. Diagana et al. (1997)
showed that when base changes in the
TATA box destroy muscle-specific activation of MyHC, the contacts
between TBP and the TATA box also change. In some cases, the mechanism
may be differing composition of the PIC. Human TAFII30 was
found by Jacq et al. (1994)
to be present in only some TFIID complexes
and to be required for activation by the AF-2 containing region E of
the human estrogen receptor. Similarly, some TAFs are almost certainly
subject to alternative splicing (e.g., Weinzierl et al. 1993
). It would
be surprising if the core promoter sequence did not influence the
makeup of the PIC and, hence, the possibility of activation by specific
transcription factors.
There are transcription factors not part of, but very frequently acting
in concert with, the PIC. For example, on the order of half of all
vertebrate promoters contain a somewhat conserved sequence element with
a core sequence similar to CCAAT (Benoist et al. 1980
; Efstratiadis et
al. 1980
). There seem to be a large number of factors that interact
with CCAAT-like sequences, not all of which are known to actually
influence transcription initiation (see Tsutsumi et al. 1993
for a
list). CCAAT box-binding factor (CBF, also called NFY and CP1) is a
trimeric transcription factor that is known to be involved in the
activity of a number of promoters (see Sinha et al. 1996
for an
overview). CBF may recruit other common factors to many promoters as
well (Wright et al. 1994
). Consensus sequences for the DNA-binding
sites of CBF match well a mathematical derivation (PWM) of CCAAT
commonality between many promoters, so that CBF may be the major factor
involved in CCAAT-box function (Bucher 1990
). The heavily studied
CCAAT/enhancer-binding protein (C/EBP) family (for overviews, see Zhao
et al. 1993
; Osada et al. 1996
) contains at least six members with very
similar DNA-binding specificity (Osada et al. 1996
) and is known to
activate transcription through the CCAAT box of at least some promoters
(Cao et al. 1991
). There are also repressors known to act through the
CCAAT box (e.g., Pattison et al. 1997
).
CpG islands (also known as HTF islands and MFIs) are regions of
vertebrate genomes defined primarily by the lack of methylation at CpG
doublets (for an overview, see Bird 1987
). CpG islands are strongly
associated with TSS, a fact that gives rise to experimental procedures
for isolating promoters (e.g., Shago and Giguere 1996
). 5-Methyl-C
often mutates to T, so that in most vertebrate DNA CpG occurs at less
than one-fourth the frequency expected from the C + G content.
However, in CpG islands CpG is much less under-represented. This,
together with a somewhat higher than average C + G-content, may
allow discrimination of CpG islands in typical DNA sequence data, where
the methylation pattern is unknown (e.g., Gardiner-Garden and Frommer
1987
).
Any model fully describing determinants of the transcription initiation
site (and rate) will include not only discriminatory patterns in DNA
sequence but also three-dimensional structure. Compare, for example,
the partial explanation of sequence specificity in the TATA box based
on the structure of the DNA-TBP complex (Juo et al. 1996
); the
competition between histones and transcription factors in gene
activation/repression (for review, see Kingston et al. 1996
); and the
existence of transcription factors whose function seems to be reshaping
the DNA to bring distant sites into proximity (see, e.g., Wolffe 1994
).
Unfortunately, the data available on the structural aspects of
transcription initiation, particularly the data of general predictive
value, remains minuscule compared to relevant data on sequence
specificity of protein-DNA contacts, so that transcription factor
binding sites will probably remain the focus of promoter recognition
algorithms for some time.
Techniques and Resources
Because transcription initiation seems to be brought about by the cooperative binding of a number of proteins to the DNA, the primary computational approach to promoter recognition has been to combine modules recognizing individual binding sites, using some overall description of how these sites should be spatially arranged.
Sometimes binding specificity is characterized using consensus
sequences, that is, by giving the most preferred base at each position
within a site. But this approach loses much of the information and is
of marginal utility. For example, the DNA-binding specificity of the
(very large) family of basic helix-loop-helix family of transcription
factors (e.g., Kadesch 1993
) is often specified as CAnnTG.
However, this pattern occurs about once every 256 bp. If all the
factors of this family really bound so frequently and without differing
specificity, they could certainly not accomplish their role of
controlling terminal differentiation of many different tissue types. In
fact, their binding is more specific and differs from factor to factor
(e.g., cf. Hsu et al. 1994
and Wright et al. 1991
).
A PWM assigns a weight to each possible nucleotide at each position of
a putative binding site and gives as a site score the sum of these
weights. It has been shown that in at least some cases this score
approximates the energy of protein binding (Berg and von Hippel 1988
and references therein; cf. also Barrick et al. 1994
). It is widely
recognized that a PWM is a more informative description of a protein's
DNA-binding specificity than is a consensus sequence, and PWMs are
often used where enough information is available to build them. Frech
et al.(1997a
,b
) have reviewed both tools for building the PWM
(specialized multiple local alignment algorithms) and tools used to
search for putative transcription factor binding sites. The statistical
significance of PWM match scores has been treated by Hofmann and Bucher
(1995)
and Claverie and Audic (1996)
.
The PWM methodology is predicated on the hypothesis that different
positions within the site make independent contributions to binding.
Although a number of cases are known where this approximation seems to
be a reasonable one (e.g., Berg and von Hippel 1988
and references
therein; Fickett 1996c
), most who have used PWMs know of cases where
the method gave poor results. This could be attributable to many
reasons, for example, the existence of multiple isoforms of the
protein, leading to different classes of sites (e.g., Andres et al.
1995
), or alternative protein conformations induced by the DNA
structure (e.g., Bonven et al. 1995
), leading to correlated preferences
at different positions. It will probably be important to apply
nonlinear methods of separation (and perhaps develop new ones) for this
problem. Nonlinear methods have been successfully applied in the
recognition of splicing junctions. Brunak et al. (1991)
used multilayer
neural nets; Burge and Karlin (1997)
used decision trees; and a number
of investigators have used position-specific oligonucleotide counts
(e.g., Solovyev and Salamov 1997
and references therein).
To build any model of the DNA-binding specificity of a protein, one
needs a number of known sites (it would be valuable to have the
strength of the sites as well, but this information is rarely
available). For core promoter elements the best data source may be the
Eukaryotic Promoter Database (EPD; Bucher and Trifonov 1986
), a
collection of experimentally mapped TSSs and surrounding sequences. For
other transcription factors, one traditional data source has been the
Transcription Factor Database (TFD; Ghosh 1990
), but this database is
no longer maintained. Currently maintained collections include TRANSFAC
(Wingender et al. 1996
) and the Transcription Regulatory Region
Database (TRRD; Kel et al. 1994
). If one is interested in a particular
factor, there is no substitute for reading the literature to find both
natural sites and random oligonucleotide selection data (for an
overview, see Wright and Funk 1993
), and understanding the degree of
evidence for each putative site. For hundreds of recently discovered
transcription factors, binding site data may be scarce or absent. In
some cases, it may be possible to predict the specificity of a new
factor from that of a closely related factor whose specificity is known
(e.g., Choo and Klug 1994
; Suzuki and Yagi 1994
).
Bucher (1990)
constructed PWM for several core promoter elements; these
are widely used in promoter recognition algorithms. PWM for many
specific transcription factors have been collected in TRANSFAC and TRRD
(see also Chen et al. 1995
). Because some of the sites used to build
these matrices have questionable experimental support, one should
exercise caution in applying them.
Most of the work in this area has centered around characterizing
transcription factor binding sites and their relative localization. Approaching a different aspect of the problem, Benham (1996)
has described methods to predict regions of helix destabilization, likely
to coincide with certain gene features, including transcriptional regulatory regions. Also, the advent of large-scale model organism sequencing allows one to identify functionally important regions of all
kinds (though not to differentiate between the different possible
functions) by means of sequence conservation. The application of this
technique, termed phylogenetic footprinting, to the discovery of gene
regulatory regions has been reviewed by Duret and Bucher (1997)
.
Available Promoter Prediction Tools
In this section we describe publicly available software tools for locating promoters in DNA sequence. To gain some idea of how the tools perform in practice, we tested them on a small sample of recently determined sequences in which the transcription initiation site has been experimentally mapped. We collected 18 published mammalian sequences containing 24 promoters (Table 1) in a total of 33120 bp. Two of these sequences were not found in GenBank (as of February 20, 1997); the others were dated no earlier than May 16, 1996. None of them matches a sequence in EPD (either at the level of identity or at the level of clear homology). Thus, we believe that these represent an independent test set, not overlapping in any significant way the sequences used in the development of the tools described below.
|
Each tool was used with the default settings and was tested in early
March 1997 (most of the on-line services do not give version numbers).
The computer predictions are given alongside the mapped TSS in Table 1.
It is difficult to summarize the degree of agreement of the computer
predictions with experimental results, because of ambiguities in the
results on both sides. Experimental accuracy may be impacted by mRNA
degradation, which can lead to the mapped location of the TSS being
3
to its true location. Some programs aim to locate the TSS
exactly, tolerating a high false-positive rate, with the idea that the
approximate location will already be known. Some are intended to
analyze large genomic sequences and have as their goal the approximate
localization of promoters or gene starts. We evaluated only the ability
to approximately locate the TSS itself. If a program gave a promoter prediction but not an explicit TSS, we took the 3
end of any predicted promoter window as the predicted TSS. The predicted TSS,
explicit or implicit, was counted as correct if it was within 200 bp
5
, or 100 bp 3
, of any experimentally mapped TSS. Given these
criteria, accuracy results are summarized in Table 2.
Because of the limited sample size and the possibly skewed nature of
the sample (discussed below), results should be taken as provisional and perhaps pessimistic.
|
Audic/Claverie
Audic and Claverie (1997)Autogene
Autogene (available by ftp from ftp.bionet.nsc.ru; directory pub/biology/aug) includes a module for promoter recognition (Kondrakhin et al 1995GeneID/Promoter1.0
An unpublished promoter-finding algorithm, developed by S. Knudsen (Technical University of Denmark), is included in the GeneID e-mail server (send "help" to geneid{at}darwin.bu.edu). According to the on-line documentation, "Promoters are predicted by a program called promoter1.0. It has been developed as an evolution of simulated transcription factors that interact with sequences in promoter regions." In our tests promoter1.0 identified 10 (42%) of the promoters, and reported 51 false positives (1/649 bp).NNPP
NNPP (M. Reese, http://www-hgc.lbl.gov/inf/nnpp-abstract.html) combines recognition of the TATA box and the Inr, using the time delay neural net architecture, which allows for variable spacing between the features. We tested the algorithm using the on-line service at http://www-hgc.lbl.gov/projects/promoter.html. When tested on our data set NNPP identified 13 of the 24 promoters (54%) and reported 72 false positives (1/460 bp). [At the optional threshold 0.9, 7 (29%) of the promoters were identified, and 31 false positives (1/1068 bp) were reported.]PromFind
PromFind (Hutchinson 1996PromoterScan
PromoterScan (Prestridge 1995TATA
Because many investigators rely heavily on the TATA box to help locate a possible promoter, we also tested the TATA PWM from Bucher (1990)
8.16) the TATA PWM
gave 159 predictions in our test set. We used a more restrictive
cutoff, namely
6.5, that gave 54 predictions, more in line with the
other methods. With these parameters the TATA PWM identified 6 (25%)
of the known promoters and predicted 47 apparent false positives (1/705
bp).
TSSG and TSSW
TSSG and TSSW (Solovyev and Salamov 1997
1 to
100,
101 to
200, and
201 to
300 relative to the TSS, and (4) potential transcription factor binding sites. TSSG is based on the promoter.dat file derived from TFD by Prestridge (1995)Algorithms Not Included in the Test Results
GRAIL includes promoter recognition as one component of integrated gene structure prediction (Matis et al. 1996DISCUSSION
The accuracies of the various programs are plotted in Figure 1, where it may be seen that the true positive rate is approximately a constant fraction of the total number of predictions. For comparison we also show a line on which the accuracy rates of completely random predictions would fall.
|
The results presented here should not be used to compare the various programs among themselves (except perhaps to note that no technique used to date is obviously superior to the others), in part because the test set is small for this purpose. Also, the programs use somewhat different definitions of the problem and are not really directly comparable. Our tests were in some sense unfair for each program, usually in a unique way for each. For example, PromFind is intended to locate the promoter when one already knows the approximate gene location and the coding strand, and so it makes exactly one prediction, on the strand presented, in each sequence it is given to analyze; but we had multiple promoters in some sequences, and we tested both strands of each sequence with each program. An examination of the test results in light of each program's design goals will still show, however, that our conclusions about the general state of the field are not materially affected.
At the default settings, the algorithms we tested found 13%-54% of the true promoters in our test set. However, in the test sets used by the developers the correct prediction rates were higher, and it must be noted that the test set we used was perhaps not representative. It is possible that the way we chose the test set, namely searching recent issues of journals with a focus on transcriptional regulation, retrieved promoters that are active in very specialized contexts. Furthermore, in two cases there are fewer nucleotides upstream of the experimentally mapped TSS than are required for the analysis window of some of the programs. Nevertheless, investigators do need to analyze sequences like the ones in our test set, and the test results do suggest that the challenge of finding all promoters reliably is far from being met.
The programs reported on the order of one false positive per kilobasepair. On the surface, this suggests that if they were applied to a mammalian genome as a whole (with approximately one gene per few tens of kilobases), they would give a few tens of false positives for each real gene. This too may be misleading, however. Because most of the algorithms make use of transcription factor binding site density, they may be expected to give a high signal on enhancers as well as promoters. And although enhancers may be found anywhere up to tens of kilobases away from the TSS, they tend to be more concentrated near the promoter. Thus, it is quite possible that current tools have simply not developed far enough to differentiate reliably between promoters and enhancers and that some of the false positives are in fact true transcriptional regulatory regions. On the other hand, it is also possible that some of the true positives in this set, where the promoter density is high, are attributable to chance and that the false-positive rate would be higher in general genomic DNA.
Although our current knowledge of transcription initiation is still far from complete, it is clear that considerable information is available that has not yet found its way into current algorithms. Given the advances in our understanding of promoters gained from experimental methods in the last few years, there are grounds for cautious optimism that better algorithms can, in fact, be developed.
Wherever a consensus sequence, a PWM, or other recognition module is
built to discern the binding sites of a protein, it is probably worth
taking the time to fully evaluate the experimental data available, as
well as using the latest computational techniques. To quote Frech et
al. (1997b)
, "perhaps more time and effort should be invested in
improving the quality of matrix libraries rather than in developing new
algorithms to calculate matrix scores."
However, it will be many years before the majority of transcription
factors and their DNA-binding specificities becomes known. One natural
way to try to improve promoter prediction would be to concentrate on
the core promoter elements. For example, (1) an evaluation of the
Bucher TATA matrix on a large number of TATA boxes with proven function
would be valuable. Also, given the dependence of activator function on
TATA sequence, it would be worth attempting nonlinear recognition
methods, such as neural nets or quadratic discriminant analysis. (2)
The very low information content of the overall Inr consensus (Javahery
et al. 1994
), together with the evidence for involvement of multiple
proteins families and the existence of conserved elements that occur in
some but not all sequences downstream of promoters, suggests that it
might be worthwhile to attempt either cluster analysis or nonlinear discrimination of proven, functional Inr sequences. (3) The CCAAT box
pattern most used in current algorithms, namely that of Bucher (1990)
,
was derived not from a biological definition, but from a computational
one. Bucher's algorithm was, very roughly, to find a linearly
definable pattern common to many promoters and with a strong similarity
to CCAAT. Now that several proteins are known to recognize a similar
pattern and to be involved in transcription initiation, it seems worth
investigating whether there are different classes of CCAAT boxes
corresponding to the different proteins.
| |
ACKNOWLEDGMENTS |
|---|
This work was supported by SmithKline Beecham Pharmaceuticals, Synaptic Ltd., and U.S. Public Health Service grant HG00981-01A1 from the National Center for Human Genome Research. We thank P. Agarwal, J.-M. Claverie, M. Gelfand, I. Grosse, R. Guigo, W. Wasserman, and M. Zhang for valuable comments on the work.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL ficketjw{at}molbio.sbphrd.com; FAX (610) 270-5580.
| |
REFERENCES |
|---|
|
|
|---|
A new predictor of DNA regulatory regions.
Comp. Appl. Biosci.
12:
375-382.
responsiveness and basal expression of the myeloid human Fc
R1b gene is mediated by a functional PU.1 site and a transcription initiator consensus.
J. Exp. Med.
179:
1985-1986 [Medline].
Database of transcription regulatory regions on eukaryotic genomes.
In Proceedings of the 28th Annual Hawaii International Conference on System Sciences v5, Biotechnology Computing, pp. 42-51. IEEE Computer Society Press, Los Alamitos, CA.
terminal processing sites.
Comp. Appl. Biosci.
10:
597-603.
exon usage is subject to transcriptional control by three tandem promoters and alternative splicing.
Biochim. Biophys. Acta
1306:
75-92 [Medline].[Medline]
-flanking region of the rat PP1
gene.
Biochim. Biophys. Acta
1309:
221-225 [Medline].[Medline]
end of the human papillomavirus type 6 long control region.
J. Virol.
71:
2013-2022 [Medline].[Abstract]
end restriction fragments of cDNAs.
Proc. Natl. Acad. Sci.
93:
659-663 [Medline].