|
|
|
Published online before print
February 15, 2002, 10.1101/gr.212502. Article published online before print in February 2002
Vol. 12, Issue 3, 470-481, March 2002
METHODS
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
The early developmental enhancers of Drosophila melanogaster comprise one of the most sophisticated regulatory systems in higher eukaryotes. An elaborate code in their DNA sequence translates both maternal and early embryonic regulatory signals into spatial distribution of transcription factors. One of the most striking features of this code is the redundancy of binding sites for these transcription factors (BSTF). Using this redundancy, we explored the possibility of predicting functional binding sites in a single enhancer region without any prior consensus/matrix description or evolutionary sequence comparisons. We developed a conceptually simple algorithm, Scanseq, that employs an original statistical evaluation for identifying the most redundant motifs and locates the position of potential BSTF in a given regulatory region. To estimate the biological relevance of our predictions, we built thorough literature-based annotations for the best-known Drosophila developmental enhancers and we generated detailed distribution maps for the most robust binding sites. The high statistical correlation between the location of BSTF in these experiment-based maps and the location predicted in silico by Scanseq confirmed the relevance of our approach. We also discuss the definition of true binding sites and the possible biological principles that govern patterning of regulatory regions and the distribution of transcriptional signals.
| |
INTRODUCTION |
|---|
|
|
|---|
In contrast to coding sequences, where each base pair can be placed in the informational context of protein structure, regulatory DNA of promoters and enhancers has no obvious uniform language, no universal code. However, it is clear that a significant fraction of this regulatory DNA code represents sequences recognized by transcription factors.
Most of the current strategies for identifying binding sites for
transcription factors (BSTF) rely on the extraction of binding sites by
comparing a set of functionally related regulatory sequences. Algorithms such as MEME (Bailey and Elkan 1995
),
YEBIS (Yada et al. 1998
), CONSENSUS (Hertz et
al. 1990
), and ANN-Spec (Workman and Stormo 2000
) employ
various methods based on expectation maximization (EM; Bailey and Elkan
1994
) and Gibbs sampling (Lawrence et al. 1993
). In addition, several word-counting algorithms have been developed to approach the problem. For instance, the recent Moby Dick program (Bussemaker et
al. 2000b
) employs a suffix-tree strategy (Apostolico et al. 2000
;
Marsan and Sagot 2000
) to build word dictionaries and then deduce the
most significant motifs. Strategies based on extraction from a set
often use as an important criterion that a majority of sequences
contain the same motif (MEME). For instance, in a typical
case in which an unaligned set was represented by a large number (521)
of relatively short proximal promoter sequences (
100 to +5; Pesole et
al. 1992
), this extraction method allowed reliable prediction, mainly
of proximal promoter elements (TATA-box) and of ubiquitous
binding sites (Bussemaker et al. 2000a
). Specific binding motifs that
are present in only one or a few members of the set, however, are
likely to be lost using this approach.
Until now very few attempts have been made to approach BSTF prediction
from another angle, relying for instance on the observation that
functional binding sites are often found in clusters within regulatory
regions and thus cause a biased word distribution within a given
sequence. This bias makes it feasible to extract BSTFs from just a
single region. This could be an important achievement as it could
identify the transcriptional information specific only to this
particular regulatory sequence. The significance of such an extraction
from a single sequence is especially important for the analysis of
extended and complex regulatory regions found in higher eukaryotes. A
promising attempt to predict binding sites in a single wide region was
based on measuring hexamer frequencies within the Drosophila
Ubx-C region (Lewis et al. 1995
).
The fact that many experimentally found BSTF of higher eukaryotes are
repeated within a narrow regulatory region allows one to use the same
basic principle for the extraction from a single sequence as for the
extraction from a set of unaligned sequences, (i.e., by exploring motif
redundancy). This redundancy is also affected by the presence of
accessory (weak, or shadow) sites, which are often found in a
regulatory region nearby the experimentally confirmed strong sites
(Kassis et al. 1989
; Stanojevic et al. 1991
; Small et al. 1992
).
Although the meaning of these sites is unclear, they have been observed
in a wide array of regulatory sequences. Thus, families of related
words (motifs) would reliably describe specific BSTF patterns found in
a single regulatory region.
One of the differences between extraction from a single sequence and extraction from a set is the higher statistical ambiguity caused by an insufficient length of sequence, by small numbers of repeats, and by the presence of related and overlapping motifs in the same sequence. Moreover, along with multiple BSTF, regulatory regions often contain other statistically significant patterns such as long simple repeats (...CACACA...) or poly(N) tracts (...TTTTT...). The exact function of these sequences is generally not known, but they often interfere with attempts to reveal binding sites. Therefore, special statistics accounting for word overlaps is important when using extraction from a single sequence.
Another known problem related to BSTF extraction without a
consensus/matrix description is the lack of biological confirmation for
the prediction relevance. Because, in most cases, a typical algorithm
requires an estimated BSTF length and number of expected motifs (MEME;
Bailey and Elkan 1995
), at least some training procedures which are
based on a reliable training set appear to be necessary for a given
biological system. During the last few years, such training sets have
become available for unicellular organisms like Escherichia
coli and yeast in the form of annotated promoter databases (van
Helden et al. 1998
, 2000
; Zhu and Zhang 1999
). However, the situation
evolves much more slowly for higher eukaryotes (Cavin Perier et al. 1998
).
To overcome biological ambiguity of such predictions, we focused on a
particularly well-known system: the early developmental enhancers from
Drosophila. For this system, we developed experimentally based
definitions for the most robust binding sites and we built precise maps
of their distribution in these enhancer regions. The enhancers of
the Drosophila developmental genes have several advantages
for our study
(1) Functional similarity: Typically a stripe of
expression at the blastoderm stage of embryonic development; (2)
Similar regulation: Most enhancers respond to a relatively small number
of known maternal or gap genes (Bicoid, Hunchback, Krüppel, etc.)
or pair-rule genes (Eve, Ftz, Hairy, etc.); (3) Structural homogeneity:
The enhancers typically have a defined length (~1000 bp) and are
not located near unique proximal promoter elements such as TATA,
DPE, and INR (Weis and Reinberg 1992
; Burke et al. 1998
; Pedersen et
al. 1998
); and (4) Level of characterization: The large amount of
biochemical, genetic, and evolutionary (comparisons between species)
data accumulated in the literature for these enhancers makes them an
extremely valuable resource.
Based on the principles described above for the extraction from a single region, we developed a new algorithm, Scanseq, that requires no consensus/matrix description and locates the position of potential binding sites in one given sequence. We then investigated the correlation between the Scanseq predictions and the experimentally verified distribution of binding sites in a set of Drosophila developmental enhancers. We found a high correlation for all enhancers used in our set, using a wide range of algorithm parameters. With the help of a special training procedure, we defined the most effective parameter ranges that can be used in a search for unknown BSTF in this type of complex regulatory regions in Drosophila and likely in other multicellular organisms. We also analyzed the distribution of weak shadow sites and revealed their specific arrangements in several developmental enhancers from our collection.
| |
RESULTS |
|---|
|
|
|---|
Developmental Enhancers: Maps of Binding-Site Distribution
We thoroughly annotated a number of Drosophila
developmental enhancers and generated maps of BSTF distribution to
measure the efficiency/accuracy of the Scanseq
predictions. Building such maps required accurate processing of a
thorough literature compilation, as well as establishing definitions
for BSTF. We designed two strategies for the treatment of the compiled
literature data (Fig. 1). The first
strategy solved the frequent disagreements in the length and the exact
location of BSTF reported by different sources. The second strategy
implemented a uniform criterion for the minimal strength of a true
binding site. As an indirect measure of this strength, we used the
positional weight matrix (PWM) score for this site (Berg and von Hippel
1987
).
|
Our compilation contained footprints and other data for 20 of the
best-known early Drosophila developmental enhancers
(see appendix 1.3 on the New York University Web site:
http://homepages.nyu.edu/~dap5/PSS/appendix1.html). To minimize
interference of possible experimental errors, we only included sites
for a given transcription factor found in at least two different
enhancers (reported by two different research groups) from our
collection. We also required that a site be verified by at least two
independent methods, including biochemical (footprints), genetic
(mutant), or evolutionary (highly conserved blocks) analyses and not
simply by a search for a consensus. After such filtering, our set
contained binding sites for seven transcription factors: Bicoid (34 sites total), Caudal (15), Ftz (25), Hunchback (43), Knirps (47),
Krüppel (21), and Tramtrak (7). We also narrowed down the number of
regulatory regions to 10, each containing at least two of the seven
types of sites: engrailed intron (enint; Kassis et
al. 1989
; Florence et al. 1997
), even-skipped stripe 2 (eve2; Stanojevic et al. 1991
; Small et al. 1992
; Arnosti et al. 1996
), even-skipped stripe 3+7 (eve3+7; Small et
al. 1996
), fushi-tarazu proximal enhancer (ftzprox;
Han et al. 1993
, 1998
; Yu et al. 1999
), hairy stripe 6 enhancer (hairy6; Langeland et al. 1994
), hairy
stripe 7 enhancer (hairy7; La Rosee et al. 1997
, 1999
),
abdominal-A enhancer (iab2; Shimell et al. 2000
),
Krüppel region 730 (kr730; Hoch et al. 1991
), spalt
early enhancer (sal; Kuhnlein et al. 1997
; Barrio et al. 1999
;
de Celis et al. 1999
), and tailless enhancer (tll;
Hoch et al. 1992
; Liaw et al. 1995
).
In the next stage, we built alignments (CLUSTALW,
LaserGene) for each type of selected BSTF and outlined a
well-defined core made of positions with a high information content
(see appendix 1.1 on the Web site). For each type of site, a PWM was
built from the core alignment. We used PWMs that were not normalized
for the average nucleotide composition (set
p
= 0.25 into formula 6 below) to avoid any
possible bias for base composition in a particular sequence.
Searches with these PWMs revealed not only the presence of the
experimentally verified BSTF, but also multiple high-scoring matches.
Therefore, we generated two alternative types of BSTF maps for each
regulatory region. The first map, refined, contained only high-scoring
PWM hits that coincided with the experimentally identified sites
(footprints). This map served to fix the length and the location of the
already-known binding sites. However, it is known that in vitro
analyses often reveal only the strongest binding sites (Tronche et al.
1997
). Therefore, we also developed a second map, consistent, that was
based on the relative PWM scores of the found matches.
To determine the relevant PWM score cutoff, we calculated at each
cutoff value the number of hits (H, number of experimentally confirmed sites), the number of false-positive sites (FP), and the number of false-negatives (FN, missing but experimentally confirmed sites) between the refined and the resulting consistent map.
This procedure was performed independently for each type of BSTF
considered. To give more weight to the experimentally verified BSTFs in
the consistent map, we added more penalties to FN than to
FP. We built our penalty function by modifying the likelihood
ratio criterion (see appendix 1.2 on the Web site)
|
(an1) |
Possible experimental errors, as well as the specificity of our
descriptions (alignments, PWM), probably cause the disagreements found
between the refined and the consistent maps built. An example of
comparison between these maps is shown in Table
1. We consider our consistent maps (see
appendix 1.3 on the Web site) as the closest approximation to the
distribution of true BSTF. However, it is unlikely that one should
expect a better agreement between the Scanseq prediction
and one of the two maps than between the two maps themselves.
|
Formulation of the Scanseq Algorithm
We based our Scanseq algorithm on the assumption that each word recognized by a given transcription factor (BSTF) belongs to its own family of similar words (binding-site motif) found in the same enhancer sequence. Scanseq (Fig. 2) extracts statistically significant motifs from a single sequence and generates a map of potential binding sites for this sequence. The algorithm features special statistics for accounting for word overlaps in the same DNA strand and for correlating word overlaps in the complementary strands of DNA (see Methods and appendix 2.2 on the Web site).
|
The Scanseq algorithm includes the following basic stages. In the first step, a search is performed with each m-letter word in the sequence (the seed word) for all similar words with no more than k mismatches. The resulting word family forms the initial motif for each seeded word. In the second step, the search is performed with the PWM constructed for each of the initial motifs. This matrix is normalized for the average sequence composition and uses pseudocounts to cope with small-sampling problem. In the third step, the algorithm calculates the expectation and the variance for the number of occurrences in the random sequence for the double-stranded DNA. The Z score of the refined motif is assigned to each corresponding initial seed word. In most cases, the characteristic length of the potential recognition motif and its divergence level is not known. Therefore, the algorithm performs several calculation rounds with different m and k and finds motif with the highest Z score for a given initial seed word. Selected optional range for the parameters m, k (mmax, mmin, and kmax), and Z-score cutoff value defines the predicted map.
Parameters and Predictions
Depending on the amount of available information, regulatory regions in general can be divided into three categories: unidentified, identified in the genome but with no further annotation, and well-known regions with at least some maps for BSTF distribution. Typically, the first category requires an independent preliminary analysis (recognition in the genome) before predicting BSTF. The second category requires some a priori (default) parameter settings deduced from an appropriate training set. Individual training of parameters on one sequence might be applied to the third category of sequences to reveal yet unknown BSTF. Most currently available motif-extracting programs usually require custom settings for the number of expected motifs and their approximate length. We introduced a relatively simple parameter, coverage (c), instead of the widely used number of expected motifs.
The distribution of known BSTFs within the developmental enhancers
(consistent maps) showed that on average they represent about a quarter
of the sequence length (0.24; see Table 2).
Therefore, we took this value as the default coverage expectation.
Generally, this important parameter must be approached with care as we
observed that in several cases significant deviations from this default coverage occur. The extreme examples were the abdominal A
enhancer (iab2) and the hairy stripe 6 region,
whose coverage of the consistent maps were 7% and 65%, respectively
(see Table 2). For the length of the BSTFs, we used a range from 7 bp
to 15 bp, which is the size observed for the most robust binding motifs
found in the developmental enhancers (see appendix 1.2 on the Web
site). We also allowed the maximal divergence of the initial search
(which represents the number of mismatches, kmax) to
vary in the range of 0%-40% (see also Z-score profiles in Fig.
3).
|
|
To estimate the accuracy of the prediction with no prior
consensus/matrix description, we measured the correlation between the
experiment-based consistent maps and the maps of predicted sites
generated by the Scanseq algorithm. Three statistical values were monitored: (1) The Matthews correlation coefficient (CC; Matthews 1975
),
|
(an2) |
|
(an3) |
|
(an4) |
|
(an5) |
1 to 1.
Training Parameters on an Individual Region
To assess the sequence-to sequence variations of the best parameters, we first trained Scanseq on each individual enhancer sequence. For each considered combination of parameters, defining minimal and maximal length of the binding motif and its divergence (mmin, mmax, and kmax), we found the optimal coverage c (the fraction of the total sequence length covered with the predicted sites) that produced the highest CC and OQ values (see appendix 3 on the Web site). The optimal individual parameters found for the 10 developmental enhancers are shown in Table 2. Despite the fact that the optimal length/divergence parameter combination differed in most cases, the correlation between the predicted and the consistent maps was positive for virtually all combinations tested (see appendix 3 on the Web site and Fig. 3). In the worst case (hairy stripe 6 enhancer), 65% of which was covered with BSTF, the PQ was still positive. In many cases the optimized coverage c was very close to the observed coverage (c-MAP) for the consistent experiment-based map. The practical advantage of individual training is clear from the example of eve stripe 2 region (Fig. 4). At the best parameter values found for the consistent map (only Bicoid, Hunchback, and Krüppel sites were included), we also managed to predict another distinct motif at position 510: CATAATAAT. This sequence exactly coincides with the most conserved half of the first Giant site in the eve stripe 2 region: TAAAAACACATAATAAT. The best individual parameter combination, which is often specific for a particular sequence, typically produces minimal statistical noise there (see the difference in Z scores in Table 2).
|
Training Parameters on the Group-of-10 Regions
To assess the best default parameters for sequence-independent predictions, we trained Scanseq on the entire group-of-10 consistent maps from our enhancer collection. To find the optimal ranges for length and divergence (mmax, mmin, and kmax), we calculated the average CC and PQ values for our 10 enhancers for each tested parameter combination (at the optimal coverage c; see above). Then we sorted the parameter combinations in a descending order of average CC or PQ values (see appendix 3 on the Web site). The combination of 7 bp-9 bp with 0-1 mismatches provided the best scores for both selected measures of statistical correlation. We set the coverage expectation according to the average value we observed in the 10 consistent maps (cav = 0.24). The results summarized in Table 3 indicate that these default parameters still worked well for most examples from the training set.
|
|
Testing the Default Parameters on the eve stripe 4+6 and runt stripe 5
For two enhancers from our initial collection, even-skipped stripe 4+6 (Fujioka et al. 1999| |
DISCUSSION |
|---|
|
|
|---|
Definition of True Binding Site
The efficiency of the Scanseq program indirectly
confirms that the multiple binding motifs in Drosophila
developmental enhancers are statistically significant. In these
regions, weak and strong sites together form powerful word families. To
independently confirm the abundance of weak shadow sites in these
enhancers, we searched the even-skipped stripe 2 region (728 bp) with our PWMs for Bicoid, Krüppel, Hunchback, and Giant and built
a distribution of PWM scores for all positive matches. Table
4 shows the comparison of such distribution
with the expectation in random sequences having the same length and
base content.
|
Most of the experimentally verified true sites generated the highest scores (>6) and their presence in such numbers was statistically unexpected in eve stripe 2. The second score zone (4-6) contained the weak sites with mismatches in the core. Surprisingly, for this score zone, the observed number of sites still exceeded the expected number for all four types of binding motifs. The strong agreement of data for all four binding motifs suggest that the eve stripe 2 enhancer has at least twice as many sequences related to known BSTF than reported experimentally.
This simple test not only confirms the specific presence of accessory shadow sites (not revealed by footprint) around the strong sites, but also provides new grounds for the definition of BSTF. In fact, some of the poorly scoring shadow sites might be considered as true sites, thus changing the initial alignments, as well as the critical cutoff values. Apparently the procedure for the definition of BSTF must be iterative and include likelihood criteria (see equation 1) at the first stage, followed by statistical refinement of the motif at the second stage (Table 4).
BSTF Arrangements and Role of Tandem Clusters
It is still unclear whether the detected weak shadow sites have
functional significance and how much they contribute to transcriptional regulation of the enhancers. To shed some light on this problem, we
analyzed the distribution of weak sites from the score zone 4-6 (Table
4) and found striking features in their arrangement: The weak sites
often formed tandem clusters in the enhancers from our training
set (Table 5).
|
Equally spaced sets of 5-10 repeats of an imperfect site form a highly
unusual periodic sequence, with a small period of repeat, often causing
overlap of neighboring matches (compare RATCCC to CTAATCCC
Bicoid).
The fine structure of the most impressive examples and the evolutionary
conservation of one of the sequences are presented in Figure
6. The arrangement of the shadow sites in tandem clusters and the striking conservation of these tandems in
evolution strongly support their biological significance.
|
We see two possible roles for the tandem repeats in enhancers. One,
they might be directly involved in the tight binding of transcription
factors; in this case the multiplication of weak sites into tandem
clusters could make such binding highly cooperative and strong (Burz et
al. 1998
). Two, the tandem clusters may participate in a variety of
recruitment mechanisms. In the simplest case, long repetitive sequences
may effectively serve to trap a protein from solution and recruit the
transcription factor to its strong binding site within the enhancer.
This hypothesis assumes that the initial binding to a repeat of shadow
sites is weak and that the transcription factor quickly slides or jumps
to stronger neighboring sites. The possibility of such lateral
diffusion for transcription factors on DNA has been widely discussed in
the literature (Berg et al. 1981
; Berg and von Hippel 1985
; Khory et
al. 1990
).
Although the exact role of the tandem repeats of shadow sites in enhancers, as well as their precise structure, remain to be explored, they represent a unique opportunity for unveiling the regulatory code of promoters. The unusual structures of periodic sequences might not only assist in identifying true binding sites in promoter and enhancer regions, but they may also serve for the efficient recognition of regulatory sequences. This, however, will require further analysis and classification to distinguish true regulatory tandems from satellite, telomeric, and other repeated sequences.
Strategies for BSTF Prediction
The prediction of Knirps and Hunchback sites in the eve4+6
region shows that several methods can be successfully combined for the
mapping of BSTF in defined regulatory regions. Each method, however,
has its limitations. For instance, the analysis of the evolutionary
conservation of regulatory regions usually does not reveal the binding
sites themselves, but only conserved blocks within a regulatory
sequence. Due to the possible presence of conserved transcriptional
signals other than BSTFs and to the extreme flexibility of the
regulatory code, the interpretation of such conserved blocks as
candidate binding sites might be incorrect. Another widely used
approach requires a prior description of BSTF in the form of a matrix
(PWM, hidden Markov model) or consensus. This method is much
more reliable, but the description is not always available. The
currently existing databases, such as TRANSFAC and TRDD (Heinemeyer et
al. 1998
), contain only a limited fraction of all transcriptional
factors, many of which represent fairly pleiotropic regulators, found
in a vast number of regulatory regions. Moreover, as was shown in the
current work, the definition of BSTFs must also include a relevant
cutoff for the search to distinguish between true sites and
false-positives. However even such consistent cutoff values do not
prevent the detection of chance matches at irrelevant places of genome
(Berg and von Hippel 1987
).
Methods of the third class use no a priori information and extract
sites from the set of unaligned sequences, each of which is believed to
contain somewhere the same BSTF, for instance, from regulation
experiments. The most powerful techniques in this case are expectation
maximization (Bailey and Elkan 1994
) and Gibbs sampling (Lawrence et
al. 1993
). This approach became popular for analysis of microarray
experimental data. These methods extract common motifs from a set of
sequence data, which might not be sufficient, especially in the case of
unique tissue-specific signals, often presented only in one sequence of
the set. In this context, the extraction of BSTF from a single region
with no assumption of matrix/consensus/conservation takes an important
place in the unveiling of the regulatory DNA code. Although this method
is currently less precise than the more conventional extraction with the PWM, we have shown that it can be adopted for virtually any regulatory sequence and deliver biologically relevant predictions.
The possibility of BSTF extraction from a single sequence makes application of the technique especially important for genome computational studies and genome annotation projects. However, the meaningful predictions can be generated only if clustered BSTFs are presented in a promoter region and correct parameter settings are found for a particular biological system. Currently, available information about the organization of eukaryotic promoter cannot provide us with the answer of how common binding-site clustering is. However, in many known cases (B.P. Berman et al., in prep.), this clustering is frequent enough to make our prediction strategy successful.
To investigate the possible application range of our algorithm, we performed similar calculations for rhodopsin promoters of Drosophilla, the system, which has been studied experimentally by the authors of this paper (see appendix, rhodopsin promoters, on the Web site). The minimal rhodopsin promoters are much shorter than the developmental enhancers are (~300 bp versus ~1000 bp); they contain nonredundant elements such as the TATA box, and, in opposite to the developmental enhancers, they activated at the very end of the developmental cascade. Extraction of known recognition motifs from the Drosophila rhodopsin promoters at the default parameter settings, established for the developmental enhancers, have shown similar performance (maximum observed CC = 0.59). The training of Scanseq on the group of six most known Drosophila rhodopsin promoters delivered exactly the same optimal parameter settings (7 bp-9 bp, 1 mismatch). Identical behavior of the Scanseq program in two biologically distinct systems supports its wide application range.
We believe that our algorithm can still be improved by a better definition of the borders of regulatory regions (see appendix, imperfect sequence data, on the Web site), independent estimation of the coverage expectation, and a better comparison of the extracted motifs.
| |
METHODS |
|---|
|
|
|---|
Scanseq Algorithm
Search for Redundant Motifs
For each m-letter seed word located in position l of S, all words in S were found that differ from the seed word by no more than k substitutions. This set of ql words found comprised the initial motif Hl for position l. Because the initial motif, Hl, is an alignment, one can build a PWM, Wl, from this alignment using equation 6 (Berg and von Hippel 1987
|
(an6) |
(i) is the number of
occurrences of letters of type
in the
ith column of the alignment. For the null
statistical hypothesis, we considered the Bernoulli sequence
R, with letter probabilities estimated from S:
p
= N
/N
for each letter type
. We scanned the sequence with
matrix Wl and selected the PWM threshold in
a way such that for each seed word, the total number of high-scoring words is equal to the previously found number ql.
Statistical Evaluation
This ql was our observation for the number of similar words found in the sequence O = ql. To test whether this number was significantly greater than the number of similar words counted in a random sequence of the same length and composition, we constructed Z score: Z = (O
E)/V1/2. In this
formula, E is the expected number of words found in a random
sequence, and V is its variation. In our case, E was the expectation of the motif 
that
already had its own inverse complement, notably any palindrome,
obtained the weight
= 2; whereas, all other words
obtained the weight
= 1. Then for double-stranded counting
|
(an7) |
belonging to the motif

), the variance V takes the form
|
(an8) |
,f
reflects possible overlaps with different shifts between words
and f, c1 is the linearity constant, the value of which is small as compared to
Vdouble for our length range of hundreds of base
pairs (Régnier 2000Evaluation of Motif Length and Divergence
A regulatory region usually contains binding motifs with different characteristics involving site length and divergence. In this case, fixing of any particular m and k could result in extraction of only particular types of signals. To bring more flexibility into the procedure, we first ran Scanseq with different m and k values and then compared the Z scores assigned to words seeded with these different parameters (Fig. 3). We considered that word w1 (m1, k1, Z1) dominates word w2 (m2, k2, Z2) if m1
m2,
Z1 > Z2 and
w1 covers no less than m2
1
letters of w2. Then for each position l
there will be a dominant word wl seeded at this
position. The Z score of this dominant word was assigned to
position l. We scanned over all realistic ranges of signal lengths and maximal mismatch numbers. However, since the irrelevant m and k introduce undesirable noise, it is more
practical to train the algorithm for the best minimal
mmin and maximal mmax length of the
word, as well as for the maximal number of mismatches kmax.
Construction of Predicted Maps
To generate the predicted maps of BSTF distribution, we selected positions with Z scores higher than a custom cutoff value Zmin. Depending on the chosen Zmin, the dominant selected words cover a certain fraction of the DNA sequence. Due to the dramatic difference in Z-score values generated in different sequences for the same mmax, mmin, and kmax, we found it practical to consider the overall length of sequence covered with the predicted map, or coverage c, as a custom parameter, instead of Z score. Note that for each c, there is a corresponding Zmin. The specification of four parameters
mmax,
mmin, kmax, and c
is
sufficient to generate a predicted map. To find the best values for these
parameters, we applied explicit training on our set of enhancer sequences.
| |
ACKNOWLEDGMENTS |
|---|
We thank Steven Small, Bud Mishra, Michael Gelfand, and Tiffany Cook for helpful discussion and critical reading of manuscript. This work was supported by grants from National Science Foundation (IBN 0002958) and National Institutes of Health/National Eye Institute (EY13010) to C.D. V.M. and A.L. were also supported in part by grants from Ludwig Institute for Cancer Research and Howard Hughes Medical Institute East Europe. Web appendix is available at http://homepages.nyu.edu/~dap5/PSS/appendix1.html.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
5 Corresponding author.
E-MAIL dap5{at}nyu.edu; FAX (212) 995-4710.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.212502. Article published online before print in February 2002.
| |
REFERENCES |
|---|
|
|
|---|
Received August 27, 2001; accepted in revised form December 14, 2001.