|
|
|
Published online before print
March 12, 2003, 10.1101/gr.911803
METHODS
Computationally Identifying Novel NF-
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ABSTRACT |
|---|
|
|
|---|
B-regulated immune genes in the human genome
is important to our understanding of immune mechanisms and immune
diseases. We fit logistic regression models to the promoters of 62
known NF-
B-regulated immune genes, to find patterns of transcription
factor binding in the promoters of genes with known immune function.
Using these patterns, we scanned the promoters of additional genes to
find matches to the patterns, selected those with NF-
B binding sites
conserved in the mouse or fly, and then confirmed them as
NF-
B-regulated immune genes based on expression data. Among 6440
previously identified promoters in the human genome, we found 28
predicted immune gene promoters, 19 of which regulate genes with known
function, allowing us to calculate specificity of 93%100% for the
method. We calculated sensitivity of 42% when searching the 62 known
immune gene promoters. We found nine novel NF-
B-regulated immune
genes which are consistent with available SAGE data. Our method of
predicting gene function, based on characteristic patterns of
transcription factor binding, evolutionary conservation, and expression
studies, would be applicable to finding genes with other functions.
B is a transcription factor (TF) that is known to be an
important mediator of immune responses (Baeuerle and Baltimore 1996
B-regulated immune genes
play fundamental roles in both innate immunity and adaptive immunity. A
draft of the human genome sequence was published recently
(International Human Genome Sequencing Consortium 2001
B signaling in the immune
system, identifying these novel NF-
B-regulated immune genes will
advance our understanding of human immunity and immune diseases.
Gene promoters are regulatory regions that are integral components of
genes. For the purposes of this study, a promoter is the regulatory
region of the gene that is proximal to the transcription start site
(TSS). Eric Davidson used the patterns of TF binding sites in the
promoters of sea urchin developmental genes to build a computational
model to accurately predict their expression (Yuh et al. 1998
). In
Drosophila, TF binding-site patterns in regulatory regions
have been used to search the fly genome to find developmental genes
(Berman et al. 2002
; Markstein et al. 2002
). In mammals, efforts using
logistic regression analysis (LRA) models of regulatory regions to find
muscle and liver genes have also seen some success (Wasserman and
Fickett 1998
; Krivan and Wasserman 2001
).
For most genes in the human genome, the precise locations of TSSs and
their proximal promoters are still unknown. As such, finding the
promoters is often a necessary first step in studying gene regulation,
and previous work in our lab has involved finding these sites (Liu et
al. 2001
; Liu and States 2002
). For genes with known mRNAs, our
prediction method, CONPRO, has been shown to identify promoters with
70% sensitivity and over 90% specificity. Applying CONPRO to the
human genome, we found 6440 promoters for genes with known mRNAs.
To identify novel NF-
B-regulated immune gene promoters among these
6440 promoters, we used the patterns of TF binding sites in the
promoters of known NF-
B-regulated immune genes and evolutionary
conservation of the NF-
B binding sites, then confirmed the
predictions based on available expression data. We retrieved 62 known
NF-
B-regulated immune response genes and their promoters (Baeuerle
and Baichwal 1997
), and we found that five TF families (including
NF-
B) have binding sites overrepresented by a factor of at least two
in these immune gene promoters. This overrepresentation suggests that
these TFs are important in coregulating immune genes. We fit two LRA
models, based on the patterns of binding sites for these five TFs and
the positions of the NF-
B binding site within these promoters, then
searched the 6440 promoters to find preliminary candidates for immune
gene promoters. To improve the specificity of our predictions, these
preliminary candidates were checked for NF-
B binding-site
conservation between the human and mouse genome, or the human and
Drosophila genome when mouse genomic data were not available.
Serial analysis of gene expression (SAGE) data on myeloid, lymphoid,
and microvascular endothelial cells are consistent with nine of these
genes being activated by NF-
B in vivo, and we identified them as
putative novel NF-
B-regulated immune response genes. Our method has
sensitivity of 42%, and our two LRA models show specificity of 93%
and 100%, respectively.
| RESULTS |
|---|
|
|
|---|
B Binding Sites in Immune Gene Promoters
B binding
sites of 62 known NF-
B-regulated mammalian immune gene promoters
(Baeuerle and Baichwal 1997
B binding sites in these 62
promoters is plotted against the position relative to the TSS (solid
curve, Fig. 1). For reference, among a set of mammalian nonimmune
promoters, NF-
B binding sites appear as random noise without
positional preference (dotted curve, Fig. 1). Among immune genes, the
region immediately upstream of the TSS contains the highest density of
NF-
B sites, whereas the region further upstream also contains an
enrichment of NF-
B sites relative to nonimmune promoters, but not as
high as in the proximal region. To find the breakpoint between these
two regions, we used piecewise linear regression (S-Plus 6.0) to find
the turning point in the curve at 230bp (solid lines, Fig. 1). Over
75% of NF-
B binding sites in the immune gene promoters are within
230 bp upstream of the TSS, and NF-
B binding is more significant in
these genes than in immune genes with the NF-
B binding site farther
upstream. We built separate LRA models for these two groups of genes.
|
B-Regulated Immune Promoters
B, binding sites for other TFs
may be informative in identifying immune genes. We searched for these
other informative TFs in NF-
B-regulated immune gene promoters by
looking for TFs with binding sites that are overrepresented in
NF-
B-regulated immune gene promoters. To find overrepresented TF
binding sites, we compared the number of promoters with at least one
binding site for a TF between two groups of promoters: the group of 62
known immune gene promoters and a null group that includes four sets of
62 mammalian nonimmune promoters taken from the Eukaryotic Promoter
Database (EPD; Perier et al. 2000
B-regulated group as in the null group. The TFs that meet this
criterion are AP1, IK1, IK3, IRF1, IRF2, ISRE, NF-
B, and STAT (Fig.
2, left). For comparison, the number of
promoters containing binding sites for eight other TFs (Fig. 2, right)
is not significantly different between immune and nonimmune promoters.
This result is consistent with biological observations, because NF-
B
has been shown to interact with AP1 and IRF1 to regulate genes (Thanos
and Maniatis 1995
B, AP1, IK, IRF, and STAT.
|
B regulation, we built one LRA
model for immune promoters with NF-
B binding sites within 230 bp
upstream of the TSS (Table 1, left) and one
model for promoters with more distal NF-
B binding sites (Table 1,
right). For both models, the most informative TF is NF-
B, which is
expected for NF-
B-regulated immune genes, but the presence or
absence of the other informative TF binding sites also helps categorize
promoters.
|
B-Regulated Immune Genes
B-regulated immune gene promoters is
shown in Figure 3. For each promoter, we
first check for an NF-
B binding site in the first 600 bp upstream of
the TSS. If none is found, we exit the algorithm. If an NF-
B binding
site is found and it is in the 1 to 230 bp window we use model I;
otherwise we use model II. If a promoter has a pattern of TF binding
sites yielding a probability,
(x) (probability of being
from the immune gene group), which exceeds the threshold established
for the LRA model used, we consider the promoter a preliminary
candidate.
|
B binding sites in the mouse
genome, but we do not have genomic sequences for all mouse genes. In
such cases, we compared the preliminary candidates with the
Drosophila genome. Although flies have only a simple form of
immune response (innate immunity), conservation of regulatory regions
across such evolutionary distance is likely to be functionally
important. On searching the 62 known NF-
B-regulated immune genes,
our method selected 18 promoters by model I and eight promoters by
model II, yielding sensitivity of 26/62, or 42%.
Next, we searched for NF-
B-regulated immune gene promoters among the
6440 human promoters previously identified by CONPRO (Liu and States
2002
). CONPRO finds promoters that are associated with mRNA
transcripts, and thus finding an immune promoter is equivalent to
finding an immune gene. Among the 6440 promoters, we found 28 immune
genes, 22 genes by model I and six genes by model II.
Specificity: Predicted Immune Genes With Known Function
Among the 22 predicted immune gene promoters found by model I, 15
regulate well characterized human genes (Table
2). Eleven of these promoters have been
cloned, and mutagenesis studies confirmed that the genes are
NF-
B-regulated (Promoter Study, Table 2). Two other genes,
MAP3K8 and MIP-3
, have gene expression data
consistent with NF-
B regulation (Gene; Array; Table 2):
MAP3K8 is regulated by activation of NF-
B under five
different experimental conditions, and MIP-3
shows
increased expression after NF-
B is activated by seven different
stimulators in different cell lines.
|
B;, whereas the IP3KB and IP3KA promoters do
not have any of the above sites. Based on these studies, we believe
that IP3KC is the isoform expressed in the immune system, regulating
T-cell and B-cell activation. Model I also predicts p84 as an
immune gene, but we found no experimental evidence to confirm this,
yielding specificity of 14/15, or 93% for model I.
Model II predicts six immune gene promoters, with four of them
regulating well characterized genes (Table
3). The first three of these genes have
been shown in promoter studies to have immune functions and NF-
B
regulation. Additionally, MyD118 (gadd45beta) is a
response gene in myeloid differentiation and is upregulated when
NF-
B is activated in both T and B cells. Ecto-ATPase provides
signals for activating cytokine secretion in T cells and antibody
secretion in B cells, as well as signals for B-cell proliferation. As
such, all four of the model II predictions are supported by
experimental data, yielding specificity of 4/4, or 100%.
|
B-Regulated Immune Genes
|
B in these
cell lines, along with estimates of the statistical significance
(P-values) of these changes are summarized in the last column
of Table 4 (Audic and Claverie 1997
B include
LPS, M-CSF, or GM-CSF on monocytes; phorbol myristate acetate (PMA) on
T cells; and vascular endothelial growth factor (VEGF) on HMVEC (Kim et
al. 2001
B we observe
that the expression level of these nine genes increases significantly,
and thus SAGE data are consistent with the prediction that these novel
genes are regulated by NF-
B. Given the evidence that our method is
very specific, we are convinced that these nine genes are novel
NF-
B-regulated immune response genes. | DISCUSSION |
|---|
|
|
|---|
B signaling pathway in the
immune system, identifying novel NF-
B-regulated immune genes in the
human genome will undoubtedly advance the investigation of immune
mechanisms. Although experimental methods such as DNA microarrays are
available to characterize patterns of gene expression, these methods
are expensive and require biological materials which may be difficult
to obtain for some tissues and conditions. Thus, computational methods
can complement experimental methods by identifying candidate genes,
maximizing the effectiveness of expression studies. We build LRA models
for immune promoters with NF-
B binding sites within the 1 to
230-bp window (model I), and for promoters with NF-
B binding sites
between 231 and 600 bp (model II). This method shows sensitivity of
42%, with specificity of 93%100%.
We found 28 immune gene promoters with an LRA model score above the
threshold and with NF-
B binding sites conserved in the mouse or
Drosophila genome. These 28 genes are predicted to be immune
response genes, with nine genes among them being novel
NF-
B-regulated immune genes. For all of the nine predicted novel
immune genes, SAGE data on microvascular endothelial, myeloid, or
lymphoid cells suggest that they are NF-
B-regulated genes. Because
our search procedure is very specific, we propose that these nine genes
are NF-
B-regulated immune response genes.
One of these novel NF-
B-regulated immune genes (sir2
homolog 2) is notable. The yeast sir2 protein is believed to
be functional in the nucleus and is involved in gene silencing and
aging (Shore 2000
). However, the human homolog 2 primarily locates in
cytoplasm, and overexpression of the gene has no affect on cell growth
or chromosome stability (Afshar and Murnane 1999
). Our LRA analysis
suggests that it is an immune gene in humans, and SAGE data further
demonstrate that the gene is expressed in both lymphoid and myeloid
lineages after the cells are stimulated by NF-
B activators. We
anticipate that the gene plays a fundamental role which is common to
both lymphoid and myeloid cells.
The score thresholds (
[x]) for the two models are set so
that for model I (
[x] > 0.65), the promoter must have
binding sites for NF-
B, and at least one other informative TF (AP1,
IK, IRF, or STAT) to be considered a preliminary candidate. To meet the
threshold for model II (
[x] > 0.50), the promoter must
have binding sites for NF-
B and either IK or IRF, else NF-
B and
both AP1 and STAT.
Because our goal is to develop a specific method for immune gene
prediction, we set the position weight matrix (PWM) thresholds high to
reduce false predictions. The high specificities of model I (93%) and
model II (100%) tend to justify expensive and time-consuming
experimental characterization of the nine predicted immune genes.
Because the sensitivity of our method is 42% and we have 28
predictions from 6440 genes, we would expect that there are about 70
NF-
B-regulated immune genes among the 6440 genes and about 400
NF-
B-regulated immune genes among the 35,000 genes in the entire
human genome.
This is the first successful attempt at a genome-scale computational search for immune genes by promoter analysis. As the sequencing of the human genome is finished, assigning functions to novel genes in a high-throughput manner becomes increasingly important. We demonstrate here that regulatory genomics can be applied to genome-scale prediction of the functions for novel genes in specific physiological pathways or biological systems.
| METHODS |
|---|
|
|
|---|
Piecewise Linear Regression
Piecewise linear regression analysis (S-Plus6.0) was used to reveal
the turning point in the NF-
B binding-site positional distribution
curve (Fig. 1). In a two-segment curve, piecewise linear regression
analysis fits both sections with lines after a knot is specified. We
choose the knot yielding the minimal sum of squared errors (270 bp).
Logistic Regression Analysis
We also used S-Plus6.0 to perform a multivariate logistic
regression analysis (LRA). In the LRA, maximum likelihood was used to
estimate the relative effect of each TF binding site on the probability
that a promoter is associated with an immune gene. We first tested for
possible synergistic effects of multiple-copy TF binding by fitting two
linear models: One model uses the count of binding sites for each TF in
each promoter, and the other model uses an indicator variable for the
existence of at least one binding site for each TF in a promoter,
regardless of the TF copy number. The count of TFs was not found to be
significant, so we used indicator variables for the presence of binding
sites for AP1, IK (IK1, IK3), IRF (IRF1, IRF2, ISRE), NF-
B, or STAT.
Modeling was performed on the 62 NF-
B-regulated immune promoters and
the 248 mammalian nonimmune promoters extracted from the EPD. The
probability that a given promoter is from the group of immune genes,
(x), is estimated by:
![]() | (1) |
0 is the intercept, coefficient
i is the effect of TF xi, and
i indexes the five TFs.
Assessing the Significance of SAGE Data
Expression levels of the predicted novel NF-
B-regulated genes,
with and without NF-
B activation, is obtained by data mining from
SAGE databases. We measured the significance of expression level
changes after NF-
B is activated, by the method of Audic and Claverie
(1997)
. In this method, the probability of a change of expression level
from x to y is:
![]() | (2) |
Genomes
Human genomic sequence data are downloaded from the UCSC Genome
Server (http://genome.ucsc.edu, Golden Path assembly April
2001 release). Mouse genomic sequence data were retrieved from the NCBI
mouse trace database. The Drosophila genome was retrieved from
the NCBI genome data set.
Effects of Stringency in Accepting TF Binding Sites
The search described above was designed to emphasize specificity,
so we set the thresholds for accepting TF binding sites accordingly. To
test the effects on specificity and sensitivity when we lower the
stringency in classifying sequences as TF binding sites, both models I
and II were recalculated with lower PWM score cutoffs (Table
5). As shown in Table 5, the specificity is
very high (18/19 overall) using our high threshold for accepting TF
binding sites, but sensitivity is only 42% (26/62). With medium
stringency (described in Table 5), the sensitivity increases to 61%,
but the specificity among these additional predictions drops to 40%
(4/10 additional predictions with known function being
NF-
B-regulated immune genes). Lowering the stringency further brings
the specificity down dramatically, because only one of the 17
additional predicted genes with known functions is likely to be an
immune gene, and the sensitivity is only slightly improved (71%). We
conclude that the high stringency search has detected most of the
immune genes which could be detected by this method and should be
applied when we take a genomic approach to search for genes with
specific functions.
|
| WEB SITE REFERENCES |
|---|
|
|
|---|
www.prevent.m.u-tokyo.ac.jp/SAGE.html; SAGE data.
http://www.ncbi.nlm.nih.gov/SAGE; SAGE data.
| Acknowledgements |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| Footnotes |
|---|
E-MAIL dstates{at}umich.edu; FAX (734) 615-6553.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.911803. Article published online before print in March 2003.
| REFERENCES |
|---|
|
|
|---|
B as a frequent target for immunosuppressive and anti-inflammatory molecules. Adv. Immunol. 65: 111-137.[Medline]
B: Ten years after. Cell 87: 13-20.[CrossRef][Medline]
B-inhibitory protein NF-
B1 p105. Nature 397: 363-368.[CrossRef][Medline]
by NF-
B downregulates proapoptotic JNK signalling. Nature 414: 308-313.[CrossRef][Medline]
B and Rel proteins: Evolutionarily conserved mediators of immune responses. Annu. Rev. Immunol. 16: 225-260.[CrossRef][Medline]
B. J. Immunol. 161: 2276-2283.
-chemokine. Blood 89: 3315-3322.
B activation in endothelial cells. J. Biol. Chem. 276: 7614-7620.
and interferon-
synergistically activate the RANTES promoter through nuclear factor
B and interferon regulatory factor 1 (IRF-1) transcription factors. Biochem. J. 350: 131-138.
B kinase and NF-
B target genes at the pre-B to immature B cell transition. J. Biol. Chem. 276: 18579-18590.
pathway. Immunol. Rev. 76: 30-46.
B regulates inducible CD83 gene expression in activated T lymphocytes. Mol. Immunol. 37: 783-788.[CrossRef][Medline]
B potently upregulates the promoter activity of RANTES, a chemokine that blocks HIV infection. J. Immunol. 158: 3483-3491.[Abstract]
+ and TCR
+ intraepithelial lymphocytes provided by serial analysis of gene expression (SAGE). Immunity 15: 419-434.[CrossRef][Medline]
B: A lesson in family values. Cell 80: 529-532.[CrossRef][Medline]
Received October 15, 2002; accepted in revised format December 12, 2002.
This article has been cited by other articles:
![]() |
K. Bunting, S. Rao, K. Hardy, D. Woltring, G. S. Denyer, J. Wang, S. Gerondakis, and M. F. Shannon Genome-Wide Analysis of Gene Expression in T Cells to Identify Targets of the NF-{kappa}B Transcription Factor c-Rel J. Immunol., June 1, 2007; 178(11): 7097 - 7109. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. GuhaThakurta Computational identification of transcriptional regulatory elements in DNA sequence Nucleic Acids Res., July 19, 2006; 34(12): 3585 - 3598. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. D. Cohen, A. Klingenhoff, A. Boucherot, A. Nitsche, A. Henger, B. Brunner, H. Schmid, M. Merkle, M. A. Saleem, K.-P. Koller, et al. Comparative promoter analysis allows de novo identification of specialized cell junction-associated proteins PNAS, April 11, 2006; 103(15): 5682 - 5687. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Abnizova and W. R. Gilks Studying statistical properties of regulatory DNA sequences, and their use in predicting regulatory regions in the eukaryotic genomes Brief Bioinform, March 1, 2006; 7(1): 48 - 54. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-W. Chang, R. Nagarajan, J. A. Magee, J. Milbrandt, and G. D. Stormo A systematic model to predict transcriptional regulatory mechanisms based on overrepresentation of transcription factor binding profiles Genome Res., March 1, 2006; 16(3): 405 - 413. [Abstract] [Full Text] [PDF] |
||||