|
|
|
|
Vol. 9, Issue 12, 1198-1203, December 1999
LETTER
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
We wish to identify genes associated with disease. To do so, we look for novel genes whose expression patterns mimic those of known disease-associated genes, using a method we call Guilt-by-Association (GBA), on the basis of a combinatoric measure of association. Using GBA, we have examined the expression of 40,000 human genes in 522 cDNA libraries, and have discovered several hundred previously unidentified genes associated with cancer, inflammation, steroid-synthesis, insulin-synthesis, neurotransmitter processing, matrix remodeling, and other disease processes. The majority of the genes thus discovered show no sequence similarity to known genes, and thus could not have been identified by homology searches. We present here an example of the discovery of eight genes associated with prostate cancer. Of the 40,000 most-abundant human genes, these 8 are the most closely linked to the known diagnostic genes, and thus are prime targets for pharmaceutical research.
[The sequence data described in this paper have been submitted to the GenBank data library under accession nos. AF109298-AF109303.]
| |
INTRODUCTION |
|---|
|
|
|---|
Genes that are differentially expressed in disease states are
candidates for pharmaceutical intervention. Previous
researchers have collected expression data for up to 10,000 genes
simultaneously (Lockhart et al. 1996
; Lashkari et al. 1997
), have
identified genes differentially expressed in cancer (DeRisi et al.
1996
; Fannon 1996
; Zhang et al. 1997
; Vasmatzis et al. 1998
), and have identified clusters of coexpressed genes (Eisen et al. 1998
; Michaels et al. 1998
; Wen et al. 1998
; Tamayo et al. 1999
). Previous work has
focused on differential expression, for example, in healthy versus
diseased tissue (Greller and Tobin 1999
); the joint expression of novel
genes with known disease genes has rarely been examined. In addition,
previous work has examined a small fraction of the total genome
(typically 10,000 genes or less) and has used linear or monotonic
measures of correlation, which fail to detect many known gene associations.
To identify genes that are candidate therapeutic or diagnostic targets, we look for novel genes whose expression patterns mimic those of known disease-associated genes. For the analyses presented here, we examined the expression of 40,000 human genes in 522 cDNA libraries in Incyte's LifeSeq database. The libraries were prepared from a diverse set of human anatomic and pathologic samples, representing most major tissue categories and many of the major pathologies.
| |
RESULTS |
|---|
|
|
|---|
We present an example of the application of Guilt by Association
(GBA) to identify genes associated with prostate cancer. Each year in
the United States, prostate cancer kills >40,000 men, and
>200,000 new cases are diagnosed, making it the second most common
cancer and the second most common cause of cancer deaths among males
(Parker et al. 1996
; Presti and Carroll 1996
; Foster 1998
).
Unfortunately, the best available diagnostic tests are substantially
<100% sensitive and specific, and many men have incurable prostate
cancer at the time of diagnosis (Whittemore et al. 1995
; Richie and
Kaplan 1996
; Stamey 1996
).
The standard molecular diagnostic marker for prostate cancer is
prostate-specific antigen (PSA), a protease produced in the prostate
(Morris et al. 1998
); however, ~20% of men who undergo prostatectomy for prostate cancer have normal levels of PSA (Presti et
al. 1996
). Prostatic acid phosphatase (PAP) was used widely in
diagnostic tests for prostate cancer prior to the development of the
more reproducible PSA test (Bostwick 1998
). Kallikrein is a protease
expressed in the prostate that has 80% sequence similarity with PSA
(Corey et al. 1997
) and is differentially expressed in prostate cancer.
Several groups are evaluating kallikrein for use as a diagnostic test
for prostate cancer and as a measure of response to therapy
(Charlesworth et al. 1997
; Eerola et al. 1997
; Mikolajczyk et al.
1998
). Two other proteins that have been linked to the disease are
seminal-plasma protein and prostate-specific transglutaminase.
Seminal-plasma protein is a prostate-secreted protein with inhibin-like
activity (Mbikay et al. 1987
). Inhibins are members of the transforming
growth factor
(TGF
) superfamily of growth factors (Thomas et
al. 1998
) that modulate prostate tumor growth (Perry et al. 1997
; Guo
and Kyprianou 1998
). Prostate-specific transglutaminase catalyzes
post-translational protein cross-linking, and exhibits differential
expression in prostate cancer cell lines (Dubbink et al. 1996
). Other
proteins have been identified as potential markers of prostate cancer
(Cramer et al. 1998
; Rentzepis et al. 1998
), and are candidates for use
in the analysis method described here. Despite the current standard
treatments for prostate cancer (surgery, radiation, and chemotherapy),
men with this disease still have significant mortality and morbidity rates.
By use of GBA as described in the Methods section, we identified eight
novel genes (IPCA-1 through IPCA-8) that show a
strong association with the known prostate cancer genes (PSA, PAP,
kallikrein, seminal-plasma protein, and prostate-specific
transglutaminase). Table 1 shows that, for PSA, the
most closely coexpressed genes are glandular kallikrein, three novel
genes, prostate seminal protein, PAP, a fourth novel gene, prostate
transglutaminase, a fifth novel gene, and neuropeptide Y. (IPCA-9,
IPCA-10 and IPCA-11 are coexpressed with PSA but appear to be 3'
untranslated sequences.) Table 2 summarizes the
coexpression of the eight novel genes with the five known prostate
cancer genes, and, for comparison, two unrelated genes (myosin light
chain and elongation factor 1
). The four values for each gene in
Table 1 are the values used in the GBA probability calculations (see
Methods; Table 6, below). Each of the eight novel genes is coexpressed
with at least one known prostate cancer gene with a P value
<10 E-06.
|
|
By use of 522 libraries from diverse tissues, we increase the sample
size available for the statistical tests, but we run the risk of
detecting associations that simply indicate that two genes are
expressed in the same tissue, rather than being more closely linked in
their function. We wish to test that the observed associations are not
simply due to coexpression in the same tissue. Hence, we performed the
same GBA analysis on a set of 51 male-reproductive tissue libraries,
with the results shown in Table 3. The associations detected in the male-reproductive tissue libraries support the conclusions reached with all 522 libraries, specifically, that the same
set of genes show close association even within the tissue type. The
P values are not as small as with 522 libraries, because of
the smaller sample size, but are still much less than the P values for the unrelated genes (ef1-
and myosin).
|
Four other known genes (beside the five used in the GBA analysis) are
coexpressed with at least one of the known prostate cancer genes and
are among the ten genes most closely coexpressed with that prostate
cancer gene. MAT8 is coexpressed with prostatic seminal protein. It has
been reported to be differentially expressed in breast cancer (Morrison
et al. 1995
; Schiemann et al. 1998
). Neuropeptide Y is coexpressed
with PSA. It has been reported to be associated with prostate cancer
(Minth et al. 1984
; Mack et al. 1997
). Sorbitol dehydrogenase is
coexpressed with prostate transglutaminase and with kallikrein.
Significant alterations in its activity in toxin-damaged male
reproductive tissues have been reported (Pant et al. 1995
); we found no
previous report of an association with cancer. ZN-
-2-glycoprotein
is coexpressed with prostate transglutaminase. It has been reported to
be associated with prostate cancer (Gagnon et al. 1990
) and breast
cancer (Hurlimann and van Melle 1991
; Freije et al. 1993
; Lopez-Boado
et al. 1994
). Our analysis did not detect genes that are significantly
underexpressed when the prostate cancer-associated genes are overexpressed.
Seven of the eight IPCA genes showed distant or no sequence similarity
to genes known at the time of the analysis. Gene IPCA-3 exhibits
significant sequence similarity to several serine proteases. Subsequent
to the identification of these genes by GBA and submission of their
sequences to GenBank in November 1998, the sequence of a gene with
>99% identity to IPCA-3 was reported in the GenBank database. This
gene is prostase, an androgen-regulated serine protease with
prostate-restricted expression (Nelson et al. 1999
).
| |
DISCUSSION |
|---|
|
|
|---|
We have analyzed the pairwise coexpression patterns of 40,000 genes in >500 libraries (the largest such expression analysis reported to date), and have identified several hundred disease-associated genes using a novel coexpression algorithm, GBA. We identified eight novel genes associated with prostate cancer. Of the 40,000 most-abundant human genes, these 8 are the most closely linked to the known prostate cancer diagnostic genes, and thus are prime targets for pharmaceutical research.
The method of analysis presented here is complementary to traditional correlation measures in that GBA is better able to detect nonlinear relationships in data with high variability; for genes whose relationships are linear or monotonic, we would expect correlation analysis to provide similar results, possibly even with smaller sample sizes than are required for GBA.
GBA is able to detect potential disease-associated genes that have no sequence similarity to known genes, and does not require that expression be measured in both diseased and healthy tissue (as is required by differential expression analysis); thus, it offers opportunities to discover the functions of genes that cannot be readily identified by other means.
| |
METHODS |
|---|
|
|
|---|
For the method of analysis described here, we reduce each expression datum to a binary variable (present or absent), rather than analyzing expression as a continuous variable using linear or rank correlation. Before we chose this binary-encoding method to identify coexpressed genes, we evaluated Pearson linear correlation and Spearman rank correlation using continuous values. Whereas these correlation methods sometime identified known relationships well, they often performed unsatisfactorily, possibly for several reasons. Many genes that are known to be associated do not exhibit the simple linear or monotonic relationships assumed by these methods. Libraries may in some cases be normalized or subtracted to increase complexity. The quantitative measurement of expression has sufficiently high variability (inability to accurately distinguish two- to threefold changes, particularly at low expression levels) that correlation measures may not reliably distinguish true associations from spurious correlation.
For the purpose of this analysis, we consider a gene to be present
(expressed) if cDNA corresponding to that gene is detected in the
sample from that library. We consider a gene to be absent (not
expressed) when no cDNA for that gene is detected in the library. To
determine whether two genes, A and B, have similar expression patterns,
we examine their occurrences in the 522 cDNA libraries, as shown in
Table 4. A 0 indicates that the gene was not detected
in the library; a 1 indicates that it was detected.
|
For a given pair of genes, the expression data in Table 4 can by
summarized in a 2 by 2 contingency table. Table 5
presents such a coexpression contingency table for the hypothetical
genes A and B in a total of 30 libraries; Table 6
presents the same data as variables that we will use shortly. We
determine the probability that the coexpression shown in Table 5 occurs
by chance with a counting method, as follows. We take as our null
hypothesis that there is no association between gene A and
gene B. Under the null hypothesis, the marginal counts in
Tables 5 and 6 are fixed, the expected count in each cell is a function
of the marginals, and deviations from the expected count are random.
The number of ways that k occurrences of a gene can be distributed in r
libraries is (r C k), that is, the combinatoric choose function. From
Table 6, we can calculate the probability of observing n11 counts using the hypergeometric distribution, as in a Fisher Exact test (Agresti 1990
). From the hypergeometric distribution, the probability of observing exactly n11 counts is p(n11) = (n1. C n11) × (n2. C n21) / (n.. C n.1).
|
|
To determine whether there is association (lack of independence) between the genes, we calculate the sum of all the (hypergeometric) probabilities for outcomes at least as extreme as the observed outcome. As a concrete example, consider the n11 count in the cell (Gene A present and Gene B present) in Table 5. We can calculate the probability of observing a count of exactly 8 using the hypergeometric distribution, that is, p(n11 is 8) = (10 C 8) × (20 C 2)/(30 C 10). To test the null hypothesis, we are interested not only in the case in which we observe a count of exactly 8 in the cell, but also the cases in which we observe more extreme values of n11, subject to the constraints of the marginals. Hence, we sum the probability of the observed count and of the more extreme possible counts (n11 = 8, 9, and 10) to determine the total probability of counts at least as extreme as those observed. In the case of Table 5, the probability that the observed coexpression is due to chance is P = 0.0003.
This method of estimating the probability for coexpression of two genes
makes several assumptions that do not hold strictly. Because more than
one library may be obtained from a single patient (for example, both
tumor and nontumor tissue), libraries are not completely independent.
In addition, because we perform multiple statistical tests on each
gene, we must consider the question of statistical significance and
interpretation of the P values. One method to correct for
multiple comparisons in determining a suitable P value is to
apply a Bonferroni correction (dividing the desired
, say,
P = 0.01, by the number of comparisons performed). For
n genes, we perform n(n
1)/2 pairwise
comparisons; thus 40,000 genes yield 8 * 108 pairwise
comparisons, requiring a Bonferroni-corrected P value of
0.01/(8 * 108) or ~10
11. With such a
correction, some, but not all, of the eight identified genes still show
significant association with the known genes. However, the Bonferroni
correction is extremely conservative, yielding almost no false
positives at the price of failing to detect many real associations. For
this reason, we implemented an alternative to the Bonferroni
correction, which we describe next.
To analyze 40,000 genes, we perform 8 * 108 pairwise
comparisons. Suppose that the genes in this set are not, in fact,
associated. If none of the genes in the 40,000 were associated, we
would still expect, by chance, to see 8 * 108 *
10
6 = 800 pairs of genes with a P value
<10
6. Empirically, when we perform the 8 *
108 pairwise comparisons, we observe >250,000 pairs with a
P value <10
6, which is consistent with the
notion that many pairs of genes do, in fact, have related function and
therefore have some similarity in their expression patterns. We expect
800 by chance, but observe >250,000; this result suggests that, at a
P value of 10
6, a very large proportion of all
associations are due to true biological relationships.
For practical interpretation, we can only claim that, of the 40,000 genes examined, the genes identified here are the most closely associated with the known prostate cancer genes. The observed expression patterns demonstrate association with, but do not prove direct involvement in, prostate cancer. The known genes in this analysis are all far downstream in the process of prostate cancer progression; we speculate that some of the novel genes, whose expression patterns do not exactly match those of the known genes, may be closer to the origin of the disease, and perhaps involved in regulation of the progression of the cancer.
| |
ACKNOWLEDGMENTS |
|---|
We thank our colleagues at Incyte and our editor, Lyn Dupre, for their support and assistance in this research. We thank the reviewers for several suggestions that improved the quality and clarity of the paper.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL mwalker{at}incyte.com; FAX (650)855-0572.
| |
REFERENCES |
|---|
|
|
|---|
identification of therapeutic targets.
Trends Biotechnol.
14:
294-298[CrossRef][Medline].Received April 26, 1999; accepted in revised form September 23, 1999.
This article has been cited by other articles:
![]() |
I. Friedberg Automated protein function prediction--the genomic challenge Brief Bioinform, September 1, 2006; 7(3): 225 - 242. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Khatri, B. Done, A. Rao, A. Done, and S. Draghici A semantic analysis of the annotations of the human genome Bioinformatics, August 15, 2005; 21(16): 3416 - 3421. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Wu, M. G. Walker, J. Luo, and L. Wei GBA server: EST-based digital gene expression profiling Nucleic Acids Res., July 1, 2005; 33(suppl_2): W673 - W676. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Zhou, J. A. Young, A. Santrosyan, K. Chen, S. F. Yan, and E. A. Winzeler In silico gene function prediction using ontology-based pattern identification Bioinformatics, April 1, 2005; 21(7): 1237 - 1245. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Nelander, P. Mostad, and P. Lindahl Prediction of Cell Type-Specific Gene Modules: Identification and Initial Characterization of a Core Set of Smooth Muscle-Specific Genes Genome Res., August 1, 2003; 13(8): 1838 - 1854. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Braslavsky, B. Hebert, E. Kartalov, and S. R. Quake Sequence information can be obtained from single DNA molecules PNAS, April 1, 2003; 100(7): 3960 - 3964. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-C. Li, H. Zhao, H. Shiina, C. J. Kane, and R. Dahiya PGDB: a curated and integrated database of genes related to the prostate Nucleic Acids Res., January 1, 2003; 31(1): 291 - 293. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. G. R. Thompson, J. W. Harris, B. J. Wold, S. R. Quake, and J. P. Brody Identification and Confirmation of a Module of Coexpressed Genes Genome Res., October 1, 2002; 12(10): 1517 - 1522. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. P. Brody, B. A. Williams, B. J. Wold, and S. R. Quake Significance and statistical errors in the analysis of DNA microarray data PNAS, October 1, 2002; 99(20): 12975 - 12978. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. P. Leo, S. Y. Hsu, and A. J. W. Hsueh Hormonal Genomics Endocr. Rev., June 1, 2002; 23(3): 369 - 381. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. T. Loging, A. Lal, I-M. Siu, T. L. Loney, C. J. Wikstrand, M. A. Marra, C. Prange, D. D. Bigner, R. L. Strausberg, and G. J. Riggins Identifying Potential Tumor Markers and Antigens by Database Mining and Rapid Expression Screening Genome Res., September 1, 2000; 10(9): 1393 - 1402. [Abstract] [Full Text] |
||||
![]() |
S. Y. Hsu and A. J. W. Hsueh Discovering New Hormones, Receptors, and Signaling Mediators in the Genomic Era Mol. Endocrinol., May 1, 2000; 14(5): 594 - 604. [Full Text] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||