|
|
|
|
Vol. 9, Issue 10, 950-959, October 1999
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
Large, publicly available collections of expressed sequence tags (ESTs) have been generated from Arabidopsis thaliana and rice (Oryza sativa). A potential, but relatively unexplored application of this data is in the study of plant gene expression. Other EST data, mainly from human and mouse, have been successfully used to point out genes exhibiting tissue- or disease-specific expression, as well as for identification of alternative transcripts. In this report, we go a step further in showing that computer analyses of plant EST data can be used to generate evidence of correlated expression patterns of genes across various tissues. Furthermore, tissue types and organs can be classified with respect to one another on the basis of their global gene expression patterns. As in previous studies, expression profiles are first estimated from EST counts. By clustering gene expression profiles or whole cDNA library profiles, we show that genes with similar functions, or cDNA libraries expected to share patterns of gene expression, are grouped together. Promising uses of this technique include functional genomics, in which evidence of correlated expression might complement (or substitute for) those of sequence similarity in the annotation of anonymous genes and identification of surrogate markers. The analysis presented here combines the application of a correlation-based clustering method with a graphical color map allowing intuitive visualization of patterns within a large table of expression measurements.
| |
INTRODUCTION |
|---|
|
|
|---|
The development of distinct tissues and cell-types is a fundamental characteristic of the growth of higher organisms. Tissue and cellular differentiation, in turn, is highly dependent on specific patterns of gene expression and transcript accumulation.
In higher plants, a large volume of literature exists documenting
spatial and temporal regulation of gene expression. It is increasingly
clear that developmental pathways can be considered as modular, and
that developmental transitions are accompanied by global changes in the
expression of specific complements of genes (Doebley and Lukens 1998
).
For example, the intensively studied transition from etiolated to
greening seedling involves coordinate regulation of many
light-regulated genes (von Arnim and Deng 1996
). Also increasingly
clear, is the notion that complements of genes are best studied in
parallel, which has become feasible with the development of new
technologies (Schena et al. 1996
, 1998
; Wen et al. 1998
).
Traditional approaches to the analysis of mRNA abundance, such as
Northern blotting, tend to be limited by the number of transcripts that
can be simultaneously analyzed. More recent innovations, such as
hybridization to arrayed cDNA libraries or oligonucleotide chips permit
simultaneous analysis of the abundance of thousands of transcripts (for
review, see Brown and Botstein 1999
). These latter approaches can be
thought of as analog, because hybridization signal intensity reflects
transcript abundance. In plants, the use of arrays of partially
sequenced cDNAs has been successfully applied to the analysis of gene
expression in light- and dark-grown seedlings of Arabidopsis
(Desprez et al. 1998
).
Digital analysis of gene expression can be achieved by generation of
tags to expressed genes and transcript abundance inferred from the
frequency of tags. This approach has been used with both conventional
ESTs (Okubo et al. 1992
; Lee et al. 1995
; Takenaka et al. 1998
) and in
the SAGE technique with much shorter (10 bp) tags (Velculescu et al.
1995
, 1997
; Zhang et al. 1997
). The availability of significant
collections of expressed sequence tags from plant genomes presents an
opportunity to analyze digital expression profiles for plant tissues
and genes. Several studies have observed that the abundance of EST tags
for many genes varies according to the tissue of origin of the cDNA
library. (Uchimiya et al. 1992
; Hofte et al. 1993
; Umeda et al. 1994
;
Cooke et al. 1996
; Yamamoto and Sasaki 1997
). Because EST data is
inherently noisy (Aaronson et al. 1996
; Hillier et al. 1996
; Wolfsberg
and Landsman 1997
), a rigorous statistical test was derived to assess
the reliability of the identification of differentially expressed genes
from EST counts sampled from different libraries (Audic and Claverie
1997
). EST data has also been used to reveal alternative transcripts of
the same gene, as well as their eventual library-specific distribution (Burke et al. 1998
; Gautheret et al. 1998
).
As of October 1998, there are ~37,000 Arabidopsis and
27,000 rice publicly available EST sequences, as well as smaller
collections from other plant species
(http://www.ncbi.nlm.nih.gov/dbEST). An important difference between
the Arabidopsis and rice ESTs (at least for the purposes
described in this report) is that a large proportion of the
Arabidopsis ESTs were generated from a single cDNA library,
prepared from a mixture of tissues (Newman et al. 1994
; Delseny et al.
1997
), whereas the rice ESTs are more evenly derived from a set of
tissue and organ-specific cDNA libraries, therefore making them a more
suitable starting point for gene expression studies (Yamamoto and
Sasaki 1997
).
A significant proportion of ESTs show no similarity to sequences in
existing databases (Adams et al. 1992
; Claverie 1996
). Ascribing
functions to those anonymous sequences has therefore become one of the
major bottlenecks in plant and animal genomics. One way of gaining
functional information on anonymous genes is by use of the two-hybrid
system (for review, see Brent and Finley 1997
). According to this
approach, direct physical interactions of the product of an unknown
gene are used to reveal its relationships with the product of
(hopefully) better-characterized ones. Using publicly-available rice
ESTs as a test set, we show that a multidimensional analysis of EST
data can provide similar types of information, albeit based on the
concept of statistical rather than physical interactions. Functional
relationships between genes may then be inferred from the mathematical
identification of significant similarities between their expression patterns.
Using the rice ESTs available in dbEST (Boguski et al.1993
) we have
computed an expression profile for each gene represented by at least 5 ESTs in 10 different cDNA libraries. For each of those genes, the
expression profile is therefore derived from 10 expression measurements
(EST counts). Correlation analysis was then used to point out
significant similarities in the expression profiles of genes as well as
to generate a graphical representation of gene clusters exhibiting
related expression patterns. Our results indicate that genes with
similar functions, or tissues expected to share patterns of gene
expression, can be recognized by use of this type of analysis. The
multidimensional analysis of EST data, in a way quite parallel to
microarray experiments (DeRisi et al. 1996
; Eisen et al. 1998
), may
thus constitute a new approach to the functional annotation of
anonymous genes and to a more global understanding of plant physiology.
| |
RESULTS |
|---|
|
|
|---|
EST Database and Contigs
A breakdown of the rice cDNA libraries represented in dbEST (as of 10/98) is shown in Table 1. Preliminary investigations in which expression profiles were generated from all libraries with >100 ESTs showed that the smaller libraries gave misleading results (data not shown). Therefore, of the 27 cDNA libraries that contribute to the EST set, only the 10 largest (representing 95% of the total ESTs) were used in the analysis presented here. These 10 cDNA libraries contribute varying numbers of ESTs to the dataset used; the difference between the largest and smallest rice cDNA libraries used here is approximately fivefold (library 1073 has 5094 ESTs, library 1009 has 890 ESTs).
|
Rice ESTs were organized into clusters and contig (consensus) sequences
derived by a protocol adapted from Gautheret et al. (1998)
(see
Methods). Selected statistics of the clustered set of rice ESTs are
shown in Table 2.
|
| |
Derivation of Expression Profiles |
|---|
|
|
|---|
Expression profiles were derived for each of the 707 contigs with 5 or more constituent ESTs. The cDNA library of origin was scored for each constituent EST of each contig, producing a two-way, contig versus library, table of raw EST counts (see Fig. 1). The content of this table was the primary data for all subsequent computations.
|
In a preliminary investigation, an alternative protocol was explored in which the raw EST counts were further reduced to a binary scale, such that simply the presence or absence of a given gene in a given library was recorded (one or more EST=1, none=0). The subsequent statistical analysis of this binary data (with, e.g., Fisher's 2 × 2 exact test) was found to be much less sensitive and meaningful than the analyses performed with the raw EST counts. The remainder of this report, therefore, focuses on the identification of correlated expression patterns with a statistical analysis of actual EST counts.
| |
Assessing the Pairwise Similarity Between Expression Profiles |
|---|
|
|
|---|
The first aim of our analysis is to identify pairs of genes
(represented here by EST contigs) exhibiting a similar, multicondition (i.e., cDNA library) expression pattern. For each gene, the data consists of 10 numbers (EST counts) defining an expression profile (see
Fig. 1 for overview of entire procedure). If two genes are expressed in
a coordinated manner, we expect their expression profiles to have
similar shapes, that is, the two series of EST counts to follow the
same up or down trend. Given that the absolute EST counts vary widely
between contigs (the number of constituent ESTs per contig ranges
between 308 and 5) and libraries (there is a fivefold difference
between the number of ESTs contributed by the largest and by the
smallest cDNA libraries
see Table 1), a meaningful measure of
expression profile similarity had to be independent of those absolute
numbers. Within these constraints, the Pearson linear correlation
coefficient (see Methods) represents a natural, easy to compute,
similarity measure. The value of this coefficient varies from
1 to
1; a value close to 1 indicates a high similarity of the compared
expression profiles (i.e., proportionality between the EST counts of
two genes), whereas a value close to zero indicates no coordinated
expression. A useful property of this coefficient is its capacity to
also point out pairs of genes exhibiting opposite expression behavior
(anti-correlated profiles, for example, sequences expressed in mutually
exclusive sets of libraries), potentially another form of biologically
interesting gene coupling. In this latter case, the Pearson coefficient
value approaches
1.
Finally, a significance level (P value) is associated with the
computation of this correlation coefficient, allowing the evidence of
pairwise coordinated gene expression to be ranked according to
reliability [as with BLAST (Altschul et al. 1990
) for sequence similarity].
To first confirm that computing Pearson's correlation coefficient is
an appropriate way of identifying correlated expression profiles,
groups of contigs with highly correlated profiles were analyzed. First,
pairs of contigs with high correlation coefficients (in this case,
r > 0.94), were identified within the 707 × 707 (symmetrical) matrix of pairwise gene expression profile correlation coefficients. These pairs of contigs were then organized into mutually
matching clusters, whereby each profile in a cluster matches all of the
others in the same cluster at the required stringency
(r > 0.94). Table 3 shows two such
clusters of contigs. The expression profile and putative identity is
shown for each contig. Profiles in both groups of contigs are
characterized principally by expression in libraries 1073 and 535 (immature seed and panicle at ripening stage). However, the contigs
form two discrete clusters on the basis of linear correlation. Thus,
for the group of contigs in Table 3a, expression is several fold higher
in library 1073 than in library 535, whereas the converse is true for
the group of contigs in Table 3b. Most of the contigs in the two
clusters encode proteins with seed-related functions, in particular
storage proteins, concurring with previous observations of
over-representation of prolamin and glutelin transcripts in rice
seed cDNA libraries (Liu et al. 1995
; Yamamoto and Sasaki 1997
).
|
The Pearson correlation coefficient therefore permits fine-scale identification of sequences with correlated expression profiles.
| |
Assessing the Pairwise Similarity Between cDNA Libraries |
|---|
|
|
|---|
The degree of pairwise similarity between whole cDNA libraries can be similarly assessed with the Pearson correlation coefficient. The same table of multi-condition expression data is used, although with rows and columns exchanging roles. For each of the 10 sampled libraries, the profiles now consist of the 707 numbers (EST counts) characterizing the level of expression of each gene. If two tissues express a similar complement of genes, we expect the EST sampling of the corresponding cDNA libraries to exhibit similar profiles, hence, to be characterized by a high pairwise correlation coefficient. The computation of Pearson's coefficient between all cDNA libraries results in a 10 × 10 (symmetrical) matrix that will be used in building the graphical representation of the expression data.
A Two-Dimensional Graphical Representation Revealing Gene Clusters
The second aim of our study is to build a graphical representation of the whole table of multi-condition expression measurements, as a way to visualize clusters of genes obeying similar expression patterns.
To combine the library and contig data into a single representation, we
adapted the clustered correlation approach pioneered by Weinstein et
al. (1997)
(Fig. 1). This technique involves reordering the results of
multidimensional assays (in the latter study, N compounds vs. M tumors)
so as to reveal discrete islands of regularities (e.g., different
compounds affecting a similar subset of tumors, or different tumors
affected by a common subset of compounds). This is performed by
reordering the rows (and columns) of the data table so that the most
similar ones are adjacent to each other. In our case, the data table
consists of the expression measurements of 707 genes (rows) in 10 cDNA
libraries (columns).
In the first step, a N × N row pairwise metric distance matrix is computed (see Methods) and then used to build a dendrogram that assembles all rows into a single tree. The rows are then reordered according to their hierarchical position in this tree. In our case, the contig/gene pairwise distance matrix is derived (see Methods) from the matrix of pairwise correlation coefficients described above. Adjacent genes then have similar expression profiles (Fig. 1). Given the large number of contigs (707), the complete contig dendrogram has not been reproduced here, although fragments are shown in Fig. 3, below. (The complete dendrogram and other data is available from the authors).
The same procedure is used to assign pairwise distances to cDNA libraries (Fig. 1) and reorder them in the table of EST counts. Adjacent libraries are then those apparently expressing the most similar subsets of genes (Fig. 1). The tree derived from the library correlation analysis is shown in Figure 2. As would be predicted, libraries derived from similar tissue types (callus libraries 1275, 961, and 75) or libraries derived from overlapping tissues (library 535 from panicle at ripening stage and library 1073 from immature seed) cluster together. This validates our method, and suggests that the cDNA libraries analyzed are reliable sources of expression data.
|
Other nearest neighbours on the tree include libraries 499 and 307 (panicle at flowering and green shoot at 8-days old, respectively). Interestingly, library 193 (etiolated shoot 8-days old) and library 307 (green shoot at 8 days old), are not paired, suggesting significant differences in expression patterns between these tissues. This is explained by the massive induction of light-regulated transcripts that occurs during the greening process, which are present in green but not etiolated tissue. These differences are also illustrated in the clustered correlation map shown in Figure 3.
|
Once optimally reordered according to both contig and to library similarity, the expression measurement table can be graphically represented as a map, with the color in a given cell reflecting the underlying EST count (Fig. 1). Following the reordering of rows and columns, clusters of genes exhibiting coordinated expression appear as blocks of similar color, and are readily identified either by visual inspection, or automatically via the use of classical image-processing techniques.
Figure 3 shows the complete clustered correlation map generated from the rice data. To illustrate ways in which the data may be explored, two fractions of the map have been expanded and annotated with contig numbers and putative identities (Fig. 3B, C). These show that contigs in close proximity on the map may represent genes with related functions. In addition, regions indicated by arrows on the map correspond to clusters of contigs expressed more or less specifically in a particular library; for example the green arrow indicates contig sequences expressed at high levels in library 307 (green shoot, 8-days old), many of which encode chloroplast component precursors.
The clustered correlation map therefore enables expression patterns of interest to be selected prior to identification of specific sequences. The clustered correlation map and associated results are available from the authors.
| |
DISCUSSION |
|---|
|
|
|---|
This report presents a new protocol for the analysis of EST data aimed at discovering correlated patterns of gene expression between different tissues, with the rice EST database as a test set. Despite the inherent noise of EST data, and the relatively small size of the data set analyzed, our results show that coherent patterns of gene expression can be revealed. The approach permits both the association of tissues via their common patterns of gene expression and the association of genes via their tissue-dependent expression patterns.
The set of cDNA libraries used to generate the rice ESTs are
sufficiently varied to cover each of the principal tissues in the plant
life cycle (Yamamoto and Sasaki 1997
). In addition, groups of libraries
representing the same tissues at different developmental stages (e.g.,
libraries from panicle tissues) or the same tissue type under different
growth conditions (e.g. libraries from callus tissue) are present
within the 10 libraries analyzed in our study. By use of this data set,
our methods show how whole transcriptomes from different tissues can be
compared in a statistical manner. Tissues for which gene expression
profiles would be expected to overlap, such as 1073 and 535 (immature
seed and panicle at ripening stage, respectively), or 961, 75, and 1275 (all from callus), are found to have overlapping profiles. Similarly,
genes with inter-related functions, such as those involved in seedling physiology shown in Table 3, are found to have correlated expression profiles. The strength of the method lies in the fact that clustering is based on expression profiles; prior knowledge of sequence identity is not required. Furthermore, the anonymous sequences in Table 3 (i.e.,
contigs 12, 15, 83, 366, 367, and 378 in the first cluster) illustrate
how expression profile clustering might aid candidate gene selection;
in this particular example, the anonymous sequences in Table 3 would be
good candidates for identification of novel genes involved in seed metabolism.
Clustering genes by expression profile may also enable identification
of novel regulatory elements, as genes with correlated profiles might
be expected to have regulatory elements in common (DeRisi et al. 1997
;
Brazma et al. 1998). Other possible uses include the identification of
surrogate markers (e.g., Figueroa et al. 1998
; Johnson et al. 1998
),
whereby a conveniently assayed biomarker allows monitoring or
prediction of a particular condition (e.g., a gene or cluster of genes
whose expression profiles consistently correlate with an agricultural
trait of interest).
Overall similarities between tissues are clearly revealed by the
dendrogram or the two-dimensional clustered correlation map representation of expression profiles. These types of observations may
contribute to a new understanding of the interrelationships between
different tissues and developmental pathways. For example, it has long
been hypothesized that leaves and certain floral organs derive from a
common ancestral organ, an idea supported by documented instances of
common regulatory processes during leaf and floral morphogenesis (Arber
and Parkin 1907
; Satina and Blakeslee 1941
; Steeves and Sussex 1989
;
Bowman et al. 1993
; Hofer et al. 1997
). Large-scale studies of gene
expression may support these hypotheses by identifying tissues with
similar or overlapping patterns of gene expression.
The value of expression profiles from EST collections, and the
potential for functional prediction are entirely dependent on the
available data. In addition, certain assumptions are implicit when
using EST collections for transcript profiling. First, to ensure that
tag frequency correlates with the actual transcript abundance in a
given tissue, the cDNA libraries should have been prepared in a
comparable manner. For example, normalized cDNA libraries (e.g.,
Patanjali et al.1991
), in which the frequencies of clones representing
abundant and rare transcripts are normalized with respect to one
another, are not suitable for a study of this type (although some large
effect might still be detectable by a binary presence/absence coding of
the original multi-condition EST counts). In addition, ESTs should be
contributed to the databases without prior selection for novel
sequences (in some cases redundancy within EST sets is reduced by first
screening the existing EST set and then only submitting sequences not
already present). Potential errors may also originate from the
EST-clustering procedure. For instance, ESTs derived from the 5'
and 3' ends of a long transcript may constitute discrete contigs.
However, this is not anticipated to be a major problem in the technique
presented here.
The potential of large-scale gene expression analysis is most often
discussed in the context of hybridization techniques such as cDNA
microarrays (see Duggan et al. 1999
) or synthetic oligonucleotide arrays (for a recent review, see Lipshutz et al. 1999
). These technologies have been applied in several systems including two independent studies of the yeast transcriptome (Wodicka et al. 1997
;
Eisen et al. 1998
), the monitoring of 1000 human genes in activated
human T-cells (Schena et al. 1996
), and the analysis of the fibroblast
transcriptional response to serum (Iyer et al. 1999
). Studies have also
been performed on subsets of Arabidopsis cDNAs (Schena et al.
1995
; Desprez et al. 1998
) and on a subset of human genes related to
inflammation (Heller et al. 1997
). These accomplishments should not
hide the fact that the high-density microarray technology is still only
marginally accessible to academic laboratories (Cheung et al. 1999
). On
the other hand, the established EST (Adams et al. 1992
; Okubo et al.
1992
) or SAGE (Velculescu et al. 1995
) approaches have proven their
capacity in monitoring gene expression in a large variety of
experimental systems (Lee et al. 1995
; Anderson and Seilhamer 1997
;
Madden et al. 1997
; Velculescu et al. 1997
; Zhang et al. 1997
; He et
al. 1998
; Hibi et al. 1998
; Takenaka et al. 1998
; de Waard et al.
1999
), including plants (Uchimiya et al. 1992
; Hofte et al. 1993
; Liu
et al. 1995
; Yamamoto et al. 1997
). The EST approach is unique in
allowing both expression measurements and the discovery of new genes at the same time, whereas microarray techniques are limited to a repertoire of previously identified sequences. Furthermore, ESTs have a
wide range of applications including mapping and studies of colinearity
(Sasaki 1996
). Several studies have shown that EST/SAGE sampling
experiments can reliably identify differentially expressed genes (Lee
et al. 1995
; Audic and Claverie 1997
; He et al. 1998
; Greller and Tobin
1999
). In more recent work, it has been shown that the analysis of EST
data can provide valuable insight into the existence and the expression
patterns of alternative transcript forms (Burke et al. 1998
; Gautheret
et al. 1998
). In the present article, we show that the analysis of this
data can be extended beyond the simple recognition of differential
expression to the identification of gene subsets exhibiting coordinated
expression patterns.
From a statistical point of view, multicondition expression data
obtained from hybridization arrays or cDNA tag sampling are quite
similar. They both result in gene abundance estimates stored in a gene
versus cDNA library table. Thus, it is expected that after a first step
of signal processing (such as noise filtering, pixel detection,
thresholding, and normalization) specific to the microarray technique
involved, similar statistical treatment could be applied. In the case
of EST or tag data, initial signal processing consists mainly of
selecting genes and libraries for which total tag counts are large
enough to eventually lead to statistically significant inferences (in
our own study, selecting contigs representing five or more ESTs, and
those cDNA libraries from which >800 ESTs have been generated). Our
analysis is then quite similar to the approach independently followed
by Eisen et al. (1998)
to identify coordinated gene expression in yeast using cDNA microarrays. For instance, both use the Pearson correlation coefficient as the primary statistical parameter to quantify the similarity of expression profiles. However, slightly different metrics
for the subsequent hierarchical clustering of genes were used; whereas
Eisen et al. (1998)
directly used the pairwise correlation coefficient
between genes, we computed a true Euclidean distance from the whole
gene versus gene correlation coefficient matrix. The distance between
two genes is thus computed from the similarity of their expression with
all other genes in the matrix, and not from a single pairwise
correlation. This procedure, which minimizes the influence of random
fluctuation in tag counting, might also serve in smoothing the noise of
microarray pixel data. The sensitivity of expression analysis from EST
data depends to an extent on the number of ESTs sequenced.
Theoretically, expression profiles could be derived for even very
weakly expressed genes if sufficient numbers of ESTs were generated.
This contrasts with current limitations of microarray technology, in
which sensitivity is limited by the quantity of RNA used per
hybridization, making detection of very weakly expressed transcripts
difficult (see Duggan et al. 1999
).
The nature of our multicondition expression data also allowed us to
perform hierarchical clustering of both rows (genes) and columns (cDNA
libraries), resulting in a two-dimensional clustering (following
Weinstein et al. 1997
) indicative of both gene and library expression
similarity. Similar genes are thus graphically clustered into islands
of simple shape (Fig. 3). In a subsequent development of our display
program, the visual recognition of these islands will be supplemented
by standard image processing algorithms, an attractive alternative to
the complexity of more abstract clustering algorithms.
With increased definition of EST collections, (e.g., cDNA libraries prepared from tighter developmental windows, or cDNA libraries prepared from specific cell types), digital expression profiles will become increasingly valuable sources of expression information. This information, alongside expression data from other large-scale approaches, has an important role to play in our efforts to assign function to anonymous sequences.
| |
METHODS |
|---|
|
|
|---|
EST Database and Contigs
Rice ESTs were extracted from GenBank version 107 with Batch Entrez
at the National Center for Biotechnology Information
(http://www.ncbi.nlm.nih.gov/Entrez/batch.html). dbEST (Boguski et al.
1993
) reports were obtained with the Sequence Retrieval System (SRS) at
the Human Genome Mapping Project (http://iron.hgmp.mrc.ac.uk/).
Rice ESTs were quality controlled and organized into contigs as
described elsewhere (Ewing et al. 1999
;
http://igs-server.cnrs-mrs.fr/ewing). The protocol involved a classical
preliminary cleaning of the EST data (vector removal, elimination of
low quality sequences), a stringent pairwise comparison of all cleaned
EST sequences, followed by the separate contiging of overlapping ESTs.
Because our aim is a statistical analysis of gene expression profiles, contigs derived from fewer than five constituent ESTs were excluded from the study. Putative identities (Table 2) were assigned to every
resulting contig sequence by querying them against the
SWISS-PROT/TrEMBL (36.0) database (Bairoch and Apweiler 1998
) with
gapped BLASTx (Altschul et al. 1997
).
Contig and Library Correlation Analysis
The similarity between contigs (genes) or cDNA library expression profiles was estimated by Pearson's r coefficient, quantifying the degree of linear correlation between two variables, X = (x1,x2,...,xN) and Y = (y1,y2,...,yN).
Given a sample of N pairs of score, r quantifies the extent to which we can make useful predictions on the value of Y from the knowledge of the corresponding X score. The measure of correlation, r, is computed as
|
1 and +1. Values of r near to 0 indicate a low degree of correlation. Positive values of r
indicate that high values of X are associated with high values
of Y. Negative values of r indicate that low values
of X are associated with high values of Y
(anti-correlation) or vice versa.
The pairwise gene expression correlation coefficients were computed by the repetitive use of the above formula, in which X and Y are different genes associated with their corresponding EST counts (x1,x2, ... ,xN) and (y1,y2,, ... ,yN) measured in cDNA libraries 1,2 ,N (with N = 10). The result of these computations constitutes a 707 × 707 symmetrical matrix of correlation values and a matrix of pairwise gene distances was subsequently derived from it as described below.
Alternatively, a table of the pairwise library correlation coefficient was computed, now taking X and Y as different libraries associated with the EST counts (x1,x2, ... ,xN) and (y1,y2, ... ,yN) corresponding to the various genes 1,2, ... ,N (with N = 707). The result of these computations constitutes a 10 × 10 symmetrical matrix of correlation values. As for the gene distance values, a matrix of pairwise library distances was derived as described below.
Hierarchical Classification of Genes and Libraries
The hierarchical classification (dendrogram) of objects requires
the calculation of the distance between all pairs of objects. From the
gene correlation matrix constructed previously (the elements of which
are r values ranging from
1 to 1), a pairwise Euclidean distance
matrix was derived as follows. The Euclidean distance d, between two
sets,
X = (x1,x2, ... ,xN) and
Y = (y1,y2, ... ,yN) is simply computed as
|
By the same method, the 10 × 10 matrix of library correlation
coefficients was used to derive pairwise distance values between libraries. The gene and library distance matrices were then used to
build their associated dendrograms according to the UPGMA algorithm (Sokal and Michener 1958
), implemented in the neighbor program (Kuhner
and Felsenstein 1994
). Dendrograms were plotted with the njplot program
(Perriere and Gouy 1996
). The order of contigs and libraries in their
respective dendrograms were used to reorder the original data table.
The reordered data table was then used as the basis for plotting the
clustered correlation map, generated with Matlab 5.2 (MathWorks, Inc.).
| |
ACKNOWLEDGMENTS |
|---|
The financial support of Novartis Crop Protection, Inc. is gratefully acknowledged. We also thank Dr David Robertson for help with dendrograms and Suzanne Dixon for reading the manuscript.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 Corresponding author.
E-MAIL ewing{at}igs.cnrs-mrs.fr; FAX 33 (0)4 91 16 45 49.
| |
REFERENCES |
|---|
|
|
|---|
database for "expressed sequence tags".
Nat. Genet.
4:
332-333[CrossRef][Medline]
analysis of transcripts of genes engaged in ATP-generating pathways.
Plant Mol. Biol.
25:
469-478[CrossRef][Medline].Received April 26, 1999; accepted in revised form August 4, 1999.
This article has been cited by other articles:
![]() |
A. Omid, T. Keilin, A. Glass, D. Leshkowitz, and S. Wolf |