|
|
|
|
Vol. 12, Issue 4, 567-583, April 2002
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
All protein sequences from 19 complete chloroplast genomes (cpDNA) have been studied using a new computational method able to analyze functional correlations among series of protein sequences contained in complete proteomes. First, all open reading frames (ORFs) from the cpDNAs, comprising a total of 2266 protein sequences, were compared against the 3168 proteins from Synechocystis PCC6803 complete genome to find functionally related orthologous proteins. Additionally, all cpDNA genomes were pairwise compared to find orthologous groups not present in cyanobacteria. Annotations in the cluster of othologous proteins database and CyanoBase were used as reference for the functional assignments. Following this protocol, new functional assignments were made for ORFs of unknown function and for ycfs (hypothetical chloroplast frames), which still lack a functional assignment. Using this information, a matrix of functional relationships was derived from profiles of the presence and/or absence of orthologous proteins; the matrix included 1837 proteins in 277 orthologous clusters. A factor analysis study of this matrix, followed by cluster analysis, allowed us to obtain accurate phylogenetic reconstructions and the detection of genes probably involved in speciation as phylogenetic correlates. Finally, by grouping common evolutionary patterns, we show that it is possible to determine functionally linked protein networks. This has allowed us to suggest putative associations for some unknown ORFs.
| |
INTRODUCTION |
|---|
|
|
|---|
The so-called postgenomic era is linked to the knowledge of complete
genomes for many organisms. In this context, the
design and the testing of new mathematical and computational tools able to assign function to gene products and compare complete genomes are
becoming crucial. The use of computational tools to infer, analyze, and
compare both structure and function of the complete predicted proteome
is being considered an essential new instrument for the progress of
biological research (for reviews, see Andrade and Sander 1997
; Bork et
al. 1998
; Eisenberg et al. 2000
; Pellegrini 2001
). Still,
the field of bioinformatics is in its infancy. For example, the
fraction of hypothetical proteins or open reading frames (ORFs) in
complete genomes remains remarkably high. Thus, the most recently
sequenced first complete plant genome, from Arabidopsis
thaliana (Arabidopsis genome 2000
), presents 25,498 identified
genes, out of which ~30% correspond to hypothetical proteins or
proteins of unknown function. Similarly, and in spite of the impressive
accumulation of genome information over the last several years, the
availability of tools for comparative genome analysis to establish the
implications of the differences in gene content between species from a
biomolecular perspective is virtually absent.
The chloroplast is an essential organelle in plants. It performs
photosynthesis and therefore is required for the photoautotrophic plant
growth that moves our biosphere. The generally accepted endosymbiontic
hypothesis states that chloroplasts have arisen from an internalized
cyanobacterial ancestor (Cavalier-Smith 2000
). Chloroplasts have
maintained an independent genome that encodes an important part of the
proteins required for their photosynthetic activity and different
housekeeping functions. The chloroplast genome (cpDNA) consists of
homogeneous circular double-stranded DNA molecules of 110-200 kb size,
containing between 30 to 50 different RNA genes and a number of
protein-coding genes, which ranges from about 100 in land plants and
green algae to 150-200 in nongreen algae (Sugiura 1995
). These
protein-coding genes can be roughly classified into two main groups:
genes enrolled in the expression and translation machinery of the
chloroplast and genes related to bioenergetics and photosynthetic
function. The largest known chloroplast genome corresponds to the red
alga Porphyra and has 70-80 additional genes, one-third of
which are related to biosynthesis of amino acids and other essential
biomolecules. A feature of chloroplast genomes from most plants is the
presence of two large inverted repeats (IRs) of 6-76 kb that divide
the cpDNA in one large and one small single-copy region (called LSC and
SSC, respectively; Sugiura 1995
).
The nonrecombinant, uniparentally inherited nature of organelle genomes
makes them potentially useful tools for evolutionary studies. However,
in practice, detecting useful polymorphism at the population level is
often difficult due to the low level of substitutions (slow
substitution rates) in plant chloroplast genomes. Attempts to
reconstruct plastid evolution with traditional biomolecular approaches
(i.e., sequence-based analyses of RNAs or of protein-encoding genes)
have proven particularly difficult (Martin et al. 1998
; Sugiura et al.
1998
; Adachi et al. 2000
). Thus, in a recent study, Martin and
coworkers attempted to build phylogenetic trees and to obtain
evolutionary information by comparing 45 common chloroplast proteins,
pasted together in a unique macroprotein with 9957 (Adachi et al. 2000
)
and 11,039 (Martin et al. 1998
) amino acid sites. The investigators
encountered some difficulties to statistically discriminate among the
several possible phylogenetic trees obtained. Therefore, alternative
and independent types of evidence that might provide new information
about ancient plastid history are required. Some new alternatives for
phylogenetic assignments involve examination of the arrangement or
order of genes in genomes by gene-cluster analysis history (Stoebe and
Kowallik 1999
). However, such analysis are usually carried out with
some particular sets of related proteins or group of genes and,
therefore, they tend to reflect the partial phylogeny of these genes
rather than that of the whole organisms or genomes.
The knowledge of complete genomes opens up the possibility of developing innovative tools for phylogenetic reconstruction and evolutionary analyses. A promising approach is presented here, based on the quantitative analysis of clusters of orthologous proteins (COGs) and applied to chloroplast history. Chloroplast genomes are, as a result of the considerable number available and small size, excellent model systems in computational genomics studies. A number of important questions from the bioinformatics perspective can be addressed by experimenting with them. For example, how accurate can phylogenetic reconstructions be made by using the complete genome information? Or is it possible to uncover evolutionary forces by comparing complete genomes? Finally, can phylogenetic profiles, derived from the absence/presence of a given gene in the set of genomes under study, be used to determine functional associations? In this study we try to address these questions. First, we apply a new program developed in our laboratory for genome annotation, and compare its performance with the well-known approach of using PSI-BLAST. Then, a comparative genomics study is carried out using techniques borrowed from multivariate analysis.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
Functional Annotations of cpDNA Proteins
Quality of Annotations
All protein sequences from 19 complete chloroplast genomes (cpDNA) were compared against the 3168 proteins from Synechocystis PCC6803 complete genome. The number of proteins in a chloroplast proteome ranges from 66 in the smallest genome (Euglena gracilis) to 209 proteins in the largest one (Porphyra purpurea). Taking all genomes, the total number of chloroplast proteins compared was 2266. As described in Methods, we compared genomes pairwise, taking all proteins in one genome and comparing them with all proteins in the other, trying to find the most likely ortholog pair. The degree of functional matching was evaluated with the µ-score, a measure of the structurally implied similarity between two sequences (see Methods). Figure 1 presents the values of µ-scores obtained by pairwise comparison of all the chloroplast proteins against the following: (1) the set of Synechocystis proteins annotated in the COGs database (http://www.ncbi.nlm.nih.gov/COG/), which included 2113 proteins; (2) the complete proteome of Synechocystis, which included 3168 proteins and was taken from the CyanoBase database (http://www.kazusa.or.jp/cyano/); and (3) the set of Saccharomyces proteins constituted by its database of COGs, which included 2175 proteins.
|
2.5 was
checked by looking at the identity of function of well-known and
well-annotated proteins pairs. Among the first 100 known
Synechocystis proteins, all assignments were correct.
Therefore, the cutoff of 2.5 seems a good, conservative threshold for
functional assignment in this case. Below 2.5, most pairs do not have
assigned function (gray lines). There is an intermediate zone between
2.5 and 1.9 in which the method still maintains a significant
proportion of correct matches. This region includes 131 matches, out of
which 50 are correct, 2 are erroneous, 15 correspond to new
assignments, and 64 correspond to hypothetical proteins in both sides.
The new assignments identified in this region are included in Table 3
(see below). Below 1.9 a significant amount of mismatches start to appear.
Fraction of Annotations in Complete Genomes
The number of proteins with µ-scores above the cutoff (µc = 2.5) was 1097 when Synechocystis COGs database was used and 1696 when Synechocystis CyanoBase was used. A good linear correlation was found between proteome size and number of assignments to COGs. Table 1 includes this information and some other derived data. In the first three columns we show the size of each cpDNA, the size considering only ORFs (i.e., counting only the nucleotides corresponding to ORFs), and the percentage of nonencoding DNA (No cod DNA, calculated by subtracting columns 1 and 2 and then dividing by total DNA). The next four columns show the total number of proteins in each cpDNA (Prot Total) and the assignments based on Synechocystis COGs (Syn COGs), on all Synechocystis proteome (Syn CyanoB), and on the comparsion with all other cpDNAs (in other cpDNA). This last pairwise comparison between cpDNAs was done using only proteins not assigned to any Synechocystis orthologs. The final number of cpDNA proteins assigned was 1837 out of 2266, which is 81% of the ORFs in the cpDNAs. Of these proteins, 1696 correspond to Synechocystis orthologies and 141 to orthologous groups specific to the plant chloroplasts. The remaining 429 proteins were left unassigned.
|
Discussion of New Annotations
We have been able to provide some new functional assignments for some ycfs (hypothetical chloroplast frames), which correspond to ORFs well conserved in cpDNAs but without a clear functional annotation (Rochaix 1999
|
|
Factor Analysis of the Matrix of Orthologous Genes
Construction of the X-Matrix
During the construction of the X-matrix, it was observed that only five genes were present in all 20 genomes. These genes (not included in the matrix) were rpl2, rps2, rps3, rps4, and rps12. All of them correspond to ribosomal proteins, and they form the most conserved core of the chloroplast ribosome probably essential for its translation activity. In the 277 groups of orthologous proteins that form the X-matrix (which are lines 101 type), 256 correspond to COGs present in Synechocystis and 21 to COGs that are specific to plant chloroplasts, not present in cyanobacteria. Functional annotation was taken from CyanoBase, with some small modifications. In the 277 COGs, a total of 73 correspond to hypothetical proteins (i.e., groups of proteins not having functional annotations in databases and present in at least two cpDNAs). The matrix will be available at http://alice.usal.es/cpDNA20x277matrix and can also be obtained by contacting the authors.Loadings Matrix at the Optimal Dimensionality and Phylogenetic Analysis Derived
The X-matrix (a 20 × 277 binary matrix) was analyzed by multivariate analysis as described in Methods. These analyses allow us to obtain important quantitative information about the mutual relationships among genomes, as well as relationships among orthologous proteins. These relationships are mathematically expressed by the loadings (which provide information about the degree of similarity between the genomes) and by the dot product (DP) scores (see equation 6) and the factor scores (which provide information about the proteins and about the relationship between them). Table 4 shows the values of the loadings obtained for each genome. A second X-matrix including only 18 genomes was also analyzed. In this matrix all COGs corresponding to the nonphotosynthetic parasitic genomes (E. virginiana and T. gondii) were excluded. Exclusion of these two genomes was done for the phylogenetic analysis (vide infra) to avoid possible noise introduced by nonphotosynthetic parasitic species. The values of the loadings for the second matrix are also presented in Table 4. A set of 22 chloroplast genes is lost in these excluded genomes: atpA, atpB, atpE, atpF, atpH, petB, psaA, psaB, psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbJ, psbK, psbL, psbT, rbcL, rpoC2, and ycf4. They are mostly photosynthetic subunits that belong to ATPase and photosystem II. A set of 3 proteins is only lost in Epifagus: rpl14, rpoB, and rpoC1; 8 proteins, all ribosomal, are only lost in Toxoplasma: rpl16, rpl20, rps7, rps8, rps11, rps14, rps18, and rps19. The type of proteins lost in Epifagus and Toxoplasma genomes clearly reveals that these organisms have nonphotosynthetic plastids and that they have a very diminished ability for independent translation.
|
|
|
Detection of Specific Genes as Phylogenetic Correlates
One of the most interesting features of a phylogenetic analysis at the genome level is that it not only allows one to obtain phylogenetic relationships between species, but it may also provide a way to identify the specific proteins or genes that can best explain the differences between species or groups of species, according to the derived phylogenetic tree. In Table 5 we list the proteins we find as the most specific or peculiar for the different groups of chloroplasts. The genes are in descending order according to their DP scores (see Methods for an explanation of this derivation). In this way, genes present at the top of each group in Table 5 are the most important to differentiate and define each branch of genomes in the tree. The profiles column marks with a 1 the presence of a certain gene in a specific genome. Each number (1 or 0) corresponds to 1 of the 20 species ordered from left to right as enumerated in Table 4, starting with Synechocystis.
|
chlB, chlL, and chlN
and two
genes involved in sulfate absorption
cysA and cysT
(see Table 5). Many of the green algae are unicellular organisms that
need to take nutrients from the aquatic environment. Sulfate is one of
these important nutrients. Most of the absorption and nutritional
functions are coded by nuclear genes in higher pluricellular plants. In
fact, in land plants the root cells are specialized to carry out such functions.
The following phylogenetic correlates (see Table 5) were
identified in the case of land plant chloroplast genomes: maturase (matK), NADH dehydrogenase (ndhA, B, C, D, E, F, G, H, I, J,
and K), one subunit of cytochrome b6/f
(petL), and another subunit from photoystem II
(psbM). The presence of maturase is a distinct feature in
higher land plants directly related with the appearance of introns (see
Table 1; Wolfe et al. 1991Clustering of Phylogenetic Profiles
The clustering method automatically classifies genes in the score
matrix within nine groups. The main results from the clustering analysis can be found in Table
6, which presents the
proportion of a subset of selected complexes or functional units in
each cluster. Complexes selected to test the ability of the clustering procedure are indicated in Methods. Figure 4
presents the three-dimensional plots of the
relative position in a three-dimensional projection of the original
euclidian space spanned by the 20 genomes (Fig. 4A,B), along with the
position of the nine gene clusters (Fig. 4C,D). As can be seen in Table
6A, the method locates in cluster number 9 a hypothetical photosyntetic
chloroplast core, composed by a major number of subunits from ATPase,
photosystem II (PSII), and cytochrome b6f (Cytb6f) and by important
populations of photosystem I (PSI) and ribosomal proteins (RibProt).
The other clusters include proteins or polypeptides characteristic of
specific groups of organisms. For example, cluster number 1 contains a
significant proportion of photosystem I subunits (33.33%) and
ribosomal proteins (46.51%), specific of nongreen algae; cluster no.;
3 includes a major proportion of phycobilisome (Phyb) proteins
(77.78%) representing mainly red algae; and cluster number 5 includes
all NADH dehydrogenase (NADHase) proteins (100%) representing land
plants. The rest of the clusters are less significantly assigned to one
group of organisms. The method also clearly identifies in which
functional units a specific loss of proteins or subunits occur along
the evolution. Thus, in cluster number 1 it can be observed that two
main complexes (PSI and ribosome) suffer a strong reduction of genes
when passing from nongreen algae to green plants (see Table 6C).
|
|
Finally, in these analyses the quality of the functional recovery and
purity of the clusters has been measured. The method allows a recovery
of 73.0% of the original functional associations in the clusters for
the statistically significant associations (those with
P-value < 10
3). The percentage of proteins
belonging to one specific functional unit is 36.4% within the nine
clusters produced. The specificity level seems to be high enough to
allow prediction of tentative functional associations for hypothetical
proteins. For example, the following proteins in nongreen algae present
exactly the same set of scores in the X-matrix: 30S
ribosomal protein S20 (rps20), 50S ribosomal protein L34
(rpl34), 50S ribosomal protein L35 (rpl35), and two
ycf proteins (ycf33 and ycf35). On the basis of our results, it is tempting to speculate that ycf33 and
ycf35 are ribosomal or translation-related proteins.
Conclusions
A set of recently developed methods for function annotation and genome comparison has been applied to a series of 19 chloroplast genomes. Genome annotation using these methods has proved to be very reliable, providing high confidence functional assignments for an average of 81% of the proteins in chloroplast genomes. Multivariate analysis of a binary data matrix derived from these genomes has allowed us to derive rather accurate phylogenetic relationships between them at the genome level. One of the most interesting features of such analysis is the possibility of detecting genes acting as phylogenetic correlates, genes critical to the formation of the observed tree topology. These genes are, from a mathematical perspective, responsible of the tree topology and, on the basis of the quality of the tree, possibly related to speciation from a biological viewpoint. Therefore, this type of analysis has the potential to help uncover the evolutionary forces shaping the organisms and their adaptative responses through the modification of their biochemical systems. In the case of chloroplasts, we have found in our analysis that these genes acting as phylogenetic actually form part of important components of the chloroplast biochemical machinery.
It is important to emphasize that the phylogenetic correlates should not be identified with genes that lead the way in evolution. A more plausible mechanism is that gain/loss of function would follow adjustments to new environments, which, by imposing a selective pressure, can select subpopulations generated by a random process. However, by studying them it may be possible to infer what general environmental selective pressure could operate in the different lineages. From our analysis, one of the driving forces in the evolution of green algae and plant chloroplasts appear to be the acquisition of molecular systems providing higher levels of regulation, probably with increasing levels of involvement by the genome from the host cell. This increased level of regulation appears to be reflected at two levels: First, regulatory units appear to be added to the energy generating complexes, along with specific molecular systems to control photooxidative stress. Second, at the same time an eukaryote-like genome organization of the chloroplast genome is developed, changing the translation and translocation machinery and incorporating intron-processing enzymes such as maturase, presumably for all to have a higher level of synchronization between the gene expression of the photosynthetic apparatus and the host genes. It is obvious that a better understanding of these processes, in addition to the insight gained in basic biological processes, can have enormous impact in genetic engineering and biotechnology. We have shown that comparative genomics is a powerful tool toward that goal.
| |
METHODS |
|---|
|
|
|---|
Genome Data Set
At the time of conducting this work 17 cpDNA had been fully
sequenced. They correspond to eight land plants (Arabidopsis
thaliana, Marchantia polymorpha, Nicotiana
tabacum, Oenothera elata, Oryza sativa, P. thunbergii, Spinacia oleracea, and Zea mays);
three green algae (C. vulgaris, Mesostigma viride,
and N. olivacea); one Euglenophyta (E. gracilis); two
Rhodophyta or red algae (C. caldarium and P. purpurea); one Bacillariophyta (O. sinensis); one
Cryptophyta (G. theta); and one Glaucocystophyceae (C. paradoxa). The complete proteomes of the plastid genomes from two
nonphotosynthetic parasites were also available and were also included
in this study: One is from the protozoan parasite T. gondii
and the other from the parasitic flowering plant E. virginiana. These two genomes were included due to their functional
similarities and evolutionary relationships to cpDNA. The E. virginiana chloroplast genome lacks the main photosynthetic genes
(Wolfe et al. 1992
). The T. gondii one cannot be considered a
true chloroplast genome, but it can be considered a plastid genome of
probable green algal origin (Kohler et al. 1997
). All genome sequences
were downloaded from http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/plastids_tax.html.
Functional Annotations
Functional annotations of ORFs derived from the complete chloropast
genomes were carried out with a recently developed computer program for
functional annotation (Fabrega et al. 2001
). This program finds pairs
or orthologs in two different genomes, A and B. To do
so, each of the N sequences in genome A is scanned
against all other M sequences in genome B. Pairs of
sequences are aligned using the Needleman and Wunsch algorithm with
zero end gaps with a normalized Gonnet matrix (Gonnet et al. 1992
).
After the scanning step, an orthology likelihood score (µ-score) for ORF i in genome
A is defined as
|
(1) |
ij are defined as the
number of times the sequence similarity between sequences i
and j exceeds the expected minimum value of the score
consistent with a common fold:
sc(ni,mj),
where ni is the length of sequence i and
mj is the length of sequence j, as derived from
training sets of sequence-structure matches by Abagyan and Batalov (1997)
|
(2) |
Construction of the X-Matrix
A matrix of orthologous chloroplast proteins was then constructed
based on the above results of pairwise genome comparisons using a
simple binary count for the presence (1) or absence (0)
of a given ortholog. In this way, a matrix was built that had 20 columns (Synechocystis and 19 cpDNA species) and 277 lines,
corresponding to 277 different orthologous groups (i.e., COGs). The
matrix was built first by automatically including all the 1837 proteins
with µ-score
2.5 (this gave a set of 286 lines) and
secondly by manual correction of some lines to include proteins well
annotated between µ-score 2.5 and 1.9 (this reduced the
total set to 277 lines). The matrix designed was of the 101 type, meaning that each group of orthologous should have a
member in at least two cpDNAs or in one cpDNA and in the
Synechocystis genome. In this way, COGs present in only one
species (lines type 100) or COGs present in all species (lines
type 111) were discarded. COGs of the type 100 were not
adequate to study pairwise relationships between genomes. Type
111 corresponds to functions present in all the cpDNAs and
therefore did not include any differential information for the matrix.
The 101-type matrix seems to provide the best equilibrium
between variability and conservation.
Factor Analysis of the X-Matrix
Factor analysis (FA; Reyment and Joreskog 1996
) was used for the
comparative genomics studies. FA seeks to find an underlying orthogonal
factor model of an original X-matrix (in this case our
101 matrix) of the form
|
(3) |
|
(4) |
|
(5) |
Phylogenetic Reconstruction
Phylogenetic trees of the genomes studied were derived by
clustering genomes in the loadings space with a neighbor-joining method
(Saitou and Nei 1987
). To generate the distance matrix needed in the
neighbor-joining algorithm, we used the distribution of points
representing the organisms in the space given by the loadings matrix at
the optimal dimensionality found by FA (i.e., the dimensionality at
which the eigenvalue of 1 is reached). In this space euclidean
distances between each pair of genome loadings were calculated. Once
the main tree was built, an estimation of confidence or reliability of
each branch was obtained by means of a jackknife bootstrap analysis
using 1000 replicates. Bootstrap values were computed by selecting
random subsets of 75% of the genes per genome (Durbin et al. 1998
),
reanalyzing the new X-matrix by FA and recalculating the trees.
Distribution of trees and frequency of each branch in the original tree
were recorded using the CONSENSE program included in
PHYLIP software package (Felsenstein 1996
).
Detection of Phylogenetic Correlates and Gene Clustering
The FA results can be used to identify the specific COGs or groups
of proteins contributing heavily to the specific character of different
species, as inferred from the phylogenetic analysis. This is done here
as follows: The values of the loadings at each dimension k are
transformed to fingerprints by translating them to a binary
form
1 when the value is >0.5 and 0 otherwise,
forming a vector. These patterns can be correlated with the patterns of
presence/absence of the genes used to build up the original
X-matrix, so that the specific genes that best define each
dimension can be identified. For each variable (i.e., each COG in the
X-matrix) i, with a profile in the X-matrix,
and at each dimension k, we compute the phylogenetic correlates (i.e., the variables mainly responsible for discrimination in that dimension) as the 10% upper-ranking COGs obtained according to
the following dot product (DP):
|
(6) |
Study of Functional Linkages
It has been suggested that functionally linked proteins tend to
co-evolve, displaying patterns of correlation according to their
presence or absence in a set of genomes. Consequently, this form of
co-evolution could be detected using a bit-like representation of the
genomes (Pellegrini et al. 1999
), in a similar way to the one used here
to create the X-matrix. We have tested the ability of our
method to detect these functional associations using the scores derived
from FA, as the projection into a low dimensional space should provide
a better metric to establish these associations, and the chloroplast
genomes are a good model system for such study.
We have clustered the elements of the scores matrix at the optimal
dimensionality using two agglomerative methods sequentially. The
procedure includes first the Ward algorithm (Ward 1963
) to determine
the set of centroids from the cloud of points and then the derivation
of the members connected to each centroid by using a k-means
algorithm (Johnson and Wichern 1992
). As the optimal number of clusters
is unknown, a stopping rule for determining the optimum number of
clusters (in the interval of 5 to 50 clusters) must be employed. The
figure of merit we have used in the stopping rule is the
C-index (Milligan 1980
), defined as
|
(7) |
Further, we have checked the significance of the observed clustering.
For that, we have centered our study in the ability of the clustering
procedure to classify well-defined macromolecular functional complexes
present in chloroplasts. These complexes are as follows: photosystem I
(which includes 12 polypeptides, genes psa-);
photosystem II (18 polypeptides, genes psb-); ATPase (8 polypeptides, genes atp-); cytochrome b6/f complex (6 polypeptides, genes pet-); NADH dehydrogenase (11 polypeptides, genes ndh-); phycobilisoma (9 polypeptides,
genes apc-, cpc-, and nbl); ribosome (43 ribosomal proteins, genes rpl- and rps-); RNA
polymerase (4 polypeptides, genes rpo-); and cell division
proteins (5 polypeptides, genes fts- and min-). Thus,
we have used a test set of 116 polypeptides divided into nine different
complexes or functional units, which include 42% of the full COGs
matrix constructed and represent a model plant chloroplast genome that
will have ~100 proteins. We evaluated the performance of the
clustering procedure by monitoring two parameters: the recovery of a
given complex in a given cluster, expressed as a percentage
(%R), and the purity in a given functional complex of a
specific cluster, expressed as a percentage (%P). Thus, the
former informs about the ability of the clustering procedure to
concentrate the elements of a functional complex in a cluster, whereas
the latter informs about the specificity of the cluster toward a given
complex. The two parameters are computed as
|
(8) |
|
(9) |
| |
WEB SITE REFERENCES |
|---|
|
|
|---|
http://alice.usal.es/cpDNA20x277matrix
http://www.kazusa.or.jp/cyano/; CyanoBase database.
http://www.ncbi.nlm.nih.gov/COG/; COGs database.
| |
ACKNOWLEDGMENTS |
|---|
This work was supported by Mount Sinai start-up funds (ARO). J.D.L.R. acknowledges a short-term fellowship from OCDE and support from the Spanish government (grant MCT-DGI-PGC, PB98-0480). J.J.L. is a NATO postdoctoral fellow.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL ortiz{at}inka.mssm.edu; FAX 212-860-3369.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.209402.
| |
REFERENCES |
|---|
|
|
|---|