|
|
|
|
Vol. 12, Issue 6, 916-929, June 2002
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
The comparison of the small molecule metabolism pathways in Escherichia coli and Saccharomyces cerevisiae (yeast) shows that 271 enzymes are common to both organisms. These common enzymes involve 384 gene products in E. coli and 390 in yeast, which are between one half and two thirds of the gene products of small molecule metabolism in E. coli and yeast, respectively. The arrangement and family membership of the domains that form all or part of 374 E. coli sequences and 343 yeast sequences was determined. Of these, 70% consist entirely of homologous domains, and 20% have homologous domains linked to other domains that are unique to E. coli, yeast, or both. Over two thirds of the enzymes common to the two organisms have sequence identities between 30% and 50%. The remaining groups include 13 clear cases of nonorthologous displacement. Our calculations show that at most one half to two thirds of the gene products involved in small molecule metabolism are common to E. coli and yeast. We have shown that the common core of 271 enzymes has been largely conserved since the separation of prokaryotes and eukaryotes, including modifications for regulatory purposes, such as gene fusion and changes in the number of isozymes in one of the two organisms. Only one fifth of the common enzymes have nonhomologous domains between the two organisms. Around the common core very different extensions have been made to small molecule metabolism in the two organisms.
[Online supplementary material available a http://www.genome.org.]
| |
INTRODUCTION |
|---|
|
|
|---|
Here we compare the enzymes of small molecule metabolism found
in the prokaryote Escherichia coli and the
unicellular eukaryote Saccharomyces cerevisiae (yeast). There
is evidence for the existence of prokaryotes 3.8 billion years ago
(bya) and of eukaryotes 2.7 bya (Mojzsis et al. 1996
; Brocks et al.
1999
). Endosymbiosis of an
-proteobacterium is widely accepted as
the origin of mitochondria, and mitochondrial genes, in the eukaryotes
(Margulis 1970
). This endosymbiosis event must have occured before the
divergence of plants, 1.6 bya (Lang et al. 1999
; Wang et al. 1999
), and
arguments have been made for it being much earlier (Martin and Müller
1998
). Thus according to these estimates, most of the enzymes of small molecule metabolism in E. coli and yeast have had between 1.6 and 2.7 by of separate evolution, depending on whether the yeast enzymes originate from the eukaryotic ancestor or the
protomitochondrial genome (Brown and Doolittle 1997
).
Regardless of the origin of the enzymes, during this time there have been countless chances for orthologous genes in the two organisms to diverge by mutation, to undergo recombinations resulting in domain loss or accretion, and to change gene structure by gene fusion or fission. New genes for an existing function could be acquired by horizontal transfer or functional displacement of one gene by another within a genome. In addition, many new genes have arisen by duplication and divergence to produce new enzymatic functions and pathways.
Until now, investigations of these evolutionary processes have been
limited to studying one aspect, such as gene fusion (Enright et al.
1999
) or nonorthologous displacement (Koonin et al. 1996
; Makarova et
al. 1999
), or have focused on differences in pathway topologies rather
than the evolution of common enzymes (Huynen et al. 1999
). Here we
investigate, and to some extent quantify, the frequency of all these
evolutionary processes in a large set of enzymes common to the two very
distantly related organisms. The extensive information available on the
enzymes and pathways of small molecule metabolism in E. coli
and yeast allows us to determine the extent to which different
evolutionary processes have taken place since they separated from their
last common ancestor. At present such a comparison would be much less
successful in any other pair of organisms due to the lack of knowledge
of their enzymes and pathways. E. coli and yeast have long
been model organisms and have been the subjects of very extensive
experimental characterization of their genes and proteins, including
the determination of their complete genome sequence.
We show that over half of the gene products involved in small molecule metabolism of E. coli and yeast carry out common reactions in the two organisms. Our approach is to use sequence and structural information to characterise the domain structure and the evolutionary relationships of these shared enzymes. The use of structural information together with powerful multiple sequence comparison methods, as well as assignment to sequence families, provides us with an almost complete picture of the protein families that the enzymes belong to, including very distant evolutionary relationships.
Knowledge of the domain architecture of common enzymes allows us to assess the extent of conservation between enzymes, but also provides insight into aspects of the regulation of enzymes, such as differing numbers of isozymes in E. coli and yeast and instances of gene fusion. As well as affecting regulation of otherwise separate genes, gene fusion serves to co-localize gene products. Protein-protein interactions have the same effect, and we survey and compare protein-protein interactions as well as gene fusions in yeast.
| |
NOMENCLATURE |
|---|
|
|
|---|
Genes, Gene Products, Domains, Enzymes, and Proteins
Before describing the pathways and their enzymes it is useful to provide a glossary of terms we use throughout the text.
Genes and Gene Products
These refer to the DNA entity and the polypeptide produced by its expression.Proteins and Enzymes
These are the functional units. They can consist of one gene product with one or more domains, multiple copies of one gene product, or a combination of gene products.Common or Equivalent Enzymes
E. coli and yeast enzymes are described as common or equivalent when they play the same role, i.e., catalyze the same reaction step, in the pathways common to the two organisms. The E. coli and yeast enzymes can be, but don't have to be, homologous. For instance, in Figure 1, the E. coli and yeast aspartate semialdehyde dehydrogenases are homologous, whereas the serine/threonine deaminase enzymes are not. For one reaction step, there can be multiple E. coli and yeast gene products that constitute the common enzymes in the two organisms. For instance, in Figure 1 there are three serine/threonine deaminase isozymes in E. coli, and two serine/threonine dehydratases in yeast, so there are five gene products that constitute the common enzymes for this reaction step.
|
Domain
This is the evolutionary unit in proteins. Small- and most medium-sized proteins consist of a single domain. Large proteins usually consist of two or more domains that have been brought together by recombination. Domains may combine with more than one partner and may also occur in isolation as functional units. Throughout the figures accompanying this text, gene products are represented by black lines, and their domains are represented by colored shapes. For example, in Figure 1 the yeast serine/threonine dehydratases consist of a single domain represented by a light blue rectangular shape, which is a domain of the Tryptophan synthase beta subunit-like PLP-dependent enzyme family. Other gene products consist of multiple domains, as in the five domains of the E. coli aspartate kinase/homoserine dehydrogenases thrA and metL.Family
Domains that are related, having evolved from a common ancestor by gene duplication, belong to the same family. Family membership can be detected by straightforward sequence comparison, but more distant relationships are only detectable through conservation of three-dimensional structure of proteins rather than amino acid sequence. Families that correspond to proteins of known three-dimensional structure are sometimes referred to as `structural families', whereas families inferred on the basis of sequence alone are sometimes referred to as `sequence families' in the text. In the figures, all domains belonging to one family are represented by the same colored shape.SCOP
The Structural Classification of Proteins (SCOP) database (Murzin et al. 1995SUPERFAMILY (Gough et al. 2001
)
HMM
Abbreviation for Hidden Markov Model. In our context, this means a probabilistic model of a set of aligned related protein sequences. This model can be used to match other protein sequences to themselves to see whether they are related to the family in the model (Eddy 1996FASTA (Pearson and Lipman 1988
)
| |
METHODS AND RESULTS |
|---|
|
|
|---|
Pathways and Enzymes in E. Coli and Yeast
To compare the components of small molecule metabolism pathways in E. coli and yeast in a detailed and efficient manner it is necessary to have them in a form that allows the comparison to be made using computational procedures. To establish such a data set of pathways, we made use of four different databases. Though there is considerable overlap in the information they contain, each has features that made significant contributions to this work. We used the information from the databases to compare the sets of common and unique enzymes in E. coli and yeast rather than the pathways themselves. The enzymes are components of pathways of course, and we do mention the extent of shared and unique pathways according to the KEGG pathway definitions (see below), but our main focus is the common enzymes.
The Dataset of Pathways and Enzymes
The four databases used here are KEGG (Kanehisa and Goto 2000KEGG DATABASE
In this database, the pathways of individual organisms are all superimposed on reference or template pathways. This feature makes it easy to draw parallels between pathways in different organisms and KEGG provides the starting point for constructing the set of small molecule metabolic pathways and enzymes used here. In KEGG, pathways are described purely in terms of Enzyme Commission (EC) numbers (NC-IUBMB 1992ECOCYC DATABASE
The E. coli proteins are well understood and well documented compared to other organisms, particularly through the work of Monica Riley and her colleagues (Riley 1998METACYC DATABASE
This database is related to EcoCyc, but contains the pathways and enzymes of other organisms, including S. cerevisiae. The MetaCyc pathways are constructed by comparing the EC numbers and enzyme names of the organism of interest to pathways already established for E. coli or other organisms (Karp et al. 1999ERGO OR WIT METABOLIC PATHWAY DATABASE
The public version of the database is a repository of detailed information for, at present, 39 organisms (Overbeek et al. 2000METABOLIC PATHWAYS USED IN THIS WORK
The automatic procedure we used to process the KEGG pathways is similar to that used in the construction of MetaCyc (see above), except that we increased the score if the reactions in a pathway in E. coli or yeast were actually connected to each other, as opposed to being separated by steps that were present in the template pathway but were not identified in the individual organism. In some cases this involved removing some pathways all together and modifying others. For example, we removed the KEGG photosynthesis and tetracycline biosynthesis pathways from E. coli as these are clearly not relevant to this organism and were assigned in KEGG as an artefact of the assignment system based on EC number alone. The final number of KEGG pathways in our processed set is 55 in E. coli and 57 in yeast (Supplementary data are available at www.genome.org and URL's for all the public databases mentioned above). Of these pathways 48 are shared, meaning that at least a subset of the enzymes catalyzing reactions in these pathways are found in both E. coli and yeast. There are also seven pathways in E. coli not present in yeast, and nine pathways in yeast not present in E. coli. The set of common enzymes are members of the shared pathways, and these were used in our detailed comparison of the enzymes in the two organisms.Identification of Common Enzymes
Overall, the comparisons of shared EC numbers and gene products show
that at least half of the enzymes of small molecule metabolism in
E. coli and one third in yeast are not shared between the two organisms, as shown in Table 1. We wanted
to establish the extent and nature of the enzymes shared by E. coli and yeast, so we created groups of gene products that are the
common enzymes.
|
The common enzymes were identified by matching equivalent position in pathways in the two organisms and grouping together the enzymes that occur in two organisms at that position. A simple example is the fructose-1,6 bisphosphatase in E. coli (fbp) and yeast (FBP1) that occur at equivalent positions in the gluconeogenesis pathway.
Several cases are more complicated than this simple example, however, and these were treated according to the following rules: (1) If an enzyme occurred in more than one pathway, we assigned it to a single group. (2) Where reaction steps with the same EC number in different pathways are catalyzed by different sets of gene products, we made separate groups. (In E. coli, there are 11 EC numbers whose reactions are catalyzed by two nonidentical, but possibly homologous, combinations of gene products and two EC numbers [2.7.1.69, 1.1.1.-] catalyzed by three different combinations. In yeast, there are 10 EC numbers with two different combinations of gene products and two EC numbers [1.1.1.37, 2.5.1.-] with three different combinations. (3) Where enzymes catalyze two or more reaction steps corresponding to two or more EC numbers, there are two or more EC numbers that are associated with exactly the same gene products in both E. coli and yeast. The groups of gene products that are common enzymes were made nonredundant according to the gene product identifiers as well, so some of the groups correspond to multiple EC numbers.
After these filtering processes, we obtained a set of 271 groups that
contain, in all, 384 E. coli and 390 yeast gene products. For
a list of these see Supplementary data at www.genome.org. As
described in Table 2, the contents of these
groups vary. Two hundred and thirty four groups contain the same or
similar small numbers of E. coli and yeast proteins. Five
groups have large numbers of gene products from one organism and few
from the other. Two of these groups are reactions that are only
described by three EC numbers (acyltransferases, 2.3.1.- and
galactosyl/mannosyltransferases, 2.4.1.-) and hence are not well
defined. The other three groups involve large complexes: two
representing NADH dehydrogenases and one for the ATP synthase complex.
The remaining 32 groups have slightly different small numbers of gene
products, such as one in one organism and three or four in the other
and so forth. The cases of gene fusion have as many entries as there
are component enzymes, so there are 37 groups of equivalent enzymes
that correspond to the entries in Table 7 (see below).
|
The accuracy of the assignments of enzymes to EC numbers and pathways could affect our comparison of common enzymes in the following way. If a yeast protein and an E. coli protein were erroneously assigned as being the same enzyme, our analysis would be affected. This is particularly likely to occur during assignment of putative enzymes, in other words enzymes that are assigned through homology rather than experimental characterization. There are at most 192 such enzymes in our data set, and 87 of these are part of the set of common enzymes we analyze in detail. Excluding these 87 enzymes would not affect our general conclusions, however, and so we retain them in our data set.
Enzymes Unique to E. coli or Yeast
In our dataset, there are 332 gene products constituting 233 enzymes unique to E. coli, and 209 gene products constituting 97 enzymes unique to yeast. These enzymes occur in the small number of pathways that are unique to each organism (seven and nine in E. coli and yeast, respectively), but also across the 48 KEGG pathways that contain common enzymes. The organism-specific extensions to pathways with common enzymes involve mostly one or two reactions that are connected to one or both ends of the common part of the pathway. However there are some cases where several separate organism-specific runs of reactions are added to different parts of the pathway, and not all the reactions in a KEGG pathway are necessarily connected. In the Aminosugars Metabolism pathway, there is a series of E. coli-specific reactions, followed by a few common reactions, which are followed by a series of yeast-specific reactions. This clear linear division between series of E. coli-specific and yeast-specific reactions is unique.
To ensure that the enzymes annotated as unique in KEGG do not have
hitherto unidentified counterparts in the other genome, we compared the
distribution of sequence identities for the 332 and 209 gene products
in E. coli and yeast that represent enzymes unique to each of
these organisms with that of the common enzymes, whose distribution of
sequence identities is discussed below. Only 13% of the 541 unique
enzymes have matches above 30%, as compared to 75% of the common
enzymes. We inspected the 23 matches above 40% sequence identity
because matches at these sequence identities are very likely to have
identical EC numbers, according to Wilson et al. (2000)
and Todd et al.
(2001)
, and found only eight such cases, which are likely candidates
for reclassification. Therefore, it is likely that most of the enzymes
classified as present in one of the two organisms but absent in the
other are classified correctly, as there is no reason why the pattern
of sequence divergence should differ between this set of enzymes and
the common enzymes. There remains the possibility that there are as yet
unidentified enzymes that are unique to one of the two organisms. For
E. coli, this is very unlikely though, as small molecule
metabolism has been experimentally investigated for decades and even
putative enzymes are included in our data set and the above
calculations. Therefore, newly discovered enzymes in small molecule
metabolism are most likely to be yeast enzymes that are not shared with
E. coli. This would decrease the fraction of common enzymes
out of all yeast enzymes, and therefore we view the fraction of common
enzymes of one half and two thirds of all enzymes of small molecule
metabolism as a lower bound.
The Domain Structure and Family Membership of the Common Enzymes
As described above, just over half of the gene products involved in small molecule metabolism in E. coli and two thirds of those in yeast carry out reactions that are common to both organisms. To compare these common enzymes in terms of their evolutionary relationships, we need to define the domain structure and the protein families to which these domains belong.
Identification of Domains in the E. coli and Yeast Enzymes
To identify the nature of domains in the E. coli and yeast gene products we used three sets of calculations. First, gene products were matched to hidden Markov models of the domains that occur in proteins of known structure (SUPERFAMILY HMMs); second, they were matched to the Pfam HMMs, and third they were matched to each other using FASTA. By these calculations, of the 384 E. coli gene products in shared enzymes, 374 (97%) were matched in nonoverlaping regions by 603 domains using the three methods. Of the 390 yeast gene products in shared enzymes, 343 (88%) were matched in nonoverlaping regions by 607 domains by all three methods (Table 3a).
|
SUPERFAMILY HMMS
The domains in proteins of known structure (and the superfamilies they belong to) are described in the SCOP database. Gough et al (2001)PFAM HMMS
The gene products, or parts of them, that remained unassigned after the identification of SCOP domains were scanned against the Pfam database (Bateman et al. 2000PAIRWISE SEQUENCE COMPARISONS
Even after the Pfam search, some amino acid regions longer than 75 residues remained without a domain assignment. We compared these regions to each other with FASTA (Pearson and Lipman 1988Domain Structure of the E. coli and Yeast Enzymes
In total, 603 domains were identified in to the 384 E. coli gene products part of common enzymes and 607 domains in the 390 yeast gene products, as described in Table 3a. The single-domain gene products consist of one domain that matches the whole sequence. Among the E. coli gene products with assignments there are 202 (53%) such cases, and among the yeast gene products there are 148 (38%) such cases. The other gene products matched between two and six domains, except one sequence in yeast that has 11 domains. Overall, there is a slightly larger fraction of multi-domain gene products in yeast than in E. coli, and the multi-domain gene products tend to have somewhat more domains. Previous work on the EcoCyc list of the gene products that form small molecule metabolism in E. coli showed that about half contain just one domain and half are the product of the recombination of two or more domains (Teichmann et al. 2001Protein Families of the E. coli and Yeast Enzymes
The sequences used to build the SUPERFAMILY HMMs are those of the domains in the proteins of known structure. In SCOP, on the basis of an examination of their structures, sequences, and functions, these domains have been clustered into superfamilies whose members can have distant or close evolutionary relationships. We can use this information to cluster into families the domains of the common enzymes matched by the SUPERFAMILY HMMs. Four hundred and eighty one E. coli domains belong to one of 171 different SCOP superfamilies and 522 yeast domains belong to one of 161 superfamilies, as shown in Table 3a. One hundred and forty of the SCOP superfamilies are common to both organisms.
|
|
|
|
A Comparison of the Sequences and Domain Architectures of Common Enzymes
Above we discussed the general features of the domains and families of the common enzymes. Now we turn our attention to the similarity in sequence and domain architecture of the proteins within groups of common enzymes.
Sequence Identity Among Common Enzymes
To find the distribution of sequence identities between the E. coli and yeast proteins in the 271 groups of common enzymes, a FASTA search was done between the proteins of the two organisms. The matches at an expectation value threshold of 0.01 or lower were accepted as significant, and the sequence identities for these matches were extracted. The resulting distribution of sequence identities is shown in Figure 3. The distribution of sequence identities is drawn from the best match between an E. coli and yeast sequence in 229 of the 271 common enzyme groups.
|

Identity and Divergence of Domain Architectures
As mentioned above, significant sequence identity is detected in 229 of the 271 groups of common enzymes, but in 88 the matching region only covers part of the two most similar gene products out of the sets of gene products in these groups. Therefore, to obtain more information about the extent of homology between E. coli and yeast enzymes, we compared domain architectures i.e., the order and family identity of domains of enzymes within each group of gene products. The results are described in Table 5 and below.
|
|
Nonorthologous Displacement
There are 19 cases where pairs of common enzymes share no structural or sequence domains (Table 5). These 19 cases were investigated for evidence of nonorthologous displacement. This involved the retrieval of extra information from resources such as EcoCyc, MetaCyc, MIPS (Mewes et al. 2000
|
|
Gene Fusions and Protein-Protein Interactions
In the previous sections, we have seen that there is extensive conservation of domain architecture in the set of shared enzymes in small molecule metabolism of E. coli and yeast. In common enzymes that have identical domain architectures in E. coli and yeast, there can be a difference between the enzymes at the level of gene structure. Whether the domains belonging to enzymes come from one or several genes affects both the regulation and localization of enzymes, as the parts of a single gene will, by definition, be completely coregulated and colocalized. Colocalization, and potentially regulation, can also be achieved through protein-protein interactions, and we investigate the protein-protein interactions among yeast enzymes in the second part of this section.
Gene Fusions or Fissions
We identified 20 cases of gene fusion or fission with this system, listed in Table 7. The first five cases involve a single E. coli protein and pairs of S. cerevisiae proteins. In the other 15 cases, the yeast enzyme consists of a single gene product and the E. coli proteins are pairs (10 cases), triplets (two cases), quadruplets (two cases), or five different gene products (one case). In 10 of these cases, the E. coli enzymes are adjacent or close to each other on the bacterial chromosome, suggesting that they are coregulated in E. coli in some way as well. In the five cases where the E. coli enzymes are far apart on the chromosome, the fingerprint of gene fusion is lost and it is not clear to what extent the individual enzymes are coregulated.
|
Protein-Protein Interactions in Yeast
Gene fusions provide a means of coregulation and colocalization of enzymes. We found that fusions occur in subunits of enzymes as well as separate enzymes that are at most two steps apart within a pathway. Protein-protein interactions can also serve to colocalize consecutive enzymes to improve flux by minimizing diffusion of the substrate. At the same time, protein-protein interactions between enzymes further apart in the metabolic network can occur for regulatory reasons: an example of this being the regulation of isoleucyl- and valyl-tRNA synthetase by threonine deaminase, two enzymes that are four steps apart in leucine and valine biosynthesis (Savageau and Jacknow 1979
|
| |
DISCUSSION |
|---|
|
|
|---|
Our comparison of yeast and E. coli small molecule metabolic pathways and enzymes shows that over half of the proteins in this central set of pathways are present in both of these two distantly related organisms. This means that almost as many enzymes of small molecule metabolism are unique to each of the two organisms as are common. Of the sets of enzymes common to both organisms, over two thirds have very closely conserved domain architecture. Just under one quarter of the common enzymes have domain architectures that are partly shared and partly unique to one or both organisms. Among the enzymes that have some similarity in domain architecture, almost all have <50% sequence identity between the E. coli and yeast enzymes, and about a quarter have <30% sequence identity. There are only 13 cases of clear nonorthologous displacement where there is no homology whatsoever between the yeast and E. coli enzyme.
In one seventh of the sets of common enzymes, there are differing numbers of isozymes in E. coli and yeast. There are a few groups of common enzymes with identical numbers of isozymes in the two organisms, and analysis of all sets of isozymes suggests that they occurred after the last common ancestor of E. coli and yeast. The isozymes indicate that even if domain architecture is conserved, regulation of an enzymatic step may be different between the two organisms.
This is also evident in the cases of gene fusion. Fifteen of the 20 cases of gene fusion or fission involve a single yeast enzyme and several individual E. coli enzymes. The balance may be tilted towards the eukaryote due to the absence of operons, but the five cases of fusion in E. coli suggests that fusion may be more than just a means of coregulation, but rather a way of colocalizing otherwise separate gene products.
Colocalization through gene fusions is observed between enzymes at most two steps apart in pathways. A survey of the protein-protein interactions between yeast enzymes shows that a large fraction of these is also between enzymes that are either consecutive or very close to each other in a pathway in terms of reaction steps. Although there is this tendency for physical association of enzymes close to each other in the reaction network, this is by no means the general rule for all consecutive reactions. From our analysis, the frequency of both gene fusions and protein-protein interactions in metabolic pathways appears to be limited, with 15 cases of gene fusions and a small number of protein-protein interactions between separate enzymes among the 368 yeast enzymes considered here.
| |
WEB SITE REFERENCES |
|---|
|
|
|---|
http://stash.mrc-lmb.cam.ac.uk/SUPERFAMILY/; The database of HMMs.
| |
ACKNOWLEDGMENTS |
|---|
We are grateful to Adrian Shepherd and Stuart Rison for help when we set up our relational database. S.A.T. has a Beit Memorial Fellowship for Medical Research and O.J. has a BBSRC studentship.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
5 Present address: MRC Laboratory of Molecular Biology, Cambridge CB2 2QH, UK.
4 Corresponding author.
E-MAIL sat{at}mrc-lmb.cam.ac.uk; FAX +44-(0)1223-213556.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.228002.
| |
REFERENCES |
|---|
|
|
|---|