|
|
|
|
Vol. 9, Issue 7, 608-628, July 1999
RESEARCH
|
| |
ABSTRACT |
|---|
|
|
|---|
Comparative analysis of the protein sequences encoded in the four euryarchaeal species whose genomes have been sequenced completely (Methanococcus jannaschii, Methanobacterium thermoautotrophicum, Archaeoglobus fulgidus, and Pyrococcus horikoshii) revealed 1326 orthologous sets, of which 543 are represented in all four species. The proteins that belong to these conserved euryarchaeal families comprise 31%-35% of the gene complement and may be considered the evolutionarily stable core of the archaeal genomes. The core gene set includes the great majority of genes coding for proteins involved in genome replication and expression, but only a relatively small subset of metabolic functions. For many gene families that are conserved in all euryarchaea, previously undetected orthologs in bacteria and eukaryotes were identified. A number of euryarchaeal synapomorphies (unique shared characters) were identified; these are protein families that possess sequence signatures or domain architectures that are conserved in all euryarchaea but are not found in bacteria or eukaryotes. In addition, euryarchaea-specific expansions of several protein and domain families were detected. In terms of their apparent phylogenetic affinities, the archaeal protein families split into bacterial and eukaryotic families. The majority of the proteins that have only eukaryotic orthologs or show the greatest similarity to their eukaryotic counterparts belong to the core set. The families of euryarchaeal genes that are conserved in only two or three species constitute a relatively mobile component of the genomes whose evolution should have involved multiple events of lineage-specific gene loss and horizontal gene transfer. Frequently these proteins have detectable orthologs only in bacteria or show the greatest similarity to the bacterial homologs, which might suggest a significant role of horizontal gene transfer from bacteria in the evolution of the euryarchaeota.
| |
INTRODUCTION |
|---|
|
|
|---|
Phylogenetic analysis of rRNA and a set of proteins involved in
translation, transcription, and replication has led to the concept of
archaea as a third division of life, distinct from either bacteria or
eukaryotes (Woese et al. 1978
, 1990
; Woese and Gupta 1981
; Pace et al.
1986
; Zillig 1991
). Furthermore, rooting of
paralogous trees for translation elongation factors and proton ATPases
suggested that archaea are a sister group of eukaryotes (Gogarten et
al. 1989a
,b
; Iwabe et al. 1989
; Gribaldo and Cammarano 1998
). This
concept appears to be gaining further support from the generally
eukaryotic layout of the genome expression systems, particularly the
system of DNA replication whose principal components are orthologous to
the respective replication proteins of eukaryotes but apparently do not
have counterparts in bacteria (Mushegian and Koonin 1996
; Brown and
Doolittle 1997
; Edgell and Doolittle 1997
). However, it has been aptly
noted that archaea have a "eubacterial form and eukaryotic content"
(Keeling et al. 1994
). Indeed, beyond the common "negative" trait,
namely the small cell size and the absence of a nucleus, archaea and
bacteria share major aspects of genome organization and expression
strategy. The most important of these common features include the
(typically) single circular chromosome, the absence of introns in
protein-coding genes, the operonic organization of many genes, and the
absence of a 5'-terminal cap and the presence of a
ribosomal-binding (Shine-Dalgarno) site in archaeal mRNAs (Brown and
Doolittle 1997
). Furthermore, several operons, particularly those
encoding ribosomal proteins, are conserved in archaea and bacteria
(Brown and Doolittle 1997
; Koonin and Galperin 1997
).
The analysis of the first two completely sequenced archaeal genomes,
those of Methanococcus jannaschii (Bult et al. 1996
) and
Methanobacterium thermoautotrophicum (Smith et al. 1997
), showed, somewhat unexpectedly given the already established
archaeal-eukaryotic clade, that the bacterial form of archaea is
complemented by considerable bacterial content. It has become clear
that the majority of archaeal proteins show the greatest similarity to
their bacterial homologs, which is likely to indicate bacterial origin,
and only a minority look "eukaryotic" (Koonin et al. 1997
; Smith et
al. 1997
). In functional terms, there is a clear split between the
bacterial and eukaryotic components of the archaeal genomes
the
eukaryotic genes are primarily those coding for components of the
translation, transcription, and replication machineries, whereas the
bacterial ones typically encode metabolic enzymes and proteins involved in cell division and cell wall biogenesis (Koonin et al. 1997
; Smith et
al. 1997
). These findings raised the issue of possible extensive gene
exchange between bacteria and archaea (Feng et al. 1997
; Koonin et al.
1997
; Doolittle and Logsdon 1998
).
Subsequently, the complete genome sequences of two additional archaeal
species, namely Archaeoglobus fulgidus (Klenk et al. 1997
) and
Pyrococcus horikoshii (Kawarabayasi et al. 1998a
,b
), have been
reported. All four available complete archaeal genomes represent only
one of the two (or possibly three) main archaeal subdivisions
the
Euryarchaeota (Olsen et al. 1994
; Pace 1997
). Nevertheless, they show
sufficient diversity to allow us, for the first time, to embark on a
systematic comparative analysis of archaeal genomes. We describe here
the results of a detailed comparative analysis of the four complete
euryarchaeal protein sets. Our principal approach included the
delineation of sets of orthologous genes and examination of
phylogenetic patterns in these families (Tatusov et al. 1997
; Koonin et
al. 1998
).
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
Orthologous Families Delineated by Comparison of Four Euryarchaeal Genomes and the Principal Types of Events in Archaeal Evolution
The proteins encoded in the genomes of the four euryarchaeal
species comprise a very good set for the delineation of families of
likely orthologs [designated clusters of orthologous groups (of
proteins), COGs; Tatusov et al. 1997
)]. In the original COG analysis,
we emphasized that to use consistency between different genomes to
support the derivation of COGs, the sequences of the compared proteins
should be maximally independent; therefore, this criterion works best
with phylogenetically distant genomes. At large phylogenetic distances,
however, correct identification of COGs may be hampered by other
problems, such as difficulty in distinguishing orthologs from paralogs,
and in some cases, very low similarity between orthologs that precludes
their detection altogether. As a result, the final step in the
construction of the original collection of COGs involved considerable
manual correction. The distances separating the four archaeal species
are intermediate between those that are seen among close bacterial
species such as Escherichia coli and Haemophilus
influenzae (in the original COG analysis, these species were not
considered independently) and those between phylogenetically remote
species such as bacteria and eukaryotes. In quantitative terms, the
mean percent identity of the best hits in all-against-all interspecies
comparisons of protein sequences is in the range of 41%-46% for the
archaea, 57% for E. coli versus H. influenzae, and
between 30%-35% for most distant bacterial lineages and bacteria
versus eukaryotes or archaea (N.V. Grishin, unpubl.;
ftp://ncbi.nlm.nih.gov/pub/koonin/gen2gen). It appears that the
intermediate level of sequence conservation seen among the archaea is
high enough to prevent most, if not all, artificial lumping of COGs
attributable to paralogous families, but low enough for the consistency
criterion to be valid and useful. For these reasons, most of the
archaeal COGs delineated by the automatic procedure were corroborated
by subsequent case-by-case evaluation. Furthermore, given the typically
highly significant similarity between archaeal orthologs, it is most
unlikely that any significant number of them have been missed as a
result of low sequence conservation.
Figure 1 shows the breakdown of the archaeal protein set in terms of
their conservation in the four complete genomes. The majority of the proteins in each species
from 58% for P. horikoshii to 71% for M. jannaschii
belong to the
archaeal families of likely orthologs (COGs), and another sizable
fraction (from 7% for M. jannaschii to 11% for A. fulgidus) were identified as distant homologs of the COGs. Among
the remaining proteins that had no archaeal homologs, for a relatively
small fraction (from 1% in M. jannaschii to 4% in A. fulgidus), homologs were detected in other taxa (primarily
bacteria), and the rest (~20%) had no detectable homologs. This
distribution suggests that a conserved archaeal gene set does exist.
This core gene set, however, includes a minority of the archaeal genes
as indicated by the fact that only 543 of the 1326 identified COGs
(40%) are represented in all four archaeal species; the remaining COGs
are roughly equally divided between those that include three and two
species (Fig. 2). The universal archaeal COGs
encompass 31%-35% of the proteins encoded in each of the individual
genomes. This number appears to be an important measure of the
evolutionary stability of the genomes
the rest of the gene complement
in each of the archaea must have been subject to evolutionary events
other than vertical inheritance, such as duplication with subsequent
rapid divergence, horizontal gene transfer, and lineage-specific gene loss.
|
|
These results provide at least a rough estimate of the likely amount of
gene loss in each species, as well as the number of COGs represented in
the ancestral euryarchaeon. A conservative estimate of the number of
genes that might have been lost in each genome is provided by the
number of COGs that include three archaeal species other than the given
one. This number is in the range of 50 to 70 for M. jannaschii, M. thermoautotrophicum, and A. fulgidus, as opposed to 206 in P. horikoshii (Fig. 2). The
greatest number of COGs that are not represented in P. horikoshii is not surprising as it is a heterotrophic organism that
lacks a number of biosynthetic capabilities (Gonzalez et al. 1998
). The
majority of the archaea are autotrophs and it seems most likely that
the ancestral form also had been autotrophic; thus, the absence of the
representatives of many COGs in P. horikoshii is best
explained by lineage-specific gene elimination. At least some of the
archaeal COGs with two members are also likely to reflect gene loss.
Thus, a higher estimate for the number of ancestral genes lost in each genome can be obtained by adding up all COGs with three or two members
that do not include the given species. The result varies from a total of 220 genes for M. jannaschii to 451 genes for P. horikoshii.
Thus, the analysis of the conserved archaeal families reveals major
genome plasticity, with only a minority of families represented in all
genomes. These observations make all the more pertinent the question:
which essential cellular functions are provided by the set of 543 universal archaeal COGs and which are not represented by it, and,
accordingly, are performed by nonorthologous (unrelated or paralogous)
proteins in different species
the phenomenon described as nonorthologous gene
displacement (Koonin et al. 1996a
; Mushegian and Koonin 1996
).
The Core Set of Conserved Euryarchaeal Genes, Lineage-Specific Gene Loss, and Nonorthologous Gene Displacement
The COGs represented in all four euryarchaeal species are significantly enriched in proteins that are involved in genome expression, compared to the entire collection of the archaeal COGs. In particular, most of the basic components of the translation, transcription, and replication systems are conserved consistently in all four species; the same is true of a number of proteins implicated in repair and recombination (Fig. 3).
|
In other functional categories of genes, the genome plasticity revealed
by COG analysis is more pronounced. Because of the apparent loss of a
number of biosynthetic pathways in the heterotrophic P. horikoshii, there are relatively few metabolic enzymes among the
all-archaeal COGs, and in fact, it does not seem possible to delineate
even a single metabolic pathway that would be completely orthologous in
all four archaea (Table
1).
Among the three autotrophic species, most of the steps of the central
pathways are represented by orthologs; nevertheless, almost each
pathway has at least one step where nonorthologous displacement is
likely (Table 1). The biosynthesis of branched chain aliphatic amino
acids (leucine, isoleucine, valine) is an example of a complex pathway
that is, in its entirety, represented by orthologs in the three
autotrophic archaea as well as in most bacteria. This is, however, an
exception rather than the rule among the archaeal metabolic
pathways
few of them consist exclusively of orthologs of bacterial
enzymes. In most pathways, at least one or two reactions are predicted to be catalyzed either by known archaea-specific enzymes or by yet
uncharacterized ones (Table 2). In the readily
detectable cases of nonorthologous gene displacement, one of the
alternative solutions is frequently based on orthologs of the
respective bacterial enzymes, whereas the other one seems to be unique
for archaea and is not always identifiable. This is, for example, the
situation with a critical reaction in glycolysis, namely the formation
of pyruvate from phosphoenolpyruvate. M. jannaschii and
P. horikoshii encode an ortholog of the bacterial pyruvate
kinase that is predicted to catalyze this reaction. Pyruvate kinase,
however, is not detectable in the other two archaea. Given that the
other components of the trunk portion of the glycolytic pathway are
present and that the reaction catalyzed by pyruvate kinase is
indispensable for the completion of glycolysis, nonorthologous
displacement must be invoked. The most likely displacing enzyme is
phosphoenolpyruvate synthase, which is conserved in all archaea and
might produce pyruvate by reversing its typical reaction.
|
|
Nonorthologous gene displacement is notable also in the archaeal amino
acid metabolism. For example, different archaeal species apparently use
radically different pathways to synthesize proline. In M. thermoautotrophicum and A. fulgidus, proline can be formed from ornithine in a single reaction catalyzed by ornithine
cyclodeaminase (Sans et al. 1988
). M. jannaschii and P. horikoshii lack this enzyme, and while the latter is expected to be
a proline auxotroph, the only possible route for proline biosynthesis
in M. jannaschii appears to be through the deacetylation of
N-acetylglutamate
-semialdehyde into
-glutamic
semialdehyde, followed by its conversion into pyrroline-5-carboxylate
and then to proline as shown for bacteria and yeast (Adams and Frank
1980
). M. jannaschii encodes an ortholog of the
N-acetylornithine deacetylase (ArgE) that catalyzes the first
step of this pathway. The second step of the pathway, conversion of
-glutamic semialdehyde to pyrroline-5-carboxylate, occurs spontaneously. However, the ortholog of the bacterial enzyme for the
last step of proline biosynthesis, namely pyrroline-5-carboxylate reductase (ProC), is not encoded in the M. jannaschii genome
and should have been displaced by another dehydrogenase that remains to
be identified experimentally. Remarkably, A. fulgidus encodes only the ArgE ortholog and M. thermoautotrophicum only the
ProC ortholog. It appears that in this case, we observe nonorthologous displacement of an entire (albeit short) pathway whereby acquisition of
the ornithine cyclodeaminase gene by A. fulgidus and M. thermoautotrophicum has made the enzymes of the original pathway of
proline biosynthesis dispensable.
In addition to the cases of apparent nonorthologous displacement, there
are several important gaps in our understanding of metabolic pathways
in all euryarchaeota. The archaeal version of sugar metabolism
is particularly puzzling. There is no doubt that autotrophic archaea
possess the capabilities to synthesize ribose, deoxyribose, and the
sugar components of the cell envelope. It is unclear, however, how they
accomplish this in the absence of aldolase, fructose bisphosphatase,
transaldolase, transketolase, and pentose-5-phosphate 3-epimerase (see
Table 1). Genes for all these enzymes are missing in M. thermoautotrophicum and A. fulgidus, whereas M. jannaschii has genes coding for the three latter enzymes but not
the former two. It appears that compared to bacteria, the archaeal
sugar metabolism shows systematic nonorthologous displacement of
enzymes. Interestingly, one of the archaeal COGs includes predicted
aldolases that are highly conserved in all four archaea and are
orthologous to the recently identified class I fructose-biphosphate
aldolase from E. coli (Thomson et al. 1998
). There are two
paralogous representatives of this family of aldolases in M. jannaschii and A. fulgidus and only one member in M. thermoautotrophicum and P. horikoshii (Table 1). These
enzymes are likely to catalyze key reactions both in pentose and in
hexose biosynthesis; the exact pathways remain to be studied experimentally.
Archaeal COGs that contain four or three members account for the majority of known housekeeping functions, with several notable exceptions (e.g., those in the translation machinery discussed above), and in a sense, may be considered an idealized minimal archaeal gene complement. The COGs with two members appear to account for more specific functions linked to the organism's particular life style, for example, a number of COGs that include enzymes involved in methanogenesis in M. jannaschii and M. thermoautotrophicum.
Relationships Between Euryarchaeal Protein Families and Their Bacterial and Eukaryotic Homologs
The majority of the archaeal COGs have homologs in other taxa. In the present analysis, we attempted to distinguish carefully between true orthology (see Methods) and other homologous relationships that typically include weak sequence conservation or differences in domain architectures. There are notable differences in the distribution of the apparent phylogenetic affinities for the COGs represented in all archaea (universal) and those that include only three or two archaeal species. For >50% of the universal archaeal COGs, orthologs were identified in both bacteria and eukaryotes, in a sharp contrast to the nonuniversal COGs for which this fraction comprised of only 28% (Fig. 4A,B). A significant majority of the COGs that have only bacterial orthologs are not conserved in all archaea, whereas most of the COGs that have only eukaryotic orthologs belong to the universal subset (Fig. 4A,B). Furthermore, those COGs that do not have any homologs outside the archaea are poorly represented in the universal subset.
|
A complementary, quantitative analysis of the distribution of sequence similarities supports these observations. Archaeal proteins from the COGs that include only two or three species typically show the greatest similarity to bacterial homologs, in contrast to the universal COGs that are significantly enriched in proteins most similar to the eukaryotic homologs (Fig. 5). This difference might reflect true phylogenetic affinities, difference in evolutionary rates in different functional categories of proteins, or both. However, the finding that COGs consisting of two to three euryarchaeal members typically show a greater similarity to bacterial homologs, might suggest a significant contribution of horizontal transfer of bacterial genes into archaea.
|
The functional distinction between bacterial and eukaryotic COGs in archaea is clear-cut and is related to the functional difference between the universal and specialized subsets discussed above (see Fig. 3). The bacterial COGs within the universal subset comprise primarily proteins involved in energy production (e.g., ferredoxins and numerous components of hydrogenase complexes), certain metabolic functions, such as coenzyme biosynthesis, and transport system components. Interestingly, this bacterial set also includes enzymes involved in protein degradation and potentially in chaperone-like functions, such as three families of previously undetected predicted zinc-dependent proteases (K.S. Makarova, L. Aravind, and E.V. Koonin, unpubl.). Furthermore, the bacterial component of the universal COG subset includes several repair enzymes, proteins involved in cell division, for example, chromosome partitioning ATPases and stress response proteins, such as the homologs of the bacterial universal stress protein UspA.
The UspA homologs are an example of a protein superfamily that
originally has not been recognized in archaeal genome analyses but, in
fact, is conserved in all archaea, most bacteria, plants, and fungi;
all archaea and many bacteria encode multiple, paralogous members of
this superfamily (Fig. 6). Most of the proteins in the superfamily consist of one or more copies of the UspA domain, but
in the A. fulgidus protein AF1612 and a Synechocystis
protein, the UspA domain is fused to a cation transporter. In addition, fusions of the UspA domain to bacterial sensor proteins (e.g., KdpD)
and to plant protein kinases were detected. The E. coli UspA
protein has been reported to possess autophosphorylation activity
(Freestone et al. 1997
). Very recently, the x-ray structure of the
M. jannaschii protein MJ0577 that we identified as a UspA homolog has been determined and the protein has been shown to tightly
bind ATP (Zarembinski et al. 1998
). It appears likely that the UspA
superfamily proteins and domains are nucleotide-binding signal
transducers that play a central regulatory role in both archaeal and
bacterial cells.
|
Within the bacterial component of the euryarchaeal core gene set, 8 COGs with 4 members and 13 COGs with 3 members include archaeal
proteins that contain the helix-turn-helix (HTH) domain and are
predicted to function as transcription regulators. The conservation
of these families in all or all but one of the archaea whose genomes
have been sequenced, along with the existence of a number of more
specific HTH protein families, emphasizes the combination of bacterial
and eukaryotic features in the archaeal transcription machinery.
Indeed, all archaeal RNA polymerase subunits and several basal
transcription factors are most closely related to their eukaryotic
counterparts, and some of them have no detectable orthologs in bacteria
(Leffers et al. 1989
; Puhler et al. 1989
; Zillig et al. 1989
; Langer et
al. 1995
; Bell and Jackson 1998
; Bell et al. 1998
). This is in a stark
contrast with the bacterial affinities of the predicted transcriptional
regulators; a detailed analysis of the archaeal transcription machinery
and its evolutionary implications will be presented elsewhere (L. Aravind and E.V. Koonin, unpubl.).
Nearly all of the eukaryotic COGs in archaea, with only a few
exceptions, consist of proteins involved in translation, modification of translation machinery components, transcription, replication, and
repair. The present analysis resulted in the identification of
previously undetected archaeal orthologs for several characteristically eukaryotic proteins that function in transcription and replication. Three such findings include the orthologs of the large subunit of DNA
primase, the P30 subunit of RNAse P, and the nascent
polypeptide-associated complex (NAC)
-subunit. The detection of
the second eukaryotic-type primase subunit further supports the concept
of a eukaryotic-type replication machinery in archaea but, in addition,
is of particular interest given the existence of archaeal homologs of
bacterial DNA G-type primases (Aravind et al. 1998
).
The NAC
family seems to be of special interest and we present this
case in some detail. NAC
is a multifunctional eukaryotic protein
that is involved in translation and subcellular targeting of nascent
polypeptides (Wang et al. 1995
; Wickner 1995
; Powers and Walter 1996
)
but it has been shown to function also as a transcription coactivator
(Yotov et al. 1998
). All archaea encode an apparent ortholog of
NAC
with a conserved domain organization; a further detailed
sequence analysis showed that the amino-terminal domain of these
proteins is distantly related to the general transcription factor BTF3
(Fig. 7A,B). Unexpectedly, we found that the small, carboxy-terminal domain of NAC
and its archaeal counterparts, which is missing in BTF3, showed significant similarity to the distinct
amino-terminal domain of the bacterial translation elongation factor
EF-Ts and is likely to adopt the same structure (Fig. 7A,C,D,). The
amino-terminal domain of EF-Ts has been implicated in its interaction
with EF-Tu (Zhang et al. 1997
); a similar interaction with the archaeal
and eukaryotic elongation factors might be involved in the
translational function of NAC
. It appears likely that the
ancestral form of NAC
already performed a dual role in
transcription and translation; as the result of our present analysis,
each of these functions was mapped tentatively to a distinct domain.
|
As reported previously, bacterial homologs of some of the protein
families that appeared to be confined to archaea and eukaryotes could
be identified by structural comparison or through sequence searches
using sensitive methods. An example of a structural comparison that has
convincingly demonstrated the existence of a bacterial homolog
(probably a highly diverged ortholog) of a archaeal-eukaryotic protein
family is the relationship between the clamp subunits of DNA
polymerases, that is, the eukaryotic proliferating cell nuclear antigen
(PCNA), its highly conserved archaeal orthologs, and bacterial DNA
polymerase
subunit (Krishna et al. 1994
). More recently,
bacterial homologs were detected by detailed sequence analyses for
several translation factors that appeared to be exclusively archaeal-eukaryotic, such as eIF-5A whose highly diverged ortholog in
bacteria is the elongation factor P (Tatusov et al. 1997
; Kyrpides and
Woese 1998
). In the same vein, we observed that eukaryotic-archaeal initiation factor eIF6 contains a diverged ribosomal protein S1-type RNA-binding domain and thus, has homologs, although apparently not true
orthologs, among bacterial proteins (data not shown). Other examples of
eukaryotic-archaeal families, for which distant bacterial homologs
become detectable as a result of detailed sequence analysis, are the
transcription factors TFIIE and MBF1 (multiprotein bridging factor 1),
in which we identified HTH domains (L. Aravind and E.V. Koonin,
unpubl.). A number of other families, however, remained refractory to
the detection of bacterial homologs despite extensive searches [e.g.,
several families of ribosomal proteins, translation initiation factor
eIF-1
, three subunits (K, L, and N) of DNA-dependent RNA
polymerase, and two DNA primase subunits].
Synapomorphies (shared-derived characters) Among Archaeal Protein Families and Archaea-Specific Family Expansions
Shared-derived characters present in the members of the given
lineage to the exclusion of all other taxa under comparison (synapomorphies) are perhaps the most reliable indicators of monophyly that are free of the uncertainties that plague conventional methods of
tree analysis, particularly when ancient evolutionary events are
involved. At the level of conserved proteins, it is natural to define a
synapomorphy as a family (COG) that does not have orthologs in other
taxa. Typically, this conclusion can be reached either when
there are no detectable homologs for a given family outside a particular clade, or when it has a unique
domain architecture, with homologs found only for
individual domains. According to these criteria, the 71 COGs that are represented in all four archaeal genomes
but do not have detectable orthologs outside archaea (see Fig.
4B) should be considered archaeal synapomorphies (Table 2). The most
obvious of these are the 32 universal archaeal COGs that do not have
any detectable nonarchaeal homologs. Unfortunately, the information on
the functions of these proteins is scant. A striking exception is the
recently discovered archaeal DNA polymerase II (Uemori et al. 1997
;
Cann et al. 1998
; Ishino et al. 1998
) that is one of the most highly
conserved proteins among the four archaea, but does not show any
detectable similarity to other known polymerases (or any other
proteins) except for a zinc finger domain.
In fact, however, the 71 COGs that have no obvious nonarchaeal
orthologs mark only the lower bound of the number of synapomorphies. There is a considerable number of COGs that show readily definable unique features, although a traceable line of vertical descent seems to
exist, suggesting orthologous relationships with bacterial or
eukaryotic genes. Three examples in this category are translation elongation factor EF-1
, the small subunit of archaeal DNA
polymerase II, and the archaeal ortholog of the eukaryotic repair
protein ERCC4. The eukaryotic EF-1
all contain an additional
domain that is homologous to glutathione S-transferases
(Koonin et al. 1994
) and is fused to the main domain that is conserved
in the archaeal counterparts (Table 2; Fig. 8). In
the case of the polymerase subunit and the ERCC4 protein, the archaeal
counterparts contain the conserved sequence motifs that strongly
suggest, respectively, a phosphohydrolase and a helicase activity; in
eukaryotes, these motifs are disrupted, indicating that the respective
enzymatic activities are abolished (Aravind and Koonin 1998
; Aravind et al. 1999
).
|
The most interesting synapomorphies are those COGs that consist of
proteins whose individual domains are conserved in other taxa but the
domain architecture is unique (Table 2; Fig. 8). The recently described
archaeal homologs of bacterial DnaG-type primases represent one such
example where the primase domain is highly conserved in archaea and
bacteria but the domains implicated in DNA binding are unrelated
(Aravind et al. 1998
). Table 2 and Figure 8 show additional instances
of unique domain architectures in archaea. These include both
archaea-specific domain fusions, as in the archaeal counterpart of the
eukaryotic multiprotein bridging factor MBF1 (a transcriptional
coactivator), and splitting of multidomain proteins into subunits
encoded by distinct genes, as in the cases of the largest subunit of
DNA-directed RNA polymerase and GMP synthetase. Interestingly, in
M. thermoautotrophicum and P. horikoshii (but not in
the other two archaeal species) the genes for the two GMP synthetase
subunits are adjacent (Table 2), which strongly suggests that an
ancestral gene that encoded the two-domain enzyme had been split early
in archaeal evolution.
In addition to the protein families that are genuine synapomorphies,
the uniqueness of a clade is defined by significant expansions of gene
families that are less abundantly represented in other lineages.
Several archaea-specific gene family expansions were detected as well
as gene expansions confined to one or two archaeal species (Fig.
9). In only one case, that of ferredoxins, a
correlation between a protein superfamily expansion and distinct
features of archaeal physiology, such as iron-dependent respiration
(Schafer et al. 1996a
,b
) and methanogenesis, seems obvious. Some of the other expanded families, [e.g., metal-dependent
-lactamase-like hydrolases (Aravind 1998
)] include enzymes with versatile functions whose connection with the specifics of the archaeal lifestyle (if any)
remains unclear.
|
Three expanded archaeal families include P-loop-containing ATPases,
namely the RecA/RadA superfamily and two archaea-specific groups that
have undergone species-specific amplification in M. jannaschii
and P. horikoshii, respectively (Mj-type and Ph-type predicted
ATPases). In the present analysis, the RecA/RadA ATPases formed two
distinct COGs. One of these is represented by a single member in each
of the four archaea and is orthologous to eukaryotic RadA-type ATPases.
The second COG consists of different numbers of paralogs from each of
the archaeal species and includes, in addition to typical RecA-like
ATPases, forms with a duplicated ATPase domain, inactivated forms and
fusions with other domains (e.g., GTPases; Aravind et al. 1999
, L. Aravind, unpubl.). Interestingly, the members of this COG that contain
the duplication of the ATPase domain are highly similar and apparently
orthologous to a family of cyanobacterial RecA-like ATPases at least
one of which is involved in circadian clock regulation (Ishiura et al.
1998
) (Fig. 10). Taken together with the observed
inactivation and fusion with other domains, this functional connection
may suggest that this second type of archaeal RecA-like ATPases is
involved in signal transduction rather than repair. It appears likely
that the duplication of the ATPase domain, which is unique within the
RecA/RadA family of ATPases, occurred in one of the two
lineages
euryarchaeota or cyanobacteria
with a subsequent horizontal
gene transfer; the direction of transfer in this case is uncertain.
|
The archaea-specific family of Ph-type ATPases contains, in addition to
the ATPase domain proper, a predicted HTH domain, whereas the distinct,
although distantly related Mj-type family, contains a putative
metal-binding motif (Koonin 1997
; data not shown.). Given the presence
of an HTH, the Ph-type family is most likely involved in ATP-dependent
transcription regulation; by analogy, a similar role may be proposed
for the Mj-type ATPases, the conserved metal-binding site being
involved in DNA binding.
Other proteins and domains that are unusually abundant in archaea
probably perform regulatory and signaling functions, such as the CBS
domain (Bateman 1997
; Ponting 1997
) and the newly identified PIN domain
(Figs. 9 and 11), although their functions are not
understood in detail. The PIN (PilT amino terminus) domain is of
particular interest. It is a compact domain that consists of ~100
amino acids, with the sequence conservation centered at two nearly
invariant aspartates that cap predicted
-strands and two
additional acidic residues found in the majority of PIN domains (Fig.
11). Each of the archaeal species encodes multiple stand-alone versions
of the PIN domain as well as fusions with other domains; two of these fusions, namely those with the PilT-type ATPase domain and a C4 zinc
finger, are archaeal synapomorphies (Figs. 9 and 11). PIN domains are
sporadic and much less common in bacteria and eukaryotes except for the
major expansion in Mycobacteria that appears to be independent
of the archaeal expansion (Figs. 9, 11; L. Aravind, unpubl.). The
function of the PIN domain is not known but a role in signaling appears
likely given the presence of this domain in the plasmid-encoded
transcriptional repressor StbB (Tabuchi et al. 1992
) and the DIS3
family of eukaryotic proteins that are involved in mitosis regulation
(Kinoshita et al. 1991
; Noguchi et al. 1996
; Shiomi et al. 1998
). The
yeast Dis3P is a 3'-5' exonuclease, which is a subunit of the
exosome (Mitchell et al. 1997
), and consists of the PIN domain fused to
a RNase II domain and a dsRNA-binding domain. The DIS3 proteins appear
to perform a regulatory function mediated by their binding to the
GTP-Ran and RCC1 proteins (Noguchi et al. 1996
). Given the
conservation of the PIN domain in DIS3 proteins from yeast to mammals
(Fig. 11), it is likely to perform an important signaling function in
all eukaryotes and, by implication, in archaea and bacteria.
|
Concluding Remarks
The analysis of the orthologous gene families (COGs) among the four completely sequenced archaeal genomes resulted in the delineation of the core gene set that is conserved in euryarchaeota. This core set includes only 31%-35% of the genes from each of the genomes but seems to account for most of the principal functions in genome replication, expression, and repair, as well as the majority of the reactions in several central metabolic pathways. This core gene set appears to have been relatively stable throughout the evolution of euryarchaeota. It defines the euryarchaeal clade through a number of synapomorphies-unique features, such as specific domain architectures of proteins that are conserved among the members of archaeal COGs but are not found outside the euryarchaea.
The evolution of the variable "shell" of the euryarchaeal genomes
should have included multiple eventsother than vertical inheritance,
namely horizontal gene exchange and lineage-specific gene
loss, in archaeal evolution. Likely horizontal gene transfer may be
manifest as nonorthologous gene displacement
apparent
substitution of an unrelated or distantly related but functionally
equivalent gene for the ancestral archaeal gene.
Generally, the comparison of the 4 archaeal genomes confirms the observations first made for M. jannaschii and M. thermoautotrophicum: the majority of archaeal proteins, particularly the metabolic enzymes and proteins involved in cell division and cell wall biogenesis, are most similar to their bacterial counterparts, and a minority, primarily proteins involved in genome replication and expression, most closely resemble their eukaryotic orthologs. The comparative analysis made it clear that the eukaryotic component belongs almost entirely to the families that are conserved in all four genomes, whereas much of the bacterial component comprises more variable families and species-specific genes. This might suggest a significant role of horizontal gene transfer from bacteria in the evolution of the euryarchaeota.
Comparative analysis of the four available genomes of euryarchaeota, aided by the availability of a number of complete bacterial genome sequences and one complete eukaryotic genome, provides some glimpses of archaeal evolution and the relationships between the three divisions of life. Once complete genomes of at least one crenarchaeon and some early-branching eukaryotes arrive, it will become possible to strive for a more coherent picture.
| |
METHODS |
|---|
|
|
|---|
Databases
The databases used in this study were the nonredundant (NR)
database and a separate database containing the protein sequences encoded in the complete genomes of four archaea, namely M. jannaschii (Bult et al. 1996
), M. thermoautotrophicum
(Smith et al. 1997
), A. fulgidus (Klenk et al. 1997
), and
P. horikoshii (Kawarabayasi et al. 1998a
,b
). The archaeal
protein complements and the complete nucleotide sequences of the
archaeal genomes were extracted from the Genomes division of Entrez.
Database Searches
The protein sequence database searches were performed using the
gapped BLAST program and the PSI-BLAST program (Altschul et al. 1997
). The PSI-BLAST program constructs a
position-specific matrix (PSSM) from a multiple alignment
generated from the BLAST hits above a certain
expectation value (e-value) and carries out iterative
database searches using the PSSM as the query
(Altschul et al. 1997
; Altschul and Koonin 1998
).
PSI-BLAST also has the capability to save the PSSM after a
user-defined number of iterations or at convergence and to
reuse for searching another database (Wolf et al. 1999
).
The estimates of statistical significance of the
PSI-BLAST results are based on the extreme value
distribution statistics originally developed by Karlin and
Altschul for local alignments without gaps (Karlin and
Altschul 1990
; Karlin et al. 1991
) and subsequently shown to apply to
gapped alignments as well (Altschul and Gish 1996
; Altschul et al.
1997
). There is no analytical proof of the applicability of the
Karlin-Altschul statistics to searches that use PSSM as queries, but
extensive computer simulations showed a nearly perfect fit of the score distribution produced searches to the extreme value distribution (Altschul et al. 1997
). Therefore, e-values reported for each retrieved
sequence at the point when its alignment with the query exceeds the
cutoff for the first time should be considered reliable estimates of
the statistical significance of the observed similarity. Clearly, after
a sequence is included in the model, e-values reported for it (and its
closely related homologs) in subsequent iterations become inflated and
do not represent accurately the statistical significance (Altschul and
Koonin 1998
). All reported e-values are for the first appearance of the
given sequence above the cutoff.
The main source of artifacts that arise in database searches and are
inevitably amplified in PSI-BLAST iterations are regions of low
compositional complexity in protein sequences that typically correspond
to nonglobular domains (Wootton 1994
). To avoid such artifacts,
database searches were routinely run after masking the low complexity
regions in the query sequences using the SEG program with default
parameters (Wootton and Federhen 1996
). However, because masking may
also prevent the detection of subtle but functionally and
evolutionarily important sequence similarities, filtering for low
complexity was omitted in case-by-case analyses aimed at the detection
of distant homologs.
The current default e-value cutoff for PSI-BLAST to include a sequence
in the PSSM for use in the next iteration is 0.001. However, the
original evaluation of the accuracy of PSI-BLAST and a number of
subsequent analyses, including both large-scale benchmarking
experiments and detailed case studies, have shown that an e-value of
0.01 (and in some cases, even higher e-values) is an appropriate cutoff
for PSI-BLAST provided that (1) regions of low complexity in the query
are masked before the search, and (2) the search results are
subsequently examined for the conservation of sequence motifs that are
typical of the particular protein superfamily. Accordingly, the cutoff
of 0.01 was used as the default for PSI-BLAST searches in this work.
The outcome of the analysis performed using PSI-BLAST critically
depends on the optimal choice of the queries used to seed the iterative
search (Aravind and Koonin 1999a
). Therefore, all protein families that
were analyzed in detail were investigated using multiple starting
points. All PSI-BLAST outputs were manually examined for the
conservation of characteristic sequence motifs to corroborate the
relevance of the results and facilitate the prediction of protein functions.
Construction and Analysis of COGs of Proteins
After comparing the archaeal protein set to itself using the gapped
BLAST program, conserved archaeal families that consist of likely
orthologs, termed COGs, were delineated using the previously described
approach (Tatusov et al. 1997
; Koonin et al. 1998
). Briefly, this
procedure first identifies and clusters obvious paralogs within each
proteome; that is, those proteins that show a greater similarity to
each other than to any protein from the other proteomes. At the next
step, for each protein or group of paralogs, the most similar protein
in each of the other proteomes is found, consistent triangles of such
intergenomic best hits are identified, and triangles with a common side
are merged to form COGs.
Multiple alignments were constructed for each potential COG using the
ClustalW program (Thompson et al. 1994
); the default parameters for
ClustaW, namely the BLOSUM62 matrix for amino acid residue comparison,
gap opening penalty 10, and gap extension penalty 0.1 were used. The
resulting multiple alignments were examined, in conjunction with the
BLAST search outputs, to identify proteins that contain two or more
distinct, independently evolving regions. The distinguishing feature of
such independently evolving units in proteins is that they are fused in
some species to form a single protein, but in other species are encoded
by two distinct genes, resulting in independent proteins (Doolittle and
Bork 1993
; Doolittle 1995
; Riley and Labedan 1997
).
Typically, when the respective three-dimensional structures are
available, the independently evolving regions are recognized as
sequence cognates of compact structural units, and therefore, these
regions are frequently called domains, whereas proteins containing more
than one such region are called multidomain proteins. However, a
one-to-one correspondence between independently evolving regions of
proteins and domains defined as fundamental units of three-dimensional
structure (Branden and Tooze 1991
) may or may not exist, as a single
independently evolving region may contain more than one domain. In our
analysis, independently evolving regions of proteins were recognized on
the basis of statistically significant sequence similarity (typically,
e-value below 0.01) detected using the BLAST or PSI-BLAST programs; the
recognition of such regions is facilitated by use of the graphical
output of the database search implemented in WWW-BLAST
(http://www.ncbi.nlm.nih.gov/BLAST). Multidomain proteins may
artificially connect unrelated single-domain proteins into a cluster
(Watanabe and Otsuka 1995
; Koonin et al. 1996b
; Riley and Labedan
1997
). Clusters that appeared to contain two COGs artificially merged,
because of the presence of multidomain proteins, were manually split
into single-domain COGs.
These procedures resulted in the identification of COGs that included
at least three archaeal species. In addition, all symmetrical intergenomic best hits (Tatusov et al. 1997
) between proteins not
included in this set of COGs were analyzed to identify COGs that
contained only two species. The protein sequences from each COG were
compared to the rest of the archaeal proteins using the PSI-BLAST
program, which was run for four iterations, to detect possible distant,
nonorthologous homologs of the COGs encoded in the archaeal genomes. In
addition, the protein sequences from the COGs including three or two
archaeal species were compared to the complete sequences of the
remaining archaeal genomes translated in all six reading frames using
the gapped version of the TBLASTN program (Altschul et al. 1990
, 1997
),
to detect possible orthologs that might have been missed in the
original translation of the genome sequences.
The archaeal protein sequences included in the COGs were compared to
the NR database using the PSI-BLAST program (four iterations), to
detect orthologs and nonorthologous homologs in other taxa, even in
cases of low sequence conservation. The search outputs were analyzed
using the Tax_Break and Tax_Collector programs of the SEALS package
(Walker and Koonin 1997
), to evaluate the phylogenetic distribution of
homologs for each COG. The Tax_Break program outputs the complete
taxonomic breakdown of database hits above the chosen cutoff (e-value
of 0.01 in this work) and the Tax_Collector program outputs the
lineage-specific best hits using the taxonomy tree structure embedded
in the Entrez system. The alignments of archaeal proteins with most
similar proteins from different taxa were examined manually to assess
the orthologous relationships (or lack thereof). The assignment of
likely orthologs was based on a combination of statistical significance
of the best lineage-specific hits and the conservation of domain
architecture (Tatusov et al. 1996
, 1997
).
The PSI-BLAST searches with the same settings were performed for the
archaeal proteins not included in the COGs. To enumerate the members of
large protein or domain families encoded in the archaeal genomes, a
profile for each family was developed using the PSI-BLAST program and
run as a query against the archaeal protein sequence database using the
e-value of 0.01 (adjusted to the size of the NR database) as the cutoff
(Aravind et al. 1998
; Chervitz et al. 1998
; Wolf et al. 1999
).
Other Methods for Protein Sequence and Structure Analysis
Protein secondary structure prediction on the basis of a multiple
sequence alignment was carried out using the PHD program (Rost and
Sander 1994
). Homology modeling of protein structures was performed
using the ProMod program (Peitsch 1996
). Protein databank (PDB) files
were visualized using SWISS-PDB viewer version 2.6 (Peitsch 1996
).
Availability of the Complete Results
The complete, annotated list of archaeal COGs is available at http://www.ncbi.nlm.nih.gov/CBBresearch/Koonin/COGS/Archaea. This list is also available, together with multiple alignments for each of the COGs at ftp://www.ncbi.nlm.nih.gov/pub/koonin/Archaea.
| |
ACKNOWLEDGMENTS |
|---|
K.M. is supported by U.S. Department of Energy OBER grant DE-FG02-98ER62583.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
4 Present address: Institute of Cytology and Genetics, Russian Academy of Sciences, Novosibirsk 630090, Russia.
5 Corresponding author.
E-MAIL koonin{at}ncbi.nlm.nih.gov; FAX (301) 480-9241.
| |
REFERENCES |
|---|
|
|
|---|
A tool for making discoveries in sequence databases.
Trends Biochem. Sci.
23:
444-447[CrossRef][Medline].
A conserved catalytic domain in type IA and II topoisomerases, DnaG-type primases, OLD family nucleases and RecR proteins.
Nucleic Acids Res.
26:
4205-4213