|
|
|
|
Vol. 10, Issue 2, 204-219, February 2000 Functional Classification of cNMP-binding Proteins and Nucleotide Cyclases with Implications for Novel Regulatory Pathways in Mycobacterium tuberculosisThe Wadsworth Center for Laboratories and Research, New York State Department of Health, Albany, New York 12201-0509 USA
We have analyzed the cyclic nucleotide (cNMP)-binding protein and nucleotide cyclase superfamilies using Bayesian computational methods of protein family identification and classification. In addition to the known cNMP-binding proteins (cNMP-dependent kinases, cNMP-gated channels, cAMP-guanine nucleotide exchange factors, and bacterial cAMP-dependent transcription factors), new functional groups of cNMP-binding proteins were identified, including putative ABC-transporter subunits, translocases, and esterases. Classification of the nucleotide cyclases revealed subtle differences in sequence conservation of the active site that distinguish the five classes of cyclases: the multicellular eukaryotic adenylyl cyclases, the eukaryotic receptor-type guanylyl cyclases, the eukaryotic soluble guanylyl cyclases, the unicellular eukaryotic and prokaryotic adenylyl cyclases, and the putative prokaryotic guanylyl cyclases. Phylogenetic distribution of the cNMP-binding proteins and cyclases was analyzed, with particular attention to the 22 complete archaeal and eubacterial genome sequences. Mycobacterium tuberculosis H37Rv and Synechocystis PCC6803 were each found to encode several more putative cNMP-binding proteins than other prokaryotes; many of these proteins are of unknown function. M. tuberculosis also encodes several more putative nucleotide cyclases than other prokaryotic species.
Signal transduction pathways control many critical cellular processes, including chemotaxis, differentiation, proliferation, and apoptosis. For example, signal transduction pathways are necessary for bacterial pathogens to sense and respond to host environments, cellular differentiation during embryogenesis, conductance of nerve impulses, and cell cycle control. Disruption of these pathways can result in neoplasia, arteriosclerosis, neurological and developmental abnormalities, and cell death. The most common mechanisms of signal transduction include the phosphorylation or dephosphorylation of effector proteins by kinases and phosphatases, respectively, and the production of second messengers. Cyclic nucleotides were first recognized as second messengers 40 years ago. Such diverse molecules as (p)ppGpp, Ca2+, inositol triphosphate, and diacylglycerol have also been recognized as second messengers since then. The cyclic nucleotides adenosine 3',5'-cyclic monophosphate
(cAMP) and guanosine 3',5'-cyclic monophosphate (cGMP) are key universal second messengers, mediating cellular functions in organisms as phylogenetically diverse as Escherichia coli and Homo
sapiens. Intracellular concentrations of cyclic nucleotides (cNMPs)
are controlled by regulation of their relative rates of synthesis, excretion, and degradation (Botsford and Harman 1992 cNMP-Binding Proteins The cyclic nucleotide-binding proteins identified in prokaryotes
consist of a small group of orthologous cAMP
receptor proteins (CRP) present only in
gram-negative bacteria of the gamma subdivision of the Proteobacteria
(Botsford and Harman 1992 Three functional classes of cyclic nucleotide-binding proteins have
been described in eukaryotes: kinases, channels, and guanine nucleotide
exchange factors (GEFs). Cyclic nucleotide-dependent kinases have long
been considered the primary effectors that mediate cellular responses
to changes in intracellular cNMP concentrations (for review, see
Francis and Corbin 1994 All of these cNMP-binding proteins (cAK and cGK, cNMP-regulated
channels, cAMP-GEFs, and CRP) share sequence homology in their cyclic
nucleotide-binding domains, suggesting that they share structural
similarity (Shabb and Corbin 1992 Nucleotide Cyclases Bârzu and Danchin (1994) We chose to examine the cNMP-binding proteins and nucleotide cyclases
using recently developed Bayesian algorithms for multiple sequence
alignment and database searching (PROBE; Neuwald et al. 1997
cNMP-Binding Proteins The seed sequence used with PROBE to identify the cNMP-binding
protein superfamily was Streptomyces griseus P3
(gi|1196910), a sporulation-specific, putative cNMP-binding
protein (J. Kwak, L.A. McCue, K. Trczianka, and K.E. Kendrick, in
prep.). After partial and duplicate sequences were removed, 207 sequences were left in the superfamily sequence set. The superfamily
model consisted of three motifs, shown as sequence logos in Figure
1. Included in the model were five strongly conserved
glycines (Fig. 1: motif 1, positions 16, 23, and 35; and motif 2, positions 12 and 16) believed to be important for integrity of the
This superfamily consisted of many known cNMP-binding proteins, including eukaryotic cAKs and cGKs, cNMP-gated and cNMP-modulated channels, prokaryotic CRP proteins, and a putative cAMP-GEF. Also included in this superfamily were several prokaryotic transcription regulatory proteins (e.g., FNR, nitrogen fixation regulatory (FIXK), and nitrogen control regulatory (NTCA) proteins) that are probable paralogs of CRP that do not bind a cNMP, and several hypothetical sequences from the deduced proteomes of Mycobacterium tuberculosis, Synechocystis PCC6803, C. elegans, and others. Classification was started by first randomly dividing the cNMP-binding protein superfamily into seven classes, allowing for the six known types of cNMP-binding proteins described in the previous paragraph plus an extra class. PROBE was then used to multiply align the sequences in each class at a purge cutoff of 200. Classifier was applied to these classes and their models for a total of 16 sampling iterations, during which PROBE was called seven times (after every two sampling iterations of Classifier). With each call to PROBE, the purge cutoff value was incremented by 50, up to a maximum of 500. At convergence, seven classes remained with seven distinct models; the final models for the classes were made using PROBE at a purge cutoff of 500. Classification of this superfamily identified the similarities and
differences between classes, identifying motifs unique to individual
classes. Figure 2 is a schematic representation of
the motifs present in each class. A cNMP-binding domain or
Class 1 consisted of the cNMP-dependent kinases from many eukaryotic
species, with the class model including regions spanning two
cNMP-binding domains. Class 2 contained known cNMP-gated and cNMP-modulated channels, as well as several putative channels of
unknown function and regulation from a variety of eukaryotes. The
channel class model includes the cNMP-binding domain and a region
believed to form the channel pore (Finn et al. 1996 Class 4 consisted of sequences from gram-positive as well as
gram-negative eubacteria and contained the FNR, FIXK, and NTCA-type prokaryotic transcription regulators. The model for this class spans
the regions important for The last two classes contained many hypothetical protein sequences that
have entered the database as the result of genome-sequencing projects.
The models for these classes included only their cNMP-binding domains
and encompassed several subtle variations in sequence conservation in
this region between the classes. Class 6 consisted of 23 sequences from
prokaryotes, mainly cyanobacteria, as well as lower eukaryotes (C. elegans and fungi). Included in this class was a C. elegans protein that is a putative cAMP-GEF (Kawasaki et al.
1998 Nucleotide Cyclases The seed sequence for the cyclase superfamily was M. tuberculosis Rv1625c (gi|2113909), a putative adenylyl cyclase
(Cole et al. 1998
This superfamily included nucleotide cyclases from class III as
described by Bârzu and Danchin (1994) The classification was started by first randomly dividing the cyclase
superfamily into six classes, allowing for the five known types of
class III cyclases described in the previous paragraph plus an extra
class. PROBE was then used to multiply align each class at a purge
cutoff of 200. Classifier was applied to these classes and their models
as above
Class 1 consisted primarily of the eukaryotic integral membrane adenylyl cyclases, but also included two M. tuberculosis sequences, Rv1625c and Rv2435c. The model for this class included only regions from the cytoplasmic domains (C1 and C2) that form the catalytic region of the mammalian adenylyl cyclases. The catalytic asparagine and arginine were well conserved in this class, as was the aspartate that interacts with adenine (Fig. 4A, positions 30, 34, and 23, respectively). Class 2 contained both the Class 3 was the largest class and included known and predicted
receptor-type guanylyl cyclases from a variety of eukaryotes. Twenty-two proteins encoded by the C. elegans genome belong to this class, in agreement with the results of Yu et al. (1997) The majority of the prokaryotic cyclases belonged to the remaining two
classes. Class 4 contained sequences from Treponema, Stigmatella, mycobacterial, and cyanobacterial species. The
class model consisted of motifs spanning only the cyclase catalytic domain. Figure 4D shows that the catalytic asparagine and arginine (positions 12 and 16) were well conserved in this class. Interestingly, class 4 exhibited strong conservation of a threonine residue at position 5 in Figure 4D Class 5 contained several M. tuberculosis sequences, cyclases from several other eubacteria, the receptor-type adenylyl cyclases from protozoa, and the fungal adenylyl cyclases. This class model also consisted of motifs spanning only the cyclase catalytic domain, with conservation of the catalytic asparagine and arginine (Fig. 4E, positions 11 and 15), but a somewhat reduced conservation at the position corresponding to the residue presumed to interact with the substrate purine (Fig. 4E, position 4). The inclusion of M. tuberculosis Rv1625c and Rv2435c in the
eukaryotic adenylyl cyclase class (class 1) prompted us to analyze further these protein sequences. BLAST results of Rv1625c against the
SwissProtPlus database using the PAM70 matrix revealed that the most
significant hits were to eukaryotic adenylyl and guanylyl cyclases
(Fig. 5A). The proteins with
alignments having the highest reported bit value scores (soluble
guanylyl cyclase subunits from Manduca sexta and Rattus
norvegicus) had only a single block of homology with Rv1625c. The
human adenylyl cyclase type VIII (CYA8), however, had two separate
blocks of homology with Rv1625c (Fig. 5A), making it a highly
significant hit when the combined bit value score of the two regions of
homology is considered. Alignment of these two sequences (Rv1625c and
human CYA8) using the Bayes aligner (Zhu et al. 1998
Phylogenetic Distribution In the process of identifying and classifying the cNMP-binding
protein and nucleotide cyclase superfamilies, we observed that a number
of the cNMP-binding proteins did not belong to the known functional
classes, and formed new classes with only the cNMP-binding motif in
common (classes 6 and 7). We also noted that some species had a large
number of nucleotide cyclases. To further investigate the potential
functions of these proteins and their phylogenetic distribution, we
examined the cNMP-binding proteins and cyclases in our superfamilies,
with respect to predicted function, cellular localization, and species
(Table 1).
CNMP-BINDING PROTEINS The proteins from our cNMP-binding protein superfamily were tabulated according to known or predicted function. The majority of the eukaryotic proteins were proteins of known function or shared clear homology to the cAKs and cGKs or the cNMP-regulated channels. The majority of the prokaryotic proteins were also proteins of known function or with clear homology to transcriptional regulatory proteins of the CRP/FNR family. As expected during classification, these proteins formed classes 1-5. We performed BLAST searches and Pfam domain searches to determine putative functions for the several hypothetical proteins that are members of classes 6 and 7, to reveal whether there may be additional functional classes of cNMP-binding proteins for which there were too few members to form a separate class during our classification procedure. Putative cNMP-regulated functions that were identified were cAMP-GEF, ABC-transporter subunits, antibiotic efflux translocases, and esterases. We also identified protein sequences in eubacteria and Arabidopsis thaliana of <200 amino acids that each contain a single cNMP-binding domain spanning virtually the entire protein sequence.NUCLEOTIDE CYCLASES The proteins from our nucleotide cyclase superfamily were tabulated according to known or predicted cellular localization and nucleotide specificity, because there exists a considerable amount of data concerning these characteristics for the eukaryotic cyclases. When unknown, the nucleotide specificity of eukaryotic cyclases was predicted based on the data of Tucker et al. (1998)M. TUBERCULOSIS AND SYNECHOCYSTIS Among the prokaryotes, M. tuberculosis H37Rv and Synechocystis PCC6803 each seemed to encode a relatively large number of cNMP-binding proteins and nucleotide cyclases. To compare these species with other prokaryotes, using an unbiased sample set, we compared the prokaryotes (eubacteria and archaea) with completed genome sequences. There are currently 22 completely sequenced prokaryotic genomes, and the predicted proteomes of each of these is available from the National Institute for Biotechnology Information (NCBI). We constructed a sequence set of these 22 proteomes (41,908 total sequences), and scanned (Neuwald et al. 1995
cNMP-Binding Proteins Superfamily The majority of sequences in the cNMP-binding protein superfamily
classified with one of the known functional classes (kinases, channels,
and transcriptional regulators). These functional classes had motifs
common to the superfamily (cNMP binding pocket or First, class 1 contained both cAKs and cGKs. cAKs are heterotetramers composed of regulatory subunits (represented in class 1) and catalytic subunits, whereas the cGKs are homodimers that have the kinase active site and cGMP regulatory sites on the same polypeptide. Therefore, we expected that the cAK regulatory subunits would form a separate class from the cGKs, which would contain a kinase domain. This did not happen, however, likely due to the relatively few cGK sequences, which represent only a few species, currently in the database. Such a small group of sequences does not provide enough data to form a separate class using our methods, particularly when they are as highly homologous as the available cGK sequences. We also expected that the cNMP-gated channels and the cNMP-modulated channels could form sepa rate classes. Instead, the two channel classes that formed separated a small group of closely related plant sequences (class 3) from other eukaryotic sequences (class 2). It is likely that the sequence signals that distinguish the gated channels from the modulated channels are very subtle, and require more data for identification. Classes 4 and 5 separated the FNR-type from the CRP-type transcription
regulators. These proteins apparently have a similar Of particular interest were the putative cNMP-binding proteins detected in several species that did not classify with any of the known functional classes, and for which no function has yet been predicted. These proteins formed classes 6 and 7, and the proteins within these classes shared only the cNMP-binding domain. One protein was a likely cAMP-GEF from C. elegans. Because this one cAMP-GEF was the only entry in the database at the time, there were not enough data for a class of cAMP-GEFs to have formed during our classification. Using database searches, we determined that there are likely to be additional functional classes of cNMP-binding proteins that have not yet been described, and had too few entries in the database to form a class using our methods. Among these were sequences that appear to contain only the cNMP-binding motif, which spans the majority of the protein sequence (Table 1). It is possible that these proteins are prokaryotic and plant regulatory subunits of cNMP-dependent kinases, regulatory subunits of some other protein complex, or that they function to sequester cNMPs. Also, among "other" functions of prokaryotes in Table 1 were proteins with close homologs in several species (members of the conserved hypotheticals), indicating that there are conserved functions in prokaryotes, yet to be elucidated, that are likely regulated by cNMPs. Nucleotide Cyclase Superfamily Whereas previous classifications of nucleotide cyclases have focused
on protein topology, cellular localization, and substrate specificity,
the classification presented here relied on subtle differences in the
residues surrounding the cyclase active site, as well as the presence
of unique motifs for the two classes of eukaryotic guanylyl cyclases.
The differences in residue conservation of classes 1-3, illustrated in
Figure 4A-C, reflected what is currently known about substrate
specificity and the catalytic mechanism of the mammalian cyclases,
specifically that (1) an aspartate or cysteine residue (marked with + in Fig. 4) contributes to specificity for adenine or guanine,
respectively, and (2) that the soluble guanylyl cyclases act as
heterodimers, requiring the presence of the catalytic asparagine and
arginine (marked with * in Fig. 4) on only the Our classification identified two classes of prokaryotic nucleotide cyclases: a class that we hypothesize may represent prokaryotic guanylyl cyclases (class 4), and a class that likely represents prokaryotic and unicellular eukaryotic adenylyl cyclases (class 5). The class models for classes 4 and 5 spanned only the cyclase active site; therefore, the simplest explanation for the separation of prokaryotic cyclases into two classes is that the adenylyl and guanylyl cyclases formed separate classes due to differing conservation of the residues conferring substrate specificity. This view may be oversimplistic, however, as it is based on a relatively small number of prokaryotic cyclase sequences available during this study. When compared to the Pfam database, many of the sequences belonging to class 4 also contained various signal tranduction-type domains, including GAF (cGMP phosphodiesterase, adenylyl cyclase, and FhlA), PAS (per, arnt, and sim), FHA (forkhead-associated), and response regulator receiver domains (http://pfam.wustl.edu/; data not shown), suggesting novel modes for regulating the activity of these prokaryotic cyclases. When tabulating this superfamily, the predicted cellular localization
and nucleotide specificity of the eukaryotic cyclases in Table 1
conformed to experimental observations. All of the cyclases from
multicellular eukaryotes belonged to one of the previously identified
groups: (1) integral membrane adenylyl cyclases with 12 transmembrane
helices and 2 cytoplasmic domains, (2) receptor-type guanylyl
cyclases,and (3) cytoplasmic guanylyl cyclases. The sequences from
single-celled eukaryotes were previously identified or predicted adenylyl cyclases (see Table 1) that have been described (Tang and
Hurley 1998 However, a significant number of proteins among the prokaryotic
cyclases were predicted to be integral membrane cyclases, with topology
similar to the mammalian adenylyl cyclases (six transmembrane helices
and a single cytoplasmic domain), and receptor-type cyclases.
Prokaryotic receptor-type nucleotide cyclases have been identified
previously only in cyanobacteria (Katayama and Ohmori 1997 Archaea The phylogenetic distribution of both the cNMP-binding protein and cyclase superfamilies indicates an early origin for these proteins, perhaps before the evolutionary separation of the eubacteria from the eukaryotes. Also, the absence of archaeal proteins in our superfamilies suggests either that these proteins were lost from the archaea or evolved after the separation of the archaea from the eubacteria and eukaryotes. The lack of nucleotide cyclases (class I and class III) and
cNMP-binding proteins in the archaea suggests that the archaea either
do not use cNMPs as second messengers or produce and bind cNMPs by
mechanisms different than those described here. Mechanisms for the
production of cNMPs that are unrelated to the class III cyclases have
been described. The class I cyclases of the gamma Proteobacteria
(Bârzu and Danchin 1994 Given the results of Sismeiro et al. (1998) M. tuberculosis M. tuberculosis and Synechocystis both appear to have an unparalleled number of putative cNMP-binding proteins, although Synechocystis encodes relatively few cyclases. The functions of only half of these cNMP-binding proteins could be predicted by homology (Table 1). The large number of cNMP-binding proteins in both M. tuberculosis and Synechocystis suggests a previously unappreciated importance of cNMPs to these species and perhaps to other eubacteria. The M. tuberculosis proteins in the cyclase superfamily were of particular interest for several reasons: the large number (15) of M. tuberculosis proteins in this superfamily, the presence of predicted cytoplasmic (9), receptor-type membrane bound (1), and integral membrane (5) cyclases, and two M. tuberculosis proteins (Rv1625c and Rv2435c) classified with the multicellular eukaryotic adenylyl cyclases during our classification of this superfamily. The large number of putative cyclases in M. tuberculosis
implies that this organism may have the ability to sense and respond to
many intracellular and extracellular signals through the cNMP second
messenger system, perhaps in a manner similar to eukaryotic cyclases.
M. tuberculosis encodes a number of putative cytoplasmic cyclases, which could respond to intracellular signals in a manner similar to the eukaryotic soluble guanylyl cyclases (nitric oxide) or
the class I cyclases (nutrient availability). M. tuberculosis also encodes a putative receptor-type cyclase, similar in topology to
the eukaryotic receptor guanylyl cyclases, implying the ability to
sense an extracellular signal. The extracellular domain of this protein
(Rv2435c) has homology to a chemotaxis receptor in Desulfovibrio
vulgaris for which the ligand is unknown (Deckers and Voordouw
1996 Although it is not known whether a M. tuberculosis cyclase
activity is necessary for pathogenesis, it has been reported that macrophages with ingested mycobacteria have increased levels of cAMP
and that phagosome-lysosome fusion is impaired (Lowrie et al. 1975 Phylogenetic Analysis of the Cyclases Phylogenetic analysis of the nucleotide cyclases gave similar results as our classification, showing five major groups of cyclases and, in particular, placing the M. tuberculosis proteins Rv1625c and Rv2435c on a branch with the eukaryotic adenylyl cyclases. Our alignment of Rv1625c with a human cyclase further illustrates the surprising degree of sequence similarity between these proteins from such distant organisms. These results suggest a possible horizontal transfer event, an event that could have given an ancient mycobacterium a survival advantage as mycobacterial species were becoming pathogens of eukaryotes. Preliminary sequence results indicate that a Rv1625c ortholog is present in other pathogenic mycobacterial species, supporting this notion that a common ancestor of pathogenic mycobacteria acquired the gene. Unfortunately, there are no current genome sequencing projects for nonpathogenic mycobacterial species, therefore we could not confirm the absence of a Rv1625c ortholog in any of these species to support our hypothesis. The sequence databases have been expanding rapidly in recent years, and
with the currently ongoing genome sequencing
projects
Database mining and multiple sequence alignments were performed
with PROBE (Liu et al. 1999 Classification of the superfamily sequences was performed with the
Bayesian sequence Classifier (Qu et al. 1998 A jackknife procedure was used to detect false-positive members of the
superfamilies. After classification of a superfamily, the members of a
class were removed from the superfamily, the remaining members of the
superfamily aligned using PROBE (with a purge value of 150), and the
resulting model used to scan the nr database. A class was considered to
contain false positives and discarded if at least one of the class
members was not detected by this reduced model (E-value The cellular localization of hypothetical proteins was predicted using
TMHMM, which uses hidden Markov models to predict transmembrane helices
(Sonnhammer et al. 1998 Pairwise sequence alignment was performed with the Bayes aligner (Zhu
et al. 1998 Phylogenetic trees were constructed using PHYLIP (Felsenstein 1993
We thank J. Kwak and K.E. Kendrick for bringing our attention to S. griseus P3 and cNMP-binding proteins, and for helpful discussions. We also thank the Computational Molecular Biology and Statistics Core at the Wadsworth Center for assistance throughout this project and Michael Palumbo for assistance with the jackknife test. This research was supported by National Institutes of Health grant 5RO1-HG0125703 to C.E.L. Data sets and sequence alignments of the superfamilies and classes are available at http://www.wadsworth.org/resnres/bioinfo/. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
1 Corresponding author.
E-MAIL lawrence{at}wadsworth.org; FAX (518) 473-2900.
|