|
|
|
|
|
Vol. 10, Issue 8, 1074-1077, August 2000
INSIGHT/OUTLOOK
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ARTICLE |
|---|
|
|
|---|
The genome sequences churned out by the "genomic
revolution" have challenged both computational and experimental
biologists to come up with new methods to decipher the secrets of the
encoded proteins. The experimental biologists have largely concentrated on a variety of large-scale methods to assay gene expression and protein-protein interactions (Brown and Botstein 1999
; Uetz et al.
2000
; Walhout et al. 2000
). The computational biologists, however, have
deeply mined the genomes for evolutionary information in the form of
homology between genes (Tatusov et al. 1997
; Koonin et al. 2000
;
Ponting et al. 2000
,). Over the past few years there has been
increasing interest in the kinds of information that exist in the
context in which a protein or a domain thereof is encoded in the genome
(Mushegian and Koonin 1996
; Dandekar et al. 1998
). Recently, contextual
information has been offered as a strong handle on the problem of in
silico inference of protein function (Enright et al. 1999
; Marcotte et
al. 1999a
,b
; Overbeek et al. 1999
; Pellegrini et al. 1999
; Huynen et
al. 2000
). Understanding the scope and limitations of the use of these
methods may be critical for the experimental biologists seeking to use
computational guidelines for large-scale investigations of protein
function. Here, we outline the recent advances in this direction and briefly
illustrate the new leads they provide in understanding protein function.
Contextual information comes in several overlapping grades, each with a
different degree of specificity with regards to a particular protein's
role (Fig. 1). The most general form of contextual information is a phyletic profile, that is, the pattern of occurrence of orthologs of a particular gene in a set of genomes under comparison (Pellegrini et al. 1999
; Tatusov et al. 2000
). In this setup, the null
hypothesis would be that genes that functionally interact in a
particular pathway or complex would share a similar phyletic profile.
This hypothesis is supported by the phyletic distribution of components
of the core cellular machinery
the translation, transcription, and
replication complexes, which interact very tightly
as well as those of
metabolic pathways. For example, most of the proteins with a shared
phyletic pattern between the archaea and the eukaryotes are components
of one of the many protein complexes that have a role in the
above-stated three-core cellular processes. Thus, the detection of
uncharacterized proteins, such as the family typified by MJ0586 from
Methanococcus jannaschii, with a similar phyletic profile
would implicate them in one of the core functions (Fig. 1). When the
information from sequence homology is applied to these proteins, one
can often arrive at rather precise functions for these proteins. In the
case of the MJ0586-like proteins, sequence comparisons reveal that they
have a DNA binding helix-turn-helix domain (Aravind and Koonin 1999
),
suggesting that it is a component of the basal transcription machinery
similar to TFIIB or TBP, which share the same phyletic profile (Fig.
1). This inference is compatible with the recent implication of the
eukaryotic representative of this family MBF1 in transcriptional
regulation (Kabe et al. 1999
). When a rare shared phyletic pattern is
seen for certain proteins whose sequence affinities suggest a related
function, a strong case can be made for their interaction. One example
of this is the typically eukaryotic chromatin protein
methyltransferase
the SET domain that is seen in the bacteria
Chlamydiae and Bordatella pertussis, along with another
eukaryotic chromatin protein domain, SWIB (Stephens et al. 1998
). The
rare phyletic profile of these proteins in bacteria suggests that the
SET and SWIB domains probably interact not only in these bacteria but
possibly also in other organisms.
|
Since the pioneering works of Jacob and Monod, scientists have realized
that functionally linked genes are coregulated and occur in proximity
to each other on the chromosome. Genome comparisons have supported this
and show that, in prokaryotes, functionally interacting genes are in
clusters that range from the giant ribosomal operons to gene pairs that
survive over large phylogenetic distances (Dandekar et al. 1998
;
Overbeek et al. 1999
). Thus, the occurrence of an uncharacterized gene
in the neighborhood (the same operon) of genes with known functions
could potentially betray its function. However, the variability of
operons and the inability to predict them with a high level of
certainty causes a loss of specificity of this form of contextual
information. On this issue, Huynen et al. argue that greater stringency
in the criteria for gene neighborhood reduces false positives in these
inferences, with physical interactions between gene products strongly
predicted by conservation of gene order in an operon. Furthermore, as
the number of gene neighborhood combinations are far from exhausted with the currently available genomes, this method is likely to improve
in its scope and confidence with the availability of more genomes in
the future. While establishing functional interactions between the
proteins, the operon conservation approach has a deep predictive value
only for protein complexes that function in metabolism and core
processes. This is mainly because the functions of the uncharacterized
proteins can be interpreted in light of a fairly well-known biochemical
pathway. There are several conserved operons that encode proteins such
as histidine kinases and chemotactic receptors and
regulatory
kinases and phosphatases that are likely to physically interact in
distinct signaling pathways. Contrary to the metabolic- and
core-function proteins, in most such cases no specific biological role
can be assigned for these predicted signaling pathways (Fig. 1).
Protein domain fusions provide another form of information that could help in unraveling uncharacterized proteins or domains. Here the ability to extract useful information is very contingent on the proper classification of domain fusions. In evolutionary terms domain fusions can be classified as follows.
First, there are the ancient fusions that have occurred in the common
ancestor of a major lineage of organisms, such as the domains
comprising the
-subunit of the bacterial DNA polymerase III. In this
case the protein is made up of a phosphoesterase domain, a DNA
polymerase region, and two DNA-binding modules
a Helix-hairpin-Helix
(HhH) and an OB domain (Fig. 1). These domains by themselves provide
information only in the context of the DNA polymerase, rather than
automatically extending to the other proteins in which they are found,
such as the histidinol phosphatase that also has the same
phosphoesterase domain. Thus, they are of limited value in predicting
potential interaction partners of these proteins.
Second are the recent fusions that have occurred in the more terminal
branches of the tree of life (though they may show a secondary
dispersion through horizontal transfer). The second class of recent
fusions may further be divided into those driven by evolutionarily
mobile domains (`promiscuous domains'; Marcotte et al. 1999a
) and
gene fusions that arise from fusions of genes that typically occur
separately in several other genomes. A good number of the commonly
encountered mobile domains, which include repetitive modules such as
-propellers and
-
superhelices, function in numerous
distinct protein-protein interactions in different cellular
compartments and are of minimal value for predicting specific
interactions. Other fusions between mobile domains such as helicases
and nucleases illustrate a general tendency for these domains to
physically interact (Aravind et al. 1999
), but this does not imply that
all related nucleases and helicases interact. Some of the mobile
domains such as the Death domains in animal apoptotic signaling mediate
homophilic interaction, while cysteine knot domains often interact with
extracellular leucine-rich repeat domains. Hence, the presence of such
domains in proteins can often narrow down, if not specify, the set of
interaction partners in a given genome. Other mobile domains such as
the HhH module or the ACT domain act as tags specifying interactions
with nucleic acids and small molecules, respectively, and predict the
protein's functional milieu to some extent. Unlike the wide spectrum
of relatively imprecise predictive information offered by the mobile domain fusions, the relatively recent gene fusions are generally direct
indicators of physical interactions between the proteins encoded by
separate versions of these genes (Marcotte et al. 1999a
). Such fusions
are typical in proteins that are parts of the same complex or catalysts
of adjacent reactions in pathways (Fig. 1) and that are, in
prokaryotes, extreme cases of gene order conservation in operons. While
this form of fusion has a high predictive value distinguishing it from
other the forms, fusions may not necessarily be straightforward in the
absence of a robust list of orthologs (Galperin and Koonin 2000
).
What is the total amount of contextual information with all its
overlapping grades that is available for a particular genome? Huynen et
al. (2000)
determine this for the smallest cellular genome available to
us, that of Mycoplasma genitalium. They show that such
information can be gleaned for up to 50% of the genes with gene order
conservation contributing the maximum information and gene fusions,
phyletic profiling, and co-occurrence in operons contributing much
smaller, roughly equivalent amounts of information. The observation
that gene order conservation provides most of the contextual
information has serious implications for the generality of these
methods because the operonic organization of genes is a hallmark of the
prokaryotes. A large part of the critical regulatory interactions of
eukaryotic proteins, especially typical of multicellular forms, cannot
avail itself of this form of contextual information. A partial solution
to this problem is the use of coexpression, as determined through
transcription profiling, as an additional form of nongenomic contextual
information in the eukaryotes (Marcotte et al. 1999b
). While expression
profiles are far from being comparable to operons in their predictive
specificity, multiple combined approaches that additionally use the
results of large-scale genetic screens and mass spectroscopy-based
characterization of protein complexes could increase the amount of
context information in eukaryotic systems.
Finally, it can be seen that leads into fundamental biological
processes such as translation may be derived from contextual methods if
they are suitably combined with robust homology-based characterization
of the concerned proteins. Different forms of contextual information
have pointed out the role of at least two GTPases, one of the TdhF
family (Huynen et al. 2000
) and another of the OBG1 family fused with a
predicted RNA binding TGS domain (Galperin and Koonin 2000
). They are
likely to participate in translation, suggesting as yet undiscovered
GTP-dependent steps. Using similar techniques, different enzymes,
namely at least two methyltransferases (MG259 and MG347; Huynen et al.
2000
) and one Rossmann fold oxidoreductase of the GidA family (Marcotte
et al. 1999b
), have been implicated in translation-related functions in
bacteria and mitochondria. Both MG259 and MG347 are similar to RNA
methylases, and least one of them (MG259) has signatures typical of
nucleic acid adenine methylases. This, taken together with the link to
translation, implies that they are involved in methylation (MG259 and
MG347) and base modification (GidA) of either tRNA or rRNA. Thus, in
conjunction with appropriate homology studies, `guilt by association'
can be converted to clear-cut hypotheses that are testable with
specific experiments.
Experimental methods such as high-throughput two-hybrid screens (Uetz
et al. 2000
; Walhout et al. 2000
) suffer from false positives arising
because of interactions among proteins from different subcellular
locations. In such cases, the contextual information could effectively
cut down the search space for interacting partners and thereby improve
the chances of finding biologically relevant interactions. With genome
sequence data pouring in, these computational approaches are likely to
go hand in hand with large-scale experimental efforts in the conquest
of unchartered biological systems.
| |
FOOTNOTES |
|---|
1 E-MAIL aravind{at}ncbi.nlm.nih.gov; FAX (301) 480-9241.
| |
REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
K. Pfluger and V. de Lorenzo Evidence of In Vivo Cross Talk between the Nitrogen-Related and Fructose-Related Branches of the Carbohydrate Phosphotransferase System of Pseudomonas putida J. Bacteriol., May 1, 2008; 190(9): 3374 - 3380. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Y. Han, C. Z. Cai, Z. L. Ji, Z. W. Cao, J. Cui, and Y. Z. Chen Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach Nucleic Acids Res., December 7, 2004; 32(21): 6437 - 6444. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. M. Iyer, K. S. Makarova, E. V. Koonin, and L. Aravind Comparative genomics of the FtsK-HerA superfamily of pumping ATPases: implications for the origins of chromosome segregation, cell division and viral capsid packaging Nucleic Acids Res., October 5, 2004; 32(17): 5260 - 5279. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.Z. Cai, L.Y. Han, Z.L. Ji, X. Chen, and Y.Z. Chen SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence Nucleic Acids Res., July 1, 2003; 31(13): 3692 - 3697. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Liu, J. Zhou, M. V. Omelchenko, A. S. Beliaev, A. Venkateswaran, J. Stair, L. Wu, D. K. Thompson, D. Xu, I. B. Rogozin, et al. Transcriptome dynamics of Deinococcus radiodurans recovering from ionizing radiation PNAS, April 1, 2003; 100(7): 4191 - 4196. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. D. Urnov Methylation and the Genome: the Power of a Small Amendment J. Nutr., August 1, 2002; 132(8): 2450S - 2456. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. S. Makarova, L. Aravind, N. V. Grishin, I. B. Rogozin, and E. V. Koonin A DNA repair system specific for thermophilic Archaea and bacteria predicted by genomic context analysis Nucleic Acids Res., January 15, 2002; 30(2): 482 - 496. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. E. Bean, W. H. Dvorachek , Jr., E. L. Braun, A. Errett, G. S. Saenz, M. D. Giles, M. Werner-Washburne, M. A. Nelson, and D. O. Natvig Analysis of the pdx-1 (snz-1/sno-1) Region of the Neurospora crassa Genome: Correlation of Pyridoxine-Requiring Phenotypes With Mutations in Two Structural Genes Genetics, March 1, 2001; 157(3): 1067 - 1075. [Abstract] [Full Text] |
||||
![]() |
E. V. Koonin, Y. I. Wolf, and L. Aravind Prediction of the Archaeal Exosome and Its Connections with the Proteasome and the Translation and Transcription Machineries by a Comparative-Genomic Approach Genome Res., February 1, 2001; 11(2): 240 - 252. [Abstract] [Full Text] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||