|
|
|
|
Vol. 8, Issue 3, 163-167, March 1998
INSIGHT/OUTLOOK
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ARTICLE |
|---|
|
|
|---|
The ability to accurately predict gene function based on gene sequence
is an important tool in many areas of biological research. Such predictions have become particularly
important in the genomics age in which numerous gene sequences are
generated with little or no accompanying experimentally determined
functional information. Almost all functional prediction methods rely
on the identification, characterization, and quantification of sequence similarity between the gene of interest and genes for which functional information is available. Because sequence is the prime determining factor of function, sequence similarity is taken to imply similarity of
function. There is no doubt that this assumption is valid in most
cases. However, sequence similarity does not ensure identical functions, and it is common for groups of genes that are similar in
sequence to have diverse (although usually related) functions. Therefore, the identification of sequence similarity is frequently not
enough to assign a predicted function to an uncharacterized gene; one
must have a method of choosing among similar genes with different
functions. In such cases, most functional prediction methods assign
likely functions by quantifying the levels of similarity among genes. I
suggest that functional predictions can be greatly improved by focusing
on how the genes became similar in sequence (i.e., evolution)
rather than on the sequence similarity itself. It is well established
that many aspects of comparative biology can benefit from evolutionary
studies (Felsenstein 1985
), and comparative molecular biology is no
exception (e.g., Altschul et al. 1989
; Goldman et al. 1996
). In this
commentary, I discuss the use of evolutionary information in the
prediction of gene function. To appreciate the potential of a
phylogenomic approach to the prediction of gene function, it
is necessary to first discuss how gene sequence is commonly used to
predict gene function and some general features about gene evolution.
Sequence Similarity, Homology, and Functional Predictions
To make use of the identification of sequence similarity between
genes, it is helpful to understand how such similarity arises. Genes
can become similar in sequence either as a result of
convergence (similarities that have arisen without a common
evolutionary history) or descent with modification from a common
ancestor (also known as homology). It is imperative to
recognize that sequence similarity and homology are not interchangeable
terms. Not all homologs are similar in sequence (i.e., homologous genes
can diverge so much that similarities are difficult or impossible to
detect) and not all similarities are due to homology (Reeck et al.
1987
; Hillis 1994
). Similarity due to convergence, which is likely
limited to small regions of genes, can be useful for some functional
predictions (Henikoff et al. 1997
). However, most sequence-based
functional predictions are based on the identification (and subsequent
analysis) of similarities that are thought to be due to homology.
Because homology is a statement about common ancestry, it cannot be
proven directly from sequence similarity. In these cases, the inference of homology is made based on finding levels of sequence similarity that
are thought to be too high to be due to convergence (the exact
threshold for such an inference is not well established).
Improvements in database search programs have made the identification
of likely homologs much faster, easier, and more reliable (Altschul et
al. 1997
; Henikoff et al. 1998
). However, as discussed above, in many
cases the identification of homologs is not sufficient to make specific
functional predictions because not all homologs have the same function.
The available similarity-based functional prediction methods can be
distinguished by how they choose the homolog whose function is most
relevant to a particular uncharacterized gene (Table
1). Some methods are relatively simple
many
researchers use the highest scoring homolog (as determined by programs
like BLAST or BLAZE) as the basis for assigning function. While highest hit methods are very fast, can be automated readily, and are likely accurate in many instances, they do not take advantage of any information about how genes and gene functions evolve. For example, gene duplication and subsequent divergence of function of the duplicates can result in homologs with different functions being present within one species. Specific terms have been created to distinguish homologs in these cases (Table 2): Genes
of the same duplicate group are called orthologs (e.g.,
-globin from mouse and humans), and different duplicates are
called paralogs (e.g.,
- and
-globin) (Fitch 1970
).
Because gene duplications are frequently accompanied by functional
divergence, dividing genes into groups of orthologs and paralogs can
improve the accuracy of functional predictions. Recognizing that the
one-to-one sequence comparisons used by most methods do not reliably
distinguish orthologs from paralogs, Tatusov et al. (1997)
developed
the COG clustering method (see Table 1). Although the COG method is
clearly a major advance in identifying orthologous groups of genes, it
is limited in its power because clustering is a way of classifying
levels of similarity and is not an accurate method of inferring
evolutionary relationships (Swofford et al. 1996
). Thus, as sequence
similarity and clustering are not reliable estimators of evolutionary
relatedness, and as the incorporation of such phylogenetic information
has been so useful to other areas of biology, evolutionary techniques
should be useful for improving the accuracy of predicting function
based on sequence similarity.
|
|
Phylogenomics
There are many ways in which evolutionary information can be used
to improve functional predictions. Below, I present an outline of one
such phylogenomic method (see Fig. 1), and I
compare this method to nonevolutionary functional prediction methods.
This method is based on a relatively simple assumption
because gene functions change as a result of evolution, reconstructing the evolutionary history of genes should help predict the functions of
uncharacterized genes. The first step is the generation of a
phylogenetic tree representing the evolutionary history of the gene of
interest and its homologs. Such trees are distinct from clusters and
other means of characterizing sequence similarity because they are
inferred by special techniques that help convert patterns of similarity
into evolutionary relationships (see Swofford et al. 1996
). After the
gene tree is inferred, biologically determined functions of the various
homologs are overlaid onto the tree. Finally, the structure of the tree
and the relative phylogenetic positions of genes of different functions
are used to trace the history of functional changes, which is then used
to predict functions of uncharacterized genes. More detail of this
method is provided below.
|
Identification of Homologs
The first step in studying the evolution of a particular gene is the identification of homologs. As with similarity-based functional prediction methods, likely homologs of a particular gene are identified through database searches. Because phylogenetic methods benefit greatly from more data, it is useful to augment this initial list by using identified homologs as queries for further database searches or using automatic iterated search methods such as PSI-BLAST (Altschul et al. 1997Alignment and Masking
Sequence alignment for phylogenetic analysis has a particular purpose
it is the assignment of positional homology. Each
column in a multiple sequence alignment is assumed to include amino
acids or nucleotides that have a common evolutionary history, and each column is treated separately in the phylogenetic analysis. Therefore, regions in which the assignment of positional homology is ambiguous should be excluded (Gatesy et al. 1993Phylogenetic Trees
For extensive information about generating phylogenetic trees from sequence alignments, see Swofford et al. (1996)
|
Functional Predictions
To make functional predictions based on the phylogenetic tree, it is necessary to first overlay any known functions onto the tree. There are many ways this "map" can then be used to make functional predictions, but I recommend splitting the task into two steps. First, the tree can be used to identify likely gene duplication events in the past. This allows the division of the genes into groups of orthologs and paralogs (e.g., Eisen et al. 1995Is the Phylogenomic Method Worth the Trouble?
Phylogenomic methods require many more steps and usually much
more manual labor than similarity-based functional prediction methods.
Is the phylogenomic approach worth the trouble? Many specific examples
exist in which gene function has been shown to correlate well with gene
phylogeny (Eisen et al. 1995
; Atchley and Fitch 1997
). Although no
systematic comparisons of phylogenetic versus similarity-based
functional prediction methods have been done, there are a variety of
reasons to believe that the phylogenomic method should produce more
accurate predictions than similarity-based methods. In particular,
there are many conditions in which similarity-based methods are likely
to make inaccurate predictions but which can be dealt with well by
phylogenetic methods (see Table 4).
|
A specific example helps illustrate a potential problem with
similarity-based methods. Molecular phylogenetic methods show conclusively that mycoplasmas share a common ancestor with low-GC Gram-positive bacteria (Weisburg et al. 1989
). However, examination of the percent similarity between
mycoplasmal genes and their homologs in bacteria does not clearly show
this relationship. This is because mycoplasmas have undergone an
accelerated rate of molecular evolution relative to other bacteria.
Thus, a BLAST search with a gene from Bacillus subtilis (a low
GC Gram-positive species) will result in a list in which the mycoplasma
homologs (if they exist) score lower than genes from many species of
bacteria less closely related to B. subtilis. When amounts or
rates of change vary between lineages, phylogenetic methods are better able to infer evolutionary relationships than similarity methods (including clustering) because they allow for evolutionary branches to
have different lengths. Thus, in those cases in which gene function
correlates with gene phylogeny and in which amounts or rates of change
vary between lineages, similarity-based methods will be more likely
than phylogenomic methods to make inaccurate functional predictions
(see Table 4).
Another major advantage of phylogenetic methods over most similarity methods comes from the process of masking (see above). For example, a deletion of a large section of a gene in one species will greatly affect similarity measures but may not affect the function of that gene. A phylogenetic analysis including these genes could exclude the region of the deletion from the analysis by masking. In addition, regions of genes that are highly variable between species are more likely to undergo convergence and such regions can be excluded from phylogenetic analysis by masking. Masking thus allows the exclusion of regions of genes in which sequence similarity is likely to be "noisy" or misleading rather than a biologically important signal. The pairwise sequence comparisons used by most similarity-based functional prediction methods do not allow such masking. Phylogenetic methods have been criticized because of their dependence (for most methods) on multiple sequence alignments that are not always reliable and unbiased. However, multiple sequence alignments also allow for masking, which is probably more valuable than the cost of depending on alignments.
The conditions described above and highlighted in Table 4 are just some
examples of conditions in which evolutionary methods are more likely to
make accurate functional predictions than similarity-based methods.
Phylogenetic methods are particularly useful when the history of a gene
family includes many of these conditions (e.g., multiple gene
duplications plus rate variation) or when the gene family is very
large. The principle is simple
the more complicated the history of a
gene family, the more useful it is to try to infer that history. Thus
although the phylogenomic method is slow and labor intensive, I believe
it is worth using if accuracy is the main objective. In addition,
information about the evolutionary relationships among gene homologs is
useful for summarizing relationships among genes and for putting
functional information into a useful context.
Despite the evolution of these methods, and likely continued
improvements in functional predictions, it must be remembered that the
key word is prediction. All methods are going to make inaccurate predictions of functions. For example, none of the methods
described can perform well when gene functions can change with little
sequence change as has been seen in proteins like opsins (Yokoyama
1997
). Thus, sequence databases and genome researchers should make
clear which functions assigned to genes are based on predictions and
which are based on experiments. In addition, all prediction methods
should use only experimentally determined functions as their grist for
predictions. This will hopefully limit error propagation that can
happen by using an inaccurate prediction of function to then predict
the function of a new gene, which is a particular problem for the
highest hit methods, as they rely on the function of only one gene at a
time to make predictions (Eisen et al. 1997
). Despite these and other
potential problems, functional predictions are of great value in
guiding research and in sorting through huge amounts of data. I believe
that the increased use of phylogenetic methods can only serve to
improve the accuracy of such functional predictions.
| |
FOOTNOTES |
|---|
1 E-MAIL jeisen{at}leland.stanford.edu; FAX (650) 725-1848.
WWW: http://www-leland.stanford.edu/~jeisen.
| |
REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
M. Tamura and P. D'haeseleer Microbial genotype-phenotype mapping by class association rule mining Bioinformatics, July 1, 2008; 24(13): 1523 - 1529. [Abstract] [PDF] |
||||
![]() |
W. D. Swingley, R. E. Blankenship, and J. Raymond Integrating Markov Clustering and Molecular Phylogenetics to Reconstruct the Cyanobacterial Species Tree from Conserved Protein Families Mol. Biol. Evol., April 1, 2008; 25(4): 643 - 654. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. D. Rasmussen and M. Kellis Accurate gene-tree reconstruction by learning gene- and species-specific substitution rates across multiple complete genomes Genome Res., December 1, 2007; 17(12): 1932 - 1942. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kumar and J. Dudley Bioinformatics software for biologists in the genomics era Bioinformatics, July 15, 2007; 23(14): 1713 - 1717. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. G. Glanville, D. Kirshner, N. Krishnamurthy, and K. Sjolander Berkeley Phylogenomics Group web servers: resources for structural phylogenomic analysis Nucleic Acids Res., July 13, 2007; 35(suppl_2): W27 - W32. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Zhou and L. F. Landweber BLASTO: a tool for searching orthologous groups Nucleic Acids Res., July 13, 2007; 35(suppl_2): W678 - W682. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Hanekamp, U. Bohnebeck, B. Beszteri, and K. Valentin PhyloGena a user-friendly system for automated phylogenetic annotation of unknown sequences Bioinformatics, April 1, 2007; 23(7): 793 - 801. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Tian and A. W. Dickerman GeneTrees: a phylogenomics resource for prokaryotes Nucleic Acids Res., January 12, 2007; 35(suppl_1): D328 - D331. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Kidron, S. Repo, M. S. Johnson, and T. A. Salminen Functional Classification of Amino Acid Decarboxylases from the Alanine Racemase Structural Family by Phylogenetic Studies Mol. Biol. Evol., January 1, 2007; 24(1): 79 - 89. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Wu, L. A. Mueller, D. Crouzillat, V. Petiard, and S. D. Tanksley Combining Bioinformatics and Phylogenetics to Identify Large Sets of Single-Copy Orthologous Genes (COSII) for Comparative, Evolutionary and Systematic Studies: A Test Case in the Euasterid Plant Clade Genetics, November 1, 2006; 174(3): 1407 - 1420. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Friedberg Automated protein function prediction--the genomic challenge Brief Bioinform, September 1, 2006; 7(3): 225 - 242. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Ternes, P. Sperling, S. Albrecht, S. Franke, J. M. Cregg, D. Warnecke, and E. Heinz Identification of Fungal Sphingolipid C9-methyltransferases by Phylogenetic Profiling J. Biol. Chem., March 3, 2006; 281(9): 5582 - 5592. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Li, A. Coghlan, J. Ruan, L. J. Coin, J.-K. Heriche, L. Osmotherly, R. Li, T. Liu, Z. Zhang, L. Bolund, et al. TreeFam: a curated database of phylogenetic trees of animal gene families Nucleic Acids Res., January 1, 2006; 34(suppl_1): D572 - D580. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. M. Coombs and T. Barkay New Findings on Evolution of Metal Homeostasis Genes: Evidence from Comparative Genome Analysis of Bacteria and Archaea Appl. Envir. Microbiol., November 1, 2005; 71(11): 7083 - 7091. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. L. Williamson, B. R. Borlee, P. D. Schloss, C. Guan, H. K. Allen, and J. Handelsman Intracellular Screen To Identify Metagenomic Clones That Induce or Inhibit a Quorum-Sensing Biosensor Appl. Envir. Microbiol., October 1, 2005; 71(10): 6335 - 6344. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Lazareva-Ulitsky, K. Diemer, and P. D. Thomas On the quality of tree-based protein classification Bioinformatics, May 1, 2005; 21(9): 1876 - 1890. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. M. Stoebel Lack of Evidence for Horizontal Transfer of the lac Operon into Escherichia coli Mol. Biol. Evol., March 1, 2005; 22(3): 683 - 690. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. H.Y. He, C. C. Helbing, M. J. Wagner, C. W. Sensen, and K. Riabowol Phylogenetic Analysis of the ING Family of PHD Finger Proteins Mol. Biol. Evol., January 1, 2005; 22(1): 104 - 116. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Y. Han, C. Z. Cai, Z. L. Ji, Z. W. Cao, J. Cui, and Y. Z. Chen Predicting functional family of novel enzymes irrespective of sequence similarity: a statistical learning approach Nucleic Acids Res., December 7, 2004; 32(21): 6437 - 6444. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Uimari, M. Kotilainen, P. Elomaa, D. Yu, V. A. Albert, and T. H. Teeri Integration of reproductive meristem fates by a SEPALLATA-like MADS-box gene PNAS, November 2, 2004; 101(44): 15817 - 15822. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Kiontke, N. P. Gavin, Y. Raynes, C. Roehrig, F. Piano, and D. H. A. Fitch Caenorhabditis phylogeny predicts convergence of hermaphroditism and extensive intron loss PNAS, June 15, 2004; 101(24): 9003 - 9008. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Y. Lau and D. I. Chasman Functional classification of proteins and protein variants PNAS, April 27, 2004; 101(17): 6576 - 6581. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. T. Konstantinidis and J. M. Tiedje Trends between gene content and genome size in prokaryotic species with larger genomes PNAS, March 2, 2004; 101(9): 3160 - 3165. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kroken, N. L. Glass, J. W. Taylor, O. C. Yoder, and B. G. Turgeon Phylogenomic analysis of type I polyketide synthase genes in pathogenic and saprobic ascomycetes PNAS, December 23, 2003; 100(26): 15670 - 15675. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Gribaldo, D. Casane, P. Lopez, and H. Philippe Functional Divergence Prediction from Evolutionary Analysis: A Case Study of Vertebrate Hemoglobin Mol. Biol. Evol., November 1, 2003; 20(11): 1754 - 1759. [Abstract] [Full Text] [PDF] |
||||
![]() |
C.Z. Cai, L.Y. Han, Z.L. Ji, X. Chen, and Y.Z. Chen SVM-Prot: web-based support vector machine software for functional classification of a protein from its primary sequence Nucleic Acids Res., July 1, 2003; 31(13): 3692 - 3697. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Eisen and C. M. Fraser Phylogenomics: Intersection of Evolution and Genomics Science, June 13, 2003; 300(5626): 1706 - 1707. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. L. Citerne, D. Luo, R. T. Pennington, E. Coen, and Q. C.B. Cronk A Phylogenomic Investigation of CYCLOIDEA-Like TCP Genes in the Leguminosae Plant Physiology, March 1, 2003; 131(3): 1042 - 1053. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-E. Germond, L. Lapierre, M. Delley, B. Mollet, G. E. Felis, and F. Dellaglio Evolution of the Bacterial Species Lactobacillus delbrueckii: A Partial Genomic Study with Reflections on Prokaryotic Species Concept Mol. Biol. Evol., January 1, 2003; 20(1): 93 - 104. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Itoh, W. Martin, and M. Nei Acceleration of genomic evolution caused by enhanced mutation rate in endocellular symbionts PNAS, October 1, 2002; 99(20): 12944 - 12948. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Eisen, K. E. Nelson, I. T. Paulsen, J. F. Heidelberg, M. Wu, R. J. Dodson, R. Deboy, M. L. Gwinn, W. C. Nelson, D. H. Haft, et al. The complete genome sequence of Chlorobium tepidum TLS, a photosynthetic, anaerobic, green-sulfur bacterium PNAS, July 9, 2002; 99(14): 9509 - 9514. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Lee, R. Sultana, G. Pertea, J. Cho, S. Karamycheva, J. Tsai, B. Parvizi, F. Cheung, V. Antonescu, J. White, et al. Cross-Referencing Eukaryotic Genomes: TIGR Orthologous Gene Alignments (TOGA) Genome Res., March 1, 2002; 12(3): 493 - 502. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. J. Planet, S. C. Kachlany, R. DeSalle, and D. H. Figurski Phylogeny of genes for secretion NTPases: Identification of the widespread tadA subfamily and development of a diagnostic key for gene classification PNAS, February 27, 2001; 98(5): 2503 - 2508. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Sicheritz-Ponten and S. G. E. Andersson A phylogenomic approach to microbial evolution Nucleic Acids Res., January 15, 2001; 29(2): 545 - 552. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. D. Pollock, J. A. Eisen, N. A. Doggett, and M. P. Cummings A Case for Evolutionary Genomics and the Comprehensive Examination of Sequence Biodiversity Mol. Biol. Evol., December 1, 2000; 17(12): 1776 - 1788. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Smit and A. Mushegian Biosynthesis of Isoprenoids via Mevalonate in Archaea: The Lost Pathway Genome Res., October 1, 2000; 10(10): 1468 - 1484. [Abstract] [Full Text] |
||||
![]() |
J. T. N. Tai, E. E. Brooks, S. Liang, R. Somogyi, J. D. Rosete, R. M. Lawn, and D. Shiffman Determination of Temporal Expression Patterns for Multiple Genes in the Rat Carotid Artery Injury Model Arterioscler. Thromb. Vasc. Biol., October 1, 2000; 20(10): 2184 - 2191. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. R. Band, J. H. Larson, M. Rebeiz, C. A. Green, D. W. Heyen, J. Donovan, R. Windish, C. Steining, P. Mahyuddin, J. E. Womack, et al. An Ordered Comparative Map of the Cattle and Human Genomes Genome Res., September 1, 2000; 10(9): 1359 - 1368. [Abstract] [Full Text] |
||||