|
|
|
|
Genome Res. 13:2353-2362, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Methods Comprehensive Analysis of Orthologous Protein Domains Using the HOPS DatabaseCenter for Genomics and Bioinformatics, Karolinska Institutet, S-17177 Stockholm, Sweden
One of the most reliable methods for protein function annotation is to transfer experimentally known functions from orthologous proteins in other organisms. Most methods for identifying orthologs operate on a subset of organisms with a completely sequenced genome, and treat proteins as single-domain units. However, it is well known that proteins are often made up of several independent domains, and there is a wealth of protein sequences from genomes that are not completely sequenced. A comprehensive set of protein domain families is found in the Pfam database. We wanted to apply orthology detection to Pfam families, but first some issues needed to be addressed. First, orthology detection becomes impractical and unreliable when too many species are included. Second, shorter domains contain less information. It is therefore important to assess the quality of the orthology assignment and avoid very short domains altogether. We present a database of orthologous protein domains in Pfam called HOPS: Hierarchical grouping of Orthologous and Paralogous Sequences. Orthology is inferred in a hierarchic system of phylogenetic subgroups using ortholog bootstrapping. To avoid the frequent errors stemming from horizontally transferred genes in bacteria, the analysis is presently limited to eukaryotic genes. The results are accessible in the graphical browser NIFAS, a Java tool originally developed for analyzing phylogenetic relations within Pfam families. The method was tested on a set of curated orthologs with experimentally verified function. In comparison to tree reconciliation with a complete species tree, our approach finds significantly more orthologs in the test set. Examples for investigating gene fusions and domain recombination using HOPS are given.
The concepts of orthology and paralogy (Fitch 1970
A standard approach for assigning orthology in a phylogenetic tree is tree reconciliation (Goodman et al. 1979
However, one drawback of tree reconciliation is that it uses a given, fixed species tree: For some species the evolutionary history is still controversial, for example, the phylogenetic relationship of Homo sapiens, Caenorhabditis elegans, and Drosophila melanogaster (Mushegian et al. 1998
Here we present an approach to resolve these problems by organizing the sequences into evolutionarily distinct subgroups. Orthology is then inferred between these subgroups using ortholog bootstrapping (Storm and Sonnhammer 2002
Recent studies indicate a high rate of horizontal transfer for bacteria (Doolittle 1999
Data This paper is based on the 3735 protein families in Pfam 7.2 (Bateman et al. 2002
Each level is divided into evolutionarily distinct species groups built around completely sequenced genomes. This minimizes the chance that paralogs could be mistaken for orthologs, because the true orthologous sequence might not be known yet. These species groups, or lineages, are treated as one "pseudospecies" in the orthology analysis, which is carried out only between species groups at the same level. Orthologous relations within a species group are not analyzed. The eukaryotic level is split up into the clades of Metazoa, Viridiplantae, and Fungi. The metazoan level is divided into Chordata, Nematoda, and Arthropoda. Sequences from species that do not belong to any of those clades are not analyzed. For instance, a sequence from an Echinodermata species (sea urchins, starfish, etc.) would not be taken into account on the metazoan level, because it doesn't belong to any of the species in the Chordata, Nematoda, or Arthropoda groups. But it would be analyzed on the eukaryotic level, because it is part of the metazoan group. Therefore, orthologs to this sequence may be assigned in the Viridiplantae and Fungi groups, but not on the metazoan level. To improve the quality of ortholog assignments, we can use an outgroup criterion at the metazoan level. All sequences from the eukaryotic groups that are not part of the metazoan group are treated as outgroup sequences. If an outgroup sequence is found between two candidate orthologs in the tree, this means that they are probably not true orthologs but were clustered together because the true ortholog was lost in one of the species. The outgroup species did not lose the ortholog and is hence more closely related than the false ortholog. We have not used the outgroup criterion at the eukaryotic level, however, because of the frequent horizontal transfers from eukaryotes to prokaryotes.
Orthostrapper Here Orthostrapper is run on pairs of species groups at the same level. Each Pfam alignment is split according to the phylogenetic groups mentioned above, and six pairwise comparisons are carried out: on the eukaryotic level, Metazoa-Viridiplantae, Metazoa-Fungi, and Viridiplantae-Fungi; on the metazoan level, Chordata-Nematoda, Chordata-Arthropoda, and Nematoda-Arthropoda. Ortholog groups between more than two species groups are generated by merging the pairwise results. This is only allowed between species groups at the same HOPS level. If there are incongruencies between the pairwise results, this needs to be flagged. Partial sequences/fragments shorter than 50% of the alignment length were removed to improve the tree quality. Ortholog bootstrapping with 200 pseudosamples was then used to analyze the alignments. Alignments with more than 1200 sequences were not analyzed. Because of their arbitrary length, Pfam families of type "repeats" were excluded from the analysis.
Test Set To avoid any ambiguity, the sequences in the test set should fulfill the following criteria:
To narrow the set of proteins, all single-domain proteins (according to Pfam-A) in one species were scored with BLAST to all single-domain proteins in the second species. From the alignments, best-best hit pairs (sequences that rank each other as best match when searching both ways) were extracted. To remove many-to-many orthologous relations, each sequence of the best-best hit pairs was then aligned to all sequences within its species. If any sequence within the same species had a lower e-value than the corresponding ortholog, the pair was excluded from the test set. From the remaining pairs, sequences with functional annotation by homology were excluded. The functional descriptions of each sequence pair were then compared. Only if both proteins use the same substrate and show the same activity were they included in the final test set. This procedure resulted in a set of 102 human-yeast, 19 human-worm, and 47 human-fly putative orthologs, shown in Tables 1, 2, 3. It should be pointed out that the main criteria for the test set are `experimentally verified function' and `same substrate and activity,' which is why the test set is relatively small. The reciprocal BLAST scoring was done to limit the necessary number of (manual) comparisons of sequence annotations.
Phylogenetic Analysis
Access to the Data
Comparison With Tree Reconciliation
Accessing the HOPS Database The HOPS database can be accessed either via the NIFAS browser or a standard Internet browser. NIFAS is a Java applet for viewing phylogenetic trees of domains, connected to schematic graphical representations of the proteins' domain structure (Storm and Sonnhammer 2001 The increase of computer power over the last years makes it possible to now provide precalculated neighbor-joining trees with bootstrap support for families of up to 250 sequences and up to 1500 sequences with no bootstrap support. Although the information content of phylogenetic trees of this size with no bootstrap support is at best questionable, it allows viewing the ortholog bootstrap values calculated for the larger families within NIFAS.
NIFAS
While the orthology navigation panel is shown, a click with the left mouse button on any domain schematic will open a new window. In this window, all proteins are displayed that contain at least one domain with an orthology score above the Orthology value. A right click on a domain schematic will show the same information in the browser window in text format. The species group of a sequence is shown by color-coding the background of the species name next to the sequence identifier. Colored boxes around the sequence identifier highlight orthologous relations. Each group of orthologs and paralogs has a different color assigned (Fig. 3). Sometimes the results of the clustering can be ambiguous, especially if clustering sequences from more than two species. In these cases, the boxes for sequences that are grouped in more than one cluster are drawn with multiple colors, one for each cluster to which the sequence is assigned.
Access Through an Internet Browser
Accuracy of the Test Set
Comparing HOPS With RIO
Results for the H. sapiens-D. melanogaster Test Set
Results for the H. sapiens-C. elegans Test Set An analysis of the corresponding sequence trees shows why orthology inference with a complete species tree as done by RIO fails to find most of the orthologs (Fig. 5B). In these cases, the trees do not follow the "Ecdysozoa" phylogeny in the species tree used by RIO, which places C. elegans in a clade with arthropods. The fly sequences in these trees are in the same clade as the H. sapiens sequences, and the nematode sequences are basal to bothrepresenting the "Coelomata" hypothesis (Fig. 3). In this combination of gene and species tree, the tree reconciliation algorithm will assign the human and the worm sequence to be paralogous. In HOPS, the fly sequences are not included for the human-worm analysis; therefore, the orthologous sequence is correctly assigned.
Results for the H. sapiens-S. cerevisiae Test Set
False Negatives in HOPS
Nearly all (12 out of 13) of the cases in which the HOPS clustering scheme is not reflected by the gene tree come from a misplaced Schizosaccharomyces pombe sequence. More than 60% (eight out of 13) of these cases are observed in a single domain family (the SM domain). For instance, in the phylogenetic tree for the SM family, the S. cerevisiae LSM6_YEAST sequence appears as an outgroup to an S. pombe + metazoan clade. Therefore, both methods assign orthology between the S. pombe and the human LSM6 but not between S. cerevisiae and human. Trees reconstructed with MrBayes (data not shown) give the same results. This indicates that this behavior does not originate from an error specific to the tree reconstruction method used in HOPS. For the SM family, the S. cerevisiae sequences show an elevated rate of evolution. In combination with the relative shortness of the alignment (111 residues), this might explain why none of the tree methods reconstructs the correct tree: a monophyletic S. pombe and S. cerevisiae branch basal to the metazoan clade. In the remaining cases, orthology was not assigned because of generally low bootstrap support for the whole family. The alignment simply does not contain enough phylogenetic information to reconstruct a reliable phylogenetic tree. In other words, the ratio (informative sites)/(sequence number) is too small. This results in low ortholog bootstrap values.
Investigating Gene Fusion
Orthology Between Multidomain Proteins
There could be several reasons for this observation. Assuming that horizontal transfer between chordates and plants is ruled out, it can be explained by independent recombination events of paralogous copies of the PF00994 domain in each of the arthropod, chordate, and plant lineages. Given that the plant protein has a different domain order, it must have happened independently in the plant lineage. A scenario in which the arthropod recombination selected a paralogous copy of the PF00994 domain present early in the metazoan lineage, whereas the plant and chordate recombinations both selected the same (orthologous) copy, and the other copy was lost in all lineages, would be consistent with the present tree. On the other hand, if horizontal transfer of a single domain from chordates to plants were biologically feasible, this would be an alternative explanation. It would be interesting to know if the chordate and plant proteins have evolved into identical functions despite the different domain architecture.
Conclusion The HOPS/RIO comparison shows that the simplified HOPS approach has a higher sensitivity for finding orthologs than classical tree reconciliation with a complete species tree. Orthology assignments are done on the domain level by HOPS. The orthologous relations can be viewed graphically in NIFAS. This combination provides a comprehensive and user-friendly analysis not only of orthologous relations but also of gene fusion and domain rearrangements. The two examples of orthology between genes with different domain architecture (Figs. 7, 8, 9) demonstrate the potential of HOPS to study the mechanism of domain rearrangements. Additionally, these examples show that with increasing evolutionary distance, domain rearrangements and gene fusion can become an issue for assigning orthology. Any approach not taking into account the modular architecture of the proteins would fail to extract all orthologous information in the given examples. The results of this study indicate that there is a limit to how much phylogenetic information can be included sensibly for finding orthologs. Up to a certain point, including additional information will improve the results. For instance, if one only analyzes sequences from two species for orthology, this can lead to a situation in which a paralog is incorrectly taken for an ortholog. This will happen if the true ortholog was lost or is not sequenced yet. Including sequences from additional species in the analysis would increase the chance that at least one true ortholog is present in the same clade. Tree-based methods like HOPS and RIO will then assign orthology and paralogy correctly. But the lower success rate of RIO for finding orthologs between H. sapiens and C. elegans shows that the inclusion of additional sequences can have the opposite effect. Here the inclusion of all species in the analysis, especially arthropod sequences, prevents the tree-reconciling algorithm from finding the correct orthologs. Even if the Coelomata hypothesis were used instead of the Ecdysozoa hypothesis in combination with tree reconciliation, this would not solve the problem. The investigated sequences are rather short, compared with full-length proteins. Therefore a tree reconstructed from them is more prone to "statistical fluxes" in evolution. Assigning orthology based on speciation events that followed as close in time as for D. melanogaster, H. sapiens, and C. elegans is bound to fail for a high fraction of sequences. The only viable approach is to say that the exact grouping of these three species is unknown. For the assignment of orthology in HOPS, we try to find the right balance between including too little and too much phylogenetic information. This is not always possible. If there is only one sequence for each lineage present in a family and no sequence from an outgroup species is available, HOPS would assign these sequences as orthologous. However, it is possible that these sequences are not orthologous, but paralogous. This would be the case if the true orthologs were lost.
We would like to point out that although RIO was used for the comparison with tree reconciliation, the program was not specifically designed to solve the problem of finding all orthologs between two species. Rather, the idea of RIO is to find orthologs in any species to a given query sequence. In most of the examples in which RIO fails to find the ortholog pair on which we have focused, it typically does report an ortholog in some other species. We expect RIO to have a lower rate of false positives than HOPS. This expectation is based on RIO's stringent application of all available phylogenetic information from the species tree. However, estimating the false-positive rate is very hard, if not impossible. In the absence of the true ortholog, one cannot reliably say that two sequences that appear orthologous in a tree are not. In cases in which the true ortholog exists, both RIO and HOPS have very low false-positive rates, especially compared with cases in which many different species are clustered together (Remm et al. 2001 Applying more advanced phylogenetic methods will allow including additional phylogenetic information in the analysis. The HOPS clustering scheme is set up in a way that it can handle ambiguity in the species tree and some errors encountered in neighbor-joining tree reconstruction. But the clustering scheme could easily be adjusted to include more phylogenetic information from the species tree in case a more advanced method for the tree reconstruction would be used. At the moment, a large-scale analysis using, for instance, trees constructed in a Bayesian framework would not be practical. Calculating the tree for only a small protein family can easily take more than a day with MrBayes. In contrast, the computation of all trees and inference of orthology for all 3735 families done in this paper took less than a week on a 6 CPU UltraSPARC III system. With the increase of computational processing power, especially the increase due to parallel systems like Beowulf clusters, it will be possible to use advanced algorithms and methods for finding orthologs in the near future.
We thank Lars Arvestad for helpful comments on this work and Maido Remm, Jennifer Lee, and Alistair Chalk for their suggestions on the clustering scheme of HOPS. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
1 Corresponding author. E-MAIL Erik.Sonnhammer{at}cgb.ki.se; FAX 46-8-337983.
[The NIFAS viewer is integrated in the Stockholm Pfam site (http://Pfam.cgb.ki.se Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr1305203.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.
Bateman, A., Birney, E., Cerruti, L., Durbin, R., Etwiller, L., Eddy, S.R., Griffiths-Jones, S., Howe, K.L., Marshall, M., and Sonnhammer, E.L.L. 2002. The Pfam protein families database. Nucleic Acids Res. 30: 276-280. Blair, J.E., Ikeo, K., Gojobori, T., and Hedges, S.B. 2002. The evolutionary position of nematodes. BMC Evol. Biol. 2: 7.[CrossRef][Medline] Blanchette, M., Schwikowski, B., and Tompa, M. 2002. Algorithms for phylogenetic footprinting. J. Comput. Biol. 9: 211-223.[CrossRef][Medline] Doolittle, W.F. 1999. Lateral genomics. Trends Cell Biol. 9: M5-M8.[CrossRef][Medline] Fitch, W.M. 1970. Distinguishing homologous from analogous proteins. Syst. Zool. 19: 99-113.[Medline] ____. 2000. Homology: A personal view on some of the problems. Trends Genet 16: 227-231.[CrossRef][Medline] Gogarten, J.P. and Olendzenski, L. 1999. Orthologs, paralogs and genome comparisons. Curr. Opin. Genet. Dev. 9: 630-636.[CrossRef][Medline] Goodman, M., Czelusniak, J., Moore, G.W., Romero-Herrera, A.E., and Matsuda, G. 1979. Fitting the gene lineage into its species lineage: A parsimony strategy illustrated by cladograms constructed from globin sequences. System. Zool. 28: 132-168.[CrossRef]
Hollich, V., Storm, C.E., and Sonnhammer, E.L.L. 2002. OrthoGUI: Graphical presentation of Orthostrapper results. Bioinformatics 18: 1272-1273.
Huelsenbeck, J.P. and Ronquist, F. 2001. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics 17: 754-755. Koonin, E.V., Makarova, K.S., and Aravind, L. 2001. Horizontal gene transfer in prokaryotes: Quantification and classification. Annu. Rev. Microbiol. 55: 709-742.[CrossRef][Medline]
Makalowski, W., Zhang, J., and Boguski, M.S. 1996. Comparative analysis of 1196 orthologous mouse and human full-length mRNA and protein sequences. Genome Res. 6: 846-857.
Mushegian, A.R., Garey, J.R., Martin, J., and Liu, L.X. 1998. Large-scale taxonomic profiling of eukaryotic model organisms: A comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes. Genome Res. 8: 590-598. Page, R.D.M. 1994. Maps between trees and cladistic analysis of historical associations among genes, organisms, and areas. System. Biol. 43: 58-77.[CrossRef] Remm, M., Storm, C.E., and Sonnhammer, E.L.L. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314: 1041-1052.[CrossRef][Medline]
Snel, B., Bork, P., and Huynen, M.A. 2002. Genomes in flux: The evolution of archaeal and proteobacterial gene content. Genome Res. 12: 17-25. Sonnhammer, E.L.L. and Koonin, E.V. 2002. Orthology, paralogy and proposed classification for paralog subtypes. Trends Genet. 18: 619-620.[CrossRef][Medline] Stein, L. 2001. Genome annotation: From sequence to biology. Nat. Rev. Genet. 2: 493-503.[Medline]
Storm, C.E. and Sonnhammer, E.L.L. 2001. NIFAS: Visual analysis of domain evolution in proteins. Bioinformatics 17: 343-348. ____. 2002. Automated ortholog inference from phylogenetic trees and calculation of orthology reliability. Bioinformatics 18: 92-99.
Tatusov, R.L., Koonin, E.V., and Lipman, D.J. 1997. A genomic perspective on protein families. Science 278: 631-637. Xie, T. and Ding, D. 2000. Investigating 42 candidate orthologous protein groups by molecular evolutionary analysis on genome scale. Gene 261: 305-310.[CrossRef][Medline] Zharkikh, A. and Li, W.H. 1995. Estimation of confidence in phylogeny: The complete-and-partial bootstrap technique. Mol. Phylogenet. Evol. 4: 44-63.[CrossRef][Medline] Zmasek, C.M. and Eddy, S.R. 2002. RIO: Analyzing proteomes by automated phylogenomics using resampled inference of orthologs. BMC Bioinformatics 3: 14.[CrossRef][Medline]
ftp://ftp.cgb.ki.se/pub/HOPS/; HOPS. http://Pfam.cgb.ki.se; Stockholm Pfam site. http://www.rio.wustl.edu; RIO.
Received February 26, 2003;
accepted in revised format August 8, 2003.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||