|
|
|
|
Genome Res. 14:29-36, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Letter Coelomata and Not Ecdysozoa: Evidence From Genome-Wide Phylogenetic AnalysisNational Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
Relative positions of nematodes, arthropods, and chordates in animal phylogeny remain uncertain. The traditional tree topology joins arthropods with chordates in a coelomate clade, whereas nematodes, which lack a coelome, occupy a basal position. However, the current leading hypothesis, based on phylogenetic trees for 18S ribosomal RNA and several proteins, joins nematodes with arthropods in a clade of molting animals, Ecdysozoa. We performed a phylogenetic analysis of over 500 sets of orthologous proteins, which are represented in plants, animals, and fungi, using maximum likelihood, maximum parsimony, and distance methods. Additionally, to increase the statistical power of topology tests, the same methods were applied to concatenated alignments of subunits of eight conserved macromolecular complexes. The majority of the methods, when applied to most of the orthologous clusters, both concatenated and individual, grouped the fly with humans to the exclusion of the nematode, in support of the coelomate phylogeny. Trees were also constructed using information on insertions and deletions in orthologous proteins, combinations of domains in multidomain proteins, and presence-absence of species in clusters of orthologs. All of these approaches supported the coelomate clade and showed concordance between evolution of protein sequences and higher-level evolutionary events, such as domain fusion or gene loss.
Despite more than a century of extensive phylogenetic studies, major issues in the evolution of the metazoa (animals) remain unresolved (for review, see Hedges 2002 -thymosin (Manuel et al. 2000
The ecdysozoan topology gained rapid recognition in the "evo-devo" community thanks to its apparent biological plausibility (e.g., Adoutte et al. 2000 100 orthologous nuclear proteins using several phylogenetic methods. Both groups found that the majority of trees supported the coelomate topology. Given the multiple lines of support for each of the alternative tree topologies, the issue is considered unresolved, and the metazoan phylogenetic tree is often cautiously presented as multifurcations (e.g., Hedges 2002
The principal interest of the coelomateecdysozoa conundrum lies in the relationship between phylogeny and biological organization, at both the organismal and molecular levels. The coelomate topology reverberates with the straightforward notions of the hierarchy of morphological and physiological complexity among the considered organisms, which is the main reason why this phylogeny had been accepted since the time of Ernst Haeckel and until the 18S rRNA analysis by Lake and coworkers (Aguinaldo et al. 1997
Large-scale phylogenetic analysis inevitably involves a trade-off between taxon sampling and gene (or, more generally, character) sampling. The relative importance of increasing the number of analyzed taxa and the number of characters for the accuracy of phylogenetic inferences remains an issue of debate (Hillis 1998
Phylogenetic Analysis of Concatenated Protein Sequence Alignments To increase the statistical power of phylogenetic analysis, we constructed concatenated alignments of the subunits of eight macromolecular complexes (hereinafter, the con8 set), under the premise that these proteins are likely to evolve in the same mode and can be legitimately analyzed as a single entity (Table 1). The conserved blocks from each of the concatenated alignments (see Methods) were employed to construct distance matrix trees using the neighbor-joining and least-squares methods as well as parsimony and maximum-likelihood (ML) trees. Both distance-based methods showed a strong preference for the coelomate topology, with bootstrap probabilities >80% (Table 1; data not shown). Both maximum parsimony methods also assigned the coelomate topology to the majority of the analyzed systems, with two exceptions, namely, a weak support for the ecdysozoan topology for the RNA polymerase subunits, and a strong preference for the ecdysozoan topology for the proteasome subunits (Table 2). In contrast, all three ML methods divided the systems between the two competing topologies, with five alignments (chaperonins, clathrins, DNA polymerase subunits, licensing factors, and translation factors) showing preference for the coelomate model (with varying degrees of confidence), and three (proteasome subunits, ribosomal proteins, and RNA polymerase subunits) displaying a strong preference for the ecdysozoan model (Table 2).
Phylogenetic Analysis of Individual KOGs Each of the 507 KOGs containing representatives from six eukaryotic species (six507 set) and selected as described in the Methods section were subjected to phylogenetic analysis using the least squares and ML methods, under the gap exclusion and block site selection schemes (see Methods). For each scheme, 35%44% of the trees failed to recover the monophyly of the metazoa or that of the two yeast species or to cluster paralogs from the same species (when present) into the same lineage. These obvious artifacts were attributed to errors in automatically produced alignments, compositional bias of the sequences, or misidentification of orthologs, and the respective trees were discarded. The remaining 285 to 328 trees (depending on the method) assign one of the three possible topologies to the metazoa, with plants and fungi considered outgroups (Fig. 1). A relatively small minority (12%14%) of the trees placed the fly in the metazoan root, whereas the rest were divided between the coelomate (53%67%) and ecdysozoan (21%35%) topologies (Table 3). The gap exclusion mode (i.e., use of a greater number of relatively variable positions in the phylogenetic analysis) and the least-squares tree reconstruction method favored the coelomate topology. In contrast, the block site selection mode (use of only highly conserved, slow-evolving positions) and the ML method made the split between the two topologies more even (Table 3). Considering only the cases where at least three of the four tree construction schemes agreed on the topology, the distribution shifted even further in favor of the coelomate model, with 70% of the robust trees pointing this way (Table 3). Altogether, 202 of the 507 analyzed KOGs (40%) showed complete agreement on the reconstructed topology, which, in itself, is a considerable amount of apparently phylogenetically coherent data; however, this result also points to a notable variability in the outcomes of different analysis schemes.
Branch Length Effects It has been claimed that the coelomate topology is an artifact of the high evolutionary rate in some species of nematodes, particularly Caenorhabditis elegans, which results in long branches that are pushed to a basal position in trees (Aguinaldo et al. 1997
The fraction of trees with the coelomate topology monotonically increases over the range of the relative lengths of the C. elegans branch (Fig. 2B). Under the hypothesis that the ecdysozoan topology is the correct one and the coelomate topology appears because of long-branch attraction, this corresponds to an increasing rate of erroneous topology assignment with the increase of the relative length of the nematode branch. Whether or not the above hypothesis is realistic, it can be tested by measuring the rate of false topology assignment in model trees with varying relative branch lengths produced from simulated multiple sequence alignments. The alternative hypothesis that the coelomate topology is the correct one can be similarly tested. As shown in Figure 2B, the ML trees reconstructed by using ProtML are remarkably robust to long-branch attraction artifacts. For the alignments simulated with the ecdysozoan tree, even a 64:1 ratio of the nematode to human branch lengths yields an error rate of 60%. In the range of branch length ratios where most of the actual data belongs (1:1 to 3:1), ProtML correctly reconstructs 70%95% of trees for simulated alignments. In contrast, in the trees constructed for the real KOGs, the coelomate trees significantly outnumber the ecdysozoan ones; that is, the "error rate" is >>50% (Fig. 2B). Conversely, the fraction of trees with the erroneous ecdysozoan topology reconstructed from alignments simulated with the coelomate model increases with the decrease of the relative length of the nematode branch; however, even when the nematode branch was twice shorter than the human branch, the error rate was only 18% (Fig. 2B). Thus, the results of the tests with simulated alignments and model trees indicate that the presence of both coelomate and ecdysozoan topologies among the trees for the six384 KOG set cannot be attributed solely or even largely to the long (short) branch attraction artifacts.
Trees Built Using the Median Similarity Between Orthologs as a Measure of Evolutionary Distance
Indels as Evolutionary Markers Insertions and deletions (indels) in proteins are often considered to be suitable characters for inferring evolutionary relationships, under the assumption that independent insertion or deletion in the exact same position of a protein in different lineages (homoplasy) is unlikely (Rokas and Holland 2000
Trees Based on Gene Content and Domain Co-Occurrence in Multidomain Proteins Using patterns of gene presence-absence in orthologous sets for tree construction is one of the straightforward genome-tree approaches (Fitz-Gibbon and House 1999
The rooted tree produced using the Dollo method confidently supported the coelomate topology (Fig. 4). Otherwise, however, this tree was at odds with the prevalent taxonomic view (Hedges 2002
All eukaryotes have numerous multidomain proteins, which allows one to use the data on domain co-occurrence to construct trees. For this purpose, each pair of co-occurring domains was treated as a binary character, and the Dollo parsimony method was applied to the resulting character table with the same rationale as for the gene presence-absence data; that is, under the assumption that independent origin of the same domain combination is unlikely. The topology of the resulting tree was identical to that of the gene content tree, with an equally strong support for each internal branch (Fig. 5). Thus, evolution of domain fusions seems to follow the pattern of gene emergence and loss. The strong support for the coelomate topology seen in this tree reflects the previously noted higher similarity between the architectures of human and fly multidomain proteins compared to those of the nematode (Koonin et al. 2000
In this work, we explored metazoan phylogeny by analyzing a large set of orthologous clusters with several widely different approaches for tree construction. Quantitatively at least, there seems to be a clear convergence on the coelomate topology. This topology was supported by both sequence-dependent phylogenetic methods and sequence-independent approaches, such as the analysis of gene content of KOGs and protein domain architectures. This demonstrates the apparent concordance between different types of evolutionary events in animals; that is, gene loss and domain fusions and fissions seem to occur more or less in parallel with the decay of sequence similarity. This is not necessarily the case for the deeper branches in the eukaryotic tree, where the analysis of gene content and domain architectures supported the animal-plant grouping, in contrast to most phylogenetic analyses, including our own reported herein, which suggested the existence of an animal-fungi clade. Similarly, genome-wide phylogenetic studies on the evolution of prokaryotes revealed major differences between trees based on sequence divergence and those constructed on the basis of gene content or gene order data (Wolf et al. 2002
The phylogenetic approach employed here is one of the "genome-tree" approaches (Wolf et al. 2002 Thus, the coexistence of the two incompatible topologies among the KOGs emerges as a major outstanding issue. One possible explanation is that the models of amino acid substitutions employed in distance calculations and in maximum likelihood estimates (see Methods) are not necessarily adequate approximations of evolution for all genes. Different biases in substitution probabilities might have differential effects on tree topology. This interpretation of the differences in tree topologies for different orthologous sets assumes that there is a single true topologyconceivably, the one that is observed most frequently, that is, the coelomate topologyand all deviations from it are caused by artifacts of varying nature. However, an alternative hypothesis based on the assumption that different topologies reflect evolutionary realities also could be considered. Specifically, the different topologies could ensue from a duplication of multiple genes (perhaps large parts of the genome) preceding the divergence of the analyzed lineages; in the present case, vertebrates, arthropods, and nematodes. Under this scenario, the most common tree topology still reflects the actual order of lineage divergence, but alternative topologies result from lineage-specific, differential loss of paralogs. Taken together, the results of the genome-wide phylogenetic analysis described here indicate that the available data support the coelomate topology for animal evolution. To reach a new level of confidence in this solution, representative samples of genome sequences from the relevant taxa and more adequate models of evolution are required.
Selection of Sets of Orthologous Proteins (KOGs) for Phylogenetic Analysis Orthologous sets of eukaryotic proteins (KOGs; (Tatusov et al. 2003
Concatenated Alignments of Subunits of Macromolecular Complexes
Site Selection for Phylogenetic Analysis
Sequence-Based Phylogeny
Analysis of the Effect of Relative Branch Lengths on Tree Topology Four KOGs that stably produced the ecdysozoan topology (KOG1159, KOG1337, KOG1687, and KOG2041) and four KOGs with reliable coelomate topology (KOG0323, KOG1107, KOG2235, and KOG3061) were selected to generate "model" trees required for alignment simulation; each of these KOGs had an estimated evolutionary rate close to the median across the KOGs. Branch lengths from the corresponding ML trees were averaged to produce the ecdysozoan and coelomate model trees. On the basis of each model tree, a series of trees with varying ratios of the nematode to human branch lengths was created. In each of these trees, three branch lengths, nematode, human, and ecdysozoan (or coelomate in the simulations of coelomate topology reconstruction) were changed in such a way that: (1) The sum of branch lengths in the nematode-Drosophila-human star tree remained constant; (2) the relative position of the metazoan root on the human or the nematode branch remained the same; and (3) the ratio of the nematode to human branch lengths in the nematode-Drosophila-human star tree was set to the desired value.
Simulation of multiple sequence alignments corresponding to the evolution of a sequence family according to a given tree was performed using the Pseq-Gen tool (Grassly et al. 1997
Construction of Phylogenetic Trees by Using Distributions of Pairwise Distances Between Orthologs
Using Indels as Phylogenetic Markers
Trees Based on Presence-Absence of Species in KOGs
Analysis of Domain Architectures of Multidomain Proteins
Availability of Data and Results
We thank John Spouge and Eva Czabarka for help in estimating the confidence intervals for the relative branch length medians, Anna Panchenko and Siqian He for help with the use of the CDD library, and Alexei Kondrashov and Kira Makarova for useful discussions. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1347404.
1 Corresponding author.
Adachi, J. and Hasegawa, M. 1992. MOLPHY: Programs for Molecular Phylogenetics. Institute of Statistical Mathematics, Tokyo.
Adoutte, A., Balavoine, G., Lartillot, N., Lespinet, O., Prud'homme, B., and de Rosa, R. 2000. The new animal phylogeny: Reliability and implications. Proc. Natl. Acad. Sci. 97: 4453-4456. Aguinaldo, A.M., Turbeville, J.M., Linford, L.S., Rivera, M.C., Garey, J.R., Raff, R.A., and Lake, J.A. 1997. Evidence for a clade of nematodes, arthropods and other moulting animals. Nature 387: 489-493.[CrossRef][Medline]
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLASTand PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.
Bapteste, E. and Philippe, H. 2002. The potential value of indels as phylogenetic markers: Position of trichomonads as a case study. Mol. Biol. Evol. 19: 972-977. Blair, J.E., Ikeo, K., Gojobori, T., and Hedges, S.B. 2002. The evolutionary position of nematodes. BMC Evol. Biol. 2: 7.[CrossRef][Medline] Collins, A.G. and Valentine, J.W. 2001. Defining phyla: Evolutionary pathways to metazoan body plans. Evol. Dev. 3: 432-442.[CrossRef][Medline] de Rosa, R., Grenier, J.K., Andreeva, T., Cook, C.E., Adoutte, A., Akam, M., Carroll, S.B., and Balavoine, G. 1999. Hox genes in brachiopods and priapulids and protostome evolution. Nature 399: 772-776.[CrossRef][Medline] Felsenstein, J. 1996. Inferring phylogenies from protein sequences by parsimony, distance, and likelihood methods. Methods Enzymol. 266: 418-427.[Medline]
Fitch, W.M. and Margoliash, E. 1967. Construction of phylogenetic trees. Science 155: 279-284.
Fitz-Gibbon, S.T. and House, C.H. 1999. Whole genome-based phylogenetic analysis of free-living microorganisms. Nucleic Acids Res. 27: 4218-4222. Giribet, G., Distel, D.L., Polz, M., Sterrer, W., and Wheeler, W.C. 2000. Triploblastic relationships with emphasis on the acoelomates and the position of Gnathostomulida, Cycliophora, Plathelminthes, and Chaetognatha: A combined approach of 18S rDNA sequences and morphology. Syst. Biol. 49: 539-562.[CrossRef][Medline]
Grassly, N.C., Adachi, J., and Rambaut, A. 1997. PSeq-Gen: An application for the Monte Carlo simulation of protein sequence evolution along phylogenetic trees. Comput. Appl. Biosci. 13: 559-560.
Grishin, N.V., Wolf, Y.I., and Koonin, E.V. 2000. From complete genomes to measures of substitution rate variability within and between proteins. Genome Res 10: 991-1000. Hedges, S.B. 2002. The origin and evolution of model organisms. Nat. Rev. Genet. 3: 838-849.[CrossRef][Medline] Hillis, D.M. 1998. Taxonomic sampling, phylogenetic accuracy, and investigator bias. Syst. Biol. 47: 3-8. Hillis, D.M., Pollock, D.D., McGuire, J.A., and Zwickl, D.J. 2003. Is sparse taxon sampling a problem for phylogenetic inference? Syst. Biol. 52: 124-126.[Medline] Kishino, H., Miyata, T., and Hasegawa, M. 1990. Maximum likelihood inference of protein phylogeny and the origin of chloroplasts. J. Mol. Evol. 31: 151-160.[CrossRef] Koonin, E.V., Aravind, L., and Kondrashov, A.S. 2000. The impact of comparative genomics on our understanding of evolution. Cell 101: 573-576.[CrossRef][Medline] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.[CrossRef][Medline]
Mallatt, J. and Winchell, C.J. 2002. Testing the new animal phylogeny: First use of combined large-subunit and small-subunit rRNA gene sequences to classify the protostomes. Mol. Biol. Evol. 19: 289-301.
Manuel, M., Kruse, M., Muller, W.E., and Le Parco, Y. 2000. The comparison of
Marchler-Bauer, A., Anderson, J.B., DeWeese-Scott, C., Fedorova, N.D., Geer, L.Y., He, S., Hurwitz, D.I., Jackson, J.D., Jacobs, A.R., Lanczycki, C.J., et al. 2003. CDD: A curated Entrez database of conserved domain alignments. Nucleic Acids Res. 31: 383-387. Mirkin, B.G., Fenner, T.I., Galperin, M.Y., and Koonin, E.V. 2003. Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol. Biol. 3: 2.[CrossRef][Medline] Mitchell, A., Mitter, C., and Regier, J.C. 2000. More taxa or more characters revisited: Combining data from nuclear protein-encoding genes for phylogenetic analyses of Noctuoidea (Insecta: Lepidoptera). Syst. Biol. 49: 202-224.[CrossRef][Medline]
Mushegian, A.R., Garey, J.R., Martin, J., and Liu, L.X. 1998. Large-scale taxonomic profiling of eukaryotic model organisms: A comparison of orthologous proteins encoded by the human, fly, nematode, and yeast genomes. Genome Res. 8: 590-598. Notredame, C., Higgins, D.G., and Heringa, J. 2000. T-Coffee: A novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302: 205-217.[CrossRef][Medline] Peterson, K.J. and Eernisse, D.J. 2001. Animal phylogeny and the ancestry of bilaterians: Inferences from morphology and 18S rDNA gene sequences. Evol. Dev. 3: 170-205.[CrossRef][Medline] Raff, R.A. 1996. The shape of life: Genes, development, and the evolution of animal form. University of Chicago Press, Chicago, IL. Rokas, A. and Holland, P.W. 2000. Rare genomic changes as a tool for phylogenetics. Trends Ecol. Evol. 15: 454-459.[CrossRef][Medline]
Rosenberg, M.S. and Kumar, S. 2001. Incomplete taxon sampling is not a problem for phylogenetic inference. Proc. Natl. Acad. Sci. 98: 10751-10756. Rosenberg, M.S. and Kumar, S. 2003. Taxon sampling, bioinformatics, and phylogenomics. Syst. Biol. 52: 119-124.[CrossRef][Medline] Saitou, N. and Nei, M. 1987. The neighbor-joining method: A new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 4: 406-425.[Abstract]
Schmidt, H.A., Strimmer, K., Vingron, M., and von Haeseler, A. 2002. TREE-PUZZLE: Maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics 18: 502-504.
Shimodaira, H. and Hasegawa, M. 2001. CONSEL: For assessing the confidence of phylogenetic tree selection. Bioinformatics 17: 1246-1247. Snel, B., Bork, P., and Huynen, M.A. 1999. Genome phylogeny based on gene content. Nat. Genet. 21: 108-110.[CrossRef][Medline]
Snel, B., Bork, P., and Huynen, M.A. 2002. Genomes in flux: The evolution of archaeal and proteobacterial gene content. Genome Res. 12: 17-25. Strimmer, K. and Rambaut, A. 2002. Inferring confidence sets of possibly misspecified gene trees. Proc. R Soc. Lond. B Biol. Sci. 269: 137-142.[Medline] Swofford, D.L. 2000. PAUP*. Phylogenetic Analysis Using Parsimony (* and Other Methods). Version 4. Sinauer Associates, Sunderland, MA. Tatusov, R.L., Fedorova, N.D., Jackson, J.D., Jacobs, A.R., Kiryutin, B., Koonin, E.V., Krylov, D.M., Mazumder, R., Mekhedov, S.L., Nikolskaya, A.N., et al. 2003. The COG database: An updated version includes eukaryotes. BMC Bioinformatics 4: 41.[CrossRef][Medline] Valentine, J.W. and Collins, A.G. 2000. The significance of moulting in Ecdysozoan evolution. Evol. Dev. 2: 152-156.[CrossRef][Medline] Wolf, Y.I., Rogozin, I.B., Grishin, N.V., Tatusov, R.L., and Koonin, E.V. 2001. Genome trees constructed using five different approaches suggest new major bacterial clades. BMC Evol. Biol. 1: 8.[CrossRef][Medline] Wolf, Y.I., Rogozin, I.B., Grishin, N.V., and Koonin, E.V. 2002. Genome trees and the tree of life. Trends Genet. 18: 472-479.[CrossRef][Medline]
Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555-556.
http://www.ncbi.nlm.nih.gov/COG/new/shokog.cgi; complete set of alignments and trees for this work. ftp://ftp.ncbi.nih.gov/pub/koonin/EUK_PHYLOGENY/; clusters of orthologs from eukaryotic genomes (KOGs).
Received March 19, 2003;
accepted in revised format October 20, 2003.
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||