|
|
|
|
Genome Res. 13:1589-1594, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Letter The Balance of Driving Forces During Genome Evolution in ProkaryotesComputational Genomics Group, The European Bioinformatics Institute, EMBL Cambridge Outstation, Cambridge CB10 1SD, UK
Genomes are shaped by evolutionary processes such as gene genesis, horizontal gene transfer (HGT), and gene loss. To quantify the relative contributions of these processes, we analyze the distribution of 12,762 protein families on a phylogenetic tree, derived from entire genomes of 41 Bacteria and 10 Archaea. We show that gene loss is the most important factor in shaping genome content, being up to three times more frequent than HGT, followed by gene genesis, which may contribute up to twice as many genes as HGT. We suggest that gene gain and gene loss in prokaryotes are balanced; thus, on average, prokaryotic genome size is kept constant. Despite the importance of HGT, our results indicate that the majority of protein families have only been transmitted by vertical inheritance. To test our method, we present a study of strain-specific genes of Helicobacter pylori, and demonstrate correct predictions of gene loss and HGT for at least 81% of validated cases. This approach indicates that it is possible to trace genome content history and quantify the factors that shape contemporary prokaryotic genomes.
The principal driving forces that shape prokaryotic genomes and influence
gene content are gene genesis, horizontal gene transfer (HGT), and gene loss.
Gene content was first thought to be affected by gene genesis, in particular,
duplication and divergence of single genes
(Ohno 1970
To quantify the evolutionary processes that shape genome content, we have
used an approach that takes into account the presence or absence of a gene (or
gene family) on a phylogenetic tree. Consistent gene presence in a clade
indicates that the corresponding gene was present in the ancestor of that
clade, whereas occasional absence of a gene might result from gene loss.
Finally, fragmented distribution of a gene family across very distantly
related species is indicative of horizontal gene transfer (HGT) events
(Ragan 2001
The decision as to whether the observed distribution pattern of a gene is
the product of HGT or multiple gene loss requires the estimation of the
likelihood of these events (Ochman and
Jones 2000
We attempt to explain the present phylogenetic distribution of 12,762 protein families from 51 entire genome sequences by minimizing the number of potential gene gain and loss events. We approach the problem using phylogenetic profiles (Pellegrini et al. 1999
Parameter Optimization
To develop a realistic model for protein phylogeny using gene gain and gene
loss events, we first need to estimate their relative occurrence. The optimal
HGT penalty, previously proposed to correspond to the "expected relative
frequency" of HGT versus gene loss
(Snel et al. 2002 We experimented with HGT penalty values ranging between 15, counting all reported evolutionary events (Table 1). At low HGT penalty values (<2), gene loss is slightly overpredicted, whereas with higher HGT penalties (>3), gene loss predominates (Fig. 1). The shuffled tree overpredicts HGT at any tested threshold, because protein families are not meaningfully grouped on the tree. Interpolated curves of expected and observed ratio values for both 16S rRNA and gene content-derived trees intersect at HGT penalties between 2 and 3, indicating that the optimal HGT threshold is between these two values.
Stability of Average Prokaryotic Genome Size To identify the scenario with the most stable family content, we investigated the predicted balance between gene gain and loss on various HGT penalties (Fig. 2). On both 16S-rRNA- and gene-content-derived trees, gene gain prevails on HGT penalty lower than 2, and gene loss prevails on HGT penalties higher than 3, again indicating that the optimal threshold is between these two values (Fig. 2).
Thus, two measurescorrespondence of the expected and observed ratios between gene loss and HGT and the balance between gene loss and gene gainindicate that an optimal threshold value for HGT penalty lies between 2 and 3. These values not only correspond to optimal parameters for this analysis, but may also reflect a genuine biological effect, indicating that gene loss is between two and three times more frequent than HGT. As gene genesis must compensate for the remainder of gene loss, we estimate that its contribution should be up to twofold the amount of HGT. The fraction of families involved in HGT can be estimated, once the HGT penalty is known (Fig. 3). Although on the shuffled tree, most of the families are unrealistically indicated to be involved in HGT, the genuine trees (both 16S rRNA and gene content) imply that most protein families are gained exactly once and never transferred horizontally. We estimate that the fraction of protein families involved in horizontal transfer in the genomes under consideration is between 25% and 39% (Fig. 3).
Evolution of Individual Species
Evidently, the present set contains a multitude of pathogenic bacteria, and as such may not sufficiently represent the bacterial world. Yet obligatory parasitic bacteria were consistently reported to be derived by regressive evolution, and there is an overall agreement of the described evolutionary scenarios with present knowledge, indicating the robustness of our approach.
Model Validation With the Strain-Specific Genes of Helicobacter
pylori
We have analyzed the genome of Helicobacter pylori, for strains
J99 (Alm et al. 1999 The analysis of protein families containing strain-specific genes of the two H. pylori strains indicates that the presence of 13 of these genes can be attributed to either gene gain or gene loss (9 and 4, respectively). This result is also supported by detailed manual analysis, including the generation and boot-strapping of dendrograms (Table 2).
In virtually all cases (Table
2), there is total agreement between the gene content and 16S rRNA
trees. An estimate of precision for the method would be 81% (13 out of 16
cases), with three undecided and no false positive cases. It is encouraging
that in some cases, our predictions are better than anomalous nucleotide
composition, for example, in the case of genes HP0447 and HP1045
(Table 2). Although the
detection of closest homologs by BLAST
(Altschul et al. 1997
We have attempted to quantify the major events during the evolution of gene families, namely, gene genesis, loss, and horizontal gene transfer (HGT). Evolutionary scenarios for individual protein families were generated, with gain and loss events reported. The relative frequencies of the events shaping genome content were estimated by two methods: the correspondence between the observed and expected ratio of gene loss and HGT and the assessment of the balance between gene gain and loss. Both methods indicate that loss is up to three-fold more frequent than HGT, and gene genesis contributes up to twofold as many genes as HGT. Although our approach provides the very first attempt to estimate the ratio of processes shaping gene content, this type of analysis is dependent on the availability of genome sequences. It is possible that with wider representation of more species in the phylogenetic tree, some of the events presently interpreted as gene genesis in sparsely sampled clades may turn out to represent HGT events. Also, HGT from extinct clades may result in assignment of gene genesis, although this would require all the descendants of the clade generating the gene to be extinct. On the other hand, our analysis refers to protein families, rather than individual genes, and thus gene loss may be underestimated. A single gene genesis or HGT event introducing a member of a new family into a clade will be detected, whereas multiple gene loss events may be needed to eliminate all members of a multigene family. A future approach may quantify the processes discussed for individual genes, rather than protein families, as well as quantify the amount of gene duplication. The number of families involved in horizontal transfer is estimated between 25% and 39% of the total number of families examined. Thus, although HGT can be considered as a significant factor that shapes prokaryotic genome sequences, it is remarkable that phylogenetic distributions of at least 60% of protein families can be explained merely by vertical inheritance. Although on average gene gain and loss were assumed to be balanced, it is evident that evolution of individual lineages might significantly deviate from this balance, consistent with present knowledge. A case study of strain-specific genes of H. pylori strains implies that the precision of the method is at least 81%. With a multitude of yeast, plant, and animal genomes becoming available, a similar analysis could reveal how the contribution of the processes shaping genome content differs in eukaryotes. This approach has the potential to provide insights into the emergence of complex cellular processes and potentially restore the complete gene content of ancestral species.
Protein families were derived using an all-against-all clustering of complete genome sequences with the TRIBE-MCL algorithm (inflation value 2; Enright et al. 2002
To eliminate bias toward a particular type of phylogenetic tree, we have
used two independently derived trees. First, the 16S rRNA tree, derived from
multiple alignments of the 16S rRNA gene sequences (downloaded from the RDP;
Maidak et al. 2001
Phylogenetic profiles (Pellegrini et
al. 1999 The initial input for GeneTRACE consists of phylogenetic profiles of protein families and an evolutionary tree spanning all species involved. Inner nodes on this tree represent ancestral species. Two types of events are considered: protein family gain and loss; gene gain can be further classified as gene genesis or HGT. The algorithm consists of the following stages:
This two-pass procedure is an improvement over the original approach
suggested by Snel et al.
(2002
We thank members of the Computational Genomics Group, especially Anton Enright, for discussions and Santiago Garcia-Vallvé (University Rovira i Virgili, Tarragona, Spain) for sharing data on genes with anomalous nucleotide composition. This work was supported by the European Molecular Biology Laboratory. C.O. thanks the UK Medical Research Council, EMBO, and IBM Research for additional support. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1092603.
1 Corresponding author. [Supplemental material is available online at www.genome.org.]
Alm, R.A., Ling, L.S., Moir, D.T., King, B.L., Brown, E.D., Doig, P.C., Smith, D.R., Noonan, B., Guild, B.C., deJonge, B.L., et al. 1999. Genomic-sequence comparison of two unrelated isolates of the human gastric pathogen Helicobacter pylori. Nature 397: 176-180.[CrossRef][Medline]
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z.,
Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A
new generation of protein database search programs. Nucleic Acids
Res. 25:
3389-3402. Andersson, J.O. 2000. Evolutionary genomics: Is Buchnera a bacterium or an organelle? Curr. Biol. 10: R866-R868.[CrossRef][Medline] Andersson, J.O. and Andersson, S.G. 1999. Insights into the evolutionary process of genome degradation. Curr. Opin. Genet. Dev. 9: 664-671.[CrossRef][Medline]
Bernal, A., Ear, U., and Kyrpides, N. 2001. Genomes
OnLine Database GOLD: A monitor of genome projects world-wide.
Nucleic Acids Res. 29:
126-127. Cavalier-Smith, T. 1985. The evolution of genome size. John Wiley & Sons, Chichester, UK. Cole, S.T., Eiglmeier, K., Parkhill, J., James, K.D., Thomson, N.R., Wheeler, P.R., Honore, N., Garnier, T., Churcher, C., Harris, D., et al. 2001. Massive gene decay in the leprosy bacillus. Nature 409: 1007-1011.[CrossRef][Medline] Doolittle, R.F. 2002. Biodiversity: Microbial genomes multiply. Nature 416: 697-700.[CrossRef][Medline] Eisen, J.A. 2000. Horizontal gene transfer among microbial genomes: New insights from complete genome analysis. Curr. Opin. Genet. Dev. 10: 606-611.[CrossRef][Medline]
Enright, A.J., Van Dongen, S., and Ouzounis, C.A.
2002. An efficient algorithm for large-scale detection of protein
families. Nucleic Acids Res.
30:
1575-1584.
Garcia-Vallvé, S., Romeu, A., and Palau, J.
2000. Horizontal gene transfer in bacterial and archaeal complete
genomes. Genome Res. 10:
1719-1725.
Janssen, P.J., Audit, B., and Ouzounis, C.A. 2001.
Strain-specific genes of Helicobacter pylori: Distribution, function
and dynamics. Nucleic Acids Res.
29:
4395-4404. Koski, L.B. and Golding, G.B. 2001. The closest BLAST hit is often not the nearest neighbor. J. Mol. Evol. 52: 540-542.[Medline] Kunin, V. and Ouzounis, C.A. 2003. GeneTRACEReconstruction of gene content of ancestral species. Bioinformatics (in press).
Maidak, B.L., Cole, J.R., Lilburn, T.G., Parker Jr., C.T., Saxman,
P.R., Farris, R.J., Garrity, G.M., Olsen, G.J., Schmidt, T.M., and Tiedje,
J.M. 2001. The RDP-II Ribosomal Database Project.
Nucleic Acids Res. 29:
173-174. Mira, A., Ochman, H., and Moran, N.A. 2001. Deletional bias and the evolution of bacterial genomes. Trends Genet. 17: 589-596.[CrossRef][Medline] Moran, N.A. 2002. Microbial minimalism: Genome reduction in bacterial pathogens. Cell 108: 583-586.[CrossRef][Medline] Ochman, H. and Jones, I.B. 2000. Evolutionary dynamics of full genome content in Escherichia coli. EMBO J. 19: 6637-6643.[CrossRef][Medline] Ochman, H., Lawrence, J.G., and Groisman, E.A. 2000. Lateral gene transfer and the nature of bacterial innovation. Nature 405: 299-304.[CrossRef][Medline] Ohno, S. 1970. Evolution by gene duplication. Springer-Verlag, New York. Ouzounis, C. 1999. Orthology: Another terminology muddle. Trends Genet. 15: 445.[CrossRef][Medline] Ouzounis, C. and Kyrpides, N. 1996. The emergence of major cellular processes in evolution. FEBS Lett. 390: 119-123.[CrossRef][Medline]
Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D., and
Yeates, T.O. 1999. Assigning protein functions by comparative
genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad.
Sci. 96:
4285-4288. Ragan, M.A. 2001. Detection of lateral gene transfer among microbial genomes. Curr. Opin. Genet. Dev. 11: 620-626.[CrossRef][Medline] Snel, B., Bork, P., and Huynen, M.A. 1999. Genome phylogeny based on gene content. Nat. Genet. 21: 108-110.[CrossRef][Medline]
Snel, B., Bork, P., and Huynen, M.A. 2002. Genomes in
flux: The evolution of archaeal and proteobacterial gene content.
Genome Res. 12:
17-25. Tomb, J.F., White, O., Kerlavage, A.R., Clayton, R.A., Sutton, G.G., Fleischmann, R.D., Ketchum, K.A., Klenk, H.P., Gill, S., Dougherty, B.A., et al. 1997. The complete genome sequence of the gastric pathogen Helicobacter pylori. Nature 388: 539-547.[CrossRef][Medline] Wallace, D.C. and Morowitz, H.J. 1973. Genome size and evolution. Chromosoma 40: 121-126.[CrossRef][Medline] Wolfe, K.H. and Shields, D.C. 1997. Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387: 708-713.[CrossRef][Medline]
Zipkas, D. and Riley, M. 1975. Proposal concerning
mechanism of evolution of the genome of Escherichia coli. Proc.
Natl. Acad. Sci. 72:
1354-1358.
Received January 2, 2003;
accepted in revised format April 22, 2003.
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||