|
|
|
|
Genome Res. 14:1036-1042, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Letter Bacterial Genomes as New Gene Homes: The Genealogy of ORFans in E. coliDepartment of Biochemistry & Molecular Biophysics, University of Arizona, Tucson, Arizona 85721, USA
Differences in gene repertoire among bacterial genomes are usually ascribed to gene loss or to lateral gene transfer from unrelated cellular organisms. However, most bacteria contain large numbers of ORFans, that is, annotated genes that are restricted to a particular genome and that possess no known homologs. The uniqueness of ORFans within a genome has precluded the use of a comparative approach to examine their function and evolution. However, by identifying sequences unique to monophyletic groups at increasing phylogenetic depths, we can make direct comparisons of the characteristics of ORFans of different ages in the Escherichia coli genome, and establish their functional status and evolutionary rates. Relative to the genes ancestral to -Proteobacteria and to those genes distributed sporadically in other prokaryotic species, ORFans in the E. coli lineage are short, A+T rich, and evolve quickly. Moreover, most encode functional proteins. Based on these features, ORFans are not attributable to errors in gene annotation, limitations of current databases, or to failure of methods for detecting homology. Rather, ORFans in the genomes of free-living microorganisms apparently derive from bacteriophage and occasionally become established by assuming roles in key cellular functions.
Bacterial genomes display variation in size, even among strains of the same species. And because these microorganisms have very little noncoding or repetitive DNA, the variation in genome size usually reflects differences in gene repertoire. Some species, particularly bacterial parasites and symbionts, have undergone massive genome reduction and simply contain a subset of the genes present in their ancestors (Moran 1996
The high frequencies of ORFans detected in bacterial genomes were originally attributed to the limited set of sequenced genomes then available for comparison, and it was predicted that this category of genes would dwindle as databases expanded. Nevertheless, the number of ORFans in databases has grown despite an increase in the number and diversity of complete genome sequences. A recent survey estimated their frequency to be 14% of the total genes from 60 completely sequenced genomes (Siew and Fischer 2003b
The existence of ORFans in virtually every genome has been termed a "mystery" (Dujon 1996
Previous analyses of the species or strain-specific genes in bacteria showed that such sequences tend to have lower G+C contents than genes with a wider distribution among species. Charlebois et al. (2003
Although long ORFans are likely to be actual coding sequences, short hypothetical ORFs must be viewed with caution (Ochman 2002
The phylogenetic distributions of annotated genes in the Escherichia coli MG1655 genome are highly variable. At each node of the phylogeny, two classes of clade-specific genes are evident: those with sporadic matches in distantly related prokaryotic species (HOPs for heterogeneous occurrence in prokaryotes) and those with no detectable match to any sequences in the databases (ORFans). Based on this distinction, the E. coli MG1655 genome contains >500 genes that have no homologs outside of the -Proteobacteria, with 64 ORFans that are unique to this genome. The close relationship and relatively recent divergence of the sequenced E. coli strains considered in this analysis imply a rapid mechanism for the generation of ORFans in a genome. In addition to the ORFans found only in the E. coli MG1655 genome (n0), 162 ORFans are restricted to the clade including all sequenced representatives of E. coli (n1), an additional 113 ORFans are confined to the E. coliSalmonella enterica clade (n2), and 85 ORFans are specific to the enteric bacteria (n3). Moreover, there are numerous HOPs that are both confined to particular clades and detected in some distantly related prokaryotic genome as well as >2000 native genes ancestral to all -proteobacteria (Fig. 1). Given the criteria that we applied for classifying homologs, including the requirement of conserved gene context, these are conservative estimates of the numbers of genes specific to each clade.
Features of ORFans ORFans have some peculiar characteristics when compared with other genes. Within all clades, the ORFans spanned a similar size distribution and were significantly shorter than either HOPs or native genes (Fig. 2A). In addition, the ORFans from each clade are A+T rich, with those restricted to younger clades (n0 and n1) showing the most extreme biases in base composition (Fig. 2B). It is interesting to note that, within each clade, the G+C contents of HOPs and ORFans, although biased toward A+T relative to native genes, are distinct, suggesting separate histories for these two classes of genes.
ORFans Encode Functional Proteins ORFans, particularly those that are short, have been attributed to errors in gene annotation or possibly pseudogenes (Fischer and Eisenberg 1999
The Recognition and Origination of Fast-Evolving Sequences The small size and rapid substitution rates of ORFans suggest that the lack of detectable homologs might result from artifacts inherent to the methods used to infer similarity among sequences. However, the presence of genes unique to E. coli MG1655 (and absent from very closely related strains) indicate that these ORFans did not originate from ancestral genes with enhanced evolutionary rates. Similarly, a re-examination of genes restricted to the E. coliS. enterica clade (n2) reveals that heightened rates of evolution do not affect our ability to recognize orthologs and identify true ORFans. Although ORFans evolve, on average, more quickly than native genes, nearly 80% of the ORFans shared by E. coli and S. enterica have a Ka < 0.2. When native genes having similar or greater levels of divergence (i.e., equal or faster evolutionary rates) were subjected to BLAST similarity searches, homologs could be detected in all -Proteobacteria as well as several more distantly related genomes at our E-value threshold. Although the number of genes incorrectly assigned as ORFans is expected to be higher in deeper clades, Fischer and Eisenberg (1999
Genomic Context of ORFans and HOPs
ORFans are annotated open reading frames with no homologs in current databases. This might suggest that many of them are attributable to errors in annotation, to the failure of methods for detecting homology, or to inadequacies of the databases. The first two factors apparently play little role in the generation of ORFans in bacteria. Our analyses indicate that the majority of ORFans confined to -Proteobacterial clades are functional proteins (rather than annotation artifacts), and that sequence alignment algorithms combined with the analysis of genome context would likely recognize homologs of ORFans, if present, in other prokaryotes. With regard to the contents of current databases, recent increases in the quantity and diversity of sequenced genomes have not reduced the total number of documented ORFans (Siew and Fischer 2003bBecause species sampling can influence the recognition of the ORFans unique to a genome, we analyzed ORFans appearing over the evolutionary history of a lineage by identifying sequences unique to monophyletic groups containing E. coli MG1655 at increasing phylogenetic depths (Fig. 1). Whereas most previous studies have focused on the ORFans confined to individual genomes, our approach allows direct comparisons of the numbers and characteristics of ORFans of different ages in the E. coli genome, and yields information about their functional status and substitution rates.
Taken together, ORFans in the E. coli lineage are short, functional, A+T rich, and quickly evolving, and can be differentiated based on their sequence properties both from those laterally acquired genes that are distributed in other bacteria (HOPs) and from those genes ancestral to all Despite their distinguishing features, ORFans in E. coli do not comprise a static group: the older ORFans (i.e., those present in deeper clades) approach characteristics of native genes in terms of base composition and evolutionary rates, whereas the younger ORFans tend to be clustered and adjacent to laterally transferred sequences. Together with their fast rate of origination in bacterial genomes, this chromosomal distribution suggests that, in bacteria such as E. coli, ORFans do not arise from the degradation of ancestral coding regions or from intergenic sequences, but rather by lateral gene transfer. Moreover, genes that originate together might be expected to become dispersed over time owing to rearrangements, insertions, and deletions, which accounts for the fact that ORFans restricted to shallower clades are more typically found in larger gene clusters.
Similar to what has been observed for bacteria (Charlebois et al. 2003
A Role for Phage in Generating ORFans
Many phages are known to encode short A+T-rich genes, a high proportion of which are ORFans (Pedulla et al. 2003
The introduction of ORFans by an A+T-rich donor population has been occurring throughout the evolutionary history of the
If ORFans originate in phages, it is anticipated that their sequences will harbor additional characteristics of bacteriophage genes. Because dinucleotide frequencies can provide signatures that discriminate among sequences from different organisms and have been used to identify alien genes within genomes (Karlin 1998
The Function of ORFans When they originate, ORFans are unlikely to encode essential functions; but if maintained, ORFans can become incorporated into cellular processes and take on roles more crucial to cell survival. This could occur either by assuming the function of an ancestral gene or by conferring a new property that is integral to the host cell. As evidence of these processes, some of the ORFans restricted to the n3 clade are conserved in the highly reduced genome of the aphid symbiont Buchnera aphidicola. Because Buchnera has eliminated the vast majority of genes that were present in its free-living ancestor, its genome is thought to encode very few accessory functions and to have retained a minimal set of required genes. Among the ORFans conserved by E. coli and Buchnera is dnaT, which encodes a primosome assembly protein responsible for loading the replicative helicase DnaB onto DNA. Its immediate neighbor in both Escherichia and Buchnera is the gene specifying DnaC, another primosome assembly protein, which was classified as an HOP specific to the n3 clade because of its weak similarity to a protein in Gram-positive bacteria. Therefore, both dnaT and dnaC were likely acquired together in the ancestor of enteric bacteria and have since taken on a role in DNA replication previously performed by nonorthologous genes. This hypothesis is further supported by the detection of genes with similarity to dnaC and dnaT in two bacteriophages (Epsilon NC_004775 [GenBank] and P27 NC_003356 [GenBank] , respectively). An additional 46 ORFans (n0n4) show significant similarity to genes in sequenced phage genomes (Supplemental Table 1). Although the contribution of bacteriophage to the evolution of pathogenicity in bacteria has been well documented, these results suggest a more profound role of phage in bacterial evolution.
Two mechanisms, based on separate findings and having different evolutionary implications, could account for the existence of ORFans in bacterial genomes. In the first, ORFans are the remnants of ancestral sequences that result from the erosion and degradation of previously functional genes, and their presence is viewed as an indicator of genome dynamics (Amiri et al. 2003
Delimiting ORFans in Clades of Different Phylogenetic Depths We initially queried all completely sequenced prokaryotic genomes (n = 94) available in the EMGlib database (Perriere et al. 2000 -Proteobacteria, including Vibrio cholerae (Schoolnik and Yildiz 2000 -Proteobacteria based on their genetic relationships, as established by phylogenetic analysis of the 200 genes conserved among all taxa considered (Lerat et al. 2003
We first considered the ORFs restricted to the E. coli MG1665 genome (n0) and then to four key clades (Fig. 1), corresponding to the E. coli species (n1), the EscherichiaSalmonella group (n2), the enterics (n3), and the group including Vibrio spp., Haemophilus, Pasteurella, and the enterics (n4). To minimize the inclusion of genes expected to have sporadic distributions among bacteria, we removed all recognized IS elements as well as sequences associated with known prophages from the E. coli MG1655 gene set. Then, we defined clade-specific ORFans as those having no detectable homologs outside of a specific clade. Because of deletion events (particularly in reduced genomes), it is possible that some clade-specific ORFs are missing in certain members of a clade. Thus, for ORFans specific to n0, we retain all ORFs from E. coli MG1655 that have no match (E-value < 0.01) in any other genome considered. Similarly, the ORFans specific to clade n1 are those present in E. coli MG1655 and at least one of the other sequenced E. coli strains (E-value < 105) but absent (E-value > 0.01) from the other genomes. The ORFans specific to the deeper clades n2, n3, and n4 were similarly defined, but with the additional requirement that ORFs be present in at least one genome from each clade subsumed by the deeper clade. Those ORFs present in each clade and in at least one of the other
Genes With Heterogeneous Occurrence in Prokaryotes (HOPs)
Confirmation of ORFans and HOPs
Compositional Features and Substitution Rates
This work was funded by grants from the NIH (GM56120) and the DOE (DEFG0301ER63147). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2231904.
1 Corresponding author. [Supplemental material is available online at www.genome.org.]
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.
Amiri, H., Davids, W., and Andersson, S.G. 2003. Birth and death of orphan genes in rickettsia. Mol. Biol. Evol. 20: 15751587.
Blattner, F.R., Plunkett III, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., Collado-Vides, J., Glasner, J.D., Rode, C.K., Mayhew, G.F., et al. 1997. The complete genome sequence of Escherichia coli K-12. Science 277: 14531474.
Bubunenko, M.G. and Subramanian, A.R. 1994. Recognition of novel and divergent higher plant chloroplast ribosomal proteins by Escherichia coli ribosome during in vivo assembly. J. Biol. Chem. 269: 1822318231. Charlebois, R.L., Clarke, G.D., Beiko, R.G., and St Jean, A. 2003. Characterization of species-specific genes using a flexible, web-based querying system. FEMS Microbiol. Lett. 225: 213220.[CrossRef][Medline] da Silva, A.C., Ferro, J.A., Reinach, F.C., Farah, C.S., Furlan, L.R., Quaggio, R.B., Monteiro-Vitorello, C.B., Van Sluys, M.A., Almeida, N.F., Alves, L.M., et al. 2002. Comparison of the genomes of two Xanthomonas pathogens with differing host specificities. Nature 417: 459463.[CrossRef][Medline] Daubin, V., Lerat, E., and Perriere, G. 2003. The source of laterally transferred genes in bacterial genomes. Genome Biol. 4: R57.[CrossRef][Medline]
Deng, W., Burland, V., Plunkett III, G., Boutin, A., Mayhew, G.F., Liss, P., Perna, N.T., Rose, D.J., Mau, B., Zhou, S., et al. 2002. Genome sequence of Yersinia pestis KIM. J. Bacteriol. 184: 46014611.
Domazet-Loso, T. and Tautz, D. 2003. An evolutionary analysis of orphan genes in Drosophila. Genome Res. 13: 22132219. Dujon, B. 1996. The yeast genome project: What did we learn? Trends Genet. 12: 263270.[CrossRef][Medline]
Fischer, D. and Eisenberg, D. 1999. Finding families for genomic ORFans. Bioinformatics 15: 759762.
Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496512.
Florea, L., McClelland, M., Riemer, C., Schwartz, S., and Miller, W. 2003. EnteriX 2003: Visualization tools for genome alignments of Enterobacteriaceae. Nucleic Acids Res. 31: 35273532. Hayashi, T., Makino, K., Ohnishi, M., Kurokawa, K., Ishii, K., Yokoyama, K., Han, C.-G., Ohtsubo, E., Nakayama, K., and Murata, T. 2001. Complete genome sequence of enterohemorrhagic Escherichia coli O157:H7 and genomic comparison with a laboratory strain K-12. DNA Res. 8: 1122.[Abstract] Hendrix, R.W., Lawrence, J.G., Hatfull, G.F., and Casjens, S. 2000. The origins and ongoing evolution of viruses. Trends Microbiol. 8: 504508.[CrossRef][Medline] Juhala, R.J., Ford, M.E., Duda, R.L., Youlton, A., Hatfull, G.F., and Hendrix, R.W. 2000. Genomic sequences of bacteriophages HK97 and HK022: Pervasive genetic mosaicism in the lambdoid bacteriophages. J. Mol. Biol. 299: 2751.[CrossRef][Medline] Karlin, S. 1998. Global dinucleotide signatures and analysis of genomic heterogeneity. Curr. Opin. Microbiol. 1: 598610.[CrossRef][Medline]
Keiper, B.D. and Wormington, W.M. 1990. Nucleotide sequence and 40 S subunit assembly of Xenopus laevis ribosomal protein S22. J. Biol. Chem. 265: 1939719400. Lawrence, J.G. and Ochman, H. 1997. Amelioration of bacterial genomes: Rates of change and exchange. J. Mol. Evol. 44: 383397.[CrossRef][Medline]
Lerat, E., Daubin, V., and Moran, N.A. 2003. From gene trees to organismal phylogeny in prokaryotes: The case of the Li, W.-H. 1997. Molecular evolution. In Molecular evolution (ed. W.-H. Li). Sinauer Associates, Inc., Sunderland, MA. Makino, K., Oshima, K., Kurokawa, K., Yokoyama, K., Uda, T., Tagomori, K., Iijima, Y., Najima, M., Nakano, M., Yamashita, A., et al. 2003. Genome sequence of Vibrio parahaemolyticus: A pathogenic mechanism distinct from that of V. cholerae. Lancet 361: 743749.[CrossRef][Medline]
May, B.J., Zhang, Q., Li, L.L., Paustian, M.L., Whittam, T.S., and Kapur, V. 2001. Complete genomic sequence of Pasteurella multocida, Pm70. Proc. Natl. Acad. Sci. 98: 34603465. McClelland, M., Sanderson, K.E., Spieth, J., Clifton, S.W., Latreille, P., Courtney, L., Porwollik, S., Ali, J., Dante, M., Du, F., et al. 2001. Complete genome sequence of Salmonella enterica serovar Typhimurium LT2. Nature 413: 852856.[CrossRef][Medline]
Moran, N.A. 1996. Accelerated evolution and Muller's rachet in endosymbiotic bacteria. Proc. Natl. Acad. Sci. 93: 28732878. Ochman, H. 2002. Distinguishing the ORFs from the ELFs: Short bacterial genes and the annotation of genomes. Trends Genet. 18: 335337.[CrossRef][Medline] Parkhill, J., Dougan, G., James, K., Thomson, N., Pickard, D., Wain, J., Churcher, C., Mungall, K., Bentley, S., and Holden, M. 2001a. Complete genome sequence of a multiple drug resistant Salmonella enterica serovar Typhi CT18. Nature 413: 848852.[CrossRef][Medline] Parkhill, J., Wren, B.W., Thomson, N.R., Titball, R.W., Holden, M.T., Prentice, M.B., Sebaihia, M., James, K.D., Churcher, C., Mungall, K.L., et al. 2001b. Genome sequence of Yersinia pestis, the causative agent of plague. Nature 413: 523527.[CrossRef][Medline] Pedulla, M.L., Ford, M.E., Houtz, J.M., Karthikeyan, T., Wadsworth, C., Lewis, J.A., Jacobs-Sera, D., Falbo, J., Gross, J., Pannunzio, N.R., et al. 2003. Origins of highly mosaic mycobacteriophage genomes. Cell 113: 171182.[CrossRef][Medline] Perna, N., Plunkett, G., Burland, V., Mau, B., Glasner, J., Rose, D., Mayhew, G., Evans, P., Gregor, J., and Kirkpatrick, H. 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409: 529533.[CrossRef][Medline]
Perriere, G., Bessieres, P., and Labedan, B. 2000. EMGLib: The enhanced microbial genomes library (update 2000). Nucleic Acids Res. 28: 6871. Rocha, E.P. and Danchin, A. 2002. Base composition bias might result from competition for metabolic resources. Trends Genet. 18: 291294.[CrossRef][Medline] Schoolnik, G.K. and Yildiz, F.H. 2000. The complete genome sequence of Vibrio cholerae: A tale of two chromosomes and of two lifestyles. Genome Biol. 1: REVIEWS1016.[Medline] Selinger, D.W., Cheung, K.J., Mei, R., Johansson, E.M., Richmond, C.S., Blattner, F.R., Lockhart, D.J., and Church, G.M. 2000. RNA expression analysis using a 30 base pair resolution Escherichia coli genome array. Nat. Biotechnol. 18: 12621268.[CrossRef][Medline] Shigenobu, S., Watanabe, H., Hattori, M., Sakaki, Y., and Ishikawa, H. 2000. Genome sequence of the endocellular bacterial symbiont of aphids Buchnera sp. APS. Nature 407: 8186.[CrossRef][Medline] Siew, N. and Fischer, D. 2003a. Analysis of singleton ORFans in fully sequenced microbial genomes. Proteins 53: 241251.[CrossRef][Medline] Siew, N. and Fischer, D. 2003b. Twenty thousand ORFan microbial protein families for the biologist? Structure (Camb.) 11: 79. Simpson, A.J., Reinach, F.C., Arruda, P., Abreu, F.A., Acencio, M., Alvarenga, R., Alves, L.M., Araya, J.E., Baia, G.S., Baptista, C.S., et al. 2000. The genome sequence of the plant pathogen Xylella fastidiosa. The Xylella fastidiosa Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature 406: 151157.[CrossRef][Medline] Stover, C.K., Pham, X.Q., Erwin, A.L., Mizoguchi, S.D., Warrener, P., Hickey, M.J., Brinkman, F.S., Hufnagle, W.O., Kowalik, D.J., Lagrou, M., et al. 2000. Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen. Nature 406: 959964.[CrossRef][Medline]
Sueoka, N. 1988. Directional mutation pressure and neutral molecular evolution. Proc. Natl. Acad. Sci. 85: 26532657.
Tamas, I., Klasson, L., Canback, B., Naslund, A.K., Eriksson, A.S., Wernegreen, J.J., Sandstrom, J.P., Moran, N.A., and Andersson, S.G. 2002. 50 million years of genomic stasis in endosymbiotic bacteria. Science 296: 23762379. Wada, A. 1998. Growth phase coupled modulation of Escherichia coli ribosomes. Genes Cells 3: 203208.[Abstract]
Wei, J., Goldberg, M.B., Burland, V., Venkatesan, M.M., Deng, W., Fournier, G., Mayhew, G.F., Plunkett III, G., Rose, D.J., Darling, A., et al. 2003. Complete genome sequence and comparative genomics of Shigella flexneri serotype 2a strain 2457T. Infect. Immun. 71: 27752786.
Welch, R.A., Burland, V., Plunkett III, G., Redford, P., Roesch, P., Rasko, D., Buckles, E.L., Liou, S.R., Boutin, A., Hackett, J., et al. 2002. Extensive mosaic structure revealed by the complete genome sequence of uropathogenic Escherichia coli. Proc. Natl. Acad. Sci. 99: 1702017024.
http://globin.cse.psu.edu/enterix; Percent Identity Plots on the EnteriX server.
Received December 2, 2003;
accepted in revised format February 24, 2004.
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||