|
|
|
|
Published online before print
May 17, 2005, 10.1101/gr.3638405 Genome Res. 15:867-874, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00
Methods Identification of genomic features using microsyntenies of domains: Domain teams1 Laboratoire Génome et Informatique, CNRS/UEVE, 91034 Evry cedex, France 2 Infobiogen, 91034 Evry cedex, France 3 LacIM, Université du Québec à Montréal, Montréal, Québec, Canada 4 Soluscience, Biopôle Clermont-Limagne, 63360 Saint-Beauzire, France
The detection, across several genomes, of local conservation of gene content and proximity considerably helps the prediction of features of interest, such as gene fusions or physical and functional interactions. Here, we want to process realistic models of chromosomes, in which genes (or genomic segments of several genes) can be duplicated within a chromosome, or be absent from some other chromosome(s). Our approach adopts the technique of temporarily forgetting genes and working directly with protein "domains" such as those found in Pfam. This allows the detection of strings of domains that are conserved in their content, but not necessarily in their order, which we refer to as domain teams. The prominent feature of the method is that it relaxes the rigidity of the orthology criterion and avoids many of the pitfalls of gene-families identification methods, often hampered by multidomain proteins or low levels of sequence similarity. This approach, that allows both inter- and intrachromosomal comparisons, proves to be more sensitive than the classical methods based on pairwise sequence comparisons, particularly in the simultaneous treatment of many species. The automated and fast detection of domain teams, together with its increased sensitivity at identifying segments of identical (protein-coding) gene contents as well as gene fusions, should prove a useful complement to other existing methods.
Protein structures and sequences can often be split up into "domains." Databases such as SCOP for the structures (Andreeva et al. 2004
Although the term "synteny" originally referred to gene loci on the same chromosome, it is now widely used to refer to gene loci in different organisms, located on a chromosomal region of common evolutionary ancestry (Passarge et al. 1999
Syntenic regions in eucaryotic genomes are generally defined as groups of two or more genes in one species that possess an ortholog on the same chromosome in another species, irrespective of their orientation or order (Pevzner and Tesler 2003
In this study, we reinvestigate the search for microsyntenies by temporarily forgetting genes and working directly with protein domains, such as those found in Pfam (Bateman et al. 2004 We implemented this concept in a software named DomainTeam, freely available on request for academic purposes. The strength and limitations of this approach are discussed in detail in this work.
For reasons that will be made clear in the Results section, we shall here interest ourselves only in prokaryotic organisms. From a computational point of view, a chromosome can be defined as a collection of genes. Focusing on protein-coding genes, we want to define a chromosome as an ordered sequence of genes, where a unique coding sequence is associated with the nucleic acid sequence of a gene. In addition, we will divide each gene into one or more consecutive domains, each domain having a label. In the present case, the domains will be the Pfam domains of the encoded proteins (Pfam imposes a nonoverlapping rule on domains). In those few cases where a domain is inserted within another one (Bateman et al. 2004
The distance between two domains on the same chromosome is the difference between their positions. The position of a domain is defined using the order in which the domains appear on the chromosome (considering both DNA strands). Given a set S of domain labels, and a fixed distance C = ABD EFBCAGH IJAKBCLM NOPCAQARS
With
The content of a
Definition 1
For example, in the above set C of chromosomes, the set S ={A, B, C} is a Figure 1 shows an example of a domain team found in four different organisms, exhibiting significant rearrangements. The five domains present in Yersima pestis are transposed, reversed, and duplicated in Salmonella typhi, Escherichia coli, and Vibrio cholerae. Another example is shown in the Supplemental material (part 2), depicting a team found in a set of 10 pathogenic bacteria.
Without additional constraints, Definition 1 also leads to theoretically exponential algorithms, since the number of domain teams can be exponential in the number of labels. However, as shown in the next sections, real-life examples involving thousands of genes can be computed efficiently or at least in a reasonable time.
In order to show the exponential nature of Definition 1, consider a set L of n labels. Construct n chromosomes, each containing n-1 different labels obtained by removing one different label from L. Then, for ABCD ABCE ABDE ACDE BCDE
Each proper subset S of L has at least one occurrence, since S is contained in at least one chromosome, and the distance between two labels in a chromosome is always less than
Sensitivity of DomainTeam as viewed from three closely related genomes As a way to test the sensitivity of our approach, we compared the results obtained by GeneTeam (Luc et al. 2003 parameter was set to 3 (allowing gaps of two consecutive genes or domains). The results are summarized in Figure 2. The first obvious observation is that, for both programs, there are no huge teams that would encompass almost all of the genome. Rather, these three closely related species share a lot of microsyntenic regions (red color in Fig. 2). As expected, the teams obtained by DomainTeam (inner circle) and GeneTeam (outer circle) most often coincide. However, DomainTeam identifies larger and more numerous microsyntenies, as large nonsyntenic regions reported by GeneTeam are broken into several domain teams. The largest teams (green in Fig. 2) contain 31 and 26 genes for DomainTeam and GeneTeam, respectively. On the whole, the domain teams harbor 2207 genes (52% of the E. coli genes) and the gene teams 1662 (40%). This difference can be explained by at least three reasons, i.e., the use of the domain criterion (1) relaxes the need for strict homology, (2) permits various rearrangements of domains such as duplications or fusions, and (3) allows one to take paralogs into account; thus, the identification of duplicated regions. These three points are discussed in the next sections.
The use of domains bypasses the rigidity of pairwise sequence comparisons As already stated, multiple-sequence alignment profiles make protein sequence comparisons more sensitive than classical pair-wise alignments. Homology inference will inevitably fail in the last case, when sequences diverged too much, while two highly divergent homologous (protein) sequences may well continue to possess a common Pfam domain.
Figure 3 displays a schematic representation of a conserved team between E. coli and S. typhi, in which the proteins share five domains. The proteins encoded by pgtA and pgtB in S. typhi are known to be the members of a two-component regulatory system (Kadner 1996
Using domains instead of genes as an atomic unit allows us to detect domain rearrangements such as fusions The detection of gene fusion events can be used to predict functional associations of proteins, such as functional interaction or complex formation (Enright et al. 1999
An example is given in Figure 4, which results from the search for conserved teams across five bacteria. This team is part of the tryptophan operon. While trpC is a stand-alone gene in Bacteroides thetaiotaomicron and Anabaena, it is fused with trpF in E. coli, S. typhi, and Y. pestis. As to trpG, it is fused with trpD in E. coli and S. typhi, but with trpE in Anabaena. These fusions are also detected by other methods based on sequence comparisons and are reported in FusionDB (Suhre and Claverie 2004
Since DomainTeam detects only the fusions between adjacent genes, it will not replace other methods that rely basically on sequence comparisons, irrespective of the distance between the fusion components. However, the increased sensitivity afforded by the Pfam domains enables us to find otherwise undetected fusions. We examined the fusions concerning adjacent genes in the pairs E. coli/Haemophilus influenzae and E. coli/Helicobacter pylori reported by FusionDB, AllFuse, and DomainTeam. A total of 39 such (predicted) fusions was found, only two of them being reported by the three methods, eight by two methods, and 29 by one method, among which five were predicted by DomainTeam only. As shown in Table 1, in all of these last five cases, one of the fusion (protein) components did not match sufficiently the fused protein to be detected by a similarity search. Conversely, eight fusions predicted by FusionDB or AllFuse were not detected by DomainTeam, because one of their components did not possess a Pfam label. It is therefore clear that while DomainTeam cannot by itself replace other published methods, it can be used usefully as a complementary tool to detect otherwise unpredicted fusions.
Duplications are detected by intrachromosomal comparisons The classical step of finding orthologous genes before searching for syntenies prevents the detection of intrachromosomal duplications. We have already shown in Figure 3 that the use of domains and intrachromosomal comparisons not only enables one to find duplications, but also to detect duplications where the sequence similarities are weak. Another example containing a duplication of a whole syntenic region will be found in the Supplemental material (part 2), showing a team found in a set of 10 pathogenic bacteria.
Sensitivity of DomainTeam in massive comparisons
Each fully recovered operon was classified according to the number of chromosomes the team was found in, from two to 16 (the set of 15 Gram-negative bacteria comprised 16 chromosomes, since the genome of V. cholerae consists of two chromosomes; see Methods). Each class was then divided into three groups in the following way: (1) group 1, containing the teams found only in two or more of the eight gammaproteobacteria chromosomes; (2) group 2, containing the teams found in both gammaproteobacteria and other proteobacteria (comprising two epsilon-proteo-bacteria and one alphaproteobacterium); (3) group 3, containing the teams found simultaneously in gammaproteobacteria, other proteobacteria, and more distant taxons (the set included one cyanobacterium, one bacteroidete, one spirochete, one chlamydiae, and one thermotogae). Figure 6 illustrates the phylogenetic distribution of the 245 fully recovered operons. While 14 operons are specific to E. coli, 96 operons were recovered only within the gammaproteobacteria (group 1), and 33 extra operons were also found in other proteobacteria (group 2). Surprisingly enough, the 116 remaining operons were also fully recovered within at least one of the more distant species (group 3). See Supplemental material, part 3, for the list of operons and their phylogenetic distribution.
Limitations of domain teams identification However sensitive the method is, DomainTeam may report false negatives in those cases where adjacent protein-coding genes are not labeled with a Pfam domain. Conversely, DomainTeam may result in false positives due to "promiscuous domains" of broad specificity (Marcotte et al. 1999b
The DomainTeam algorithm relies on pre-existing Pfam annotations of proteomes. As of December 2004, the Pfam library covers 74% of the proteins in SWISS-PROT/TrEMBL. This means that, on average, one protein in four is not (so far) labeled with a Pfam domain. As shown in Table 2, the Pfam coverage of complete proteomes is heterogeneous and varies from 96% for Buchnera aphidicola (a symbiotic bacterium endowed with a small genome) down to 40% for the archaebacterium Aeropyrum pernix. Obviously, DomainTeam will inevitably miss these unlabeled proteins and their corresponding genes. Most of the time, however, they will simply be considered as insertions within the teams (a false negative will be obtained when n consecutive genes are unlabeled, with n
Although microsyntenic regions can be found across eukaryotic genomes (e.g., Oh et al. 2002
Some "promiscuous domains," such as DNA-binding domains, increase the number of small uninteresting teams. We addressed this problem through the use of a simple and empirical score, aimed at ranking the observed sets of teams as a function of the number of different domains they contain and the number of different chromosomes they belong to. For one set of a given
The best ranks are for those teams having a high number of proteins per chromosome (np/no) with a high number of different domains (nd) and a low number of promiscuous domains (1/m). It is our experience that teams with S > 90 are potentially interesting. See Supplemental material, part 4, as an example of the average number of proteins per occurrence in those teams having a score
Practical computing considerations
Conclusions
Chromosome tables and Pfam annotations The chromosomal ordered lists (chromosome tables) of the bacterial genes and their products (together with their UniProt IDs) were downloaded from the EBI "proteome" site (http://www.ebi.ac.uk/integr8/EBI-Integr8-HomePage.do). The Pfam annotations pertaining to the above-mentioned proteomes were downloaded from ftp://ftp.sanger.ac.uk/pub/databases/Pfam/database-files.
Bacterial sets Set of 15 Gram-negative bacteria: Anabaena sp, Bacteroides thetaiotaomicron, Borrelia burgdorferi, Campylobacter jejuni NCTC 11168, Chlamydia muridarum, Escherichia coli K12, Haemophilus influenzae, Helicobacter pylori ATCC 700392, Pseudomonas aeruginosa, Rhizobium loti, Salmonella typhi, Thermotoga maritima, Vibrio cholerae, Xylella fastidiosa, Yersinia pestis CO-92. Set of 13 Gram-positive bacteria: Bacillus subtilis, Bifidobacterium longum, Clostridium perfringens, Corynebacterium efficiens, Deinococcus radiodurans, Enterococcus faecalis, Lactococcus lactis, Lactobacillus plantarum, Listeria monocytogenes, Mycobacterium leprae, Oceanobacillus iheyensis, Staphylococcus aureus N315, Streptococcus agalactiae serotype V. Set of 16 archaebacteria: Aeropyrum pernix, Archaeoglobus fulgidus, Halobacterium sp, Methanobacterium thermoautotrophicum, Methanococcus jannaschii, Methanopyrus kandleri, Methanosarcina acetivorans, Methanosarcina mazei, Pyrococcus abyssi, Pyrobaculum aerophilum, Pyrococcus furiosus, Pyrococcus horikoshii, Sulfolobus solfataricus, Sulfolobus tokodaii, Thermoplasma acidophilum, Thermoplasma volcanium.
DomainTeam
We thank the Infobiogen team for their patience and understanding during very long runs and M. Marshall from the Pfam team for her help in retrieving the proper annotation files.
5 Corresponding author. E-mail pasek{at}genopole.cnrs.fr; fax 33-1-60-87-38-97. [Supplemental material is available online at www.genome.org.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3638405. Article published online before print in May 2005.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C., and Murzin, A.G. 2004. SCOP database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Res. 32: D226-D229.
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138-D141. Bergeron, A., Corteel, S., and Raffinot, M. 2002. The algorithmic of gene teams. Lecture Notes Comput. Sci. 2452: 464-476. Calabrese, P.P., Chakravarty, S., and Vision, T.J. 2003. Fast identification and statistical evaluation of segmental homologies in comparative maps. Bioinformatics 19: i74-i80.[Abstract] Durand, D. and Sankoff, D. 2003. Tests for gene clustering. J. Comput. Biol. 10: 453-482.[CrossRef][Medline]
Eddy, S.R. 1998. Profile hidden Markov models. Bioinformatics 14: 755-763. Enright, A.J. and Ouzounis, C.A. 2001. Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biol. 2: research0034.1-0034.7. Enright, A.J., Iliopoulos, I., Kyrpides, N.C., and Ouzounis, C.A. 1999. Protein interaction maps for complete genomes based on gene fusion events. Nature 402: 86-90.[CrossRef][Medline]
Fujibuchi, W., Ogata, H., Matsuda, H., and Kanehisa, M. 2000. A heuristic graph comparison algorithm and its application to detect functionally related enzyme clusters. Nucleic Acids Res. 28: 4021-4028.
Fukuda, Y., Washio, T. and Tomita, M. 1999. Comparative study of overlapping genes in the genomes of Mycoplasma genitalium and Mycoplasma pneumoniae. Nucleic Acids Res. 27: 1847-1853. Galperin, M.Y. and Koonin, E.V. 2000. Who's your neighbor? New computational approaches for functional genomics. Nat. Biotech. 18: 609-613.[CrossRef][Medline] Ghai, R., Torsten Hain, T. and Chakraborty, T. 2004. GenomeViz: Visualizing microbial genomes. BMC Bioinformatics 5: 198.[CrossRef][Medline]
Gribskov, M., McLachlan, A.D., and Eisenberg, D. 1987. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. 84: 4355-4358. Harlow, T.J., Gogarten, J.P., and Ragan, M.A. 2004. A hybrid clustering approach to recognition of protein families in 114 microbial genomes. BMC Bioinformatics 5: 45.[CrossRef][Medline] He, X. and Goldwasser, M. 2004. Identifying conserved gene clusters in the presence of orthologous groups. In Proceedings of the Eighth Annual International Conference on Computational Molecular Biology (RECOMB) 2004 (eds. P.E. Bourne and D. Gusfield), pp. 272-280. ACM, New York. Jaillon, O., Aury, J-M., Brunet, F., Petit, J-L., Stange-Thomann, N., Mauceli, E., Bouneau, L., Fischer, C., Ozouf-Costaz, C., Bernot, A., et al. 2004. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate proto-karyotype. Nature 431: 946-957.[CrossRef][Medline] Kadner, R.J. 1996. Cytoplasmic membrane. In Escherichia coli and Salmonella typhimurium, cellular and molecular biology (eds. F.C. Neidhardt et al.), pp. 58-87. ASM Press, Washington, DC.
Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., and Hattori, M. 2004. The KEGG resource for deciphering the genome. Nucleic Acids Res. 32: D277-D280. Koonin, E.V., Arawind, L., and Kondrashov, A.S. 2000. The impact of comparative genomics on our understanding of evolution. Cell 101: 573-576.[CrossRef][Medline] Korbel, J.O., Jensen, L.J., von Mering, C., and Bork, P. 2004. Analysis of genomic context: Prediction of functional associations from conserved bidirectionally transcribed gene pairs. Nat. Biotech. 22: 911-917.[CrossRef][Medline] Luc, N., Risler, J-L., Bergeron, A., and Raffinot, M. 2003. Gene teams: A new formalization of gene clusters for comparative genomics. Comput. Biol. Chem. 27: 59-67.[Medline] Marcotte, E.M., Pellegrini, M., Thompson, M.J., Yeates, T.O., and Eisenberg, D. 1999a. A combined algorithm for genome-wide prediction of protein function. Nature 402: 83-86.[CrossRef][Medline] Marcotte, E.M., Pellegrini, M., Ho-Leung, N., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999b. Detecting protein function and protein-protein interactions from genome sequences. Science 30: 751-753. Nye, T.M., Berzuini, C., Gilks, W.R., Babu, M.M., and Teichmann, S.A. 2004. Statistical analysis of domains in interacting protein pairs. Bioinformatics 21: 993-1001. Oh, K.C., Hardeman, C., Ivanchenko, M.G., Ellard-Ivet, M., Nebenfür, A., White, T.J., and Lomax, T.L. 2002. Fine mapping in tomato using microsynteny with the Arabidopsis genome: The Diageotropica (Dgt) locus. Genome Biol. 3: research0049.1-0049.11.
Overbeek, R., Fonstein, M., D'Souza, M., Pusch, G.D., and Maltsev, N. 1999. The use of gene clusters to infer functional coupling. Proc. Natl. Acad. Sci. 96: 2896-2901. Passarge, E., Horsthemke, B., and Farber, R.A. 1999. Incorrect use of the term synteny. Nat. Genet. 23: 387.[Medline] Patthy, L. 2003. Modular assembly of genes and the evolution of new functions. Genetica 118: 217-231.[CrossRef][Medline]
Pevzner, P. and Tesler, G. 2003. Genome rearrangements in mammalian evolution: Lessons from human and mouse genomes. Genome Res. 13: 37-45.
Salgado, H., Gama-Castro, S., Martinez-Antonio, A., Diaz-Peredo, E., Sanchez-Solano, F., Peralta-Gil, M., Garcia-Alonso, D., Jimenez-Jacinto, V., Santos-Zavaleta, A., Bonavides-Martinez, C., et al. 2004. RegulonDB (version 4.0): Transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res. 32: D303-D306. Sali, A. 1999. Functional links between proteins. Nature 402: 23-26.[CrossRef][Medline] Sankoff, D. 2003. Rearrangements and genome evolution. Curr. Opin. Gen. Dev. 13: 583-587.[CrossRef][Medline]
Suhre, K. and Claverie, J-M. 2004. FusionDB: A database for in-depth analysis of prokaryotic gene fusion events. Nucleic Acids Res. 32: D273-D276. Suyama, M. and Bork, P. 2001. Evolution of prokaryotic gene order: Genome rearrangements in closely related species. Trends Genet. 17: 10-13.[CrossRef][Medline] Tamames, J. 2001. Evolution of gene order conservation in prokaryotes. Genome Biol. 2: 0020.1-0020.11. Tang, J. and Moret, B.M. 2003. Scaling up accurate phylogenetic reconstruction from gene-order data. Bioinformatics 19: i305-i312.[Abstract] Vogel, C., Bashton, M., Kerrison, N.D., Chothia, C., and Teichmann, S.A. 2004. Structure, function and evolution of multidomains proteins. Curr. Opin. Struct. Biol. 14: 208-216.[CrossRef][Medline]
von Mering, C., Huynen, M., Jaeggi, D., Schmidt, S., Bork, P., and Snel, B. 2003. STRING: A database of predicted functional associations between proteins. Nucleic Acids Res. 31: 258-261.
Yanai, I., Derti, A., and DeLisi, C. 2001. Genes linked by fusion events are generally of the same functional category: A systematic analysis of 30 microbial genomes. Proc. Natl. Acad. Sci. 98: 7940-7945. Yanai, I., Wolf, Y.I., and Koonin, E.V. 2002. Evolution of gene fusions: Horizontal transfer versus independent events. Genome Biol. 3: research0024.1-0024.13. Yona, G., Linial, N., and Linial, M. 1999. Protomap: Automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins 37: 360-378.[CrossRef][Medline]
ftp://ftp.sanger.ac.uk/pub/databases/Pfam/database-files; The directory of the Pfam ftp server that contains the Pfam annotations of the proteins in UniProt. http://hmmer.wustl.edu/; HMMER series of programs. http://www.ebi.ac.uk/integr8/EBI-Integr8-HomePage.do; The proteome Home Page at EBI. http://lgi.infobiogen.fr/DomainTeams; DomainTeams full results and code downloads.
Received January 3, 2005; accepted in revised format March 28, 2005. This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||