|
|
|
|
Genome Res. 14:1669-1675, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Resources cpnDB: A Chaperonin Sequence Database1 National Research Council Plant Biotechnology Institute, Saskatoon, Saskatchewan S7N 0W9, Canada 2 National Research Council Institute for Marine Biosciences & Canadian Bioinformatics Resource, Halifax, Nova Scotia B3H 3Z1, Canada 3 Department of Pathology and Laboratory Medicine and UBC Centre for Disease Control, University of British Columbia, Vancouver, British Columbia V5Z 4R4, Canada
Type I chaperonins are molecular chaperones present in virtually all bacteria, some archaea and the plastids and mitochondria of eukaryotes. Sequences of cpn60 genes, encoding 60-kDa chaperonin protein subunits (CPN60, also known as GroEL or HSP60), are useful for phylogenetic studies and as targets for detection and identification of organisms. Conveniently, a 549567-bp segment of the cpn60 coding region can be amplified with universal PCR primers. Here, we introduce cpnDB, a curated collection of cpn60 sequence data collected from public databases or generated by a network of collaborators exploiting the cpn60 target in clinical, phylogenetic, and microbial ecology studies. The growing database currently contains 2000 records covering over 240 genera of bacteria, eukaryotes, and archaea. The database also contains over 60 sequences for the archaeal Type II chaperonin (thermosome, a homolog of eukaryotic cytoplasmic chaperonin) from 19 archaeal genera. As the largest curated collection of sequences available for a protein-encoding gene, cpnDB provides a resource for researchers interested in exploiting the power of cpn60 as a diagnostic or as a target for phylogenetic or microbial ecology studies, as well as those interested in broader subjects such as lateral gene transfer and codon usage. We built cpnDB from open source tools and it is available at http://cpndb.cbr.nrc.ca.
The advent of genome-scale sequencing projects has led to the availability of full-genome sequences for a variety of organisms including eukaryotes, bacteria, and archaea. This data is an invaluable resource for studies of genome-scale evolutionary processes such as lateral gene transfer and organelle evolution, as well as providing fuel for debates surrounding topics such as the definition of "species," particularly in the microbial world (Perna et al. 2001
A comparison of the sequences of the Escherichia coli groEL gene, which encodes a protein identified as being essential for the posttranslational assembly of bacteriophage particles and the Rubisco subunit-binding protein of higher plant chloroplasts, led to the discovery that these two proteins represent a ubiquitous protein family now known as the type I chaperonins (CPN60; Hemmingsen et al. 1988
Multiple functions have been ascribed to CPN60. Whereas the primary intracellular role of CPN60 is thought to be as a molecular chaperone in the processes of posttranslational protein folding and assembly of protein complexes (for review, see Saibil and Ranson 2002
The universal nature of cpn60 genes makes them attractive targets for phylogenetic studies (Viale and Arakaki 1994
The ability to amplify the cpn60 UT from any genomic template has also facilitated the study of complex microbial communities, in which the UT region is amplified from a complex template and libraries of cloned UT sequences are created and sequenced (Hill et al. 2002 Our ongoing efforts to exploit cpn60 as a target for phylogenetic studies, microbial detection and identification, and microbial ecology have led us to gather and curate a large collection of cpn60 (Type I chaperonin) sequence data, as well as sequence from the archaeal thermosome (Type II chaperonin), a homolog of cpn60. To share this resource with the scientific community, we have designed and implemented a Web interface for cpnDB, a curated collection of cpn60 sequence data that is available at http://cpndb.cbr.nrc.ca.
cpnDB Contents cpnDB currently contains 2000 records, approximately one third of which have full-length cpn60 gene sequence data associated with them. The remaining two-thirds of the records contain exclusively UT sequence data. Organisms represented in cpnDB include eukaryotes, bacteria, and archaea, and most of the major taxonomic groups defined by the 16S rRNA "backbone tree" are represented. Table 1 summarizes the database contents by major taxonomic group and number of genera in each group. Taxonomic lineages associated with each record are derived from the full lineages provided by the NCBI taxonomy database for each organism. Currently, the primary focus of cpnDB is on cpn60 sequences (type I chaperonin), although it may be expanded in the future to include eukaryotic type II chaperonins in addition to the archaeal type II chaperonins currently included. The cpn60 universal PCR primers do not amplify any part of the type II chaperonin genes.
Although multiple cpn60 genes are common in complex eukaryotes, we have found only a few examples of multiple cpn60 genes in bacteria. Table 2 lists the 16 bacterial species with multiple cpn60 genes for which complete genome sequences are available to date.
cpn60 UT Versus Full-Length Gene Sequence To determine whether relationships between cpn60 genes for any two organisms are reliably reflected in the UT region, we examined the pairwise percent identities between UT nucleotide sequences and full-length cpn60 gene sequences, and between UT peptide sequences and full-length CPN60 protein sequences for 97 Gammaproteobacteria sequences representing 32 genera (Fig. 1A). For each pair of sequences (4656 pairs), we determined the ratio of the UT nucleotide sequence identity to the full-length nucleotide sequence identity and the ratio of the UT peptide sequence identity to the full-length sequence identity. For example, if the UT nucleotide sequence identity between two sequences is 78%, and the full-length nucleotide sequence identity for the same pair is 76%, then the ratio would be 1.03. We found that the majority of ratios for the gammaproteobacteria (3569 of 4656 peptide ratios; 4161 of 4656 nucleotide ratios) are between 0.96 and 1.10, indicating that the level of difference between the UT regions of any two organisms in this group is representative of the level of difference between the full-length cpn60 sequences (Fig. 1B).
Sequence Diversity Within cpnDB Figure 2 shows the cumulative frequency distribution for pairwise percent nucleotide and peptide identities and pairwise peptide similarities between cpn60 UT sequences across a data set composed of one representative of each of 247 eukaryotic and bacterial genera represented in cpnDB. Pairwise UT nucleotide identities in this data set range from 26.4% to 100%, with a median value of 55.4% identity and a mean of 56.5% identity. The peptide identity range extends from 15.4% to 100%, with a median of 53.5% and a mean of 54.6%. Pairwise peptide percent similarity ranges from 33.6% to 100%, with a median value of 74.3% similarity and a mean of 74.5%.
Web Interface cpnDB was constructed with MySQL, and the Web interface was implemented with PHP. Database contents can be searched using text or sequence queries. The text search window (Fig. 3) allows searching by elements of the organism name as well as specific strain identifiers such as American Type Culture Collection numbers. A keyword query searches all text fields. The number of records retrieved can be restricted and records can be sorted in a number of ways. It is also possible to view only recent deposits to the database or to browse the entire selection by general taxonomic class and alphabetical listing of organisms.
Search results are presented in table form as shown in Figure 4. In this case, a search for genus Achromobacter yielded two records. Full records can be viewed by selecting the database ID number, or added to a download cart for later retrieval. Individual records (Fig. 5) include deposit date, organism identification, taxonomic lineage, unique GenBank identifiers, source of the data, and available nucleotide and peptide sequence data. Links to the NCBI taxonomy database entries, nucleotide, and peptide are provided for retrieval of full GenBank records and links to associated resources provided by NCBI. For nonreference sequences, which include clinical isolates, field isolates, and sequences derived from environmental samples in microbial population studies, the record also includes a description of the nearest reference sequence neighbor in the database. For example, Figure 5 shows the record for a cloned cpn60 sequence derived from a study of pig feces microbial flora. In this case, the library sequence (001_g11, GenBank accession AF436914 [GenBank] ) is shown to be 72.939% identical to the reference nucleotide sequence from Bacteroides uniformis. The date of the most recent search of the reference database is indicated and links to the FASTA and BLASTp results are provided. All nonreference sequences are searched against the reference data set following each deposit of new reference data.
Sequence-based searches can be conducted using nucleotide or peptide queries. In addition to standard FASTA and BLASTp searches, implementation of a modified version of BIBI (http://pbi1.univ-lyon1.fr/bibi/; Devulder et al. 2003 Documentation for all applications implemented is available through the cpnDB Web interface. Descriptions and protocols for the application of the universal cpn60 primers and related literature references are also presented.
16S rRNA genes have long been the standard for molecular systematic studies as well as the rapidly expanding fields of molecular diagnostics and microbial ecology. These genes are universal, the multiple copies of 16S rRNA genes per genome make it an abundant, easily detectable target, and universal primers have been developed to amplify specific fragments for sequence analysis. Potential pitfalls of 16S rRNA as a target for organism detection and identification are related to the fact that there is often insufficient discriminating sequence information within the 16S rRNA target to distinguish between closely related species and strains. Also, the multiple variable copy numbers per genome complicate quantitative assays on the basis of this target. The structure of the gene, with alternating regions of variable and conserved sequence, facilitate the formation of chimeric PCR products, especially when amplifying from a complex template (Wang and Wang 1997 Universal, protein-encoding genes offer alternatives to 16S rRNA with some particular advantages. As a consequence of the degeneracy of the genetic code, protein-encoding genes can diverge and evolve more rapidly than genes encoding structural RNA molecules, where even minor changes to the nucleotide sequence can have catastrophic effects. This results in the observation that protein-encoding gene sequences can be used to discriminate between species, or even strains within a species. Genes present in a single copy in microbial genomes, although more difficult to detect, offer superior targets for application quantitative methods such as quantitative, real-time PCR, while eliminating potential sequencing artifacts produced when multiple gene copies are not identical.
Other protein-encoding genes beside cpn60 have proven useful for phylogenetic and diagnostic purposes, including rpoB (Adekambi et al. 2003
We have found that the cpn60 UT region is reasonably representative of the entire open reading frame in terms of phylogenetically informative sequence variation (Fig. 1), and that it usually provides more discriminating information than corresponding 16S rRNA sequences (Brousseau et al. 2001 The sequence data accumulated in cpnDB is derived from public repositories or has been generated by a collaborative network of clinicians, phylogeneticists, and microbial ecologists exploiting the cpn60 and archaeal chaperonin targets in their work. All of the sequence data in cpnDB is also present in the public sequence repositories. However, the advantages to a curated collection of sequences are obvious. We do not have any immediate plans to accept direct sequence submissions to cpnDB, but instead encourage interested parties to submit their data to the public sequence databases (EMBL, GenBank, or DDBJ), as surveillance of these resources for new cpn60 sequence data is ongoing. To the best of our knowledge, cpnDB is the largest curated collection of gene-specific sequence data for a protein-encoding gene, and as such, it is a valuable resource for phylogenetic studies, clinical applications, and microbial ecology investigations.
Data Collection Surveillance of the NCBI GenBank sequence database is managed with the Pubcrawler service (http://www.pubcrawler.ie/; Hokamp and Wolfe 1999
Curation and Annotation
Updates
Funding for the development of cpnDB was provided by the Canadian Biotechnology Strategy and the National Research Council Genomics and Health Initiative. The authors gratefully acknowledge the support and contributions of the Canadian Bioinformatics Resource (http://cbr-rbc.nrc-cnrc.gc.ca/). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2649204.
4 Corresponding author.
Adekambi, T., Colson, P., and Drancourt, M. 2003. rpoB-based identification of nonpigmented and late-pigmenting rapidly growing mycobacteria. J. Clin. Microbiol. 41: 56995708. Boucher, Y., Douady, C.J., Papke, R.T., Walsh, D.A., Boudreau, M.E., Nesbo, C.L., Case, R.J., and Doolittle, W.F. 2003. Lateral gene transfer and the origins of prokaryotic groups. Annu. Rev. Genet. 37: 283328.[CrossRef][Medline]
Bourne, D.G., McDonald, I.R., and Murrell, J.C. 2001. Comparison of pmoA PCR primer sets as tools for investigating methanotroph diversity in three Danish soils. Appl. Environ. Microbiol. 67: 38023809.
Brousseau, R., Hill, J.E., Prefontaine, G., Goh, S.H., Harel, J., and Hemmingsen, S.M. 2001. Streptococcus suis serotypes characterized by analysis of chaperonin 60 gene sequences. Appl. Environ. Microbiol. 67: 48284833. Bush, R.M. and Everett, K.D. 2001. Molecular evolution of the Chlamydiaceae. Int. J. Syst. Evol. Microbiol. 51: 203220.[Abstract]
Chambaud, I., Heilig, R., Ferris, S., Barbe, V., Samson, D., Galisson, F., Moszer, I., Dybvig, K., Wroblewski, H., Viari, A., et al. 2001. The complete genome sequence of the murine respiratory pathogen Mycoplasma pulmonis. Nucleic Acids Res. 29: 21452153.
Cole, J.R., Chai, B., Marsh, T.L., Farris, R.J., Wang, Q., Kulam, S.A., Chandra, S., McGarrell, D.M., Schmidt, T.M., Garrity, G.M., et al. 2003. The Ribosomal Database Project (RDP-II): Previewing a new autoaligner that allows regular updates and the new prokaryotic taxonomy. Nucleic Acids Res. 31: 442443. Dale, C.J., Moses, E.K., Ong, C.C., Morrow, C.J., Reed, M.B., Hasse, D., and Strugnell, R.A. 1998. Identification and sequencing of the groE operon and flanking genes of Lawsonia intracellularis: Use in phylogeny. Microbiology 144: 20732084.[Abstract]
Devulder, G., Perriere, G., Baty, F., and Flandrois, J.P. 2003. BIBI, a bioinformatics bacterial identification tool. J. Clin. Microbiol. 41: 17851787. Glass, J.I., Lefkowitz, E.J., Glass, J.S., Heiner, C.R., Chen, E.Y., and Cassell, G.H. 2000. The complete sequence of the mucosal pathogen Ureaplasma urealyticum. Nature 407: 757762.[CrossRef][Medline] Goh, S.H., Potter, S., Wood, J.O., Hemmingsen, S.M., Reynolds, R.P., and Chow, A.W. 1996. HSP60 gene sequences as universal targets for microbial species identification: Studies with coagulase-negative staphylococci. J. Clinic. Microbiol. 34: 818823.[Abstract] Goh, S.H., Santucci, Z., Kloos, W.E., Faltyn, M., George, C.G., Driedger, D., and Hemmingsen, S.M. 1997. Identification of Staphylococcus species and subspecies using the Chaperonin-60 gene identification method and reverse checkerboard hybridization. J. Clin. Microbiol. 35: 31163121.[Abstract]
Goh, S.H., Driedger, D., Gillett, S., Low, D.E., Hemmingsen, S.M., Amos, M., Chan, D., Lovgren, M., Willey, B.M., Shaw, C., et al. 1998. Streptococcus iniae, a human and animal pathogen: Specific identification by the Chaperonin-60 gene identification method. J. Clin. Microbiol. 36: 21642166.
Goh, S.H., Facklam, R.R., Chang, M., Hill, J.E., Tyrrell, G.J., Burns, E.C., Chan, D., He, C., Rahim, T., Shaw, C., et al. 2000. Identification of enterococcus species and phenotypically similar lactococcus and vagococcus species by reverse checkerboard hybridization to chaperonin 60 gene sequences. J. Clin. Microbiol. 38: 39533959. Hemmingsen, S.M., Woolford, C., van der Vies, S.M., Tilly, K., Dennis, D.T., Georgopoulos, C.P., Hendrix, R.W., and Ellis, R.J. 1988. Homologous plant and bacterial proteins chaperone oligomeric protein assembly. Nature 333: 330334.[CrossRef][Medline] Hill, J.E. and Hemmingsen, S.M. 2001. Arabidopsis thaliana type I and II chaperonins. Cell Stress & Chaperones 6: 190200.[CrossRef][Medline]
Hill, J.E., Seipp, R.P., Betts, M., Hawkins, L., Van Kessel, A.G., Crosby, W.L., and Hemmingsen, S.M. 2002. Extensive profiling of a complex microbial community by high-throughput sequencing. Appl. Environ. Microbiol. 68: 30553066. Hokamp, K. and Wolfe, K. 1999. What's new in the library? What's new in GenBank? Let PubCrawler tell you. Trends Genet. 15: 471472.[CrossRef][Medline] Jian, W., Zhu, L., and Dong, X. 2001. New approach to phylogenetic analysis of the genus Bifidobacterium based on partial HSP60 gene sequences. Int. J. Syst. Evol. Microbiol. 51: 16331638.[Abstract] Kasai, H., Watanabe, K., Gasteiger, E., Bairoch, A., Isono, K., Yamamoto, S., and Harayama, S. 1998. Construction of the gyrB database for the identification and classification of bacteria. Genome Inform. Ser. Workshop Genome Inform. 9: 1321.[Medline] Kasai, H., Tamura, T., and Harayama, S. 2000. Intrageneric relationships among Micromonospora species deduced from gyrB-based phylogeny and DNA relatedness. Int. J. Syst. Evol. Microbiol. 50: 127134.[Abstract]
Khamis, A., Colson, P., Raoult, D., and Scola, B.L. 2003. Usefulness of rpoB gene sequencing for identification of Afipia and Bosea species, including a strategy for choosing discriminative partial sequences. Appl. Environ. Microbiol. 69: 67406749.
Klunker, D., Haas, B., Hirtreiter, A., Figueiredo, L., Naylor, D.J., Pfeifer, G., Muller, V., Deppenmeier, U., Gottschalk, G., Hartl, F.U., et al. 2003. Coexistence of group I and group II chaperonins in the archaeon Methanosarcina mazei. J. Biol. Chem. 278: 3325633267.
Kwok, A.Y. and Chow, A.W. 2003. Phylogenetic study of Staphylococcus and Macrococcus species based on partial hsp60 gene sequences. Int. J. Syst. Evol. Microbiol. 53: 8792. Kwok, A.Y., Wilson, J.T., Coulthart, M., Ng, L.K., Mutharia, L., and Chow, A.W. 2002. Phylogenetic study and identification of human pathogenic Vibrio species based on partial hsp60 gene sequences. Can. J. Microbiol. 48: 903910.[CrossRef][Medline] Lew, A.E., Gale, K.R., Minchin, C.M., Shkap, V., and de Waal, D.T. 2003. Phylogenetic analysis of the erythrocytic Anaplasma species based on 16S rDNA and GroEL (HSP60) sequences of A. marginale, A. centrale, and A. ovis and the specific detection of A. centrale vaccine strain. Vet. Microbiol. 92: 145160.[CrossRef][Medline] Maguire, M., Coates, A.R.M., and Henderson, B. 2002. Chaperonin 60 unfolds its secrets of cellular communication. Cell Stress & Chaperones 7: 317329.[CrossRef][Medline] Marston, E.L., Sumner, J.W., and Regnery, R.L. 1999. Evaluation of intraspecies genetic variation within the 60 kDa heat-shock protein gene (groEL) of Bartonella species. Int. J. Syst. Bacteriol. 49: 10151023.[CrossRef][Medline] Olsen, G.J., Lane, D.J., Giovannoni, S.J., Pace, N.R., and Stahl, D.A. 1986. Microbial ecology and evolution: A ribosomal RNA approach. Annu. Rev. Microbiol. 40: 337365.[CrossRef][Medline] Perna, N.T., Plunkett, G., Burland, V., Mau, B., Glasner, J.D., Rose, D.J., Mayhew, G.F., Evans, P.S., Gregor, J., Kirkpatrick, H.A., et al. 2001. Genome sequence of enterohaemorrhagic Escherichia coli O157:H7. Nature 409: 529533.[CrossRef][Medline] Saibil, H.R. and Ranson, N.A. 2002. The chaperonin folding machine. Trends Biochem. Sci. 27: 627632.[CrossRef][Medline] Viale, A. 1995. GroEL (Hsp60)-based bacterial and organellar phylogenies. Mol. Microbiol. 17: 1013.[CrossRef][Medline] Viale, A.M. and Arakaki, A.K. 1994. The chaperone connection to the origins of the eukaryotic organelles. FEBS Lett. 341: 146151.[CrossRef][Medline] Viale, A.M., Arakaki, A.K., Soncini, F.C., and Ferreyra, R.G. 1994. Evolutionary relationships among bacterial groups as inferred from GroEL (Chaperonin) sequence comparisons. Int. J. Syst. Bacteriol. 44: 527533.[CrossRef][Medline] Wang, G.C. and Wang, Y. 1997. Frequency of formation of chimeric molecules as a consequence of PCR coamplification of 16S rRNA genes from mixed bacterial genomes. Appl. Environ. Microbiol. 63: 46454650.[Abstract]
Woese, C.R., Kandler, O., and Wheelis, M.L. 1990. Towards a natural system of organisms: Proposal for the domains Archaea, Bacteria, and Eucarya. Proc. Natl. Acad. Sci. 87: 45764579. Yamamoto, S. and Harayama, S. 1995. PCR amplification and direct sequencing of gyrB genes with universal primers and their application to the detection and taxonomic analysis of Pseudomonas putida strains. Appl. Environ. Microbiol. 61: 11041109.[Abstract]
Zmasek, C.M. and Eddy, S.R. 2001. ATV: Display and manipulation of annotated phylogenetic trees. Bioinformatics. 17: 383384.
http://cpndb.cbr.nrc.ca/; cpnDB homepage. http://www.ncbi.nlm.nih.gov/; National Center for Biotechnology Information. http://www.pubcrawler.ie/; Pubcrawler homepage. http://cbr-rbc.nrc-cnrc.gc.ca/; Canadian Bioinformatics Resource. http://www.atcc.org/; American Type Culture Collection. http://www.bacterio.cict.fr/; List of Bacterial Names with Standing in Nomenclature. http://www.jalview.org/; Jalview homepage. http://pbil.univ-lyon1.fr/bibi/; BIBI homepage. http://www.genetics.wustl.edu/eddy/atv/; ATV (A Tree Viewer).
Received April 1, 2004; accepted in revised format May 28, 2004. This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||