|
|
|
|
Published online before print
October 25, 2006, 10.1101/gr.4916306 Genome Res. 16:1431-1438, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00 OPEN ACCESS ARTICLE
Resource ProtoBee: Hierarchical classification and annotation of the honey bee proteomeDepartment of Biological Chemistry, Life Science Institute, The Hebrew University, Jerusalem 91904, Israel
The recently sequenced genome of the honey bee (Apis mellifera) has produced 10,157 predicted protein sequences, calling for a computational effort to extract biological insights from them. We have applied an unsupervised hierarchical protein-clustering method, which was previously used in the ProtoNet system, to nearly 200,000 proteins consisting of the predicted honey bee proteins, the SWISS-PROT protein database, and the complete set of proteins of the mouse (Mus musculus) and the fruit fly (Drosophila melanogaster). The hierarchy produced by this method has been entitled ProtoBee. In ProtoBee, the proteins are hierarchically organized into 18,936 separate tree hierarchies, each representing a protein functional family. By using the mouse and Drosophila complete proteomes as reference, we are able to highlight functional groups of putative gene-loss events, putative novel proteins of unique functionality, and bee-specific paralogs. We have studied some of the ProtoBee findings and suggest their biological relevance. Examples include novel opsin genes and intriguing nuclear matches of mitochondrial genes. The organization of bee sequences into functional clusters suggests a natural way of automatically inferring functional annotation. Following this notion, we were able to assign functional annotation to about 70% of the sequences. ProtoBee is available at http://www.protobee.cs.huji.ac.il
Comparative genomics are heavily based on computational methods. These methods provide not only automation for handling the immense amount of data held within whole genomes, but are also a means of highlighting biologically interesting differences between genomes. The recently sequenced genome of the honey bee Apis mellifera (The Honey Bee Genome Sequencing Consortium 2006
ProtoNet is a hierarchical organization of over 1,000,000 protein sequences (Kaplan et al. 2005
We have applied the method used in ProtoNet to 199,343 proteins consisting of the GLEAN3 set of predicted bee proteins (The Honey Bee Genome Sequencing Consortium 2006
One key computational task for a newly sequenced genome is the automatic assignment of functional annotation to its predicted coding sequences (see Discussion in Sasson et al. 2006 A Web site that enables downloading, browsing, and analysis of the ProtoBee hierarchy and classification is available at http://www.protobee.cs.huji.ac.il.
ProtoBee hierarchy The resulting hierarchy of the 200,000 protein sequences contains 85,579 clusters that are organized into 18,936 separate trees. Each such tree is conjectured to represent a family of proteins that are functionally related. The proteins of each tree are all contained in its root cluster; therefore, the terms "tree" and "root" will be used interchangeably. Before proceeding, it is crucial to stress that the bee sequences are based on computational prediction. This means that some of the predicted coding sequences may be either partially or even fully incorrect. Furthermore, it is plausible that some proteins could be missing from the predicted set. In addition, the clustering and annotation methods are also expected to possess some degree of error as expected of any automatic computational method. Although in order to properly distinguish between these possibilities each cluster has to be inspected manually, in some instances it is possible to systematically pinpoint clusters that are more likely to possess unique bee features. This is the approach by which we proceed. In order to gain a global taxonomic view of the bee proteome, we look at two different perspectives. (1) Protein-based view. Each one of the 10,157 predicted bee sequences belongs to one of the roots. Other proteins assigned to the same root are considered to be putative homologs, belonging to the same functional family. For each protein, we check whether it has homologs from the mouse, fly, or other organisms. (2) Root-based view. There are 5095 roots that contain at least one of the 10,157 bee proteins. For each such root, we check whether it contains proteins from the mouse, fly, or other organisms in addition to the bee proteins. Figure 1 shows the summary of these results in a Venn diagram. As expected, a large majority (67%) of proteins have putative homologs both in mouse, fly, and other organisms. However, in terms of roots, these proteins are contained in 2539 roots, which represent only 50% of the total amount of roots. This suggests that several of these roots represent families that possess some functional divergence in the form of paralogs. A total of 87% of the proteins have putative fly homologs and 82% of the proteins have putative mouse homologs.
One of the most interesting subset of proteins is the group of 159 proteins that do not have homologs from any organism in our database. Since these proteins appear in 143 roots, most of them consist of only one bee protein. We expect these to be either bee proteins that have a unique functionality, highly diverged bee orthologs, gene prediction mistakes, or sequences that could not be properly classified by ProtoBee. An interesting subset of these 159 proteins is the subset of proteins that belong to nonsingleton clusters (i.e., consisting of more than one protein). The reason that these are especially interesting is that the chance of them being gene-prediction mistakes is significantly reduced. Such clusters are conjectured to consist of unique bee paralogs, created by gene-duplication events that are unique to the bee. Table 1 shows a list of the nine nonsingleton clusters that contain only bee proteins.
Following this comparative overview and the identification of putative bee sequences that possess a unique functionality, we would like to focus on gene-loss events in the bee. A careful testing of individual genes has previously shown cases of possible gene loss in the bee genome (Whitfield et al. 2002
So far we have focused on two sets of proteins that are of special interest in a comparative study of the bee genome proteins whose function is bee specific and proteins that are missing in the bee due to gene-loss events. One other interesting case is that of paralog enrichment. In the case of paralogs, we would like to focus on protein families that are taxonomically imbalanced. Specifically, roots that contain a high ratio of bee:fly and bee:mouse proteins may suggest that there exist several paralogs in the bee that do not exist in the fly and mouse. In order to highlight taxonomically imbalanced clusters, we use a taxonomical balance score (TB score):
where bee(C) is the number of bee proteins in cluster C and fly(C) is the number of fly proteins in C. The score ranges from 1 (only bee proteins, no fly proteins) to 0 (no bee proteins, only fly proteins), 0.5 indicating an equal amount of fly and bee proteins. A score for bee:mouse ratio is derived in a similar manner. The TB score for each cluster is available through the ProtoBee Web site. Following the procedure described in Methods, 7131 of 10,157 (70%) bee sequences were assigned annotation. While in terms of coverage this is comparable with supervised methods, the fact that the annotation sources used (see Methods) are varied in terms of scope provides viewpoints at several levels of functionality. Namely, we are able to assign a wide range of annotations from very general properties (e.g., signal transduction, metabolism) to very specific properties (e.g., glucose-6-phosphate isomerase). However, the main goal of the automatic annotation effort is to complement the view of the individual protein families. For example, suppose that by using the comparative approach previously described we find that a protein cluster of polymerases does not contain bee orthologs. A natural question would be whether bee polymerases can be found in other clusters. This can be easily examined by checking which bee proteins were annotated as polymerases. Figure 2 shows the distribution of annotated proteins into GO functional categories. A list summarizing the amount of proteins per annotation is available on the ProtoBee Web site.
Manual evaluation of the results It is obvious that the full extent of the biological relevance of the results that are produced by this computational approach cannot be assessed without manual inspection of each and every prediction. Therefore, we proceed by providing an in-depth biological analysis of only some of the results. We start by examining the set of fly+/insect+/bee clusters (Table 2). Fifteen of these clusters contain multiple biological groups that are apparently unrelated (note that according to our annotation-inference method, such clusters will not be used to infer annotations). However, it is apparent that in some instances these predictions are meaningful, considering the fact that they suggest specific functionally related groups of proteins to be missing (such as mitochondrial proteins, chorion proteins, vision proteins, and developmental proteins). Still, it is crucial to note that not all biologically coherent clusters necessarily indicate gene-loss events. For example, in the case of glucose-6-phosphate isomerase (G6PI), the protein seems to be missing, as it does not appear clustered with the Drosophila protein. G6PI is conserved amongst several species and is crucial for glycolysis, so it is highly unlikely that it does not have a bee homolog. Browsing the ProtoBee annotations, we find that ProtoBee annotates one of the bee proteins as G6PI, and this protein indeed seems to show very high similarity to G6PI proteins. Thus, we suggest that in order to determine whether a fly+/insect+/bee cluster is indicative of a putative gene loss, one should complement the study of each such cluster with an examination of the corresponding annotations.
Mitochondrial proteins In the case of COX1, ProtoBee is able to correctly cluster the COX1 protein family into a unique tree. However, one of the clusters in this tree also contains GB17755, a 30 amino acid bee protein. GB17755 shows a high level of similarity (59% identity spanning all 30 amino acids) to COX1 from various organisms. Returning to the genome, we find that the sequence of GB17755 is not part of a full-length COX1 homolog in the genome. No evidence of expression or mitochondrial targeting signal was found. In the case of ND1, we find that the bee sequence GB12194 was clustered in a cluster of ND1 orthologs. A BLAST search using GB12194 as the query shows that the best matching sequence in UniProt is the sequence of the bee mitochondrial ND1 (84% identity on a region of over 60 amino acids). Searching the genome showed that this sequence is not part of a full-length nuclear homolog of ND1. Therefore, we do not expect this to be an instance in which a high-similarity homolog was missed in the gene prediction process. Once again, no evidence of EST expression or mitochondrial targeting was found. In light of this, the most likely explanation for the appearance of these sequences in the nuclear DNA is the migration of the mitochondrial sequence to the nuclear DNA, creating NUMTs (nuclear mitochondrial DNA).
In order to further investigate whether these sequences are indeed NUMTs (Richly and Leister 2004
In the case of COX3, we find that the bee sequence GB11138 has been clustered in a cluster with bacterial COX3 and ubiquinol oxidase subunit 3 (UOX3) proteins. GB11138 shows the highest level of similarity (54% identity spanning 90% of the protein) to UOX3 from Escherichia coli. Furthermore, the length of the proteins (206 amino acids) matches that of prokaryotic COX3 and UOX3 proteins rather than that of eukaryotic COX3. The high similarity of this protein to prokaryotic COX3 and UOX3 suggested that this sequence may be of bacterial origin. Examining the contig in which this sequence appears, we have identified two adjacent sequences with high similarity to bacterial UOX and ferredoxin proteins. The contig is currently unlocalized within the genome. No evidence of expression or mitochondrial targeting signal was found. We suggest that GB11138 and its contig are either the result of a recent lateral gene-transfer event or of a contamination within the genome sequence. In the cases of all three sequences it is apparent that although these sequences probably do not code for proteins, the classifications made by ProtoBee in each of these instances were justifiable.
Opsins
Pigment Dispersal Hormone Another protein that surprisingly seems to be missing is the Pigment Dispersal Hormone (PDH). PDH has been suggested to be involved both in vision and the circadian rhythm (Park and Hall 1998
Unique bee paralogs
Cluster 391,502 consists of Apamin and Mast Cell Degranulating Protein (MCDP), both constituents of the bee venom. Apamin and MCDP were previously shown to share their 3' exon (Gmachl and Kreil 1995
Once a new genome is sequenced, there are several computational tasks that may be performed on it in order to learn about its biology. These include gene prediction, automatic annotation, and comparative analyses. For each of these tasks there are several different approaches. In this work we present a novel method that combines both the tasks of comparative analysis and automatic annotation. One unique aspect of the clustering method used by ProtoBee is the fact that it is an unsupervised method. In the supervised approach, the algorithm is typically provided with a training set of proteins known to belong to the same family, and then learns common features in order to detect new members of this family. This is the most commonly used approach for machine learning of protein families. While this approach delivers extremely high performance in terms of sensitivity and specificity, it creates a heavy bias toward the detection of only that which is known and cannot detect novel protein families. In the unsupervised approach, on the other hand, the method looks for intrinsic features of the data in order to organize it, rather than being guided externally. Using an unsupervised clustering method, ProtoBee is expected to be inferior to supervised methods such as InterProScan in terms of sensitivity/specificity. Thus, we suggest using our annotation method in conjunction with supervised methods in order to provide maximal coverage and specificity. However, the method makes up for this inferiority by its ability to detect novel protein families (e.g., nonsingleton clusters that are unique to bee) and provide a hierarchical comparative view. A genomic view that is based on the comparison of a genome to only two other genomes may be somewhat biased. However, since the computation required by this method is demanding (nearly 4 x 1010 sequence comparisons), a three-way comparison seems to be a reasonable compromise between biological accuracy and computational feasibility.
Testing our method, we have discovered that the phenomenon of NUMTs is extensive in the honey bee genome. The significant appearance of NUMTs in the bee genome is quite surprising considering that this phenomenon in nearly absent both in Anopheles gambiae and in Drosophila melanogaster (Richly and Leister 2004 In contrast to the previous application of this method in ProtoNet, the focus in ProtoBee is on a whole-genome comparative view. The ability to divide the proteins into functional groups and view each group in light of three whole proteomes provides a unique view of the functional organization of the bee proteome in light of two other metazoan proteomes. This led us to highlight interesting groups of proteins that may be able to account for unique biological characteristics of the bee. It is important to recognize that the predictions made by this method may be, in some cases, lacking or mistaken. However, our goal in highlighting potentially interesting clusters is not to provide a finalized comprehensive list of gene-loss and function-gain events, but merely to select a subset of clusters that suggest further in-depth examination. By studying a few examples of such clusters it is evident that some of these are genuinely interesting. The purpose of the examples that we provide is to demonstrate the ability of the ProtoBee method to pinpoint interesting and often surprising biology in the genome. Obviously, these biological findings require further research in order to evaluate their significance. We expect the lists of putative gene losses, unique function proteins, and bee-specific paralogs to conceal within them many more exciting biological stories.
Sources and tools The protein database that was clustered consisted of the SWISS-PROT database version 41.21 (133,312 proteins), additional mouse and fly proteins from TrEMBL version 24.8 (20,730 Drosophila proteins and 35,199 mouse proteins), and the GLEAN3 set of predicted proteins (http://www.protobee.cs.huji.il and http://www.hgsc.bcm.tmc.edu/projects/honeybee) from release v3.0 of the Apis mellifera genome (10,157 proteins). Fifty-five previously known bee proteins that appeared in SWISS-PROT were removed from the database in order to avoid duplicate instances of the proteins, leaving our protein database at a total size of 199,343 protein sequences.
For sequence comparison, NCBI BLAST (Altschul et al. 1997
Protein clustering
Protein annotation
where A is the set of all proteins in the database that have annotation a. These relatively strict requirements ensure that clusters that are biologically incoherent do not affect the process of assigning annotations and that uninformative annotations are avoided. The annotations that are assigned to the clusters are taken from the following sources: UniProt keywords, InterPro, GO "molecular function" and "biological process" terms (including the GOA mapping), and E.C. (Enzyme Classification) numbers. Finally, each bee protein is assigned the annotations that were given to the cluster to which it belongs and the annotations that were assigned to all of the cluster's parents in the hierarchy.
We thank Ori Sasson for his effort in preparing the data for ProtoBee and Alex Savenok for development of the ProtoBee Web site. We thank the Baylor College of Medicine Human Genome Sequencing Center for making the Apis mellifera genome sequence publicly available prior to publication. This work is supported in part by the EU Framework VI, NoE BioSapiens consortium for Genome Annotation. N.K. is a fellow of the Sudarsky Center for Computational Biology of the Hebrew University.
1 Corresponding author.
E-mail michall{at}cc.huji.ac.il; fax 972-2-6586448. Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4916306.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402. Bairoch, A., Apweiler, R., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., and Magrane, M., et al. 2005. The universal protein resource (UniProt). Nucleic Acids Res. 33: D154D159. Beye, M., Hasselmann, M., Fondrk, M.K., Page, R.E., and Omholt, S.W. 2003. The gene csd is the primary signal for sexual development in the honeybee and encodes an SR-type protein. Cell 114: 397398.[CrossRef][Medline] Blackshaw, S. and Snyder, S.H. 1999. Encephalopsin: A novel mammalian extraretinal opsin discretely localized in the brain. J. Neurosci. 19: 36813690. Bloch, G., Solomon, S.M., Robinson, G.E., and Fahrbach, S.E. 2003. Patterns of PERIOD and pigment-dispersing hormone immunoreactivity in the brain of the European honeybee (Apis mellifera): Age- and time-related plasticity. J. Comp. Neurol. 464: 269284.[CrossRef][Medline] Boguski, M.S., Lowe, T.M., and Tolstoshev, C.M. 1993. dbESTdatabase for "expressed sequence tags. Nat. Genet. 4: 332333.[CrossRef][Medline] Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., and Apweiler, R. 2004. The Gene Ontology Annotation (GOA) Database: Sharing knowledge in UniProt with Gene Ontology. Nucleic Acids Res. 32: D262D266. Crozier, R.H. and Crozier, Y.C. 1993. The mitochondrial genome of the honeybee Apis mellifera: Complete sequence and genome organization. Genetics 133: 97117.[Abstract] Emanuelsson, O., Nielsen, H., Brunak, S., and von Heijne, G. 2000. Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J. Mol. Biol. 300: 10051016.[CrossRef][Medline] Felsenstein, J. 1988. Phylogenies from molecular sequences: Inference and reliability. Annu. Rev. Genet. 22: 521565.[CrossRef][Medline] Gmachl, M. and Kreil, G. 1995. The precursors of the bee venom constituents apamin and MCD peptide are encoded by two genes in tandem which share the same 3'-exon. J. Biol. Chem. 270: 1270412708. Harris, M.A., Clark, J., Ireland, A., Lomax, J., Ashburner, M., Foulger, R., Eilbeck, K., Lewis, S., Marshall, B., and Mungall, C., et al. 2004. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32: D258D261. Hasselmann, M. and Beye, M. 2004. Signatures of selection among sex-determining alleles of the honey bee. Proc. Natl. Acad. Sci. 101: 48884893. Holt, R.A., Subramanian, G.M., Halpern, A., Sutton, G.G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., and Wides, R., et al. 2002. The genome sequence of the malaria mosquito Anopheles gambiae . Science 298: 129149. The Honey Bee Genome Sequencing Consortium 2006. Insights into social insects from the genome of the honey bee Apis mellifera. Nature (in press). Hubbard, T., Andrews, D., Caccamo, M., Cameron, G., Chen, Y., Clamp, M., Clarke, L., Coates, G., Cox, T., and Cunningham, F., et al. 2005. Ensembl 2005. Nucleic Acids Res. 33: D447D453. Kaplan, N., Friedlich, M., Fromer, M., and Linial, M. 2004. A functional hierarchical organization of the protein sequence space. BMC Bioinformatics. 5: 196.[CrossRef][Medline] Kaplan, N., Sasson, O., Inbar, U., Friedlich, M., Fromer, M., Fleischer, H., Portugaly, E., Linial, N., and Linial, M. 2005. ProtoNet 4.0: A hierarchical classification of one million protein sequences. Nucleic Acids Res. 33: D216D218. Max, M., McKinnon, P.J., Seidenman, K.J., Barrett, R.K., Applebury, M.L., Takahashi, J.S., and Margolskee, R.F. 1995. Pineal opsin: A nonvisual opsin expressed in chick pineal. Science 267: 15021506. Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Bateman, A., Binns, D., Bradley, P., Bork, P., Bucher, P., and Cerutti, L., et al. 2005. InterPro, progress and status in 2005. Nucleic Acids Res. 33: D201D205. Nakai, K. and Horton, P. 1999. PSORT: A program for detecting sorting signals in proteins and predicting their subcellular localization. Trends Biochem. Sci. 24: 3436.[CrossRef][Medline] Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443453.[CrossRef][Medline] Olson, S.A. 2002. EMBOSS opens up sequence analysis. European Molecular Biology Open Software Suite. Brief. Bioinform. 3: 8791. Park, J.H. and Hall, J.C. 1998. Isolation and chronobiological analysis of a neuropeptide pigment-dispersing factor gene in Drosophila melanogaster . J. Biol. Rhythms. 13: 219228.[Abstract] Quevillon, E., Silventoinen, V., Pillai, S., Harte, N., Mulder, N., Apweiler, R., and Lopez, R. 2005. InterProScan: Protein domains identifier. Nucleic Acids Res. 33: W116W120. Richly, E. and Leister, D. 2004. NUMTs in sequenced eukaryotic genomes. Mol. Biol. Evol. 21: 10811084. Sasson, O., Kaplan, N., and Linial, M. 2006. Functional annotation prediction: All for one and one for all. Protein Sci. 15: (in press). Spaethe, J. and Briscoe, A.D. 2004. Early duplication and functional diversification of the opsin gene family in insects. Mol. Biol. Evol. 21: 15831594. Thompson, J.D., Higgins, D.G., and Gibson, T.J. 1994. ClustalW: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22: 46734680. Thresher, R.J., Vitaterna, M.H., Miyamoto, Y., Kazantsev, A., Hsu, D.S., Petit, C., Selby, C.P., Dawut, L., Smithies, O., and Takahashi, J.S., et al. 1998. Role of mouse cryptochrome blue-light photoreceptor in circadian photoresponses. Science 282: 14901494. Velarde, R.A., Sauer, C.D., Walden, K.K., Fahrbach, S.E., and Robertson, H.M. 2005. Pteropsin: A vertebrate-like non-visual opsin expressed in the honey bee brain. Insect Biochem. Mol. Biol. 35: 13671377.[CrossRef][Medline] Whitfield, C.W., Band, M.R., Bonaldo, M.F., Kumar, C.G., Liu, L., Pardinas, J.R., Robertson, H.M., Soares, M.B., and Robinson, G.E. 2002. Annotated expressed sequence tags and cDNA microarrays for studies of brain and behavior in the honey bee. Genome Res. 2: 555566.
Received November 11, 2005; accepted in revised format June 1, 2006. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||