|
|
|
|
Published online before print
October 25, 2006, 10.1101/gr.5580606 Genome Res. 16:1329-1333, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00 OPEN ACCESS ARTICLE
Commentary Community annotation: Procedures, protocols, and supporting tools1Department of Animal Science, Texas A&M University, College Station, Texas 77843, USA; 2Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030, USA
Investigators at the Baylor College of Medicine Human Genome Sequencing Center (BCMHGSC) and BeeBase organized a community-wide effort to manually annotate the honey bee (Apis mellifera) genome. Although various strategies for manual annotation have been used in the past, the value of dispersed community annotation has not yet been demonstrated. Here we make a case for the merit of dispersed community annotation. We present annotation procedures, standard protocols, and tools used for sequence analysis, data submission, and data management. We also report lessons learned from this dispersed community annotation effort for a metazoan genome.
Annotation is one of the most difficult tasks in genome sequencing projects, yet it is essential for connecting genome sequence to biology. Lincoln Stein (2001)
Strategies for annotation were discussed after completion of the human and fly draft genomes (Claverie 2000 The community-wide effort to annotate the honey bee genome was unusual in that it was a decentralized open annotation project for a metazoan genome. It seems appropriate that the research community for honey bee, the first sequenced social insect genome, embark on an annotation sociology experiment. Honey bee investigators around the world cooperated over a three month period to manually annotate >3000 gene models for genes of particular interest in honey bee research. Here we present the approaches taken by BCMHGSC, BeeBase, and the Honey Bee Genome Sequencing Consortium and demonstrate the value of a decentralized community annotation effort.
Our approach to successful community annotation We used a combination of communication via Listserv and conference calls, standard operating procedures (SOPs), central source of annotation data sets, annotation submission Web site, and expert review to avoid the potential problems associated with open annotation. To avoid duplication, the community (of 177 registered individuals) was divided into groups based on themes of biological interest, each with a group leader, and an annotation Web site was provided by BCMHGSC that allowed registered users to view submissions. To optimize consistency, (1) the community developed SOPs, (2) requirements for data submission were established and enforced at the submission Web site, and (3) all submissions were reviewed by an expert at BeeBase prior to assignment of identifiers and incorporation into the honey bee Official Gene Set (OGS).
Community organization and communication support
Standard operating procedures
Annotation Web site and database
Analysis tools
Gene model review and ID assignment
Outcome of community annotation The number of genes and gene models is summarized in Table 1. The number of gene models in the OGS increased from 10,157 to 10,314, despite 302 OGS models being dropped due to splits and merges. Over 25% of the OGS was touched by manual gene model revision, confirmation, or functional annotation. Of these annotations, about half of the gene model coordinates were unchanged, and 12% were revised by the BeeBase curator. The BeeBase curator revised the models to match the genome assemblysome revisions were due to splice site differences or coding sequences that were not extended fully to the start and stop codons, but many represented additional information from PCR experiments or other nongenome data that could not be represented as coordinates in the genome sequence. These sequences are a benefit of this annotation process and are available from BeeBase in their original annotated form.
Duplicated annotation to some extent could not be avoided, because some genes could be grouped into multiple community themes. However, in most cases gene models were accepted for alternative splice forms of the same gene and treated as different annotations. In the few cases of conflicts regarding splitting or merging gene models, submitters were notified and the conflicts were resolved.
Lessons learned The issue of data consistency is addressed by collecting annotations in a central database with appropriate constraints. Strict constraints on the Gene IDs make redundant annotations easier to identify and reduce the number of conflicting annotations. Requiring exon features (coordinates, sequence) to be included in a gene annotation makes identification of overlapping annotations and consistency checking easier. Methods to import OGS sequence and feature information into the submission interface speed the annotation process and reduce data entry errors that generate minor alignment inconsistencies that require effort to follow up. For the few sequences with major differences between the annotated gene and the OGS gene, importing the OGS sequence is not beneficial. Allowing the annotators flexibility in genomic sequence data sources can have mixed blessings. Most of the annotators used assembly2, the release available at NCBI and Ensembl at that time. BeeBase provided assembly2 as well as newer assemblies and unassembled sequences, so that community members could annotate sequences that were not represented in assembly2. As a result of access to additional sequence data, a number of gene models were improved by extending fragments and identifying missing exons. However, the need to map all the gene models to the same assembly meant that some submitted models had to be revised. The original submitted gene models are available at BeeBase, with the expectation that they will map to future assemblies. A related issue is that migrating annotations from one assembly to another is not a solved problem. In the honey bee, we used sequence alignments of gene sequences to the genome assembly to map the gene features to the new assembly. Methods that convert coordinates based on known mapping of genome assembly contigs in the two assemblies have the potential to be more reliable, although contigs also may change between assembly versions and hence may not map cleanly from one assembly to the next. A noteworthy outcome of the honey bee annotation effort was its community building effect. The experience provided a valuable learning opportunity for community members who had not previously annotated gene models, including graduate students and post-doctoral researchers. In addition, the BeeBase staff became well acquainted with the community and gained exposure to important areas in honey bee biology. We anticipate a long-lasting synergy between community members and BeeBase, which will prove to be especially helpful in the development of a new model organism. This model is being modified and expanded for other ongoing BCMHGSC sequencing projects, including the sea urchin (Strongylocentrotus purpuratus) and the red flour beetle (Tribolium castaneum). Two issues arise with the prospect of exploiting the newly developed communityBeeBase synergy by continuing community annotation at BeeBase. One is the submission of annotations to NCBI. BCMHGSC and BeeBase have developed an agreement with NCBI, by which BCMHGSC would grant permission to BeeBase to submit new gene set releases as annotations on assembly files. This mechanism was used for the D. pseudoobscura annotation in the BCMHGSC collaboration with FlyBase. The other issue is that of funding to allow the continued management of community annotation at BeeBase. To date, community members have rallied to help raise support for BeeBase, which resulted in several funding sources acknowledged below. Although we do not anticipate funding at a "museum model" level, we do expect funding to continue supporting community participation and submission of annotations to NCBI.
We thank Hugh Robertson, Gene Robinson, Erica Sodergren, and Jay Evans for helpful discussions about the annotation process and the software tools needed to support it. This work was funded by grants from USDA-ARS and NHGRI, NIH. Funding for BeeBase (CGE) includes USDA ARS Special Cooperative Agreement 58-6413-6-034, supplement to NIH 5-P41-HG000739-13, the Texas Agricultural Experiment Station, and gifts from Golden Heritage Foods and Sioux Honey Association.
3 These authors contributed equally to this work.
E-mail kworley{at}bcm.edu; fax (713) 798-6977. [Supplemental material is available online at www.genome.org. The genome sequence is available under the accession numbers CM000054CM000069 (for chromosome linkage groups) and AADG05* for contigs.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5580606.
Aguero, F., Zheng, W., Weatherly, D.B., Mendes, P., and Kissinger, J.C. 2006. TcruziDB: An integrated, post-genomics community resource for Trypanosoma cruzi . Nucleic Acids Res. 34: D428D431. Braun, B.R., van Het Hoog, M., d'Enfert, C., Martchenko, M., Dungan, J., Kuo, A., Inglis, D.O., Uhl, M.A., Hogues, H., and Berriman, M., et al. 2005. A human-curated annotation of the Candida albicans genome. PLoS Genet. 1: 3657. The Chimpanzee Sequencing and Analysis Consortium 2005. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437: 6987.[CrossRef][Medline] Cho, S., Huang, Z.Y., Green, D.R., Smith, D.R., and Zhang, J. 2006. Evolution of the complementary sex-determination gene of honey bees: Balancing selection and trans-species polymorphisms. Genome Res. (this issue). Claverie, J.M. 2000. Do we need a huge new centre to annotate the human genome? Nature 403: 12.[Medline] Collins, A.M., Caperna, T.J., Williams, V., Garrett, W.M., and Evans, J.D. 2006. Proteomics and genomics of honey bee seminal vesicles and semen. Insect Mol. Biol. (in press). Cunningham, W. and Leuf, B. 2001. The Wiki way: Quick collaboration on the Web. Addison-Wesley, New York. D'Ascenzo, M.D., Collmer, A., and Martin, G.B. 2004. PeerGAD: A peer-review-based and community-centric web application for viewing and annotating prokaryotic genome sequences. Nucleic Acids Res. 32: 31243135. Dearden, P.K., Wilson, M.J., Sablan, L., Osborne, P.W., Havler, M., McNaughton, E., Kimura, K., Milshina, N.V., Hasselman, M., and Gempe, T., et al. 2006. Patterns of conservation and change in honey bee developmental genes. Genome Res. (this issue). Dehal, P., Satou, Y., Campbell, R.K., Chapman, J., Degnan, B., De Tomaso, A., Davidson, B., Di Gregorio, A., Gelpke, M., and Goodstein, D.M., et al. 2002. The draft genome of Ciona intestinalis: Insights into chordate and vertebrate origins. Science 298: 21572167. Dowell, R.D., Jokerst, R.M., Day, A., Eddy, S.R., and Stein, L. 2001. The distributed annotation system. BMC Bioinformatics. 2: 7.[CrossRef][Medline] Evans, J.D., Aronstein, K., Chen, Y.P., Hetru, C., Imler, J.-L., Jiang, H., Kanost, M., Thompson, G., Zou, Z., and Hultmark, D. 2006. Immune-related genes and honey bee disease responses. Insect Mol. Biol. (in press). Forêt, S. and Maleszka, R. 2006. Function and evolution of odorant binding protein gene family in a social insect, the honey bee (Apis mellifera). Genome Res. (this issue). Galagan, J.E., Nusbaum, C., Roy, A., Endrizzi, M.G., Macdonald, P., FitzHugh, W., Calvo, S., Engels, R., Smirnov, S., and Atnoor, D., et al. 2002. The genome of M. acetivorans reveals extensive metabolic and physiological diversity. Genome Res. 12: 532542. Glasner, J.D., Liss, P., Plunkett III, G., Darling, A., Prasad, T., Rusch, M., Byrnes, A., Gilson, M., Biehl, B., and Blattner, F.R., et al. 2003. ASAP, a systematic annotation package for community analysis of genomes. Nucleic Acids Res. 31: 147151. Hillier, L.W., Miller, W., Birney, E., Warren, W., Hardison, R.C., Ponting, C.P., Bork, P., Burt, D.W., Groenen, M.A., and Delany, M.E., et al. 2004. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432: 695716.[CrossRef][Medline] The Honey Bee Genome Sequencing Consortium 2006. Insights into social insects from the genome of the honey bee Apis mellifera . Nature (in press). Hubbard, T. and Birney, E. 2000. Open annotation offers a democratic solution to genome sequencing. Nature 403: 825.[Medline] Jones, A.K., Raymond-Delpech, V., Thany, S.H., Gauthier, M., and Sattelle, D.B. 2006. The nicotinic acetylcholine receptor gene family of the honey bee, Apis mellifera . Genome Res. (this issue). Kapustin, Y., Souvorov, A., and Tatusova, T. 2004. Splign: A hybrid approach to spliced alignments. In Proceedings of RECOMB 2004Research in computational molecular biology, p. 741. Kawai, J., Shinagawa, A., Shibata, K., Yoshino, M., Itoh, M., Ishii, Y., Arakawa, T., Hara, A., Fukunishi, Y., and Konno, H., et al. 2001. Functional annotation of a full-length mouse cDNA collection. Nature 409: 685690.[CrossRef][Medline] Kunieda, T., Fujiyuki, T., Kucharski, R., Forêt, S., Ohashi, K., Takeuchi, H., Kamicouchi, A., Kage, E., Morioka, M., and Ament, S., et al. 2006. Unique characteristics of the honeybee genes for carbohydrate-metabolizing enzymes as revealed by the genome annotation. Insect Mol. Biol. (in press). Lewis, S.E., Searle, S.M., Harris, N., Gibson, M., Iyer, V., Richter, J., Wiel, C., Bayraktaroglir, L., Birney, E., and Crosby, M.A., et al. 2002. Apollo: A sequence annotation editor. Genome Biol. 3: research0082. Lindblad-Toh, K., Wade, C.M., Mikkelsen, T.S., Karlsson, E.K., Jaffe, D.B., Kamal, M., Clamp, M., Chang, J.L., Kulbokas III, E.J., and Zody, M.C., et al. 2005. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature 438: 803819.[CrossRef][Medline] McLeod, M.P., Qin, X., Karpathy, S.E., Gioia, J., Highlander, S.K., Fox, G.E., McNeill, T.Z., Jiang, H., Muzny, D., and Jacob, L.S., et al. 2004. Complete genome sequence of Rickettsia typhi and comparison with sequences of other rickettsiae. J. Bacteriol. 186: 58425855. Ohyanagi, H., Tanaka, T., Sakai, H., Shigemoto, Y., Yamaguchi, K., Habara, T., Fujii, Y., Antonio, B.A., Nagamura, Y., and Imanishi, T., et al. 2006. The Rice Annotation Project Database (RAP-DB): Hub for Oryza sativa ssp. japonica genome information. Nucleic Acids Res. 34: D741D744. Pennisi, E. 2000. Ideas fly at gene-finding jamboree. Science 287: 21822184. Postel, J. and Reynolds, J. 1985. File Transfer Protocol (FTP). In RFC 959, Network Working Group. The Rat Genome Sequencing Consortium 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493521.[CrossRef][Medline] Riley, M., Abe, T., Arnaud, M.B., Berlyn, M.K., Blattner, F.R., Chaudhuri, R.R., Glasner, J.D., Horiuchi, T., Keseler, I.M., and Kosuge, T., et al. 2006. Escherichia coli K-12: A cooperatively developed annotation snapshot2005. Nucleic Acids Res. 34: 19. Robertson, H.M. and Wanner, K.W. 2006. The chemoreceptor superfamily in the honey bee Apis mellifera: Expansion of the odorant, but not gustatory, receptor family. Genome Res. (in press). Slater, G.S. and Birney, E. 2005. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics. 6: 31.[CrossRef][Medline] Stein, L. 2001. Genome annotation: From sequence to biology. Nat. Rev. Genet. 2: 493503.[Medline] Stein, L.D., Mungall, C., Shu, S., Caudy, M., Mangone, M., Day, A., Nickerson, E., Stajich, J.E., Harris, T.W., and Arva, A., et al. 2002. The generic genome browser: A building block for a model organism system database. Genome Res. 12: 15991610. Stover, C.K., Pham, X.Q., Erwin, A.L., Mizoguchi, S.D., Warrener, P., Hickey, M.J., Brinkman, F.S., Hufnagle, W.O., Kowalik, D.J., and Lagrou, M., et al. 2000. Complete genome sequence of Pseudomonas aeruginosa PA01, an opportunistic pathogen. Nature 406: 959964.[CrossRef][Medline] Sutherland, T.D., Weisman, S., Trueman, H., and Haritos, V.S. 2006. Honey bee silk genes encoding novel coiled coil proteins have evolved independently of other insect silk genes. Genome Res. (this issue). Thomas, E. 1986. LISTSERV. L-Soft International Inc, Landover, MD. Tripathy, S., Pandey, V.N., Fang, B., Salas, F., and Tyler, B.M. 2006. VMD: A community annotation database for oomycetes and microbial genomes. Nucleic Acids Res. 34: D379D381. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., and An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562.[CrossRef][Medline] Winsor, G.L., Lo, R., Sui, S.J., Ung, K.S., Huang, S., Cheng, D., Ching, W.K., Hancock, R.E., and Brinkman, F.S. 2005. Pseudomonas aeruginosa Genome Database and PseudoCAP: Facilitating community-based, continually updated, genome annotation. Nucleic Acids Res. 33: D338D343.
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||