|
|
|
|
Published online before print
May 17, 2005, 10.1101/gr.3756405 Genome Res. 15:893-899, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 OPEN ACCESS ARTICLE
Resources AnoEST: Toward A. gambiae functional genomicsEuropean Molecular Biology Laboratory, D69117 Heidelberg, Germany
Here, we present an analysis of 215,634 EST and cDNA sequences of a major vector of human malaria Anopheles gambiae structured into the AnoEST database. The expressed sequences are grouped into clusters using genomic sequence as template and associated with inferred functional annotation, including the following: corresponding Ensembl gene prediction, putative orthologous genes in other species, homology to known proteins, protein domains, associated Gene Ontology terms, and corresponding classification into broad GO-slim functional groups. AnoEST is a vital resource for interpretation of expression profiles derived using recently developed A. gambiae cDNA microarrays. Using these cDNA microarrays, we have experimentally confirmed the expression of 7961 clusters during mosquito development. Of these, 3100 are not associated with currently predicted genes. Moreover, we found that clusters with confirmed expression are nonbiased with respect to the current gene annotation or homology to known proteins. Consequently, we expect that many as yet unconfirmed clusters are likely to be actual A. gambiae genes. [AnoEST is publicly available at http://komar.embl.de, and is also accessible as a Distributed Annotation Service (DAS).]
Blood-feeding anopheline mosquitoes are obligatory vectors for the transmission of the malaria parasites of the genus Plasmodium. The parasites undergo asexual development within mammalian hosts and produce gametocytes which, when ingested by the mosquito, initiate the sexual cycle that culminates with production of sporozoites. In turn, an infected mosquito takes another bloodmeal and sporozoites are released into the circulation of a naive host, thus completing the transmission cycle. Human malaria causes over 1 million deaths every year in the developing world. Recently, in recognition of the great importance of Anopheles gambiae in global health, its genome has been sequenced by an international scientific consortium (Holt et al. 2002
Here, we report a large-scale study of malaria mosquito A. gambiae EST and cDNA sequences structured into the newly developed AnoEST database. Using these cDNA microarray data in conjunction with AnoEST, we have experimentally confirmed expression of 7961 clusters during mosquito development. Of these, 3100 are not associated with currently predicted genes (Holt et al. 2002
A. gambiae EST classification We collected from public sequence databases (Benson et al. 2004
The descriptive statistics of the AnoEST data is provided in Table 1, which includes the numbers of different types of clusters and their annotation with respect to the Ensembl, UniProt/SWISS-PROT (Apweiler et al. 2004
In total, 11,608 T-clusters overlap with 10,726 Ensembl gene models (of 14,364 Ensembl predictions as of Aug. 10, 2004, v23.2b.1), indicating that, despite very strict clustering criteria, the analysis probably engendered only a minor number of fragmentation artifacts. On average, the derived EST clusters overlap with Ensembl gene models by about 920 nt, corresponding to 70% of the shorter loci; 2695 clusters overlap Ensembl gene models by >90%. Only 452 EST clusters have shorter than 20% overlaps; these probably derive from UTRs. Interestingly, 9870 T-clusters (4789 of which are supported by two or more EST/cDNAs) have no associated Ensembl gene predictions.
N-type clusters are quite different; they are twice as numerous, but have only one-sixth as many Ensembl overlaps as do the T-type clusters (Table 1). Moreover, 35,660 (77%) of the N-type clusters are formed by only 863 ESTs, each of which is aligned to at least 50 distinct genomic loci. These likely represent transposable elements in A. gambiae, as 24,984 N-type clusters show significant homology to known transposable elements in RepBase9.12 (Jurka 2000
Analysis of T-clusters
First, we explored the question of whether clusters with verified expression but without annotation represent low-level transcriptional leakage or whether they are expressed at levels comparable to those of recognized genes. For this purpose, we compared the distribution of log2-transformed values of expression for T-clusters with and without Ensembl gene prediction and for the fraction of clusters with and without SWISS-PROT homologs. As shown in Figure 2, in both cases, genes with and without annotation showed rather similar distributions, with only a small shift toward lower expression values in the absence of annotation, which was slightly more pronounced for clusters with SWISS-PROT homologs. Only 61 T-type clusters with confirmed expression, 20 of which have a corresponding gene model, show significant homology to A. gambiae transposable elements. This comparison suggested that most of the 3100 EST clusters that are currently lacking a predicted gene model have detectable expression and are likely to be actual genes.
We then compared the T-cluster subsets with verified expression with those lacking microarray data (mostly not represented on the microarrays). These subsets were reasonably similar in terms of presence or absence of corresponding Ensembl predictions, SWISS-PROT homologs, or both (Table 2). As expected, the microarray expressed subset was substantially (fivefold) smaller than the subset lacking microarray data in the case of singletons, whereas the subsets were of equal size for
Based on the analysis summarized in Figures 1 and 2 and in Table 2, our working hypothesis is that a substantial fraction of EST singletons represents actual genes, as do most of the 2 ESTs clusters. These data suggest that the number of genes in the A. gambiae genome may be substantially higher than currently predicted. A similar conclusion has been drawn recently for the Drosophila melanogaster genome using a combined bioinformatics and expression profiling approach (Hild et al. 2003
Interface to the AnoEST database Examples of the available interactive searches are represented in Figure 3. By default, the information on queried sequences is returned in a condensed format showing data corresponding to the best-matching EST cluster (Fig. 3A). The "Sequences" tab at the top of the interface allows retrieval of the sequences in FASTA format and, if required, generates reverse complemented sequences, e.g., for 3'-sequenced clones. The "Details" tab makes available more extensive information on similarity to known proteins and protein domains, orthology, GO, and "GO-slim" categories (Fig. 3B). The annotation available for each corresponding genomic region in Ensembl can also be explored through a direct link to the genome browser. The "Homology" tab refers to the full records of a similarity search of the EST cluster consensus sequence against the UniProt/SWISS-PROT protein database. The records allow manual inspection of the alignments and provide html references to the corresponding entries in the UniProt/SWISS-PROT database. When exploring all expressed sequences assigned to one cluster (Fig. 3C), the visualization of EST alignments to genome allows a quick grasp of the gene organization, EST coverage, and quality of the clustering. Sequences derived from 5'- and 3'-ends are colored differently. The scale bar provided indicates the real cluster length over the genomic alignment, sized to fit to the image. The EST cluster image is mapped by html links to EST records for exploring cases of interest. To make the results more broadly accessible and integrated with the Ensembl genome browser, the data are also available through the DAS protocol (http://komar.embl.de:9000/das). The dump of the data in relational mySQL format is available upon request.
AnoEST utility for microarray analysis
Future developments Together with the Ensembl team, we are planning to use the obtained results for refinement of current gene predictions in the Anopheles genome. This would complement the approach of another Anopheles database, AnoBase (http://www.anobase.org/), which is oriented toward manual refinement of automatically predicted gene models. The functional and expression data available through AnoEST is also being used for the discovery and annotation of alternative splicing events. In the future, we plan to extend AnoEST for use with the previously mentioned new generation of single-exon amplicon microarrays that will permit coupling of transcription profiling of the whole mosquito genome with other high-throughput functional assays, such as the production and use of specific double-stranded RNAs for RNAi gene silencing, and the production of peptides to develop antibody panels. This new microarray platform (MMC2) is designed in the context of an informal Mosquito Microarray Consortium (MMC) that emerged as an initiative to coordinate and standardize global transcriptional studies in A. gambiae. The current AnoEST data on clusters of expressed sequences that are not matched with current Ensembl gene models, as well as on alternatively spliced transcripts, is used to design additional features of MMC2. Although AnoEST was initiated as an independent database, it will be adapted to serve as one of the functional genomics modules of a new integrated genomic data resource for multiple vectors of disease, VectorBase (http://www.vectorbase.org).
EST clustering The analysis begins with the collection and processing of all available A. gambiae EST and cDNA sequences, linked with their GenBank/EMBL-Bank/DDBJ accession number, clone name identifier, cDNA strand information, and nucleotide sequence. All sequences are then aligned to the unmasked reference genome using the BLAT algorithm (Kent 2002 ESTs are then clustered (assigned into groups) on the basis of their genomic overlap. For example, two sequences are assigned to the same cluster if their overlap over the aligned regions (exons) is greater than a certain threshold (30 nt in the current version of AnoEST). To avoid CPU-consuming all-against-all EST comparisons, which would be computationally challenging when considering potential alignment of over 200,000 EST sequences with nearly 500,000 genomic loci, we compare ESTs only with the cluster's projection on the genome. DNA strands are considered independently. EST sequences originating from the 3'-end of a clone are deposited in public repositories as reverse complements; therefore, we alter their alignment strand information prior to clustering. In many cases, an expressed sequence can be aligned to more than one place in the genome (paralogs, transposable elements), making it difficult to identify reliably which genomic locus is actually represented by the EST. To address this, we rank EST to genome alignments using a number-of-matches minus number-of-mismatches scoring scheme, similar to BLAT. The matches with the highest score are then marked as "best", or as "unique best" when the second-best score is significantly lower (e.g., by more than 15, to reflect the EST sequence error rate and weak support from the data distribution). Clusters including at least one "unique best" EST are identified as TCLAG (for Transcribed CLuster of Anopheles Gambiae, also referred to as T-clusters above), whereas those that share regions of high-sequence identity to EST/cDNA sequences, but there is no one sequence aligned to the locus as "unique best" are identified as NCLAG clusters (with No uniquely matched ESTs). The third type of cluster identifiers, UCLAG, corresponds to ESTs that failed to align (Unaligned) to the A. gambiae nuclear or mitochondrial genome. In the final step of our clustering procedure, we join clusters that contain ESTs originating from the 5'- and 3'-ends of the same clone, provided that they map as "unique best" to the corresponding EST clusters, and they are on the same chromosome, the same strand, and <30 kb apart. The choice of many of the above-described parameters reflects a conservative approach that attempts to minimize errors of joining independent expressed loci at the expense of allowing some fragmentation errors, e.g., one gene could be represented by two EST clusters if we do not have sufficient information to link these clusters together. The observed representation of 10,726 Ensembl gene models by 11,608 T-clusters suggests only a minor number of fragmentation artifacts. Use of strand-specific clustering avoids the severe problems of erroneous joining of distinct genes (data not shown). However, some sequences inserted into plasmid in the wrong orientation form erroneous clusters on the strand opposite the actual genes. An upper estimate of such errors is about 11%, counting the number of T-clusters overlapping annotated genes with respect to T-clusters on the opposite strand without annotation (counting overlaps over an average 70%).
Automatic annotation The derived clusters of expressed sequences are identified with gene models predicted by the Ensembl annotation pipeline, noting the fraction of genomic overlap over all predicted exons and allowing ±150 nt to capture EST clusters derived from UTRs.
We showed previously that genes recognized as 1:1 orthologs in the genomes of A. gambiae and D. melanogaster code on average for proteins with 56% sequence identity (Zdobnov et al. 2002
We identified groups of orthologous genes between the predicted full proteomes of A. gambiae and D. melanogaster, and broader orthologous groups, including other animal genomes with full genome coverage using an Inparanoid-like (Remm et al. 2001
Implementation
Microarray assessment of EST cluster expression
This work was partially supported by NIAID/NIH VectorBase contract (NIAID-DMID-04-34 coordinated by FH Collins), NIAID/NIH U01 AI48846 and EMBL. We acknowledge annotation support from Ensembl (a joint Sanger Institute/EMBL-EBI project funded by the Wellcome Trust) and DNA sequence information from Celera Genomics, Genoscope, Pasteur Institute, EMBL, and the University of Notre-Dame (the major contributors to A. gambiae genomic, EST, and cDNA sequence data). We are also grateful to S. Meister and other members of the F.C. Kafatos group, and to members of the P. Bork group for helpful discussions.
1 These authors contributed equally to this work.
2 Corresponding author. Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3756405. Article published online ahead of print in May 2005. Freely available online through the Genome Research Immediate Open Access option.
Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., et al. 2004. UniProt: The Universal Protein knowledgebase. Nucleic Acids Res. 32: D115-D119. Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25: 25-29.[CrossRef][Medline]
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138-D141.
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., and Wheeler, D.L. 2004. GenBank: Update. Nucleic Acids Res. 32: D23-D26.
Birney, E., Andrews, T.D., Bevan, P., Caccamo, M., Chen, Y., Clarke, L., Coates, G., Cuff, J., Curwen, V., Cutts, T., et al. 2004. An overview of Ensembl. Genome Res. 14: 925-928.
Camon, E., Magrane, M., Barrell, D., Lee, V., Dimmer, E., Maslen, J., Binns, D., Harte, N., Lopez, R., and Apweiler, R. 2004. The Gene Ontology Annotation (GOA) Database: Sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res. 32: D262-D266.
Dimopoulos, G., Casavant, T.L., Chang, S., Scheetz, T., Roberts, C., Donohue, M., Schultz, J., Benes, V., Bork, P., Ansorge, W., et al. 2000. Anopheles gambiae pilot gene discovery project: Identification of mosquito innate immunity genes from expressed sequence tags generated from immune-competent cell lines. Proc. Natl. Acad. Sci. 97: 6619-6624.
Dimopoulos, G., Christophides, G.K., Meister, S., Schultz, J., White, K.P., Barillas-Mury, C., and Kafatos, F.C. 2002. Genome expression analysis of Anopheles gambiae: Responses to injury, bacterial challenge, and malaria infection. Proc. Natl. Acad. Sci. 99: 8814-8819.
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967-974. Hild, M., Beckmann, B., Haas, S.A., Koch, B., Solovyev, V., Busold, C., Fellenberg, K., Boutros, M., Vingron, M., Sauer, F., et al. 2003. An integrated gene annotation and transcriptional profiling approach towards the full gene content of the Drosophila genome. Genome Biol. 5: R3.[CrossRef][Medline]
Holt, R.A., Subramanian, G.M., Halpern, A., Sutton, G.G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., Wides, R., et al. 2002. The genome sequence of the malaria mosquito Anopheles gambiae. Science 298: 129-149. Hughes, T.R., Mao, M., Jones, A.R., Burchard, J., Marton, M.J., Shannon, K.W., Lefkowitz, S.M., Ziman, M., Schelter, J.M., Meyer, M.R., et al. 2001. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol. 19: 342-347.[CrossRef][Medline] Jurka, J. 2000. Repbase Update: A database and an electronic journal of repetitive elements. Trends Genet. 9: 418-420.
Kent, W.J. 2002. BLATthe BLAST-like alignment tool. Genome Res. 12: 656-664.
Kulikova, T., Aldebert, P., Althorpe, N., Baker, W., Bates, K., Browne, P., van den Broek, A., Cochrane, G., Duggan, K., Eberhardt, R., et al. 2004. The EMBL Nucleotide Sequence Database. Nucleic Acids Res. 32: D27-D30.
Kumar, S., Christophides, G.K., Cantera, R., Charles, B., Han, Y.S., Meister, S., Dimopoulos, G., Kafatos, F.C., and Barillas-Mury, C. 2003. The role of reactive oxygen species on Plasmodium melanotic encapsulation in Anopheles gambiae. Proc. Natl. Acad. Sci. 100: 14139-14144.
Letunic, I., Copley, R.R., Schmidt, S., Ciccarelli, F.D., Doerks, T., Schultz, J., Ponting, C.P., and Bork, P. 2004. SMART 4.0: Towards genomic data integration. Nucleic Acids Res. 32: D142-D144.
Miyazaki, S., Sugawara, H., Ikeo, K., Gojobori, T., and Tateno, Y. 2004. DDBJ in the stream of various biological data. Nucleic Acids Res. 32: D31-D34.
Mulder, N.J., Apweiler, R., Attwood, T.K., Bairoch, A., Barrell, D., Bateman, A., Binns, D., Biswas, M., Bradley, P., Bork, P., et al. 2003. The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res. 31: 315-318. Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., Saito, R., Suzuki, H., et al. 2002. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420: 563-573.[CrossRef][Medline] Remm, M., Storm, C.E., and Sonnhammer, E.L. 2001. Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314: 1041-1052.[CrossRef][Medline] Ribeiro, J.M., Topalis, P., and Louis, C. 2004. AnoXcel: An Anopheles gambiae protein database. Insect Mol. Biol. 13: 449-457.[CrossRef][Medline] Smith, T.F. and Waterman, M.S. 1981. Identification of common molecular subsequences. J. Mol. Biol. 147: 195-197.[CrossRef][Medline]
Zdobnov, E.M., von Mering, C., Letunic, I., Torrents, D., Suyama, M., Copley, R.R., Christophides, G.K., Thomasova, D., Holt, R.A., Subramanian, G.M., et al. 2002. Comparative genome and proteome analysis of Anopheles gambiae and Drosophila melanogaster. Science 298: 149-159.
http://komar.embl.de; AnoEST database. http://komar.embl.de:9000/das; AnoEST DAS server. http://www.genoscope.org/; GenoscopeCentre National de Séquençage. http://www.girinst.org/; Genetic Information Research Institute. http://www.anobase.org/; AnoBase database. http://www.vectorbase.org; VectorBase database. http://www.mysql.com/; MySQL relational database engine. http://www.php.net/; PHP scripting language. http://www.biodas.org; Distributed Annotation System (DAS). http://www.sanger.ac.uk/Software/analysis/proserver/; Perl-based DAS server.
Received January 26, 2005; accepted in revised format April 13, 2005. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||