|
|
|
|
Published online before print
November 7, 2007, 10.1101/gr.6679507 Genome Res. 17:1823-1836, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00 OPEN ACCESS ARTICLE
12 Drosophila Genomes/Letter Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes1 Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02139, USA; 2 Berkeley Drosophila Genome Project, Department of Genome Biology, Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California 94720, USA; 3 FlyBase, The Biological Laboratories, Harvard University, Cambridge, Massachusetts 02138, USA; 4 Department of Molecular and Cellular Biology, Harvard University, Cambridge, Massachusetts 02138, USA; 5 MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts 02139, USA
The availability of sequenced genomes from 12 Drosophila species has enabled the use of comparative genomics for the systematic discovery of functional elements conserved within this genus. We have developed quantitative metrics for the evolutionary signatures specific to protein-coding regions and applied them genome-wide, resulting in 1193 candidate new protein-coding exons in the D. melanogaster genome. We have reviewed these predictions by manual curation and validated a subset by directed cDNA screening and sequencing, revealing both new genes and new alternative splice forms of known genes. We also used these evolutionary signatures to evaluate existing gene annotations, resulting in the validation of 87% of genes lacking descriptive names and identifying 414 poorly conserved genes that are likely to be spurious predictions, noncoding, or species-specific genes. Furthermore, our methods suggest a variety of refinements to hundreds of existing gene models, such as modifications to translation start codons and exon splice boundaries. Finally, we performed directed genome-wide searches for unusual protein-coding structures, discovering 149 possible examples of stop codon readthrough, 125 new candidate ORFs of polycistronic mRNAs, and several candidate translational frameshifts. These results affect >10% of annotated fly genes and demonstrate the power of comparative genomics to enhance our understanding of genome organization, even in a model organism as intensively studied as Drosophila melanogaster.
The compilation of a complete and accurate catalog of all protein-coding genes is a critical step in fully understanding the functional elements in any genome. In Drosophila melanogaster, a century of classical genetics, large-scale EST and cDNA sequencing (Rubin et al. 2000
It is unclear how close to completion the current gene set may be, or what fraction of the current annotations may be inaccurate. On one hand, numerous genes and alternative splice forms may be still missing from the current annotation, and indeed a pilot study suggests an additional 700 genes may lie amidst 10,000 existing de novo and microarray-based predictions (Yandell et al. 2005
Comparative genomic analysis is a powerful approach to the discovery of protein-coding genes. Comparative data have been used to significantly revise the established annotations of the yeast Saccharomyces cerevisiae genome (Cliften et al. 2003 In this study, we use whole-genome alignments of 12 Drosophila genomes to systematically review the protein-coding gene annotations of D. melanogaster. By studying the conservation properties of known genes, we identify recurrent patterns of evolutionary change that are hallmarks of purifying selection operating upon protein-coding sequences. We use these evolutionary signatures to examine the entire genome and identify conserved protein-coding regions with high accuracy. These signatures confirm the protein-coding function of the vast majority of hypothetical genes and identify more than a thousand new exons. In contrast, these signatures strongly reject several hundred genes, most of which are likely to be spurious predictions or noncoding genes. We also used these signatures to refine the annotation and boundaries of existing genes, including translation initiation sites, splice sites, and functional reading frame of translation. Finally, our methods identify candidates for a variety of exceptional gene structures such as translational readthrough, dicistronic genes, and conserved reading frameshifts in the middle of protein-coding exons. We evaluated many of these proposed changes through manual curation and directed sequencing efforts. Overall, we used comparative data to propose revisions for >10% of D. melanogaster protein-coding gene models. While many extensions and future directions remain, this work is a substantial step toward achieving the best possible gene annotations for D. melanogaster. It also serves as a model for similar efforts to improve the annotation of other important target genomes, including the human.
Protein-coding DNA sequences evolve under distinctive evolutionary constraints since selective forces at the nucleotide level reflect constraints operating on the encoded protein. Thus, mutations to the DNA that preserve properties of the amino acid translation (e.g., synonymous substitutions) tend to be tolerated, while mutations that disrupt the translation (e.g., frame-shifting insertions or deletions or nonsense mutations) tend to be excluded by natural selection. In DNA sequence alignments of closely related species, these constraints manifest themselves as "evolutionary signatures", recurrent patterns of evolutionary change that we can use to uniquely identify protein-coding sequences (Fig. 1).
We applied two independent quantitative metrics that use evidence from multiple informant sequences to distinguish regions under protein-coding selection. The first metric observes reading frame conservation (RFC) and quantifies the strong tendency of insertions or deletions (indels) within coding regions to preserve the reading frame of translation. We have previously applied RFC in yeast species (Kellis et al. 2003 In contrast to methodologies that focus primarily on high sequence conservation to identify candidate genes, the RFC and CSF metrics focus on distinctive patterns of divergence in protein-coding genes, specific to their unique selective pressures. Therefore, functional RNA-level or DNA-level elements (such as RNA genes and structures, developmental enhancers, or other regulatory regions), which often exhibit high nucleotide conservation (Fig. 2), are very unlikely to show high RFC or CSF scores, enabling these metrics to distinguish coding and noncoding regions with higher accuracy. For example, when used to discriminate between exons of well-studied genes and random noncoding regions with the same length distribution, the CSF metric alone accepts 94% of coding exons while rejecting >99% of the control regions (Supplemental Table 1). This discriminatory power allowed us to systematically review the D. melanogaster genome annotation for protein-coding genes. We present detailed benchmarks of these and several other metrics elsewhere (M.F. Lin, A. Deoras, M. Rasmussen, and M. Kellis, in prep.).
Benchmarking the RFC and CSF evolutionary signatures Our first goal was to evaluate how well our approach worked on test data sets of well-annotated genes. For this purpose, we used the classes of "named" and "well-studied" genes defined earlier. We scored every gene model covered by whole-genome sequence alignments according to the RFC and CSF metrics. By studying the score distributions for known genes and noncoding control regions, we chose RFC and CSF cutoffs above which a given gene annotation is nearly certain to represent protein-coding sequence, and used these as a test to determine whether the comparative evidence confirms that a candidate gene is indeed protein-coding (although this test does not verify that the annotated gene structure is correct in every detail).
We first scored the 893 well-studied genes. Our test accepts 882 (99%) of these gene models. Only 11 of these genes did not pass our thresholds. Two of these (y and bw) are well-conserved genes that failed due to previously known strain-specific disrupting mutations in the sequenced strain of D. melanogaster. The remainder may represent fast-evolving genes or genes recently evolved from previously noncoding regions. We also applied the same test to the remaining 3818 named genes with <50 citations and found that it accepts 97% (3684). Overall, the comparative evidence confirms that 4566 of 4711 "named" genes (97%) show the evolutionary signatures of protein-coding genes. We also evaluated 15,564 noncoding regions
Evolutionary confirmation of uncharacterized genes We then turned our attention to the 9022 CGid-only genes in the Release 4.3 annotation set, which lack a descriptive gene name (including 4373 GO-annotated genes and 4649 uncharacterized genes). The evidence for these gene models varies widely and may include de novo gene model prediction, long open reading frames (ORFs), cDNA sequences, mRNA expression evidence, or homology with genes in other species. Since our evolutionary signatures are specific to protein-coding function, they can provide a powerful additional line of evidence indicating that these genes encode proteins, based on their alignments across Drosophila genomes.
Our test accepts 7879 of the 9022 CGid-only genes (87%), confirming that the vast majority of these annotations show the evolutionary signatures of protein-coding genes, and are therefore very likely to encode proteins. (Again, passing our test does not imply that all details of these gene structures are correctly annotated; we also note that it is possible that ancestral genes that have been very recently deactivated in D. melanogaster, have not yet acquired many disrupting mutations, and are still annotated as genes may pass our test.) The fraction of accepted CGid-only genes was only slightly higher for the "GO-annotated" subset than for uncharacterized (89% vs. 86%). It is not surprising that the proportion of accepted models for CGid-only genes (87%) is lower than for the named genes (97%): Some uncharacterized genes may be erroneous or spurious annotations (we consider this possibility further below), while others are likely to be under less stringent selective pressure than most named genes, many of which are conserved across very large evolutionary distances (Bergman et al. 2002
New genes and exons
Manual curation incorporates most predicted exons into gene annotations Of the 928 assessed exons, 562 (61%) were incorporated into existing genes, leading to the revision of 438 gene models. The new exons most often led to the creation of alternative transcripts and, less frequently, to the modification of the intron/exon structure of an existing transcript isoform. Many of these changes (58%) were supported by additional evidence such as previously unincorporated BDGP cDNA sequences and/or sequence similarity to known proteins. Some revisions were complex, including 65 merges of two or more Release 4.3 gene models, 10 splits of Release 4.3 gene models, and four new dicistronic transcript models. An additional 192 (21%) curated exons were incorporated in 142 newly created gene models. Of these, 39% were supported by EST/cDNA and/or protein sequence similarity. Twenty-four of the new gene models (12%) lie within an intron of another gene on the same strand. The remaining 174 curated exons (19%) were not incorporated into any gene models. Most of these are either small exon predictions, with a median length of 21 amino acids, or encode low-complexity sequence. Typically, these were unsupported by experimental data that would indicate inclusion in a gene model. These 174 exon predictions should be viewed as unresolved with regard to their validity, since future data may provide such experimental support.
Directed cDNA sequencing confirms predicted exons, reveals new genes and splice forms Of the 126 tested predictions within intergenic regions, we obtained a full-length cDNA for 88 exons (70%). The resulting cDNAs provide evidence for 50 new genes, including 10 single-exon genes and 40 multi-exon genes (which incorporate 43 predicted exons, and additional flanking exons that were not predicted by our algorithm). In addition, these cDNAs provided evidence for the modification of 39 existing Release 4.3 annotations: 11 new 5' extensions or splice variants, 13 new 3' extensions or splice variants (14 exons), two dicistronic transcripts (three exons), six transcripts merging multiple Release 4.3 gene models, and one internal splice variant. Of the 58 tested predictions within introns of existing annotations, we obtained a full-length cDNA for 32 (55%). Only 18 of these represent new internal splice variants of the surrounding gene while the remaining 14 appeared independent of the surrounding gene. These 14 include eight alternative splice forms of previously annotated genes (five 5' exons and two 3' exons), two new single-exon genes, two new multi-exon genes, and two gene merges. Most surprising were data supporting an apparent example of overlapping coding sequence on opposite strands (Fig. 3D).
Overall, the cDNA data validated 120 of the 184 targeted predictions (65%). The recovered cDNA sequences also indirectly validated 42 predicted new exons that were not purposely targeted, as they were contained within the transcripts recovered from the 120 targeted predictions, leading to a total of 162 cDNA-validated predictions. The recovered cDNAs also captured additional translated and untranslated exons that were not predicted by our algorithm (see examples in Fig. 3). Finally, we note that the remaining 64 targeted predictions for which we did not obtain a high-quality, full-length cDNA sequence are not necessarily false predictions, since we only screened libraries derived from certain tissues and developmental stages (Hoskins et al. 2005 Using TBLASTX, we searched other genomes for homologs of the new genes we recovered through cDNA sequencing. We found that many appear to be specific to the Drosophila or insect lineages (Supplemental Table 2). For example, 37% have a significant hit in the mosquito (Anopheles gambiae) or honeybee (Apis mellifera) genome assemblies, compared to 50% of randomly selected genes of comparable length; similarly, only 12% have significant hits to worm, yeast, or vertebrates, compared to 32% of random genes. Because gene annotation often relies on homology with known genes in other species, this might explain in part why these genes have not previously been identified.
An alternative strategy identifies relatively few additional exons We selected 193 "consensus" exons that are predicted by at least five of these algorithms, do not overlap annotated exons, transposable elements, or our predictions, and are at least 100 nt in length. After manual curation, 98 (51%) were incorporated into a gene model: 15 were incorporated into gene models that included exons identified by our algorithm, 63 were incorporated into existing gene models, and 20 were annotated as new or reinstated gene models. To test the validity of this approach, eight of the affected gene models were selected for evaluation by RT-PCR. Seven of the eight newly annotated "consensus" exons were validated. In several cases, additional newly annotated exons based on evolutionary signatures were also validated. Overall, 852 new exons were annotated by manual curation using both analyses, of which 88% were predicted by our algorithm based on evolutionary signatures.
Conclusion: New exons and genes Although the subsets of the predicted exons that we subjected to curation and sequencing were not selected entirely at random, neither were they selected in a way that would strongly bias them toward the highest-quality predictions. We conclude that our approach was able to identify new exons with very high predictive value, even when all existing gene annotations were excluded. Moreover, the results of an alternative strategy based on a variety of de novo and evidence-based predictions suggest that relatively few protein-coding exons remain unidentified in the euchromatin—at least that can be found at a reasonable false discovery rate using existing computational methods.
Many poorly conserved gene annotations are dubious While our previous analysis evaluated each candidate gene over its entire length, here, we searched for any evidence of protein-coding selection. We allowed for fast-evolving domains or partially incorrect annotations by evaluating overlapping windows of 30 amino acids for evidence of protein-coding evolution. We also allowed for lineage-specific genes by searching for evolutionary evidence in groups of species at three different phylogenetic distances from D. melanogaster. Moreover, we tested three different genome alignment sets, to allow for potential misalignments (see Methods). Finally, we note that, if a gene is recently gained and its orthologous region is simply absent in the informant genomes, our methods make no statement about its veracity. Instead, we only evaluated regions that do align to putatively orthologous sequences in other species.
We found that 414 CGid-only genes (4.6% of 9022) are rejected even by these very lenient criteria. By comparison, only three of 893 well-studied genes (0.3%) are rejected and only 40 of all 4711 named genes (0.8%). If all rejected well-studied genes are false rejections, we would expect <30 of the 414 rejected CGid-only genes to be false rejections (95% confidence, binomial distribution). Based on named genes, we would expect that <91 of the 414 rejections are false rejections, and that at least 323 of the 414 rejected genes (78%) are indeed spurious. On one hand, this may be an overestimate, as the named and well-studied genes may be biased toward deeply conserved functions with vertebrate orthologs (Bergman et al. 2002 Several statistics suggest that most of the genes rejected by our test are likely to be spurious predictions. As a group, they closely resemble random noncoding regions (Supplemental Fig. 1). The majority consist of relatively short, single-exon ORFs, many of which are likely to occur by chance across the whole genome. Their median coding sequence length is 381 nt, considerably shorter than the median length of all genes (1179 nt), and 63% are single-exon. We manually examined each of the 414 CGid-only genes that were rejected by our test and all evidence supporting them, and we concluded that 222 (54%) can be immediately deleted from the annotations or recategorized as nonprotein-coding genes. These include 55 genes previously annotated as supported by cDNA sequences, which in fact turned out to be due to genomically primed clones. An additional 73 of the rejected genes (18%) had unclear or conflicting evidence and have been flagged as being of uncertain quality in the annotation comments, although they were not immediately deleted. Finally, the remaining 119 (29%) are adequately supported by existing evidence and were kept unchanged in the current database. A subset of these is likely to be rapidly evolving genes, while others may prove to be RNA-coding genes with no protein function. We also manually examined the 40 named genes that were rejected by our test, and found that six of these should also be deleted or changed to nonprotein-coding annotations. The remaining 34 contain several genes known to be rapidly evolving, including seven male accessory gland peptides or other male-specific genes.
Last, we found that transcript evidence for at least some of the rejected genes may be explained by nonprotein-coding function. In particular, there is strong evidence that the transcripts for CG33311 and CG31044 are in fact precursor RNAs of microRNA genes rather than protein-coding mRNAs (Stark et al. 2007 We conclude that most of the genes rejected by our test in fact do not represent genuine protein-coding genes, and the existence of many of these annotations is due to genomically primed cDNAs, erroneous de novo gene predictions, and sometimes functional RNA genes. A minority is likely to represent fast-evolving or species-specific genes that are not under purifying selection over the evolutionary distances we examined. Overall, our tests based on evolutionary signatures confirmed 7879 of 9022 CGid-only genes (87%) as clearly under protein-coding selection and rejected 414 (4.6%), most of which are likely to be spurious annotations (Table 1). We abstained from making a decision based on comparative evidence for the remaining 729 CGid-only genes (8.1%), which either could not be aligned or were supported by evolutionary signatures weakly or only over a fraction of their length. These results can help guide directed experimentation to resolve the function of all genes and transcripts, and also help focus curation efforts on a relatively small number of problem cases.
Refining existing gene annotations
Translation start sites
Reading frame of translation In addition to locating protein-coding regions, the comparative information reveals the reading frame of translation under purifying selection, since the signature of codon substitution frequencies is specific to the reading frame. This has allowed us to distinguish between overlapping ORFs, and reveal the one under selection when multiple ORFs of comparable length are all open (Fig. 4B). Such overlapping ORFs are sometimes found in short single-exon genes, where the systematic annotation has typically selected the longest, while it may in fact be a shorter ORF that is translated. We found five cases (CG15281, CG13244, CG7738, CG18358, and CG12656) where a shorter ORF is clearly under selection, to the exclusion of the annotated ORF. While this is a small number of cases, we note that this change leads to a completely different protein translation.
Adjustments to existing exons
We also identified many existing exons that appear to be incompletely annotated, as the evolutionary signatures of protein-coding selection extend beyond their present splice boundaries, including 912 by at least 30 nt and 600 by at least 45 nt (see Supplemental materials). This may indicate either an alternative splice site or a simple mistaken annotation. When we considered the position of the likely corrected (or alternative) splice site, we found that the "extensions" of at least 30 nt are enriched for lengths divisible by three (P < 2.2 x 10–16,
Recent nonsense and frameshift mutations We also identified locations in the D. melanogaster genome where protein-coding evolutionary selection abruptly shifts from one reading frame to another. In five cases, these coincide with a short frame-shifting indel, specific to the sequence of D. melanogaster, and absent from all of the other genomes. One of these (within sdk) was due to a previously known erroneous genomic sequence on chromosome arm 3L in D. melanogaster, while another (within CG33294, currently known as CR33294) may be a pseudogene. The remaining three cases (within Ugt86Dd, Dscam, and CG34143) are apparently recent frameshift mutations.
Identifying unusual protein-coding structures
Stop codon readthrough
Translational readthrough of stop codons can occur through several mechanisms, among which our approach does not distinguish. However, it does not appear that many of these genes represent new selenoproteins, because many (37%) of the putatively readthrough stop codons are not UGA and we were unable to identify convincing examples of the related SECIS elements according to previously established criteria (Kryukov et al. 1999 I RNA editing by ADAR, which is most active in the nervous system (Bass 2002
Polycistronic messenger RNAs
"Programmed" translational frameshifts
A revised fly gene catalog The availability of whole-genome alignments of the 12 Drosophila genomes allowed us to measure evolutionary signatures unique to protein-coding regions. In conjunction with manual curation and large-scale sequencing experimentation, these signatures enabled us to systematically revisit the fly genome annotation, with proposed changes affecting >10% of all genes. (1) We identified 1193 new exons with high predictive value, most of which were integrated into FlyBase gene annotations and many of which were validated by cDNA sequencing experiments, revealing many surprising new gene models and alternative splice forms. (2) In addition to discovering new genes, we used evolutionary signatures to revisit existing gene annotations. This led to confirmation that 87% of CGid-named annotations show evolutionary signatures of protein-coding genes and, conversely, to the identification of 3%–4% of CGid-only annotations that are likely to be spurious predictions or noncoding genes. (3) At a finer-grain level, evolutionary signatures allowed us to propose detailed refinements to hundreds of existing annotations, adjusting the translation start codon, correcting splice boundaries, resolving the functional reading frame in short single-exon transcripts, and identifying strain-specific disrupting mutations. (4) Lastly, the power of evolutionary signatures enabled us to recognize unusual gene structures, which challenge the current assumptions of gene annotation efforts: We found abundant evidence of stop codon readthrough, polycistronic transcripts, and several candidates for conserved translational frameshifts.
Challenges for computational prediction of complete gene models Our results revealed important insights relevant to full gene model prediction. We obtained full-length cDNA clones for 162 of our predicted new exons, many of which fell into surprising gene models, reinforcing the difficulty of de novo gene model prediction. For example, when new exons were discovered within introns of existing genes on the same strand, the simplest expectation would be that they form alternatively spliced transcripts of the surrounding gene. In contrast to this expectation, however, only 56% were alternative transcripts, and the remaining 44% linked to other genes or formed independent transcription units. Such nested and interdigitated genes, as well as mutually exclusive exons within single genes, are refractory to most de novo gene structure predictors.
A further challenge to computational gene structure prediction is presented by exceptional biological phenomena, such as stop codon readthrough, polycistronic transcripts, and translational frameshifts. These are generally assumed to be rare and eukaryotic gene predictors are not built to recognize them. However, 115 dicistronic genes are currently annotated in FlyBase, and our results suggest that the true number may be substantially larger. Similarly, while only one functional translational frameshift has been described in Drosophila (Ivanov et al. 1998
The next major advances in de novo gene prediction methods are likely to come from continued advances in our understanding of the sequence signals governing transcription, splicing, and translation regulation, as well as the advent of more flexible algorithmic frameworks that are well-suited to take advantage of such unconventional signals (Lafferty et al. 2001
Applying the evolutionary signature approach to other target genomes More generally, the preexisting, high-quality annotations for D. melanogaster allowed us to demonstrate the high sensitivity and specificity of the RFC and CSF tests based on evolutionary signatures. Since these signatures are universal consequences of natural selection and the genetic code, our results suggest that they can provide a strong foundation for the identification of protein-coding genes within any group of closely related species, even when cDNA library sequences are not immediately available or when no genomes with high-quality annotations exist in closely related taxa. Furthermore, it may also be possible to define specific evolutionary signatures—beyond mere sequence conservation—for other classes of functional elements, which suggests a general approach for the identification of functional elements in any genome. The derivation of reliable gene models for protein-coding genes remains a challenge, especially given the abundance of complex gene structures in metazoan genomes. It is also inherently difficult for comparative genomic methods to identify very fast-evolving, species-specific genes, which are centrally important to the study of evolution, speciation, and immunity. Thus, the complete genome annotation of any species will continue to be most effectively pursued through the concerted efforts of computational predictions, manual curation, and large-scale cDNA sequencing.
Genome alignments We used several different sets of multiple sequence alignments of the 12 Drosophila genomes in this study. Two were based on a synteny map generated by Mercator (C. Dewey [University of Wisconsin, Madison] and L. Pachter [University of California at Berkeley]), with sequence alignments generated by MAVID (Bray and Pachter 2004
Reading frame conservation (RFC)
Codon substitution frequencies (CSF) Thorough benchmarks of the RFC and CSF metrics, as well as various other discriminative metrics for protein-coding gene identification, with different alignments and different combinations of informant species, are presented elsewhere (M.F. Lin, A. Deoras, M. Rasmussen, and M. Kellis, in prep.).
"Confirming" genes
"Rejecting" genes
Predicting new exons
Selection of exon candidates for cDNA isolation
RT-PCR
Refinements to existing annotations and unusual gene structures
We are indebted to the community effort for sequencing, assembly, and alignment of the 12 Drosophila genome sequences without which this project would not have been possible, and for the early release and collaborative data sharing. We thank Andy Clark, Tim Sackton, and Tony Greenberg for helpful discussions on lineage-specific genes; Gene Yeo and Jade Vinson for sharing code for a splice site discriminator; and Alex Stark, Pouya Kherapdour, Matt Rasmussen, Ameya Deoras, Josh Grochow, Erez Lieberman, and Aviva Presser for invaluable discussions.
6 Corresponding author.
E-mail manoli{at}mit.edu; fax (617) 262-6121. [Supplemental material is available online at www.genome.org. Additional supplemental materials are available online at http://compbio.mit.edu/fly/genes/. Full-length cDNA sequence data from this study have been submitted to GenBank under accession nos. BT029554–BT029635, BT029637–BT029727, BT029940–BT029957, BT030133– BT030144, BT030416–BT030421, and BT030448–BT030452. RT-PCR amplicon and primer sequence data have been submitted to GenBank under accession nos. ES439769–ES439782.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6679507
Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., Galle, R.F., et al. 2000. The genome sequence of Drosophila melanogaster. Science 287: 2185–2195. Andrews, J., Smith, M., Merakovsky, J., Coulson, M., Hannan, F., and Kelly, L.E. 1996. The stoned locus of Drosophila melanogaster produces a dicistronic transcript and encodes two distinct polypeptides. Genetics 143: 1699–1711.[Abstract] Bass, B.L. 2002. RNA editing by adenosine deaminases that act on RNA. Annu. Rev. Biochem. 71: 817–846.[CrossRef][Medline] Bergman, C.M., Pfeiffer, B.D., Rincon-Limas, D.E., Hoskins, R.A., Gnirke, A., Mungall, C.J., Wang, A.M., Kronmiller, B., Pacleb, J., Park, S., et al. 2002. Assessing the impact of comparative genomic sequence data on the functional annotation of the Drosophila genome. Genome Biol. 3: doi: 10.1186/gb-2002-3-12-research0086. Bergstrom, D.E., Merli, C.A., Cygan, J.A., Shelby, R., and Blackman, R.K. 1995. Regulatory autonomy and molecular characterization of the Drosophila out at first gene. Genetics 139: 1331–1346.[Abstract] Bernal, A.E., Crammer, K., Hatzigeorgiou, A., and Pereira, F.C.N. 2007. Global discriminative learning for higher-accuracy computational gene prediction. PLoS Comput. Biol. 3: e54. doi: 10.1371/journal.pcbi.0030054.[CrossRef][Medline] Blanchette, M., Kent, W.J., Riemer, C., Elnitski, L., Smit, A.F., Roskin, K.M., Baertsch, R., Rosenbloom, K., Clawson, H., Green, E.D., et al. 2004. Aligning multiple genomic sequences with the threaded blockset aligner. Genome Res. 14: 708–715. Bray, N. and Pachter, L. 2004. MAVID: Constrained ancestral alignment of multiple sequences. Genome Res. 14: 693–699. Brent, M.R. 2005. Genome annotation past, present, and future: How to define an ORF at each locus. Genome Res. 15: 1777–1786. Brogna, S. and Ashburner, M. 1997. The Adh-related gene of Drosophila melanogaster is expressed as a functional dicistronic messenger RNA: Multigenic transcription in higher organisms. EMBO J. 16: 2023–2031.[CrossRef][Medline] Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78–94.[CrossRef][Medline] Casey, J.L. and Gerin, J.L. 1995. Hepatitis D virus RNA editing: Specific modification of adenosine in the antigenomic RNA. J. Virol. 69: 7593–7600.[Abstract] Castellano, S., Morozova, N., Morey, M., Berry, M.J., Serras, F., Corominas, M., and Guigo, R. 2001. In silico identification of novel selenoproteins in the Drosophila melanogastergenome. EMBO Rep. 2: 697–702.[CrossRef][Medline] Celniker, S.E., Wheeler, D.A., Kronmiller, B., Carlson, J.W., Halpern, A., Patel, S., Adams, M., Champe, M., Dugan, S.P., Frise, E., et al. 2002. Finishing a whole-genome shotgun: Release 3 of the Drosophila melanogaster euchromatic genome sequence. Genome Biol. 3: doi: 10.1186/gb-2002-3-12-research0079. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||