|
|
|
|
Genome Res. 14:406-413, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods Whole Genome Sequence Comparisons and "Full-Length" cDNA Sequences: A Combined Approach to Evaluate and Improve Arabidopsis Genome Annotation1 Genoscope-Centre National de Séquençage and Centre National de la Recherche Scientifique Unité Mixte de Recherche-3080, 91000 Evry, France 2 Institut National de la Recherche Agronomique, Unité de Recherche en Génomique Végétale, 91000 Evry, France 3 Life Technologies, a Division of Invitrogen, Carlsbad, California 92008 USA
To evaluate the existing annotation of the Arabidopsis genome further, we generated a collection of evolutionary conserved regions (ecores) between Arabidopsis and rice. The ecore analysis provides evidence that the gene catalog of Arabidopsis is not yet complete, and that a number of these annotations require re-examination. To improve the Arabidopsis genome annotation further, we used a novel "full-length" enriched cDNA collection prepared from several tissues. An additional 1931 genes were covered by new "full-length" cDNA sequences, raising the number of annotated genes with a corresponding "full-length" cDNA sequence to about 14,000. Detailed comparisons between these "full-length" cDNA sequences and annotated genes show that this resource is very helpful in determining the correct structure of genes, in particular, those not yet supported by "full-length" cDNAs. In addition, a total of 326 genomic regions not included previously in the Arabidopsis genome annotation were detected by this cDNA resource, providing clues for new gene discovery. Because, as expected, the two data sets only partially overlap, their combination produces very useful information for improving the Arabidopsis genome annotation.
The sequence of the Arabidopsis thaliana genome was completed in 2000 by the Arabidopsis Genome Initiative (AGI; Lin et al. 1999
To evaluate and further improve these annotations, we used two different types of data as follows. (1) Whole genome sequence comparisons between Arabidopsis and rice. In this strategy, we detected evolutionarily conserved regions (ecores) between Arabidopsis and the available rice sequence draft, as was done between the human and pufferfish genome using Exofish (Roest Crollius et al. 2000
In the most recent version of the Arabidopsis genome annotation (MIPS, June 2003), 26,446 annotated genes were identified. Because for many annotated genes the UTR regions are not available, the CDS only will be used and referred to hereafter as annotations or annotated features. Of these, 12,165 annotated gene models were supported by "full-length" cDNAs, including most of the "full-length" cDNA analyzed in Yamada et al. (2003 To estimate the level of completion of the annotation of the Arabidopsis genome, independent of existing annotation resources, we performed genome-wide sequence comparisons between Arabidopsis and rice genomic sequences.
Whole Genome Sequence Comparisons
To further analyze the ecores lying in the intergenic regions, we have constructed models based on ecotigs (ECOre conTIG). Such models are constructed by linking in the same model two or more ecores that are located in the same relative position on both genomes (see Methods). A fraction of the ecotigs are composed of more than one gene, reflecting the microsynteny existing between Arabidopsis and rice (Salse et al. 2002 This analysis strongly suggests that the gene catalog of Arabidopsis is not yet complete, and that a number of existing annotations require re-examination. To address these issues, we made use of a novel collection of full-length cDNAs.
Analysis of the cDNA Collection
Because the GSLT sequences are based on the assembly of single pass reads and may contain sequencing errors, for CDS determination, we generated a cDNA sequence (virtual cDNA) using the matching Arabidopsis genomic sequence. CDSs for 21,572 cDNA clones for which a full-insert sequence was available were determined (see Methods). Additional information on the GSLT resource is available as Supplemental information. We compared the length of the sequences from the GSLT resource with the 21,797 publicly available mRNA sequences with complete CDSs (GenBank, PLN section release 133) referred to hereafter as E-A-mRNA (Existing Arabidopsis mRNAs). A small subset of these E-A-mRNAs were not in the MIPS June 2003 annotation. The most 5' and 3' sequences from both data sets were selected, and their size differences calculated for 4841 and 4836 pairs, respectively. The results are shown in Figure 2. In 26% of the cases (1244), sequences from the GSLT resource extend the 5' end sequence of the E-A-mRNA resource, and for 61 of these, at least one novel 5' exon was detected. A list of 5' and 3' extensions with novel exon(s) is available at http://www.genoscope.cns.fr/Arabidopsis/file1. In some cases, the coding region was also extended. An example is shown in Figure 3. When this analysis was not restricted to the most 5' sequence from the GSLT resource, the GSLT sequences extended the 5'-end sequence from the E-A-mRNA resource in 22% of the cases. Furthermore, 77% of the GSLT CDS sequences started either at the same ATG or at an upstream ATG, compared with the E-A-mRNA CDS sequences.
GSLT Models and Annotated Gene Structure Of the 18,025 GSLT clones for which a validated gene model was available, 17,159 overlapped 9297 annotated genes, at least partially (see Fig. 1, Case 4), 326 overlapped 251 annotated genes, but on the opposite strand, and 540 are located in regions with no gene annotation. Additional information on nonvalidated gene models and unassembled sequences is available as Supplemental information. Of the 9297 annotated genes overlapped by the GSLT validated gene models, 6429 were already supported by a "full-length" cDNA, 1967 by ESTs, and 901 were not supported by expression data. To evaluate the impact of the GSLT resource on the genome annotation, we used a suitable subset of these 18,025 GSLT clones. This subset (13,031 clones) is restricted to cDNA sequences covering the totality of an annotated gene, and matching this gene solely (Fig. 1, Case 5; Table 4). We then compared the CDS deduced from the GSLT models with the annotated CDSs from groups A and B defined above (results in Table 5).
As expected, the vast majority (95%) of the annotated gene models for Group A were confirmed by the GSLT clone analysis, validating these annotations and our analysis simultaneously. Conversely, 45% of annotated gene models not supported by "full-length" cDNAs needed to be inspected for extensions, missing exons, and incorrect splice sites. Lists corresponding to dubious annotations can be found at http://www.genoscope.cns.fr/Arabidopsis/file 2 to 5 and used to explore the supporting evidence on a browser (http://www.genoscope.cns.fr/cgi-bin/ggb/ggb?source=Arabidopsis). Existing CDS annotation goes beyond the GSLT CDS (Fig. 1, Case 6) for 388 (7%) genes from group A and 306 (15.8%) from group B, although the sequence of the clone covers the totality of the annotated gene. Manual inspection of a 10-Mb region shows that these mostly correspond to GSLT cDNAs that are probably derived from immature mRNAs. This was further confirmed by the sequence of a publicly available mRNA (E-A-mRNA) in 70% of the 388 annotated genes from group A, suggesting that these gene models are accurately annotated. However, in 36 and 43 annotated genes from the A and B groups, respectively, GSLT models from two independent clones at least disagree with the proposed annotated gene (http://www.genoscope.cns.fr/Arabidopsis/file 6 and 7). In most of the cases, (21/43) the difference between the annotated gene and the GSLT model is due to an unspliced intron located in the same position in at least two GSLT sequences. Alternative splicing is the second most frequent explanation (6/43); one example is shown in Figure 4.
Novel Genes The GSLT cDNA resource was used to detect new Arabidopsis genes that were overlooked during previous annotation processes. Using an automated analysis, we detected 326 genomic regions not overlapped by an annotated gene, but covered by at least a GSLT cDNA sequence. For each region, the cDNA clone with the longest CDS was selected. These unannotated regions were classified according to the relative size of the CDS and exon number (Table 6). Of the 326 classified regions, 96 show evolutionarily conserved regions (see below) (http://www.genoscope.cns.fr/Arabidopsis/file 8)
Additional Features One of the difficulties encountered during the annotation process is to define the correct beginning and end of a gene (Fig. 1, Case 7). In some cases, erroneous predictions lead to a gene model that merges or splits real genes. We searched for GSLT cDNA sequences bridging two or more consecutive annotated genes, and found 93 regions (186 annotated genes) in which two genes could potentially be merged. Conversely, we found 35 cases in which two nonoverlapping GSLT sequences were included in the same gene annotation, raising the possibility that the annotation had merged two real genes (http://www.genoscope.cns.fr/Arabidopsis/file 9 and 10)
Alternative splicing is thought to be rare in plants as compared with mammals, although the number of known cases is increasing (Jordan et al. 2002
In addition, we found 326 GSLT sequences that overlapped 251 annotated genes, but on the opposite strand, and could correspond to antisense RNAs. In most of the cases, visual inspection did not reveal cloning artifacts. Of the 251 annotated genes, 166 were covered by both a cDNA supporting the annotated gene and an antisense cDNA. In 141 cases, the antisense cDNA was unspliced and did not permit the exclusion of possible genomic DNA contamination. Interestingly, in 12 cases, more than one unspliced antisense cDNA (GSLT and E-A-mRNA) was found for a given gene, increasing the possibility that they may correspond to antisense RNAs. In 25 cases, the antisense cDNA contained splicing events, and in all of these cases, the GT-AG splice site consensus sequence was found.
Combining Comparative Genomics and cDNA Data Additional internal exons have been suggested by the observation of a total of 562 ecores mapping within annotated genes, but which do not match annotated exons (Fig. 1, Case 1). Most of these ecores (522) map in 380 group B gene annotations (not supported by a full-length cDNA). Fifty of these genes, as well as all of the 66 ecores that reside in these models but outside annotated exons, are matched by GSLT cDNAs. This suggests also that a vast majority of the remaining 456 ecores of group B annotations represent true exons. By combining the two data sets, potential missing exons were detected for 456 annotated genes. Among the 424 annotated genes potentially extended by ecotigs, 192 could be matched by GSLT or E-A-mRNA. In 153 cases (80%), one ecore localized in the extension is matched by an exon of a cDNA, and in 118 cases, this extending exon is part of the CDS deduced from the cDNA (Fig. 6). A total of 947 potential gene extensions were found when systematic intergenome comparisons and full-length GSLT cDNA were used.
In addition, 403 ecotigs were composed exclusively of ecores mapping outside an annotation. A total of 87 of these ecotigs (22%) were colocalized with at least one cDNA (GSLT and/or E-A-mRNA). An example is presented in Figure 7. Some of these could correspond to novel genes (75), whereas others (12) appear to be extensions of annotated models (cDNAs that also match an annotated gene). The rest of these ecotigs (316) represent additional potential novel genes or gene extensions, and could be targeted for biological validation using reverse transcribed PCR with primers designed from the corresponding ecore sequences (http://www.genoscope.cns.fr/Arabidopsis/file 15 and 16).
In this study, we attempted to estimate the degree of completion and accuracy of the existing annotation of the Arabidopsis genome and to improve it by using novel data sets on the basis of systematic intergenome comparisons and full-length enriched cDNA libraries. A systematic intergenome comparison was performed between rice and Arabidopsis genomes. The ecore analysis provides (1) a way to monitor the degree of completion of genome annotation, (2) a method to refine the proposed gene models, and (3) a resource for novel candidate gene models. About half of the 8% of ecores that fell outside gene annotations could be ascribed to background, suggesting that the fraction of coding features that remains unannotated is very low (4%5%). This fraction corresponds either to parts of existing genes or to novel genes.
As an example, we estimate, on the basis of the analysis of a subset of 66 nonexon-matching ecores, that most of the 562 nonexon-matching ecores, detected within the boundary of an annotated gene, correspond to missing or alternative internal exons. Of these, 40 are found in the group of annotated genes supported by "full-length" cDNAs and 522 are in the group not supported by a "full-length" cDNA ( Of the 21,572 complete GSTL cDNA sequences that we produced, 20,407 correspond to 10,512 annotated genes. A detailed comparison between the annotated genes and the GSLT-based models shows a very high contrast between the annotated genes supported by "full-length" cDNAs and those that are not supported. Identical gene models are found in 95% and 55% of the cases, respectively. The discrepant 5% and 45% displayed either splice-site differences, exon skipping, or 5' and/or 3' extensions. Given the limitations in the cDNA approach, such discrepancies do not necessarily invalidate the annotated gene models. In some instances, alternative gene models were found to be due to alternatively spliced isoforms, showing the usefulness of the GSLT resource to further document genes already supported by cDNA sequences. In addition, a small number of discrepancies between annotated gene models and GSLT models could be explained by errors either in the genomic sequence or in splice-site determination in the GSLT models.
The GSLT resource also permitted discovery of yet undetected genes. We found 326 genomic regions covered by a cDNA sequence with no corresponding annotated genes, pseudogenes, or transposons. Of these regions, 147 correspond to a spliced gene model, excluding the possibility that the cDNA sequence results from the cloning of genomic DNA. Furthermore, 73 of these regions have corresponding ecores and represent good candidates for novel genes (see Table 6). The situation of the remaining regions requires further investigation, as the size of the CDS of true genes can be very short, 80 amino acids in plants (Cock and McCormick 2001 The GSLT cDNA sequences are being incorporated into the next release of the Arabidopsis genome annotation, which is in progress at TIGR and MIPS. As shown here, it will allow the validation, updating, or discovery of thousands of gene models, as well as the recognition of alternate splice sites. The results obtained from comparative genomics are also valuable in improving the Arabidopsis genome annotation. Unfortunately, the incorporation of the ecores or ecotigs in an annotation process is not as straightforward as for the cDNA sequences. Although the specificity is high, the sensitivity is estimated to be close to 50%, so Exofish is not sufficient per se to refine the gene models, and gene modeling has to incorporate additional data. For instance, ab initio gene prediction programs could be improved significantly if they could take into account the presence of ecores when constructing new gene models. Nevertheless, with the increasing number of genomic sequences generated, it is obvious that comparative genomics will be an important tool for genome annotation in the coming years.
Library Construction cDNA libraries were constructed with mRNA extracted from four different tissues (accession Col 0): leaf and stem, hormone-treated callus, flower buds, and flowers at various developmental stages and forming siliques in the developing embryo.
Four normalized libraries were prepared at Invitrogen Corp. as follows: First and second-strand cDNA were synthesized from poly(A)+ mRNA, using Superscript II RT (Invitrogen) and an oligo-dT primer containing a NotI site, following the protocols described in the Invitrogen Manual: SuperScript Plasmid System with Gateway Technology for cDNA Synthesis & Cloning (Cat 18248013) http://www.invitrogen.com/content/sfs/manuals/18248.pdf. The cDNA was polished with T4 polymerase, digested with NotI to create 5'-blunt/3'-Not I cDNA, then size-fractionated on a gel, purified, and ligated into the pCMV-Sport6.1 vector. The libraries were normalized to Cot-10, essentially following method 2-1 of Bonaldo et al. (1996
Sequencing Procedure
Alignment of cDNA Sequences on the Arabidopsis Genome and CDS Construction
Exofish Procedure
We also assembled Arabidopsis ecores to create ecotigs. These ecotigs group the ecores together as long as they are colinear on the two genomes. Two consecutive ecores on the Arabidopsis genome are in the same model if these two ecores are composed of at least two consecutive HSPs, or if they are separated at most by one HSP on the rice genome (Jaillon et al. 2004
We thank Chris Gruber and Mark Smith for library construction, Sébastien Aubourg and Pierre Rouzé for providing the set of Arabidopsis manually annotated genes prior to publication. We thank Nathalie Choisne, Sylvie Samain, Nadia Demange, Agnes Violet, and Susan Cure for help during the course of the project. We also thank Franck Aniere and the entire system network team. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1515604.
4 Present address: Intronn, Inc. Gaithersburg, MD 20878, USA.
5 Corresponding author. [Supplemental material is available online at www.genome.org. The cDNA sequences have been released to the EMBL. The data produced during this analysis and accession nos. are available at http://www.genoscope.cns.fr/Arabidopsis/. The GSLT cDNA clones are available at Genoscope. The results can be visualized at http://www.genoscope.cns.fr/cgi-bin/ggb/ggb?source=Arabidopsis/.]
Arabidopsis Genome Initiative (AGI) 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815.[CrossRef][Medline]
Bonaldo, M.F., Lennon, G., and Soares, M.B. 1996. Normalization and subtraction: Two approaches to facilitate gene discovery. Genome Res. 6: 791806.
Cock, J.M. and McCormick, S. 2001. A large family of genes that share homology with CLAVATA3. Plant Physiol. 126: 939942.
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967974.
Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions, A., Oeller, P., Varma, H., et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100. Haas, B.J., Volfovsky, N., Town, C.D., Troukhan, M., Alexandrov, N., Feldmann, K.A., Flavell, R.B., White, O., and Salzberg, S.L. 2002. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3: RESEARCH0029.[Medline]
Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr., R.K., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., Town, C.D., et al. 2003. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31: 56545666.
Hebsgaard, S.M., Korning, P.G., Tolstrup, N., Engelbrecht, J., Rouze, P., and Brunak, S. 1996. Splice site prediction in Arabidopsis thaliana pre-mRNA by combining local and global sequence information. Nucleic Acids Res. 24: 34393452.
Jaillon, O., Dossat, C., Eckenberg, R., Eiglmeier, K., Segurens, B., Aury, J.M., Roth, C.W., Scarpelli, C., Brey, P.T., Weissenbach, J., et al. 2003. Assessing the Drosophila melanogaster and Anopheles gambiae genome annotations using genome-wide sequence comparisons. Genome Res. 13: 15951599. Jaillon, O., Aury, J.-M., Roest Crollius, H., Salanoubat, M., Wincker, P., Dossat, C., Castelli, V., Boudet, N., Samair, S., Eckenberg, R., et al. 2004. Genome-wide analyses based on comparative genomics. In Cold Spring Harbor Symposia on Quantitative Biology, Vol. LXVIII. Cold Spring Harbor Laboratory Press, New York, (in press). Jordan, T., Schornack, S., and Lahaye, T. 2002. Alternative splicing of transcripts encoding Toll-like plant resistance proteinsWhat's the functional relevance to innate immunity? Trends Plant Sci. 7: 392398.[CrossRef][Medline] Kazan, K. 2003. Alternative splicing and proteome diversity in plants: The tip of the iceberg has just emerged. Trends Plant Sci. 8: 468471.[CrossRef][Medline]
Kessler, M.M., Zeng, Q., Hogan, S., Cook, R., Morales, A.J., and Cottarel, G. 2003. Systematic discovery of new genes in the Saccharomyces cerevisiae genome. Genome Res. 13: 264271. Kong, J., Gong, J.M., Zhang, Z.G., Zhang, J.S., and Chen, S.Y. 2003. A new AOX homologous gene OsIM1 from rice (Oryza sativa L.) with an alternative splicing mechanism under salt stress. Theor. Appl. Genet. Lin, X., Kaul, S., Rounsley, S., Shea, T.P., Benito, M.I., Town, C.D., Fujii, C.Y., Mason, T., Bowman, C.L., Barnstead, M., et al. 1999. Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402: 761768.[CrossRef][Medline] Mayer, K., Schuller, C., Wambutt, R., Murphy, G., Volckaert, G., Pohl, T., Dusterhoft, A., Stiekema, W., Entian, K.D., Terryn, N., et al. 1999. Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402: 769777.[CrossRef][Medline]
Mott, R. 1997. EST_GENOME: A program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13: 477478. Roest Crollius, H., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L., Fischer, C., Fizames, C., Wincker, P., Brottier, P., Quetier, F., et al. 2000. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet. 25: 235238.[CrossRef][Medline] Salanoubat, M., Lemcke, K., Rieger, M., Ansorge, W., Unseld, M., Fartmann, B., Valle, G., Blocker, H., Perez-Alonso, M., Obermaier, B., et. al. 2000. Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana. Nature 408: 820822.[CrossRef][Medline]
Salse, J., Piegu, B., Cooke, R., and Delseny, M. 2002. Synteny between Arabidopsis thaliana and rice at the genome level: A tool to identify conservation in the ongoing rice genome sequencing project. Nucleic Acids Res. 30: 23162328. Schoof, H. and Karlowski, W.M. 2003. Comparison of rice and Arabidopsis annotation. Curr. Opin. Plant Biol. 6: 106112.[CrossRef][Medline]
Seki, M., Narusaka, M., Kamiya, A., Ishida, J., Satou, M., Sakurai, T., Nakajima, M., Enju, A., Akiyama, K., Oono, Y., et al. 2002a. Functional annotation of a full-length Arabidopsis cDNA collection. Science 296: 141145. Seki, M., Narusaka, M., Kamiya, A., Ishida, J., Satou, M., Sakurai, T., Nakajima, M., Enju, A., Akiyama, K., Oono, Y., et al. 2002b. RIKEN Arabidopsis full-length cDNA database. Trends Plant Sci. 7: 562563.[CrossRef]
Sun, G., Dilcher, D.L., Zheng, S., and Zhou, Z. 1998. In search of the first flower: A jurassic angiosperm, archaefructus, from Northeast China. Science 282: 16921695. Tabata, S., Kaneko, T., Nakamura, Y., Kotani, H., Kato, T., Asamizu, E., Miyajima, N., Sasamoto, S., Kimura, T., Hosouchi, T., et al. 2000. Sequence and analysis of chromosome 5 of the plant Arabidopsis thaliana. Nature 408: 823826.[CrossRef][Medline] Theologis, A., Ecker, J.R., Curtis, J.P., Federspeil, N.A., Kaul, S., and Venter, C. 2000. Sequence and analysis of chromosome 1of the plant Arabidopsis thaliana. Nature 408: 816820.[CrossRef][Medline] Vandepoele, K., Simillion, C., and Van de Peer, Y. 2002. Detecting the undetectable: Uncovering duplicated segments in Arabidopsis by comparison with rice. Trends Genet. 18: 606608.[CrossRef][Medline]
Yamada, K., Lim, J., Dale, J.M., Chen, H., Shinn, P., Palm, C.J., Southwick, A.M., Wu, H.C., Kim, C., Nguyen, M., et al. 2003. Empirical analysis of transcriptional activity in the Arabidopsis genome. Science 302: 842846. Yang, Y.W., Lai, K.N., Tai, P.Y., and Li, W.H. 1999. Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J. Mol. Evol. 48: 597604.[CrossRef][Medline]
http://www.genoscope.cns.fr/; gives direct access to the browser. http://www.invitrogen.com/content/sfs/manuals/18248.pdf; contains protocol used for libraries construction. http://www.genoscope.cns.fr/Arabidopsis; permits access to files listed in the text, with links to the browser. http://rgp.dna.affrc.go.jp/IRGSP/; The International Rice Genome Sequencing Project home page.
Received May 7, 2003;
accepted in revised format December 27, 2003.
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||