|
|
|
|
Published online before print
February 6, 2007, 10.1101/gr.5881807 Genome Res. 17:299-310, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00 OPEN ACCESS ARTICLE
Letter Improvement of whole-genome annotation of cereals through comparative analysesThe Institute for Genomic Research, Rockville, Maryland 20850, USA
Rice is an important model species for the Poaceae and other monocotyledonous plants. With the availability of a near-complete, finished, and annotated rice genome, we performed genome level comparisons between rice and all plant species in which large genomic or transcriptomic data sets are available to determine the utility of cross-species sequence for structural and functional annotation of the rice genome. Through comparative analyses with four plant genome sequence data sets and transcript assemblies from 185 plant species, we were able to confirm and improve the structural annotation of the rice genome. Support for 38,109 (89.3%) of the total 42,653 nontransposable element-related genes in the rice genome in the form of a rice expressed sequence tag, full-length cDNA, or plant homolog from our comparative analyses could be found. Although the majority of the putative homologs were obtained from Poaceae species, putative homologs were identified in dicotyledonous angiosperms, gymnosperms, and other plants such as algae, moss, and fern. A set of rice genes (7669) lacking a putative homolog was identified which may be lineage-specific genes that evolved after speciation and have a role in species diversity. Improvements to the current rice gene structural annotation could be identified from our comparative alignments and we were able to identify 487 genes which were mostly likely missed in the current rice genome annotation and another 500 genes for structural annotation review. We were able to demonstrate the utility of cross-species comparative alignments in the identification of noncoding sequences and in confirmation of gene nesting in rice.
The Poaceae (or grass family) is the most economically important family of plants as the majority of food for human diet or feed food are obtained from species within the family including rice (Oryza sativa), maize (Zea mays), wheat (Triticum aestivum), barley (Hordeum vulgare), sorghum (Sorghum bicolor), oats (Avena sativa), millet (Eleusine coracana), and rye (Secale cereale) (http://faostat.fao.org). At the genome level, gene content and gene order are well conserved among the Poaceae (Gale and Devos 1998
The map-based sequence of the rice genome (O. sativa ssp. japonica var. Nipponbare) was completed in 2005 (International Rice Genome Sequencing Project 2005
As rice is the first finished cereal genome, the rice genome annotation will be used extensively in the annotation of genes in other cereals and grass species. As with other eukaryotic species, annotation of the rice genome was initiated using gene predictions from ab initio gene finders and further improved by using cDNA and expressed sequence tags (ESTs) (Yuan et al. 2005
In this study, we performed large-scale comparative genome analyses with rice using all available major plant sequence data sets. Genome sequence data sets include the finished sequence of the model dicotyledonous (dicot) plant, Arabidopsis thaliana (Arabidopsis Genome Initiative 2000
Support for current rice genome annotation Support in the form of rice transcripts or putative homologs of the 55,890 total rice genes were identified by searching against sequence data sets from 185 plant species which collectively represents 2670 Mb of sequence. The sequence data included (1) genomic sequences from A. thaliana, P. trichocarpa, Z. mays, and S. bicolor, (2) the Arabidopsis proteome, and (3) 185 plant transcript data sets which are clustered assemblies of ESTs, mRNAs, and full-length cDNAs (Table 1; http://plantta.tigr.org; Childs et al. 2007 16% of all the plant transcripts (Table 1; Supplemental Fig. 1). Clearly, the numbers of putative rice homologs within the Plant TAs will vary based on both the representation of the transcriptome and the evolutionary distance between rice and each species.
In this study, the TA and genomic sequences were placed into 10 groupings based on the type of data source and taxonomic distance relative to rice: (1) rice TA, (2) Other Poaceae TAs (excluding Oryza sativa; 23 species), (3) Other Monocot TAs (excluding Poaceae species; eight species), (4) Eudicotyledons (Eudicot) TAs (121 species), (5) Other Plant TAs (32 species, such as basal angiosperms, algae, mosses, and ferns), (6) Assembled Zea mays (AZMs) genomic sequences, which are assembled methylation filtration and high C0t reads from the pilot maize gene enrichment sequencing project (Whitelaw et al. 2003 780,000 ESTs have been released since functional annotation of Release 4 of our annotation and Massively Parallel Signature Sequencing (MPSS), Serial Analysis of Gene Expression (SAGE), and proteomic data were utilized in functional annotation of Release 4 models (http://rice.tigr.org), there are some inconsistencies between the function assignment of the gene models in Release 4 and the data presented in this study. For example, hypothetical genes should lack transcript support. However, in this study, we identified 520 (3.6%) hypothetical genes with cognate transcripts due to the recent rice EST release (Table 1), which should be promoted in their annotation to "expressed gene". In this study, only 83.3% of the rice genes annotated with expression support in Release 4 have cognate EST and/or full-length cDNA transcript support, indicating that the remaining 16.7% genes annotated with expression support in Release 4 were obtained through MPSS, SAGE, and peptide evidence data types.
Among the 10 groupings, homologs for rice genes in all gene categories (i.e., known/putative, expressed, hypothetical, and TE-related; Table 1) were most frequently identified within the Other Poaceae TA data sets, which is consistent with previous reports of high sequence identity among the Poaceae (Ware and Stein 2003 The prevalence of homologs from diverse clades of the plant kingdom suggests that most of these "core plant genes" may be important housekeeping genes that are not only constitutively expressed and detectable through EST sampling methods but also conserved in function. In contrast, the inability to detect a homolog for 7669 rice genes (including 30 known genes, 895 expressed genes, and 6744 hypothetical genes) in the 2512 Mb of non-rice genomic and transcriptomic sequence available to date suggests the presence of lineage specific genes in rice, which may have evolved after speciation and have a role in species diversity. Alternatively, these, or a subset of these genes, may be artifacts of our annotation methods or encode pseudogenes or transposable elements that we have failed to identify properly.
Distribution of support for rice gene models throughout the plant kingdom
As shown in Table 2, the majority of the rice genes have Poaceae evidence support and only a very small number (i.e., PMO+ + PM+O + PM+O+ = 43 + 0 + 0 = 43) of rice genes are supported solely by non-Poaceae sequence data. Noticeably, a mere 4544 (10.6%) of the total 42,653 non-TE-related loci have no evidence support under the significance level (E-value cutoff <1 x 105) used in this study, of which 4384 of these unsupported loci are hypothetical genes. Overall, evidence support could be identified for 69.4% (Total PMO = 14,337 4384 = 9953; Table 2) of the 14,337 hypothetical genes using an E-value cutoff of <1 x 105 and 2116 loci (= 14,337 12,221) had distinct support under the more stringent E-value cutoff of <1 x 1050. All of the hypothetical genes are the result of the prediction of the program FGENESH (Salamov and Solovyev 2000
Frequency of homologs in gene-enriched genomic versus transcript Poaceae sequences The above analyses indicated that Poaceae sequence data are a valuable resource for annotating the rice genome. The Poaceae data set contains 24 TAs, AZMs, and ASBs. Although it is well known that transcript sequence data are the most important resource in the gene identification, we were interested in ascertaining the contribution of the Poaceae (excluding rice) TAs and maize/sorghum genomic sequences relative to the rice TAs in providing support for genome annotation. Not surprisingly, the rice TA data set yielded the best contribution among the three major Poaceae data resources (Table 3). Interestingly, the non-rice Poaceae TAs had a comparable number of rice homologs as the maize and sorghum genome assemblies, suggesting broad representation of the Poaceae transcriptome in the collective Poaceae TA data set. Indeed, 93.8% [(19,699 + 369)/21,403] of the known/putative rice genes have a potential homolog in both the non-rice Poaceae TAs and AZM/ASB sequences at a high significance level (E-value cutoff of <1 x 1020). AZMs and ASBs have homologs in 92.4% (19,780) and 89.1% (19,070) of the known/putative rice genes, respectively. Over 98% (21,071) coverage would be reached if the significance was lowered to 1 x 105, consistent with reports that the gene-rich sequencing strategy provides significant coverage of the maize and sorghum gene space (Palmer et al. 2003
Overall, 90,039 (32.6%) of the total 275,904 AZMs had BLASTZ alignments with the rice genome. Of these, 51,403 AZMs have representative alignments that cover 54.5 Mb of the rice genome. Similarly, 65,233 (39.8%) of the total 163,908 ASBs had BLASTZ alignments to the rice genome with representative alignments from 39,885 ASBs spanning 69 Mb of the rice genome. In total, the genic regions of over two thirds of the total rice genes were covered at least partially by the AZM and ASB genomic alignments, including 31,690 (74%) non-TE-related genes.
Noncognate transcripts
Improvement of gene prediction using comparative alignments
Cross-species spliced alignments can be used to corroborate predicted gene structures (Fig. 1) and amend gene predictions (Brendel et al. 2004
As exon-intron boundaries defined by spliced alignments of heterologous transcripts are not as reliable as those by cognate transcripts, more stringent criteria were employed to refine our analysis. First, a putative exon had to be supported by at least three alignments. As a result, 75,111 putative exons were predicted. Second, we compared these cross-species putative exons with existing exons within our annotation. To avoid confounding our results due to alternative splicing and UTR exons, we focused on "novel" exons in genic regions, which did not overlap with existing exons in Release 4, i.e., "novel" exons in annotated intronic regions and in which the annotated intron is not supported by any transcript (rice or heterologous). A total of 500 genes (including 395 known genes, 66 expressed genes, and 39 hypothetical genes) with potential new exons were identified through cross-species alignments in which the exon had to be supported by at least three independent alignments. For 477 of the 500 genes with new exons, cross-species alignments were from more than one species. Manual inspection showed that most of these genes have incorrect gene structures (see Figs. 2, 3; Supplemental Figs. 2, 3), suggesting that known genes and expressed genes can be improved through comparative analyses. In addition to refinement of gene structures, comparison using PASA2 identified 1854 assemblies, which supported unannotated or "missed" genes. Using a stringent set of criteria ( 300 bp in length and 3 exons), we conservatively identified 388 assemblies located in 255 distinct intergenic regions as candidate unannotated genes.
Using BLASTZ alignments of the AZMs and ASBs to the rice genome, many alignments between rice genes and putative homologs were identified that spanned multiple exons. Not surprisingly, the identity of the alignment in intronic regions was significantly lower than the flanking exonic regions, appearing as a banded pattern in the genome browser display (Fig. 1). For example, maize, sorghum, and even Arabidopsis genomic comparisons indicated a potential gene upstream of LOC_Os04g45820 that was not predicted by FGENESH and in which only short cognate rice EST sequences and two cross-species spliced alignments are available (Supplemental Fig. 4). By combining the partial gene structure provided by rice EST spliced alignments, exon patterns in the genomic alignments with AZM5_17958, AZM_5_84956, ASB44489, ASB71162 and ASB45539, and cross-species spliced alignments from wheat and maize, we can construct a gene model consistent with the gene prediction from TWINSCAN (Korf et al. 2001 Genomic comparisons can also indicate the existence of novel genes. Using the AZMs and ASBs, numerous BLASTZ alignments were located in "intergenic" regions, which may lead to the identification of the unannotated genes. Each continuous intergenic region was regarded as one unit in the analysis to simplify the computation (which may contain more than one gene). In total, there were 1145 and 830 intergenic regions over 1000 bp length containing alignments with AZMs and ASBs, respectively. Overall, 1614 distinct intergenic regions were covered and 361 of them were covered by matches from both an AZM and an ASB sequence. The conserved regions were then searched against the TIGR Oryza Repeat and the UniProt databases, resulting in 493 and 339 non-TE-related conserved intergenic sequences identified from maize and sorghum, respectively. Even when the significance was increased to an E-value cutoff of <1 x 1050, there were still 291 and 175 potentially new genes identified from maize and sorghum, respectively, which could be merged into 324 distinct intergenic regions. Further analyses showed that many of those regions encode genes not contained in the current rice genome annotation (data not shown). As our filtering criteria were stringent in that they required similarity to annotated proteins, other conserved regions may also encode genes which have not been previously identified. Indeed, by removing the filter of UniProt similarity yet retaining the repetitive sequence filter, we identified 800 additional candidate new genes. Some conserved regions may contain multiple genes (Supplemental Fig. 5), while others may contain coding regions of the neighboring genes missed in the annotation process and not new genes. Nevertheless, these conserved "intergenic regions" can be used to improve the current rice genome annotation.
Conserved noncoding regions
Although our study utilized simple genomic alignments and did not employ algorithms dedicated to finding miRNAs (Adai et al. 2005
Our analyses show that comparative analyses are extremely useful in the annotation of the rice genome even when more than one million rice transcript sequences are available. Furthermore, we show that the completed rice genome sequence and its annotation provide a valuable data resource for genomic research in other grass species and will certainly facilitate the ongoing maize genome annotation or play an even more important role for those cereal species with only limited sequence data such as oat and rye. Through our comparative analyses, we were able to identify 255 and 324 unannotated candidate genes which were missed in Release 4, by cross-species spliced alignments and genomic comparison, respectively, of which, 92 were found by both methods. In total, 487 distinct candidates were identified. Further analysis showed that, although there are FGENESH predictions in 350 (72%) of these conserved "intergenic regions", in most cases, the FGENESH algorithm predicted a single, long gene model that spanned two valid neighboring genes with an intron in a relatively short intergenic region, i.e., a merged model. As full-length cDNAs are available to support one gene in the merged FGENESH model, the long FGENESH prediction is truncated by the PASA2 program which heavily weights full-length cDNA evidence over ab initio gene finder output. Consequently, the other exons within the merged FGENESH model that lack cDNA support are deleted and not included in the final model or gene set. Of the remaining 137 unannotated gene candidates, 43 (31%) likely originate from organellar insertions (data not shown; Supplemental Fig. 5). This analysis suggests that a modified update strategy for the PASA2 program to capture the deprecated portion of merged FGENESH models, coupled with integration of an organellar gene finder into our annotation pipeline, should undercover a majority (80.7%) of these two classes of missed genes. The BLASTZ alignments between rice and maize or sorghum were able to span short introns and clear, distinct alignments were apparent; however, these alignments might be split by long introns. Clearly, these genomic comparisons are able to reveal gene structures, although it may be still difficult for a curator to determine the exact exon-intron boundaries without additional information. To address this problem, spliced alignments from paralogous and heterologous transcripts could be employed to identify the exact exon-intron boundaries. It was shown that 25,258 (61.7%) of the cross-species spliced alignment assemblies can be incorporated into the genes annotated in Release 4. Many assemblies may reveal the right gene structure (Figs. 2, 3; Supplemental Figs. 2, 3); however, most of them are problematic due to low sequence similarity or gene structure alternation subsequent to speciation. Therefore, additional filters are needed to improve the quality of the spliced alignments. For example, establishing a requirement that each exon-intron boundary in cross-species spliced alignment assemblies be supported by at least three or more alignments may permit more automated incorporation of cross-species alignments data into an annotation pipeline.
We also show that genomic comparisons can shed light on the evolution of gene structure and organization. Some rice genes are intervened by a short intergenic region and synteny of not only gene order but also intergenic regions, which can be seen with rice, maize, and sorghum (data not shown). However, it is unclear whether the conservation of short intergenic regions has a biological function role. Alternative splicing is a common feature in plants (Wang and Brendel 2006
Comparative analyses can also be applied to the study of the transposable elements. Some mutator-like transposable elements (MULEs), for example, can capture fragments from host gene and are referred to as Pack-MULEs (Jiang et al. 2004 In this study, we have shown the value of comparative alignments in improving structural and functional annotation of the rice genome, which can be attributed in large part to the deep representation of genomic and transcriptomic sequence for the Poaceae. Clearly, the depth of sequence data is not evenly distributed among taxa in the plant kingdom, and increased efforts in sequencing non-Poaceae monocots may shed light not only on the evolution of the Poaceae genome but also on the divergence of monocots from eudicots.
TIGR rice genome annotation Release 4.0 of the TIGR rice genome annotation (available at http://rice.tigr.org/; Yuan et al. 2005
Other plant genomes
Plant transcript assemblies
Cross-species spliced alignments
Genomic comparisons
Unannotated genes
Data availability
We thank members of the rice annotation team at TIGR for critical comments on the manuscript, and B. Haas for technical assistance on the configuration of the PASA2 pipeline. This work was supported by a National Science Foundation Plant Genome Research Program grant to C.R.B. (DBI-0321538).
1 Corresponding author.
E-mail rbuell{at}tigr.org; fax: (301) 838-0208. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5881807
Adai, A., Johnson, C., Mlotshwa, S., Archer-Evans, S., Manocha, V., Vance, V., and Sundaresan, V. 2005. Computational prediction of miRNAs in Arabidopsis thaliana. Genome Res. 15: 7891. Arabidopsis Genome Initiative 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815.[CrossRef][Medline] Bartel, D.P.. 2004. MicroRNAs: Genomics, biogenesis, mechanism, and function. Cell 116: 281297.[CrossRef][Medline] Bedell, J.A., Budiman, M.A., Nunberg, A., Citek, R.W., Robbins, D., Jones, J., Flick, E., Rholfing, T., Fries, J., and Bradford, K., et al. 2005. Sorghum genome sequencing by methylation filtration. PLoS Biol. 3: e13.[CrossRef][Medline] Bennetzen, J.L.. 2000. Comparative sequence analysis of plant nuclear genomes: Microcolinearity and its many exceptions. Plant Cell 12: 10211029. Berezikov, E., Cuppen, E., and Plasterk, R.H. 2006. Approaches to microRNA discovery. Nat. Genet. 38 (Suppl. 1): S2S7. Brendel, V., Xing, L., and Zhu, W. 2004. Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus. Bioinformatics 20: 11571169. Chan, A.P., Pertea, G., Cheung, F., Lee, D., Zheng, L., Whitelaw, C., Pontaroli, A.C., SanMiguel, P., Yuan, Y., and Bennetzen, J., et al. 2006. The TIGR Maize Database. Nucleic Acids Res. 34: D771D776. Childs, K.L., Hamilton, J., Zhu, W., Ly, E., Cheung, F., Hank, W., Rabinowicz, P.D., Town, C.D., Buell, C.R., and Chan, A.P. 2007. The TIGR Plant Transcript Assemblies Database. Nucleic Acids Res. 35: D846D851 (Database issue). Dennis, P.P. and Omer, A. 2005. Small non-coding RNAs in Archaea. Curr. Opin. Microbiol. 8: 685694.[Medline] Eddy, S.R.. 2001. Non-coding RNA genes and the modern RNA world. Nat. Rev. Genet. 2: 919929.[CrossRef][Medline] Foissac, S., Bardou, P., Moisan, A., Cros, M.J., and Schiex, T. 2003. EUGENE'HOM: A generic similarity-based gene finder using multiple homologous sequences. Nucleic Acids Res. 31: 37423745. Gale, M.D. and Devos, K.M. 1998. Comparative genetics in the grasses. Proc. Natl. Acad. Sci. 95: 19711974. Goff, S.A., Ricke, D., Lan, T.H., Presting, G., Wang, R., Dunn, M., Glazebrook, J., Sessions, A., Oeller, P., and Varma, H., et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100. Gross, S.S. and Brent, M.R. 2006. Using multiple alignments to improve gene prediction. J. Comput. Biol. 13: 379393.[CrossRef][Medline] Haas, B.J., Delcher, A.L., Mount, S.M., Wortman, J.R., Smith Jr., R.K., Hannick, L.I., Maiti, R., Ronning, C.M., Rusch, D.B., and Town, C.D., et al. 2003. Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res. 31: 56545666. International Rice Genome Sequencing Project 2005. The map-based sequence of the rice genome. Nature 436: 793800.[CrossRef][Medline] Ito, Y., Arikawa, K., Antonio, B.A., Ohta, I., Naito, S., Mukai, Y., Shimano, A., Masukawa, M., Shibata, M., and Yamamoto, M., et al. 2005. Rice Annotation Database (RAD): A contig-oriented database for map-based rice genomics. Nucleic Acids Res. 33: D651D655. Jiang, N., Bao, Z., Zhang, X., Eddy, S.R., and Wessler, S.R. 2004. Pack-MULE transposable elements mediate gene evolution in plants. Nature 431: 569573.[CrossRef][Medline] Jones-Rhoades, M.W., Bartel, D.P., and Bartel, B. 2006. MicroRNAS and their regulatory roles in plants. Annu. Rev. Plant Biol. 57: 1953.[CrossRef][Medline] Juretic, N., Hoen, D.R., Huynh, M.L., Harrison, P.M., and Bureau, T.E. 2005. The evolutionary fate of MULE-mediated duplications of host gene fragments in rice. Genome Res. 15: 12921297. Kikuchi, S., Satoh, K., Nagata, T., Kawagashira, N., Doi, K., Kishimoto, N., Yazaki, J., Ishikawa, M., Yamada, H., and Ooka, H., et al. 2003. Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301: 376379. Korf, I., Flicek, P., Duan, D., and Brent, M.R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics 17 (Suppl. 1): S140S148. Lin, H., Zhu, W., Silva, J.C., Gu, X., and Buell, C.R. 2006. Intron gain and loss in segmentally duplicated genes in rice. Genome Biol. 7: R41.[CrossRef][Medline] Liu, C., Bai, B., Skogerbo, G., Cai, L., Deng, W., Zhang, Y., Bu, D., Zhao, Y., and Chen, R. 2005. NONCODE: An integrated knowledge database of non-coding RNAs. Nucleic Acids Res. 33: D112D115. Mathe, C., Sagot, M.F., Schiex, T., and Rouze, P. 2002. Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res. 30: 41034117. Ohyanagi, H., Tanaka, T., Sakai, H., Shigemoto, Y., Yamaguchi, K., Habara, T., Fujii, Y., Antonio, B.A., Nagamura, Y., and Imanishi, T., et al. 2006. The Rice Annotation Project Database (RAP-DB): Hub for Oryza sativa ssp. japonica genome information. Nucleic Acids Res. 34: D741D744. Ouyang, S. and Buell, C.R. 2004. The TIGR Plant Repeat Databases: A collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res. 32: D360D363. Palmer, L.E., Rabinowicz, P.D., O'Shaughnessy, A.L., Balija, V.S., Nascimento, L.U., Dike, S., de la Bastide, M., Martienssen, R.A., and McCombie, W.R. 2003. Maize genome sequencing by methylation filtration. Science 302: 21152117. Paterson, A.H., Bowers, J.E., and Chapman, B.A. 2004. Ancient polyploidization predating divergence of the cereals, and its consequences for comparative genomics. Proc. Natl. Acad. Sci. 101: 99039908. Peterson, D.G., Schulze, S.R., Sciara, E.B., Lee, S.A., Bowers, J.E., Nagel, A., Jiang, N., Tibbitts, D.C., Wessler, S.R., and Paterson, A.H. 2002. Integration of Cot analysis, DNA cloning, and high-throughput sequencing facilitates genome characterization and gene discovery. Genome Res. 12: 795807. Rabinowicz, P.D. and Bennetzen, J.L. 2006. The maize genome as a model for efficient sequence analysis of large plant genomes. Curr. Opin. Plant Biol. 9: 149156.[CrossRef][Medline] Rabinowicz, P.D., Schutz, K., Dedhia, N., Yordan, C., Parnell, L.D., Stein, L., McCombie, W.R., and Martienssen, R.A. 1999. Differential methylation of genes and retrotransposons facilitates shotgun sequencing of the maize genome. Nat. Genet. 23: 305308.[CrossRef][Medline] Reinhart, B.J., Weinstein, E.G., Rhoades, M.W., Bartel, B., and Bartel, D.P. 2002. MicroRNAs in plants. Genes & Dev. 16: 16161626. The Rice Chromosome 3 Sequencing Consortium 2005. Sequence, annotation, and analysis of synteny between rice chromosome 3 and diverged grass species. Genome Res. 15: 12841291. Sakata, K., Nagamura, Y., Numa, H., Antonio, B.A., Nagasaki, H., Idonuma, A., Watanabe, W., Shimizu, Y., Horiuchi, I., and Matsumoto, T., et al. 2002. RiceGAAS: An automated annotation system and database for rice genome sequence. Nucleic Acids Res. 30: 98102. Salamov, A.A. and Solovyev, V.V. 2000. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10: 516522. Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Human-mouse alignments with BLASTZ. Genome Res. 13: 103107. Sorek, R., Shamir, R., and Ast, G. 2004. How prevalent is functional alternative splicing in the human genome? Trends Genet. 20: 6871.[CrossRef][Medline] Sorrells, M.E., La Rota, M., Bermudez-Kandianis, C.E., Greene, R.A., Kantety, R., Munkvold, J.D., Miftahudin, Mahmoud, A., Ma, X., and Gustafson, P.J., et al. 2003. Comparative DNA sequence analysis of wheat and rice genomes. Genome Res. 13: 18181827. Stanke, M., Schoffmann, O., Morgenstern, B., and Waack, S. 2006. Gene prediction in eukaryotes with a generalized hidden Markov model that uses hints from external sources. BMC Bioinformatics 7: 62.[CrossRef][Medline] Sunkar, R. and Zhu, J.K. 2004. Novel and stress-regulated microRNAs and other small RNAs from Arabidopsis. Plant Cell 16: 20012019. Sunkar, R., Girke, T., Jain, P.K., and Zhu, J.K. 2005. Cloning and characterization of microRNAs from rice. Plant Cell 17: 13971411. Tuskan, G.A., Difazio, S., Jansson, S., Bohlmann, J., Grigoriev, I., Hellsten, U., Putnam, N., Ralph, S., Rombauts, S., and Salamov, A., et al. 2006. The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science 313: 15961604. Ureta-Vidal, A., Ettwiller, L., and Birney, E. 2003. Comparative genomics: Genome-wide analysis in metazoan eukaryotes. Nat. Rev. Genet. 4: 251262.[Medline] Usuka, J., Zhu, W., and Brendel, V. 2000. Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16: 203211. Wang, B.B. and Brendel, V. 2006. Genomewide comparative analysis of alternative splicing in plants. Proc. Natl. Acad. Sci. 103: 71757180. Wang, X., Shi, X., Hao, B., Ge, S., and Luo, J. 2005a. Duplication and DNA segmental loss in the rice genome: Implications for diploidization. New Phytol. 165: 937946.[CrossRef][Medline] Wang, X., Zhang, J., Li, F., Gu, J., He, T., Zhang, X., and Li, Y. 2005b. MicroRNA identification based on sequence and structure alignment. Bioinformatics 21: 36103614. Ware, D. and Stein, L. 2003. Comparison of genes among cereals. Curr. Opin. Plant Biol. 6: 121127.[CrossRef][Medline] Whitelaw, C.A., Barbazuk, W.B., Pertea, G., Chan, A.P., Cheung, F., Lee, Y., Zheng, L., van Heeringen, S., Karamycheva, S., and Bennetzen, J.L., et al. 2003. Enrichment of gene-coding sequences in maize by genome filtration. Science 302: 21182120. Xie, K., Zhang, J., Xiang, Y., Feng, Q., Han, B., Chu, Z., Wang, S., Zhang, Q., and Xiong, L. 2005. Isolation and annotation of 10828 putative full length cDNAs from indica rice. Sci. China C Life Sci. 48: 445451.[CrossRef][Medline] Yu, J., Hu, S., Wang, J., Wong, G.K., Li, S., Liu, B., Deng, Y., Dai, L., Zhou, Y., and Zhang, X., et al. 2002. A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 7992. Yu, J., Wang, J., Lin, W., Li, S., Li, H., Zhou, J., Ni, P., Dong, W., Hu, S., and Zeng, C., et al. 2005. The Genomes of Oryza sativa: A history of duplications. PLoS Biol. 3: e38.[CrossRef][Medline] Yuan, Y., SanMiguel, P.J., and Bennetzen, J.L. 2003. High-Cot sequence analysis of the maize genome. Plant J. 34: 249255.[CrossRef][Medline] Yuan, Q., Ouyang, S., Wang, A., Zhu, W., Maiti, R., Lin, H., Hamilton, J., Haas, B., Sultana, R., and Cheung, F., et al. 2005. The Institute for Genomic Research Osa1 rice genome annotation database. Plant Physiol. 138: 1826. Zhang, B., Pan, X., Cannon, C.H., Cobb, G.P., and Anderson, T.A. 2006. Conservation and divergence of plant microRNA genes. Plant J. 46: 243259.[CrossRef][Medline] Zhao, W., Wang, J., He, X., Huang, X., Jiao, Y., Dai, M., Wei, S., Fu, J., Chen, Y., and Ren, X., et al. 2004. BGI-RIS: An integrated information resource and comparative analysis workbench for rice genomics. Nucleic Acids Res. 32: D377D382.
Received August 23, 2006; accepted in revised format December 20, 2006. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||