|
|
|
Published online before print
June 12, 2001, 10.1101/gr.GR-1617R
Vol. 11, Issue 7, 1167-1174, July 2001
REPORTS
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
The nucleotide sequence was determined for a 340-kb segment of rice chromosome 2, revealing 56 putative protein-coding genes. This represents a density of one gene per 6.1 kb, which is higher than was reported for a previously sequenced segment of the rice genome. Sixteen of the putative genes were supported by matches to ESTs. The predicted products of 29 of the putative genes showed similarity to known proteins, and a further 17 genes showed similarity only to predicted or hypothetical proteins identified in genome sequence data. The region contains a few transposable elements: one retrotransposon, and one transposon. The segment of the rice genome studied had previously been identified as representing a part of rice chromosome 2 that may be homologous to a segment of Arabidopsis chromosome 4. We confirmed the conservation of gene content and order between the two genome segments. In addition, we identified a further four segments of the Arabidopsis genome that contain conserved gene content and order. In total, 22 of the 56 genes identified in the rice genome segment were represented in this set of Arabidopsis genome segments, with at least five genes present, in conserved order, in each segment. These data are consistent with the hypothesis that the Arabidopsis genome has undergone multiple duplication events. Our results demonstrate that conservation of the genome microstructure can be identified even between monocot and dicot species. However, the frequent occurrence of duplication, and subsequent microstructure divergence, within plant genomes may necessitate the integration of subsets of genes present in multiple redundant segments to deduce evolutionary relationships and identify orthologous genes.
| |
INTRODUCTION |
|---|
|
|
|---|
Rice (Oryza sativa) is a widely grown crop, and is the
staple food for over one-half of the world's population. Extensive classical and molecular genetic maps have been
constructed to assist biological analyses and plant breeding
applications (Kinoshita 1995
; Kurata et al. 1994
). The genome size of
rice, ~440 Mb (Arumuganathan and Earle 1991
), is one of the smallest
of the cereals. It has been postulated that the genes within the genome
of rice, as with the genes of other Gramineae, are clustered
in gene-rich regions separated by gene-poor DNA (Barakat et al. 1997
).
A high degree of conservation of the order of gene-specific markers
(conserved synteny) has been observed between the genomes of most
cereals, including rice (Moore et al. 1995
). The sequences of exons and exon-intron structures of orthologous genes in the
sh2/a1-homologous regions of rice and sorghum have been
shown to be conserved (Chen et al. 1998
). However, more divergence of
gene content has been found in the Adh1 regions of the
genomes of maize and sorghum (Tikhonov et al. 1999
). Nevertheless, rice
is being developed as the key model monocot species for molecular
genetic investigations, with the expectation that, by exploiting
conserved synteny, the identification and functional assignment of
genes in rice will lead to the identification of the equivalent genes
in other cereal species. These applications are being supported by the
Rice Genome Project, which commenced in 1991. The main aim of this
project is the determination of the complete nucleotide sequence of
the rice genome.
A 340-kb region around the rice Adh1-Adh2 region has been sequenced
(Tarchini et al. 2000
), and is predicted to contain 33 protein-coding
genes. Fourteen of these predicted genes were supported by the
identification of corresponding transcripts, and 15 genes were similar
in structure to genes with known functions. Nineteen of the 33 genes
were members of gene families within the sequenced region, although
some copies were predicted to be nonfunctional pseudogenes.
The key model dicot plant species is Arabidopsis thaliana
(Arabidopsis). Extensive classical genetic, molecular genetic,
and physical maps have been developed, along with numerous genome analysis and gene cloning strategies (Koornneef 1990
; Lister and Dean
1993
; Feldmann et al. 1989
; Giraudat et al. 1992
; Bancroft et al. 1993
;
Schmidt et al. 1995
; http://nasc.nott.ac.uk/new_ri_map.html;). The
Arabidopsis genome has been completely sequenced (The
Arabidopsis Genome Initiative 2000
). It is very gene-rich,
containing 25,498 genes, with an average density of one gene per 4.5 kb. Conservation of gene order has been observed between segments of
the genome of Arabidopsis and those of its closest relatives
among crops, the cultivated Brassica species (Kowalski et al.
1994
; Cavell at al. 1998
; Lagercrantz 1998
).
It had been predicted that conserved synteny between the genomes of
Arabidopsis and cereals, which diverged ca. 200 million years
ago (Wolfe et al. 1989
), would be detectable for segments of ~3 cM
(Paterson et al. 1996
). Such conservation could lead to the use of
positional approaches to integrate functional genomics information from
both monocot and dicot species. The results of comparative genetic
mapping efforts have provided little evidence for conserved gene
organization (Gale and Devos 1998
). Although some conserved synteny can
be detected between the genomes of rice and Arabidopsis using
physical mapping and sequence analysis approaches, the extent of
conservation appears low (Devos et al. 1999
; Han et al. 1999
; van
Dodeweerd at al. 1999
). In the present study we report the results of a
pilot-scale rice genome sequencing project and the use of the data to
further study aspects of genome organization in Arabidopsis.
| |
RESULTS |
|---|
|
|
|---|
Gene Prediction
Four overlapping BACs representing the 340-kb region to be sequenced
had been identified previously (van Dodeweerd et al. 1999
). A shotgun
sequencing strategy was used and annotation performed on a 339,972-bp
contiguous assembly as submitted to EMBL (accession no. AJ307662). Four
gene prediction programs were used for modeling exon structure:
Genemark.hmm, FGENESH, Genscan,
and GeneFinder. Comparisons of the outputs from these
programs with gene structures determined using EST matches and protein
homologies for three genes are shown in Figure
1. Although all programs correctly
predicted the presence of a gene, none of the predictions accurately
identified the exon-intron structures of the genes.
|
Gene prediction in rice is complicated by the fact that a rice species setting is available only for Genemark.hmm. Nevertheless, our data suggest that even Genemark.hmm output is not reliable enough to perform an in silico whole-genome analysis in rice. Further adjustment and refinement of gene prediction programs is necessary for large-scale automated genome analysis. Similarities of genomic sequences with EST sequences, matches of predicted protein products with known proteins, and matches with transposable elements were also used to derive the final gene modeling, as shown in Figure 2. In total, 56 potential protein-coding genes were identified, along with a region containing a retrotransposon and a region showing homology to transposon Tnr1. Two tRNAs were also identified. A summary of the positions of the identified genes and other features is presented in Table 1.
|
|
Identification of Homologous ESTs and Proteins
The rice genomic nucleotide sequence was used to query a rice EST database to identify ESTs corresponding to modeled genes. A threshold of at least 90% sequence identity over at least 150 bp was applied. Each predicted gene was used to query all available nucleotide and protein databases for homologous identified or predicted protein sequences. The results of both analyses are summarized in Table 1. Sixteen of the 56 modeled genes match ESTs, supporting the prediction of the presence of a gene. The predicted proteins of 29 putative genes (52%) match known proteins, and the predicted proteins of 17 putative genes (30%) match proteins predicted from genome sequence data. The predicted proteins of the 10 remaining putative genes (18%) show no similarity to known proteins, so may either represent new types of proteins or be the result of false gene predictions.
The predicted proteins for all 56 putative genes were analyzed for the
presence of functional domains using Interpro (The
InterPro Consortium 2000
). The results are shown in Table 2. Characterized functional domains of
protein products were identified for 33 of the 56 putative genes
(59%). These allowed us to identify a four-member gene family (C635,
W700, W940, and C1190) encoding protein products with AP2
domain/ethylene responsive element binding protein functional domains,
which was the largest gene family we identified in the sequenced region.
|
Analysis of Genome Organization in Rice and Arabidopsis
The extent of conservation of both the presence and the position of
genes in the sequenced segment of the rice genome and the corresponding
segments of the genome of Arabidopsis was analyzed. BLASTP analyses were performed using the extracted amino acid sequences of the annotated rice genes to query a database of all
predicted Arabidopsis protein sequences, using a
P-value of
e
5 as a cutoff. The results were then filtered
to remove adjacent matches (indicative of tandem duplications) and
clusters of three or more nearby matches recorded. The results for this analysis of the 340-kb region are summarized in Table
3. The relative coding strand for each gene
model is denoted by W or C. Five segments of the Arabidopsis
genome contained conserved subsets of the rice genes, as shown in
Figure 3. These segments represented
regions of the Arabidopsis genome containing approximately 22, 27, 20, 15, and 23 genes, for the chromosome 4(a), 5, 2, 4(b), and 3 segments, respectively, shown in Figure 3. Overall, 22 of the 56 rice
genes are represented in the five Arabidopsis chromosome segments, counting both copies of three pairs of related genes (W495/W505, C635/W700, and W940/C1190) that show homology to common Arabidopsis genes. The most highly conserved segment,
chromosome 4(a), contains eight conserved genes, with one pair
(W950-AT4g17340 and W1050-AT4g17350) reversed. The relative coding
strand orientation of the genes is also conserved, except for the
reversal of the final pair, which is consistent with the inversion of
the segment containing the genes. This region of the
Arabidopsis genome had been shown previously to be related to
the sequenced segment of the rice genome (van Dodeweerd at al. 1999
).
The Arabidopsis chromosome 5 segment contains seven conserved
genes. These are also in conserved order and orientation, except the
same reversed pair of genes. This region of the Arabidopsis
genome had been shown previously to be related to the chromosome 4(a)
segment (Bancroft 2000
). The remaining segments contain seven, five,
and five conserved genes for the chromosome 2, 4(b), and 3 segments,
respectively, all in conserved order. However, the orientation of
several of the individual genes is reversed, indicating possible
small-scale inversion events.
|
|
Analysis of Additional Regions of the Rice Genome
To assess the generality of our findings of gene density in the rice genome and the conservation of microstructure with the genome of Arabidopsis, we selected for analysis two further BACs that had been sequenced and submitted to public databases. One of these, P0436E04 (accession no. ap002818), was selected as the sequenced clone nearest to a telomere (map position 0.3 cM on chromosome 1). The other, P0406H10 (accession no. ap002524), was near the middle of a chromosome arm (20.2 cM on chromosome 1). We implemented our annotation protocols using these data and compared the putative genes derived with those accompanying the database submission. For P0436E04, 26 genes and three transposons were identified in 145 kb of sequence, compared with 24 genes and two transposons recorded with the submission. For P0406H10, 25 genes and one transposon were identified in 156 kb of sequence, compared with 26 genes and one transposon recorded with the submission. The densities of putative genes identified, one per 5.6 kb and one per 6.2 kb for P0436E04 and P0406H10, respectively, are very similar to the density found in the 340-kb region analyzed (one per 6.1 kb). Although the annotation accompanying database submissions of rice genome sequence suggested significantly different gene structures to those predicted by our protocols, the overall gene density predicted is very similar. The gene densities of these clones are typical of those accompanying the rice BAC sequences presently in the public databases. These results suggest that a typical gene density for the rice genome is around one gene per 6 kb.
Searches were conducted for segments of the Arabidopsis genome
that contain conserved gene content and order for each of BAC clones
P0436E04 and P0406H10. The same methods and recording criteria were
used. The results are summarized in Table 4
and Figure 4 for P0406H10, and Table
5 and Figure 5
for P0436E04. In both cases multiple conserved segments were
identified. Only three or four conserved genes were identified in each
segment; there was one reversal of gene order (W1600-AT5g07380 and
W3350-AT5g07690), and the strand orientation of several of the genes
was not conserved (i.e., C3852-AT1g80360, C2000-AT4g32610,
W399-AT5g63880, C3900-AT5g07080). However, these results indicate that
it may be feasible to align much of the rice genome with duplicated
segments of the genome of Arabidopsis.
|
|
|
|
| |
DISCUSSION |
|---|
|
|
|---|
Using a combination of approaches, 56 genes were predicted in the
340 kb of rice genome sequence data we generated and analyzed, indicating a density of one gene per 6.1 kb. This density is close to
that found for the genome of Arabidopsis; that is, one gene per 4.76 kb (Bancroft 2000
), but higher than that found near the ADH1 locus of rice, one gene per 10.3 kb (Tarchini et al.
2000
). Extrapolation to the 440-Mb genome of rice, using the gene
densities of one per 6.1 kb or 10.3 kb, would predict the presence of
~72,000 or ~43,000 genes in the rice genome, respectively.
However, there is evidence of gene-rich and gene-poor isochores in
the rice genome based on bulk sequence composition (Barakat et al.
1997
), and both regions analyzed are likely to be characteristic
of the gene-rich regions. If we estimate that the rice ESTs
presently in dbEST represent ~10,000 nonredundant genes, our
observation that 16 of the 56 predicted genes identified (29%) have
EST matches leads us to predict a total gene number for rice of
~35,000. This would be consistent with a model in which the majority
of the rice genes are contained in gene-rich regions comprising 50% of
the genomic DNA of rice (220 Mb), with these gene-rich regions
typically containing a gene density of one per ~6 kb, as we have observed.
The composition of the region we have analyzed differs significantly
from that near the ADH1 locus (Tarchini et al. 2000
). In
addition to more predicted genes in a region of almost identical size
(56, compared with 33), we identified fewer transposons and retrotransposons (2, compared with 15). There are smaller
gene/pseudogene families; for example, the largest gene family we
identified contained four members, compared to 13 members. The region
around the ADH1 locus contains several genes with homology to
genes known to be involved in plant disease resistance. It has a
complex structure, including a large family of genes, some of which do
not encode a full and functional protein, and several retrotransposons.
This resembles the structure of the Arabidopsis ecotype
Columbia allele of the RPP5 disease-resistance locus identified on
chromosome 4 (Bevan et al. 1998
). However, this is an unusual genome
organization, and is not representative of the genome structure as a
whole (Lin et al. 1999
; Mayer et al. 1999
).
Sixteen of the 56 modeled genes (29%) match EST sequences, supporting the predicted gene models. Further support for the authenticity of our predicted genes come from the highly significant homology that the predicted products of many of them show to known or predicted proteins in other species. Forty-six of the 56 predicted genes (82%) show such homology. These data suggest that the majority of our gene models correctly indicate the presence of a gene. It also suggests that the EST representation in rice may be relatively low, which in turn might indicate that many of the genes of rice are expressed at low levels generally, only in specific cells or in response to specific conditions. The 10 gene models for which no homology has been identified may be false gene predictions or genes unique to rice.
The framework of conserved genes preserved between segments of the
genomes of Arabidopsis and rice suggests that mechanisms of
genome evolution have been operating to delete, rearrange, and disperse
single or small groups of genes, resulting in extensive genome
reshuffling during plant evolution. This is inconsistent with the
suggestion that plant genome organization might have evolved primarily
by gross rearrangements, permitting the construction of unified genetic
maps (Paterson et al. 1996
). Mechanisms that might achieve the observed
divergence of genome fine structure may involve mobile genetic
elements, as has been found to contribute to "exon shuffling" in
mammalian systems (Boeke and Pickeral 1999
). It is also likely that
unequal crossing over contributes to both tandem duplications of genes,
and deletion of single or small groups of genes (Bancroft 2001
).
Many duplicated regions have been identified within the genome of
Arabidopsis (Lin et al. 1999
; Mayer et al. 1999
; Bancroft 2000
). It has been suggested that these may have been the result of an
ancestral tetraploidy event (Blanc et al. 2000
; The
Arabidopsis Genome Initiative 2000
), or multiple duplication
events (Vision et al. 2000
). Our data support the hypothesis that there
have been multiple duplication events during the evolution of the
genome of Arabidopsis. These duplicated segments appear to
have diverged extensively by the loss of different subsets of
interspersed genes. The relationships of such highly diverged
duplicated segments is revealed most clearly by comparative sequence
analysis with relatively distantly related species, such as tomato (Ku
et al. 2000
) or rice. By integrating the data from multiple duplicated segments of the Arabidopsis genome we have been able to align segments of the rice and Arabidopsis genomes and deduce the
ancestral relationships of sets of genes. It is not known whether the
340-kb rice genome segment studied is also the product of genome
duplication events during the ancestry of rice. When the rice genome
sequence data become available, it should be possible to analyze
complex relationships within the rice genome by extensive analysis
using the Arabidopsis genome sequence. By taking due account
of the mechanisms of the evolution of plant genome structure, it may be
possible to make extensive use of comparative genome analysis to
integrate structural and functional genomics of dicot and monocot species.
| |
METHODS |
|---|
|
|
|---|
Sequencing of BAC Clones
Individual BAC clones were sequenced by standard methods using a
shot-gun approach (Bodenteich et al. 1993
). Cesium chloride-purified BAC DNA was sheared by nebulization (Roe et al. 1996
). After
end-filling, DNA fragments were size fractionated and cloned into the
SmaI site of pUC18 or HincII site of pUC19 (Amersham
Pharmacia Biotech). Clones were sequenced using the ABI PRISM Dye
Terminator Cycle Sequencing ready Reaction kit with FS AmpliTaq DNA
polymerase (PE Applied Biosystems) and analyzed on ABI 377 (PE Applied
Biosystems) sequencing gels. The sequence data were assembled using
PHRED/PHRAP software (Green 1996
).
Analysis of Sequence Data
The sequence was subjected to a modified analysis procedure based
on that established for genome analysis of Arabidopsis
thaliana (Mayer et al. 1999
). BLAST (Altschul et al.
1997
) analysis of the sequence against the EMBL nucleotide database and
MIPS in-house databases (a nonredundant protein database, a plant
transposon database, a rice EST database, and an all-plant EST
database) was performed. Gene predictions were performed using
Genscan (Burge and Karlin 1997
), GeneFinder
(P. Green and L. Hillier, unpubl. software), FGENESH (A.A.
Salamov and V.V. Soloyev, unpubl. software;
http://genomic.sanger.ac.uk/gf/gf.shtml), and
Genemark.hmm (Lukashin and Borodovsky 1998
). An
Oryza sativa setting is available only for
Genemark.hmm. For GeneFinder as well as
Genscan the Arabidopsis setting was used. The
Zea mays setting available for Genscan yielded
less reliable results. Splice-site predictions using Netplantgene2 (Tolstrup et al. 1997
) (Arabidopsis setting) gave unreliable
results, and was not used for gene modeling.
Gene modeling was performed by combining intrinsic data (gene predictions) with extrinsic data (database matches). Gene models were adjusted to fit EST data from rice and other plants as well as to homologous protein matches where available. For genes not supported by any database matches the FGENESH prediction was generally used.
Protein domain characterization was performed using the
InterPro software (The InterPro Consortium 2000
), and
similarity analysis of extracted proteins was performed by
BLASTP comparison to a nonredundant protein database.
| |
ACKNOWLEDGMENTS |
|---|
This work was funded under the BBSRC GAIT Initiative (grant 208/GAT09069) and the EU Arabidopsis Genome Sequencing Project (CT97-0274).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
11 Present address: Plant Research International, Droevendaalsesleeg 1, 6708 PB, Wageningen, The Netherlands.
12 Corresponding author.
E-MAIL ian.bancroft{at}bbsrc.ac.uk; FAX: 44 1603 259882.
Article published on-line before print: Genome Res., 10.1101/gr.161701.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.161701.
| |
REFERENCES |
|---|
|
|
|---|
An integrated documentation resource for protein families, domains and functional sites.
Bioinformatics
16:
1145-1150.Received August 23, 2000; accepted in revised form April 3, 2001.
This article has been cited by other articles:
![]() |
C. D. Town, F. Cheung, R. Maiti, J. Crabtree, B. J. Haas, J. R. Wortman, E. E. Hine, R. Althoff, T. S. Arbogast, L. J. Tallon, et al. Comparative Genomics of Brassica oleracea and Arabidopsis thaliana Reveal Gene Loss, Fragmentation, and Dispersal after Polyploidy PLANT CELL, June 1, 2006; 18(6): 1348 - 1359. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Rong, J. E. Bowers, S. R. Schulze, V. N. Waghmare, C. J. Rogers, G. J. Pierce, H. Zhang, J. C. Estill, and A. H. Paterson Comparative genomics of Gossypium and Arabidopsis: Unraveling the consequences of both ancient and recent polyploidy Genome Res., September 1, 2005; 15(9): 1198 - 1210. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. L. Ramos, M. Martinez-Bueno, A. J. Molina-Henares, W. Teran, K. Watanabe, X. Zhang, M. T. Gallegos, R. Brennan, and R. Tobes The TetR Family of Transcriptional Repressors Microbiol. Mol. Biol. Rev., June 1, 2005; 69(2): 326 - 356. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Muller, M. Denis, L. Gentzbittel, and T. Faraut The Iccare web server: an attempt to merge sequence and mapping information for plant and animal species Nucleic Acids Res., July 1, 2004; 32(suppl_2): W429 - W434. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Vandepoele, C. Simillion, and Y. Van de Peer Evidence That Rice and Other Cereals Are Ancient Aneuploids PLANT CELL, September 1, 2003; 15(9): 2192 - 2202. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Zhu, D.-J. Kim, J.-M. Baek, H.-K. Choi, L. C. Ellis, H. Kuester, W. R. McCombie, H.-M. Peng, and D. R. Cook Syntenic Relationships between Medicago truncatula and Arabidopsis Reveal Extensive Divergence of Genome Organization Plant Physiology, March 1, 2003; 131(3): 1018 - 1026. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. A. Ziolkowski, G. Blanc, and J. Sadowski Structural divergence of chromosomal segments that arose from successive duplication events in the Arabidopsis genome Nucleic Acids Res., February 15, 2003; 31(4): 1339 - 1350. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Vandepoele, Y. Saeys, C. Simillion, J. Raes, and Y. Van de Peer The Automatic Detection of Homologous Regions (ADHoRe) and Its Application to Microcolinearity Between Arabidopsis and Rice Genome Res., November 1, 2002; 12(11): 1792 - 1801. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Salse, B. Piegu, R. Cooke, and M. Delseny Synteny between Arabidopsis thaliana and rice at the genome level: a tool to identify conservation in the ongoing rice genome sequencing project Nucleic Acids Res., June 1, 2002; 30(11): 2316 - 2328. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Yu, S. Hu, J. Wang, G. K.-S. Wong, S. Li, B. Liu, Y. Deng, L. Dai, Y. Zhou, X. Zhang, et al. A Draft Sequence of the Rice Genome (Oryza sativa L. ssp. indica) Science, April 5, 2002; 296(5565): 79 - 92. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. W. Mewes, D. Frishman, U. Guldener, G. Mannhaupt, K. Mayer, M. Mokrejs, B. Morgenstern, M. Munsterkotter, S. Rudd, and B. Weil MIPS: a database for genomes and protein sequences Nucleic Acids Res., January 1, 2002; 30(1): 31 - 34. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Schoof, P. Zaccaria, H. Gundlach, K. Lemcke, S. Rudd, G. Kolesov, R. Arnold, H. W. Mewes, and K. F. X. Mayer MIPS Arabidopsisthaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome Nucleic Acids Res., January 1, 2002; 30(1): 91 - 93. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||