|
|
|
|
Published online before print
February 12, 2004, 10.1101/gr.1481104 Genome Res. 14:463-471, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00
Resources Numerous Novel Annotations of the Human Genome Sequence Supported by a 5'-EndEnriched cDNA Collection1 Genoscope-Centre National de Séquençage and CNRS UMR-8030, 91000 Evry, France
A collection of 90,000 human cDNA clones generated to increase the fraction of "full-length" cDNAs available was analyzed by sequence alignment on the human genome assembly. Five hundred fifty-two gene models not found in LocusLink, with coding regions of at least 300 bp, were defined by using this collection. Exon composition proposed for novel genes showed an average of 4.7 exons per gene. In 20% of the cases, at least half of the exons predicted for new genes coincided with evolutionary conserved regions defined by sequence comparisons with the pufferfish Tetraodon nigroviridis. Among this subset, CpG islands were observed at the 5' end of 75%. In-frame stop codons upstream of the initiator ATG were present in 49% of the new genes, and 16% contained a coding region comprising at least 50% of the cDNA sequence. This cDNA resource also provided candidate small protein-coding genes, usually not included in genome annotations. In addition, analysis of a sample from this cDNA collection indicates that 380 gene models described in LocusLink could be extended at their 5' end by at least one new exon. Finally, this cDNA resource provided an experimental support for annotations based exclusively on predictions, thus representing a resource substantially improving the human genome annotation.
The draft sequences of the human genome (Lander et al. 2001
High-throughput systematic sequencing of EST libraries has provided a wealth of information on many human gene transcripts. However, these EST sequences are often partial and, hence, insufficient to define the structure of the entire genes and encoded proteins. A number of extensive cDNA programs have been initiated since then to supply sequence, mapping, and expression data on the corresponding genes (Strausberg et al. 1999
Sequencing of full-length transcripts followed by sequence alignment on a genomic reference sequence have recently been successfully used for the identification of exon structures of both human and other eukaryotic genes (Haas et al. 2002
The cDNA Collection and Its Analysis mRNAs from nine human tissuesnamely, neuroblastoma, placenta, fetal and adult brain, fetal liver, thymus, T- and B-cell lines, and the HeLa-cell linewere used to prepare cDNA libraries enriched for full-length inserts (see Methods). This set of cDNA libraries is hereafter referred to as the CNSLT cDNA resource. A total of 91,813 CNSLT cDNA clones were essentially submitted to single path pairwise end sequencing, producing >200,000 sequence reads (Table 1). The 5' and 3' sequence reads from each CNSLT cDNA clone were initially aligned on the repeat-masked human genome assembly (NCBI build 30) by using BLAST (see Methods). Ninety-six percent of the CNSLT cDNA clones (Table 2) yielded alignments that were used to define preliminary transcript models as described in Methods. A total of 1397 cDNA clones were considered to be putative chimeras due to the discrepant alignments of their 5' and 3' sequences on the genome and were excluded from the analysis. No alignment was observed between 3293 cDNA clones and the reference genomic sequence. This alignment was repeated with the 3293 nonmatching clones on NCBI build 33 that became available during revision of this article. A significant fraction of the clones (1369/3293, 42%) could be matched to the nearly finished human reference genomic sequence (for further details, see Methods). Furthermore, 228 of the 58% yet-nonmatching clones yielded an alignment with the mouse genome assembly.
The exon boundaries and the structure of each of the transcript models were further defined by using the sim4 algorithm (see Methods). Such transcript models, supported by the cDNA clones, were obtained for 97% (84,865 out of 87,123 clones) of the CNSLT cDNA resource.
The LocusLink resource serves as a central source to integrate sequence, standard nomenclature, gene, and protein descriptions (Pruitt et al. 2000 Transcript models derived from the CNSLT cDNA clones and RefSeq or GenBank transcripts were merged by clustering, resulting in a total of 11,124 forward and 10,817 reverse clusters considered as gene models (see Methods). Of these gene models, 4241 genomic regions supported by the CNSLT cDNA resource were not documented by a LocusID (see Methods). A total of 2041 out of these 4241 regions, which corresponded to the CNSLT clones with overlapping 5' and 3' sequence reads (46%) were submitted for further analysis.
Manual Curation of Gene Models
The search for coding regions was performed for the reconstructed sequence of the genes mapping to nonannotated regions, according to the LocusLink resource (see Methods). Inframe stop codons upstream of the initiator ATG were present in 49% (273/552) of the new genes supported by the CNSLT cDNA resource. This number was consistent with that observed by using cDNA resources containing the 5' ends of the transcripts (Suzuki et al. 2000
After this curation procedure, the remaining 1072 proposed gene models (see Methods) were classified into three different categories: (1) 226 known genes, corresponding to genes included in LocusLink in the course of the analysis; (2) 552 novel genes, which have a CDS of at least 300 bp without a LocusID identifier (single-exon genes accounted for 31% of the total number of novel genes); and (3) 294 putative genes, corresponding to gene models with no LocusID identifier and no significant CDS. A summary of the number of novel, putative, and known genes supported by the CNSLT resource in each chromosome is given in Table 3. Novel genes were found even for chromosomes with annotation that has recently been updated (Reymond et al. 2002
The novel and putative genes, which were supported only by the CNSLT cDNA resource, were finally realigned to the NCBI build 33 in order to ascertain their position on the nearly "finished" version of the human reference genomic sequence (Table 4; for details, see Methods). A total of 548 out of the 552 proposed new gene models yielded alignments on the new release of the reference human genome sequence.
Only four of the novel genes, previously located on chromosomes 1, 8, and 17, did not align on the current human reference sequence. Likewise, of the 294 putative genes defined on the build 30, only one, previously located on chromosome 3, did not match the current release of the human genome sequence. In some instances, novel and putative genes were fused in a single genomic region.
To evaluate the impact on the annotation of the "essentially complete" human genome sequence, we later re-examined the novelty of the gene models defined by the CNSLT resource using the Ensembl genes as the set of reference annotation (Hubbard et al. 2002
A total of 210 Ensembl genes were covered by a novel gene defined by the CNSLT resource, which corresponded to 209 genomic regions. Fifty-two of these Ensembl annotations were tagged as novel Ensembl genes; these are now also confirmed by the CNSLT cDNA resource. To complete the analysis, we compared the CNSLT cDNAs now matching the NCBI build 33 and the Ensembl genes. In 74 cases, these cDNAs were located in genomic regions devoid of Ensembl annotation, indicating the existence of other new genes not identified by using the previous release of the human genome sequence.
A survey of the exon composition of the spliced models proposed for the new genes showed an average of 4.7 exons per gene, which was lower than reported mean values for extensively annotated genes (Heilig et al. 2003
More than one transcript variant was observed for
To evaluate the proportion of complete 5' ends in models proposed for the new genes after human curation, the presence of CpG islands in the 5' regions was investigated (see Methods). A total of 344 CpG islands (60%) appeared in the vicinity of the 5' end of the proposed novel models. Previous estimates have shown that between 60% and 67% of genes are associated with a CpG island at their 5' end (Antequera and Bird 1999
Sequence comparisons with the pufferfish Tetraodon nigroviridis using Exofish (Roest Crollius et al. 2000 An example of a novel gene supported by the CNSLT cDNA resource is shown in Figure 2A. This gene corresponds to a cluster of nine cDNA clones that match the reverse strand of the human genome assembly for chromosome 22. Five of these cDNA clones could be assembled on the genome sequence. Five different alternative transcripts were found for this gene, with the most abundant transcript variant chosen as representative for the gene. The model proposed corresponded to a structure of 18 exons confirmed by identifiable splice junctions. Thirteen out of the 18 exons were supported by linked Tetraodon ecores. A coding region of 272 animo acids was identified in the model proposed for this gene, comprising <50% of the sequence reconstructed on the genome. Moreover, a CpG island was found in the vicinity of the 5' end of the gene.
Comparison of the virtual cDNA (see Methods) with the human genome assembly build 31 by means of MegaBLAST allowed us to map the proposed gene model on chromosome 22 (NT_011520 [GenBank] contig); it overlaps a hypothetical gene model (locus LOC220686), which had no experimental transcript evidence supporting the hypothetical model to date.
Small Open Reading Frames
For the category 3 models, we observed an average of 3.4 exons per gene and the presence of a CpG island in the vicinity of the 5' end for 47% (139/294) of the cases in this category. Because in most of the cases, human genes have a mouse counterpart with highly conserved exonic structure (Waterston et al. 2002 TBLASTX searches were also performed against the mouse genome sequence for a sample of 126 monoexonic smORFs, and matches were found in 75 of the cases (60%), 19 of which had a CpG island in the vicinity of the 5' end. Although it is likely that a fraction of the monoexonic smORFs may correspond to pseudogenes, this strongly indicates that unspliced smORFs may encode a true protein or be part of larger gene models.
Extension of the 5' End of Annotated Genes An example of the extension of an annotated gene, provided by the CNSLT resource is shown in Figure 2C. This gene is supported by the DKFZP434K1772 cDNA (LocusID 54507), a 12-exon structure located on the forward strand of chromosome 1. The human CNSLT cDNA clone (accession nos. BX329090 [GenBank] , BX370116 [GenBank] , BX399403 [GenBank] , and BX399404 [GenBank] ; Fig. 2C), extends the annotated gene by seven exons, with a canonical dinucleotide GTAG pattern for all donor and acceptor sites. Two of these exons are supported by humanTetraodon ecores. Furthermore, this extension allowed the anchoring of the gene in the vicinity of a CpG island. The CDS from this annotated gene model is also extended by using this CNSLT cDNA. We estimate that the new model, initially based on the DKFZP434K1772 hypothetical protein, should now be complete.
Additional Comparisons to Mouse Genome and Transcripts In addition, 160 of the novel genes (29%) were covered by at least one FANTOM2 cDNA clone on the same genomic region of the human genome assembly (Table 4). The same procedure was applied by using the mouse genome sequence assembly, resulting in 172 novel genes clustered with at least one FANTOM2 cDNA clone (see Methods).
As it has already been shown, 40% of the human genome can be aligned with the mouse genome at the nucleotide level. Moreover, 99% of mouse genes have a homolog in the human genome, with highly conserved exonic structure (Waterston et al. 2002
Integration of information on CpG islands and humanpufferfish ecores into the models proposed for each transcript followed by manual curation allowed us to group 1072 gene models into three main categories: novel, known, and putative genes. Novel genes were identified on all chromosomes except for chromosome Y. Although eight new genes were located on chromosome 21, 16 mapped to chromosome 22, possibly as a reflection of the higher gene content of chromosome 22 (Dunham et al. 1999
Recently small protein-coding genes (smORFs) were found in Saccharomyces cerevisiae by searching potential budding yeast ORFs against sequences from other related and nonfungal species (Kessler et al. 2003
First exons, which are usually partially or completely non-coding, are overlooked by most gene-finding algorithms used for prediction of protein-coding genes. Although new methods are improving the accuracy of first exon predictions (Brent 2002 Sequence comparison to the mouse FANTOM2 cDNAs, together with later alignment with the mouse genome gave further support to the novel genes defined by the CNSLT cDNA resource. Furthermore, the CNSLT resource provides a number of additional candidate genes on the mouse genome, to be confirmed later by experimental approaches.
Even though several groups have focused on the generation of full-length cDNA libraries, there is still a lack of resources for identifying a significant fraction of genes and transcripts in their entirety (Kristiansen and Pandey 2002
Library Construction The cDNA libraries were generated on the pCMVSPORT6 vector by Life Technologies, a division of Invitrogen Corporation. Briefly, first-strand cDNA was synthesized from polyA+mRNA by using Invitrogen Superscript II RT and an oligo-(dT) primer containing a NotI site. The 5' end was enriched, and double-stranded cDNA was digested with NotI and cloned into the NotI and EcoRV sites of the pCMVSPORT6 vector.
Alignment of cDNA Sequences to the Human Genome
Human Curation
Comparative sequence analysis versus the T. nigroviridis genome using Exofish (Roest Crollius et al. 2000
CDS Determination
Search for Homologous Mouse Genes
A comparison against the mouse FANTOM2 cDNA clones was performed for the identification of possible mouse counterparts: The mouse FANTOM2 cDNAs were filtered by establishing the association to a novel and/or putative human gene; in a second step, the selected FANTOM2 cDNAs were mapped and clustered together with the novel and/or putative human genes on the human and mouse genome. Briefly, for each human novel and putative gene model (846 genes), we identified those FANTOM2 mouse cDNAs (data set of 60,770 mouse full-length cDNA clones; Okazaki et al. 2002 The subset of CNSLT cDNA clones with no match to the build 30 release of the human genome assembly was subjected to the same process.
Comparison of Novel and Putative Genes Defined by the CNSLT Resource With the Ensembl Annotation
This work was supported by the French Ministry of Research (grant no. 9950275). We thank Carole Dossat and Olivier Jaillon for support on Exofish, Ralph Eckenberg on CDS identification, and Sumitta Samair and Eric Pelletier for support with the presentation of the data. We thank Chris Gruber, Wu Bo Li, and Joel Jessee for cDNA library construction. We wish to thank as well François Sigaux, Philippe Dessen, and Jacques Haiech for helpful discussions at various stages of the project; Susan Cure for her help in writing the manuscript; and the technical staff of Genoscope for its essential contribution to the experimental part of the work. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1481104. Article published online before print in February 2004.
5 Corresponding author.
2 Present address: LGI-BioInformatic, Aventis Pharma S.A., 94400, Vitry-Sur-Seine, France
3 Present address: European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge CB101SD, UK
4 Present address: Genomining, 92120, Montrouge, France. [The sequence data from this study have been submitted to EMBL under accession nos. BX323813 [GenBank] , BX323814 [GenBank] , BX324295 [GenBank] BX465182, AL513551 [GenBank] AL583711.]
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403410.[CrossRef][Medline]
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402. Antequera, F. and Bird, A. 1999. CpG islands as genomic footprints of promoters that are associated with replication origins. Curr. Biol. 9: R661667.[CrossRef][Medline] Brent, M.R. 2002. Predicting full-length transcripts. Trends Biotechnol. 20: 273275.[CrossRef][Medline]
Collins, J.E., Goward, M.E., Cole, C.G., Smink, L.J., Huckle, E.J., Knowles, S., Bye, J.M., Beare, D.M., and Dunham, I. 2003. Reevaluating human gene annotation: A second-generation analysis of chromosome 22. Genome Res. 13: 2736. Deloukas, P., Matthews, L.H., Ashurst, J., Burton, J., Gilbert, J.G., Jones, M., Stavrides, G., Almeida, J.P., Babbage, A.K., Bagguley, C.L., et al. 2001. The DNA sequence and comparative analysis of human chromosome 20. Nature 414: 865871.[CrossRef][Medline] Dunham, I., Shimizu, N., Roe, B.A., Chissoe, S., Hunt, A.R., Collins, J.E., Bruskiewich, R., Beare, D.M., Clamp, M., Smink, L.J., et al. 1999. The DNA sequence of human chromosome 22. Nature 402: 489495.[CrossRef][Medline]
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967974.
Guigo, R., Dermitzakis, E.T., Agarwal, P., Ponting, C.P., Parra, G., Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C., et al. 2003. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1019 additional genes. Proc. Natl. Acad. Sci. 100: 11401145. Haas, B.J., Volfovsky, N., Town, C.D., Troukhan, M., Alexandrov, N., Feldmann, K.A., Flavell, R.B., White, O., and Salzberg, S.L. 2002. Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol. 3: research0029.0021research0029.0012. Hattori, M., Fujiyama, A., Taylor, T.D., Watanabe, H., Yada, T., Park, H.S., Toyoda, A., Ishii, K., Totoki, Y., Choi, D.K., et al. 2000. The DNA sequence of human chromosome 21. Nature 405: 311319.[CrossRef][Medline] Heilig, R., Eckenberg, R., Petit, J.L., Fonknechten, N., Da Silva, C., Cattolico, L., Levy, M., Barbe, V., De Berardinis, V., Ureta-Vidal, A., et al. 2003. The DNA sequence and analysis of human chromosome 14. Nature 421: 601607.[CrossRef][Medline] Hogenesch, J.B., Ching, K.A., Batalov, S., Su, A.I., Walker, J.R., Zhou, Y., Kay, S.A., Schultz, P.G., and Cooke, M.P. 2001. A comparison of the Celera and Ensembl predicted gene sets reveals little overlap in novel genes. Cell 106: 413415.[CrossRef][Medline]
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., et al. 2002. The Ensembl genome database project. Nucleic Acids Res. 30: 3841.
Kessler, M.M., Zeng, Q., Hogan, S., Cook, R., Morales, A.J., and Cottarel, G. 2003. Systematic discovery of new genes in the Saccharomyces cerevisiae genome. Genome Res. 13: 264271.
Kikuno, R., Nagase, T., Waki, M., and Ohara, O. 2002. HUGE: A database for human large proteins identified in the Kazusa cDNA sequencing project. Nucleic Acids Res. 30: 166168. Kristiansen, T.Z. and Pandey, A. 2002. Resources for full-length cDNAs. Trends Biochem. Sci. 27: 266267.[CrossRef][Medline] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline]
Mott, R. 1997. EST_GENOME: A program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13: 477478. Okazaki, Y., Furuno, M., Kasukawa, T., Adachi, J., Bono, H., Kondo, S., Nikaido, I., Osato, N., Saito, R., Suzuki, H., et al. 2002. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 420: 563573.[CrossRef][Medline] Pruitt, K.D., Katz, K.S., Sicotte, H., and Maglott, D.R. 2000. Introducing RefSeq and LocusLink: Curated human genome resources at the NCBI. Trends Genet. 16: 4447.[CrossRef][Medline] Reymond, A., Camargo, A.A., Deutsch, S., Stevenson, B.J., Parmigiani, R.B., Ucla, C., Bettoni, F., Rossier, C., Lyle, R., Guipponi, M., et al. 2002. Nineteen additional unpredicted transcripts from human chromosome 21. Genomics 79: 824832.[CrossRef][Medline] Roest Crollius, H., Jaillon, O., Bernot, A., Dasilva, C., Bouneau, L., Fischer, C., Fizames, C., Wincker, P., Brottier, P., Quetier, F., et al. 2000. Estimate of human gene number provided by genome-wide analysis using Tetraodon nigroviridis DNA sequence. Nat. Genet. 25: 235238.[CrossRef][Medline] Skaletsky, H., Kuroda-Kawaguchi, T., Minx, P.J., Cordum, H.S., Hillier, L., Brown, L.G., Repping, S., Pyntikova, T., Ali, J., Bieri, T., et al. 2003. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature 423: 825837.[CrossRef][Medline]
Strausberg, R.L., Feingold, E.A., Klausner, R.D., and Collins, F.S. 1999. The mammalian gene collection. Science 286: 455457.
Strausberg, R.L., Feingold, E.A., Grouse, L.H., Derge, J.G., Klausner, R.D., Collins, F.S., Wagner, L., Shenmen, C.M., Schuler, G.D., Altschul, S.F., et al. 2002. Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. Proc. Natl. Acad. Sci. 99: 1689916903. Suzuki, Y., Ishihara, D., Sasaki, M., Nakagawa, H., Hata, H., Tsunoda, T., Watanabe, M., Komatsu, T., Ota, T., Isogai, T., et al. 2000. Statistical analysis of the 5' untranslated region of human mRNA using "Oligo-Capped" cDNA libraries. Genomics 64: 286297.[CrossRef][Medline]
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. 2001. The sequence of the human genome. Science 291: 13041351. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562.[CrossRef][Medline]
Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., Schuler, G.D., Schriml, L.M., Sequeira, E., Tatusova, T.A., et al. 2003. Database resources of the National Center for Biotechnology. Nucleic Acids Res. 31: 2833.
Wiemann, S., Weil, B., Wellenreuther, R., Gassenhuber, J., Glassl, S., Ansorge, W., Bocher, M., Blocker, H., Bauersachs, S., Blum, H., et al. 2001. Toward a catalog of human genes and proteins: Sequencing and analysis of 500 novel complete protein coding human cDNAs. Genome Res. 11: 422435. Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7: 203214.[CrossRef][Medline]
http://compbio.ornl.gov/grailexp; Grail Experimental Gene Discovery Suite Web site. http://www.ensembl.org/EnsMart/; EnsMart data mining toolset retrieval of annotated genomes.
Received April 30, 2003;
accepted in revised format December 2, 2003.
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||