|
|
|
|
Genome Res. 14:665-671, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Methods Identification of Rat Genes by TWINSCAN Gene Prediction, RTPCR, and Direct Sequencing1 Human Genome Sequencing Center, Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, Texas 77030, USA 2 Laboratory for Computational Genomics, Washington University, St. Louis, Missouri 63130, USA
The publication of a draft sequence of a third mammalian genomethat of the ratsuggests a need to rethink genome annotation. New mammalian sequences will not receive the kind of labor-intensive annotation efforts that are currently being devoted to human. In this paper, we demonstrate an alternative approach: reverse transcription-polymerase chain reaction (RTPCR) and direct sequencing based on dual-genome de novo predictions from TWINSCAN. We tested 444 TWINSCAN-predicted rat genes that showed significant homology to known human genes implicated in disease but that were partially or completely missed by methods based on protein-to-genome mapping. Using primers in exons flanking a single predicted intron, we were able to verify the existence of 59% of these predicted genes. We then attempted to amplify the complete predicted open reading frames of 136 genes that were verified in the single-intron experiment. Spliced sequences were amplified in 46 cases (34%). We conclude that this procedure for elucidating gene structures with native cDNA sequences is cost-effective and will become even more so as it is further optimized.
The publication of a draft sequence of the rat genome provides an exciting opportunity to find orthologs and novel paralogs of known human genes that can be studied in a well established physiological and pharmacological model (Rat Genome Sequencing Project Consortium 2004
Systematic RTPCR and direct sequencing of many novel gene predictions was first reported three years ago (Miyajima et al. 2000
Summary of RTPCR Results We obtained spliced sequence from RTPCR products for 59% of predictions for which we designed primers in adjacent exons flanking a single intron. For a subset of the single-intron successes, we designed primers in exons at or near the end of the predicted ORF to amplify most of the ORF and in the presumed untranslated regions (UTRs) to amplify the complete ORF. The success rates were 70% and 34%, respectively (Table 1).
Predictions TWINSCAN predicted 24,490 genes comprising 182,013 exons on build 2.1 of the rat genome (January 2003). These and other statistics were similar to those obtained from the TWINSCAN annotation of human (Flicek et al. 2003
Single-Intron Experiments Of the 554 remaining exon pairs, we were able to design primers that met our criteria for 444 (see Methods for details). All 444 primer pairs were synthesized and used in RTPCR with pooled rat RNA from diverse tissues and developmental stages. PCR products were purified and sequenced using both forward and reverse primers (see http://genes.cse.wustl.edu/rat-data-03/ The resulting sequences were then analyzed to determine whether the primers amplified a spliced product from the target region. If so, the experiment was categorized as successful. From the 444 experimental wells, the analysis revealed 196 successful amplifications of spliced sequences (44%). Products from another 142 experiments failed to yield 20 consecutive bases of high-quality sequence, either because nothing was amplified or because sequencing failed. The remaining 107 primer pairs yielded sequences that did not provide a reliable indication of whether the product was spliced. Repetition of this experiment using the same primers but adding DMSO and glycerol to the PCR mixture yielded an overlapping set of 166 successes. Combining the two experiments, the total number of verified introns was 260 (59%). Among these successes, the intron boundaries were exactly as predicted by TWINSCAN in 189 (73%). Each 96-well plate contained two tests with primers in exons 3 and 4 of rat p53, an intron splicing test with primers in exon 3 and intron 3, and an intron splicing test with primers in introns 7 and 8. In the initial experiment, tests with primers in exons 3 and 4 of rat p53 yielded spliced sequence in eight cases and failures in two cases on different plates. In one failure there was probably no amplification, as both the forward and reverse reads, which were produced in different sequencing runs, contained no high-quality sequence. The other failure yielded a short sequence that matched one exon but did not cross the splice boundary. During the repetition of the experiment, three of the five positive controls did not yield sequences. This could be due to failure during high-throughput PCR amplification, purification, or sequencing. In both experiments, the intron splicing tests with primers located in intronintron or intronexon amplified and produced sequence that was not spliced. This is likely due to unspliced pre-mRNA. Therefore, amplification should not be interpreted as implying that the primers annealed to exonic sequence (see Discussion).
Full ORF Experiments We attempted to design 93 UTR and 93 TE primer pairs for predictions that yielded successful results on the first set of single-intron experiments. Product size was limited to 4 Kb, because the efficiency of the PCR methods used was expected to decline rapidly at greater product lengths. Because product length was limited and primer design is not always successful, we ended up with 93 TE primer pairs and 89 UTR primer pairs. After the experimental procedure described above, sequence analysis revealed amplification of a spliced product overlapping the predicted gene from 64 TE experiments (69%); 13 yielded no high-quality sequence from either primer, and 16 yielded sequence that did not produce a reliable spliced alignment to the targeted region. The corresponding numbers for the UTR experiments were 26 successful (29%), 22 with no high-quality sequence, and 41 with no spliced product from the predicted region. After the completion of this experiment, a new assembly of the rat genome was released in June 2003 (release 3.1). This assembly included 1500 new BACs yielding a net increase of 70 million bases in the assembly. In order to determine the effect of this improvement on the results described above, the 22 UTR primer pairs that yielded no high quality-sequence were remapped to the new genome sequence. This revealed that 19 of the 22 lay in regions which had been extensively revised such that the original primers would not be expected to amplify any product in the new genome sequence. We then remapped all the exon pairs that yielded positive results in the single-intron experiment to the new genome build and extracted the full-ORF predictions containing them. A new set of TE and UTR targets was selected, including 22 TE and 46 UTR targets that failed on the first UTR plate, as well as 59 new TE and 47 new UTR targets. Six of the TE targets that had been negative before came out positive (27%), as did 36 of those that had not been previously tested (61%). Five of the UTR targets that had been negative before came out positive (10%), as did 15 of those that had not been previously tested (32%). Combining both experiments, ORFs containing 152 distinct RTPCR positive exon pairs were tested with TE primers, of which 106 were successfully amplified and end sequenced (70%). For the UTR primers, ORFs containing 136 distinct RTPCR positive exon pairs were tested, of which 46 were successfully amplified and end sequenced (34%). All but five of the UTR successes were also TE successes. In order to compare our sequence-based analysis to a different assay for RTPCR success, we ran the products on a 1% agarose gel and estimated the product sizes. Products were classified according to whether they yielded a single band within 400 nucleotides (nt) of the expected size, multiple bands at least one of which was within 400 nt of the expected size, or no band within 400 nt of the expected size. Each group was divided into those that yielded a "hit" by our sequencing and analysis method and those that did not (Table 2). Most of the products that yielded a single band of the right size had been classified as hits (93%), and most of the hits yielded at least one band of the right size (65%). However, 35% of the hits yielded no visible band of the expected size, despite yielding high-quality spliced sequence that matched the expected genomic location. RTPCR successes may have been missed by gel analysis because sequencing is more sensitive at low template concentrations, because the accuracy of gel sizing is limited, or because of discrepancies between the predicted size and the actual size. Most of the products that were not classified as hits by sequence analysis yielded no band of the expected size (80%), but 18% yielded multiple bands including one of the expected size. Examination of sample traces from the products that yielded a band of the expected size along with other bands revealed that multiple templates were sequenced in every case, regardless of whether the product was classified as a hit by sequence analysis. This could have been the result of alternative splicing or mispriming. Five of the products that were not classified as hits by sequence analysis (2%) yielded a single band of the expected size. All of these turned out to be cases in which multiple templates were visible in the sequencing trace but not in the gel analysis.
Comparison of a Sample of Confirmed Rat Genes to Known Human Genes Querying NCBI's nonredundant protein database (nr) with the UTR-confirmed gene predictions revealed that eight of them had become provisional or curated rat RefSeqs since we picked them as targets on March 28, 2003. In all eight cases, the name of the rat RefSeq matched that of an HGMD human gene. To get a sense of the kinds of genes we had verified, we investigated the 38 that had not become rat RefSeqs (Table 3). In 33 cases the rat prediction appeared likely to be the true ortholog of the top human hit in nr. Of these, 31 were known genes and two were predictions. Of the 31 known genes, 17 were in HGMD. Most of the rat predictions were about the same length as their putative human orthologs, although alignments sometimes suggested a missed splice site or exon or an alternative splice. In six cases, however, the prediction was more than 100 amino acids shorter than the putative ortholog. Because the high-quality portions of the reads did not include both the predicted start and stop codons in these cases, the primer may have annealed to a coding exon or an intron, rather than a UTR. In five cases, the verified rat prediction did not appear to be orthologous to a known human gene, and may therefore have been a novel paralog.
In these experiments we used TWINSCAN gene prediction followed by RTPCR and direct sequencing to confirm partial gene structures in the newly sequenced rat genome. All 444 of the targets had significant similarity to known human genes, but they were at least partially missed by the April 2003 Ensembl annotation of the rat genome. Among these, 260 were confirmed by primers spanning a single intron, 106 of the single-intron positives were also confirmed using primers in the outermost predicted exons, and 46 of the single-intron positives were also confirmed using primers in the predicted UTRs. Among the single-intron positives we tested, 70% were positive in the terminal-exon experiments and 34% in the UTR experiments. The higher success rate with TE primers relative to UTR primers probably results from a combination of factors. First, terminal exon primers were designed in the outermost exons with sufficient coding sequence; the shortest exons, which are most likely to be mispredicted, were omitted from the TE experiment for reasons of primer design. Second, we designed UTR primers within 300 bp of the predicted ORF; many of these 300-bp regions may have included sequence outside the UTRs or in UTR introns. Finally, gene prediction algorithms are more accurate in identifying exons than in grouping them into transcripts; this makes it likely that some of the predicted initial and terminal exons were actually internal exons, and hence the predicted UTR regions flanking them were introns.
To the best of our knowledge, this is the first reported attempt to amplify complete mammalian open reading frames from de novo gene predictions on a significant scale. Previous efforts to verify many mammalian gene predictions by RTPCR and sequencing have targeted a single intron (Miyajima et al. 2000
An unexpected outcome of our experiments was the ease with which pre-mRNAs can be amplified from primers in introns. In particular, primers complementary to introns of p53 yielded amplification and high-quality sequence that, when aligned to the genome, revealed no splicing. Although pre-mRNA intermediates of p53 are probably particularly abundant (Khochbin et al. 1992 Much more is learned when alignment of the amplicon sequence to the genome reveals the locations of one or more introns. Furthermore, when neither primer anneals to a particular intron, cDNAs in which it is retained are not selectively amplified. Thus, when alignment of the amplicon sequence to the genome reveals an aligned sequence bounded by two introns that have been spliced out, that sequence is almost certainly an exon of the mature mRNA. Using this analysis, the complete set of TE and UTR experiments determined the locations of 598 complete exons (both splice sites found), 743 introns, and 1486 splice boundaries. Because each complete exon is bounded by two inferred introns, the number of introns identified is necessarily greater than the number of complete exons. TWINSCAN correctly predicted 87% of the complete exons, 83% of the introns, and 91% of the splice boundaries. The exon pairs we targeted were selected for both similarity to a human disease gene from the HGMD database and lack of similarity to an Ensembl prediction on rat. We expected that Ensembl and its protein mapping engine GeneWise would map known human genes to all locations on the rat genome that could produce highly similar proteins. Therefore, we expected that the TWINSCAN predictions that did not match Ensembl's rat predictions would be distant paralogs of known human genes. However, most of the predicted ORFs that were amplified and sequenced turned out to be orthologs of known human genes (see Table 3). Of the predictions that were verified by TE primers, 71% were partially missed by Ensembl whereas 29% were missed completely (no stretch of 100 consecutive nt identical to any Ensembl predicted mRNA). For example, TWINSCAN correctly predicted the structure of the rat aspartylglucosaminidase (AGA; Fig. 1), deficiency of which causes the lysosomal storage disease aspartylglycosaminuria. The rat and human proteins align over their entire lengths with 84% amino acid identity. Ensembl only predicts two exons from this nine-exon gene, apparently because GeneWise aligns a fragmentary 40-amino acid rat protein (SWISS-PROT P30919 [GenBank] ) in preference to the complete human protein. Future versions of Ensembl will likely overcome the blocking of complete xeno-proteins (e.g., human AGA) by fragments of native proteins (e.g., rat AGA), but this example highlights the importance of comparing systems for genome-wide annotation to one another, not to the structures that could be derived by an expert focusing on a small number of genes. One key to the scalability of our approach is the automation of the sequence analysis. In principle, aligning the experimental sequence to the genome and checking for introns ought to be relatively straightforward. When we implemented this procedure and checked the automated analyses manually, we discovered a number of special cases that needed to be incorporated in the program. For example, in several of the experiments that our program classified as failures, further inspection revealed that we had discovered an exon within a gap in the early draft assembly of the genome. Optimizing the protocol for amplifying full ORFs from UTR primers will almost certainly improve the yield. For example, the likelihood that primers anneal to UTR exons can be increased by optimizing the simple primer placement approach used here and, ultimately, by incorporating UTR prediction into algorithms such as TWINSCAN. Sequencing the entire PCR product would eliminate failures due to end sequences that do not reach a splice site. Finally, the success rates for products under 2.5 Kb (55%) was much higher than for those over 2.5 Kb (23%). Thus, amplifying longer ORFs in two overlapping segments would likely give a much higher yield. Even without optimization, our approach is attractive compared to obtaining new clones by traditional library construction, screening ESTs, and sequencing promising clones to determine whether they contain a complete ORF. The number of ESTs required to identify each new, full ORF clone has recently increased dramatically, indicating that the traditional approach is reaching saturation. On the other hand, the declining cost of primer synthesis will likely make RTPCR even more attractive in the future. At the same time, the sequencing of more mammalian genomes will improve the accuracy of comparative de novo prediction methods such as TWINSCAN. We have shown for the first time that high success rates can be obtained in RTPCR and sequencing of many predicted mammalian ORFs. There is ample opportunity to improve this method, and trends in traditional cDNA library sequencing, primer cost, and gene prediction accuracy will inevitably increase its role in genome annotation. Ultimately, this will lead to a fundamental improvement in annotationthe most reliable annotations will be those that are completely supported by native cDNA sequence obtained from an experiment designed to address a specific hypothesisthe gene prediction. The road forward in annotating open reading frames is clear; the challenges for the future are alternative splices and 5' UTRs.
Predictions Initial gene predictions were made by running TWINSCAN version 1.1 on release 2.1 of the rat genome sequence (January 2003) as repeat-masked by the University of California Santa Cruz (UCSC; http://genome.ucsc.edu/goldenPath/rnJan2003/chromosomes/
Primer Design The goal of the single-intron experiment was to obtain high-quality sequence around the target intron. To achieve this, primers were designed in the exons surrounding the target intron but no less than 30 bp away from it. Primer3 parameters were as follows: primer optimal size 27, primer optimal Tm 70, primer minimum Tm 67, primer maximum Tm 73, single primer maximum mispriming score of 27, primer pair maximum mispriming score of 51.3. All other parameters were left at default values. The mispriming library consisted of all Provisional and Reviewed Rat RefSeqs (NM accessions) available on March 28, 2003. The aim of the full-ORF experiment was to produce high-quality sequence from both ends of the predictions that covered as much of the gene structure as possible. Predictions with more than 4 Kb of coding sequence were not considered for this experiment, because the high-throughput PCR methods used here have significantly reduced efficiency beyond this length (though we can amplify much longer transcripts using different conditions for different lengths). Two primer pairs were generated for each targeted prediction: one pair in the outermost exons among those with at least 75 coding bases (100 in the second TE experiment), and one pair in the putative UTRs just outside the ORF. The UTR primers were between 10 and 300 bases of genomic sequence from the predicted start and end of translation. Primer3 parameters were as above except for: primer optimal Tm 72.5, primer minimum Tm 65, primer maximum Tm 80, and primer pair maximum mispriming score of 45.9. The mispriming library was as above except that 300 bp of genomic sequence surrounding each RefSeq (analogous to the primer-design region of the targeted predictions) was also included.
Primer sequences for all experiments can be found at http://genes.cse.wustl.edu/rat-data-03/
PCR and Sequencing
All sequencing traces were submitted to the NCBI trace archive (http://www.ncbi.nlm.nih.gov/Traces/
Sequences are also available on the auxiliary data Web site: http://genes.cse.wustl.edu/rat-data-03/
Classification of Sequencing Results
When EST_GENOME produced a spliced alignment that failed criterion 1 or 4, the alignment and sometimes the trace were inspected manually. If a convincing explanation for the poor alignment could be found, the experiment was classified as positive. Examples of convincing explanations include an experimental exon that falls into a gap in the genome assembly or a miscalled base near a splice site.
We thank the Baylor College of Medicine Human Genome Sequencing Center and the Rat Genome Sequencing Project for the genome sequence, without which this project would not have been possible. Thanks to Jeltje van Baren and James Kent for useful discussions, and to Chris Ponting and Mikhail Velikanov for suggestions about ortholog determination. D.S., M.A., and M.R.B. were partially supported by grants HG02278 from the National Human Genome Research Institute and DBI-0091270 from the National Science Foundation. R.G. and J.Q.W. were partially supported by grants from the NHGRI/NHLBI (1 U54 HG02345) and the NCI/SAIC (20XS182A). The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1959604.
3 Corresponding author.
Bairoch, A. and Apweiler, R. 2000. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res. 28: 45-48.
Birney, E. and Durbin, R. 2000. Using GeneWise in the Drosophila annotation experiment. Genome Res. 10: 547-548. Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94.[CrossRef][Medline] Das, M., Burge, C.B., Park, E., Colinas, J., and Pelletier, J. 2001. Assessment of the total number of human transcription units. Genomics 77: 71-78.[CrossRef][Medline]
Don, R.H., Cox, P.T., Wainwright, B.J., Baker, K., and Mattick, J.S. 1991. "Touchdown" PCR to circumvent spurious priming during gene amplification. Nucleic Acids Res. 19: 4008.
Flicek, P., Keibler, E., Hu, P., Korf, I., and Brent, M.R. 2003. Leveraging the mouse genome for gene prediction in human: From whole-genome shotgun reads to a global synteny map. Genome Res. 13: 46-54. Gasteiger, E., Jung, E., and Bairoch, A. 2001. SWISS-PROT: Connecting biomolecular knowledge via a protein database. Curr. Issues Mol. Biol. 3: 47-55.[Medline]
Guigó, R., Dermitzakis, E.T., Agarwal, P., Ponting, C., Parra, G., Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C., et al. 2003. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl. Acad. Sci. 100: 1140-1145.
Hubbard, T., Barker, D., Birney, E., Cameron, G., Chen, Y., Clark, L., Cox, T., Cuff, J., Curwen, V., Down, T., et al. 2002. The Ensembl genome database project. Nucleic Acids Res. 30: 38-41. Khochbin, S., Brocard, M.P., Grunwald, D., and Lawrence, J.J. 1992. Antisense RNA and p53 regulation in induced murine cell differentiation. Ann. NY Acad. Sci. 660: 77-87.[Abstract] Korf, I., Flicek, P., Duan, D., and Brent, M.R. 2001. Integrating genomic homology into gene structure prediction. Bioinformatics (Suppl. 1) 17: S140-S148.[Abstract] Miyajima, N., Burge, C.B., and Saito, T. 2000. Computational and experimental analysis identifies many novel human genes. Biochem. Biophys. Res. Commun. 272: 801-807.[CrossRef][Medline]
Mott, R. 1997. EST_GENOME: A program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13: 477-478.
Parra, G., Agarwal, P., Abril, J.F., Wiehe, T., Fickett, J.W., and Guigo, R. 2003. Comparative gene prediction in human and mouse. Genome Res. 13: 108-117.
Pruitt, K.D. and Maglott, D.R. 2001. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res. 29: 137-140. Rat Genome Sequencing Project Consortium. 2004. Genome sequence of the Brown Norway Rat yields insights into mammalian evolution. Nature (in press). Reboul, J., Vaglio, P., Rual, J.F., Lamesch, P., Martinez, M., Armstrong, C.M., Li, S., Jacotot, L., Bertin, N., Janky, R., et al. 2003. C. elegans ORFeome version 1.1: Experimental verification of the genome annotation and resource for proteome-scale protein expression. Nat. Genet. 34: 35-41.[CrossRef][Medline] Rozen, S. and Skaletsky, H. 2000. Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol. 132: 365-386.[Medline] Stenson, P.D., Ball, E.V., Mort, M., Phillips, A.D., Shiel, J.A., Thomas, N.S., Abeysinghe, S., Krawczak, M., and Cooper, D.N. 2003. Human Gene Mutation Database (HGMD(R)): 2003 update. Hum. Mutat. 21: 577-581.[CrossRef][Medline]
http://genome.ucsc.edu/goldenPath/rnJan2003/chromosomes/; Source for rat genome build 2.1. http://genome.ucsc.edu/goldenPath/rnJun2003/chromosomes/; Source for rat genome build 3.1. http://www.well.ox.ac.uk/~rmott/est_genome.shtml; EST_GENOME alignment program. http://www.ncbi.nlm.nih.gov/Traces/; NCBI Trace Archive. http://genes.cse.wustl.edu/; TWINSCAN Web site. http://genes.cse.wustl.edu/rat-data-03/; Auxiliary data for this paper, including sequences and accessions.
Received September 11, 2003;
accepted in revised format November 17, 2003.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||