|
|
|
|
Published online before print
January 8, 2007, 10.1101/gr.5661407 Genome Res. 17:212-218, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Methods A tale of two templates: Automatically resolving double traces has many applications, including efficient PCR-based elucidation of alternative splices1 Laboratory for Computational Genomics and Department of Computer Science, Washington University, St. Louis, Missouri 63130, USA; 2 Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06620-8103, USA; 3 Department of Biology, Washington University, St. Louis, Missouri 63130, USA
Trace Recalling is a novel method for deconvoluting double traces that result from simultaneously sequencing two DNA templates. Trace Recalling identifies up to two bases at each position of such a trace. The resulting ambiguity sequence is aligned to the genome, identifying one template sequence. A second template sequence is then inferred from this alignment. This technique makes possible many exciting biological applications. Here we present two such applications, alternate splice finding and elucidation of multiple insertion sites in a random insertional mutagenesis library. Our results demonstrate that RTPCR followed by Trace Recalling is a more efficient and cost effective way to find alternate splices than traditional methods. We also present a method for mapping double-insertion events in a random insertional-mutagenesis library.
During normal Sanger sequencing (Sanger et al. 1977
Double traces occur in a number of biological and biotechnological applications and have been observed since the early days of fluorescent dye sequencing (Gibbs et al. 1990
Double traces have also been observed in random insertional mutagenesis experiments. Recently a collection of 127,760 knockout Arabidopsis thaliana lines was created by using Agrobacterium T-DNA (Alonso et al. 2003
We have developed a method to analyze double traces that we call Trace Recalling. Trace Recalling works by recasting the de novo base-calling problem as a database search and alignment problem. Current base-calling programs are designed for the de novo sequencing of completely unknown sequences such as are encountered in a whole-genome sequencing project. Since the completion of sequencing projects for Homo sapiens (Lander et al. 2001
Trace Recalling algorithm The inputs to Trace Recalling are a chromatogram and a reference genomic sequence. The chromatogram is processed with the base caller PHRED (Ewing et al. 1998
Automatic detection of alternate splice double traces One application of Trace Recalling is screening traces from direct sequencing of RTPCR products for evidence of alternative splicing. Prior knowledge of the boundaries of alternate splices is not required for this application; they are discovered through the primary and secondary alignments. Our software compares the primary and secondary alignments obtained from each trace. If the primary and secondary alignments are identical, there is no evidence of alternative splicing. If there is no positive scoring secondary alignment, the recalled sequence is most likely noise and there is no evidence of alternative splicing. If the two alignments overlap, but differ in their internal exonintron structure, they are further analyzed for several types of alternative splices: alternate exon, clean alternate exon or cassette exon (an alternate exon in which the boundaries of the adjacent exons are identical in both alignments), alternate 5' or 3' splice sites, or retained intron. Details of the method used to compare the two alignments are presented in the Methods section. If none of these alternate splice forms are found and the alignments are not identical, there is no evidence of an alternate splice. The peak area ratio threshold is used to discriminate between peaks from a second template and noise peaks. If an alternate splice is present and the threshold is set too high, the alternate splice is not detected because there are not enough peaks from the secondary template that pass the threshold, so the recalled sequence does not align properly to the secondary isoform of the gene. If it is set too low, noise peaks obscure the portion of the trace that is the same in both templates, including the portion between the sequencing primer and the locus where the two templates diverge. As a result, that portion of the trace fails to align, causing the alternate splice to not be detected. Because setting the threshold either too high or too low results in failure to detect true alternate splices rather than erroneous splice detection, there is no harm in trying several thresholds. Thus, to find alternate splices, we run Trace Recalling with several different thresholds (1/2, 1/3, . . .1/20) and analyze the results as a group. If the results are consistent with the above-described pattern, the gene is marked as alternately spliced. A visualization of this procedure is presented in Supplemental Figure 8 for several traces in the experiment described below.
Alternate splicing experiment
We analyzed another set of possible alternate splice-derived double traces that were generated as part of the Mammalian Gene Collection (MGC) project (Gerhard et al. 2004
Random insertional mutagenesis experiment
We carried out an experiment to examine a subset of the 1609 Arabidopsis lines, in which we predict two T-DNA insertion events on different chromosomes. Of these, 66 were selected for validation. In all but one of these, one of the two predicted insertion sites coincided with the insertion site predicted by the previously published method of BLASTing the original read (without ambiguity codes) against the genome (Alonso et al. 2003 Amplification of a predicted insertion site alone provides evidence that the prediction was correct, since the primer in the flanking genomic sequence was chosen to be unique in the Arabidopsis genome. Sequencing of the PCR product and alignment of that sequence to the predicted locus provides stronger evidence that the prediction was correct. Therefore, in our analysis we define two types of confirmation, confirmation by amplification and confirmation by alignment. In the former, a T-DNA/genome junction is amplified, but the genomic sequence does not align well to the genome. In the latter, the genomic sequence does align well to the predicted genome location. Of the 132 individual insertion events tested, 59 were confirmed by alignment, 18 were confirmed by amplification, 14 aligned to an unexpected locus, and another 41 showed no evidence of amplification. In 17 of the plant lines tested, both predicted insertion events were verified by alignment. In another seven lines, the primary insertion event was verified by alignment and the secondary insertion event was verified by amplification. Thus, we were able to verify 36% of the tested double-insertion events. An analysis of predicted multiple insertions that were not verified is presented in Table 1.
We have demonstrated that Trace Recalling can find alternate splices in single-pass reads of RTPCR products. In a controlled experiment, we tested the ability of Trace Recalling to deconvolute double traces created by sequencing the RTPCR products of two isoforms of a gene simultaneously. By examining only two reads per target, Trace Recalling was able to correctly identify both isoforms in a majority (25) of the targets that were previously known to be alternately spliced. In contrast, the cloning and sequencing strategy required much greater effort (cloning and sequencing multiple inserts per target), yet it identified less than half as many known alternately spliced targets. Trace Recalling also predicts a potentially novel optional exon in this experiment. RTPCR products were also run on a gel in an attempt to quantify the number of templates present after amplification (Supplemental Fig. 2). In six of the missed targets, only one band appeared on the gel. These likely represented cases in which the second isoform was not expressed in the tissues tested. In seven of the missed targets, there were three or more bands. The excess bands may be nonspecific amplification or additional isoforms; in either case, it is likely that Trace Recalling was confused by the extra peaks. Finally, there were 10 targets in which exactly two bands were present and Trace Recalling failed to detect alternate isoforms. Examination of the traces indicates that in six cases there is a very low concentration of the secondary template. In another two cases, two sets of peaks are visible but they are shifted with respect to each other, and so the secondary template is lost. Finally, in two other cases, three sets of peaks are clearly present even though only two bands on the gel are visible. The most serious limitation of Trace Recalling appears to be secondary template peaks getting lost in noisy traces. Trace Recalling makes possible a protocol in which most alternate splices are elucidated by one or two sequencing reactions followed by a targeted experiment to confirm the alternate splice. Our experiments suggest that to get similar results by blindly cloning and sequencing from a pool of RTPCR products would require many more sequencing reactions. Sequencing full-length cDNA clones is still required for determining the global structures of the isoforms when more than one region shows alternative splicing, but Trace Recalling can determine the local structure with greater sensitivity and at a much lower cost. We also demonstrated the application of this method to a high throughput RTPCR project. We have identified a large set of MGC targets that are likely to be alternately spliced, enhancing the value of those experiments. We have also demonstrated that Trace Recalling can be used to screen for multiple insertion sites in a random insertional mutagenesis library. We found that a substantial fraction of the hypotheses generated for double-insertion sites can be verified by PCR and sequencing. Trace Recalling was used to screen a set of 38,033 traces generated as part of an Arabidopsis mutagenesis library yielding 1609 traces thought to represent double-insertion events. We experimentally tested 66 of these lines, and by our most stringent classification (confirmation by alignment), 17 lines were shown to contain inserts in the predicted locations. By a less-stringent classification (confirmation by amplification), another seven lines were shown to contain inserts at the predicted locations. Experimental verification of predicted multiple insertion sites is required, but Trace Recalling provides a method to design these verification experiments. The only other available method for identifying multiple insertion sites is the one used by the authors of the Arabidopsis studyblindly resequencing the tag sequences used for mapping. Sometimes this results in the secondary tag being more pronounced in a different trace, allowing the identification of the secondary insertion site. Trace Recalling has difficulty with several types of traces, including those in which the secondary template signal is near the level of the noise in the trace. Trace Recalling ignores secondary peaks <1/20th the area of the primary peak, since in practice these very small secondary peaks are almost always noise. While this threshold makes it impossible to detect very weak secondary templates, it significantly reduces problems associated with the background noise present in all traces. This does not appear to be a limiting factor in the analysis of the MGC traces (Supplemental Fig. 9). Another difficulty occurs when slight differences in mobility rates of DNA molecules cause the peaks to become off register or out of phase. This interferes with Trace Recalling in two ways. First, it becomes difficult for PHRED to correctly identify secondary peaks. Second, it causes many single base-pair gaps in both alignment stages. Such gaps are heavily penalized to reduce spurious alignments. This often results in the loss or premature truncation of the alignment. Another problem arises when more than two templates of nearly equal concentration are simultaneously sequenced. If this occurs, peaks from the secondary and tertiary templates become interleaved in the recalled sequence, interfering with the secondary alignment step. In addition to the applications explored here, Trace Recalling may be useful in other applications where double traces are encountered. For example, direct sequencing of genomic amplicons from an individual heterozygous for an indel polymorphism results in a double trace. This is because the chromosome sequences are shifted with respect to each other by the length of the indel. The ABI KB base caller is designed to handle this circumstance for short indels by attempting to shift the sequence represented in the trace with respect to itself looking for matches between the shifted sequences. This strategy, however, is limited to analyzing indels no longer than 15 bp (ABI representative, pers. comm.). Trace Recalling has no length restriction, since the reference genome sequence could be used to deconvolute such traces. As evidence of this, the alternate splicing work presented here demonstrates that Trace Recalling can handle gaps the size of introns that are often many kilobases in length. Another potential future application for Trace Recalling involves whole-genome shotgun resequencing. Double traces could be generated by sequencing two pooled clones in the same reaction, and the reference genome sequence could be used to deconvolute them, yielding two reads per sequencing reaction.
Computational methods Automatic identification of traces resulting from alternate splicing Method for detecting alternate splices at a single threshold GTF (a standard format for recording genome annotations) files are created from the primary and secondary alignments of Trace Recalling, representing the coordinates of exons in these alignments in the coordinate system of the reference genomic sequence. These GTF files are used to create an "indicator string" for the pair of alignments. For each base in the genomic reference sequence, if it is contained in an exon from both alignments, the corresponding position of the indicator string is set to "2". If the position is covered by an exon in only one alignment, the indicator string position is set to "1". Finally, if the position is not covered by exons in either alignment, the indicator string position is set to "0". Alternate splices appear as matches to certain regular expressions in the indicator string. The Perl regular expressions used are: Clean alternate exon, 2+0+1+0+2+; Alternate exon, 2+1+0+1+0+2+ or 2+0+1+0+1+2+; Alternate splice site, 2+1+0+2+ or 2+0+1+2+; Retained intron 0+2+1+2+1*$ or |P%1*2+1+2+0+ or |P%1*2+1+2+1*$ or 0+2+1+2+0+.
Method for calling alternate splices using different thresholds
Controlled alternate splice experiment
Alignment parameters used for Trace Recalling Secondary alignment parameters: Match score, 1; Mismatch penalty, 1; Gap penalty, 2; Splice penalty, 20; Intron penalty, 40.
MGC experiment
Random insertional mutagenesis experiment Second BLAST the cleaned recalled sequence against whole genome using the parameter string: cpus 1 kap wink 1 hspmax 0 W 6. Default match, mismatch, and gap penalties were used.
EST_GENOME alignment parameters used to clean recalled sequence
Alignment parameters used for Trace Recalling Secondary alignment parameters: Match score, 1; Mismatch penalty, 1; Gap penalty, 2; Splice penalty, 100,000; Intron penalty, 100,000.
Biological methods
For each sample, After the 48 h, the plates were placed into a 25°C, 16-h light incubator for 5 d. After 5 d, a single leaf was cut from each surviving seedling. Leaves from each plate were pooled into a single tube (one per plate) and immediately immersed into liquid nitrogen. Using a small pestle, each collection was ground within liquid nitrogen and after evaporation, further ground in the genomic filter extraction buffer (0.2 M Tris-HCL at pH 9.0, 0.4 LiCl, 25 mM EDTA, and 1% SDS). The samples were then centrifuged at high in a tabletop centrifuge and the supernate added to an equal part isopropanol, mixed, and again centrifuged at high to pellet the DNA. The liquid was decanted and the tubes allowed to dry. DNA was resuspended in 400 µL of TE buffer and 2 µL used for initial PCR.
PCR and sequencing for alternate splice and MGC experiments
We thank Joseph Ecker for providing traces for the Arabidopsis random insertional mutagenesis experiment. We also thank Randall Brown and Jeremy Buhler for many helpful discussions about the Trace Recalling algorithm and Beth Frazier for her valuable biological insights. We are grateful to Richard Gibbs and John McPherson for encouragement, interesting discussions, and supervision of the alternate splice experiment. M.R.B. was supported in part by R01 HG02278 and in part by the National Cancer Institute for the Mammalian Gene Collection project under Contract No. N01-CO-12400. A.T. was supported in part by T32 grant HG000045. P.K. was supported in part from funds provided by Washington University, Department of Biology. The methods here are covered under patent application PCT/VS05/003681.
5 Corresponding author.
E-mail brent{at}cse.wustl.edu; fax (314) 935-7302.
4 These two authors contributed equally to this work. [Supplemental material is available online at www.genome.org. Alternate splice forms discovered during this work have been deposited in GenBank under accession nos. EB71062EB710342.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5661407
Alonso, J.M., Stepanova, A.N., Leisse, T.J., Kim, C.J., Chen, H., Shinn, P., Stevenson, D.K., Zimmerman, J., Barajas, P., and Cheuk, R., et al. 2003. Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301: 653657. Don, R.H., Cox, P.T., Wainwright, B.J., Baker, K., and Mattick, J.S. 1991. Touchdown PCR to circumvent spurious priming during gene amplification. Nucleic Acids Res. 19: 4008. Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8: 175185. Garsin, D.A., Urbach, J., Huguet-Tapia, J.C., Peters, J.E., and Ausubel, F.M. 2004. Construction of an Enterococcus faecalis Tn917-mediated-gene-disruption library offers insight into Tn917 insertion patterns. J. Bacteriol. 186: 72807289. Gerhard, D.S.L., Wagner, E.A., Feingold, C.M., Shenmen, L.H., Grouse, G., Schuler, S.L., Klein, S., Old, R., Rasooly, P., and Good, M., et al. 2004. The status, quality, and expansion of the NIH full-length cDNA project: the Mammalian Gene Collection (MGC). Genome Res. 14: 21212127. Gibbs, R.A., Nguyen, P.N., Edwards, A., Civitello, A.B., and Caskey, C.T. 1990. Multiplex DNA deletion detection and exon sequencing of the hypoxanthine phosphoribosyltransferase gene in Lesch-Nyhan families. Genomics 7: 235244.[CrossRef][Medline] Kent, W.J. 2002. BLATthe BLAST-like alignment tool. Genome Res. 12: 656664. Lander, E.S.L.M., Linton, B., Birren, C., Nusbaum, M.C., Zody, J., Baldwin, K., Devon, K., Dewar, M., Doyle, W., and FitzHugh, R., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860921.[CrossRef][Medline] Lee, M.S., Dougherty, B.A., Madeo, A.C., and Morrison, D.A. 1999. Construction and analysis of a library for random insertional mutagenesis in Streptococcus pneumoniae: Use for recovery of mutants defective in genetic transformation and for identification of essential genes. Appl. Environ. Microbiol. 65: 18831890. Mikkelsen, T.S., Hillier, L.W., Eichler, E.E., Zody, M.C., Jaffe, D.B., Yang, S.-P., Enard, W., Hellmann, I., Lindblad-Toh, K., and Altheide, T.K., et al. 2005. Initial sequence of the chimpanzee genome and comparison with the human genome. Nature 437: 6987.[CrossRef][Medline] Mott, R. 1997. EST_GENOME: A program to align spliced DNA sequences to unspliced genomic DNA. Comput. Appl. Biosci. 13: 477478. Sanger, F., Nicklen, S., and Coulson, A.R. 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. 74: 54635467. Waterston, R.H.K., Lindblad-Toh, E., Birney, J., Rogers, J.F., Abril, P., Agarwal, R., Agarwala, R., Ainscough, M., Alexandersson, P., and An, S.E., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562.[CrossRef][Medline] Wu, J.Q., Shteynberg, D., Arumugam, M., Gibbs, R.A., and Brent, M.R. 2004. Identification of rat genes by TWINSCAN gene prediction, RT-PCR, and direct sequencing. Genome Res. 14: 665671.
Received June 19, 2006; accepted in revised format November 22, 2006.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||