|
|
|
|
Published online before print
November 29, 2006, 10.1101/gr.5488207 Genome Res. 17:108-116, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Methods Large-scale production of SAGE libraries from microdissected tissues, flow-sorted cells, and cell lines1 Canadas Michael Smith Genome Sciences Centre, BC Cancer Research Centre, BC Cancer Agency, Vancouver, British Columbia V5Z 4S6, Canada; 2 Genome British Columbia, Vancouver, British Columbia V5Z 1C6, Canada; 3 Terry Fox Laboratory, BC Cancer Research Centre, BC Cancer Agency, Vancouver, British Columbia V5Z 1L3, Canada
We describe the details of a serial analysis of gene expression (SAGE) library construction and analysis platform that has enabled the generation of >298 high-quality SAGE libraries and >30 million SAGE tags primarily from sub-microgram amounts of total RNA purified from samples acquired by microdissection. Several RNA isolation methods were used to handle the diversity of samples processed, and various measures were applied to minimize ditag PCR carryover contamination. Modifications in the SAGE protocol resulted in improved cloning and DNA sequencing efficiencies. Bioinformatic measures to automatically assess DNA sequencing results were implemented to analyze the integrity of ditag structure, linker or cross-species ditag contamination, and yield of high-quality tags per sequence read. Our analysis of singleton tag errors resulted in a method for correcting such errors to statistically determine tag accuracy. From the libraries generated, we produced an essentially complete mapping of reliable 21-base-pair tags to the mouse reference genome sequence for a meta-library of 5 million tags. Our analyses led us to reject the commonly held notion that duplicate ditags are artifacts. Rather than the usual practice of discarding such tags, we conclude that they should be retained to avoid introducing bias into the results and thereby maintain the quantitative nature of the data, which is a major theoretical advantage of SAGE as a tool for global transcriptional profiling.
Serial analysis of gene expression (SAGE) offers a particularly attractive technology for profiling eukaryotic transcriptomes (Velculescu et al. 1995
SAGE is among the few relatively accessible digital gene expression profiling technologies capable of generating comprehensive transcriptome profiles. Nevertheless, challenges associated with laborious library construction and generally limited access to inexpensive automated DNA sequencing have restricted its application to large-scale initiatives. Experiments at our Genome Center (Smailus et al. 2005
Laboratory design Laboratory space and workflow were designed to limit the potential of PCR cross-contamination. In particular, we segregated pre-ditag work from ditag and post-ditag work. We adopted a policy of single-use aliquots for reagents, exclusive reliance on disposable plasticware and protective apparel, daily decontamination routines, and species-specific work areas. An effective biochemical measure for limiting cross-species contamination was the design and implementation of species-specific LongSAGE adapters. PCR primers corresponding to each adapter pair were designed to be incapable of amplifying ditags from any other adapter pair. We relied on the I-SAGE Long kit (Invitrogen) as the primary source of quality-tested and modularized reagents. These were supplemented as required with materials from suppliers of standard molecular biology equipment and reagents.
Tissue collection and RNA extraction Removal of contaminating genomic DNA was performed with Ambions DNA-free reagent and protocol, a method that does not require subsequent organic extraction, alcohol precipitation, heating, or the addition of EDTA to the DNase-treated RNA sample. Protocols requiring the latter conditions sometimes yielded degraded RNA following DNase treatment, possibly due to the activation of residual ribonucleases still present after RNA extraction.
Assessment of RNA quality
LongSAGE library construction
Construction of "standard" libraries initially required 250 µg of DNase-treated total RNA. mRNA was captured using oligo(dT) magnetic beads followed by synthesis of double-stranded cDNA using SuperScript II reverse transcriptase (Invitrogen), RNAseH, and Escherichia coli DNA polymerase (Invitrogen). The resulting bead-bound cDNA was digested with the tagging enzyme NlaIII (Invitrogen), and the product was divided into two fractions for separate ligation of two adapters with 4-bp overhangs complementary to NlaIII digestion products. The adaptercDNA ligation products were digested with the type IIS tagging enzyme MmeI (NEB), releasing adapter-tag products with 2-nucleotide (nt) overhangs. The two adaptertag fractions were then ligated to form
Colony picking was performed using a Q-Pix robot (Genetix), and inoculations were made into 2x YT media with 50 µg/mL Zeocin and 7.5% glycerol. Following overnight culture, glycerol stocks were used to inoculate larger-volume cultures for plasmid preparation using a standard alkaline lysis procedure adapted for high-throughput processing with microtiter plates (Yang et al. 2005
Incorporation of published SAGE protocol modifications
Sub-microgram LongSAGE
Construction of SAGE-Lite libraries
Detection of cross-species ditag contamination
Measuring duplicate ditag frequency
Error analysis and correction The error rate was equivalent to the frequency at which off-by-one tag sequences of highly abundant tags occurred in a library (where off-by-one tag sequences are defined as containing a single base pair permutation, insertion, or deletion relative to a highly abundant tag sequence). Using the QF, we were able to identify tag sequences likely to originate from a sequencing error. By removing those tag sequences and measuring the frequency of single-base errors in the remaining tag sequences, we were able to determine the frequency of single-base errors introduced prior to sequencing (the "library construction" error rate).
Tag-to-gene mapping
Figure 1 provides an overview of the libraries constructed using the approaches described here. As of January 2006, our platform had generated 298 libraries from four species (Supplemental Table 1), achieving a throughput of up to 12 libraries constructed per month. More than 30 million SAGE tags have been sequenced from these libraries. Fifty-eight libraries were constructed from human embryonic stem cell lines (www.transcriptomES.org) and cancer-related cell lines; 206 libraries were constructed from developing mouse tissues and cells (www.mouseatlas.org; Siddiqui et al. 2005
Among the more significant adjustments to our pipeline was the incorporation of a brief NlaIII digestion prior to size fractionation and concatemer cloning (Supplemental Fig. 1). This step, suggested by Gowda et al. (2004) 20 ShortSAGE (14-bp) tags per clone initially to the routine generation of an average of 3540 LongSAGE (21-bp) tags per clone, or 15,000 tags per 384-well plate sequenced (Supplemental Fig. 2). Higher colony titers can thus routinely yield tens of millions of tags per library. Library construction timelines have also been improved such that most libraries can now be constructed within 11 d. Most significant from the perspective of future application of LongSAGE to characterize the transcriptomes of rare cell populations (e.g., cancer stem cells, fine-needle aspirate samples, etc.) was the reduction in the requirements of total RNA to 50 ng for the construction of regular LongSAGE libraries from non-amplified starting material. For experiments involving tissue samples where accumulation of even this small amount of RNA is not possible, we have relied primarily on SAGE-Lite (Peters et al. 1999RNA isolation and analysis for LongSAGE library construction was successfully performed with a variety of cell and tissue types from multiple species (Supplemental Table 1), including mouse spleen and pancreas for which purification of high-quality RNA was problematic, presumably because of the presence of elevated RNAse levels in these tissues. For degraded samples, subsequent repurification of RNA from tissues and the addition of broad-spectrum ribonuclease inhibitors such as SUPERaseIn (Ambion) were effective in producing RNA of sufficient quality. The quality of all purified RNAs was assessed using an Agilent Bioanalyzer 2100. Even so, electropherograms indicative of a high proportion of intact RNA were insufficient to guarantee that the RNA could successfully be used to generate a SAGE library. On occasion (for example, in the case of mouse spleen tissue), the RNA appeared intact, but analysis of the sample using a biochemical RNase assay (RNaseALERT) yielded a ribonuclease-positive score. Such samples usually degraded when subjected to incubation with DNase, indicating the need for an RNA re-extraction step and the use of ribonuclease inhibitors.
Continuous optimization of our SAGE library construction pipeline resulted in numerous incremental improvements affecting the rate of library construction. During the project, nearly a fourfold increase in libraries constructed per quarter was achieved during the peak library construction period (Supplemental Fig. 2), but not at the expense of library quality. For example, early efforts only rarely produced libraries yielding We constructed LongSAGE libraries using RNA purified from human, mouse, zebrafish, and nematode worm cells. The diversity of species analyzed using our pipeline and the PCR-intensive nature of SAGE library construction resulted in the potential for undesirable interlibrary cross-species contamination. Unless such contamination was detected, it could yield tag sequences that would fail to match sequence resources from the species under study, leading to the erroneous conclusion that previously undiscovered "novel" transcripts had been detected in the LongSAGE analysis. To address this potential problem, we sought to design a computational screen (see Methods) that could be used to analyze an initial "quality control" 384-well plate of LongSAGE sequences prior to more extensive library sequencing. In the event that the screen detected cross-species contamination, library sequencing could be aborted at an early stage. A distribution of contamination levels for all libraries analyzed is shown in Figure 2.
An analysis of ditags was undertaken to confirm the source and nature of cross-species tag contamination. The most prevalent contaminants were ditags from previous library preparations. In such cases, the majority of ditags in a library exhibited the property that both tags in the ditag could be mapped to the correct species, while a few ditags contained tags that could both be mapped to the "contaminating" species. We were able to distinguish such cases from those in which the contamination event occurred at an early stage prior to ditag generation and PCR amplification. In these cases, ditags should be a mixture of the three possible combinations of tags (species A/A, species B/B, and species A/B). We detected only one instance in which the majority of ditags were a mixture of two species. This was subsequently traced back to a tube mix-up in the laboratory in which two different species tag-adapter solutions were accidentally combined.
Many published SAGE analyses report that duplicate ditags are discarded prior to analysis under the assumption that they represent experimental (e.g., PCR) artifacts (Dinel et al. 2005
Inspection of the cloned ditag concatemer sequences revealed that in all six cases duplicate cloned concatemers, arising from process errors in the lab or in the computational pipeline, had resulted in either the resequencing or the computational re-counting of the same cloned ditag concatemer multiple times. Such multiple concatemer sequences, in which the complete sequences of the cloned ditag inserts are identical, are now removed computationally. Given our findings, we recommend that a ditag frequency diagnostic be performed as a key QC step, but we do not recommend the routine elimination or subtraction of duplicate ditags as suggested by Dinel et al. (2005)
With increasing depth of SAGE library sequencing, we observed a monotonic increase in the number of tag sequences observed only once (singletons). The majority of these singletons are likely the result of errors arising from library construction (e.g., RT and PCR errors) and DNA sequencing. However, we anticipated that the singleton class would be enriched in rare and novel transcripts as well as artifacts, and hence sought to devise a method to distinguish between these classes of singletons. We reasoned that a high-quality singleton was more likely to represent a rare transcript than to represent an artifact. We clustered our tags in a manner similar to that of Akmaev and Wang (2004)
The DNA sequencing process yields phred (Ewing and Green 1998
Tag-to-genome mapping A key aspect of SAGE-based gene expression analysis is the reliable mapping of tag sequences. If a tag maps to a transcript resource or to the genome, there is increased confidence that the tag is not an artifact. The P-value of a tag type in our mouse meta-library was observed to be improved if the tag mapped to a sequence resource. We found that >96% of error-corrected tag sequences observed more frequently than doubletons mapped to a known resource. Analysis of a sample of the remaining 4% that did not map showed that these generally corresponded to unannotated transcripts, tag sequences interrupted by splice junctions, and errors related to the evolving state of the genome sequence assembly. We noted that the probability of a tag mapping increases for highly expressed genes, perhaps reflecting a more complete state of annotation for such genes.
In a previous study (Siddiqui et al. 2005
Sequencing depth
To explore the relationship between sampling depth and gene discovery, we examined the extent to which sequences in the mouse Reference Sequence database (http://www.ncbi.nlm.nih.gov/RefSeq/), a well characterized transcript database, were represented by LongSAGE tags in a 12-million-tag meta-library and in a kidney library as a function of increasing tag numbers (Fig. 4). In both libraries, we observed an initial rapid increase in RefSeq coverage with increasing tag numbers. The rate of coverage then decreased. However, for all sampling depths we analyzed (Fig. 4), the kidney LongSAGE data provided less coverage of RefSeq than the meta-library. This result was consistent with intuition and the notion that the repertoire of transcripts represented in the kidney LongSAGE data was reduced compared with the diversity of transcripts represented in the meta-library. In the case of the kidney library, Figure 4B shows that the most rapid increase in the rate of RefSeq coverage occurred in the first 100,000 tags sampled. The first 400,000 tags sampled from the kidney library represented 40% of RefSeq. The next 400,000 tags sampled provided only
Conclusions We describe a SAGE library construction pipeline that has been devised and implemented to generate high-quality digital gene expression profiling data. The pipeline incorporates many previously published improvements and synthesizes these into a standard operating procedure (SOP) suitable for use by entry-level technical staff, following a 2-wk training period. We developed and validated our SOP by using it to generate 298 libraries yielding >30 million tags over a 4-yr period with a small group of library construction technicians (five to eight people) and variable adjustments in the scale of activity to match changing demands for libraries during that period. All data and software tools for data analysis are available at www.transcriptomES.org (embryonic stem cell data), www.mouseatlas.org (mouse data), or http://elegans.bcgsc.bc.ca (C. elegans data). These protocols should be applicable to other academic centers and facilitate the exploitation of SAGE for additional gene expression analyses and gene discovery.
This work was supported by funding from Genome Canada and the National Cancer Institute (USA). We are indebted to numerous groups at Canadas Michael Smith Genome Sciences Centre, including the Administration, Projects, Operations, LIMS, and IT Systems teams, who have provided expert assistance in constructing and maintaining a large-scale SAGE library construction effort at our Genome Centre. We gratefully acknowledge the following individuals for providing RNA and tissue for library construction: Elizabeth M. Simpson, Cheryl Helgason, James Thomson, Martin Pera, Meri Firpo, Catherine Verfaille, Donald Riddle, Jim McGhee, Isabella Tai, Ralph Durand, Andrew Van Kessel, David Baillie, and Donald Moerman. M.M., P.H., R.H., and S.J. are scholars of the Michael Smith Foundation for Health Research. M.M. is a Terry Fox/NCIC Young Investigator. P.H. is a Canadian Institutes of Health Research New Investigator.
4 Present address: Department of Cancer Genetics, BC Cancer Research Centre, BC Cancer Agency, Vancouver, British Columbia V5Z 1L3, Canada
E-mail mmarra{at}bcgsc.ca; fax (604) 877-6085. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.5488207
Akmaev, V.R. and Wang, C.J. 2004. Correction of sequence-based artifacts in serial analysis of gene expression. Bioinformatics 20: 12541263. Angelastro, J.M., Klimaschewski, L.P., and Vitolo, O.V. 2000. Improved NlaIII digestion of PAGE-purified 102 bp ditags by addition of a single purification step in both the SAGE and microSAGE protocols. Nucleic Acids Res. 28: E62. Beissbarth, T., Hyde, L., Smyth, G.K., Job, C., Boon, W.M., Tan, S.S., Scott, H.S., and Speed, T.P. 2004. Statistical modeling of sequencing errors in SAGE libraries. Bioinformatics 20: I31I39. Bennett, S.T., Barnes, C., Cox, A., Davies, L., and Brown, C. 2005. Toward the $1000 human genome. Pharmacogenomics 6: 373382.[Medline] Chen, J. and Sadowski, I. 2005. Identification of the mismatch repair genes PMS2 and MLH1 as p53 target genes by using serial analysis of binding elements. Proc. Natl. Acad. Sci. 102: 48134818. Colinge, J. and Feger, G. 2001. Detecting the impact of sequencing errors on SAGE data. Bioinformatics 17: 840842. Dinel, S., Bolduc, C., Belleau, P., Boivin, A., Yoshioka, M., Calvo, E., Piedboeuf, B., Snyder, E.E., Labrie, F., and St-Amand, J. 2005. Reproducibility, bioinformatic analysis and power of the SAGE method to evaluate changes in transcriptome. Nucleic Acids Res. 33: e26. Ewing, B. and Green, P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8: 186194. Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8: 175185. Gowda, M., Jantasuriyarat, C., Dean, R.A., and Wang, G.L. 2004. Robust-LongSAGE (RL-SAGE): A substantially improved LongSAGE method for gene discovery and transcriptome analysis. Plant Physiol. 134: 890897. Halaschek-Wiener, J., Khattra, J.S., McKay, S., Pouzyrev, A., Stott, J.M., Yang, G.S., Holt, R.A., Jones, S.J., Marra, M.A., and Brooks-Wilson, A.R., et al. 2005. Analysis of long-lived C. elegans daf-2 mutants using serial analysis of gene expression. Genome Res. 15: 603615. Heidenblut, A.M., Luttges, J., Buchholz, M., Heinitz, C., Emmersen, J., Nielsen, K.L., Schreiter, P., Souquet, M., Nowacki, S., and Herbrand, U., et al. 2004. aRNA-longSAGE: A new approach to generate SAGE libraries from microdissected cells. Nucleic Acids Res. 32: e131. Impey, S., McCorkle, S.R., Cha-Molstad, H., Dwyer, J.M., Yochum, G.S., Boss, J.M., McWeeney, S., Dunn, J.J., Mandel, G., and Goodman, R.H. 2004. Defining the CREB regulon: A genome-wide analysis of transcription factor regulatory regions. Cell 119: 10411054.[Medline] Kenzelmann, M. and Muhlemann, K. 1999. Substantially enhanced cloning efficiency of SAGE (Serial Analysis of Gene Expression) by adding a heating step to the original protocol. Nucleic Acids Res. 27: 917918. Kim, J., Bhinge, A.A., Morgan, X.C., and Iyer, V.R. 2005. Mapping DNAprotein interactions in large genomes by sequence tag analysis of genomic enrichment. Nat. Methods 2: 4753.[CrossRef][Medline] Kirschman, J.A. and Cramer, J.H. 1988. Two new tools: Multi-purpose cloning vectors that carry kanamycin or spectinomycin/streptomycin resistance markers. Gene 68: 163165.[CrossRef][Medline] Kodzius, R., Kojima, M., Nishiyori, H., Nakamura, M., Fukuda, S., Tagami, M., Sasaki, D., Imamura, K., Kai, C., and Harbers, M., et al. 2006. CAGE: Cap analysis of gene expression. Nat. Methods 3: 211222.[CrossRef][Medline] Loh, Y.H., Wu, Q., Chew, J.L., Vega, V.B., Zhang, W., Chen, X., Bourque, G., George, J., Leong, B., and Liu, J., et al. 2006. The Oct4 and Nanog transcription network regulates pluripotency in mouse embryonic stem cells. Nat. Genet. 38: 431440.[CrossRef][Medline] Margulies, E.H., Kardia, S.L., and Innis, J.W. 2001. Identification and prevention of a GC content bias in SAGE libraries. Nucleic Acids Res. 29: e60. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., and Chen, Z., et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376380.[Medline] Matsumura, H., Reich, S., Ito, A., Saitoh, H., Kamoun, S., Winter, P., Kahl, G., Reuter, M., Kruger, D.H., and Terauchi, R. 2003. Gene expression analysis of plant hostpathogen interactions by SuperSAGE. Proc. Natl. Acad. Sci. 100: 1571815723. McKay, S.J., Johnsen, R., Khattra, J., Asano, J., Baillie, D.L., Chan, S., Dube, N., Fang, L., Goszczynski, B., and Ha, E., et al. 2003. Gene expression profiling of cells, tissues, and developmental stages of the nematode C. elegans. Cold Spring Harb. Symp. Quant. Biol. 68: 159169.[CrossRef][Medline] Neilson, L., Andalibi, A., Kang, D., Coutifaris, C., Strauss III, J.F., Stanton, J.A., and Green, D.P. 2000. Molecular phenotype of the human oocyte by PCR-SAGE. Genomics 63: 1324.[CrossRef][Medline] Peters, D.G., Kassam, A.B., Yonas, H., OHare, E.H., Ferrell, R.E., and Brufsky, A.M. 1999. Comprehensive transcript analysis in small quantities of mRNA by SAGE-lite. Nucleic Acids Res. 27: e39. Powell, J. 1998. Enhanced concatemer cloning-a modification to the SAGE (Serial Analysis of Gene Expression) technique. Nucleic Acids Res. 26: 34453446. Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., and Kanin, E., et al. 2000. Genome-wide location and function of DNA binding proteins. Science 290: 23062309. Saha, S., Sparks, A.B., Rago, C., Akmaev, V., Wang, C.J., Vogelstein, B., Kinzler, K.W., and Velculescu, V.E. 2002. Using the transcriptome to annotate the genome. Nat. Biotechnol. 20: 508512.[CrossRef][Medline] Schroeder, A., Mueller, O., Stocker, S., Salowsky, R., Leiber, M., Gassmann, M., Lightfoot, S., Menzel, W., Granzow, M., and Ragg, T. 2006. The RIN: An RNA integrity number for assigning integrity values to RNA measurements. BMC Mol. Biol. 7: 3.[CrossRef][Medline] Shendure, J., Mitra, R.D., Varma, C., and Church, G.M. 2004. Advanced sequencing technologies: Methods and goals. Nat. Rev. Genet. 5: 335344.[Medline] Siddiqui, A.S., Khattra, J., Delaney, A.D., Zhao, Y., Astell, C., Asano, J., Babakaiff, R., Barber, S., Beland, J., and Bohacec, S., et al. 2005. A mouse atlas of gene expression: Large-scale digital gene-expression profiles from precisely defined developing C57BL/6J mouse tissues and cells. Proc. Natl. Acad. Sci. 102: 1848518490. Smailus, D.E., Marziali, A., Dextras, P., Marra, M.A., and Holt, R.A. 2005. Simple, robust methods for high-throughput nanoliter-scale DNA sequencing. Genome Res. 15: 14471450. Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. 1995. Serial analysis of gene expression. Science 270: 484487. Wei, C.L., Ng, P., Chiu, K.P., Wong, C.H., Ang, C.C., Lipovich, L., Liu, E.T., and Ruan, Y. 2004. 5' Long serial analysis of gene expression (LongSAGE) and 3' LongSAGE for transcriptome characterization and genome annotation. Proc. Natl. Acad. Sci. 101: 1170111706. Wei, C.L., Wu, Q., Vega, V.B., Chiu, K.P., Ng, P., Zhang, T., Shahab, A., Yong, H.C., Fu, Y., and Weng, Z., et al. 2006. A global map of p53 transcription-factor binding sites in the human genome. Cell 124: 207219.[CrossRef][Medline] Yang, G.S., Stott, J.M., Smailus, D., Barber, S.A., Balasundaram, M., Marra, M.A., and Holt, R.A. 2005. High-throughput sequencing: A failure mode analysis. BMC Genomics 6: 2.[CrossRef][Medline]
Received May 12, 2006; accepted in revised format October 3, 2006. This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||