|
|
|
|
Published online before print
November 21, 2007, 10.1101/gr.6984908 Genome Res. 18:172-177, 2008 ©2008 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/08 $5.00 OPEN ACCESS ARTICLE
Methods Gene expression profiling by massively parallel sequencing1 Institut für Tierzucht und Genetik, Veterinärmedizinische Universität Wien, 1210 Vienna, Austria; 2 Eurofins Medigenomix GmbH, 82152 Martinsried, Germany
Massively parallel sequencing holds great promise for expression profiling, as it combines the high throughput of SAGE with the accuracy of EST sequencing. Nevertheless, until now only very limited information had been available on the suitability of the current technology to meet the requirements. Here, we evaluate the potential of 454 sequencing technology for expression profiling using Drosophila melanogaster. We show that short (< 80 bp) and long (> 300–400 bp) cDNA fragments are under-represented in 454 sequence reads. Nevertheless, sequencing of 3' cDNA fragments generated by nebulization could be used to overcome the length bias of the 454 sequencing technology. Gene expression measurements generated by restriction analysis and nebulization for fragments within the 80- to 300-bp range showed correlations similar to those reported for replicated microarray experiments (0.83–0.91); 97% of the cDNA fragments could be unambiguously mapped to the genomic DNA, demonstrating the advantage of longer sequence reads. Our analyses suggest that the 454 technology has a large potential for expression profiling, and the high mapping accuracy indicates that it should be possible to compare expression profiles across species.
Gene expression technologies have greatly matured over the past years, but it has become clear that hybridization-based approaches have obvious limitations in cross-species comparisons (Gilad et al. 2005 In this study, we evaluate the potential of 454 sequencing technology to serve as a reliable tool for expression profiling. We show that 454 sequencing technology has a biased representation of cDNA fragments with different length. However, in combination with random breakage of the cDNAs by nebulization, 454 sequencing provides an excellent tool for expression profiling. The high accuracy with which we could map the sequenced fragments onto the Drosophila melanogaster genome suggests that 454 sequencing has great potential for interspecific expression profiling.
Conceptual design Measuring gene expression by sequencing requires only that a proportion of the transcript be analyzed. We sequenced a 3' region of the cDNA to avoid potential bias due to incomplete reverse transcription of the mRNAs. We used two different approaches to evaluate the potential of 454 sequencing for expression profiling. First, we generated well-defined 3' cDNA fragments by restriction enzyme treatment (Fig. 1). Within the limitations of the available genome annotation, we could predict the expected 3' cDNA fragments, as we used a D. melanogaster strain with a fully sequenced genome. This strategy allowed us to evaluate whether fragment size affected 454 sequencing efficiency. As a second strategy, we sequenced 3' cDNA ends that were generated by random shearing of the cDNA (Fig. 1). Use of the same mRNA for both approaches allowed comparison between the two different strategies and thus a measurement for the reliability of 454 sequencing-based expression profiling.
Sequencing and mapping efficiency The first prerequisite for a reliable measure of gene expression based on absolute counts is accurate identification of the transcript corresponding to the sequence read. Totals of 11,477 and 14,570 reads with average read lengths of 114 ± 20 and 116 ± 23 bp were collected from digested (DIG) and nebulized (NEB) samples, respectively. After raw data were processed to eliminate low-quality sequences and poly(A) tails, we obtained 11,437 (DIG) and 14,512 (NEB) high-quality short expressed sequence tags (ESTs) (Table 1).
The 454-ESTs were mapped to transcripts or genomic DNA of D. melanogaster (release 5.1) using BLAST with highly stringent criteria (E < 10–10, >90% identity, >50% of read length included in the high-scoring segment pair [HSP]). About two-thirds of the ESTs could be unambiguously mapped to the database of protein-coding transcript: 97% were unambiguously mapped to the genome and 90% to annotated genes (Table 1). Thus, 7% of the ESTs were not mapped to annotated gene sequences in D. melanogaster collection despite having an unambiguous match in the genomic sequence. To test if this discrepancy could be explained by incomplete annotation of the transcripts in D. melanogaster (i.e., lack of 3' UTR, or missing isoforms), we performed a BLAST search against portions of 3' flanking sequences of 500 and 2000 bp. This improved the mapping efficiency to 95%, indicating that information on 3' UTRs is still missing or incomplete even for the well-characterized D. melanogaster genome. Further support for incomplete 3' UTR information is provided by the higher mapping efficiency of TaiI library, which harbored, on average, longer cDNA fragments and had a higher probability to overlap with the coding part of the transcript. Five percent of the ESTs that were not mapped to the transcript database could be located on intronic regions, suggesting the presence of new isoforms. We also found that 3%–6% of the hits to the transcript database consisted of antisense transcripts from 5%–6% of the genes sampled (Table 1).
Assessment of biases in transcript representation
As 454 sequencing reads are frequently too short to cover the entire 3'cDNA fragment, we estimated the fragment length by matching the sequence read to the transcripts and determining the number of bases between the first base of the alignment and the 3' end of the reference transcript. While the predicted 3' cDNA fragment-length distribution differed among the three restriction enzymes used, we consistently observed a striking difference between the expected and observed distributions. For all enzymes tested, ESTs shorter than
Nebulization success The undesirable effect of this apparent size bias in the 454 sequencing could be overcome if every transcript had a similar distribution of fragment sizes. Thus, randomly breaking cDNA fragments should overcome the size bias, as it affects all transcripts similarly. Shearing of DNA fragments by high-pressure nitrogen (nebulization) is frequently used to produce short DNA fragments for sequencing (Surzycki 2000 We tested for a potential effect of cDNA length on nebulization efficiency. As for the DIG library, we estimated the 3' cDNA fragment size by extending the aligned 454 sequencing ESTs to the 3' end of the transcript and compared the distribution of the inferred fragment sizes among different cDNA length classes. Despite covering a wide range of size classes, we found the mean size of the nebulized cDNA fragments to be very similar among cDNAs of different length (Fig. 3). We further scrutinized the nebulization pattern by analyzing highly expressed genes for which at least 30 sequences were available. Genes that are not spliced and that have similar transcript lengths were found to have similar fragment sizes (data not shown). This observation suggests that there is no apparent effect of the DNA sequence on the nebulization procedure, but more data are required to corroborate this. Nevertheless, even if nebulization were to cause some differences between cDNA fragments, they may not translate into a biased measurement of gene expression due to the relatively broad size range for which 454 sequencing quantitatively operates.
Cross-method consistency The above analyses suggested that 454 sequencing could be used for expression profiling when cDNAs are nebulized. To further validate this approach, we compared the results of the nebulized cDNAs to those obtained from cDNAs treated with restriction enzymes. When the expression levels of nebulized library were compared with the different digested libraries, the correlation coefficients ranged from 0.71 to 0.77 (Table 2). A recent study comparing the reproducibility for different microarray platforms reported intra-platform correlation coefficients ranging from 0.68 to 0.95 (Kuo et al. 2006
Despite the large potential of 454 sequencing technology for transcriptome analysis, so far only a limited number of approaches using this technique have been published. One approach involved a modification of the paired-end ditagging (PET) technique (Ng et al. 2006 20 bp of each full-length transcript are simultaneously extracted and covalently-linked into the paired-end ditag. In the second approach, DeepSAGE (Nielsen et al. 2006
In our proof-of-principle study using D. melanogaster, we showed that the sequencing of randomly sheared 3' cDNA provides a very good alternative to the previously suggested approaches for expression profiling using 454 sequencing technology. While it would be also possible to sequence full-length cDNAs (Bainbridge et al. 2006
As in a previous report using another technique of massively parallel sequencing (Chen and Rattray 2006
One difference of the 454 sequencing technology to other massively parallel sequencing techniques is the generation of longer read lengths. We evaluated the effect of read length on the mapping efficiency by truncating the obtained 454 reads to 20, 50, and 100 bp. Short read lengths result in many HSPs with scores very similar to the best one. About 20% of the 20-bp fragments had at least two perfect matches in the D. melanogaster genome (Fig. 4), whereas 50- and 100-bp fragments had substantially increased mapping accuracies, resulting in only 3% and 0.5% ambiguously mapped fragments, respectively. Furthermore, the difference in bit scores between the best and second-best hits is much more pronounced for longer fragments. Hence, as expected, longer fragments result in a higher proportion of unambiguously mapped sequences. In the presence of sequence polymorphism, the mapping of short sequence reads to the corresponding genes becomes an even more challenging task and introduces considerable uncertainty. For well-annotated genomes, restricting the analysis to the transcriptome rather than the genome can compensate to some extent for the low mapping accuracy of shorter sequence reads. This is a widely used approach for SAGE analyses, but for poorly annotated genomes this strategy is not efficient. For example, a recent gene expression study in D. pseudoobscura could map only 27% of the SAGE tags to transcripts (Metta et al. 2006
RNA isolation and cDNA synthesis The D. melanogaster genome strain (y; cn bw sp) was obtained from the Tucson Stock Center (stock no. 14021-0231.36). Flies were grown at 20°C in standard cornmeal medium. Total RNA was extracted with TRIzol (Invitrogen) from 30 virgin females aged 3–7 d (three replicates of 10 flies each). RNA samples were treated with DNase I (10 units/50 µg of total RNA), and absence of contaminating genomic DNA was confirmed by PCR of two low-expressed genes CG11053 and CG13272 (Supplemental Table 1) from total RNA. Furthermore, the primers were chosen such that the amplicon includes an intronic sequence, which permits the identification of genomic DNA contamination.
First-strand cDNA was generated from
Enzyme library preparation
Shotgun library preparation
454 sequencing
Bioinformatics
We thank three anonymous reviewers for insightful comments. We also thank the C.S. laboratory members for helpful comments on earlier versions of this manuscript and S. Glinka for 454 sequencing. This work was supported by Fonds zur Förderung der wissenschaftlichen Forschung grants to C.S. and a fellowship of the Brazilian National Council for Scientific and Technological Development (CNPq) to T.T.T.
3 Corresponding author.
E-mail christian.schloetterer@vu-wien.ac.at; fax 43-1-250775693. [Supplemental material is available online at www.genome.org. The EST sequences have been deposited in GenBank under accession nos. EV574767–EV600806.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6984908
Bainbridge, M.N., Warren, R.L., Hirst, M., Romanuik, T., Zeng, T., Go, A., Delaney, A., Griffith, M., Hickenbotham, M., Magrini, V., et al. 2006. Analysis of the prostate cancer cell line LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics 7: 246. doi: 10.1186/1471-2164-7-246.[CrossRef][Medline] Becker, R.A., Chambers, J.M., and Wilks, A.R. 1988. The new S language: A programming environment for data analysis and graphics. Wadsworth & Brooks/Cole, Pacific Grove, CA. Chen, J., Lee, S., Zhou, G., and Wang, S.M. 2002. High-throughput GLGI procedure for converting a large number of serial analysis of gene expression tag sequences into 3' complementary DNAs. Genes Chromosomes Cancer 33: 252–261.[CrossRef][Medline] Chen, J. and Rattray, M. 2006. Analysis of tag-position bias in MPSS technology. BMC Genomics 7: 77. doi: 10.1186/1471-2164-7-77.[CrossRef][Medline] Emrich, S.J., Barbazuk, W.B., Li, L., and Schnable, P.S. 2007. Gene discovery and annotation using LCM-454 transcriptome sequencing. Genome Res. 17: 69–73. Gilad, Y., Oshlack, A., Smyth, G.K., Speed, T.P., and White, K.P. 2006. Expression profiling in primates reveals a rapid evolution of human transcription factors. Nature 440: 242–245.[CrossRef][Medline] Gilad, Y., Rifkin, S.A., Bertone, P., Gerstein, M., and White, K.P. 2005. Multi-species microarrays reveal the effect of sequence divergence on gene expression profiles. Genome Res. 15: 674–680. Kuo, W.P., Liu, F., Trimarchi, J., Punzo, C., Lombardi, M., Sarang, J., Whipple, M.E., Maysuria, M., Serikawa, K., Lee, S.Y., et al. 2006. A sequence-oriented comparison of gene expression measurements across different hybridization-based technologies. Nat. Biotechnol. 24: 832–840.[CrossRef][Medline] Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z.T., et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380.[Medline] Metta, M., Gudavalli, R., Gibert, J.M., and Schlotterer, C. 2006. No accelerated rate of protein evolution in male-biased Drosophila pseudoobscura genes. Genetics 174: 411–420. Ng, P., Tan, J.J.S., Ooi, H.S., Lee, Y.L., Chiu, K.P., Fullwood, M.J., Srinivasan, K.G., Perbost, C., Du, L., Sung, W.K., et al. 2006. Multiplex sequencing of paired-end ditags (MS-PET): a strategy for the ultra-high-throughput analysis of transcriptomes and genomes. Nucleic Acids Res. 34: e84. doi: 10.1093/nar/gkl444. Nielsen, K.L., Hogh, A.L., and Emmersen, J. 2006. DeepSAGE—Digital transcriptomics with high sensitivity, simple experimental protocol and multiplexing of samples. Nucleic Acids Res. 34: e133. doi: 10.1093/nar/gkl714. R Development Core Team. 2007. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. Surzycki, S.J. 2000. Basic methods in molecular biology. Springer-Verlag, New York. Velculescu, V.E., Zhang, L., Vogelstein, B., and Kinzler, K.W. 1995. Serial analysis of gene expression. Science 270: 484–487. Weber, A.P., Weber, K.L., Carr, K., Wilkerson, C., and Ohlrogge, J.B. 2007. Sampling the Arabidopsis transcriptome with massively parallel pyrosequencing. Plant Physiol. 144: 32–42.
Received August 1, 2007; accepted in revised format October 3, 2007. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||