|
|
|
|
Published online before print
October 1, 2007, 10.1101/gr.6435207 Genome Res. 17:1697-1706, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Resource SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing1 Max-Planck-Institute for Molecular Genetics, 14195 Berlin-Dahlem, Germany; 2 Institute for Functional Genomics, Computational Diagnostics, University of Regensburg, 93053 Regensburg, Germany
The latest revolution in the DNA sequencing field has been brought about by the development of automated sequencers that are capable of generating giga base pair data sets quickly and at low cost. Applications of such technologies seem to be limited to resequencing and transcript discovery, due to the shortness of the generated reads. In order to extend the fields of application to de novo sequencing, we developed the SHARCGS algorithm to assemble short-read (25–40-mer) data with high accuracy and speed. The efficiency of SHARCGS was tested on BAC inserts from three eukaryotic species, on two yeast chromosomes, and on two bacterial genomes (Haemophilus influenzae, Escherichia coli). We show that 30-mer-based BAC assemblies have N50 sizes >20 kbp for Drosophila and Arabidopsis and >4 kbp for human in simulations taking missing reads and wrong base calls into account. We assembled 949,974 contigs with length >50 bp, and only one single contig could not be aligned error-free against the reference sequences. We generated 36-mer reads for the genome of Helicobacter acinonychis on the Illumina 1G sequencing instrument and assembled 937 contigs covering 98% of the genome with an N50 size of 3.7 kbp. With the exception of five contigs that differ in 1–4 positions relative to the reference sequence, all contigs matched the genome error-free. Thus, SHARCGS is a suitable tool for fully exploiting novel sequencing technologies by assembling sequence contigs de novo with high confidence and by outperforming existing assembly algorithms in terms of speed and accuracy.
For almost 20 years, the Sanger technique (Sanger et al. 1977
For the assembly of sequence fragments of
Given the low cost of sequence data generated by second-generation sequencing instruments, sequence fragments can be provided to cover a target sequence in excess of 100-fold. When the resulting huge numbers of very short reads are assembled, simplicity becomes a virtue for an assembly algorithm. One approach to assemble reads as short as 25 bases was reported by Warren et al. (2007) In this paper, we first demonstrate the feasibility of assemblies using 25–40 base reads under idealized conditions, assuming that all possible reads of a given length are available and no sequencing errors are permitted. Next, we focus on a realistic scenario, showing how to cope with missing reads and how to detect reads that contain sequencing errors out of a large pool of reads. We calculate assemblies for simulated reads of sequenced BAC clones from Arabidopsis, Drosophila, and human, as well as simulated reads of microbial sequences. Finally, we generate a 36-mer read data set from Helicobacter acinonychis (genome size 1.55 Mbp) on the Illumina 1G sequencing instrument and use SHARCGS to assemble the genome based on these data.
Assembling simulated short read (25–40-mer) data under idealized assumptions Assuming that all possible reads are present in the data set, no sequencing errors have occurred, and read length is fixed, the quality of the assembly solely depends on intrinsic properties of the sequence. We assembled simulated ideal 30-mer reads from a set of 60 BAC insert sequences with an average length of 110 kbp each, derived from Arabidopsis thaliana, Drosophila melanogaster, and Homo sapiens (Table 1 and Supplementary Table 1). We found between 1 (Drosophila, BAC supp20) and 520 (Human, BAC supp26) SHARCGS contigs per BAC. As a quality measure for the assembled SHARCGS contigs, we compared the N50 sizes of the assemblies from different BACs. Both Arabidopsis and Drosophila BACs had an average N50 length of >30 kbp, and human BACs were assembled to an average N50 length of 5–6 kbp.
The results in Table 1 and Supplementary Table 1 highlight the quality optimum that can be achieved in 30-mer-based assemblies. However, the assembly can be improved by using longer reads. Figure 1 summarizes the N50 changes when contig assembly is carried out with 25–40-mer reads, respectively, calculated from assemblies with 20 BACs per species. There is a clear tendency toward larger N50. For instance, an increase of read length from 30 to 40 bases almost doubled N50 in all three species. Longer reads have a better performance in bridging regions of ambiguity or low complexity, resulting in larger sequence contigs.
Repeat content and assembly quality To determine the influence of the repeat content on the assembly quality, we analyzed the GenBank BAC sequences with RepeatMasker (www.repeatmasker.org). While human BACs, with the exception of Y chromosome derived clones and a few outliers, showed uniform repeat content (40%–55%), the Drosophila set showed high variance with some BACs being almost repeat-free (1% repeat content) and some containing >50% repetitive DNA. The data show that a higher fraction of repetitive regions results in shorter N50 lengths. We got the best SHARCGS assemblies in Drosophila for BACs with a repeat content of 5% or lower, resulting in N50 of >57 kbp. Repeats that occur only once per BAC, or which are very diverged, will essentially behave like single-copy sequences and will not affect assembly quality. To assess whether the number of low-complexity (LC) or simple repeats (SR), as identified by RepeatMasker, have an effect on the assembly quality, we counted the number of such sequence segments per BAC. Arabidopsis BACs contained 40 LC/SRs on average, BACs from Drosophila contained 44 LC/SRs, and human BACs contained 37 LC/SRs (numbers based on clones in Table 1). Since the N50 for human BAC assemblies are the lowest in our sample, we conclude that LC/SRs in this data set have no effect onto SHARCGS assembly. This is supported by closer inspection of the data: For instance, Drosophila BAC 13 (N50 of 6 kbp) contained 19 LC/SRs, while BAC 18 contained 61 LC/SRs and had an N50 of 87 kbp. With very few exceptions, SHARCGS contigs covered the inserts of BACs > 99%. Two outliers were BAC 13 and BAC 8. BAC 13 was exceptionally repeat-rich (79% repetitive DNA), while the insert of BAC 8 (repeat content 32%) contained 38 kbp of highly similar tandem repeats (Supplementary Fig. 1).
Assembling simulated short-read data with missing reads and sequencing errors Missed positions (or "gaps") result in decreased contig lengths, as one can expect breaks to occur whenever contig extension fails due to missing reads. The more serious problem might be wrong assembly if ambiguities (or "forks") are not detected because of missing reads. This problem and its solution implemented in SHARCGS are illustrated in Figure 2. For each read R that is to be assembled onto a contig, SHARCGS inspects both the forward and reverse strand for ambiguities. Read R is only used to extend the contig, if other reads matching the contig with shorter overlaps do not lead to ambiguities. In this manner, gaps can be checked safely for ambiguities (see Methods for details).
In a BAC-based sequencing project, sequencing from BAC pools increases efficiency in terms of time and money. The resulting data sets contain reads from more than a single clone. The sequencing technology itself can be used to discriminate between reads derived from individual clones by using indices. We assumed distinguishable, barcoded BACs, and we simulated reads for 1–16 different BACs per pool. Increasing the pool size implies that fewer reads are available per BAC, and thus the coverage per BAC is decreasing. Assuming 5 million 30-mer reads per sample, the coverage is 1363/n, with n indicating the number of BACs per pool (insert size 110 kbp per BAC).
To avoid assembling faulty reads, we assumed that it is unlikely that the same error occurs multiple times in a data set of limited size. Thus, the correct reads for a certain base must outnumber reads with a wrong base call at this position. By comparing the read counts, we are able to determine a read data set that can be used for the assembly, e.g., reads that are present at least three times. The number of correct reads per BAC decreases with increasing complexity of the BAC pool, since fewer reads are available for the individual BAC (Fig. 3A). For example, assuming a sequencing error rate of 0.6%, threefold confirmation of a 30-mer read results in 94% of reads retained in a pool of 5 BACs (coverage
We applied all the features above and assumed a sequencing error rate of 0.6% to the assembly of the set of 60 BACs described earlier (Fig. 4A–C). The SHARCGS assembly took 15 min per BAC on a Dual Xeon 2.8-GHz 32-bit Linux machine with 4 GB of RAM. In the context of these assemblies we calculated 942,380 contigs > 50 bp, corresponding to 489.59 Mbp. All these contigs, with one single exception, could be aligned perfectly without mismatches or gaps against the BAC reference sequences. The only falsely assembled contig we observed was 57 bp in size and was derived from reads from a conserved 30-bp inverted repeat located within Drosophila BAC supp20 (GenBank AC099006). From these data, we can conclude that our algorithm is perfectly reliable with respect to assembling contigs de novo from short reads and that our filtering to remove reads containing sequencing errors is efficient. The box plots in Figure 4A–C illustrate the feasibility to assemble BACs from indexed pools. Noteworthy is the relatively constant average N50 that is observed for pooled clones, even though the variance is high because of sequence differences inherent in BACs (see Table 1, e.g., repeat content). In both Arabidopsis and Drosophila, 5–6 BACs can be pooled to still achieve an N50 at 20 kbp, and for human BACs we find an N50 of 4–5 kbp for pools of up to 9 BACs. With increased pool sizes, contigs become progressively shorter. This can be explained by redundancy checks becoming so stringent that sufficient reads are not retained for the assembly. Figure 5 shows this effect in comparison to the N50 sizes of assemblies based on simulated reads under ideal assumptions for different pool sizes.
Assembling simulated short-read data of microbial sequences To further evaluate the prospects of SHARCGS, we assembled yeast chromosomes and bacterial genomes based on simulated 30- or 32-mer data sets under realistic assumptions as described above (missing reads, 0.6% sequencing error rate; Table 2). We chose Saccharomyces cerevisiae chromosomes V (0.58 Mbp) and VII (1.09 Mbp), as well as the genomes of Haemophilus influenzae (1.94 Mbp) and E. coli (4.71 Mbp). Depending on sequencing depth (coverage range 130- to 270-fold), we obtained an N50 of up to 28 kbp for yeast chromosome V and up to 18 kbp for yeast chromosome VII. The bacterial genomes (coverage range 95- to 164-fold) could be assembled with an N50 of 22 kbp (H. influenzae) and 16 kbp (E. coli). In all these assemblies the sequence coverage exceeded 97% and all contigs were error-free, as determined by matching them back against the reference sequences.
Assembling Illumina short-read data of the H. acinonychis genome In order to evaluate SHARCGS performance on real data, we chose to sequence and assemble the genome of H. acinonychis, a bacterial species isolated from the stomach of large felines. The sequences of the H. acinonychis genome (1.55 Mbp) and of the H. acinonychis plasmid pHac1 (3661 bp) were determined recently with Sanger technology (Eppinger et al. 2006 We removed reads containing unspecified bases from the complete data set of 12.3 million reads and trimmed the last four bases of each read. We assembled the remaining 11.8 million reads without taking ELAND results into account, thus relying exclusively on read filtering and assembly procedures implemented within SHARCGS. Different from the assemblies based on simulated reads described above, the filtering of Illumina read data relied on the utilization of cumulative base quality values (see Methods). We used four different base quality thresholds (Q > 20, Q > 25, Q > 30, and Q > 35) to generate four read data sets (Table 3a). For instance, the read data sets Q > 20 and Q > 35 contained 1,174,569 and 978,900 unique reads, respectively, to be left for contig assembly. Each read data set was assembled separately. Thereafter, we took advantage of SHARCGS feature to automatically merge contigs generated from different filter settings (Table 3b). The inclusion of reads that contain sequencing errors (reads are retained when applying weak filtering criteria) result in contig breaks, and missing reads (reads are removed when applying strong filtering criteria) result in contig breaks, too. However, runs with different filtering criteria result in contig breaks at different positions. Merging overlapping contigs generated in different assembly runs therefore improves the assembly quality (see Methods for details). In our case, assemblies from single runs contained 1303–1948 contigs with N50 between 1558 and 2257 bp. For the final H. acinonychis assembly we merged the contigs from these four assembly runs, resulting in an increased N50 size by more than 50% relative to the best single assembly. The final assembly of H. acinonychis contains 937 contigs that cover 98% of the genome with an N50 of 3659 bp: 932 of these contigs are error-free, and 5 contigs contain mismatches at 1–4 positions. In three of these five contigs, discrepancies between the SHARCGS assembly and the reference sequence are supported by 103, 117, or 198 Illumina reads, respectively. These high numbers of supporting reads suggest that the sequence based on Illumina reads is correct. The remaining two discrepancies are located at the ends of SHARCGS contigs.
The assembly described here was carried out entirely de novo, i.e., solely depending upon the read sequences and base quality values. We positioned the contigs on the reference genome and found that many contigs overlapped. By merging contigs overlapping more than five bases, our assembly collapsed into 228 contigs, increasing the N50 length 5.4-fold (Table 3d). This suggests that microread-based assemblies can be improved with modest additional effort, by carrying out a hybrid assembly together with low, e.g., one- or twofold coverage of the target genome in 454 or Sanger reads.
Comparative evaluation of SHARCGS
Even though SSAKE does have a "stringent" mode implemented that should in principle avoid the extension of contigs across regions of ambiguity, in practice a large number of false joins were observed. In realistic simulations with missing reads and sequencing errors, 25% (on average, range 0%–58%) of SSAKE contigs could not be aligned to the BAC reference sequence. This indicates that the performance of SSAKE is particularly vulnerable to the presence of sequencing errors in a read data set. Considering only correct SSAKE contigs, an average BAC insert coverage of 76% was determined. In contrast, all contigs generated by SHARCGS could be perfectly aligned to the source sequences. The fraction of each BAC insert sequence covered with SHARCGS contigs was on average 93% (Supplementary Table 2). We conclude that the large proportion of unmatched contigs will seriously limit the utility of the Warren et al. (2007)
We have attempted a similar comparison with the Euler2 algorithm from Pevzner et al. (2001)
There are multiple applications for microread-sequencing technologies, including genome resequencing, genotyping, transcript identification and profiling, transcription factor binding site analysis, and methylome characterization. SHARCGS extends the portfolio of applications to comprise de novo sequence assembly. Our evaluation of SHARCGS was based both on simulated reads and on reads generated on the Illumina 1G sequencer. We have shown that SHARCGS generates error-free contigs from short-read data containing as much as 1.5% of sequencing errors, as long as high coverage of the target genome is provided (200-fold or more). With these conditions, we assembled a bacterial genome with a size of 1.6 Mpb (H. acinonychis). Assuming that the accuracy of short-read sequencing technologies will improve, larger bacterial genomes are realistic targets, as shown for the 4.4-Mbp genome of E. coli on simulated reads with an error rate of 0.6%. To assemble more complex eukaryotic genomes, a BAC-wise approach may be considered. The choice of the BAC pool size depends on the repeat content of the genome. In the present implementation of our algorithm, single-nucleotide polymorphisms would cause spurious contig breaks. If BACs from the same chromosomal region derived from different haplotypes were sequenced in one single pool, sequence variants may be discarded from the data set, unless indices were used. Microread-based assemblies can easily be improved, e.g., by integration of reads from HTP pyrosequencing or conventional Sanger reads at low coverage. These modifications consolidate the assembly by bridging gaps and regions of ambiguity.
The SHARCGS algorithm consists of a filtering step to avoid reads with sequencing errors, an assembly step to generate contigs, and a final contig merging step including computation of quality measures.
Filtering for confirmed reads
In order to assemble reliable contigs, we suggest removing unconfirmed low-quality reads from the assembly. We consider a read as being confirmed when the following two conditions hold: Reads are generated multiple times, and overlapping partners exist. Without quality scores, SHARCGS simply filters for reads generated at least n times, n being a parameter of this filtering step. When quality measures are available, the first filtering step is modified: SHARCGS then filters for reads having high minimal quality values. Quality values of all occurrences of a read on the same or the opposite strand are combined, similar to Phrap (Ewing and Green 1994 In simulations, the second criterion allows the removal of virtually all wrong reads left over after the first filtering step, at the expense of discarding very few correct ones. In this second filtering step, our algorithm removes reads that lack partners with perfectly matching overlap. The minimal overlap used to search for matching partners is a parameter of this filtering step, to be chosen larger than half of the read length. We consider reads to be confirmed, if at least one matching partner exists for each of its ends, thus if it is covered entirely at least twice. Reads containing sequencing errors are very unlikely to find partners on both sides. After the filtering steps, SHARCGS generates reverse complements for all confirmed reads and keeps only one copy of each read, before starting the core assembly algorithm.
The core assembly algorithm
Merging contigs from several assembly runs and generation of quality measures Setting the filtering parameters to very stringent levels discards large amounts of data from the input. The resulting assembly has very short contigs, because reads from too many positions in the target sequence are unavailable. When including weakly confirmed reads in the assembly, it is more difficult to filter reads containing sequencing errors. Such reads, however, would stop the algorithm from assembling long contigs, since reads containing sequencing errors cause spurious ambiguities. The contig breaks due to confirmed reads with sequencing errors tolerated by weak filtering are unlikely to occur at the same positions as the contig breaks caused by missing reads after application of strong filtering criteria. This is why we run the core assembly algorithm automatically for weak, medium, and strong filter parameter settings. For each parameter setting, the core algorithm only generates contigs contained in the target sequence. SHARCGS attempts to merge contigs from different core assembly runs by finding exact overlaps at least as long as the read length.
If quality measures are supplied with the raw reads, SHARCGS computes quality measures for any position in all contigs. It adopts the paradigm described in the Phrap documentation (Ewing and Green 1994
Fine-tuning parameters The optimal setting of both parameters depends on the length of the target sequence. We use the abundance of reads generated several times as a fingerprint of the target sequence length. We expect the number of reads generated x times to be distributed according to a binomial distribution with parameters k, the number of reads generated, and P, the probability to read correctly at a given position in the target sequence. The probability P is given by Pcorrect/n, where n is the length of the target sequence. In the input data, we can count the number of reads generated x times for any x. In a maximum likelihood approach, we determine the combination of error rate and target sequence length, which most probably causes the distribution of read confirmations observed in the input data. This estimation is accurate in simulated data (within 5% of the real sequence length). According to the target sequence length n, we determine the confirmation levels for which n/2 (strong filtering) to n (weak filtering) reads pass the filter. At most we expect to observe one read per position in the target sequence, so we do not expect to have more than n correct reads. Moreover, we do not expect to be able to assemble input data with decent quality, if more than half of the reads are missing.
The minimal overlap with which the core assembly algorithm should check for ambiguities, as well as the minimal overlap used in the second filtering step to detect unreliable reads, depends on the lengths of gaps expected in the input data. In this context, we call gaps runs of consecutive positions for which no confirmed read is available. The probability Pg that a gap at least of length g starts at a given position depends on the number of available reads navail and the length of the target sequence n as follows:
Implementation
Evaluation
Short-read generation in silico
Preparation of fragment libraries and Illumina sequencing
Program and data availability
We thank Mark Achtman and Giovanna Morelli for providing DNA from H. acinonychis and Andrew Hufton for comments on the manuscript.
3 Corresponding author.
E-mail himmelbauer{at}molgen.mpg.de; fax 49-30-8413-1380. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6435207
Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., Hoskins, R.A., and Galle, R.F. 2000. The genome sequence of Drosophila melanogaster. Science 287: 2185–2195. Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 1301–1310. Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J.P., and Lander, E.S. 2002. ARACHNE: A whole-genome shotgun assembler. Genome Res. 12: 177–189. doi: 10.1101/gr.208902. Bentley, D.R. 2006. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. 16: 545–552.[CrossRef][Medline] Choi, S.D., Creelman, R., Mullet, J., and Wing, R.A. 1995. Construction and characterization of a bacterial artificial chromosome library from Arabidopsis thaliana. Weeds World 2: 17–20. Eppinger, M., Baar, C., Linz, B., Raddatz, G., Lanz, C., Keller, H., Morelli, G., Gressmann, H., Achtman, M., and Schuster, S.C. 2006. Who ate whom? Adaptive Helicobacter genomic changes that accompanied a host jump from early humans to large felines. PLoS Genet. 2: e120. doi: 10.1371/journal.pgen.0020120.eor.[CrossRef][Medline] Ewing, B. and Green, P. 1994. Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 8: 186–194. Hoskins, R.A., Nelson, C.R., Berman, B.P., Laverty, T.R., George, R.A., Ciesiolka, L., Naeemuddin, M., Arenson, A.D., Durbin, J., David, R.G., et al. 2000. A BAC-based physical map of the major autosomes of Drosophila melanogaster. Science 287: 2271–2274. Huang, X. and Madan, A. 1999. CAP3: A DNA sequence assembly program. Genome Res. 9: 868–877. Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J.P., Zody, M.C., and Lander, E.S. 2003. Whole-genome sequence assembly for mammalian genomes: ARACHNE 2. Genome Res. 13: 91–96. doi: 10.1101/gr.828403. Margulies, M., Egholm, M., Altman, W.E., Attiya, S., Bader, J.S., Bemben, L.A., Berka, J., Braverman, M.S., Chen, Y.J., Chen, Z., et al. 2005. Genome sequencing in microfabricated high-density picolitre reactors. Nature 437: 376–380.[Medline] Mullikin, J.C. and Ning, Z. 2003. The Phusion assembler. Genome Res. 13: 81–90. doi: 10.1101/gr.731003. Myers, E.W., Sutton, G.G., Delche, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al. 2000. A whole-genome assembly of Drosophila. Science 287: 2196–2204. Osoegawa, K., Mammoser, A.G., Wu, C., Frengen, E., Zeng, C., and de Jong, P.J. 2001. A bacterial artificial chromosome library for sequencing the complete human genome. Genome Res. 11: 483–496. Pevzner, P.A., Tang, H., and Waterman, M.S. 2001. An Eulerian path approach to DNA fragment assembly. Proc. Natl. Acad. Sci. 98: 9748–9753. Sanger, F., Nicklen, S., and Coulson, A.R. 1977. DNA sequencing with chain-terminating inhibitors. Proc. Natl. Acad. Sci. 74: 5463–5467. Sutton, G.G., White, O., Adams, M.D., and Kerlavage, A.R. 1995. TIGR Assembler: A new tool for assembling large shotgun sequencing projects. Genome Sci. Technol. 1: 9–19. Warren, R.L., Sutton, G.G., Jones, S.J., and Holt, R.A. 2007. Assembling millions of short DNA sequences using SSAKE. Bioinformatics 23: 500–501. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520–562.[CrossRef][Medline]
Received February 27, 2007; accepted in revised format August 28, 2007. Related Articles
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||