|
|
|
|
Genome Res. 15:1127-1135, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Methods Assembly of polymorphic genomes: Algorithms and application to Ciona savignyi1 Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02141-2023, USA 2 Department of Zoology, Graduate School of Science, Kyoto University, Kyoto 606-8502, Japan
Whole-genome assembly is now used routinely to obtain high-quality draft sequence for the genomes of species with low levels of polymorphism. However, genome assembly remains extremely challenging for highly polymorphic species. The difficulty arises because two divergent haplotypes are sequenced together, making it difficult to distinguish alleles at the same locus from paralogs at different loci. We present here a method for assembling highly polymorphic diploid genomes that involves assembling the two haplotypes separately and then merging them to obtain a reference sequence. Our method was developed to assemble the genome of the sea squirt Ciona savignyi, which was sequenced to a depth of 12.7x from a single wild individual. By comparing finished clones of the two haplotypes we determined that the sequenced individual had an extremely high heterozygosity rate, averaging 4.6% with significant regional variation and rearrangements at all physical scales. Applied to these data, our method produced a reference assembly covering 157 Mb, with N50 contig and scaffold sizes of 47 kb and 989 kb, respectively. Alignment of ESTs indicates that 88% of loci are present at least once and 81% exactly once in the reference assembly. Our method represented loci in a single copy more reliably and achieved greater contiguity than a conventional whole-genome assembly method.
Genome sequencing is undergoing explosive growth, driven by improvements in both laboratory methods and computational analysis. A key aspect has been the continuing development of algorithms for assembling large and complex genomes from whole-genome shotgun (WGS) data (Myers et al. 2000
Many diploid organisms (probably most) have polymorphism rates that are much higher than the
Several authors have already noted that polymorphism within a WGS data set complicates the assembly process. For example, the sequenced strain of the mosquito (Anopheles gambiae) is inbred along 92% of its genome (one haplotype) and outbred in localized islands comprising the last 8% (two haplotypes). Assemblies of this genome display lower quality precisely in the islands of heterozygosity (Holt et al. 2002
We present here a method for the assembly of diploid genomes with very high heterozygosity, 1 substitution per 20 bases and 1 deletion/insertion event per 100 bases. In our approach, we assemble the two haplotypes separately, and then merge the haplotype assemblies to choose a single representative at each locus. The method is implemented as extensions to the Arachne assembly program (Batzoglou et al. 2002
We developed this method to assemble the diploid genome of the sea squirt Ciona savignyi, a model system for vertebrate development (Johnson et al. 2004
Analysis using default algorithm WGS data and default assembly Whole-genome assembly is a computational method for obtaining the nucleotide sequence of an entire genome by piecing together sequencing reads obtained by whole-genome shotgun (WGS) sequencing. Typically, WGS data are obtained by randomly shearing genomic DNA, selecting for fragments of a certain size, inserting these fragments in an appropriate cloning vector (e.g., plasmids for 4-kb inserts or Fosmids for 40-kb inserts), then sequencing both ends of the insert. Thus, WGS reads typically come in pairs that represent nucleotide sequences whose relative orientation and approximate distance are known. A now-standard approach to whole-genome assembly is to use overlaps between sequencing reads to computationally merge these sequencing reads into contigs, i.e., contiguous stretches of DNA with no gaps. When two reads from a pair lie in different contigs, the pairing of these two reads implies a relative orientation and position for the two contigs. This leads to the formation of larger structures called scaffolds, in which one or more contigs are separated by gaps of approximately known length.
The 190-Mb genome of Ciona savignyi was sequenced from two WGS libraries with 5-kb and 40-kb insert sizes, constructed from the DNA of a single wild, diploid individual from the San Francisco Bay (see Supplemental material). We generated 12.7x sequence coverage and 58x in physical coverage, of which 17x is from long inserts. We then performed an assembly using the Arachne assembler (Batzoglou et al. 2002 In principle, these results could have been caused by either widespread paralogy (such as a recent whole-genome duplication) or a high level of heterozygosity. We resolved this ambiguity by aligning 350 kb of finished sequence, representing seven random clones, obtained from a pool of other Ciona savignyi individuals from the same geographic source (Fig. 1). If the two related sites were paralogs representing different loci, the finished sequence from a different individual would be expected to closely match one of the two related sites. Instead, we observed in each case that the finished sequence differed significantly from both related sites: the divergences between the three sequences formed a star phylogeny with three edges of comparable length.
These data support the conclusion that the assembled variants represent the two haplotypes of the individual from which the WGS libraries were constructed, and that the finished sequence from a different individual represents a third haplotype.
Heterozygosity within the sequenced individual The two haplotypes frequently differ by large indels, i.e., sequence regions from hundreds to thousands of base pairs long that are present in only one of the two haplotypes (Fig. 2). In many cases, these differences can be identified as the recent insertion of transposable elements. In one case the difference appears to be due to a deletion mediated by recombination of nearby SINE elements, but in other cases the cause of the indel was not immediately evident. In total, the large indels between the two haplotypes comprise about one-sixth of the total number of aligning bases (28 kb out of 158 kb).
The heterozygosity rate was not only high, but also highly variable across regions. Regional variability has also been noted in Ciona intestinalis (Aparicio et al. 2002
Results with algorithm for polymorphic genomes More specifically, we introduce a new method called "the splitting rule" that causes overlaps between reads of opposite haplotype to be ignored during the formation of contigs and scaffolds. The goal is to force the two haplotypes to assemble separately in part 1.
We then accomplish the merger of the separately assembled haplotypes in part 2 in three steps. First, we use dynamic programming to select long collinear blocks of alignments between a haploid scaffold and its partner of the opposite haplotype. Second, we use collinear blocks as the atomic units and form larger structures called "diploid scaffolds." For this, we use haploid scaffolds to relay from one collinear block to the next. Third, we select a representative path through each diploid scaffold. These paths through the diploid scaffolds compose the reference assembly. When the haplotype partner of a haploid scaffold cannot be determined, we set it aside as an unpaired scaffold. Further details are provided below.
Assembly results The haplotype assemblies are more complete than the reference assembly, with 95% of ESTs aligning (the remaining 5% may be due to strain differences), and therefore the haplotype assemblies are also available for download. Ideally, all ESTs aligning to the haplotype assemblies would also be present in the reference assembly. However, among ESTs not aligning to the reference assembly, 4% align to the unpaired scaffolds and 3% align to the mirror image of the reference assembly (the unchosen haplotype). See Supplemental material for a more in-depth analysis of EST alignments.
The effect of the splitting rule can be seen by comparing the default assembly to the haplotype assemblies. As expected, the fraction of loci represented exactly twice in the assembly (measured by the number of times an EST aligns) has increased, indicating that the splitting rule caused the haplotypes to separate more cleanly in the assembly. We also observed an increase in contig sizes, scaffold sizes, and WGS read usage. These improvements appear to be side effects of the cleaner haplotype separation. Between the haplotype assemblies and the reference assembly were three intermediate stages (Figs. 3, 4). First, we formed collinear blocks of alignments relating the vast majority of sequence in the haplotype assemblies with its partner of the opposite haplotype. At this point we set aside the unpaired scaffolds that contained no collinear block of alignments. Second, we formed diploid scaffolds by using the haploid scaffolds to relay between collinear blocks of alignments; the increase in scaffold size between the haplotype assemblies and the reference assembly is attributable to this step. Third, we chose a reference path through each diploid scaffold. The choice of reference path gave us the freedom to choose the better-assembled haplotype at each position of a diploid scaffold, leading to the increase in contig size between the haplotype assemblies and the reference assembly. The N50 size of haplotype segments (sequences uninterrupted by either a contig gap or a switch between haplotypes) is 21.4 kb. Choosing a reference path also collapsed the twofold redundancy; for example, in the reference assembly, 81% of ESTs align to exactly one location, indicating that the vast majority of loci are represented uniquely. We then characterized the properties of the unpaired scaffolds, which total 99 Mb (Fig. 4). We found that these unpaired scaffolds are depleted for coding regions: only 5% of ESTs have an alignment to the unpaired scaffolds but not to the reference assembly. In addition, the unpaired scaffolds are typically small, having an N50 scaffold size of only 15 kb, and are highly enriched for repetitive sequence (Fig. 4). An independent analysis confirms that the unpaired scaffolds are highly enriched for repetitive sequence (see Supplemental material). These two properties of the unpaired scaffoldssmall size and high repeat contentappear to be the reason we could not identify their haplotype partner.
Assembly validation
Validation with finished sequence At the nucleotide level, we obtained alignments between the haplotype assembly and 522 kb out of the 542 kb of finished sequence (96%). These alignments are highly accurate, with almost all discrepancies occurring at regions of the assembly that have been assigned low quality scores. Specifically, we defined a position to be of low quality if it or any of the five flanking bases in either direction are of quality Q <45 as assigned by Arachne (assembly quality files are also available for download). By this definition, 18% of the haplotype assemblies and 13% of the reference assembly are of low quality. In the remaining high-quality regions, the discrepancies were 15 isolated single-nucleotide differences (of which four are at mononucleotide runs); a 2-base pair indel; two islands of sequence mismatches (four differences in 20 base pairs, seven differences in 120 base pairs); a"recombination" event in which the terminal 600 base pairs of a contig match the opposite haplotype perfectly; and the terminal 1800 base pairs of a contig not aligning to the finished sequence. [Also, in an ambiguous case, a variable number of tandem repeats (VNTR) region of period 45 is repeated 36 times in the finished sequence and only 27 times in the assembly]. The overall rate of mismatches or indels is 5 x 105 per base (28/542 kb), and discrepant regions comprise 0.4% of the sequence (2.4 kb/542 kb).
At regions marked as low quality, the error rate was higher. We found 108 isolated single-base differences (a rate of 1 per 5 kb) and 44 clusters of several mismatches with median length of
PCR verification of large-scale heterozygosity Specifically, at each critical junction we used the placement of end reads in the haploid scaffolds to select four 40-kb Fosmid clones (two from haplotype A and two from haplotype B) that span the critical junction (Fig. 5). If the assembly is correct, the sequence of each clone should be identical to the sequence of its matching haploid scaffold. Thus, all four clones at the critical junction should correspond in the region where the haploid scaffolds correspond, and diverge where the haploid scaffolds diverge. To test this, we designed six PCR assays: two assays within the region of correspondence that should amplify in all four clones, two assays within the region of divergence that should amplify only in the haplotype A clones, and two assays in the region of divergence that should amplify only in the haplotype B clones. All four clones spanning the critical junction were then interrogated with all six PCR assays. If the sequence of the clones supports the assembly, we would expect a product (of the predicted length) in 16 of the 24 reactions, and the absence of a product in eight of the 24 reactions. In addition, we sequenced all PCR products and checked that PCR products from a clone align best (or at least as well) to the scaffold of the same haplotype.
We interrogated all 14 critical junctions by PCR and sequenced the PCR products at 10 of the 14 critical junctions (Fig. 5). We found that in all but two cases, the presence or absence of a PCR product agreed with the prediction, and in all but these same two cases the sequence of the PCR product aligned best to the scaffold of the predicted haplotype. The overwhelming agreement between prediction and experiment indicates that the large-scale haplotype differences are real and that we assembled them correctly.
Assessment of assembly quality
Description of assembly algorithm
The splitting rule
If the two haplotypes are identical in a region I, and two reads A1 and B1 of opposite haplotype overlap solely within I, then there is no way to determine that these two reads are of opposite haplotype without considering a wider context (Fig. 6). The splitting rule gains this wider context by using other sequencing reads overlapping A1 and B1 to infer a local partition of the sequencing reads into two haplotypes. Specifically, we define (X Y) to mean that X and Y overlap, O(X,Y) to refer to the interval of their overlap, and (X! Y) to mean that they do not overlap. We consider all read-read overlaps and reject any overlaps (A1 B1) for which there are two additional reads, A2 and B2, establishing a local two-haplotype structure (A1 and A2 as one haplotype and B1 and B2 as the other) that is consistent with all other contextual information. Algebraically, we erase (A1 B1) if the following two conditions are satisfied:
Merging the haplotypes into a reference assembly
Detecting correspondences between haploid scaffolds To cluster the filtered alignments into runs of collinear alignments, we used dynamic programming to select, for each haploid scaffold S, the sequence of nonoverlapping alignments on S that is maximally collinear. The objective function for quantifying collinearity is the sum of the scores of alignments as assigned by Megablast, minus a switching penalty of 12,000 for pairs of adjacent alignments that disrupt strict collinearity. Such disruptions include adjacent pairs of alignments to different target scaffolds, adjacent pairs of alignments with different orientations relative to the same target scaffold, or adjacent alignments on the query scaffold that are overlapping or in a different order on the target scaffold. We divide this optimal sequence of alignments at each disruption of collinearity, resulting in (n + 1) nonoverlapping runs of collinear anchors, where n is the number of disruptions in the optimal path across S. The runs of collinear alignments found by dynamic programming are not necessarily reciprocal, and we filter to retain those that are. That is, we retain a run of collinear anchors from A to B only if there is also a run of collinear anchors from B to A that overlaps the original over more than half its length. We then intersect the ranges of reciprocal runs of alignments, retaining the corresponding alignments to create a collinear block. As a result of requiring all collinear blocks to be reciprocal, no part of a haploid scaffold can be part of more than one collinear block. Collinear blocks relate corresponding genomic regions in the two haplotypes, which are represented by haploid scaffolds.
Constructing diploid scaffolds
Logically, the application of these local rules could lead to a diploid scaffold forming a closed loop; we explicitly check for this condition but have never encountered it.
Collapsing diploid scaffolds into reference scaffolds
Whole-genome assembly from multiple-haplotype WGS data is fundamentally different from, and more difficult than, the assembly of single-haplotype WGS data. We have described here a method for the case of exactly two haplotypes, and applied this method to assemble the Ciona savignyi genome from 12.7x WGS sequence. Because our method first assembles the two haplotypes separately, it requires about twice the sequencing depth that is typical for WGS assembly. Using the full 12.7 WGS sequence available for Ciona savignyi, our method provides a substantial improvement over the default assembly using Arachne, yielding an assembly with improved long-range continuity and in which the vast majority of loci are represented uniquely. The improvement results primarily from exploiting the fact that there are exactly two haplotypes present in the data. From an algorithmic perspective, the knowledge that exactly two haplotypes are present is an enormous advantage. For example, half of the overlaps between sequencing reads are between two reads of the same haplotype, and by using sequencing quality scores we only need to consider overlaps with near-perfect sequence identity. In addition, each sequence is confounded by at most one other variant; once the two haplotypes at a locus have been identified, a sequencing read from that locus must exactly match sequence from one of the two haplotypes.
One alternative method for polymorphic assembly is to assemble all haplotypes in a consensus sequence from the start. This is the approach of the JAZZ assembler, which was used to assemble the moderately polymorphic genomes of Ciona intestinalis and Fugu rubripes (Aparicio et al. 2002
A second alternative method for polymorphic assembly is the approach developed by Jones and colleagues (2004 Ideally, it would be desirable to have methods for assembling polymorphic diploid genomes that can be applied regardless of the level and variability of heterozygosity. We are optimistic that this is achievable, but believe that it will require more than cosmetic refinements of traditional algorithms that were designed for genomes with little or no polymorphism. In such traditional algorithms, even the most basic data structures of "contigs" and "scaffolds" presume that a single haplotype is present. Instead, it would be better to introduce more natural data structures at the earliest stages of assembly, which allow the two haplotypes to assemble together at regions of low or zero heterozygosity and bifurcate at regions of high heterozygosity, and only project to contigs and scaffolds as the final step.
At present, polymorphic WGS assemblies cannot be produced at the same accuracy and completeness that is now routine for haploid WGS data sets, even in the case of only two haplotypes at high coverage. In addition, not all polymorphic species can be sequenced from a single individual containing only two haplotypes, e.g., polyploid plants and species for which several individuals are required to gather sufficient DNA for genome sequencing. Thus, alternative sequencing strategies for polymorphic species need to be explored. One alternative is BAC-by-BAC sequencing, which should cost We are optimistic that with continued algorithmic and sequencing advances, diploid genome assembly will become as routine as haploid assembly now is. Our results represent initial steps in that direction.
Finished sequence from the same individual The 14 finished and nearly finished clones from the same Ciona savignyi individual are from the 40-kb Fosmid WGS library. These clones were selected manually in seven pairs so that, within each pair, the two Fosmids overlap by at least 20 kb and represent opposite haplotypes. Nine of the 14 clones are finished (GenBank phase 3) and five are nearly finished (GenBank phase 2). All 14 clones ( 560 kb) were used for assembly validation. Accession numbers, phase information (phase 2 or 3), pairing at loci, and usage in figures are shown in Supplemental Table 1.
Measurement of substitution rates
Validation using finished sequence from the same individual
Finished sequence from different individuals
EST sequencing
Identifying repeats by alignment depth
This work was supported by a grant from NHGRI to E.S.L. N.S. and Y.S. were supported by a Grant-in-Aid (12202001) from MEXT, Japan. We thank Arend Sidow, Kerrin Small, and Matthew Hill for background conversations about Ciona savignyi, discussions of assembly methods, and representations of diploid assemblies, and the use of the preliminary genetic map information, Daniel Rokhsar for discussions of related issues in Ciona intestinalis, Jonathan Butler for assistance using Arachne and preparing source code for release, Kerstin Lindblad-Toh for comments and suggestions, Nick Patterson for discussions of population genetics, and Leslie Gaffney for assistance with illustrations and copy editing.
[Supplemental material is available online at www.genome.org. The sequence data from this study have been submitted to GenBank under accession nos. AACT01000000, AC092520 [GenBank] , AC092560 [GenBank] , AC092561 [GenBank] , AC102146 [GenBank] , AC117993 [GenBank] , AC117995 [GenBank] , AC153355 [GenBank] , AC126540 [GenBank] , AC126602 [GenBank] , AC129896 [GenBank] , AC129897 [GenBank] , AC129899 [GenBank] , AC129900AC129904, AC130812 [GenBank] , AC130813 [GenBank] , AC131244 [GenBank] and AC131245 [GenBank] and to DDBJ under accession nos. BW509979BW594280.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3722605.
3 Corresponding author.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.
Aparicio, S., Chapman, J., Stupka, E., Putnam, N., Chia, J.M., Dehal, P., Christoffels, A., Rash, S., Hoon, S., Smit, A., et al. 2002. Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes. Science 297: 13011310.
Batzoglou, S., Jaffe, D.B., Stanley, K., Butler, J., Gnerre, S., Mauceli, E., Berger, B., Mesirov, J.P., and Lander, E.S. 2002. ARACHNE: A whole-genome shotgun assembler. Genome Res. 12: 177189.
Dehal, P., Satou, Y., Campbell, R.K., Chapman, J., Degnan, B., De Tomaso, A., Davidson, B., Di Gregorio, A., Gelpke, M., Goodstein, D.M., et al. 2002. The draft genome of Ciona intestinalis: Insights into chordate and vertebrate origins. Science 298: 21572167.
Fleischmann, R.D., Adams, M.D., White, O., Clayton, R.A., Kirkness, E.F., Kerlavage, A.R., Bult, C.J., Tomb, J.F., Dougherty, B.A., Merrick, J.M., et al. 1995. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd. Science 269: 496512. Gibbs, R.A., Weinstock, G.M., Metzker, M.L., Muzny, D.M., Sodergren, E.J., Scherer, S., Scott, G., Steffen, D., Worley, K.C., Burch, P.E., et al. 2004. Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493521.[CrossRef][Medline]
Holt, R.A., Subramanian, G.M., Halpern, A., Sutton, G.G., Charlab, R., Nusskern, D.R., Wincker, P., Clark, A.G., Ribeiro, J.M., Wides, R., et al. 2002. The genome sequence of the malaria mosquito Anopheles gambiae. Science 298: 129149.
Huang, X., Wang, J., Aluru, S., Yang, S.P., and Hillier, L. 2003. PCAP: A whole-genome assembly program. Genome Res. 13: 21642170.
Jaffe, D.B., Butler, J., Gnerre, S., Mauceli, E., Lindblad-Toh, K., Mesirov, J.P., Zody, M.C., and Lander, E.S. 2003. Whole-genome sequence assembly for mammalian genomes: Arachne 2. Genome Res. 13: 9196.
Johnson, D.S., Davidson, B., Brown, C.D., Smith, W.C., and Sidow, A. 2004. Noncoding regulatory sequences of Ciona exhibit strong correspondence between evolutionary constraint and functional importance. Genome Res. 14: 24482456.
Jones, T., Federspiel, N.A., Chibana, H., Dungan, J., Kalman, S., Magee, B.B., Newport, G., Thorstenson, Y.R., Agabian, N., Magee, P.T., et al. 2004. The diploid genome sequence of Candida albicans. Proc. Natl. Acad. Sci. 101: 73297334.
Kent, W.J. 2002. BLATThe BLAST-like alignment tool. Genome Res. 12: 656664.
Mullikin, J.C. and Ning, Z. 2003. The phusion assembler. Genome Res. 13: 8190.
Mural, R.J., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L., Wides, R., Halpern, A., Li, P.W., Sutton, G.G., Nadeau, J., et al. 2002. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296: 16611671.
Myers, E.W., Sutton, G.G., Delcher, A.L., Dew, I.M., Fasulo, D.P., Flanigan, M.J., Kravitz, S.A., Mobarry, C.M., Reinert, K.H., Remington, K.A., et al. 2000. A whole-genome assembly of Drosophila. Science 287: 21962204. Nordborg, M. 2003. Coalescent theory. In Handbook of statistical genetics (eds. D.J. Balding et al.). Wiley, Hoboken, NJ.
Pop, M., Kosack, D.S., and Salzberg, S.L. 2004. Hierarchical scaffolding with Bambus. Genome Res. 14: 149159. Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D., and Miller, W. 2003. Humanmouse alignments with BLASTZ. Genome Res. 13: 103107.
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. 2001. The sequence of the human genome. Science 291: 13041351. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562.[CrossRef][Medline] Zhang, Z., Schwartz, S., Wagner, L., and Miller, W. 2000. A greedy algorithm for aligning DNA sequences. J. Comput. Biol. 7: 203214.[CrossRef][Medline]
Received January 24, 2005; accepted in revised format May 24, 2005. This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||