|
|
|
Published online before print
January 18, 2001, 10.1101/gr.GR-1339R
Vol. 11, Issue 2, 274-280, February 2001
METHODS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Theory is developed for the process of sequencing randomly selected large-insert clones. Genome size, library depth, clone size, and clone distribution are considered relevant properties and perfect overlap detection for contig assembly is assumed. Genome-specific and nonrandom effects are neglected. Order of magnitude analysis indicates library depth is of secondary importance compared to the other variables, especially as clone size diminishes. In such cases, the well-known Poisson coverage law is a good approximation. Parameters derived from these models are used to examine performance for the specific case of sequencing random human BAC clones. We compare coverage and redundancy rates for libraries possessing uniform and nonuniform clone distributions. Results are measured against data from map-based human-chromosome-2 sequencing. We conclude that the map-based approach outperforms random clone sequencing, except early in a project. However, simultaneous use of both strategies can be beneficial if a performance-based estimate for halting random clone sequencing is made. Results further show that the random approach yields maximum effectiveness using nonbiased rather than biased libraries.
| |
INTRODUCTION |
|---|
|
|
|---|
Genomic mapping and sequencing have benefited
greatly from the development of stable large-insert cloning platforms,
such as the bacterial artificial chromosome (BAC) clone (Shizuya et al.
1992
). Sequencing random clones has recently been discussed, both as a
primary strategy and/or in the role of augmenting map-based sequencing.
By random clone sequencing, we mean that large-insert clones are
randomly selected from a library without a priori knowledge of their
positions in the genome. They are then sequenced by standard shotgun
techniques and assembled into contigs (i.e., sets of contiguous clones)
using sequence comparison methods. Because positional information is
not known in advance, there is no guarantee that a clone will extend a
contig or will not be fully redundant with respect to clones already
sequenced. This random-clone approach contrasts with map-based
sequencing, by which clone selection is guided by a preexisting
physical map. Here, redundancy and contig information are known at the
outset, enabling some optimization of the selection procedure. For
example, if constant minimal redundancy can be maintained, map-based
sequencing is essentially a linear coverage process.
As with map-based sequencing, issues for random clone sequencing
revolve largely around performance, as measured by such parameters as
rates of progress and redundancy accumulation. These parameters have
not been empirically quantified for random sequencing because there
remains a lack of substantial data for large genomes. Conversely, map-based sequencing is better understood from an empirical standpoint. For example, a map-based approach was employed for the
Caenorhabditis elegans sequencing project (C. elegans
Sequencing Consortium 1998
) and has figured prominently in the Human
Genome Project (Sanger Centre and Washington University Genome
Sequencing Center 1998
). These projects provide a good benchmark of the
map-based approach for large genomes.
As for mathematical modeling, the random-sequencing scenario described
above has not been specifically addressed. Theoretical developments
have concentrated mainly on mapping techniques, for example, the
seminal work of Lander and Waterman (1988)
for the fingerprint method
and later models for other procedures (Arratia et al. 1991
; Barillot et
al. 1991
; Zhang and Marr 1993
; Port et al. 1995
; Schbath 1997
). Owing
to considerations of similarity, it has been postulated that mapping
models could be directly applied to random clone sequencing (Lander and
Waterman 1988
; Roach 1995
). However, a subtle issue related to clone
overlaps arises. Mapping models are necessarily based on the ability of
the method to detect overlaps for contig assembly. In particular, a
prescribed clone-length fraction necessary for detection
0 > 0 accounts for the fact that some
percentage of overlaps will go undetected (the full description of
nomenclature is given in Table 1). A
measure of overlap itself, for example, as characterized by library
depth
is not considered. However, for random clone
sequencing, contig assembly relies on ex post facto sequence
comparison, which can be assumed to detect all overlaps
(
0
0). Application of mapping theory, for
instance, the Lander and Waterman model with
0 = 0, would not explicitly account for
library depth. Modeling liability of this idealization, if any, has not
been established.
|
In this report, we formulate a theory for the random-clone-sequencing procedure described above. Our primary purpose is to provide a mechanism to estimate performance of an actual sequencing project based on this paradigm. Moreover, we also evaluate the importance of library depth as a modeling factor and, thus, the applicability of mapping models to this problem. We focus on two types of libraries: one generated by random means, for example, mechanical shearing, and one created by partial digest using a restriction enzyme. The former is expected to have essentially a uniform clone distribution because there is no preference for cut sites, whereas the latter depends upon the inherent nonuniformity of restriction sites throughout a genome and could have appreciable bias in the distribution of clones. We apply the theory to the particular case of human BAC sequencing using a randomly generated library and a library created by partial digest with HindIII. We refer to these as the randomly generated and HindIII libraries, respectively. Map-based results from human-chromosome-2 BAC data are used for comparison. No account is made of genome-specific or nonrandom phenomena; therefore, theory and results are considered to be first-order approximations.
| |
RESULTS |
|---|
|
|
|---|
Assessment of the Importance of Library Depth
If we set
0 = 0 in the Lander and Waterman
(1988)
model to simulate perfect overlap detection, the main difference
between it and our model in equation 3 is consideration of library
depth. For perfect detection, Lander-Waterman coverage reduces to the well-known Poisson coverage expression
L(i) = G
Ge
L0i/G.
Equation 3 yields the same result if we allow clone size to vanish
(L0
0). Thus, the importance of considering
library depth diminishes with clone size, at least in the approximate
context of these models.
The physical interpretation is that L0
0 tends
toward an idealized point model, for which partial overlap of one clone
with another is not possible. For the point model, clones only overlap completely or not at all. As would be expected, one identically recovers the Poisson coverage law if our model is derived without accounting for partial overlap. Because clone size is usually small
relative to genome size, it appears from equation 3 that library depth
is actually a secondary consideration compared to the other variables.
A corollary is that random clone sequencing can often be reasonably
approximated by the Poisson coverage expression.
Assessment of the Importance of Clone Size Variation
Actual clone sizes in any library can be expected to vary somewhat;
however, uniform clone size has been a standard theoretical assumption
(e.g., Lander and Waterman 1988
; Barillot et al. 1991
; Zhang and Marr
1993
; Port et al. 1995
; Roach 1995
). We use data from chromosome-2 BAC
clones to evaluate the effect upon the present model. Table
2 indicates an average clone length of
181.8 kb, with a 19-kb standard deviation. A rudimentary test is to
determine difference in coverage over one standard deviation in clone
size, that is, for 181.8 ± 9.5 kb (the test is more
conservative than it may initially seem because the size distribution
is entirely concentrated at a single point for each case). Equation 3 yields a maximum difference below 10% when evaluated over the
whole sequencing project. As with library depth, variation of clone
size does not appear to be a primary factor in modeling random clone
sequencing.
|
| |
DISCUSSION |
|---|
|
|
|---|
Performance Early in a Project
Results from 7000 randomly sequenced human BAC clones are used to
evaluate performance early in a project. Modeling constants are genome
length G = 3 Gb, nominal clone size
L0 = 170 kb (derived from a genomewide sampling of
Genome Sequencing Center BAC clones), and library depth
= 10×. Other parameters are as discussed in Methods; in particular, the average value of redundancy for map-based sequencing is
exp = 14.4%. Figures
1 and 2 show
redundancy and coverage results, respectively, for each approach. They
confirm that randomly sequenced BACs, regardless of library type,
initially yield comparable rates of coverage and lower redundancy
compared to map-based sequencing. After some number of clones, the
advantage shifts to the map-based approach. From a redundancy
standpoint, this crossover occurs at approximately 5600 clones for the
randomly generated library and between 2900 and 4800 clones for the
HindIII library. In terms of coverage, the numbers are
comparable: 5600 nonbiased clones and between 3500 and 5400 HindIII clones. The upper limit for performance crossover is
about 5600 BAC clones, using a randomly generated library. At this
point, a total of ~one-fourth of the genome has been covered. This
value is also a reasonable estimate for the mouse genome because its
parameters are nominally the same as for the human genome. Figure 2
also indicates that performance is essentially independent of library type and bias level for the first several thousand clones. In fact,
rate of progress for the first 10%-15% of the genome is essentially
fixed. This observation suggests that conventional restriction digest
libraries could effectively be used for random clone sequencing while
mapping work is in progress.
|
|
Simultaneous Random, Map-Based Sequencing
In a general scenario, random and map-based sequencing could be conducted simultaneously. At the time of this writing, ~50% of map-based human sequencing was complete, well past the 5600 clone crossover juncture. However, Figure 3 (inset) suggests that sequencing from the random library would still be justified because its coverage rate, as deduced from the slope, remains comparable to the map-based approach. Conversely, suitability of the HindIII library depends strongly on bias. For low bias, the coverage rate is slightly less than that for a random library; however, the rate is much decreased at higher bias.
|
To determine when random clone sequencing should be halted, the
following procedure can be applied. First, specify minimum acceptable
performance in terms of model parameters; then evaluate the resulting
number of clones obtained for each parameter. Choose the lowest value
as the stopping point. If finished sequence is already available, as in
this case, subtract the equivalent number of already-sequenced clones
to obtain the number of clones that can still be sequenced within the
original constraints. Consider the following example: Random sequencing
is performed while its rate of coverage is at least half that of
map-based sequencing and its redundancy is <50%. We work the
problem for a random library because it is already clear that this will
yield a higher number of clones. Using
exp = 0.144 from Table 2, rate of progress
for map-based sequencing is computed as
R = 0.856L0 per BAC clone. Taking half
this value indicates that random clone sequencing should be continued
until R falls to 0.428L0 per BAC. Solving
equation 4 for i and substituting
R = 0.428L0 yields ~15,000 clones as
one possible stopping point. The other possibility is given by the
number of clones for 50% redundancy, which can be read directly from
Figure 4 as ~28,000. Our prescribed
coverage rate governs the problem; so we choose 15,000 as the maximum
number of randomly sequenceable BAC clones. At 50% coverage, the
equivalent number of already-sequenced BAC clones is about 12,000, which means 15,000-12,000 or 3000 BACs could still be sequenced within our original performance constraints.
|
Extrapolation to Higher Coverage
Figures 3 and 4 extend the simulation to 10 genome equivalents
(176,000 BACs) to examine results at higher coverage levels. There has
been considerable debate on the efficacy of a clone map for sequencing
(Green 1997
; Weber and Myers 1997
); however, Figures 3 and 4 clearly
show its superior performance at higher coverage levels. Specifically,
map-based clone sequencing behaves linearly, whereas random clone
sequencing is an asymptotic process. The random library displays better
coverage and lower redundancy than the best HindIII library
and converges to full coverage. All bias estimates predict the
HindIII library falling short of 100% coverage due to
unrepresented regions of the genome.
Summary
Performance of map-based sequencing is superior to random clone sequencing, except early in a project. Moreover, a randomly generated library can be expected to outperform a restriction digest library, perhaps by a significant margin. These conclusions are likely true for many combinations of organism, clone type, enzyme, etc. Results suggest that sequencing via map-based and random strategies simultaneously is reasonable, especially if a suitable performance-based estimate for halting random clone sequencing is made.
Other cases of interest can be treated by applying the same procedures
shown here for human BAC clones. Of special interest is the
whole-genome shotgun method, which is now being used for large genomes
(Adams et al. 2000
). The procedure relies on random fragmentation so
that a uniform subclone distribution can reasonably be assumed. Here,
L0 is the subclone length. Because this value is
considerably small compared to genome size, a point model can readily
be assumed, permitting usage of the Poisson coverage approximation discussed above.
| |
METHODS |
|---|
|
|
|---|
Models for map-based and random clone sequencing are derived here.
Solomon (1978)
and Hall (1988)
are good introductions to the general
topic of coverage theory, which encompasses these processes.
Empirical Model of Map-Based Sequencing
A clone library supported by a physical map can be characterized by
the average overlap of the sequenced clones
exp. We estimate this parameter
empirically using 43 human-chromosome-2 BAC clones sequenced at the
Genome Sequencing Center. Table 2 shows derived overlap and size data
for these clones, which originate from the RPCI-11 library (Osoegawa et
al. 1998
). In the present context, our documentation system for clone
lengths and finishing boundaries has the effect of tallying overlaps
twice. Therefore, the average overlap of 28.9% calculated from Table 2
must be divided in half, yielding
exp = 14.4%. This group of clones
is taken as representative of the overall map-based sequencing process
for human BAC clones. In other words, overlap is assumed to be a
constant given by
(i) =
exp.
Therefore, the coverage added by each sequenced clone, that is, the
rate of progress is
R(i) = (1
exp)L0 and the total coverage L(i) is simply
iR(i).
Random-Sequencing Model for a Randomly Generated Library
Random fragmentation implies that all base positions are equally
likely to be breakage sites, which results in a uniform clone distribution. We assume a constant clone size of L0
bases, a genome length of G bases, and a library depth of
. The standard equation of expected value is employed
for estimating coverage
L =
p
L
, where L
is the coverage contributed by a
particular event
and p
is its probability of
occurrence. Three events are considered: the new event, in which the
clone does not overlap any established sequence (all added sequence is
new); the partial event, in which part of the clone overlaps
established sequence; and the buried event, in which the entire clone
is buried in previously generated sequence. The coverage after randomly
sequencing a clone is then
(pL)new +
(pL)partial +
(pL)buried. Then define a segment length b, the average number of bases
from one breakage site to the next. Because clone ends are synonymous with breakage sites, clones must overlap in units that are multiples of
b. The number of clones in a library
N =
G/L0, equals
the number of right (or left) clone ends, and therefore, the number of
uniformly distributed breakage sites. By definition, b can be
computed by dividing the genome length by the number of breakage sites,
yielding
b = G/N = L0/
.
A clone end can only occupy discrete positions given by multiples of
b; so the total number of possible positions is
K = N
+ 1 and the
probability of any given position is
p = K
1. This expression accounts for
so-called end effects (Port et al. 1995
; Roach 1995
) and prevents
clones from running off the end of the genome. No such constraint
exists for circular genomes (Parke 1997
).
The length of unique sequence after i clones have been
sequenced is Li, where
Li = lib.
For the new event, the length after sequencing another clone is
Lnew = Li + L0
and the number of ways to obtain this event is
K
(li +
1).
For the buried event, the length remains the same as it was before the
clone was sequenced, that is,
Lburied = Li,
and the number of ways this could occur is
li
+ 1. These
combinations yield
pnew = (K
li
+ 1)/K
and
pburied = (li
+ 1)/K.
Length for the partial event is simply Lnew
minus the amount of overlap, that is,
Lpartial = Li + L0
h b, where h is the number of b-length
segments of overlap. Because the new clone is, at most, equal in length
to the established sequence, this can happen a total of
h =
1 ways on either side of the
established sequence. Therefore, the partial event can be written as
(pL)partial = 2

1h=1(Li + L0
h · b)/K. Using a summation identity, this expression
reduces to
(pL)partial = 2(
1)(Li + L0)/K
(
1)L0/K. Taking all of these
events and simplifying, the final recursion for coverage is
Li+1 = Li + [1
Li/(G
L0 + L0/
]L0. Rearranging, we
obtain
|
(1) |
|
(2) |
L0(1
1/
)]
as an integrating factor (Martin and Reissner 1956
|
(3) |
L0(1
1/
) and c1 = c0
L0. Various performance
parameters can be derived from this equation. For example, in the ideal
case of end-to-end clone placement, coverage is simply L0i, which allows redundancy to be
defined in a normalized sense as
(i) = 1
L(i)/(L0i). Rate
of progress, R(i), can be calculated by
differentiation as
|
(4) |
Random-Sequencing Model for a Restriction Digest Library
Modeling the sequencing process for a restriction digest library
must account for nonuniform clone distribution, which arises from the
natural restriction-site bias of a genome. Rather than attempt to
derive an exact solution for this case, it is more expedient to employ
Monte-Carlo simulation (Press et al. 1991
). A set of restriction sites
is first established according to an appropriate
nonuniform-probability-density function. Sequence coverage is then
simulated by randomly selecting a site as a left clone end. Based upon
local site distribution and nominal clone length, the right end is then
determined, after which cumulative coverage is recomputed. Iteration is
continued until a user-specified stopping point is encountered.
Poisson probability density functions provide a suitable model for
restriction sites in the sense that a large segment size can be
specified, for example, 4s where s is
enzyme specificity, while the probability of a site occurring in any
neighborhood of a segment is small. Because site bias cannot be known a
priori (Green 1997
), we follow the methodology of Port et al. (1995)
and Schbath (1997)
in using simple functions to model bias. We select
(m + 1)xm, where m is
the user-specified bias level and x is the distance along the
linearly arranged genome. This function conserves the total number of
sites in a genome to
µavgG/4s,
where µavg is an empirically sampled
value of the number of restriction sites in a segment. A general
variable-rate Poisson process for restriction site distribution is then
given by
|
(5) |
G using equation 5; however, the
local distribution in each segment is taken to be uniform. No coverage
is allowed between base position 1 and the first restriction site
x1. Thus, bias yields a segment
x1
1, which cannot be covered by sequence due
to lack of representation in the library.
As with the uniform distribution model, other performance parameters
can be obtained using coverage results. Redundancy is computed exactly
as defined for the uniform model. For rate of progress, no closed-form
expression is available. However, it can be calculated by finite
difference approximation (Tannehill et al. 1997
). Due to the nonsmooth
nature of Monte-Carlo simulation, it may be necessary to average out
local fluctuations by computing each difference over a large set of
clones. We note that this introduces an additional component of
numerical error (Tannehill et al. 1997
).
As applied specifically to a HindIII human BAC library, the
parameters are s = 6 (because this enzyme is a 6-cutter
recognizing AAGCTT) and a segment size of 4096 bases. To estimate
µavg, 868 finished sequences from the
Genome Sequencing Center encompassing ~105 megabases were analyzed
for AAGCTT. We found µavg
1.25,
implying ~916,000 total sites for a 3-Gb genome size. A coefficient
of dispersion of 1.046 indicates that Poisson modeling is acceptable
(Sokal and Rohlf 1981
). We assume lower- and upper-bound functions for
bias as 3
x/2 and 3x2, respectively.
A code using this model was tested with m = 0 and results compared well to equation 3. This method is computationally intensive because each restriction site and its coverage status must be stored explicitly and these arrays are traversed for each succeeding clone. Placing a clone in the genome and computing coverage require approximately L0/4s and G/4s operations, respectively. Because clone ends must be restriction sites, the simulated length of a clone will vary in a range of ~4s around the nominal value of L0. Exceptions are that near the end of the genome, a clone may be much smaller because it cannot run off the end, and in restriction-site-poor areas, a clone can be significantly longer because it must extend to the next restriction site.
| |
ACKNOWLEDGMENTS |
|---|
This work was supported by a grant from the National Human Genome Research Institute (HG02042). We thank J. Wallis of the Genome Sequencing Center for useful discussions and anonymous reviewers for insightful critiques.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL mwendl{at}watson.wustl.edu; FAX (314) 286-1810.
Article published online before print: Genome Res., 10.1101.gr.133901.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.133901.
| |
REFERENCES |
|---|
|
|
|---|
Received February 2, 2000; accepted in revised form November 21, 2000.
This article has been cited by other articles:
![]() |
M. C. Wendl and R. H. Waterston Generalized Gap Model for Bacterial Artificial Chromosome Clone Fingerprint Mapping and Shotgun Sequencing Genome Res., December 1, 2002; 12(12): 1943 - 1949. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||