|
|
|
|
Vol. 10, Issue 3, 350-364, March 2000
METHODS
|
| |
ABSTRACT |
|---|
|
|
|---|
This paper describes a fast and scalable strategy for constructing a radiation hybrid (RH) map from data on different RH panels. The maps on each panel are then integrated to produce a single RH map for the genome. Recurring problems in using maps from several sources are that the maps use different markers, the maps do not place the overlapping markers in same order, and the objective functions for map quality are incomparable. We use methods from combinatorial optimization to develop a strategy that addresses these issues. We show that by the standard objective functions of obligate chromosome breaks and maximum likelihood, software for the traveling salesman problem produces RH maps with better quality much more quickly than using software specifically tailored for RH mapping. We use known algorithms for the longest common subsequence problem as part of our map integration strategy. We demonstrate our methods by reconstructing and integrating maps for markers typed on the Genebridge 4 (GB4) and the Stanford G3 panels publicly available from the RH database. We compare map quality of our integrated map with published maps for GB4 panel and G3 panel by considering whether markers occur in the same order on a map and in DNA sequence contigs submitted to GenBank. We find that all of the maps are inconsistent with the sequence data for at least 50% of the contigs, but our integrated maps are more consistent. The map integration strategy not only scales to multiple RH maps but also to any maps that have comparable criteria for measuring map quality. Our software improves on current technology for doing RH mapping in areas of computation time and algorithms for considering a large number of markers for mapping. The essential impediments to producing dense high-quality RH maps are data quality and panel size, not computation.
| |
INTRODUCTION |
|---|
|
|
|---|
Many genome-wide maps have been constructed as
part of the Human Genome Project. A current widely used technique is
radiation hybrid (RH) mapping (Goss and Harris 1975
; Cox et al. 1990
;
Walter et al. 1994
). One purpose of constructing maps is to provide
landmarks along each chromosome to guide sequencing of the DNA. To
date, most of the mapping effort has been put into iteratively
constructing denser and denser maps rather than integrating new maps
with old maps. Recurring problems in using maps from several sources
are that the maps use different markers, the maps do not place the overlapping markers in same order, and the objective functions for map
quality are incomparable. Because many large contigs of human DNA
sequence are now finished and submitted to GenBank, it would be
desirable to integrate maps of markers with the DNA sequence so that
the maps can continue to be used to fill in the rest of the sequence
and to identify genes in regions bounded by well-mapped markers.
In this paper we propose and evaluate new strategies for reconstructing RH maps and integrating those maps as well as others that have comparable objective functions for map quality. We also evaluate whether the current maps and the new maps we compute are consistent with human DNA sequence contigs in GenBank.
It is possible to reconstruct maps of previously mapped markers because
the RH database (RHdb, http://www.ebi.ac.uk/RHdb/index.html) contains
publicly submitted RH vectors (rhvectors) for sequence-tagged site
(STS) markers. An rhvector for an STS x is a vector
(x1, x2, ... ,
xn), where n is the number of hybrids (or
cell lines) in the RH panel and each xi = 0, 1, 2, depending on whether hybrid i is typed and retains
x, typed and does not retain x, or not typed
and/or ambiguous, respectively (Cox et al. 1990
; Boehnke et al. 1991
;
Matise et al. 1998
).
The rhvectors in RHdb are generated from multiple mapping panels; those
reviewed in this paper are from the Genebridge 4 (GB4) panel (Gyapay et
al. 1996
) and the Stanford G3 panel (Stewart et al. 1997
). Previously
published maps used the GB4 and G3 panels independently and used
independent resources such as YAC contig data to build their maps
(Hudson et al. 1995
; Deloukas et al. 1998
). We decided to reconstruct
the RH maps to take advantage of the fact that some markers were typed
on both panels. The concatenation of rhvectors for the same marker from
both panels makes the resulting rhvectors longer, which Ben-Dor and
Chor (1997)
showed is essential to compute more accurate RH maps.
RH mapping is based on the hypothesis that the closer the loci are on a
chromosome, the more likely they are to be retained or lost together in
a hybrid. That is, their rhvectors will have few differences. The two
criteria typically used for assessing the closeness of rhvectors are
the number of obligate chromosome breaks (OCB) and maximum likelihood
estimate (MLE). Other criteria like Bayesian posterior probabilities
involve more modeling assumptions (Lange et al. 1995
) and have not been
used in developing software for computing RH maps. It is known that OCB
and MLE are not identical, but to our knowledge, Ben-Dor and Chor
(1997)
are the first to show that OCB and MLE are equivalent under
conditions of equally spaced markers and 50% retention of markers on
hybrids. However, these conditions are not satisfied by data on current
panels. We verify the incomparability of the two objective functions.
The number of OCB for a marker order on a RH map with markers typed on the same panel is the number of times a 1 is followed by a 0 or vice versa, ignoring intervening 2s (unknown), between consecutive markers at all vector positions. The OCB objective for creating a map from rhvectors, then, is to find the marker order that implies the minimum number of OCB among all possible marker orders. For the MLE objective, the breakage probability and retention probability are calculated from rhvectors that are then used for estimating the distance between markers and the likelihood of a map. The order of the markers that maximizes the likelihood of the map is considered the true order of markers on the map.
Current RH maps are produced with specially tailored software packages
such as RHMAP (Boehnke et al. 1991
), RHMAPPER (Slonim et al. 1997
), and
MultiMap (Matise and Chakravarti 1995
). The packages currently in use
choose either OCB or MLE as the objective function and use statistical
parameters and/or heuristics to produce a map. When using MLE, Lange et
al. (1995)
proposed a way of constructing a model that specifically
incorporates the possibility of typing error and presence of unknowns,
and Lunetta et al. (1996)
specifically allowed for multiple panels. We
propose extensions to the OCB and MLE objective functions, different
from those in previous papers such as (Lange et al. 1995
), to
incorporate the presence of unknowns and present a strategy that
identifies markers with the same map order independent of which
extended version of objective functions is used.
We borrow several tools and techniques from domains of computer science
and combinatorial optimization (Papadimitriou and Steiglitz 1982
) to
design and implement our strategy. It has been known for several years
that for haploid error-free data, the problem of computing a RH map for
either the OCB or MLE criterion can be mathematically transformed into
an instance of a much studied optimization problem called the traveling
salesman problem (TSP; Karp et al. 1996
; Ben-Dor and Chor 1997
). The
transformation employs an approach using multiple pairwise comparisons
between markers rather than the more commonly used multipoint
comparisons. The transformation is exact when there are no unknown
entries in the data and approximate otherwise. The TSP has been the
subject of intense research for decades (Papadimitriou and Steiglitz
1982
; Lawler et al. 1985
; Reinelt 1994
), and there is now a superb
software package called CONCORDE (combinatorial
optimization and networked combinatorial optimization research
and development environment; Applegate et
al. 1998
) for solving large instances. We decided to test
CONCORDE for RH mapping as part of our effort to reconstruct maps. An
unintended result of our experiments is that CONCORDE consistently
computes maps with lower OCB and higher MLE than those computed by
RHMAPPER. Moreover, CONCORDE is much faster on large data sets than
RHMAPPER when RHMAPPER is required to compute its initial framework
internally de novo. In the past, the users of RHMAPPER have constructed
an initial framework map, in part, by relying on information from other
sources such as genetic map and YAC contig data (Slonim et al. 1997
).
Ben-Dor and Chor (1997)
showed that with the current number of hybrids,
the probability of getting "the correct order" for all the markers
is very low (<0.01). Even for only 20 markers, the success
probability is <0.5, so any strategy that is pinned to framework
maps of >20 markers is likely to produce maps with serious
large-scale errors. The attempts made to model errors in data by hidden
Markov models (Heath 1997
; Slonim et al. 1997
) have been successful in
placing a few hundred markers but cannot be used for placing the
thousands of markers that are becoming available without starting from
a fairly dense initial framework map.
For consistency, we compare previous maps and our integrated map with
large sequence contigs submitted to GenBank. The maps are consistent
with the sequence if markers are placed in the correct sequence order
on the map. We choose this objective function for map quality because
there is currently no good way to assess how much better one map is
compared with another one in terms of the number of markers actually
ordered correctly except for chromosome 22 for which the completed
sequence is available. We find that all of the maps are inconsistent
with the sequence data for at least 50% of the contigs, but our
integrated maps are more consistent. We provide some evidence that the
inconsistencies are in large part due to data quality or panel sizes
and not as much due to mapping strategy. We also list the number of
markers in the same order between every pair of Généthon
(Dib et al. 1996
), RH Consortium (Deloukas et al. 1998
), Stanford
(Stewart et al. 1997
), and our integrated map.
The next section of this paper presents definitions and theoretical background on RH mapping and its relationship to problems in combinatorial optimization. (More background material that is relevant to the rest of the paper, but less essential, can be found in the Appendix.) This is followed by a section in Results describing our map reconstruction strategy, our map integration strategy, and our computational experiments with these strategies. We conclude with a short Discussion and a short section on Methods summarizing availability of our software and data.
Definitions and Theoretical Background
Our methods rely on known algorithms for two problems widely studied
in computer science and combinatorial optimization: the longest common
subsequence problem (LCSP, sometimes also called the longest common
substring problem) and the TSP. Both LCSP and TSP have many
applications to problems in computational biology (Gusfield 1997
) but
may be unfamiliar to practitioners of RH mapping. Therefore, we
summarize the most essential background material in this section. More
background material including a brief history of the TSP can be found
in the Appendix.
LCSP
Given two sequences A = a1, a2, ... , an and B = b1, b2, ... , bm, find a longest sequence C = c1, c2, ... , ck such that C is a subsequence of both A and B. For example, if A = a, l, g, o, r, i, t, h, m and B = l, o, g, a, r, i, t, h, m, then longest common subsequences (LCS) are l, g, r, i, t, h, m and l, o, r, i, t, h, m, both of length 7. In the weighted version of the problem, we look for common subsequence that has maximum weight. In the previous example, if the weights are a = 3 and for other letters 1, then the weighted common subsequence for A and B is a, r, i, t, h, m that has weight 8 and not l, g, r, i, t, h, m or l, o, r, i, t, h, m that have weight 7. The LCSP and its weighted version can both be solved using dynamic programming (Gusfield 1997Maximum Likelihood Computation
The steps for doing data analysis using maximum likelihood are as follows (Boehnke et al. 1991| 1. | The retention probability p of the data set is estimated by the ratio of the total number of 1s to the total number of 1s and 0s. | ||
| 2. | The likelihood of observing rhvector (x1,
x2, ... , xn) for a single
marker x is
|
where c is 1 for haploid, 2 for diploid, q = 1 p, and nj is the number of positions
i such that xi = j.
|
|||||||||
| 3. | The likelihood of observing rhvectors for a pair of markers
x and y is
|
where x,y is the breakage probability between
markers x and y, and nij is
the number of positions r such that xr = i and yr = j.
|
|||
| 4. | L(x,y) is maximized when
x,y is the smaller root of the equation
obtained by setting the derivative of L(x,y) with respect to x,y to 0. For the diploid case, the
equation to be solved is a degree five polynomial, and for the haploid case, we get a degree two polynomial whose solution gives the following:
|
The root of the quadratic equation chosen for is the smaller root
to satisfy the constraint that x,y = 0 when n10 + n01 = 0.
|
|||||
| 5. | The maximum likelihood (M) of marker order
x1, x2, ... ,
xm on a map M, also known as likelihood
of M, is
|
We use to denote multipoint likelihood and L
to denote two-point likelihood. By considering conditioning events as
independent and removing the conditioning on independent events, the
multipoint maximum likelihood of map x1,
x2, ... , xm is
approximated by several two-point likelihood estimates as
|
TSP
Given a finite number of cities and the cost of travel between each pair of them, find the cheapest way of visiting all of the cities and returning to the starting point. As explained in the Appendix, the TSP is intractable in a formal sense, but much research has gone into methods for solving specific instances either approximately or optimally. A well-known software package for the TSP, namely CONCORDE (Applegate et al. 1998Reducing OCB to a Distance Measure for TSP
The distance between two rhvectors (x1, x2, ... , xn) and (y1, y2, ... , yn) is the number of positions at which xi = 1 and yi = 0 or vice versa with distance from dummy marker to any other marker being any constant. If there are no unknowns in the given RH data, then the marker order produced by TSP achieves minimum OCB.Reducing MLE to a Distance Measure for TSP
Define the transition probability for marker x as
|
(7) |
|
(8) |
|
(9) |
|
(10) |
log(tx,y).
Computing retention frequency and breakage probabilities for diploid
data with errors results in Markov and hidden Markov models that can be
used for estimating the likelihoods by techniques such as the
estimation-maximization (EM) algorithm. These methods are thus limited
in the number of markers they can map reliably and are not suitable for
translation to TSP. Ben-Dor and Chor (1997)Normalized TSP+OCB
The distance (n10 + n01) as computed in the reduction of OCB to TSP is normalized by n/(n00 + n01 + n10 + n11) under the assumption that the positions with unknowns in them have the same distribution of differences as the positions in which both xi and yi are known. The distance according to this objective function is, then,
|
(11) |
Weighted TSP+OCB
In this objective function, all six combinations for a pair from {0, 1, 2} are assigned a weight. We did several experiments with different weighting schemes. Each experiment has three steps: (1) compute edge weights between every pair of markers, including the dummy marker, according to the weighting scheme, (2) solve TSP by using the part of CONCORDE that guarantees an optimal order for given distances (see Appendix for details), and (3) compute OCB for the marker order M obtained by TSP; compute the sum of (n10 + n01) for consecutive markers on map M. Among the edge weights that we tested, the scheme that results in a map with lowest OCB is
|
(12) |
Base TSP+MLE
Same as reduction from MLE to TSP.Extended TSP+MLE
Same as reduction from MLE to TSP except that in equation 5, n is replaced by (n00 + n01 + n10 + n11).Normalized TSP+MLE
The breakage probabilities are computed as in Extended TSP+MLE. The transition probabilities are normalized to reflect that compution of breakage probabilities ignores positions contributing to n22. The distance between x and y resulting from this normalization is
|
x,yp),
B = log(1
x,yq), and C = log(
x,y
). The distance
between dummy marker and x is given by
|
| |
RESULTS |
|---|
|
|
|---|
We first present a RH map construction strategy with the goal of producing maps that can be integrated. The emphasis is on striking a balance between the reliability of the map produced and the number of markers that get placed on the map. Second, we present a map integration strategy. The map integration procedure is not specific to RH maps and can be used for any maps that have the same objective criteria. Third, we present comparisons of our new maps with previously published maps, maps reconstructed with RHMAPPER, and sequence data submitted to GenBank.
Map Construction
The steps are as follows:
Step 1: Compute Framework Markers
The candidates C for framework markers are the markers typed on all panels. For each candidate framework marker in C, its rhvectors from different panels are concatenated to produce a virtual rhvector for the marker. The set of framework markers F is a subset of framework candidate markers C such that no marker pair in F is "very close" or "too ambiguous" to another marker in F where closeness and ambiguity are determined by cutoffs for break count B, negative logarithm of transition probability LL, and percentage of unknowns U. If a marker x
C
has more unknowns with respect to the length of its rhvector than
U, then x is not present in F. If a pair of
markers x,y
C have a break count
<B or have
log(tx,y)>LL,
at least one of x,y is not present in F. The
breakage probability for tx,y was
computed as in Extended TSP+MLE. The cutoffs are determined
experimentally and necessarily depend on the data. We look for cutoffs
that give a non-negligible set F of framework markers such
that the maps for markers in F computed in step 2 and step 3 are mostly consistent. For all maps described here, we used
B = LL = 3 but did not use any cutoff for
percentage of unknowns.
Step 2: Compute Maps
Reduce the problem of computing a map to that of TSP using each of the five reductions described in the previous section. Use CONCORDE to solve each instance of TSP and transform the solution to a map. This results in five maps for framework markers corresponding to five reductions.Step 3: Compute a Framework Map
We compute a framework map as the map with only those framework markers whose order is consistent with all the maps computed in step 2. In practice, we find that in step 1, deleting markers that have rhvectors with more unknowns than those conflicting with them is effective.Step 4: Compute Maps for Each Panel
Same as step 2 but with all markers for the panel and not just the framework markers.Step 5: Reorder Maps
If there are m markers on the framework map, say f1, f2, ..., fm, and two terminals (f0 for p-terminal and fm+1 for q-terminal), then there are m + 1 intervals on the framework map into which each remaining marker on the panel can be placed. For each marker x, we find the interval fi, fi+1 such that the likelihood of fi, x, fi+1 is the maximum among all the intervals. We compute the lod score of placing x as the logarithm of the likelihood ratios of placing x in the best interval to placing x in the next best interval. Then, each map computed in step 4 is globally reordered as follows:| 1. | For each fi, find the consecutive set of markers on
map M including fi that have the interval
i, i 1 or i + 1 assigned to them.
This piece of the map is called an extendible piece of M for
fi.
|
| 2. | Consider f0, ... ,
fm+1 in order of their increasing index.
For each fi, find the set of markers X in
the extendible piece for fi such that each marker in
X is also present in a previously considered extendible piece
for f0, f1, ... ,
fi 1. Delete markers in
X from the extendible piece for fi. In
practice, we do not see markers in extendible pieces overlapping with
markers in extendible pieces of more than one or two previous framework markers.
|
| 3. | Determine if the assignment of interval i 1 or
i + 1 orients the piece with respect to the framework map.
If the interval assigned to the first marker of an extendible piece is
less (greater) than the interval to the last marker of the extendible
piece, then the piece is oriented from p-terminal to q-terminal
(q-terminal to p-terminal). If the first and last markers are assigned
to the same interval, then the piece is unoriented.
|
| 4. | If an extendible piece of M for fi can be oriented, the piece replaces fi, and relative ordering of markers in the piece is preserved; otherwise, all the markers in the piece are collapsed at the position for fi. |
Map Integration
In this subsection we describe how to integrate two or more maps that have the same criteria for measuring the score of placing a marker on a map. The core of the integration strategy is to use the algorithm for the weighted LCSP for finding a set of markers that are common and have same relative order in a pair of maps.
Merging Maps
To merge two maps, we first compute their weighted LCS. The markers common in both maps but not present in the LCS are deleted from both maps. The markers that are not common between the two maps are interleaved by interpolation between markers that are in the LCS. For more than two maps, the number of common markers among all of them may be considerably less than the number of common markers for any pair of maps. Our algorithm for merging more than two maps is to first merge maps for all pairs and then iteratively merge the results of those pairwise merged maps. There is no fixed order in which pairwise maps are merged. For RH maps produced using the strategy in the previous subsection, the weight of a marker is its lod score that is computed in step 5. The steps for producing an integrated RH map use the merge procedure described in the previous paragraph. The steps are as follows:Step 6
Merge reordered maps to produce one map per panel.Step 7
Merge maps for each panel to produce an integrated map.New Maps and Quality Assessment
We have presented a method of constructing RH maps from data on
various panels with the aim of integrating them to produce a single RH
map. We use our algorithm on G3 panel and GB4 panel data downloaded
from RHdb to construct an integrated G3/GB4 panel RH map. We seek to
balance the quality of the map and the number of markers that get
placed on the map. Because the objective functions for RH mapping
cannot be directly used to evaluate the quality of the maps, we check
our maps using segments of contiguous genomic sequence (contigs)
reconstructed from individual clone sequences (Jang et al. 1999
), from
chromosome 22 sequence (Dunham et al. 1999
), and with already published
maps. We compare our software with RHMAPPER.
New Maps
We obtained the rhvector inputs for our experiments from the RH database (RHdb, http://www.ebi.ac.uk/RHdb/index.html). Before using the data from RHdb, we first assign a unique identifier to each pair of forward and reverse primers for STS markers. Two markers with identical primer sequences are, in reality, the same STS marker and are assigned the same identifier. If an identifier has more than one rhvector, we pick an rhvector with the fewest unknowns. We reconstructed maps for the GB4 and G3 panel data using CONCORDE and the five transformations to TSP (two variants of OCB and three variants of MLE) described earlier. These maps were then integrated to produce a single RH map. We have used the chained Lin-Kernighan (Lin and Kernighan 1973Software Quality
As described above, we have constructed an integrated G3/GB4 RH map. We also attempted to produce maps with RHMAPPER using the same data on both panels, to compare RHMAPPER with our software that uses CONCORDE. We chose RHMAPPER because it was used to construct the Whitehead Institute map (Hudson et al. 1995RUNNING TIME
To compute all the single-chromosome maps described above with CONCORDE took <2 weeks total on a Sun Ultra10 workstation. We even tested CONCORDE with input consisting of all the markers on all the chromosomes together, and that computation took 3 days. Because we used the chained Lin-Kernighan heuristic from CONCORDE whose running time is dependent on the number of iterations as well as the size of data, computing a map of all the markers together with (500,000 kicks, two runs) takes less time than the computation of chromosome-specific map where each map is run for (250,000 kicks, two runs). In contrast, RHMAPPER could not finish a chromosome 1 map within 3 weeks, and took >2 months to compute all the remaining single-chromosome maps. Constraining to an initial framework would reduce the running time for RHMAPPER considerably but may impact the quality of the map and make the quality comparison done below invalid as the errors in maps computed by RHMAPPER may be attributed to the initial framework.MAP COMPARISON IN TERMS OF OCB AND MLE
We consider Whitehead Institute map, maps computed by us using RHMAPPER for both G3 and GB4 panel data, and maps computed using CONCORDE for both G3 and GB4 panel data. For each map and each chromosome, we compute the average number of chromosome breaks observed between consecutive markers and the average of the logarithm of two-point likelihood for the maps with breakage probabilities computed as in Extended TSP+MLE. We could not use RHMAPPER for evaluating the multipoint likelihood of the maps because RHMAPPER suffers from underflow for the number of markers we have on our maps. We compared maps computed using CONCORDE for both G3 and GB4 panel data and not our integrated map because we cannot compute OCB or MLE when consecutive markers on a map are from different panels. Furthermore, the maps produced by RHMAPPER are for one panel. The results are summarized in Tables 1 and 2. RHMAPPER runs for chromosome 1 were aborted after 3 weeks of computation and is reflected by a "?" in Tables 1 and 2. There, we have not attempted to produce an "optimal" order for the markers that are binned to the same position on the maps because the rhvectors for markers binned to the same position, in principle, should be similar.
|
|
Quality of Integrated Map
We compare the quality of our integrated map with that of previously published maps by looking at consistency with sequence data and by looking at the maximum number of markers placed in same relative order between pairs of maps.CONSISTENCY WITH SEQUENCE CONTIGS
To test the correctness of a map, we can check the order of markers that are on the contigs that have been sequenced. We place markers on contigs using the e-PCR program (Schuler 1997CONSISTENCY WITH CHROMOSOME 22 SEQUENCE
The completed sequence for chromosome 22 is available from http://www.sanger.ac.uk/HGP/Chr22/ (Dunham et al. 1999
|
|
|
MAP COMPARISON IN TERMS OF LCS
We consider every pair of Généthon, RH Consortium, Stanford, and our integrated map. Table 6 lists the number of markers that are common between a pair of maps and the number of markers in their LCS. A LCS between a pair of maps gives the largest subset of markers whose relative order on both maps is the same. As expected, the number of markers common between the integrated map and G3/GB4 is more than the number of markers common between G3 and GB4 maps. The integrated map looks more consistent with the G3 map than with the GB4 map, when consistency is measured in terms of the length of the LCS of markers. It is interesting to note that 82.42% of markers are in LCS between our integrated map and Généthon's genetic map that was not constrained by any initial framework map as against 95.76% and 90.49% of markers for GB4 and G3 maps, respectively, which used information from Généthon's genetic map for constructing their initial framework map.
|
| |
DISCUSSION |
|---|
|
|
|---|
We presented a method for producing RH maps that robustly treats the data currently available. Some steps in the process can be further optimized. In particular, one would like to have a mechanism in which vectors with errors can be detected before the TSP is used to construct a map. This would decrease the number of markers that are thrown out in step 5 of our algorithm.
We demonstrated with markers from the two largest human RH maps
currently available (Stewart et al. 1997
; Deloukas et al. 1998
) that
our map integration strategy produces maps that are more consistent
with sequence data in GenBank than either map alone. This validates the
hypothesis that integrated maps can add value over nonintegrated maps.
However, our integrated map is still quite inconsistent with sequence
data, and we showed that this is largely due to poor data quality that
cannot be easily overcome by better mapping algorithms. The
inconsistency between rhvectors and DNA sequence contigs casts doubt on
the hypothesis that adding more markers to current RH maps can guide
future DNA sequencing effectively.
To compute our integrated chromosome maps, we found it necessary to
first recompute maps based on the previously used markers, so as to
take advantage of some markers that were typed on multiple panels.
Recomputing the initial maps was practical only because the genomics
research community has had the foresight to insist on making sequence,
marker, and rhvector data freely available. The recomputation process
confirmed serious concerns raised by Ben-Dor and Chor (1997)
about how
RH mapping is being performed in practice.
Ben-Dor and Chor (1997)
presented both theoretical and practical
assessments of RH mapping methods. On the practical side they tested
the usage of TSP to construct maps. They suggested that computing RH
maps via the reduction to TSP could produce maps of comparable quality
to RHMAPPER. However, they used smaller data sets than ours, and used
only three simple heuristics for TSP. We pushed their suggestion much
further by using much larger data sets and using the CONCORDE software
package for TSP. The results both in terms of map quality (Tables 1 and
2), and running time are striking. CONCORDE consistently produces maps
that have fewer OCB and higher maximum likelihood than published maps
and maps recomputed with RHMAPPER. Moreover, CONCORDE could easily handle all the data for each chromosome, computed all our single chromosome maps in under 2 weeks, and was even able to compute a map
using all markers from all chromosomes together in 3 days. In
contrast, RHMAPPER without a precomputed initial framework did not
finish the chromosome 1 map within 3 weeks. Thus, our map construction
strategy is the first one than can be scaled up to handle many more
markers than are currently available without being pinned to a possibly
erroneous framework. We show that RH mapping can be done efficiently by
taking advantage of the theoretical work and software package developed
for solving a general combinatorial optimization problem.
On the theoretical side, Ben-Dor and Chor (1997)
raised serious doubts
about the ability of the RH approach to produce good maps with current
panel sizes. We rewrite theorem 3 of Ben-Dor and Chor (1997)
as follows:
Theorem
The success probability s of correctly ordering n
uniformly distributed markers is bounded by
|
(13) |
p, and
is the
intensity of the breakage process.
Using the parameter values of n = 200, m = 83
(for G3 panel), p = 0.3,
= 10 (Ben-Dor and Chor
1997
), the above inequality shows one has a <1% chance of finding
the correct marker order. The maximum and minimum number of markers on
our framework maps are 103 for chromosome 4 and 17 for chromosome 21, respectively. Using the larger m = 93 (GB4 panel) and 103 and 17 markers, the chance of ordering 103 and 17 randomly chosen
markers correctly is <3.6% and 58%, respectively. Although the
theorem as stated does not apply when one selects a subset of markers,
which may be easier to order correctly, it does suggest that sticking
to a rigid framework is unlikely to work well. Stringham et al. (1999)
propose a way of not relying completely on a fixed framework map but do
not produce a whole genome map based on that method. To avoid the
Ben-Dor and Chor lower bound, one should choose the framework markers
carefully and allow for the possibility of rearranging or changing the
framework markers in light of the other data. Our method is not pinned
to a framework map and allows for the possibility of framework markers
to be rearranged locally in step 5 followed by possible removal of
framework markers during merge in step 6 and step 7.
Moreover, from inequality 13, it follows that panel sizes must be 2 orders of magnitude larger than currently used to boost the success
probability significantly. It is not the case that using too small
panel sizes simply causes local rearrangements that can be ignored by
the commonly used practice of binning nearby markers. Rather, we find
that marker pairs that belong close together according to the DNA
sequence often have a very large number of OCB in their rhvectors. This
indicates that either the data quality is poor or the panel sizes are
too small, so that the essential assumption that nearby markers have
nearby rhvectors does not hold with high enough probability.
Furthermore, at present there is no good way to assess the fraction of
markers correctly ordered on a map. This confirms the theoretical
evidence by Ben-Dor and Chor (1997)
that adding more markers without
increasing the panel size is not a fruitful strategy to obtain maps
with better quality.
In sum, there is a map integration and reconstruction strategy that can produce maps with better quality. Our software improves current technology for doing the RH mapping in areas of computation time and algorithms for considering large number of markers for mapping. The essential impediments to producing dense high-quality RH maps are data quality and panel size, not computation.
| |
METHODS |
|---|
|
|
|---|
The maps are stored in SQL Server Release 11.0.x of the Sybase
database management software. The functions are implemented using
Transact-SQL and C version of Open Client DB-Library. The algorithm was
developed on Unix System V release 4.0 running under SunOS 5.5.1 using
Sun WorkShop Compiler C 4.2, but it is compatible with other Unix
computers. The mapping software and a copy of this paper are available
via an electronic mail request to richa{at}helix.nih.gov. The
integrated G3/GB4 marker map is available at
http://www.ncbi.nlm.nih.gov/genome/rhmap. For each chromosome, the
recomputed integrated map has columns for (1) marker name, (2) position
(cR) on recomputed GB4 map, (3) odds on recomputed GB4 map, (4)
position (cR) on recomputed G3 map, and (5) odds on recomputed G3 map.
In our computation, we assigned a unique identifier to each marker. For
each marker its unique identifier and the following information, when
available, can be obtained by clicking on the marker name: (1) primer
information, (2) aliases for marker name, (3) mapping information with
respect to GeneMap'99, and (4) e-PCR results on genomic contigs and
cDNAs as appropriate. Information on CONCORDE is available at
http://www.caam.rice.edu/keck/concorde.html. Finished sequences of
individual clones produced by the Human Genome Project have been merged
into contiguous sequence segments (contigs) as previously described
(Jang et al. 1999
). The positions of markers within these sequences
were determined by the e-PCR program (Schuler 1997
), using a word size
of six (W = 6), a variability of up to 10 bases in the PCR
product size (M = 10), and up to 1 mismatching base allowed
(N = 1).
| |
ACKNOWLEDGMENTS |
|---|
We thank David Lipman and James Ostell for helpful discussions. We are indebted to the reviewers of this manuscript for carefully reading it and making several extremely useful suggestions that have improved the exposition of our work.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL schaffer{at}helix.nih.gov; FAX (301) 480-9241.
| |
REFERENCES |
|---|
|
|
|---|