|
|
|
|
METHODS Software for Automated Analysis of DNA Fingerprinting Gels1Department of Electrical Engineering, Washington University, St. Louis, Missouri 63130, USA; 2Genome Sciences Centre, British Columbia Cancer Agency, Vancouver, British Columbia V5Z 4E6, Canada; 3Genome Sequencing Center, Washington University School of Medicine, St. Louis, Missouri 63108, USA
Here we describe software tools for the automated detection of DNA restriction fragments resolved on agarose fingerprinting gels. We present a mathematical model for the location and shape of the restriction fragments as a function of fragment size, with model parameters determined empirically from "marker" lanes containing molecular size standards. Automated identification of restriction fragments involves several steps, including: image preprocessing, to put the data in a form consistent with a linear model; marker lane analysis, for determination of the model parameters; and data lane analysis, a procedure for detecting restriction fragment multiplets while simultaneously determining the amplitude curve that describes restriction fragment amplitude as a function of mobility. In validation experiments conducted on fingerprinted and sequenced Bacterial Artificial Chromosome (BAC) clones, sensitivity and specificity of restriction fragment identification exceeded 96% on restriction fragments ranging in size from 600 base pairs (bp) to 30,000 bp. The integrated suite of software tools, written in MATLAB and collectively called BandLeader, is in use at the BC Cancer Agency Genome Sciences Centre (GSC) and the Washington University Genome Sequencing Center, and has been provided to the Wellcome Trust Sanger Institute and the Whitehead Institute. Employed in a production mode at the GSC, BandLeader has been used to perform automated restriction fragment identification for more than 850,000 BAC clones for mouse, rat, bovine, and poplar fingerprint mapping projects.
Maps constructed from fingerprinted large-insert bacterial clones (Marra et al. 1997
Sulston et al. (1988
Our software tools, collectively called BandLeader, consist of a set of
MATLAB routines that are capable of automatically identifying and
locating marker lanes and data lanes and the restriction fragments
contained therein. Gel images, collected during the fingerprinting
procedure, are subjected first to "lane-tracking", a semi-automated
procedure that identifies the location of the lanes on the digital gel
image. This step is performed using the excellent image manipulation
tools that are part of the IMAGE package. IMAGE-extracted gel lanes are
then passed automatically to BandLeader for bandcalling. Currently the
BandLeader software is completely dependent upon the gel and data
format presently in use at the British Columbia Cancer Agency (BCCA)
Genome Sciences Centre. Detailed protocols for the production of
fingerprints suitable for analysis by BandLeader have been described
(Schein et al. 2003 BandLeader has been under development at both Washington University (St. Louis) and the Genome Sciences Centre (Vancouver) for more than four years, and during this time several versions were produced and used. The earliest versions, developed during the height of activity on the Human Genome Project, were based on full two-dimensional image processing techniques and were too slow to be of practical use. The basic methodology as presented here was in place by June 2000 (Version 2.0). At that time, the software could be used as a first step, but manual postprocessing was required to correct for certain artifacts, most notably overcalls, which resulted from a mismatch between a narrowly defined data model and the actual image data. Most of the development effort of the past two years was carried out while the mouse mapping effort at the Genome Sciences Centre was underway, and concentrated on making the software more reliable and the results more robust with respect to variations in the data away from the nominal model. A full set of exception-handling routines, to flag data that the software tools were unable to interpret, were also implemented. The most recent version (Version 2.3.3) is described here.
An electronic image of a typical agarose fingerprinting gel is shown in Figure 1A. Figure 1B shows a marker lane, with each enumerated DNA fragment annotated with the corresponding size of the fragment. Gel (TIFF) images are collected on a Molecular Dynamics Fluorimager. Although the Fluorimager is capable of different settings, BandLeader has been tuned to use 200-micron-square pixels. Each gel image is 1000 x 1200 pixels. The gel image is partitioned into 121 single-lane images, each 1000 pixels long x 9 pixels wide, as a result of the lane tracking process in IMAGE, but is otherwise unprocessed prior to our analysis. The lanes on the gel image in Figure 1A are typical of the input provided to BandLeader, and illustrate the problem BandLeader has been designed to address; namely, the automated identification of all DNA fragments in both the marker lanes and the data lanes. Below we describe the considerations and approaches that we devised to automatically identify fragments on gels of this type, and quantify BandLeader's performance on fingerprints corresponding to a test set of fully sequenced and finished BAC clones.
Forward Synthesis Model The methodology we adopt for the analysis of fingerprinting gels is consistent in many respects with the general model-based image analysis paradigm of O'Sullivan, Blahut, and Snyder (1998) In brief, our model for the production of electrophoretic gel data is as follows. Under the influence of an electric field, molecular fragments of a certain size fk migrate to a particular location on a fingerprinting gel and form a diffuse band. The relationship between the distance traveled, or mobility, and the fragment size is given by a curve which is known approximately but which must be refined empirically. A typical size/mobility curve is shown in Figure 2.
In our model, the shape of the band is a rectangle subjected to Gaussian diffusion, and the diffusion parameters are size-dependent. The brightness, or amplitude of each band is also size-dependent. Qualitatively speaking, bands due to smaller fragments travel further, are dimmer, and are more diffuse, as is evident in Figure 1. In an idealized linear model, one data lane contains the superposition of bands consistent with this model at locations determined by the fragment sizes. This is described by the model equation
Our model also accounts for the various deleterious effects that cause the image data to depart from the idealized linear model. These include: (1) a pointwise nonlinearity of fluorescent signal intensity, deliberately introduced by the Molecular Dynamics Fluorimager to compress the visual dynamic range, (2) an additive background function which appears data-dependent in an unknown way but which is smoothly-varying, (3) impulsive noise due to dust specks and other gel impurities, and (4) saturation in regions of high signal intensity. Background instrumentation noise is not included in the model, as the signal-to-noise ratio (SNR) is high and we see little to be gained with a Poisson or Gaussian "statistical inverse problem" approach. Based on the model described, we have developed a processing strategy which comprises the following elements: (1) an image preprocessing step for compensating for the deleterious effects in the image and putting it in a form consistent with a one-dimensional linear model, (2) marker lane analysis, for determining the quantitative aspects of the model, particularly the size/mobility curve and band shape parameters, and (3) data lane analysis, for the detection and sizing of bands in data lanes.
Image Preprocessing
Correction for Optical Nonlinearity
Background Subtraction
Impulsive Noise Filter Outlier pixels are identified on a row-by-row basis. In each row, every pixel is tested to see whether or not it exceeds twice the median value for that row, and if so it is considered an outlier. Pixels so identified are deleted and replaced by values obtained by cubic spline interpolation from the remaining pixels in the row. In the event that the outlier pixel lies on the edge of the lane, it can happen that the result of cubic spline extrapolation can be negative. Any negative values obtained are set to 0.
Symmetry and Monotonicity Constraints
Conversion to One-Dimensional Trace
All of the computations implied in Equation 3 are carried out on a discrete grid, with the integrations being replaced by a summation over nine pixels, corresponding to the width of the lane extracted by IMAGE from the fingerprinting gel. All of the shape functions Bh(x,y), or templates are stored in a look-up table which is computed prior to analysis.
Gain Correction Examples of the results of the preprocessing steps described above are illustrated graphically in Figure 3. In this figure, the various stages of the preprocessing are shown as a sequence of panels. The top panel shows the raw image data displayed in false color using the MATLAB "jet" colormap. The second panel shows this image after background subtraction and correction for optical nonlinearity. In the third panel, we see the image after the output of the dust-speck filter. This image also has a line running down the middle, indicating the estimated center of the lane. The fourth panel shows the result of additional processing to center the bands in the lane and enforce constraints of monotonicity away from the center. The result of gain correction to normalize the bands is shown in the fifth panel. Also in the fifth panel the image data have been shifted to the left to account for a bulk mobility shift (relative to a fixed standard) that is determined during marker lane analysis. Finally, in the sixth panel is shown the 1-D trace obtained after the application of the across-lane matched filter.
Marker Lane Analysis The purpose of the marker lane analysis is to determine the exact mobilities of 37 different fragments (Fig. 1B) that have exactly known sizes. The nominal mobilities of these fragments are known fairly accurately, for a given experimental protocol, but vary slightly even within one gel due to subtle variation in the gel conditions and nonuniformities in the electric field. Once the marker band locations have been determined, the fragment size/mobility relation can be determined for every data lane by interpolation across the gel and down each lane. After the locations of the bands have been identified, the shapes of the bands are also analyzed to develop the templates needed for a complete linear model for the data lanes in a gel.
Marker Band Detection
In a marker lane there are 37 bands (Fig. 1B), most of which are
distinct and easily identified, except for the pair numbered 1819,
and the group of seven at high mobility, numbers 3137. All of the
marker lanes appear similar, differing only in some translation and
distortion of the mobility axis, and in the overall lane amplitude.
Thus, the primary task in marker lane analysis is to fit a distorted
version of a standard template to the marker trace. In this respect,
the analysis has much in common with algorithms in pattern matching or
pattern recognition using deformable templates (Grenander and Miller
1994 In the marker template, the first 17 bands form a distinctive and easily recognized pattern. This pattern is approximated by a translated and dilated version of a standard template, with all the band peak amplitudes equal. The top section of the trace is matched to a set of 4000 versions of the template (100 translations times 40 dilations) until a best fit is found. As the distortion of the mobility axis may be something other than a simple translation and dilation, each marker band must be individually isolated. This is accomplished by sequentially finding each band using a prediction based on previously identified bands. This sequential procedure is carried out beginning at marker band 9 and operates in both directions, up and down the trace, from this point. Quadratic peak-finding is used to identify peak locations to subpixel accuracy for known singlets, whereas a slightly different version of the previously described pattern-matching procedure is used for bands that do not have clearly identifiable peaks.
Marker Band Verification Additional steps are taken to verify the marker lane analysis, once all the individual lanes have been called. The 25 marker lanes across the gel are examined for any discrepancies. For each marker band k, the 25 called mobilities across the gel, mk(I), I = 1···25, should form a smooth curve. Each function mk(I), I = 1···25, is fit to a low-order polynomial. Any called locations that deviate significantly from this curve are replaced by an estimated mobility found by polynomial interpolation. The same interpolation procedure can be used to replace data from "bad" marker lanes discarded by the correlation analysis. Figure 4 shows an image depicting the raw data from all 25 marker lanes in a typical gel, with results of the marker lane analysis superimposed. In this figure we have shown only the high-molecular-weight bands at low mobility (roughly the top half of the gel) to better illustrate the performance. Note the subtle but significant variation in the marker band location from lane to lane, the very reason that accurate marker lane analysis is critical.
Band Shape Analysis The marker bands, once identified, can be used to develop a complete band shape model for a fingerprinting gel. This is done by an empirical analysis of the second moments of the bands, and fitting these to a sequence of second moments consistent with the model. Because the model band shape is separable, we can analyze the horizontal and vertical moments separately. A horizontal band is found by summing pixels in the vertical direction, and vice versa. Furthermore, for shape analysis, the horizontal and vertical bands are easily normalized to unit area.
The analysis of the band shapes proceeds by computing the horizontal
and vertical second moments of the 28 well-resolved singlets in the
marker lanes. According to our model, the second moment of each band
can be attributed to three sources: (1) a fixed rectangular pulse, (2)
a fixed Gaussian pulse with different horizontal and vertical widths,
and (3) a variable-width Gaussian pulse with circular symmetry and
width increasing with mobility. We have found that a useful model
describing the diffusion is that the standard deviation (square root of
the second moment) of the variable Gaussian pulse grows quadratically
with mobility. A complete description for the band shapes is found by
fitting the two sequences
Because of the computational impracticality of building a separate model for each lane in the gel, the results of the analysis for all 25 marker lanes are combined to give an "average" band shape model for the gel. From the band shapes and the known fragment sizes, a nominal model for the amplitude curve can be generated as well. All of this information is combined to generate a set of templates and other data structures used in the data lane analysis, which we call the complete gel model.
Data Lane Analysis
The basic data model, after the preprocessing described previously, is
given by
We adopt a discrete implementation of the model, in which the possible mobilities mk are quantized onto a grid of 1500 possible values, logarithmically spaced between a minimum and maximum mobility determined by the modeling step. Typically this leads to step sizes on the "mobility grid", as it is called, of approximately 0.2 pixels at low mobilities and 1 pixel at high mobility. This corresponds roughly to the resolution available from the band shapes, which decreases with increasing mobility. The typical quantization error in mobility leads to errors on the order of 0.25% in fragment size, ignoring other bandcalling errors. The reason for the discretization of the mobilities onto the mobility grid is that it simplifies the search procedure. We use a search algorithm that shares properties of both a gradient algorithm and exhaustive search. As the band shape model is stored in a look-up table, it is not possible to compute gradients analytically; a numerical approach is required.
Cluster Analysis
Grid Search
The knowledge of the amplitudes of the bands, or equivalently the fact that the entries of the solution vector x are integers, eliminates the model-order problem which often plagues model-fitting procedures. There is no risk of overfitting the data with too many bands. Increasing the number of bands over that which gives the optimal fit will simply increase the error between the data and linear combination; thus the fitting procedure is in a sense self-limiting. We have crafted a hybrid numerical gradient search to solve the model-fitting problem for one cluster. We adopt a cost function h(s,Ax), and seek the value of the vector s which minimizes this cost function. For simplicity, the details of the search algorithm are omitted here. The cost function is a modified least-squares function, where the modifications address the uncertainty in the amplitude curve. The modified cost function places more emphasis on the shape of the target function, and less on its amplitude.
Amplitude Curve Estimation
Pass 1. The amplitude curve is found by scaling the normalized
amplitude curve by a factor
Pass 2. The normalized amplitude curve is again scaled by a
factor Pass 3. The amplitude curve is multiplied pointwise by a cubic polynomial. The coefficients of this polynomial are chosen to minimize the squared error between a synthetic trace and the data. The results of the analysis of a typical data lane are summarized graphically in Figure 5. The top panel shows the image of the data lane in false color, after preprocessing. The second panel shows the one-dimensional trace and the nominal amplitude curve based on the 15% rule plus the Pass 1 bandcalls indicated as small black circles. The fourth panel shows the same trace with the Pass 3 amplitude curve and the final bandcalls, indicated with small red circles. The bottom panel contains a synthetic trace, generated according to our forward synthesis model using all the results of the data lane analysis. The agreement between the model and the preprocessed data is evident here; the correlation between the actual trace and the synthetic trace is 0.98 in this example.
Exception Handling Several heuristic safeguards have been built into BandLeader to detect data lanes that are defective in some sense, and also to recognize when there has been an error in processing and thus the results cannot be used with confidence. Specifically, there are eight conditions that will generate errors and four conditions that will generate warnings. An error will cause the lane data and any bandcalling results to be discarded, while a warning is recorded in a log file for further manual inspection if desired. Most conditions are tested on the preprocessed one-dimensional trace signal. Different tests occur at different points in the processing. The conditions that generate errors are as follows:
The conditions that generate warnings are as follows
All of the errors and warnings are recorded in a separate log file for each gel analyzed. In our experience it is rare that all 96 data lanes are analyzed without error, with an average of nine lanes per gel generating a warning or error record in the log files.
Assessment of BandLeader Performance The results of our analyses are shown in Figure 6A. For comparison, we have included the results of a similar analysis performed using automatically generated IMAGE bandcalls (Fig. 6B). Of the 140 fingerprinted human BACs and 185 fingerprinted mouse BACs considered for this analysis, BandLeader accepted 139 and 183, respectively. Analysis of the sequences corresponding to these clones revealed that they contain 16,782 HindIII restriction fragments. Of these, BandLeader correctly identified 16,134, corresponding to a sensitivity measure of 96.13%. Of the 16,736 fragments identified by BandLeader, 16,138 correctly identified a sequence-predicted fragment, for a specificity measure of 96.42%. For comparison, automated IMAGE bandcalls are only 60.99% sensitive and 88.65% specific. Hence, although BandLeader is not perfect, it offers a remarkable improvement over IMAGE. Further, BandLeader outperforms the manual efforts of even our most experienced technical staff, at throughputs far exceeding those possible by manual analysis of the fingerprinting gels.
A major goal of the BandLeader project was to produce software that would reliably detect multiplets, which we defined as restriction fragments that comigrated within the same data lane on a fingerprinting gel. We assessed the performance of BandLeader in multiplet identification as follows. First, we identified as multiplets all fragments in BAC sequence data falling within a restriction fragment size window of +/ 2%. A similar grouping was done for the BandLeader bandcalls corresponding to these clones. A total of 322 human and mouse BAC sequences were analyzed and 3431 multiplets found in the sequence, for an average of approximately 10 multiplets per sequenced BAC. BandLeader correctly predicted band multiplicity in 96.0 % of these cases (see Fig. 7 and Methods). BandLeader's performance, even on the larger fragment clusters, is striking. For example, BandLeader correctly identifies a cluster of 11 bands. Again, although BandLeader is not perfect, it is distinctly superior to any other bandcalling option we are aware of, manual or automatic.
We have described a set of software tools for the automated analysis of agarose gel images acquired during the DNA fingerprinting process. The various steps involved are common to many image analysis and related engineering problems. First, a model was established for the process of electrophoresis and the digital image acquisition. The model included sufficient detail to allow for variations in the model parameters, but was not so complex as to lead to unwieldy image analysis tasks. Based on this model, procedures were derived for determining the model parameters (through the use of marker lanes) and for solving the inverse problem on the data lanes. The integrated suite of tools, written in MATLAB and collectively called BandLeader, has been used for the mouse fingerprint mapping project at the BCCA (Gregory et al. 2002 The use of automated image analysis for high-throughput fingerprint mapping projects has important advantages. Chief among these are the increases in both the rate and accuracy of data analysis and the opportunity to reanalyze the very large fingerprint data sets if more suitable parameters are found. Further, the opportunity exists to repeatedly analyze the gel images to collect statistics. For example, our entire set of mouse (C57BL/6) fingerprints (3500 gels containing more than 330,000 fingerprints) can be reanalyzed in 600 CPU hours. Since each gel analysis is independent, the process is amenable to parallelization, such that only about 24 processors would be needed to reanalyze the 3500-gel mouse set in one day. The BandLeader software has already proven of enormous value in completing mapping projects that would otherwise be unfeasible given time and budgetary constraints. We have used versions of BandLeader to analyze 13,629 fingerprinting gels, generated in fingerprint mapping efforts aimed at bacterial, fungal, plant, and animal genomes. Although BandLeader's performance is excellent, there is room for incremental improvement. With continued careful modeling and algorithm improvement, we see the potential for increased bandcalling performance, with gains in sensitivity, specificity, and sizing accuracy. One of the more challenging aspects of the automated trace analysis has been the estimation of the amplitude curve, which facilitates detection of multiple bands. The human eye adapts quite easily to model variations and aberrations, and in all but the most pathological cases it is a simple matter for our technical staff to identify singlets visually and hypothesize a smooth curve connecting the peaks. In our automated analysis, it has proven difficult to sort out the singlets, multiplets, and clusters, and derive a reliable amplitude curve. The current version of the software is doing a satisfactory job with this particular task, but on rare occasions, errors may cause a lane to be failed. There are several safeguards built into the software to recognize bad data and to abort the processing for a given lane. Causes of unusable data include: (1) an empty lane, (2) contamination, due to traces of a second clone or other genomic material in the sample, and (3) nonrecombinant BACs, which yield only one or two large bands in the lane. On very rare occasions, there are good lanes for which BandLeader analysis fails, presumably because the fit of the data to the model is poor. One way to recognize this is to compute the estimated clone size by summing all of the detected fragment sizes; if this sum is outside of acceptable limits, the lane can be rejected. In addition, any unexpected software errors (such as a divide by zero) are "trapped" and allow for the analysis of subsequent lanes to continue without the entire process halting. Currently, BandLeader relies on the data collection format used in our laboratories, and there is no flexibility in gel format or choice of marker DNA. As the fingerprinting data generation protocols are published and the marker DNA is commercially available, this inflexibility is not a major obstacle in the use of BandLeader to support fingerprinting activities in other laboratories. However, we recognize that there are several applications for a more flexible version of BandLeader, including restriction analysis of plasmid and other clones, and also genotyping. Hence, near-term future research and development will focus on methods for the generalization of our techniques to other protocols. For example, we intend to work towards the substitution of restriction digested, sequenced BAC clones in place of the commercially prepared markers. This will permit BandLeader to generate data models from marker lanes that are equivalent to the data lanes, and this in turn is anticipated to positively impact bandcalling accuracy, especially for comigrating restriction fragments. The extent to which accuracy can be improved is limited however, as BandLeader already is capable of 96.42% specificity and 96.13% sensitivity when used in a fully automated mode. This performance, and the robustness and reliability of the code, have made BandLeader the only restriction fragment identification system used in all of the large-scale, high-throughput fingerprinting activities at the GSC.
To evaluate the performance of BandLeader, we compared the restriction fragment sizes determined by BandLeader analysis of fingerprinted clones to those generated by in-silico digests of the sequences of the same clones. Three hundred and twenty-five fully sequenced and finished BACs residing in GenBank were identified and the sequences downloaded from the National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov). All clone sequences were analyzed computationally to generate "in silico" fingerprints. The actual clones were recovered from our local copy of the RPCI-11 library (Osoegawa et al. 2001
http://www.sanger.ac/Software/Image; IMAGE software is available at this Sanger Institute site. http://www.ncbi.nlm.nih.gov; NCBI home page. Access to GenBank database.
This work was supported in part by NIH grants 1-U01-HG02042, Sequencing the Human Genome, and 1-U01-HG02155, Sequencing the Mouse Genome. We gratefully acknowledge the support of the British Columbia Cancer Foundation, the British Columbia Cancer Agency (BCCA), and all members of the Mapping Group at the BCCA Genome Sciences Centre. M.M. is a Michael Smith Foundation for Health Research Scholar. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
4 Corresponding author. E-MAIL mmarra{at}bcgsc.ca; FAX (604) 877-6085. Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.904303.
Received October 11, 2002; accepted in revised format February 26, 2003. 13:940-953 © by 2003 Cold Spring Harbor Laboratory Press ISSN 1088-9051/03 $5.00 This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||