|
|
|
|
Vol. 9, Issue 10, 1002-1012, October 1999
RESOURCE
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Currently, the main limitation in high-throughput microsatellite genotyping is the required manual editing of allele calls. Even though programs for automated allele calling have been available for several years, they have limited capability because accurate data could only be assured by manual inspection of the electropherograms for confirmation. Here we describe the development of a parametric approach to allele call quality control that eliminates much of the time required for manual editing of the data. This approach was implemented in an editing tool, Decode-GT, that works downstream of the allele calling program, TrueAllele (TA). Decode-GT reads the output data from TA, displays the underlying electropherograms for the genotypes, and sorts the allele calls into three categories: good, bad, and ambiguous. It discards the bad calls, accepts the good calls, and suggests that the user inspect the ambiguous calls, thereby reducing dependence on manual editing. For the categorization we use the following parameters: (1) the quality value for each allele call from TrueAllele; (2) the peak height of the alleles; and (3) the size of the peak shift needed to move peaks into the nearest bin. Here we report how we optimized the parameters such that the size of the ambiguous category was minimized, and both the number of miscalled genotypes in the good category and the useable genotypes in the bad category were negligible. This approach reduces the manual editing time and results in <1% miscalls.
| |
INTRODUCTION |
|---|
|
|
|---|
Dissection of the major and minor genetic factors
important in complex genetic diseases requires the ability to generate
an enormous amount of genotypic information. Because these diseases tend to skip one or more generations, one can choose for study either
many large extended families with multiple patients separated by many
meiotic events or an even greater number of sib-pairs (Lander and
Schork 1994
). Regardless of the approach used, at least a half million
microsatellite genotypes may be necessary for any given project. For
example, when using 1000 microsatellite markers to type 1000 DNA
samples, a total of 1 million genotypes must be determined.
SNP (single nucleotide
polymorphism) genotyping may be used in the future for such
studies, but higher density SNP maps and cheaper genotyping platforms
are prerequisites. In addition, because the heterozygosity rates of
SNPs are so low compared with microsatellites, at least 2 to 5 times
more SNPs will be required to achieve the same power as microsatellites
in pedigree based studies (Kruglyak 1997
). Another disadvantage is that
the accuracy of SNP genotyping is less easily determined through
inheritance checking than microsatellites. Furthermore, this higher
density of markers will also require a very high resolution physical
map to assure proper order of markers and will probably need to await
the full sequence of the human genome. Although some have hoped that
genome-wide SNP association studies may replace family-based linkage
studies, the required number of SNP markers has been estimated to be
about 500,000 (Kruglyak 1999
). For these reasons, microsatellite
genotyping will probably continue to be the method of choice for
genome-wide linkage studies in the near future. To achieve this
scale of genotyping within a reasonable time, a high-throughput
approach is needed at every step (Hall et al. 1996
).
As robot technology and more sophisticated sequencers have increased the throughput in microsatellite genotyping dramatically, the editing of the data has become a bottleneck, limiting throughput. The software used for allele calling has not evolved at the same pace as the robotic and sequencing technologies, and manual editing of the data is both costly and time consuming. The main limitation of software that is currently available has been the lack of quality measures for the allele calls made by the automated programs, which could help sort out accurate calls from inaccurate ones. Hence, if accurate allele calling is desired, a human eye must check all the automated calls by inspecting the electropherograms. Furthermore, many programs have not been tailored for high-throughput genotyping, lacking features such as batch processing of gel files.
We hypothesized that there must be a set of parameters that could be
used to fractionate the allele calls from an allele-calling program
according to quality in a manner that decreases the user's editing
time without compromising accuracy. The use of such quality measures in
genotyping would perhaps be analogous to their use in sequencing (Ewing
and Green 1998
; Ewing et al. 1998
). Here we describe a set of
parameters that we optimized according to efficiency and accuracy of
allele calls.
TrueAllele
We chose to work with an allele-calling program called TrueAllele
(TA), commercially available from Cybergenetics, Inc. It uses
quantitation and deconvolution algorithms for allele calling. TA is
written in Matlab and currently runs under MacOS, Windows NT/95/98, and
Unix-based systems (Perlin et al. 1994
, 1995
). At deCODE genetics we
run TA 1.02b.1 on 400 MHz Pentium II work stations, running the Linux
operating system accessed via ReflectionX from PC/NT computers.
Compared to Genotyper 2.0 (GT) for Applied Biosystems (1994)
, TA has
three main advantages. It provides a quality measure for every allele
call; it allows for batch processing of gel files; and it performs an
efficient tracking of the gel files. The main limitation of the program
is that the interface is not as user friendly as GT, and the manual
editing of the allele calls can be as time consuming for a high
throughput project.
To streamline the process of genotype analysis and to make it as user friendly as possible, we developed two programs that handle the preparation and management of the batch runs. One program gathers all files that are required for a given batch run on TA into a single folder. The other program extracts and prepares the files after each batch run, preparing the results and quality measures for allele calls, as well as the electropherograms for our editing program called Decode-GT.
Decode-GT
Next, we created a program, Decode-GT 1.0, that incorporates a set
of parameters that can be optimized for the most efficient and accurate
allele calls. It is a PC program that runs under Windows NT and has
three main functions. First, it sorts the allele calls according to
quality measures and can display the electropherograms on which they
are based. Second, it checks the allele calls of CEPH control samples
to ensure that the gel is properly calibrated. Third, it performs an
inheritance check on the results using pedigree information. Decode-GT
reads the combined results file from TA and sorts the data into three
categories
bad allele calls, good allele calls, and ambiguous allele
calls
sorting is based on a TA quality measure, the peak heights, and
the peak shifts. The aim is that only calls in the ambiguous category
need be inspected by the user.
Defining Criteria for Categorization
Our goal was to set the criteria used by Decode-GT such that the ambiguous category would include relatively few allele calls, without discarding too many useable allele calls in the bad category or including false calls in the good category. To find the optimal settings, we performed a study in which we compared TA results with results of manual editing using GT. We systematically examined how various settings affected: (1) the number of miscalls that were captured into the ambiguous category and (2) the size of the ambiguous category. We then incorporated the criteria found to be optimal in Decode-GT and tested them on a new data set examining (1) the number of miscalls in the good category prior to editing (i.e., prior to inspection of the CEPH control and ambiguous genotypes and the inheritance check); (2) the remaining miscalls after editing; and (3) the average size of the ambiguous category. We independently processed 7595 genotypes from 80 markers using both TA and GT in our first study. Of those, there were 719 discrepancies between the two methods; these we refer to as miscalls by TA, since all genotypes from GT had been manually inspected and edited as necessary. The main reasons for miscalls were the following: (1) The signal (peak height) was very low; (2) there was contamination or PCR artifacts that gave additional peaks; (3) TA had shifted the size calibration to fit the peaks into the binning library; (4) heterozygous genotypes were called as homozygous due to insufficient amplification of the larger allele, and therefore low peak height; (5) TA called a stutter peak as an allele; and (6) TA called a homozygous genotype as heterozygous by assigning an allele to a small peak in the electropherogram noise.Bad Calls
Allele calls that fall under this category are discarded and electropherograms for the alleles are not inspected. We used the peak height of allele 1 (the smaller fragment by molecular weight) to find a threshold value that would discard as many unusable allele calls as possible, without discarding a large fraction of allele calls that were useable (that is, were used when inspected by a user in GT). The peak height value is assigned by TA on a similar scale as the value given in GT. Figure 1 shows the effect of increasing the height threshold from 0 to 100 on the total number of discarded genotypes, and for the discarded genotypes the fraction that is usable (were called by a user in GT). The number of discarded genotypes increases rapidly as the height threshold rises from 0 to 45. After that, the rate of increase lessens. The number of potentially usable genotypes that are discarded starts rising at height ~35 and rises steadily thereafter. Therefore, at a height threshold <40, the discarded allele calls are primarily unreadable calls and at a threshold of 50, only 0.3% of the potentially usable data are discarded whereas 403 discrepancies are moved to the bad category. Therefore, using a height threshold of 50 is optimal.
|
Ambiguous Calls
As the peak height of allele 1 decreases, the risk of a miscall tends to increase. To determine the optimum height threshold for the ambiguous category, we inspected the effect of increasing the peak height threshold from 50 to 150 on the total number of genotypes placed in the ambiguous category and the number of miscalls included in the good category (Fig. 2). As the peak height threshold reaches 100, the decrease in miscalls in the good category levels off. The size of the ambiguous category reaches 10% at that value, which is acceptable. Therefore, genotypes with peak heights between 50 and 100 are placed in the ambiguous category. Using just this criterion, the fraction of miscalled genotypes that remains in the good category is ~2.75%.
|
|
|
|
Using Decode-GT
To assist the user in editing and evaluating the quality of data, Decode-GT has six view modes: main-view, CEPH-view, inheritance check, ladder plots, allele histograms, and report. Figure 6 shows the Decode-GT main window and explains some features.
|
In the main view, the called genotypes are listed and the electropherogram for each selected genotype is shown. That graph can be expanded to allow the user to check for alleles outside the defined marker size window. The user can select to have all genotypes, the ambiguous genotypes, or homozygous genotypes displayed in the list box. In a separate graph the user can select to view the electropherograms of all colors for the selected genotype to detect spectral overlap, or have some or all electropherograms for that marker plotted simultaneously in one graph for inspection of the allelic ladder. There is also a window that allows the user to type in comments that will be incorporated into the report. The user can edit the selected genotype, discard it, or discard a whole marker.
CEPH View
In CEPH-view, the electropherograms for the CEPH-control samples are shown simultaneously in separate graphs (Fig. 7). The user can select a marker for which the electropherograms are to be inspected. The known genotypes for the selected marker are also shown. The user can then shift the alleles for the entire gel or marker to normalize according to the CEPH reference genotypes.
|
Inheritance Check View
This view shows the results from the inheritance check (Fig. 8). There are two list boxes
one that shows the
families who had inheritance errors and the other that lists the
members of the selected family. Each family member can be successively
selected and each corresponding electropherogram can be immediately
inspected to resolve discrepancies. As in the main view, the user can
edit the selected genotype, view the allelic ladder, and check for spectral overlap.
|
Ladder Plot View
The ladder plot view shows the superposition of the electropherograms from all samples genotyped with the selected marker (Fig. 9). When more than one gel file is loaded in to the program, this view allows comparison of allelic ladders if the same marker is on both gels.
|
Allele Histogram View
The allele histogram view shows the number of occurrences for each allele for a selected marker (Fig. 10). This can be useful to compare allele frequencies for markers between gel files or sets of individuals.
|
Report View
The report view shows the name of the user, the date, and the name of the gel file (Fig. 11). It also presents statistical information about the data, such as the number and percentage of discarded genotypes, ambiguous genotypes, and edited genotypes along with heterozygosity rate and inheritance errors for each marker.
|
Using Decode-GT
After the data has been loaded into the program, the user performs these tasks successively:| 1. | Inspects the CEPH-control samples to see if they match each other and the known genotypes. |
| 2. | Inspects the genotypes listed as ambiguous. |
| 3. | Inspects the allelic ladder plots of the good genotype category to look for unexpected peaks. |
| 4. | Performs an inheritance check and inspects the mismatches (if any). |
| 5. | Inspects all allele calls for that marker if the inspection reveals any errors made by TA that were not included in the ambiguous category. |
| 6. | Saves the edited results table and the report file. |
| |
DISCUSSION |
|---|
|
|
|---|
We have described how an allele-calling program combined with quality measures and empirically derived criteria results in very accurate genotyping while limiting the users energies to inspection of ambiguous calls. By discarding allele calls that do not meet the given criteria for quality value and peak height, some allele calls that could be used if inspected by eye, are discarded. However, our tests showed that <0.5% of automatically discarded genotypes had been used when edited with GT. Prior to editing, the fraction of miscalled alleles falling into the good category were <1%. Using our defined inspection protocol this fraction drops to <0.4% in our study.
The total error rate in genotyping is composed of calling errors and
other processing errors, such as, PCR, DNA isolation, and
electrophoresis. In this paper we address only the issue of calling
errors, and how we tolerate a slight increase in error rate to increase
throughput. Using this approach, the total error rate in our genotyping
data is <1% and within acceptable limits. We believe that an
unacceptable genotyping error rate for multipoint linkage studies is
>4%. A calling error rate of 0.5% while inspecting <15% of the
genotypes is then quite acceptable. Therefore, the main advantages of
this approach are the batch-run feature and the dramatically reduced
manual editing time. Our approach is similar to work that has been done
to enhance the editing of sequences by using quality values with
Phred/Phrap/Consed (Ewing et al. 1998
, Ewing and Green 1998
).
The hands-on time in preparing a TA run for a gel file is 5-15
min and the editing of the results in Decode-GT is 10-20 min, depending on the quality of the data
in total 15-35 min per gel file,
averaging ~25 min. When using Genescan and GT for processing gel
files, the hands-on time averaged 2-3 h. The reduction in hands-on
time compared to the previous method, when all allele calls were
confirmed by inspection, is 80%-90%.
Another time-saving feature of TA is the automatic binning for all markers that are processed. When using GT, the binning information must be typed manually into a template document when a marker is processed for the first time. This allows rapid (even daily) alterations in marker panels without having to manually reset or redefine the expected bins. We routinely custom design panels to rerun markers that have failed in the multiplex runs on a particular set of samples.
At deCODE Genetics, we currently process ~400,000 microsatellite
genotypes per week using Perkin-Elmer-ABI 877 PCR robots and 377 XL
Sequencers with 96 lane upgrades and are currently doing three- to
fourfold multiplexing with 80%-85% efficiency. For our initial
genome-wide screens we use the ABI Linkage Marker Set (v. 2) and the
ABI intercalating set, for a total of 870 markers, along with
additional sets to fill in the gaps. These are all dinucleotide markers
that have been PIG-tailed to eliminate the plus A artifact (Brownstein
et al. 1996
).
The dream of modern human genetics is that we will soon be able to
solve the common complex genetic diseases. This may come from the use
of the most informative markers (microsatellites) applied to the most
informative families or populations with extensive genealogy
spanning centuries (Gulcher and Stefánsson 1998
). But because
several genes may together or in part contribute to each disease,
the power to detect linkage must be further increased through the
use of higher density marker sets, larger numbers of patients linked
together over generations within a population, and robust
multipoint identity by reliable statistical methods, (Kruglyak et
al. 1996
; Kong and Cox 1997
). The use of allele calling software
together with optimized parameters that fractionate the data according
to quality as described here, may advance human genetics toward its destiny.
Availability of Programs
TA is available from Cybergenetics, Inc. (Pittsburgh, PA; www.cybgen.com). Decode-GT is free of charge and available to academic groups upon request from deCODE Genetics. To obtain a copy of the program, contact Birgir Pálsson, e-mail birgir{at}decode.is. A demonstration version is available at www.decode.is/company/index.html.
| |
ACKNOWLEDGMENTS |
|---|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 These authors contributed equally to this work.
4 Corresponding author.
E-MAIL jgulcher{at}decode.is; FAX 354 570 1903.
| |
REFERENCES |
|---|
|
|
|---|
Received April 7, 1997; accepted in revised form August 9, 1999.
This article has been cited by other articles:
![]() |
G. T. Skalski, C. R. Couch, A. F. Garber, B. S. Weir, and C. V. Sullivan Evaluation of DNA Pooling for the Estimation of Microsatellite Allele Frequencies: A Case Study Using Striped Bass (Morone saxatilis) Genetics, June 1, 2006; 173(2): 863 - 875. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Matsumoto, W. Yukawa, Y. Nozaki, R. Nakashige, M. Shinya, S. Makino, M. Yagura, T. Ikuta, T. Imanishi, H. Inoko, et al. Novel algorithm for automated genotyping of microsatellites Nucleic Acids Res., November 19, 2004; 32(20): 6069 - 6077. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Fossdal, F. Jonasson, G. T. Kristjansdottir, A. Kong, H. Stefansson, S. Gosh, J. R. Gulcher, and K. Stefansson A novel TEAD1 mutation is the causative allele in Sveinsson's chorioretinal atrophy (helicoid peripapillary chorioretinal degeneration) Hum. Mol. Genet., May 1, 2004; 13(9): 975 - 981. [Abstract] [Full Text] [PDF] |
||||
![]() |
H Modin, T Masterman, T Thorlacius, M Stefansson, A Jonasdottir, K Stefansson, J Hillert, and J Gulcher Genome-wide linkage screen of a consanguineous multiple sclerosis kinship Multiple Sclerosis, April 1, 2003; 9(2): 128 - 134. [Abstract] [PDF] |
||||
![]() |
B. R. Olafsdottir, D. B. Rye, T. E. Scammell, J. K. Matheson, K. Stefansson, and J. R. Gulcher Polymorphisms in hypocretin/orexin pathway genes and narcolepsy Neurology, November 27, 2001; 57(10): 1896 - 1899. [Abstract] [Full Text] [PDF] |
||||
![]() |
J.-L. Li, H. Deng, D.-B. Lai, F. Xu, J. Chen, G. Gao, R. R. Recker, and H.-W. Deng Toward High-Throughput Genotyping: Dynamic and Automatic Software for Manipulating Large-Scale Genotype Data Using Fluorescently Labeled Dinucleotide Markers Genome Res., July 1, 2001; 11(7): 1304 - 1314. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||