Vol. 12, Issue 3, 424-429, March 2002
LETTER
Computational Comparison of Human Genomic Sequence Assemblies for a Region of Chromosome 4
Colin A.M.
Semple,1,2
Stewart W.
Morris,
David J.
Porteous, and
Kathryn L.
Evans
Medical Genetics Section, Department of Medical Sciences, The
University of Edinburgh, Molecular Medicine Centre, Western General
Hospital, Edinburgh EH4 2XU, United Kingdom
Much of the available human genomic sequence data exist in a
fragmentary draft state following the completion of the initial high-volume sequencing performed by the International Human Genome Sequencing Consortium (IHGSC) and Celera Genomics (CG). We compared six
draft genome assemblies over a region of chromosome 4p
(D4S394-D4S403), two consecutive releases by the IHGSC at University
of California, Santa Cruz (UCSC), two consecutive releases from the
National Centre for Biotechnology Information (NCBI), the public
release from CG, and a hybrid assembly we have produced using IHGSC and CG sequence data. This region presents particular problems for genomic
sequence assembly algorithms as it contains a large tandem repeat and
is sparsely covered by draft sequences. The six assemblies differed
both in terms of their relative coverage of sequence data from the
region and in their estimated rates of misassembly. The CG assembly
method attained the lowest level of misassembly, whereas NCBI and UCSC
assemblies had the highest levels of coverage. All assemblies examined
included <60% of the publicly available sequence from the region. At
least 6% of the sequence data within the CG assembly for the
D4S394-D4S403 region was not present in publicly available sequence
data. We also show that even in a problematic region, existing software
tools can be used with high-quality mapping data to produce genomic
sequence contigs with a low rate of rearrangements.
[All sequence accessions for the genomic sequence assemblies analyzed
and the data sets used to assess coverage and rates of misassembly are
available from http://www.ed.ac.uk/~csemple.]
1
Present address: Bioinformatics, MRC Human Genetics Unit,
Edinburgh EH4 2XU, UK.
2
Corresponding author.
12:424-429 ©2002 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/02 $5.00