|
|
|
|
Vol. 12, Issue 1, 3-15, January 2002
REPORTS
|
| |
ABSTRACT |
|---|
|
|
|---|
Williams syndrome is a complex developmental disorder that results from the heterozygous deletion of a ~1.6-Mb segment of human chromosome 7q11.23. These deletions are mediated by large (~300 kb) duplicated blocks of DNA of near-identical sequence. Previously, we showed that the orthologous region of the mouse genome is devoid of such duplicated segments. Here, we extend our studies to include the generation of ~3.3 Mb of genomic sequence from the mouse Williams syndrome region, of which just over 1.4 Mb is finished to high accuracy. Comparative analyses of the mouse and human sequences within and immediately flanking the interval commonly deleted in Williams syndrome have facilitated the identification of nine previously unreported genes, provided detailed sequence-based information regarding 30 genes residing in the region, and revealed a number of potentially interesting conserved noncoding sequences. Finally, to facilitate comparative sequence analysis, we implemented several enhancements to the program PipMaker, including the addition of links from annotated features within a generated percent-identity plot to specific records in public databases. Taken together, the results reported here provide an important comparative sequence resource that should catalyze additional studies of Williams syndrome, including those that aim to characterize genes within the commonly deleted interval and to develop mouse models of the disorder.
[The sequence data described in this paper have been submitted to GenBank under accession nos. AF267747, AF289666, AF289667, AF289664, AF289665, AC091250, AC079938, AC084109, AC024607, AC074359, AC024608, AC083858, AC083948, AC084162, AC087420, AC083890, AC080158, AC084402, AC083889, AC083857, and AC079872.]
| |
INTRODUCTION |
|---|
|
|
|---|
The past decade has brought spectacular advances in our
understanding of the contiguous gene deletion
disorder Williams syndrome (WS, also known as Williams-Beuren
syndrome; OMIM 194050 [see http://www.ncbi.nlm.nih.gov/Omim]). This
complex and intriguing developmental disorder is associated with
defects in multiple physiological systems, with the classic phenotypic
features including cardiovascular disease, dysmorphic facial
characteristics, infantile hypercalcemia, and unique cognitive and
personality components (Burn 1986
; Morris et al. 1988
; Bellugi et al.
1990
, 1999
; Lashkari et al. 1999
; Mervis et al. 1999
; Donnai and
Karmiloff-Smith 2000
; Mervis and Klein-Tasman 2000
; Morris and Mervis
2000
).
A key turning point in elucidating the genetic basis of WS came in 1993 with the discovery that the disorder is associated with hemizygous
microdeletions within human chromosome 7q11.23 that include the elastin
gene (ELN; Ewart et al. 1993
). Since that time, there have
been numerous studies aiming to map this region of chromosome 7, identify the genes residing within the commonly deleted interval, and
associate the phenotypic features of the disorder to the
haploinsufficiency of specific genes. These efforts have been aided by
a joint effort between our group and the Washington University Genome
Sequencing Center (http://genome.wustl.edu/gsc) to map and sequence
the human WS region. However, significant challenges have been
encountered. For example, attempts to establish contiguous and accurate
long-range physical maps of the human WS region have been hampered by a
number of problems, including unstable yeast artificial chromosome
(YAC) clones derived from the region (which are most likely a
consequence of the notably high density of repetitive sequences) and
the presence of several large (~300 kb), closely spaced blocks of DNA
with near-identical sequence (Gorlach et al. 1997
; Osborne et al.
1997a
; Hockenhull et al. 1999
; Korenberg et al. 2000
; Peoples et al.
2000
; Valero et al. 2000
). The latter genomic segments, which greatly
confound conventional mapping and sequencing strategies, are
particularly important, both because they contain gene and pseudogene
sequences (Gorlach et al. 1997
; Osborne et al. 1997a
; Perez Jurado et
al. 1998
) and because they appear to play a central role in mediating the inter- and intrachromosomal recombination events that lead to the
WS-associated deletions (Perez Jurado et al. 1996
; Robinson et al.
1996
; Baumer et al. 1998
).
Despite the challenges associated with mapping and sequencing the human
WS region, numerous genes residing within the commonly deleted interval
and the flanking duplicated segments have been identified (Fig.
1; Table 1;
Francke 1999
; Osborne 1999
; Osborne and Pober 2001
). The diverse
phenotypic features associated with WS likely result from
haploinsufficiency of these and/or yet-to-be-identified genes that
reside within the deleted interval. However, with the exception of
ELN and cardiovascular/connective tissue disease, correlating
individual genes with specific phenotypic features has proven
difficult.
|
|
As a complement to the above efforts, our interests have focused on the
comparative mapping and sequencing of the WS region in the human and
mouse genomes. Previously, we established a bacterial clone-based
contig map of the mouse genomic region encompassing the Eln
and Ncf1 (p47-phox) genes (DeSilva et al. 1999
); note that NCF1 gene/pseudogene sequences reside within the
duplicated blocks in the human WS region (Fig. 1; Table 1).
Interestingly, we discovered that the mouse WS region is devoid of the
large duplicated segments that are characteristic of its human
counterpart. To acquire a more detailed view of this important genomic
interval, we have now extended our mouse physical mapping efforts as
well as sequenced the entire mouse WS region. Here, we report the
generation of ~3.3 Mb of mouse genomic sequence and the results of
detailed computational analyses, which included extensive comparisons
with the available sequence of the human WS region.
| |
RESULTS |
|---|
|
|
|---|
Physical Mapping of the Mouse WS Region
The segment of the mouse genome corresponding to the human WS region
resides on distal mouse chromosome 5. Our previous clone-based physical
mapping efforts resulted in the construction of a bacterial artificial
chromosome (BAC)/P1-derived artificial chromosome (PAC) contig spanning
a large portion of this genomic region, including the entire interval
flanked by the Eln and Ncf1 genes (DeSilva et al.
1999
). As part of a broader effort to generate BAC-based physical maps
of the portions of the mouse genome orthologous to human chromosome 7 (Thomas et al. 2000
), we extended this contig map to encompass the
entire WS region (including the interval commonly deleted in WS, the
segment that is duplicated in human, and additional flanking DNA). The
complete contig map is available as part of an electronic supplement
accompanying this paper (at http://bio.cse.psu.edu/publications/desilva). Based on our earlier (DeSilva et al. 1999
) and expanded physical mapping efforts, a set of
21 clones, which together fully encompass the mouse WS region, was
selected for systematic sequencing (Fig. 1).
Consistent with our previous mapping studies (DeSilva et al. 1999
), we
encountered no evidence for the presence of large, duplicated blocks of
DNA within the mouse WS region, such as those residing in the
orthologous segment on human chromosome 7q11.23. Indeed, the
clone-based physical mapping of the mouse WS region proceeded smoothly,
in striking contrast to our efforts and those of others (Osborne et al.
1996
; Hockenhull et al. 1999
; Korenberg et al. 2000
; Peoples et al.
2000
; Valero et al. 2000
) in mapping the human WS region.
The long-range organization of the mouse and human WS regions is also
different in other ways. Specifically, a single contiguous block of
mouse chromosome 5 encompassing the WS region is orthologous to two
discontiguous segments of human chromosome 7, one on 7q11.23 and one on
7q22. The former segment contains the interval commonly deleted in WS
and the flanking duplicated blocks; interestingly, the orientation of
the central portion of this region is inverted in mouse versus human
(Fig. 1). The inverted orientation of the mouse WS region (compared to
the human WS region) was confirmed by two-color fluorescent in situ
hybridization (FISH) studies with Ncf1- and
Fkbp6-containing BACs; the results clearly showed that
Ncf1 is at the centromeric end and Fkbp6 at the
telomeric end of the WS region on mouse chromosome 5 (data not shown).
These physical mapping studies are consistent with the BSS
JAX panel genetic mapping data
(http://www.jax.org/resources/documents/cmdata/bkmap/BSS.html). Importantly, the breakpoints associated with this evolutionary inversion correspond to the locations of the duplicated blocks in the
human WS region, which are also the most common sites of deletion
breakpoints seen in WS (Fig. 1). Our finding of an inverted orientation
of the mouse versus human WS region is consistent with data generated
by others (Peoples et al. 2000
; Valero et al. 2000
).
Immediately telomeric to the interval commonly deleted in WS is a genomic segment encompassing the HIP1/Hip1, MDH2/Mdh2, POR/Por, and ZP3/Zp3 genes; this region is oriented the same in mouse and human. However, in mouse, this segment is contiguous (in the telomeric direction) with a region that is orthologous to human 7q22 and that contains the Cutl1 and Pai genes. In human, this segment is not contiguous with the WS region and, in fact, is inverted in orientation (relative to the mouse segment; see Fig. 1).
Sequencing of the Mouse WS Region
The 21 overlapping mouse clones depicted in Figure 1 were sequenced
by a shotgun sequencing strategy. The GenBank accession number for each
resulting sequence is provided in Table 2.
Note that the first five clones (391O16, 92N10, P510M19, 303E12, and 42J20) were isolated from libraries derived from the 129SV mouse strain
and sequenced prior to the decision to use the C57BL/6J mouse strain
(with an emphasis on the RPCI-23 mouse BAC library) for sequencing the
mouse genome as part of the Human Genome Project (Battey et al. 1999
;
Denny and Justice 2000
). The remaining 16 clones were isolated from the
RPCI-23 library. Taken together, a total of ~3.3 Mb of nonredundant
mouse genomic sequence was generated, of which a single contiguous
block of just over 1.4 Mb is finished, high-accuracy sequence (i.e.,
with an error rate of <1 in 10,000 bp), another ~1.4 Mb is at a
full-shotgun stage (with ~11-fold average coverage in Phred Q20
bases; Ewing et al. 1998
; Ewing and Green 1998
) and is currently being
finished, and the remaining ~0.5 Mb is at a working-draft stage (with
~5-fold average coverage in Phred Q20 bases), as indicated in Figure
1 and Table 2.
|
Mouse-Human Comparative Sequence Analysis
The resulting mouse genomic sequence was subjected to rigorous computational analyses. Emphasis was placed on studying the large (~1.4 Mb), contiguous block of finished sequence, which included the entire region orthologous to the interval commonly deleted in WS. For comparison to the finished mouse sequence, we were able to identify finished or draft-level human sequence in GenBank for all but ~200 kb of the corresponding region on human chromosome 7q11.23 (with the notable segments unavailable for comparative analyses being ~40 kb encompassing the gene represented by AK005040, ~100 kb at the 5' end of ELN, and ~20 kb just 5' to CLDN3).
The central analytical and organizational tool for our comparative
sequence analyses was the program PipMaker (Hardison et
al. 1997
; Ellsworth et al. 2000
; Schwartz et al. 2000
). The core
function of this program is to perform direct comparisons between large
blocks of orthologous sequences. In addition, though, PipMaker provides an effective and convenient mechanism for assimilating and displaying relevant annotations about large segments of genomic sequence, including the location of repetitive elements and CpG islands, the intron-exon organization of genes, and,
most importantly, the areas (both coding and noncoding) found to be
highly conserved between two orthologous sequences. To enhance the
utility of PipMaker, we recently added a feature that
incorporates hyperlinks from annotated regions of the resulting percent-identity plot (PIP) to relevant Internet sites. This allows the
creation of an informative and dynamic electronic supplement that
captures the key elements of each comparative analysis. An illustration
of this new PipMaker feature is provided in Figure
2, which shows a small portion of the PIP
generated by comparing the sequences of the mouse and human WS regions
(note that the entire PDF-formatted PIP is available at
http://bio.cse.psu.edu/publications/desilva).
|
Our comparative analyses revealed a number of interesting general
features of the WS region. First, the GC content of the mouse and human
WS regions is similar, both the overall level (48.8% and 49.2%,
respectively) and the relative uniformity across the region (ranging
from 41.7% to 51.7% in mouse and 40.2% to 55.5% in human when
calculated in 50-kb windows). In contrast, the mouse and human WS
regions differ substantially in their repeat content, for example,
consisting of 35.9% and 54.2% interspersed repetitive elements
(mostly SINES and LINES), respectively. In addition, there is a notable
lack of uniformity of repeat content across the region, ranging from
30.6% to 62.7% in mouse and 27.9% to 84.3% in human (when
calculated in 50-kb windows). The difference in the amount of
repetitive sequences largely accounts for the slight compression of the
mouse WS region compared to its human counterpart. For example, this is
clearly evident in the interval encompassing the genes
GTF2IRD2/Gtf2ird2, NCF1/Ncf1, and
GTF2I/Gtf2i, with finished sequence being available
for both the mouse and human regions; the size of the same genomic
segment is ~124 kb and ~169 kb in mouse and human, respectively
(consisting of 34.3% and 50.0% interspersed repeats, respectively).
Finally, PipMaker analysis revealed numerous segments that
are highly conserved between the mouse and human WS regions. Most of
these correspond to exons within known and newly identified genes (see
below); however, many others appear to be conserved noncoding
sequences. Specifically, within the ~1.4 Mb of finished mouse
sequence, 55 gap-free alignments of
100 bp in length and with
70%
mouse-human sequence identity were identified that do not overlap any
of the identified exons. Two of these are shown in Figure 2, with the complete list available at http://bio.cse.psu.edu/publications/desilva.
PipMaker analysis also revealed that mouse-human sequence
conservation across the WS region is relatively low compared to other
genomic regions examined to date, both in terms of the total amount of
noncoding, nonrepetitive sequence that is at least moderately conserved
(i.e., can be reliably aligned between mouse and human) and the amount
that is highly conserved. To quantify this, we focused attention on the
finished sequence from the mouse WS region. Following removal of
segments for which the orthologous human sequence was not available and
the masking of both repeats and annotated coding regions, the remaining
mouse sequence was aligned with its human counterpart. Only 20.3% of
the nonexonic, nonrepetitive sequence could be aligned between mouse
and human, providing a benchmark for the overall level of conservation
(Table 3). Only 1.1% of the sequence was
found to be highly conserved (i.e., resided within a gap-free alignment
of
100 bp in length and
70% mouse-human sequence identity). For
comparison, we performed the same analysis on 12 other genomic regions
for which large blocks of finished sequence were available for both
mouse and human. For these other regions, we first masked repeats and
annotated exons in the human (rather than mouse) sequence. In all but
two cases, there is a greater degree of total mouse-human sequence conservation than that encountered with the WS region (Table 3), with a
greater percentage of highly conserved sequence seen in all but three
cases. In addition, the data presented in Table 3 suggest a potential
correlation between mouse-human sequence divergence and the content of
G+C nucleotides and/or interspersed repetitive elements; note that the
latter is consistent with the findings of Chiaromonte et al. (2001)
.
However, a more systematic study is certainly required before firm
conclusions can be reached.
|
Significant effort was also focused on the computational detection and
annotation of genes residing in the WS region. The availability of both
mouse and human genomic sequences greatly enhanced the ability to
detect genes and to define their long-range organization. Table
4 provides a summary of the 30 genes
identified within the ~1.4 Mb of finished mouse sequence, with
additional details (e.g., deduced coding sequences, predicted amino
acid sequences of the corresponding proteins, and presence of conserved domains) available at http://bio.cse.psu.edu/publications/desilva. Of
these 30 genes, 20 have been assigned names and reported previously as
residing within the WS region (see Table 1), while one
(Gtf2ird2) is associated with an annotated GenBank record
(AY014963) indicating its presence in the WS region. Importantly, the
remaining 9 (in each case indicated in Table 4 by a representative
GenBank record containing a corresponding full-length cDNA sequence or
an associated expressed-sequence tag [EST]) represent newly
identified genes with respect to their presence in the WS region. The
evidence that these are authentic genes includes the identification of cDNA sequences matching the mouse genomic sequence, their overlap with
GenScan-predicted gene models (in all but one case), and
the presence of strong mouse-human sequence conservation; these
features are detailed in Figure 3.
Remarkably, 6 of these newly identified genes (AK017044, AK004244,
AK008014, AK003386, AK019256, and BE290321) clearly reside within the
genomic interval commonly deleted in WS. Additional features of the
newly identified genes are summarized in an electronic table at
http://bio.cse.psu.edu/publications/desilva.
|
|
The 30 identified genes are associated with a number of other
interesting features. First, all but 4 (87%) have a CpG island at
their 5' end (Table 4); this is a considerably higher fraction than
that reported previously for mouse genes (Antequera and Bird 1993
;
Jareborg et al. 1999
). Second, the splice sites and intron-exon organization of the genes are the same in mouse and human (at least for
the genes for which genomic sequence was available in both species)
except for Eln/ELN, which has 81% amino acid
identity between mouse and human but shows a lack of conservation at
the splice junctions. Third, the coding-sequence conservation between the mouse-human orthologous gene pairs (Table 4) falls within the
typical range established previously (Makalowski et al. 1996
; Makalowski and Boguski 1998
), with the exceptions being the less conserved Wbscr15/WBSCR15 (as we reported previously
[Doyle et al. 2000
]) and perhaps Pom121/POM121.
Finally, with the exception of the changes associated with the
evolutionary inversions depicted in Figure 1, gene order is the same in
the mouse and human WS regions.
The ~1.9-Mb segment of draft-level mouse sequence that we generated (corresponding to the seven clones taken to full-shotgun and three clones taken to working-draft levels of redundancy; see Table 2) is orthologous to a region of human chromosome 7 that is telomeric to the interval commonly deleted in WS (Fig. 1). As such, less rigorous computational analyses have thus far been performed with this mouse sequence. However, since human sequence is available for virtually all of this segment, a routine set of comparative analyses was performed using PipMaker, with the resulting PIPs available at http://bio.cse.psu.edu/publications/desilva.
| |
DISCUSSION |
|---|
|
|
|---|
It is now well-established that the comparative analysis of genomic
sequence from different organisms represents a powerful means for
identifying conserved coding and noncoding regions, including
regulatory elements (Duret and Bucher 1997
; Hardison et al. 1997
;
Hardison 2000
; Miller 2000
; Wasserman et al. 2000
; Cliften et al. 2001
;
Pennacchio and Rubin 2001
; Touchman et al. 2001
). With the recent
completion of a working-draft sequence of the human genome
(International Human Genome Sequencing Consortium 2001
; Venter et al.
2001
), increasing attention is being given to the sequencing of other
organisms (Green 2001
). In particular, the sequencing of the mouse
genome is now taking center stage (Battey et al. 1999
; Denny and
Justice 2000
), with the recognition that the resulting data will
provide both an invaluable infrastructure for performing research with
this important experimental animal and the ability to more rigorously
annotate the human sequence by comparative analyses (Batzoglou et al.
2000
; Bouck et al. 2000
).
Indeed, the past few years have brought a sizable crescendo in the
generation of mouse genomic sequence, allowing insightful comparisons
to be made with the orthologous human sequence. Notable examples of
large (e.g., >300 kb) blocks of generated mouse sequence include that
from the velocardiofacial syndrome region (~634 kb; Lund et al.
2000
), the Cftr region (~358 kb; Ellsworth et al. 2000
), the
Bpa/Str region (~430 kb; Mallon et al. 2000
), the
region on chromosome 7 containing an imprinted genomic domain (~1 Mb; Onyango et al. 2000
), the region on chromosome 11 containing a cluster
of interleukin genes (~1100 kb; Loots et al. 2000
), the region
containing the protocadherin gene cluster (~900 kb; Wu et al. 2001
),
the cat eye syndrome region (~450 kb; Footz et al. 2001
), the region
on chromosome 17 containing a cluster of olfactory receptor genes
(~330 kb; Younger et al. 2001
), a segment on mouse chromosome 16 orthologous to the Down's syndrome critical region (~470 kb;
Pletcher et al. 2001
), the Fra14A2/Fhit region
(~600 kb; Shiraishi et al. 2001
), and the 15 mouse genomic segments orthologous to human chromosome 19 (totaling ~42 Mb; Dehal et al.
2001
); note that a handful of other examples are also cataloged at www.ncbi.nlm.nih.gov/genome/seq/MmProgress.shtml. Together, the
generated mouse sequence has played a key role in the establishment and
refinement of computational approaches for systematic comparative sequence analysis (Mallon and Strivens 1998
; Stojanovic et al. 1999
;
Batzoglou et al. 2000
), with the emergence of tools such as
PipMaker (http://bio.cse.psu.edu; Schwartz et al. 2000
),
VISTA (http://sichuan.lbl.gov/vista; Mayor et al. 2000
),
and Alfresco (http://www.sanger.ac.uk/Software/Alfresco; Jareborg and Durbin 2000
).
The ~3.3 Mb of sequence reported here for the mouse WS region
represents one of the largest and most complete blocks of mouse sequence reported to date. This is particularly the case with respect
to the ~1.4-Mb contiguous segment of finished, high-accuracy sequence. Indeed, in many of the cases listed above, only draft-level mouse sequence has thus far been generated. Our extensive and high-quality data set provided the opportunity to perform detailed computational analyses, with particular emphasis on mouse-human sequence comparisons. Several general findings deserve special mention.
First, the order and structure of genes in the mouse and human WS
regions are well conserved, with the only exceptions relating to the
two large evolutionary inversions illustrated in Figure 1. Second,
comparative sequence analysis in conjunction with cDNA/EST comparisons
and Genscan predictions has provided strong evidence for
the presence of at least nine previously unreported genes within the WS
region (see Fig. 3 and below). Finally, numerous conserved noncoding
sequences can be readily identified within the human and mouse WS
regions; these represent viable candidates for regulatory elements
associated with the numerous genes residing in the region or perhaps
serve some other biologically important function(s). Of note, during
the generation of our mouse sequence data, Martindale et al. (2000)
reported the elucidation and analysis of ~115 kb of sequence from the
mouse WS region, specifically a segment encompassing the genes
Limk1, Eif4h, Wbscr15, and Rfc2.
Their analyses of this portion of the mouse WS region are concordant
with the results presented here.
Our experience in analyzing the sequence of the mouse WS region once again illustrates the tremendous value of mouse-human sequence comparisons for annotating genes. Simple comparisons of genomic sequences and collections of cDNA-derived (e.g., EST) sequences often fail to detect certain mRNAs (e.g., those expressed at low levels or in a tissue-restricted fashion). In addition, false-positive results are common, typically due to contaminating genomic sequences amongst the ESTs. However, a combined strategy employing both mouse-human genomic sequence comparisons and genomic-cDNA sequence comparisons provides an efficient and effective path toward the construction of accurate gene models. For example, such a combined approach led to our identification of a previously undetected 5' terminal exon of HIP1/Hip1, leading to refined information about the structure of this gene beyond that available in RefSeq. In addition, evidence of mouse-human sequence conservation provided critical clues that directly led to the identification of the nine previously unreported genes in the WS region. Once detected, the conserved regions were more carefully compared to available sequence databases, resulting in the identification of matching full-length cDNA sequences in a majority of cases.
PipMaker is now a well-established program for performing
the types of routine comparative sequence analyses mentioned above. The
new enhancements to PipMaker reported here should further
increase the utility of this tool. In particular, PipMaker
can now be used to capture and disseminate the large amount of
ancillary information that is routinely generated during the
comparative analysis of large blocks of genomic sequence, in essence
providing an archive of both the underlying data and a detailed account
of any analyses performed with it. This is accomplished through the
creation of a PDF-based file that contains both the PIP and links from
relevant features of the PIP to specific Internet sites. Such a PDF
file can serve as an electronic supplement to a publication, which
inevitably can only provide highlights of the comparative analyses
being reported (e.g., Figs. 2, 3). Indeed, this is just one facet of
the expanding synergy between traditional scientific publishing and the
Internet. An alternate approach to this problem was recently described
(Wilson et al. 2001
), which involves the use of a sequence-alignment
viewer that is provided as part of the electronic supplement and
downloaded automatically by the Web browser when viewing alignments. An
advantage of the Wilson et al. strategy is that it provides greater
interactivity to the end-user, for example, allowing access to
alignments with nucleotide-level resolution. An advantage of
PipMaker is that it only utilizes features of the PDF
language, making the supplemental archive much easier to create and to access.
The region of human chromosome 7q11.23 commonly deleted in WS is of
great medical and biological interest because of the relative frequency
of the disease (~1:20,000), the complex and intriguing phenotypic
features of WS (Burn 1986
; Morris et al. 1988
; Bellugi et al. 1990
,
1999
; Lashkari et al. 1999
; Mervis et al. 1999
; Donnai and
Karmiloff-Smith 2000
; Mervis and Klein-Tasman 2000
; Morris and Mervis
2000
), and the involvement of large, duplicated blocks of DNA in the
deletional events leading to the syndrome (Perez Jurado et al. 1996
;
Robinson et al. 1996
; Baumer et al. 1998
). The mouse sequencing efforts
reported here should accelerate research aiming to better understand
the genetic basis of WS. First, our data provide a comprehensive
resource for characterizing the genes residing within and around the
interval commonly deleted in WS. This includes information about gene
structure as well as valuable clues about potential regulatory regions.
The value of this mouse sequence deserves highlighting in light of the
difficult-to-generate and, at present, fragmentary nature of the human
sequence for the WS region. Second, our comparative analyses have
revealed the presence of at least nine genes that were not previously
known to reside within the WS region. Importantly, six of these genes are located within the interval commonly deleted in WS, making each an
important candidate to evaluate for its possible role in the disorder.
Finally, the mouse sequence we generated should aid the creation of
mouse models of WS. Specifically, significant efforts are currently
ongoing to create mouse strains completely deleted or hemizygous for
one or more genes within the WS region. Our efforts have provided a key
infrastructure (i.e., complete genomic sequence) that should greatly
facilitate the design of appropriate knockout constructs as well as a
set of additional gene targets. In light of the difficulty to date in
assigning specific genes to WS-associated phenotypic features, the
ability to generate mouse models is regarded as key for untangling the complex genetics of WS.
In a slightly different context, our studies provide insight about the
evolution of the WS region and the genes residing therein. Based on our
comparative mapping and sequence data, this region has undergone
extensive evolutionary changes in the human and/or mouse lineages since
their last common ancestor. For example, the genomic complexities (with
respect to large, closely spaced duplicated segments) encountered in
the human and other great apes are not present in more distantly
related mammals, such as the mouse (DeSilva et al. 1999
).
Interestingly, these duplicated segments reside at the breakpoints
associated with an evolutionary inversion, such that the interval
commonly deleted in WS has an inverted orientation in the human versus
the mouse genome. In addition, there is a second evolutionary inversion
associated with a genomic segment residing just telomeric to the WS
region; this segment is contiguous with the rest of the WS region in
mouse but discontiguous in human. It is interesting to contemplate the steps that produced two evolutionary inversions and one breakpoint within the human and mouse lineages, as discussed by Valero et al.
(2000)
. At a sequence level, there is also evidence for significant divergence between the mouse and human WS regions. Indeed, the overall
level of mouse-human sequence conservation across the WS region is
atypically low; this is particularly the case for the noncoding (and
nonrepetitive) sequence (Table 3), but is also evident for some genes
(e.g., Wbscr15/WBSCR15 [Doyle et al. 2000
;
Martindale et al. 2000
] and Pom121/POM121; see Table
4).
In summary, our studies show how comparative sequence analysis can simultaneously provide valuable data for addressing problems in both human genetics and genome evolution. Based on this experience and the anticipated surge in the acquisition of genomic sequence for numerous other organisms, one can now readily envision a new era of scientific inquiry, in which sequence-based comparisons drive the study of genome structure, function, and evolution.
| |
METHODS |
|---|
|
|
|---|
Mouse Genomic Sequencing
The overlapping set of mouse BAC (Shizuya et al. 1992
) and PAC
(Ioannou et al. 1994
) clones shown in Figure 1 and listed in Table 2
were selected from either the contig reported previously (DeSilva et
al. 1999
; specifically, clones 391O16, 92N10, 303E12, and 42J20
isolated from the Research Genetics CITB-CJ7-B [strain 129SV] mouse
BAC library [http://www.resgen.com] and clone P510M19 isolated from
the RPCI-21 [strain 129SV] mouse PAC library
[http://www.chori.org/bacpac]) or one more recently constructed as
part of a larger mouse mapping effort (Thomas et al. 2000
;
specifically, clones with the prefix `RP23' that were isolated from
the RPCI-23 [strain C57BL/6J] mouse BAC library
[http://www.chori.org/bacpac; Osoegawa et al. 2000
]). Colony-pure
clone isolates were subjected to restriction enzyme digest-based
fingerprint analysis (Marra et al. 1997
), and the resulting data were
analyzed with the programs Image and FPC
(http://www.sanger.ac.uk/Software; Soderlund et al. 1997
, 2000
) to
assemble BAC/PAC contig maps, which in turn were used to guide the
selection of overlapping clones for sequencing. Each selected clone was
subjected to shotgun sequencing (Wilson and Mardis 1997
; Green 2001
),
essentially as described previously (DeSilva et al. 2000
; Ellsworth et
al. 2000
; Touchman et al. 2000
). Sequences were edited and assembled
with the Phred/Phrap/Consed suite of programs (Ewing et al. 1998
; Ewing and Green 1998
; Gordon et
al. 1998
).
Comparative Analyses of Mouse and Human Sequences
The generated mouse sequence reported here was subjected to
detailed computational analyses, including comparisons with the orthologous human sequence (when available). Genomic sequence from the
human WS region was obtained as follows. The available sequence
encompassing the LIMK1-RFC2 interval (Martindale et
al. 2000
) was supplemented with individual sequence records found by
searching the NCBI databases (nr and htgs); most often,
these records contained draft-level (as opposed to finished) sequence. In some cases, only small sequence contigs were available. For example,
the CLDN3 gene could only be found on a ~1.6-kb stretch of
sequence, with the regions immediately flanking the gene not available
for comparison with the mouse sequence.
Mouse and human genomic sequences were compared by constructing a
percent-identity plot (Hardison et al. 1997
; Ellsworth et al. 2000
;
Schwartz et al. 2000
). Specifically, the generated mouse sequence and
available human sequence were subjected to repeat masking with the
RepeatMasker program (A.F.A. Smit and P. Green, unpubl.
data; see
http://www.genome.washington.edu/UWGC/analysistools/repeatmask.htm). The human sequence was then aligned relative to the mouse sequence using the BLASTZ component of the PipMaker
program (http://bio.cse.psu.edu; Schwartz et al. 2000
). In the
resulting PIP, segments that were
50% identical between mouse and
human were plotted, with other regions appearing blank. Gaps within an
alignment appear as discontinuities between adjacent horizontal lines.
Representative portions of the PIP generated with the sequences from
the mouse and human WS regions are shown in Figures 2 and 3, with a
more complete summary of the PipMaker results available at
http://bio.cse.psu.edu/publications/desilva. Additional information
about the range of computational analyses performed is also detailed in
Tables 3 and 4.
| |
ACKNOWLEDGMENTS |
|---|
We thank the staff of the NIH Intramural Sequencing Center (NISC) for their dedicated work in generating the mouse sequence reported here, with special thanks to Michelle Walker, Jyoti Gupta, Sirintorn Stantripop, and Quino Maduro for their efforts in sequence finishing. We also thank the Washington University Genome Sequencing Center for generating the human sequence; Amalia Dutra for FISH studies; Jennifer Munsterteiger for editorial assistance; and Elliott Margulies, Matthew Portnoy, and Arjun Prasad for critical review of the manuscript. This work was supported in part by grant HG02238 (W.M.), grant HG02325-01 (L.E.), and funds for mouse sequencing (E.D.G.) from the National Human Genome Research Institute (NIH).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
4 These authors contributed equally to this work.
Present addresses: 5Department of Animal Science, Oklahoma State University, Stillwater, OK 74078, USA; 6Celera Genomics, Rockville, MD 20850, USA.
7 Corresponding author.
E-MAIL egreen{at}nhgri.nih.gov; FAX 301-402-4735.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.214802.
| |
REFERENCES |
|---|
|
|
|---|
globin cluster.
Hum. Mol. Genet.
10:
371-382
a workbench for comparative genomic sequence analysis.
Genome Res.
10:
1148-1157