|
|
|
|
Vol. 11, Issue 6, 1018-1033, June 2001
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
Duplication and deletion of the 1.4-Mb region in 17p12 that is delimited by two 24-kb low copy number repeats (CMT1A-REPs) represent frequent genomic rearrangements resulting in two common inherited peripheral neuropathies, Charcot-Marie-Tooth disease type 1A (CMT1A) and hereditary neuropathy with liability to pressure palsy (HNPP). CMT1A and HNPP exemplify a paradigm for genomic disorders wherein unique genome architectural features result in susceptibility to DNA rearrangements that cause disease. A gene within the 1.4-Mb region, PMP22, is responsible for these disorders through a gene-dosage effect in the heterozygous duplication or deletion. However, the genomic structure of the 1.4-Mb region, including other genes contained within the rearranged genomic segment, remains essentially uncharacterized. To delineate genomic structural features, investigate higher-order genomic architecture, and identify genes in this region, we constructed PAC and BAC contigs and determined the complete nucleotide sequence. This CMT1A/HNPP genomic segment contains 1,421,129 bp of DNA. A low copy number repeat (LCR) was identified, with one copy inside and two copies outside of the 1.4-Mb region. Comparison between physical and genetic maps revealed a striking difference in recombination rates between the sexes with a lower recombination frequency in males (0.67 cM/Mb) versus females (5.5 cM/Mb). Hypothetically, this low recombination frequency in males may enable a chromosomal misalignment at proximal and distal CMT1A-REPs and promote unequal crossing over, which occurs 10 times more frequently in male meiosis. In addition to three previously described genes, five new genes (TEKT3, HS3ST3B1, NPD008/CGI-148, CDRT1, and CDRT15) and 13 predicted genes were identified. Most of these predicted genes are expressed only in embryonic stages. Analyses of the genomic region adjacent to proximal CMT1A-REP indicated an evolutionary mechanism for the formation of proximal CMT1A-REP and the creation of novel genes by DNA rearrangement during primate speciation.
| |
INTRODUCTION |
|---|
|
|
|---|
Submicroscopic duplications/deletions represent genomic
rearrangements that can be responsible for inherited diseases. These are not visible by conventional karyotype assays and
are thus likely to involve rearranged fragments smaller than 1-2 Mb.
Disorders with these types of rearrangements may be caused by dosage
effects of a single or multiple genes. Inherited diseases resulting
from such genomic rearrangement may be categorized as genomic disorders in contrast to classic Mendelian diseases caused by point mutations in
the causative genes (for review, see Lupski 1998b
; Shaffer and Lupski 2000
).
Charcot-Marie-Tooth disease type 1A (CMT1A) is one of the first and
best-characterized examples of a submicroscopic genomic disorder. CMT1A
is the most common inherited peripheral neuropathy and accounts for
70% of CMT type 1 inherited demyelinating neuropathy (for review, see
Lupski and Garcia 2001
). Molecular genetic approaches have identified a
submicroscopic duplication of the 1.4-Mb genomic region in chromosome
band 17p12 in the majority of the CMT1A cases (Lupski et al. 1991
;
Raeymaekers et al. 1991
; Wise et al. 1993
; Nelis et al. 1996
; Roa et
al. 1996
). A submicroscopic deletion of the same region results in
hereditary neuropathy with liability to pressure palsy (HNPP), a
distinct form of inherited peripheral neuropathy with episodic and
milder manifestations (Chance et al. 1993
, 1994
). The CMT1A duplication
and HNPP deletion represent products of unequal crossing over and a
reciprocal recombination between flanking 24-kb homologous sequences
termed CMT1A-REPs (Lupski 1998a
). Subsequently, a gene encoding PMP22,
a major component of the peripheral nervous system myelin, was mapped
in the middle of this 1.4-Mb region (Matsunami et al. 1992
; Patel et
al. 1992
; Timmerman et al. 1992
; Valentijn et al. 1992
). Several lines
of evidence indicate that gain of one copy of PMP22 is
responsible for CMT1A, whereas loss of one copy of PMP22
results in HNPP through a PMP22 gene dosage effect as the
mechanism for these disorders (Lupski et al. 1992
).
Although duplication and deletion of PMP22 is the event
responsible for CMT1A and HNPP, respectively, as many as 30 to 50 other
genes may be contained in this 1.4-Mb region on the basis of its
genomic size (Murakami et al. 1997b
). A question remains as to why only
PMP22 is dosage sensitive, whereas other genes in the region
are apparently not. In addition, the clinical phenotypes of patients
having the same 1.4-Mb duplication are quite variable. A formal
possibility exists that minor dosage effect of genes other than
PMP22 in this 1.4-Mb region somehow contribute to the variability of phenotypic manifestations or a combination of phenotypes (e.g., CMT + connective tissue disorder). Furthermore, there are rare
case reports of smaller duplications (Ionasescu et al. 1993
; Palau et
al. 1993
; Valentijn et al. 1993
) or deletion (Chapon et al. 1996
),
raising the question as to whether such rare recombination events are
mediated by other repeat units in this region.
To characterize the genomic architecture of this region, we constructed PAC and BAC contigs and produced a finished sequence across this 1.4-Mb interval. We defined a 1,421,129-bp genomic interval as the CMT1A duplication/HNPP deletion region. Here we report the identification of low-copy number repeats (LCRs), the comparison of genetic and physical maps, the identification and characterization of genes, and a mechanism for the evolution of new mammalian genes by DNA rearrangements.
| |
RESULTS |
|---|
|
|
|---|
Sequencing the 1.4-Mb CMT1A Duplication/HNPP Deletion Region
A contig of overlapping bacterial clones was constructed on the
basis of marker content by use of pre-existing and newly generated STSs. Restriction fragment fingerprinting (Marra et al. 1997
) verified
the order of clones within the contig and identified a set of minimally
overlapping BAC and PAC-tiling path of clones for genomic
characterization. Individual clones were subjected to shotgun
sequencing, assembly, and finishing. A path of 12 overlapping clones
contains the complete region bounded by the CMT1A-REPs, and this is
part of a larger 15-clone path analyzed in this study (Fig.
1). Previously, we have predicted the size
of this genomic region to be 1.5 Mb on the basis of physical mapping
data obtained by pulsed-field gel electrophoresis (PFGE) and Southern
blotting analyses (Pentao et al. 1992
). Our completed sequence
indicates that the entire region from the first nucleotide of the
proximal CMT1A-REP to the last nucleotide of the distal CMT1A-REP is
1,421,129 bp.
|
Repetitive Elements
RepeatMasker indicates that high copy number
retrotransposable elements and simple tandem repeats (STRs) account for
43.37% of the entire CMT1A/HNPP region (Table
1). The repetitive elements consist of
9.97% Alu sequences and 13.43% LINE1 elements, which is
comparable in distribution with that of chromosome 21, but in contrast
to that of chromosome 22, which contains 16.8% of Alu
sequences and 9.73% of LINE1 elements (Dunham et al. 1999
; Hattori
et al. 2000
).
|
There is a mariner insect transposon-like element 140-kb
centromeric to PMP22, termed HSMAR2-PMP22 (Fig 1).
This mariner element is interrupted by an insertion of an
Alu element, indicating that it is no longer active. However,
we observed both 5' and 3' inverted terminal repeats (ITRs), suggesting that this mariner element has the potential to act as a
cis-acting substrate to promote double-strand DNA breakage
(Reiter et al. 1996
, 1999
).
We identified 53 STRs with repeating units >11. Nine STRs
(D17S793, D17S261, D17S122, D17S1357, D17S1356, D17S839, D17S1358, D17S955, and D17S921) were mapped
previously to this region, two (D17S918 and D17S900)
were mapped to the region but not known to be within the CMT1A/HNPP
interval, and forty-two represent newly identified potential
polymorphic markers. The new STRs include 26 dinucleotide [21
(CA)n, 2 (GA)n, 1 (TA)n, 1 (TA)n(CA)n, and 1 (TG)n(GA)n],
2 trinucleotide [2 (CAA)n], 10 tetranucleotide [6
(TTTA)n and 4 (TTTC)n], and 4 pentanucleotide [1
(TTTTC)n, 1 (CAATA)n, 1 (CGATA)n, and 1 (TTTTA)n] elements. Fifteen of these STRs have been shown to
reveal significant polymorphic variation in different ethnic
populations (Badano et al. 2001
).
Low Copy Repeats: An 11-kb Element
In addition to the previously defined CMT1A-REPs (24,011 bp of
98.7% nucleotide identity, Reiter et al. 1997
), other low copy repeats
were identified (Fig.1). LCRA1 and LCRA2, located 32-kb centromeric and
140-kb telomeric to the distal CMT1A-REP in inverted orientaion, are
highly similar 11-kb low copy number repeat segments. We also found a
4-kb truncated copy of this repeat, termed LCRB, ~180 kb centromeric
to the proximal CMT1A-REP (Fig. 1). Therefore, one copy of this repeat
is located within the 1.4-Mb region and the other two are located
outside of this region. LCRA1 and LCRA2 are highly similar throughout
the 11 kb (97% identity), whereas LCRB aligns only with a 4-kb
interior portion (95% identity to LCRA1) (Fig.
2A). Further sequence comparisons revealed
one small region (132 bp) that represents DNA rearrangements between
these LCRs (Fig. 2A). LCRA1 contains three contiguous fragments (25, 89, and 18 bp) that involve small tandem repeat units (14- and 9-bp
monomer). The corresponding region in LCRA2 contains a duplication of
the 25-bp monomer as well as a deletion of the 18-bp fragment, probably
resulting from polymerase slippage at the 14- and 9-bp repeat units
flanking these 25- and 18-bp fragments, respectively, in LCRA1.
Furthermore, the recombination breakpoint of the LCRB is located in
this small region between the 14- and 9-bp repeat units, resulting in
truncation of the 89-bp fragment and loss of the 25-bp fragment. No
18-bp deletion was found in the LCRB. This genomic evidence indicated
that the LCRA1 is likely the progenitor and the other two LCRs are
derivatives of LCRA1. A duplication event that results in LCRB may have
been followed by another duplication that generated LCRA2.
|
Searches of the high throughput human genome sequence revealed the
presence of multiple copies of this LCR in the genome. After
elimination of the highly repetitive 4.4-kb flanking sequences from
this 11-kb fragment, BLAST searches with the 6.6-kb region
identified 29 BAC clones assigned to 9 different chromosomes; 1, 4, 8, 9, 11, 13, 16, 17, and 22 (data not shown). Electronic PCR analyses
(Schuler 1998
) of each BAC clone showed STSs from multiple chromosomes
matching a single BAC sequence, whereas the 11-kb LCRA1 only contains a
chromosome 17-specific STS, suggesting the repeat structures involving
these loci in the genome are complex. Further mapping and
characterization are required to elucidate the nature of these repeat
structures involving multiple loci in the genome.
BLAST searches of this 6.6-kb region against the human EST database revealed a number of clones homologous to this portion of the LCRA1 low copy repeat. There are two different genes or groups of genes; one homologous to the 3kb-4kb region from the centromeric side (named CDRT15, see details in the following section) and the other to the 4.5-6.3-kb region. Further database searching revealed that the latter is a processed pseudogene of KIAA1511, which encodes a protein of unknown function and maps to chromosome 1 (GenBank accession no. AB040944). Interestingly, ESTs belonging to the former group have various levels of homology, suggesting that these ESTs may be transcribed from multiple loci in the genome. Further sequence comparison of these EST clones to the genomic sequence database mapped them to at least nine different genomic loci.
Comparison between the Physical and Genetic Maps
In previous efforts to identify the CMT1A gene by linkage
analysis, the CMT1A region was estimated to be much larger than 1.4 Mb
on the basis of the genetic distance between linked markers (Patel et
al. 1990
; Timmerman et al. 1990
). However, subsequent physical mapping
with PFGE and YAC-based STS content mapping revealed a physical size of
1.5 Mb (Pentao et al. 1992
). One hypothesis to explain the observed
discrepancy between genetic and physical distances has been that a
potential recombination hotspot exists within the CMT1A genomic region
in addition to the positional recombination hotspot located within
CMT1A-REP (Reiter et al. 1996
). To evaluate the actual recombination
frequency, we systematically compared the genetic map and genome
sequence-based physical map of the CMT1A duplication/HNPP deletion
region by integrating the Marshfield genetic mapping data into our
physical map. Eight polymorphic microsatellite markers
(D17S900, D17S921, D17S955, D17S839, D17S918, D17S122, D17S261,
and D17S793) were found in both the Marshfield genetic map and
genomic sequence from the 1.4-Mb region (Broman et al. 1998
). Of these,
two markers (D17S900 and D17S918) were not mapped
inside this region in the previous physical maps (Murakami and Lupski
1996
; Boerkoel et al. 1999
). Three markers identified previously in the
CMT1A region (D17S1356, D17S1357, and
D17S1358) were not included in the Marshfield study (Blair et
al. 1995
).
We generated a genetic/physical map correlation (Fig.
3A) and compared it with the flanking
1.5-Mb regions. Physical distances in the proximal regions include
estimates based on BAC physical mapping data at 100-Kb resolution on
the centromeric side (J.R. Lupski and B. Birren, unpubl.) and fully
finished sequence on the telomeric side. These genetic/physical map
comparisons indicate that the recombination frequency of an at least
4.5-Mb region including the CMT1A duplication/HNPP deletion region is
low in males. In sharp contrast, this region recombines frequently in females. The cM/Mb ratio of the entire 4.5-Mb region is 5.5 for female,
0.67 for male, and 3.3 for the sex-averaged map. As a result of this
contrast, this region has a high female/male recombination frequency
ratio, which is steeply increasing toward the centromere (Fig. 3B).
Neither CMT1A-REP regions nor the entire CMT1A/HNPP region have a
higher recombination frequency than flanking regions. The 820-kb region
between D17S1843 and D17S918, which spans the proximal CMT1A-REP, revealed no recombination in the families examined
in both male and female meiosis (Broman et al. 1998
). There is also
a low recombination region in both sexes telomeric to distal
CMT1A-REP for >1 Mb.
|
Genes in the 1.4-Mb Region
Sequence analysis was performed by the use of NIX (nucleotide
identification of unknown sequences), which incorporates a number of
independent gene prediction tools (Fig. 1; Table
2). Each gene was further characterized by
additional database searches and expression analyses. We categorized
the genes into three groups; (I) genes for which we have biological
evidence including cDNA sequences, gene structures, similarity to other
genes, or multiple spliced ESTs, matching gene predictions with
complete gene structure; (II) predicted genes with limited information
such as multiple EST matches and/or predicted exonic structures, but
complete gene structural information is not available, and; (III)
pseudogenes. Overall, we identified 21 genes or predicted genes (Groups
I and II) in this region.
|
Genes
Of the eight genes in this group, four are known: HREP, PMP22,
HS3ST3B1, and COX10. Of these,
HS3ST3B1 is the only gene newly mapped to this region.
COX10 and HREP are located in the CMT1A-REP regions
in which complete sequence data were available previously (Reiter et
al. 1997
; Kennerson et al. 1997
, 1998
; Murakami et al. 1997a
). We thus
describe the genomic structures of PMP22 and HS3ST3B1
in further detail. Four previously unknown genes were also identified,
NPD008/CGI-148, tektin3 (TEKT3),
CDRT1 (CMT1A duplicated
region transcript 1), and CDRT15.
PMP22
PMP22, the gene responsible for CMT1A and HNPP, has four coding exons and two alternatively utilized exons I (Suter et al. 1994
|
HS3ST3B1
The cDNA sequence for HS3ST3B1 (heparan sulfate D-glucosaminyl) 3-O-sulfotransferase 3B1) was described previously, but the genomic structure was unknown (Shworak et al. 1999NPD008/CGI-148
This transcript has a 615-bp ORF, encoding a predicted 205 amino acid protein. The structure of this gene is shown in Figure 4C. The cDNA sequence reveals an almost complete match with two genes in the database, NPD008 (GenBank accession no. AF223467) and CGI-148 (GenBank accession no. AF151906). NPD008 was isolated from pituitary glands, whereas CGI-148 was reconstructed by a comparative EST database search between human and Caenorhabditis elegans (Lai et al. 2000
|
TEKT3
TEKT3 (Tektin3), located 50-kb centromeric to PMP22, spans 37.7 kb (Fig. 4D). Its eight exons encode a 490 amino acid protein with significant homology to the tektin protein families. The closest homology was to the sea urchin protein, tektin A1, suggesting that this gene is likely to encode a human ortholog for tektin A1, termed TEKT3. As observed in other members of the tektin family, TEKT3 also has a highly conserved tektin domain, RSNVELCRD (underlined residues were conserved in TEKT3) (Norrander et al. 1998CDRT1
CDRT1 is located 1.3-kb telomeric to proximal CMT1A-REP (Fig. 4C). Multiple human and mouse EST alignments reveal a single exon gene encoding a 243 amino acid protein with unknown function. The upstream 1.3-kb region has weak but potential promoter sequence motifs estimated by the promoter prediction programs TSSW and NNPP. Northern blotting identified a major 2-kb and a minor 1-kb transcript in the pancreas and a faint 2-kb transcript in the heart (Fig. 5A). Further evolutionary analysis of this gene is described in a subsequent section.CDRT15
CDRT15 is located within the LCRA1. The 778-bp cDNA sequence is divided into three exons, encoding an 188 amino acid protein of unknown function (Fig. 4E). As mentioned above, there are at least eight paralogous copies of this gene in the human genome. Submitted sequences include one full-length cDNA clone encoding an unknown protein (GenBank accession no. AF038169) and numerous partial sequences. We reconstructed complete coding cDNA sequences by aligning these ESTs with each other. At least three cDNA clones were found to contain ORFs with possible exon/intron structures. Interestingly, they have insertion/deletion mutations that result in frameshifts of the ORF, thus encoding totally different proteins; others have insertions/deletions that appear to result in early termination. It is not clear which gene copies are producing functional proteins and which are transcribed pseudogenes.Predicted Genes
We identified 13 predicted genes (Fig. 1; Table 2). Each of these has incomplete information to determine full-length cDNA sequence. However, substantive evidence, including matching UniGene clusters, matching ESTs with intron structure, and significant scores by gene prediction programs, suggest these represent bona fide genes. Interestingly, Northern blotting analyses of these genes by use of an adult tissue panel revealed minimal expression, whereas RT-PCR analysis indicated substantial expression in embryonic tissues (Fig. 5). Results of the database and expression analyses for these 13 genes are summarized in Table 2.
Pseudogenes
Six pseudogenes were identified in the CMT1A/HNPP region (Fig. 1; Table 2). Each locus reveals evidence for absent introns and disrupted coding sequence by mutations, suggesting that they are processed pseudogenes. The pseudogene for cyclophilin A (CYPAP) revealed deletion of a region corresponding to the first 180 bp of cDNA sequence. The pseudogene for KIAA1164 showed deletion for the first 2 kb of original 4 kb cDNA, inversion of a 1-kb region, and insertion of an L1 element.
Evolution of New Genes by DNA Rearrangement During Speciation: Origin of HREP and CDRT1
Database searches to identify mouse orthologs of human genes in this
region provided evidence of an additional ancestral rearrangement with
functional consequences. Searches with human CDRT1 sequences identified mouse ESTs with coding sequences extending 5' upstream from
the initiation site for the human gene (Fig.
6A). Human sequence corresponding to this
5' extension is not found in the genomic sequence from the
CDRT1 region. In fact, the mouse EST sequences that extend
298-bp 5' from the start of the human CDRT1 gene do not match
any sequence in the human genome. However, additional sequences further
5' in the mouse EST contig show similarity to the human HREP
gene. The human HREP gene is located centromeric to the
proximal CMT1A-REP and, like the human CDRT1 gene, is
transcribed in the telomeric direction, ending within the proximal
CMT1A-REP (Kennerson et al. 1997
, 1998
) (Fig. 1). In searching for a
mouse ortholog for HREP, we identified a 759-bp continuous
fragment of mouse HREP partial mRNA sequence. The first 269 bp
of this sequence aligns with the human cDNA and corresponds to human
exons IV and V. However, the remainder of the mouse mRNA does not align with human HREP exon VI, but instead the sequences at the 3'
end of this mouse HREP EST contig contain CDRT1
sequences. Exon VI of human HREP is located inside the
proximal CMT1A-REP and utilizes complementary sequence of
COX10 pseudoexon VI. Mice do not have the proximal CMT1A-REP;
the proximal CMT1A-REP appeared during primate speciation between
gorilla and chimpanzee (Kiyosawa and Chance 1996
; Reiter et al. 1997
;
Boerkoel et al. 1999
; Keller et al. 1999
). These data suggest that in
the mouse, sequences corresponding to human HREP and
CDRT1 are part of a single gene. The fact that 298 bp from
within the mouse ortholog of HREP does not match genomic
sequence on either side of the proximal CMTA1-REP suggests that the
primate progenitor to human lost some genome sequence when the proximal
CMT1A-REP integrated into this region (Fig. 6B).
|
| |
DISCUSSION |
|---|
|
|
|---|
Human 17p12 is a genomic region prone to DNA rearrangement (the
CMT1A duplication and HNPP deletion) and has undergone relatively recent evolutionary changes during primate speciation (the 24-kb duplicated CMT1A-REPs). Although extensive studies have been performed to elucidate the molecular mechanism for the CMT1A duplication and HNPP
deletion, an unequal crossing-over event via homologous recombination
utilizing the flanking CMT1A-REPs as substrates, less information has
been available for the 1.4-Mb CMT1A/HNPP genomic region between the
CMT1A-REPs (Murakami and Lupski 1996
; Murakami et al. 1997b
; Boerkoel
et al. 1999
). The finished genomic sequence of this 1.4-Mb region has
allowed the elucidation of the genes within the genomic interval and
has provided information regarding the genomic architecture of the
CMT1A/HNPP region. Our analyses uncovered new LCRs, revealed
male-specific reduced recombination, identified novel genes, and shown
a mechanism for the evolution of new genes through DNA rearrangement.
Our findings suggest that the human genome is in a state of flux with
DNA rearrangements apparently responsible for a significant amount of
genomic evolution.
LCRs
Large genomic rearrangements mediated by LCR units are associated
with a number of human genomic disorders (Lupski 1998b
; Shaffer and
Lupski 2000
). In the CMT1A/HNPP region, in addition to the
previously reported CMT1A-REP (Pentao et al. 1992
; Reiter et al. 1996
,
1997
), we have identified three copies of a novel LCR, LCRA1, LCRA2,
and LCRB. Interestingly, the genomic organization of LCRA1 and LCRA2
consists of inverted repeats flanking the 200-kb region containing the
distal CMT1A-REP (Fig. 1). This genomic structure may allow flipping
or inversion of the 200-kb genomic fragment in between, thus
resulting in the CMT1A-REPs having an inverted orientation (Fig.
2B). Such a genomic arrangement may prevent the interchromosomal
unequal crossing over that results in CMT1A duplication and HNPP
deletion, making such individuals less susceptible to de novo
duplication/deletion. This hypothesis is directly testable by
determining the CMT1A-REP orientation in the parent of origin for
the de novo rearrangement.
A nucleotide sequence comparison between these LCRs revealed that the LCRA1 is likely a progenitor and the other two arose from subsequent duplication events. Two features indicate that the LCRB was probably generated first by local duplication followed by another duplication event to generate LCRA2 from LCRA1. First, the 18-bp deletion only exists in LCRA2 and the sequence homology between LCRA1/LCRB is lower than that between LCRA1/LCRA2. Secondly, a corresponding copy of CDRT15 in LCRA2 has premature termination and thus is likely a pseudogene of CDRT15.
Multiple copies of LCRs are distributed throughout the human genome.
Some BAC clones containing these LCRs map to the Smith-Magenis syndrome
(SMS) region on 17p11.2. SMS-REP is a large (>200 kb) low copy
region-specific repeat that acts as an homologous recombination substrate and is responsible for a large (~4 Mb) genomic deletion and
duplication associated with human disorders (Chen et al. 1997
; Potocki
et al. 2000
). Six copies of the LCRs were also mapped in 22q11.2, but
not in the chromosome 22-specific LCRs (Dunham et al. 1999
).
Therefore, this LCR family manifests complex divergence throughout
the human genome. Because copies of this LCR family are located
close to the recombination breakpoints of SMS in 17p12, this LCR family
may potentially be involved in the mechanism generating other genomic disorders.
Furthermore, these genome-wide repeat units also involve a gene family that reveals multiple transcripts from different loci. At least three copies of the transcript with no premature termination have been isolated. Further characterization of the sequences of these genomic loci as well as determination of the function of CDRT15 and its paralogs will clarify the complicated structure of these LCRs.
Comparison of Genetic and Physical Maps of the CMT1A Duplication/HNPP Deletion Region
We hypothesized previously that the mariner transposon-like
element MITE, which is located ~500 bp proximal to the preferential region for strand exchange or hotspot for unequal crossing over in the
CMT1A-REPs, may promote double-strand DNA breaks and stimulate the
homologous recombination (Reiter et al. 1996
, 1998
). Multiple studies
from CMT1A duplication and HNPP deletion patients in different world
populations confirm a positional hotspot for recombination within an
~500-bp region of the 24,011-bp homologous CMT1A-REPs (Kiyosawa et
al. 1995
; Lopes et al. 1996
; Reiter et al. 1996
; Timmerman et al. 1997
;
Yamamoto et al. 1997
; Chang et al. 1998
). It has been suggested that
CMT1A-REPs may also mediate high-frequency homologous recombination of
this region at a genomic level.
To investigate this latter hypothesis, we examined the relationship between genetic and physical distances using 21 known STS markers that span this portion of the genome (Fig. 3A). Although we expected increased recombination frequency at some specific cis-acting sequence, such as CMT1A-REPs or HSMAR2-PMP22, there is no significant change in the recombination frequency throughout the region. Instead, we observed evidence for reduced recombination in the 820-kb region between D17S1843 and D17S918 that contains the proximal CMT1A-REP and two of three HSMAR2 elements. These data indicate that the HSMAR2 elements may not increase the frequency of the recombination in the germ line, or the resolution and sensitivity to detect their effect on recombination ratio may be below the lower limit of detection in this study.
Interestingly, in male meiosis, the genomic region with low
recombination frequency extended beyond the CMT1A region in both the
proximal and distal directions. As shown in chromosome 7, high
female/male distance ratio in the genetic versus physical map is likely
the result of reduced recombination in males, not of enhanced
recombination in females (Broman et al. 1998
). There was no
recombination identified in the male meiotic map between D17S921 and D17S620 (~3 Mb), whereas in females
this same physical distance revealed a 20-cM genetic distance. This
reduced male recombination frequency may result in an extended region
of two allelic chromosomes without crossing over or synapse formation in meiosis. Such an absence of synapse formation could in turn allow
the chromosomes to slip on each other, thus enabling an unequal
crossover involving the tandem repeat units, CMT1A-REPs. On the other
hand, frequent interchromosomal equal crossovers may provide anchors to
prevent chromosomal slipping and reduce the chance of unequal
crossovers between the proximal and distal CMT1A-REPs. In support of
this hypothesis, de novo CMT1A duplication events occur 10 times more
frequently in males than females (Palau et al. 1993
; Lopes et al.
1997
). Therefore, we hypothesize that one of the mechanisms for the
male sex preference in de novo CMT1A duplication may result from the
male sex-specific low recombination frequency throughout the region.
Interestingly, in the studies of human trisomies, significant reduction
of genetic recombination was observed in the trisomy-generating
meiosis, and it was suggested that absence of pairing and/or
recombination contributes to nondisjunction (Lamb et al. 1996
). In the
context of the hypothesis that decreased recombination may increase the
unequal crossover at the proximal and distal CMT1A-REPs, individuals
with reduced meiotic recombination may have an increased propensity to
generate unequal reciprocal recombination products.
Han et al. (2000)
reported recently that the frequency of unequal
crossover between the proximal and distal CMT1A-REPs is almost
identical to that of the average equal crossover in the human genome
by use of sperm DNA analysis. This hypothesis also indicates that the
CMT1A-REPs do not contain a genomic recombination hotspot for the
unequal crossover. In the same study, Han et al (2000)
localized the
recombination breakpoint in the same hotspot identified previously by
the analysis of patient DNA. Together with the fact that the
CMT1A-REPs do not contain a genomic hotspot for equal crossover
according to the comparison of the genetic and physical maps in this
study, the hotspot in the CMT1A-REP should be defined as a hotspot for
the position preference, not for recombination frequency (Han et al. 2000
).
Genes in the CMT1A Duplication/HNPP Deletion Region
In the 1.4-Mb CMT1A duplication/HNPP deletion region, we identified
five genes and 13 predicted genes in addition to three previously
mapped genes. The current estimated average number of human genes per
Mb is between 9.6 and 12.9 (International Human Genome Sequence
Consortium 2001
). Previous studies suggested that chromosome 17 is
gene-rich by a factor of 1.44 (Deloukas et al. 1998
), which increases
the estimated number of the genes on chromosome 17 to be between 13.8 and 18.6 per Mb. The combination of the eight confirmed and 13 predicted genes within this 1.4-Mb region yields a density of 15 genes/Mb, well within this estimate.
In addition to PMP22, we mapped one previously characterized
and two uncharacterized genes to this region, HS3ST3B1,
NPD008/CGI-148, and TEKT3. HS3ST3B1 is one
of the five isoforms of genes encoding heparan sulphate biosynthesizing
enzymes, heparan sulphate sulphotransferases (HS3STs). Heparan sulphate
binds to specific proteins such as antithrombin and several growth
factors, and thereby regulates various biological processes including
anticoagulation and angiogenesis (Rosenberg et al. 1997
). HS3STs
catalyze sulfation of monosaccharide sequences of heparan sulphate,
which is believed to be critical for binding to the target proteins.
HS3ST3B1 has a closely related isoform, HS3ST3A1,
which also has similar patterns of tissue expression and encodes a
protein with similar enzymatic activity. Together with the nature of
this type of catalytic enzyme, wherein changes in dosage usually do not
affect the system, existence of a paralog with similar enzymatic
properties suggest that duplication or deletion of one allele of
HS3ST3B1 may not affect heparan sulphate biosynthesis.
Tektin includes a family of proteins and represents one of the
components of motile and primary cilia associating with the major
structural component of cilia, microtubules (Linck and Langevin 1982
;
Linck et al. 1985
; Steffen and Linck 1988
). Tektins have been best
studied in sea urchins, a species in which three isoforms have been
isolated; tektin A1, tektin B1, and tektin C1. Mammalian homologs
for tektin B1 and tektin C1 have been isolated (GenBank accession no. NM_014466, NM_011902 and NM_011569) (Norrander et al.
1998
; Iguchi et al. 1999
). In the CMT1A/HNPP region, we identified TEKT3 as the first homolog for tektin A1 in mammals.
Like other tektin homologs, it is preferentially expressed in testis.
Tektin A1 and tektin B1 are thought to be assembled as heterodimers
to comprise the tektin filament, and interact with tubulins to form the
basis of the high degree of stability of doublet microtubules (Pirner
and Linck 1994
). In the mouse sperm, the tektin B1 homologous protein
tekt2 is localized in flagella, strongly suggesting that tektins may
play essential roles in formation of sperm and in sperm motility
(Iguchi et al. 1999
). Loss of TEKT3 may reduce the motility of
the sperm of HNPP patients because of their haploid nature.
Relevance to CMT1A/HNPP Genomic Disorders
Of the new LCRs found in the CMT1A/NHPP region, LCRA2 and LCRB are
present in a tandem orientation and flank PMP22, suggesting that they have the potential to be substrates for unequal homologous recombination leading to duplication or deletion of PMP22.
Four families with alternate size duplication or deletion were reported previously (Ionasescu et al. 1993
; Palau et al. 1993
; Valentijn et al.
1993
; Chapon et al. 1996
). Genetic studies with a few markers showed
that the proximal break points of these cases are located close to or
within the proximal CMT1A-REP, and the distal break points mapped
between PMP22 and D17S125 (Ionasescu et al. 1993
; Palau et al. 1993
; Valentijn et al. 1993
; Chapon et al. 1996
). Therefore, at least in these cases, recombination between the LCRs
found in this study are unlikely to be involved in the small duplication or deletion. Additional analyses for LCR in this region failed to identify any significant stretches of homologous sequence (>1 kb) that may serve as substrates for such alternative homologous recombination events.
Most of the genes identified in this study revealed extremely low expression in adult tissues but obvious expression in fetal tissues. It is surprising that these embryonic genes have no developmental effect on the individuals with duplication or deletion of the 1.4-Mb region. The observation that to date PMP22 is the only gene responsible for CMT1A/HNPP due to the mechanism of gene dosage accompanied by duplication or deletion of this region suggests that dosage sensitivity may be a unique property of PMP22 but not of the other genes in the 1.4-Mb region. The sequence of most of these genes contains insufficient information to estimate their function. However, the cumulative data suggest that only 1 in 21 genes, at least in this portion of the human genome, is sensitive to dosage effects.
Evolution of New Genes, HREP and CDRT1, by DNA Rearrangement
Identification of the COX10 gene spanning the distal
CMT1A-REP and only one exon (pseudoexon VI) in the proximal CMT1A-REP indicates that the distal copy is the original and the proximal CMT1A-REP represents a duplicated copy (Murakami et al. 1997a
; Reiter et al. 1997
). Evolutionary studies reveal that this
insertional event occurred between gorilla and chimpanzee (Kiyosawa and
Chance 1996
; Reiter et al. 1997
; Boerkoel et al. 1999
; Keller et al. 1999
). Subsequently, another gene, HREP, was identified close to the proximal CMT1A-REP (Kennerson et al. 1997
, 1998
). HREP is transcribed toward the telomere from outside the proximal CMT1A-REP and terminates within the proximal CMT1A-REP. The last exon of HREP occurs at the same position, but on the complementary
strand of COX10 pseudoexon VI (Kennerson et al. 1997
).
Interestingly, we found that a mouse gene homologous to human HREP does not share the region after exon V with human HREP, but instead matches CDRT1, which is adjacent to the proximal CMT1A-REP on the telomeric side. Therefore, CDRT1 and HREP are likely to be parts of an Ancestral Gene before the Integration of Proximal CMT1A-REP (AGIP) (Fig. 6). The CMT1A-REP insertional event, which is estimated to have occurred during primate speciation between gorilla and chimpanzee, divided AGIP into two genes, HREP and CDRT1. These findings show an example of evolution of new genes by DNA rearrangement during mammalian genome evolution. The first half of AGIP became HREP utilizing a part of CMT1A-REP as a new terminating exon, whereas the last exon of AGIP became a single exon gene CDRT1. Interestingly, expression profiles of these two genes are different; HREP is expressed in heart and skeletal muscle, whereas the major expression of CDRT1 is observed in pancreas. Furthermore, a region in AGIP between the HREP syntenic portion and CDRT1 syntenic portion was likely to be lost during the CMT1A-REP integration, suggesting that this insertional genomic rearrangement was accompanied by loss of a genomic fragment. Further evolutionary analysis of the genomic region surrounding proximal CMT1A-REP in chimpanzee and gorilla may elucidate the mechanism of integration of the CMT1A-REP.
In conclusion, we have evaluated the 1.4-Mb finished genomic sequence of the CMT1A/HNPP region. Data obtained from this genome-sequencing study enable new insights into human genome architecture and mammalian genome evolution, show evolution of new genes by genome rearrangements during primate speciation, and add to the plethora of information being created by the complete nucleotide sequencing of the human genome.
| |
METHODS |
|---|
|
|
|---|
Construction of Physical Maps of the 1.4-Mb CMT1A/HNPP Region
We implemented two independent approaches to construct the physical
map of the CMT1A/HNPP genomic region. The first approach utilized
STS content-based mapping performed at Baylor College of Medicine. We
used the end sequences of the multiple cosmid clones from a
previously constructed cosmid contig of this region (Murakami and
Lupski 1996
) to screen PAC (P1 artificial chromosome; RPCI-1
Rosewell Park Cancer Institute, Buffalo, NY) and BAC (bacterial artificial chromosome; CITB California Institute for Technology) libraries by PCR on DNA pools and/or by filter hybridization. Eight
known genetic markers and the PMP22 gene were also used as
probes. Overlaps of each large insert genomic clone were evaluated by
EcoRI fingerprinting by use of a FluorImager (Molecular
Dynamics), as described elsewhere (Marra et al. 1997
).
A parallel and alternative approach used YAC-based mapping conducted at
the Whitehead Institute Center for Genome Research as a part of the
effort to sequence the entire human chromosome 17. To create reliable
physical maps despite significant amounts of low-copy repetitive
sequence, we used a high density of unique markers. In addition to
pre-existing markers, new markers were generated from shotgun sequences
derived from pulsed-field gel-purified YACs. Overlapping YACs from the
CEPH Mega-YAC library (Chumakov et al. 1995
) that were not known to be
chimeric based on STS content (Hudson et al. 1995
) were selected from
the CMT1A region (Pentao et al. 1992
). Each YAC was fractionated and
subcloned separately into M13. Single-sequencing reactions were
performed on several hundred subclones from each YAC and the resulting
sequences contained from 20%-60% yeast DNA, depending on the YAC.
Thirty-eight base pair overgos were designed (Ross et al. 1999
) and
further tested by hybridization to eliminate probes that contained
highly or moderately repetitive sequences that escaped detection during their design. BAC library (RPCI-11) screening was by hybridization with pools of up to 40 overgos derived from a single YAC, with an
average density of 30 overgos per Mb of genomic region. Positive clones
from the library screen were streaked on agar plates to obtain single
colonies and one clone from each positive address was rearrayed into
new 96-well plates. To generate marker content maps, replica filters
made from the 96-well plates were hybridized individually with each of
the overgos used in the library screen, as well as overgos derived from
overlapping YACs, and overgos representing other markers mapped in the
region. Markers that hybridized to greater than the expected number of
clones were not included in the final map, nor were markers that were
not linked by at least two clones. Clones that did not share at least two markers with an overlapping clone were not included in the map. The
final density of markers in the BAC map of the region was ~1 marker
every 10 kb. This high-density physical mapping generated an
overlapping contig with 8- to 10-fold coverage. Combining these two
physical maps, clones with a minimal tiling path were selected for
sequencing (Fig. 1).
Shotgun Library Construction, DNA Sequencing, and Sequence Data Analyses
Subclone libraries were constructed for each human genome
containing bacterial clone and shotgun sequencing, assembly and finishing was performed as described (International Human Genome Sequencing Consortium 2001
). A single annotated gap remains in the
sequence of RP11-726O12 (AC005517). PCR amplification of template DNA
from the corresponding large-insert genomic clone followed by
sequencing revealed that the gap contains 439 bp with an extremely
high content of GA repeat. The repeat content is probably
responsible for the difficulties encountered in cloning and sequencing
this gap region. The sequence from each BAC/PAC clone was assembled
into a larger sequence contig by use of Sequencher (Gene
Codes). These data were analyzed by the NIX analysis program (Nucleotide Identification of unknown sequences, UK MRC Human
Genome Mapping Project; http://www.hgmp.mrc.ac.uk), a Web-based package
of gene analysis software (including GRAIL, Fex, Hexon, MZEF,
Genemark, Genefinder, FGene, BLAST, Polyah, RepeatMasker and
TRNAscan). Each region that contained a potential gene
was individually analyzed by additional gene prediction and protein
analysis programs, by use of the ExPASy proteomics server (Expert
Protein Analysis System; http://www.expasy.ch). Putative core
promoter and transcription-binding sites were analyzed by TESS (http://www.cbil.upenn.edu/tess/index.html),
Human Core-Promoter Finder
(http://sciclio.cshl.org/genefinder/CPROMOTER/human.htm), TSSG, and TSSW (BCM GeneFinder;
http://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html). RepeatMasker was independently run to identify
interspersed repeat sequences. A genetic map of chromosome 17 with raw
data from polymorphic genetic markers within this region was obtained from the Marshfield Web site (http://www.marshmed.org/genetics) to
evaluate genetic/physical map correlations (Broman et al. 1998
).
Northern Blotting and RT-PCR Analyses
Expression profiles and the size of each transcript was determined by multiple tissue Northern blotting (Clontech). Primers from the unique 3' untranslated region of each isolated gene were designed by use of web-based software, Primer3 (http://www-genome.wi.mit.edu/genome_software/other/primer3.html). Corresponding BAC/PAC clones were used as template DNA for PCR to generate probes to minimize the chance of amplification of gene family members and pseudogenes mapping elsewhere in the genome. RT-PCR was performed for some of the predicted genes by use of first-strand cDNA from various adult and fetal tissues (Clontech).
| |
ACKNOWLEDGMENTS |
|---|
We thank Yi-Mieng Chang, Thearith Koeuth, and Stephen Ansley (Baylor College of Medicine) for their technical assistance. We also thank Will FitzHugh, George Grant, Rob Nahf, Diane Gilbert, and Boris Pavlin for their technical support of the WIBR mapping activities and all members of the WI/MIT Center for Genome Research Sequencing Group. K.I. and L.T.R. are supported by postdoctoral fellowships from the Charcot-Marie-Tooth Association. This research was supported in part by grants from the National Human Genome Research Institute to E.S.L., the National Eye Institute to N.K. (R01 EY12666), and the National Institute for Neurological Disorders and Stroke (R01 NS27042) and the Muscular Dystrophy Association to J.R.L..
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
4 Present address: Department of Biology, University of California San Diego, La Jolla, CA 92093, USA.
5 Corresponding author.
E-MAIL jlupski{at}bcm.tmc.edu; FAX (713) 798-5073.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.180401.
| |
REFERENCES |
|---|
|
|
|---|