|
|
|
|
Genome Res. 15:1535-1546, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Letter
Segmental duplications and gene conversion: Human luteinizing hormone/chorionic gonadotropin
|
| ABSTRACT |
|---|
|
|
|---|
-globin), the diversity and linkage disequilibrium (LD) patterns of duplicons and the role of gene conversion in shaping them have been poorly studied. To shed light on these issues, we have re-sequenced the human Luteinizing Hormone/Chorionic Gonadotropin
(LHB/CGB) cluster (19q13.32) of three population samples (Estonians, Mandenka, and Han). The LHB/CGB cluster consists of seven duplicated genes critical in human reproduction. In the LHB/CGB region, high sequence diversity, concentration of gene-conversion acceptor sites, and strong LD colocalize with peripheral genes, whereas central loci are characterized by lower variation, gene-conversion donor activity, and breakdown of LD between close markers. The data highlight an important role of gene conversion in spreading polymorphisms among duplicon copies and generating LD around them. The directionality of gene-conversion events seems to be determined by the localization of a predicted recombination "hotspot" and "warm spot" in the vicinity of the most active acceptor genes at the periphery of the cluster. The data suggest that enriched crossover activity in direct and inverted segmental repeats is in accordance with the formation of palindromic secondary structures promoting double-strand breaks rather than fixed DNA sequence motifs. Also, this first detailed coverage of sequence diversity and structure of the LHB/CGB gene cluster will pave the way for studying the identified polymorphisms as well as potential genomic rearrangements in association with an individual's reproductive success.
-globin (Papadakis and Patrinos 1999
Interestingly, there is a nonrandom distribution of the functions of human segmentally duplicated genes within the proteome (Bailey et al. 2002
). Several genes associated with female or male reproduction have been shown to be duplicated during primate evolution as well as evolving under positive Darwinian selection (Wyckoff et al. 2000
; Bailey et al. 2002
; Nahon 2003
). One of the gene families that has evolved in the primate lineage is the gonadotropin hormone
-subunit (GtHB) family, represented in human by seven duplicated Luteinizing Hormone
(LHB)/Chorionic Gonadotropin
(CGB) genes located at 19q13.32 and two single-copy genes, FSHB at 11p13p14 and TSHB at 1p13.2. Consistent with an essential role in reproduction, all of the few described nonsynonymous mutations lead to either infertility or reduced gonadal function (Themmen and Huhtaniemi 2000
). The ancestral LHB gene has duplicated several times during primate evolution, giving rise to a new gene, CGB, differing from LHB both in the time (pregnancy vs. adult lifetime) and tissue (placenta vs. pituitary) of expression as well as mRNA stability (Policastro et al. 1986
; Maston and Ruvolo 2002
).
In order to study fine-scale sequence variation and LD structure in duplicated regions, we have applied the LHB/CGB gene cluster as a model and re-sequenced population samples from three continents. We explore the following questions: (1) What is the role of gene conversion in shaping the diversity and LD patterns in duplicons? (2) What potentially determines the distribution of crossovers and gene conversion in duplicated regions?
In addition, as an important contribution to human reproductive genetics, this is the first survey of sequence variation in human LHB/CGB genes, essential for successful fertilization and pregnancy. The detailed knowledge of the structure and diversity of the LHB/CGB gene cluster paves the way for studying an individual's general or reduced (e.g., susceptibility to spontaneous abortions) reproductive success in association with identified sequence variants of LHB/CGB genes as well as potential genomic rearrangement patterns (insertions, deletions, duplications, etc.) between homologous regions within the cluster.
| Results |
|---|
|
|
|---|
-site (GCTGGTGG) (Smith 1988
-site could have been stimulators of direct and inverted duplications within the cluster (Bailey et al. 2003
|
subunit of hCG hormone (CGB, CGB5, CGB7, and CGB8), there is 97%99% DNA sequence identity, whereas their identity with the functionally divergent LHB gene is 92%93% and with the CGB1 and CGB2 genes, 85% (Supplemental Fig. S1). Despite primary DNA sequence homology with the rest of the genes, CGB1 and CGB2 have possibly diverged in function (although a protein is still uncharacterized) through the use of an alternative exon 1 as well as a shifted open reading frame (Fig. 2A; Supplemental Figs. S1, S2). Whereas among the hCG hormone
-subunit-coding genes the protein identity is 98%100%, and to LH
85%, there is no significant amino acid sequence similarity between hCG
coded by CGB, CGB5, CGB7, CGB8, and LH
on the one side, and the predicted protein for CGB1 and CGB2 on the other side (Supplemental Fig. S2). In addition to the high DNA sequence homology among LHB/CGB genes, also the intergenic regions within the cluster reveal pairwise sequence similarity from 81% up to 97% (Fig. 1A). The whole genomic region is extremely high in G+C content (
55%), with individual genes exceeding 60% (Table 1).
|
= 0.0010 for African Americans and
= 0.0008 for European Americans, LHB/CGB genes exhibit high diversity. Interestingly, there is a clear decrease in diversity levels toward the center of the cluster: the peripheral loci CGB7 (mean
= 0.0055), LHB (0.0038), and CGB (0.0040) being highly diverse compared to the central loci CGB2 (0.0015), CGB1 (0.0012), and CGB5 (0.0017) (Table 1). Positive Tajima D-values (Table 1) point out the excess of high-frequency SNPs for CGB7, CGB, and LHB. In contrast, for CGB1, CGB2, and CGB5, Tajima D-values are mostly negative, and the frequency distributions tend to be skewed toward rare variants. However, as theoretical simulations have shown that statistical tests of neutrality based on the standard coalescent theory for a single-copy gene may not be appropriate for duplicated genes (Innan 2003
|
For LHB and four hCG
-subunit coding genes, only a few nonsynonymous mutations were identified: two signal peptide (LHB) and eight mature peptide variants (two for LHB, one for CGB5, four for CGB7, and one shared by CGB, CGB5, and CGB7) (Supplemental Table S3). Two of the variants overlap with the previously characterized mutations: (1) Ala-3Thr in the signal peptide of LH
(Jiang et al. 2002
) identified with a low frequency (2.2%) in Mandenka; and (2) a worldwide-spread Trp8Arg variant (Estonians 12%, Mandenka 7%, and Han 4%) in LH
mature protein (Pettersson et al. 1994
; Nilsson et al. 1997
).
Traces of gene conversion among LHB/CGB genes
When polymorphism data are obtained from duplicate loci, gene conversion between copies is visually detectable if polymorphisms are shared between the loci. We aligned complete nucleotide sequence variants for all possible gene pairs to identify clustered polymorphism motifs potentially generated by gene conversion between duplicate genes (Fig. 2B,C; Table 2). Altogether 27 gene-conversion sites were identified (each might be a target for 1 + n gene-conversion events) with a minimum observed tract of 2387 bp (mean 57 bp, median 23 bp) and maximum extension up to 796 bp (mean and median for maximum tract length across sites 229 and 138 bp, respectively). The highest number of acceptor sites was identified for CGB (eight sites) and CGB7 (seven sites); fewer converted segments were determined within LHB (two sites) and CGB2 (two sites), CGB1 (three sites) and CGB5 (three sites) (Fig. 2B). For some acceptor sites, there were multiple potential donor genes; for other sites, the donor gene could be unequivocally determined (Table 2; Fig. 2C). Gene conversion events involving the 5'-UTR up to +60 were identified only between CGB, CGB5, and CGB7, all coding the hCG
-subunit (Table 2). In the case of the functionally divergent LHB (specificity defined by exon 3) and CGB1/2 (different ORF), detectable acceptor sites are clustered mostly in the middle of the gene sequence. Of 29 "shared" polymorphic sites or MSV2, several were identified as part of minimum gene-conversion tracts, supporting the hypothesis that they were derived from gene-conversion events rather than being just highly mutable positions (Fig. 2B,C).
|
Estimation of population crossing-over and linkage disequilibrium parameters
Two approaches were used to characterize the decay of linkage disequilibrium between SNPs across the LHB/CGB region. First, we quantified the levels of LD by estimating the population crossing-over parameter
/bp=4Necbp, where Ne is the effective population size and cbp is the crossing-over rate per generation between adjacent nucleotide positions. The parameter
is a key determinant of LD patterns, with the strength of LD decreasing when
increases. We used two alternative algorithms: (1) the Li and Stephens (2003
) "product of approximate conditionals" (PAC) likelihood method based on simultaneous analysis of all loci; and (2) Hudson's (2001
) "composite likelihood" (CL) method based on multiplying together the likelihoods for every pair of sites. The first method has the advantage of allowing variation of recombination rate across the region of interest (Li and Stephens 2003
). The extension of the second approach has the advantage of allowing simultaneous estimation of
CL and f, where f is the ratio of gene-conversion to crossing-over events (Frisse et al. 2001
). The average recombination rate calculated across the studied region for SNPs with MAF > 10% (Estonians,
PAC = 4.43 x 104,
CL = 6.137.40 x 104; Han,
PAC = 2.34 x 104,
CL = 6.427.06 x 104; Mandenka,
PAC = 8.93 x 104,
CL = 1.0111.492 x 103) falls in the range published for a large set of 74 genes (Table 3A; Crawford et al. 2004
). Higher
values for Africans are consistent with the idea that African populations maintained a larger long-term effective population size Ne than did non-African ones (Frisse et al. 2001
). When two
estimates are compared,
CL provides generally higher estimates than
PAC. Also,
CL based on two-locus sampling distribution seems to be somewhat less influenced by demography (closer estimates for different population samples) and less biased by SNP frequencies (less variation in estimates using all or only common, >10% MAF, SNPs) than multiloci-based
PAC (Table 3A). When gene conversion was incorporated into the model,
CL estimates for the LHB/CGB region decreased, but were independent of the length of assumed conversion tract (L) for any given sample. In contrast, the estimated ratio of gene-conversion to crossing-over rate depended inversely on the conversion tract length: As the length of the tract decreases, the estimated rate of gene conversion increases. Including only common SNPs, the maximum likelihood estimate for L = 30 bp ranged from 6 (Han) to 16 (Mandenka); whereas for L = 500 bp, f is 0.5 (Han) to 1.5 (Mandenka). This difference in estimates of f could result from the fact that effects of high gene-conversion rates with small tracts are similar to the effects of lower conversion rates with longer tracts. Recently, Ptak et al. (2004a
) reported similar estimates of f for L = 500 (African-Americans, f
1; CEPH, f
0.25) obtained from joint analysis of 84 genomic regions. Reports from single sperm analysis seem to support much shorter conversion tract lengths, 54132 bp for HLA-DPB1 (Zangenberg et al. 1995
) and 55290 bp for DNA3 loci (Jeffreys and May 2004
). As the tract lengths identified for LHB/CGB genes are consistent with the single sperm data, we suggest f to range from 2 to 16.
|
estimation, we used a descriptive approach relying on summarizing LD by a pairwise summary, r2, which measures the correlation between alleles. In order to overcome the sensitivity to allele frequencies, we included only SNPs with MAF > 10%. Consistent with previous studies (for review, see Tishkoff and Verrelli 2003
= 57.1 for Estonians, 11.6 for Han, and 13.6 for Mandenka) (Fig. 3A,B,C). Another, "warm spot" was identified between CGB and CGB2 (
= 2.36 for Estonians, 5.47 for Han, and 4.17 for Mandenka). Consistently, the predicted 8.3-kb hotspot is colocalized with the strongest LD-breakdown region within the cluster (Fig. 3A,B,C). No recombination hotspots were predicted within the LHB/CGB genes, shown above as active in gene conversion.
|
Structural analysis of the potential recombination hotspot
In order to narrow down the potential hotspot region, we resequenced the 8.3-kb region from CGB5 up to CGB7, using an Estonian sample (n = 11) as a model. The sequence diversity parameters as well as the gene-conversion activity of the included CGB8 gene were similar to neighboring CGB5 (Table 1; Fig. 2B). Apparently owing to the small sample size and fewer SNPs, both estimates of recombination rate parameters
PAC and
CL exhibit more variation compared to the analysis of the whole LHB/CGB region, depending on SNP allele frequencies as well as the assumed length of the gene-conversion tract (for
CL) (Table 3B). However, despite both approaches' struggle to provide accurate estimates of
, they are consistent in that the average recombination rate of the potential hotspot region is
10 times higher compared to the rest of the LHB/CGB cluster (Table 3B). The estimates of f (ranging from 1 to 6.5 compared to the
CL estimate across the 8.3-kb region) suggest that this region has also gene-conversion activity as high as or even higher than the LHB/CGB genes (Table 3B). When pairwise LD patterns were studied, strong associations were detected only between a few scattered loci, except in the region adjacent to CGB7 (Fig. 3D). The potential recombination hotspot was narrowed within a <1-kb region colocalizing with LD breakdown between CGB8 and CGB7, embedded within an Alu-rich (
75% Alu-sequences) segment and 90100 bp from a recombination-associated
-sequence (Fig. 3D). The hotspot exhibited 6.9 times higher
PAC compared to an average of the 8.3-kb region. As the recombination rate between CGB5CGB7 was estimated to be
10 times higher compared to the whole LHB/CGB region, the crossing-over rate of the hotspot exceeds
70 times the background rate in the gene cluster.
Further sequence analysis revealed inverted Alu repeats (625 bp) that could give rise to stemloop secondary structure formation (Fig. 1B). The single-stranded loop segment (222 bp), located exactly at the center of the predicted hotspot, could be sensitive to chromatin-altering factors and thus promote double-strand breaks (DSBs) and recombination/gene conversion (Akgün et al. 1997
). The stemloop structure might also disrupt DNA synthesis during replication, generating an unbound 3'-tail of the nascent strand that could invade homologous regions in the center of the cluster (Lobachev et al. 1998
). The hypotheses of chromosome-altering factors and stalled replication are both sufficient to explain the direction of gene conversion from the center toward the periphery of the cluster, the invading strand always acting as the recipient during a gene-conversion event (Akgün et al. 1997
). A region between CGB and CGB2 predicted as a warm spot in recombination analysis involves also inverted repeats (2094 bp), but with a much longer spacer between them (2788 bp), which might hinder the formation of a stemloop secondary structure as stable as predicted for the hotspot (Lobachev et al. 1998
). A segment homologous to hot and warm spots, located between CGB1 and CGB5, has undergone an inversion resulting in the partial loss of a palindromic DNA fragment preceding an NTF6G' pseudogene (Fig. 1A), thus prohibiting the formation of a proper stemloop structure.
| Discussion |
|---|
|
|
|---|
> 0.002: Examples are ABO and KNG (Crawford et al. 2004
3040 million years ago (Mya) among Old World primates and shaped by gene conversion between OPN1LW and OPN1MW (Verrelli and Tishkoff 2004
= 0.00550): for the distal Y-chromosomal direct HERV repeats (
= 0.00544), shaped by directional gene conversion from proximal (
= 0.0016) to distal repeat. In LHB/CGB genes, part of the high diversity is due to multisite variations (MSVs): SNPs located at the same position in several genes or represented also as paralogous sequence variants. Although parallel de novo mutations cannot be ruled out as the source for MSVs, gene conversion between highly homologous LHB/CGB genes is a more likely scenario. As a support to this scenario, we identified MSVs within multiple gene-conversion tracts between gene pairs, detected by alignment of gene variants as well as by computational analysis using the GENECONV algorithm. Directional gene conversion has shaped the diversity patterns of the LHB/CGB region. Central genes of the cluster were characterized with mainly gene donor activity and lower variation, in contrast to highly diverse peripheral genes rich in acceptor sites.
It has been suggested that over short distances, gene conversion, rather than crossing over, is likely to be the dominant force that breaks up associations among sites (Andolfatto and Nordborg 1998
; Ardlie et al. 2001
; Frisse et al. 2001
). Results from the LHB/CGB cluster did not support this hypothesis. A majority of gene-conversion recipient sites colocalized with high-LD regions in the peripheral loci of the region, whereas the middle of the cluster was characterized by LD breakdown and lower gene-conversion acceptor activity.
These observations are consistent with theoretical simulations showing that relative to a single-gene model, polymorphism may be elevated and positive LD created at duplicated genes due to gene conversion (Innan 2002
, 2003
).
Functional consequences of gene conversion in LHB/CGB genes
For duplicated genes, gene conversion has been shown to be an essential source for spreading disease mutations. For example, Boocock et al. (2003
) showed that in the case of Shwachman-Diamond syndrome (SDS), 85% of patients carried a mutation in the SBDS gene originating from a neighboring pseudogene copy, SBDSP. To what extent has gene conversion shaped variation of coding sequences for LHB/CGB genes? The worldwide-spread Trp8Arg variant (Nilsson et al 1997
) as well as the neighboring His10Arg change (Supplemental Table S3) in exon 2 of LHB most probably originate from one of the hCG
-subunit-coding genes having in these positions conserved arginines. Another example is a widely spread variant of CGB7 (41.5% in Estonians, 19.6% in Mandenka, and 28% in Han), consisting of three polymorphisms in exon 2 (+417, +422, and +423) and coding two associated amino acid changes: Arg2Lys and Met4Pro (Supplemental Table S3). Arg2Met4 combination is unique to the CGB7 gene, whereas CGB, CGB5, and CGB8 carry the Lys2Pro4 variant. As this segment in the CGB7 gene has also been found to be within a gene-conversion acceptor site (Fig. 2B; Table 2), we interpret that the Lys2Pro4 variant in CGB7 originates from another CGB gene. Possible sources for novel variants in LHB and CGB, CGB5, CGB7, and CGB8 are CGB1/CGB2, which have a 1-bp-shifted ORF and divergence in 5'- and 3'-UTRs. Thus, a neutral polymorphism of CGB1/CGB2 could cause an amino acid change in a duplicate gene if spread by gene conversion. As an example, we have identified a rare Asp117Ala change in exon 3 of CGB and CGB5, potentially originating from a common variant in the 3'-UTR region of CGB2 (SNP at position +1087) (Supplemental Table S2).
Variation patterns in LHB/CGB region and population demography
Current patterns of human genetic variation reflect not only crossover history, but also past demographic processes as well as possible selective pressures on studied genes. In the LHB/CGB region, higher diversity and number of SNPs, shorter range of LD and less pronounced LD structure, as well as higher estimated recombination rates (measured by
=4Necbp) were detected for the Mandenka compared to the Estonian and the Han samples. These differences are likely to be explained by the distinct demographic histories of African and non-African populations: The former are older and have maintained larger Ne, and the latter have experienced a bottleneck event during the expansion of modern humans out of Africa within the past 100,000 years (for review, see Tishkoff and Verelli 2003). As a result of the bottleneck, non-African populations represent only a subset of African diversity and exhibit longer LD created during the founding event and maintained in rapidly expanding populations. It is noteworthy that despite high diversity as well as differences in variation and LD levels across the genes and populations, each LHB/CGB gene has only two to four major haplotypes in a population (Table 1). The joint bottleneck in the history of non-African populations is supported by mostly shared common gene variants between Estonians and Han, whereas Mandenka have several population-specific high-frequency haplotypes (data not shown). This observation has an important implication for LD-based mapping, suggesting that core variants in highly diverse regions may also be "tagged" by a few SNPs when an appropriate marker density is chosen.
Although putative hotspots and warm spots of recombination were predicted within the same regions in all studied samples, the estimated crossover intensity at these spots differs severalfold among populations (Fig. 3). If the background recombination rate (
PAC) is taken into account (Table 3A), the hotspot crossover intensity for the Estonians exceeds approximately twice the estimation for the Mandenka and 10-fold for the Han. In contrast, fourfold higher activity is predicted for the warm spot in the Mandenka compared to the other populations. This is consistent with a suggestion that local recombination rates can vary among human populations because of differences in their allele frequencies or in historical factors affecting Ne in local regions of the genome (Jeffreys and Neumann 2002
; Crawford et al. 2004
; Ptak et al. 2004b
; Evans and Cardon 2005
).
Recombination and gene-conversion activity are potentially associated with palindrome sequences
In yeast, recombination activity has been associated with high G+C content, nuclease-sensitive chromatin, and transcription factor binding sites. Although no sequence motifs are known to predict recombination hotspots in humans, putative crossover-initiating motifs have been identified in other species (Petes 2001
; De Massy 2003
). The LHB/CGB gene cluster has all the properties described for a recombination-active region: extremely C+G-rich, Alu-richness, and the presence of several
-sequence motifs, associated with crossover activity in several species. Despite high gene-conversion activity among the LHB/CGB genes, we estimated only one recombination hotspot within an intergenic region. Apparently, gene conversion between duplicons in the human genome also occurs without crossovers, consistent with the synthesis-dependent strand-annealing (SDSA) pathway described for yeast (Allers and Lichten 2001
). What could be the determinants of the estimated potential recombination hotspot within the LHB/CGB cluster? It is probably not defined by primary DNA sequence, as there are two other highly homologous segments (Fig. 1A) located within the cluster. It has been suggested that double-stranded breaks, which are prerequisites for crossover initiation, are stimulated by the formation of palindromic secondary structures (Krawinkel et al. 1986
; Akgün et al. 1997
; Lobachev et al. 1998
). Indeed, a stable stemloop is formed around the center of the predicted hotspot (Fig. 1B), which is not the case for the two homologous regions because of minor rearrangements of these DNA segments. The hypothesis of palindrome sequences stimulating DSBs and crossover activity is supported by direct sperm analysisa structurally similar recombination hotspot, bordered by inverted Alu-motifs and characterized by crossover asymmetry, has been identified for the MHC hotspot DNA2 (Jeffreys and Neumann 2002
). We suggest that a high recombination rate and low LD, but also high gene-conversion activity in segmental duplications, could be favored by secondary structures formed by palindrome sequences. The abundance of direct and inverted repeats common in segmental duplications provides the basis for DNA secondary structure formations, initiating DSBs.
| Methods |
|---|
|
|
|---|
Population samples
The study has been approved by the Ethics Committee of Human Research of the University Clinic of Tartu, Estonia (permission no. 117/9, 16.06.03). For re-sequencing of six genes (LHB, CGB, CGB1, CGB2, CGB5, and CGB7), in total 95 DNA samples from three continents were used: 47 Estonian (Europe), 23 Mandenka (Africa), and 25 Chinese Han (Asia) individuals. The Estonian sample represents a typical European population (Dawson et al. 2002
). Mandenka and Han samples were obtained from the HGDP-CEPH Human Genome Diversity Cell Line Panel (http://www.cephb.fr/HGDP-CEPH-Panel/; Cann et al 2002
). The detailed analysis of the predicted recombination hotspot was conducted using a sample of 11 Estonian individuals.
Gene-specific and long-range PCR
A total of 12 PCR primers for the LHB, CGB, CGB1, CGB2, CGB5, and CGB7 genes were designed based on the human chorionic gonadotropin
region sequence (NCBI Refseq NG_000019
[GenBank]
) using the Web-based version of the Primer3 software (http://frodo.wi.mit.edu/cgi-bin/primer3/primer3_www.cgi). While designing PCR primers aiming to result in gene-specific amplification products, we relied on the detailed structure of the LHB/CGB region (Fig. 1). The uniqueness of all PCR primers was checked using BLAST, and only primer pairs with at least one primer being unique in the human genome were regarded suitable for amplification. The six genes were amplified to cover the entire coding sequence and part of flanking regions; amplified fragments were 15992364 bp long. Specificity of the PCR products was controlled in three steps: (1) design of unique primer pairs capable to amplify only one of the duplicated genes; (2) verification of monomorphic status for gene-specific positions used as markers for each individual gene (Supplemental Fig. S1); (3) test for Hardy-Weinberg Equilibrium for each identified SNP.
Amplification of genomic DNA (100 ng) was performed using the Long PCR Enzyme Mix (MBI Fermentas) by the standard protocol recommended by the manufacturer. Amplifications were performed in a PTC-200 thermal cycler (MJ Research). The reactions were initiated with a denaturation at 95°C for 5 min, followed by 10 cycles of denaturation at 95°C for 20 sec, annealing at 68°C for 30 sec (decrease of temperature 1°C per cycle), elongation at 68°C for 2 min; 10 cycles of 95°C (20 sec), 56°C (30 sec), 68°C (2 min); 10 cycles of 95°C (20 sec), 54°C (30 sec), 68°C (2 min); 10 cycles of 95°C (20 sec), 51°C (30 sec), and 68°C (2 min). A final extension step was performed at 68°C for 10 min.
The potential hotspot region between CGB5 and CGB7 was amplified in two stages. First, a long-range PCR was conducted that yielded a product of 8.3 kb. Second, six inner fragments (11931675 bp) were reamplified by nested PCR. Amplifications of 100 ng of genomic DNA (Long PCR Enzyme Mix; MBI Fermentas) were performed in a GeneAmp PCR System 2700 thermal cycler (Applied Biosystems). The reactions were initiated with a denaturation at 94°C for 5 min, followed by four cycles of denaturation at 94°C for 20 sec, annealing at 68°C for 30 sec (decrease of temperature 1°C per cycle), elongation at 68°C for 8 min, 11 cycles at 94°C (20 sec), 64°C (30 sec), 68°C (8 min); 25 cycles at 94°C (20 sec), 64°C (30 sec), and 68°C (8 min + 5 sec per cycle). A final extension step was performed at 68°C for 10 min. All primer sequences are available upon request.
Re-sequencing
To remove unincorporated PCR primers and mononucleotides, PCR products were treated with exonuclease I (1 U; MBI Fermentas) and shrimp alkaline phosphatase (1.5 U; USB) and incubated in a GeneAmp PCR System 2700 thermal cycler (Applied Biosystems) at 37°C for 20 min followed by enzyme inactivation at 80°C for 15 min. Purified PCR product (1.53 µL) served as a template in sequencing reactions (10 µL) with sequencing primer (2 pmol) and DYEnamic ET Terminator Cycle Sequencing Kit reagent premix (Amersham Biosciences Inc.) as recommended by the supplier. LHB, CGB, CGB1, CGB2, CGB5, and CGB7 genes were sequenced from both strands and using six different sequencing primers. Altogether 20 sequencing primers for LHB, CGB, CGB1, CGB2, CGB5, and CGB7 genes (a set of six primers for every gene) and 36 primers for 8.3-kb hotspot region (six for each nested PCR product) were designed as described above for resequencing of both strands. Sequencing reactions (1.5 µL) were run on an ABI 377 Prism automated DNA sequencer (Applied Biosystems) using ReproGel 377 gels (Amersham Biosciences Inc.).
For each gene and each population, the sequence data were assembled into a contig using phred and phrap software (Ewing et al 1998
), and the contig was edited in a consed package (Gordon et al. 1998
) to ensure that the assembly was accurate (http://www.phrap.org/phredphrapconsed.html). Polymorphisms were identified using the polyphred program (Version 4.2) (Nickerson et al. 1997
) and confirmed by manual checking. A genetic variant was verified only if it was observed in both the forward and the reverse orientations. Allele frequencies were estimated and conformance with HWE was computed by an exact test (
= 0.05) using HaploView (http://www.broad.mit.edu/mpg/haploview/index.php; Barrett et al. 2005
) program. In total, six rare SNPs for Mandenka or Han were found to be deviating from HWE, apparently because of small sample size.
Statistical analysis
Sequence diversity parameters were calculated by DnaSP software (Version 4.0) (http://www.ub.es/dnasp/; Rozas and Rozas 1999
). The direct estimate of per-site heterozygosity (
) was derived from the average pairwise sequence difference, and Watterson's
(Watterson 1975
) represents an estimate of the expected per-site heterozygosity based on the number of segregating sites (S). Tajima's D (DT) statistic (Tajima 1989
) was performed to determine if the observed patterns of diversity in the three studied population samples are consistent with the standard neutral model. Significant positive DT values may indicate an excess of intermediate-frequency SNPs consistent with balancing selection as well as population bottlenecks or subdivision, whereas significant negative DT values indicate an excess of low-frequency SNPs consistent with recent directional selection or population expansion. Haplotypes were inferred from unphased genotype data using the Bayesian statistical method in the program PHASE 2.1 (http://www.stat.washington.edu/stephens/; Stephens et al. 2001
). For haplotype reconstruction, the model allowing recombination was used. Running parameters for PHASE are described below.
Detection of gene-conversion events
Gene sequence variants derived from estimated haplotypes were used for gene-conversion analysis. For manual detection of gene-conversion sites between a pair of LHB/CGB genes, the derived complete sequence variants were aligned using Web-based ClustalW. A minimum gene-conversion site was defined as a region within an acceptor gene with
2 associated, motif-forming polymorphisms for which a potential donor gene could be defined. The maximum possible gene-conversion tract covers the identical sequence between two compared genes on both sides of the minimum gene-conversion tract. Alternatively, the aligned sequences of all possible gene pairs were analyzed for evidence of gene conversion using Stanley Sawyer's gene-conversion detection method as implemented in his GENECONV program (Version 1.81) (http://www.math.wustl.edu/
sawyer/geneconv/; Sawyer 1989
). Sawyer's gene-conversion detection algorithm detects whether pairs of sequences share unusually long stretches of similarity given their overall similarity. The GENECONV program computes global and pairwise p-values and allows mismatches within converted regions. Global and pairwise p-values are calculated using two methods. The first method is based on (10,000) permutations of the original data, and the second is based on a method similar to that used by the BLAST database-searching algorithm. Here, we only used p-values from permutations (simulations) because they are more conservative and accurate. We also only considered p-values (p < 0.05) from global fragments because their p-values are corrected for multiple comparisons whereas the p-values of pairwise fragments are not. Alignments were analyzed using the most stringent "g0" parameter, meaning that mismatches within fragments are not allowed.
Measures of linkage disequilibrium
The descriptive statistic of linkage disequilibrium (LD), r2 (Hill and Robertson 1968
), was calculated for pairs of markers and summarized using Haploview software (Barrett et al. 2005
). Reliable LD patterns were achieved by inclusion of only common SNPs with minor allele frequency (MAF) >10%. To locate gene-conversion acceptor sites at the LD landscape of LHB/CGB cluster, we calculated pairwise LD for all identified SNPs, as several converted SNPs represent low allele frequencies.
Another way to quantify levels of LD is to estimate the population crossing-over parameter
=4Necbp, where Ne equals effective population size and cbp the crossing-over rate per base pair per generation. We estimated
using two alternative algorithms. The Li and Stephens (2003
) method is based on a "Product of Approximate Conditionals" (PAC) model considering all loci simultaneously, allowing variation of recombination rate across the region of interest and thus estimation of putative recombination hotspots. Average background recombination rate (
PAC) and the factor (
) by which the recombination rate between loci exceeds the average background rate were estimated from unphased genotype data using the PHASE 2.1 software (http://www.stat.washington.edu/stephens/; Stephens et al. 2001
; Li and Stephens 2003
). Within this model, a
value of 1 corresponds to an absence of recombination rate variation, while values of
>1 indicate increase crossover activity. The value of 1<
< 10 is considered a recombination "warm spot," and the value of
> 10 is considered a recombination "hotspot" (Crawford et al. 2004
). For hotspot estimation, only common SNPs (MAF > 10%) were included in the analysis. The running parameters were number of iterations = 1000, thinning interval = 1, burn-in = 100; for increasing the number of iterations of the final run of the algorithm the -X10 parameter, making the final run 10 x longer than other runs, was used. To relax the assumption of stepwise mechanism inappropriate for triallelic SNPs, the -d1 option was used. For each sample set, we ran the algorithm 10 times, resulting in identical outputs of the parallel analysis; thus we used the median of the values obtained from one of the runs.
Alternatively, we used the "composite likelihood" (CL) method by Hudson (2001
) to estimate simultaneously the population recombination parameter
CL and f, where f is the ratio of gene conversion to crossing-over events (Frisse et al. 2001
). Hudson's method is based on multiplying together likelihoods for every pair of sites genotyped, in which these pairwise likelihoods are computed via simulation, assuming an "infinite-sites" model. The method assumes that gene conversion and crossing-over are alternative solutions of a Holliday junction and that the conversion-tract length is geometrically distributed with mean length L. We obtained maximum likelihood estimates for
CL and f from unphased data using MAXDIP (http://genapps.uchicago.edu/maxdip/index.html) with the following running parameters: starting value of
= 0.0002; f ranging from 0 to 30, with the intervals of 0.5. The analysis was run for gene-conversion-tract lengths L = 30, 50, 100, 250, and 500. The choice of the L values was based on reports from human single-sperm analysis (Zangenberg et al. 1995
; Jeffreys and May 2004
) and lengths of gene-conversion tracts identified for LHB/CGB genes.
| Acknowledgements |
|---|
| Footnotes |
|---|
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.4270505. Freely available online through the Genome Research Immediate Open Access option.
1 Corresponding author.
E-mail maris{at}ebc.ee; fax +372-7420286. ![]()
| REFERENCES |
|---|
|
|
|---|
Akgün, E., Zahn, J., Baumes, S., Brown, G., Liang, F., Romanienko, P.J., Lewis, S., and Jasin, M. 1997. Palindrome resolution and recombination in the mammalian germ line. Mol. Cell. Biol. 17: 55595570.[Abstract]
Allers, T. and Lichten, M. 2001. Differential timing and control of noncrossover and crossover recombination during meiosis. Cell 106: 4757.[CrossRef][Medline]
Andolfatto, P. and Nordborg, M. 1998. The effect of gene conversion on intralocus association. Genetics 148: 13971399.
Ardlie, K., Liu-Cordero, S.N., Eberle, M.A., Daly, M., Barrett, J., Winchester, E., Lander, E.S., and Kruglyak, L. 2001. Lower-than-expected linkage disequilibrium between tightly linked markers in humans suggests a role for gene conversion. Am. J. Hum. Genet. 69: 582589.[CrossRef][Medline]
Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers, E.W., Li, P.W., and Eichler, E.E. 2002. Recent segmental duplications in the human genome. Science 297: 10031007.
Bailey, J.A., Liu, G., and Eichler, E.E. 2003. An Alu transposition model for the origin and expansion of human segmental duplications. Am. J. Hum. Genet. 73: 823834.[CrossRef][Medline]
Barrett, J.C., Fry, B., Maller, J., and Daly, M.J. 2005. Haploview: Analysis and visualization of LD and haplotype maps. Bioinformatics 21: 263265.
Bettencourt, B.R. and Feder, M.E. 2002. Rapid concerted evolution via gene conversion at the Drosophila hsp70 genes. J. Mol. Evol. 54: 569586.[CrossRef][Medline]
Boocock, G.R., Morrison, J.A., Popovic, M., Richards, N., Ellis, L., Durie, P.R., and Rommens, J.M. 2003. Mutations in SBDS are associated with Shwachman-Diamond syndrome. Nat. Genet. 33: 97101.[CrossRef][Medline]
Bosch, E., Hurles, M.E., Navarro, A., and Jobling, M.A. 2004. Dynamics of a human interparalog gene conversion hotspot. Genome Res. 14: 835844.
Cann, H.M., de Toma, C., Cazes, L., Legrand, M.F., Morel, V., Piouffre, L., Bodmer, J., Bodmenr, W.F., Bonne-Tamir, B., Cambon-Thomsen, A., et al. 2002. Human genome diversity cell line panel. Science 296: 261262.
Crawford, D.C., Bhangale, T., Li, N., Hellenthal, G., Rieder, M.J., Nickerson, D.A., and Stephens, M. 2004. Evidence for substantial fine-scale variation in recombination rates across the human genome. Nat. Genet. 36: 700706.[CrossRef][Medline]
Dawson, E., Abecasis, G.R., Bumpstead, S., Chen, Y., Hunt, S., Beare, D.M., Pabial, J., Dibling, T., Tinsley, E., Kirby, S., et al. 2002. A first-generation linkage disequilibrium map of human chromosome 22. Nature 418: 544548.[CrossRef][Medline]
De Massy, B. 2003. Distribution of meiotic recombination sites. Trends Genet. 19: 514522.[CrossRef][Medline]
Estivill, X., Cheung, J., Pujana, M.A., Nakabayashi, K., Scherer, S.W., and Tsui, L.-C. 2002. Chromosomal regions containing high-density and ambiguously mapped putative single nucleotide polymorphisms SNPs correlate with segmental duplications in the human genome. Hum. Mol. Genet. 11: 19871995.
Evans, D.M. and Cardon, L.R. 2005. A comparison of linkage disequilibrium patterns and estimated population recombination rates across multiple populations. Am. J. Hum. Genet. 76: 681687.[CrossRef][Medline]
Ewing, B., Hillier, L., Wendl, M., and Green, P. 1998. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8: 175185.
Fredman, D., White, S.J., Potter, S., Eichler, E.E., Den Dunnen, J.T., and Brookes, A.J. 2004. Complex SNP-related sequence variation in segmental genome duplications. Nat. Genet. 36: 861866.[CrossRef][Medline]
Frisse, L., Hudson, R.R., Bartoszewicz, A., Wall, J.D., Donfack, J., and Di Rienzo, A. 2001. Gene conversion and different population histories may explain the contrast between polymorphism and linkage disequilibrium levels. Am. J. Hum. Genet. 69: 831843.[CrossRef][Medline]
Gordon, D., Abajian, C., and Green, P. 1998. Consed: A graphical tool for sequence finishing. Genome Res. 8: 195202.
Hill, W.G. and Robertson, A. 1968. The effects of inbreeding at loci with heterozygote advantage. Genetics 60: 615628.
Horton, R., Wilming, L., Rand, V., Lovering, R.C., Bruford, E.A., Khodiyar, V.K., Lush, M.J., Povey, S., Talbot Jr., C.C., Wright, M.W., et al. 2004. Gene map of the extended human MHC. Nat. Rev. Genet. 5: 889899.[CrossRef][Medline]
Hudson, R.R. 2001. Two-locus sampling distributions and their application. Genetics 159: 18051817.
Hurles, M.E. 2001. Gene conversion homogenizes the CMT1A paralogous repeats. BMC Genomics 2: 11.[CrossRef][Medline]
Iafrate, A.J., Feuk, L., Rivera, M.N., Listewnik, M.L., Donahoe, P.K., Qi, Y., Scherer, S.W., and Lee, C. 2004. Detection of large-scale variation in the human genome. Nat. Genet. 36: 949951.[CrossRef][Medline]
Innan, H. 2002. A method for estimating the mutation, gene conversion and recombination parameters in small multigene families. Genetics 161: 865872.
Innan, H. 2003. The coalescent and infinite-site model of a small multigene family. Genetics 163: 803810.
Jeffreys, A.J. and May, C.A. 2004. Intense and highly localized gene conversion activity in human meiotic crossover hot spots. Nat. Genet. 36: 151156.[CrossRef][Medline]
Jeffreys, A.J. and Neumann, R. 2002. Reciprocal crossover asymmetry and meiotic drive in a human recombination hot spot. Nat. Genet. 31: 267271.[CrossRef][Medline]
Jiang, M., Lamminen, T., Pakarinen, P., Hellman, J., Manna, P., Herrerra, R.J., and Huhtaniemi, I. 2002. A novel Ala3Thr