|
|
|
|
Genome Res. 15:241-249, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Letter A high-resolution multistrain haplotype analysis of laboratory mouse genome reveals three distinctive genetic variation patternsLaboratory of Population Genetics, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892-8302, USA
Understanding of the structure and the origin of genetic variation patterns in the laboratory inbred mouse provides insight into the utility of the mouse model for studying human complex diseases and strategies for disease gene mapping. In order to address this issue, we have constructed a multistrain, high-resolution haplotype map for the 99-Mb mouse Chromosome 16 using 70,000 single nucleotide polymorphism (SNP) markers derived from whole-genome shotgun sequencing of five laboratory inbred strains. We discovered that large polymorphic blocks (i.e., regions where only two haplotypes, thus one SNP conformation, are found in the five strains), large monomorphic blocks (i.e., regions where the five strains share the same haplotype), and fragmented blocks (i.e., regions of greater complexity not resembling at all the first two categories) span 50%, 18%, and 32% of the chromosome, respectively. The haplotype map has 98% accuracy in predicting mouse genotypes in two other studies. Its predictions are also confirmed by experimental results obtained from resequencing of 40-kb genomic sequences at 21 distinct genomic loci in 13 laboratory inbred strains and 12 wild-derived strains. We demonstrate that historic recombination, intra-subspecies variations and inter-subspecies variations have all contributed to the formation of the three distinctive genetic signatures. The results suggest that the controlled complexity of the laboratory inbred strains may provide a means for uncovering the biological factors that have shaped genetic variation patterns.
The laboratory inbred mouse is the primary mammalian model organism for human disease research owing in part to its utility in identifying genetic components underlying complex common diseases in humans. However, identifying candidate genes by conventional mapping techniques after initial low-resolution mapping is a labor-intensive and time-consuming process. An accurate high-resolution multistrain haplotype map of the mouse genome may accelerate the discovery of causative variants in the initially large candidate regions. By making possible the comparison of haplotype structure across different inbred strains used in the crosses, researchers will be able to identify haplotype blocks that segregate in concert with phenotypic differences, thereby reducing the number of potential candidates for further analysis from hundreds to a more manageable number (Grupe et al. 2001
Two previous studies, one by the Whitehead Institute (WI) (Wade et al. 2002
To gain insight into the high-resolution haplotype structure of the common laboratory inbred mouse at the genome scale, we analyzed genetic variation patterns of 70,795 SNPs discovered from high-quality variations among the five laboratory strains in the Celera whole-genomic shotgun sequence of the 99-Mb mouse Chromosome 16 (Mural et al. 2002
Construction of a multistrain haplotype map of mouse Chromosome 16 The Celera data set consists of high-density SNPs from five laboratory inbred strains, from which SNP genotypes can readily be assembled into haplotypes across the entire chromosome. By comparing haplotypes from multiple strains, it is possible to identify haplotype blocks by patterns of linkage disequilibrium between the SNPs, a procedure that has been used to define haplotype blocks for the human genome (Patil et al. 2001 A close look at the haplotype structure reveals that, in some regions, adjacent haplotype blocks with the same allelic variation pattern are interrupted by small haplotype blocks (Fig. 1). Some of these small blocks make only a minor contribution to the local genetic diversity but can disrupt the contiguity of the global haplotype structure. Therefore, we implemented a procedure to meld neighboring blocks with the same haplotype variation pattern if the number of inconsistent SNPs (i.e., the SNPs whose allelic variation patterns are inconsistent with the haplotype variation pattern) is <5% of the total SNPs in the melded block. In the example in Figure 1, we were able to meld the two haplotype blocks into a single 2.4-Mb block because only three out of 352 SNPs in this region are inconsistent with the dominant haplotype variation pattern. In all, 301 blocks were merged into 36 "melded" blocks. The average haplotype block size after melding is 44.6 kb.
Three major patterns of genetic variation emerged from the global haplotype structure. Large blocks with few polymorphisms ("monomorphic blocks"), defined as large regions (>1 Mb) with extremely low SNP density (<0.5 SNPs per 10 kb) and inconsistent haplotype variation patterns (each haplotype variation pattern spans <10 SNPs), cover 18% of the chromosome. The remaining regions consist of large polymorphic haplotype blocks ( 200 kb) that span 50% of the chromosome (Fig. 1) and fragmented haplotype blocks that span 32% of the chromosome and include 42% of SNPs (Fig. 2). The fragmented blocks take four forms: (1) erosion of a major haplotype pattern by a variety of small haplotypes blocks (Fig. 2A); (2) segmentation of two or three haplotype patterns over a long range (Fig. 2B); (3) segmentation coupled with erosion (Fig. 2C); and (4) random scrambling (Fig. 2D). There is an inverse relationship between gene density and SNP density in the three major genetic variation patterns. Large monomorphic blocks have the highest gene density and the lowest SNP density, while the fragmented blocks have the lowest gene density but the highest SNP density (Fig. 3).
Assessing the accuracy of the Chromosome 16 haplotype map constructed in this study A haplotype block is expected to capture the allelic variation pattern of a genomic region. Therefore, the accuracy of a haplotype map can be measured by how well its blocks predict the allelic variation of polymorphic markers that were not included in haplotype block construction. To assess the accuracy of the haplotype map constructed in this study, we measured the consistency between the allelic variation patterns depicted in haplotype blocks derived from our study with genotypes of SNPs that were assayed in WI and GNF studies. In this analysis, only SNPs unique to the WI and GNF studies were included, and only the strains common between the studies were analyzed. More specifically, C57BL/6J and 129S1/SvImJ are the strains common to the WI and Celera data sets; C57BL/6J, 129S1/SvImJ, DBA/2J, and A/J are common to GNF and Celera. We found that the haplotype blocks defined in this study from the Celera genotype data were consistent with 98% and 91% of the genotype data in the WI and GNF studies, respectively. In addition, the distribution of WI SNPs across large polymorphic blocks, large monomorphic blocks, and fragmented haplotype blocks is similar to that of the Celera SNPs, even when including SNPs derived from the two strains unique in the WI study, C3H/He and BALB/cByJ (Table 2).
Experimental validation by resequencing To validate the predicted haplotype structure, to extend our knowledge of the genetic diversity in the common laboratory inbred strains and to better understand the origins of the patterns of genetic variation, we resequenced 39,495 bp of genomic sequence. The regions selected for resequencing comprised 44 internally contiguous genomic segments at 21 distinct genomic loci found in our initial analysis to consist of 10 large blocks and 11 fragmented blocks (Fig. 4; Supplemental Table S4). The target regions were selected to validate or investigate the following: (1) regions defined as SNP-poor across all strains in the WI study but found to be SNP rich in our analysis; (2) large haplotype blocks with varying SNP density but the same allelic variation pattern; (3) the discrepancy between genotypes in the WI and GNF studies and the haplotype blocks constructed in this study; (4) the validity of "orphan" SNPs that break a haplotype block; (5) fragmented haplotype blocks (including segmentation and erosion); (6) the validity of SNP-poor regions discovered in this study; (7) the validity of SNPs in SNP-poor regions found in the current analysis; (8) genetic variations between the two 129 strains (129S1/SvImJ, 129X1/SvJ); and (9) the relationship between gene structure and SNP density. Additional information about target region selection can be found in the Supplemental material and Supplemental Table S4. A total of 25 inbred mice were assayed: 13 common laboratory strains and 12 wild-derived strains from four subspecies of the species Mus musculus: domesticus, musculus, molossinus, and castaneus as well as the species Mus spretus. In what follows, we use the mouse strain nomenclature developed by the Jackson Laboratory, in which the classical laboratory mouse stocks are referred to as "laboratory inbred strains" and descendants of recently captured wild mice as "wild-derived inbred strains." A total of 1004 substitution SNPs (ss32467313-ss32468316 in the NCBI dbSNP database), including six triallelic markers, were found across all strains, 225 of which were polymorphic among the laboratory inbred strains. There were a total of 132 Celera SNPs, 22 WI SNPs, and eight GNF SNPs in the regions selected for resequencing. The validation rates were 99%, 95%, and 63% for the SNPs previously described in the Celera, the WI, and the GNF data sets in these regions, respectively. In each case, the resequencing results are in agreement with the haplotype block structure defined in this study (details are in the Supplemental material and http://lpg.nci.nih.gov/mulan/).
Variations of SNP density within a haplotype block In the multistrain haplotype map that we constructed using the Celera data, the vast majority (>95%) of the SNPs within a block have the same allelic variation pattern, but the SNP density can vary considerably. The number of SNPs per segment in successive 10-kb intervals ranges from 0 to more than 20. For example, in the largest melded block (5.2 Mb) located at 45.9-51.1 Mb on Chromosome 16, in which the haplotype of C57BL/6J is different from the one shared by the four other strains, 97% of the 4119 SNPs are consistent with the dominant haplotype variation. However, the SNPs are not evenly distributed over this large physical region. SNP-poor segments (0 SNP per 10 kb), which cover 15% of the block, are interspersed with SNP-rich segments (>20 SNPs per 10 kb) (Fig. 5A). Such a pattern of varying SNP rate across a large haplotype block is common on Chromosome 16. Of the nine largest (>1 Mb) melded blocks in which C57BL/6J and 129S1/SvImJ strains have different haplotypes, eight show considerable variation in SNP density (details in the Supplemental material). Six of the eight melded blocks with differing SNP density were split into multiple haplotype blocks on the WI map of C57BL/6J and 129S1/SvImJ (Fig. 5B). The one exception is a 2.4-Mb block located at 82.88-85.26 Mb in which 85% of the genomic region has low SNP density ( 1 SNPs per 10 kb). However, even in this region, two subregions, one 90 kb and the other 60 kb long, are SNP-rich and contain 50% of the SNPs of the entire block (Fig. 1).
We were interested in verifying the high variability of SNP density within a haplotype block and investigating the cause of sharp transitions in SNP density. In the resequencing experiment, we selected juxtaposed regions of consistently high and low SNP density in the 2.4-Mb large block described above (Fig. 1) and a 77-kb small block located at 35.61-35.69 Mb (Fig. 6). The second small block was particularly interesting because it includes a 12-kb genomic sequence that encodes a protein with unknown function (GenBank accession NM_145481). As shown in Figure 6, the genomic sequence that encompasses protein-coding exons 1-4, including the introns, is SNP-poor (1 SNP per 10 kb), while the mostly noncoding exon 5 (86% of which is 3'-UTR) and its 3' downstream region are SNP-rich (10-29 SNP per 10 kb).
We resequenced 2.2 kb of the SNP-rich region and 2.0 kb of the SNP-poor region. Among the 13 laboratory inbred strains, 27 and three SNPs were found in the SNP-rich and SNP-poor regions, respectively. The SNP-poor region contains the only missense variation (Phe200Leu in the protein sequence NP_663456.1) in the laboratory inbred strains. In both SNP-poor and SNP-rich regions, the haplotypes of the laboratory inbred strains can be found in the wild-derived inbred strains of domesticus but in no other subspecies (Fig. 6C). There is no transition from inter-subspecies to intra-subspecies in the ancestral origin of genetic variations in the laboratory inbred strains even though there is a sharp transition from high SNP rate to low SNP rate at this locus. Similar results were obtained in the resequencing of one SNP-rich segment and three SNP-poor segments in the 2.4-Mb large haplotype block (details in the Supplemental material).
Analysis of haplotypes and SNPs in laboratory inbred strains and wild-derived inbred strains Of the 110 haplotypes in the laboratory inbred strains, 91 (84%) were also found in the wild inbred strains of domesticus subspecies, while none matched exclusively to the European musculus strains (CZECHII/Ei and SKIVE/Ei). Another 5% and 1% of the haplotypes in the laboratory strains were found in the Asian mice molossinus and castaneus, respectively. The remaining 10% of haplotypes could only be found in the laboratory inbred strains. The 44 contiguous genomic segments were sampled from 21 distinct genomic loci. In 17 out of the 21 genomic loci, SNPs in the laboratory inbred strains arise from intra-subspecies variations in the wild-derived inbred strains of domesticus subspecies (Fig. 4). These 17 loci include 80% of all SNPs in the laboratory inbred strains discovered by resequencing. In contrast, inter-subspecies variations between molossinus and domesticus contribute to SNPs in the laboratory inbred strains at only two loci. At the remaining two loci, SNPs in laboratory strains arise from variations between haplotypes of domesticus subspecies and haplotypes found only in the laboratory inbred strains.
Analysis of the origin of segmentation and erosion blocks
Unlike segmentation blocks in which several haplotype variation patterns alternate, erosion blocks are characterized by a single, predominant variation pattern frequently interrupted (eroded) by other variation patterns. The genomic span of predominant variation patterns varies from 10 kb to 1.2 Mb; the erosions usually contain a significant proportion (17%-35%) of the SNPs. One of the loci selected for resequencing is a 220-kb region in which the erosion blocks contain 32% of SNPs in the region (Fig. 2A). The five segments selected for resequencing span 14,777 bp. Three haplotypes were found in the 13 laboratory inbred strains; each can be directly and exclusively attributed to the domesticus subspecies (Fig. 7B). SNP pairs within a segment or between adjacent segments all fail the FGT and there are seven interruptions to the predominant variation pattern. The frequent interruptions to the dominant variation pattern coupled with lack of evidence for historic recombination indicate that it is unlikely the region is a recombination hotspot. Rather, three out of the four haplotypes found in the wild domesticus population are present in the laboratory inbred strains. The predominant variation pattern shows the divergence between MOR/Rk and LEWES/Ei, while the erosions arise because of the minor differences between MOR/Rk and WSB/Ei in this region. Another region of erosion blocks surveyed by resequencing is located at 45.1-45.9 Mb on Chromosome 16 (locus 13 in Fig. 4). There, the predominant variation pattern is generated by differences between C57BL/6J and the other Celera strains; four additional variation patterns form the erosions in this region; they contain 30% of the SNPs. The resequencing results show that C57BL/6J and C58/J have the same haplotype as molossinus, while the four haplotypes found in the other strains are likely to be of domesticus origin (Fig. 7C). Thus, this region has a mixture of inter- and intra-subspecies variations.
Phylogenetic analysis
Using high-resolution, multiple-strain SNP data of the 99-Mb mouse Chromosome 16, three major genetic variation patterns emerged from the global haplotype structure of the laboratory mouse genome: large monomorphic blocks, large haplotype blocks, and fragmented blocks. This structure is more complex than the mosaic structure of alternating large segments of low or high SNP density reported in a previous study (Wade et al. 2002 40 kb of genomic sequence in 13 laboratory inbred strains and 12 wild-derived inbred strains. Variation in haplotype block size has also been reported in two recent studies (Frazer et al 2004 5 Mb of fine-resolution haplotype structure across multiple strains. These findings suggest that a high-resolution SNP map is required to obtain an accurate description of the genetic variations in the laboratory mouse genome.
In a previous study, Wade et al. (2002
Inter-subspecies variations have been postulated to be the main contributor to the genetic diversity in the laboratory mouse genome based on the observation that most of these strains carry a M. m. domesticus mitochondrial DNA (Ferris et al. 1982
In regions of fragmented blocks, a pattern of erosion or segmentation usually spans multiple blocks. In some cases, all haplotypes in the region defined by an erosion or a segmentation pattern were originated from the same subspecies, and the complexity of the haplotype structure might be comparable to that found in an outbred population with limited founders. Erosion blocks indicate that additional haplotypes similar to the two predominant haplotypes are present in the region. Although our sample of seven strains of wild-derived domesticus subspecies is insufficient to determine SNP population frequency, we suspect that the erosions arose as a consequence of low-frequency SNPs in the wild population because such a pattern of two predominant haplotypes with completely different alleles (dubbed "yin yang" haplotypes) was also found in human when low-frequency SNPs (<10%) were filtered (Zhang et al. 2003
Compute haplotype blocks using the Celera SNP data Genotype data were converted to integer number 1 or 2 to represent alleles the same as or different from the allele in the C57BL/6J strain, respectively. We found that 34% of the genotypes in the Celera data were missing and represented these as Ns. A haplotype block was initiated with a seed of two adjacent SNPs; we required the two "seed SNPs" to have a minimum of two unambiguous haplotypes so that a haplotype block would not be initiated with phase ambiguous haplotypes. For example, two adjacent SNPs with genotypes 1N, N1, 2N, N2 for each of four strains would not qualify as a seed. Haplotypes for all strains were collected to assemble a set of nonredundant, reference haplotypes for each block. If a block had only two reference haplotypes, the current block was considered valid and was extended by 1 SNP for the next iteration of reference haplotype assembly. To assemble the reference haplotypes, the mouse strains were processed in ascending order of their number of missing genotypes. A new reference haplotype was created if the haplotype of the current strain did not match any of the existing reference haplotypes. Otherwise, the matching reference haplotype was updated if the current haplotype could replace missing alleles in the reference haplotype. The program was executed in multiple iterations; after each iteration, "orphan" SNPs that form 1-SNP blocks were removed; and the program ends when there is no remaining "orphan" SNP. The program was iterated three times to process the Celera SNP data, generating a total of 2083 blocks with 65,068 SNPs. A melding program was developed to merge adjacent haplotype blocks with the same allelic variation pattern. Only blocks that contain >40 SNPs were considered for merging. For a pair of adjacent blocks, if the two have the same haplotype variation pattern and if >95% of all SNPs in the region spanned by the two blocks (including the orphan SNPs and those that belong to other smaller blocks in the region) are consistent with the haplotype variation pattern, then the two will be merged into one block.
SNP validation and discovery by resequencing SNP loci were amplified with forward primers containing the -21M13 primer site (5'-TGTAAAACGACGGCCAGT-3') and reverse primers containing a -48M13 primer site (5'-AGCGGAT AACAATTTCACAC-3'). Cleaned PCR products were sequenced using fluorescent BigDye Terminator v2 Ready Reaction Kits (Applied Biosystems cat # 4314416) on an ABI 3100 Sequencer.
Assemble haplotypes in the 44 genomic regions
We thank David Kaufman, David Lipman, Andy Clark, Robert Clifford, Richard Finney, Maxwell Lee, and the three anonymous reviewers for critical review of the manuscript.
1 Corresponding author. E-mail jinghuiz{at}mail.nih.gov; fax (301) 402-9325. [Supplemental material is available online at www.genome.org.] Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2901705.
Bishop, C.E., Boursot, P., Baron, B., Bonhomme, F., and Hatat, D. 1985. Most classical Mus musculus domesticus laboratory mouse strains carry a Mus musculus musculus Y chromosome. Nature 315: 70-72.[CrossRef][Medline] Ferris, S.D., Sage, R.D., and Wilson, A.C. 1982. Evidence from mtDNA sequences that common laboratory strains of inbred mice are descended from a single female. Nature 295: 163-165.[CrossRef][Medline]
Frazer, K.A., Wade, C.M., Hinds, D.A., Patil, N., Cox, D.R., and Daly, M.J. 2004. Segmental phylogenetic relationships of inbred mouse strains revealed by fine-scale analysis of sequence variation across 4.6 mb of mouse genome. Genome Res. 14: 1493-1500.
Gabriel, S.B., Schaffner, S.F., Nguyen, H., Moore, J.M., Roy, J., Blumenstiel, B., Higgins, J., DeFelice, M., Lochner, A., Faggart, M., et al. 2002. The structure of haplotype blocks in the human genome. Science 296: 2225-2229.
Grupe, A., Germer, S., Usuka, J., Aud, D., Belknap, J.K., Klein, R.F., Ahluwalia, M.K., Higuchi, R., and Peltz, G. 2001. In silico mapping of complex disease-related traits in mice. Science 292: 1915-1918.
Mural, R.J., Adams, M.D., Myers, E.W., Smith, H.O., Miklos, G.L., Wides, R., Halpern, A., Li, P.W., Sutton, G.G., Nadeau, J., et al. 2002. A comparison of whole-genome shotgun-derived mouse chromosome 16 and the human genome. Science 296: 1661-1671. Nagamine, C.M., Nishioka, Y., Moriwaki, K., Boursot, P., Bonhomme, F., and Lau, Y.F. 1992. The musculus-type Y chromosome of the laboratory mouse is of Asian origin. Mamm. Genome 3: 84-91.[CrossRef][Medline]
Park, Y.G., Clifford, R., Buetow, K.H., and Hunter, K.W. 2003. Multiple cross and inbred strain haplotype mapping of complex-trait candidate genes. Genome Res. 13: 118-121.
Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., et al. 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719-1723.
Rowe, L.B., Barter, M.E., Kelmenson, J.A., and Eppig, J.T. 2003. The comprehensive mouse radiation hybrid map densely cross-referenced to the recombination map: A tool to support the sequence assemblies. Genome Res. 13: 122-133. Tucker, P.K., Lee, B.K., Lundrigan, B.L., and Eicher, E.M. 1992. Geographic origin of the Y chromosomes in "old" inbred strains of mice. Mamm. Genome 3: 254-261.[CrossRef][Medline] Wade, C.M., Kulbokas III, E.J., Kirby, A.W., Zody, M.C., Mullikin, J.C., Lander, E.S., Lindblad-Toh, K., and Daly, M.J. 2002. The mosaic structure of variation in the laboratory mouse genome. Nature 420: 574-578.[CrossRef][Medline]
Wiltshire, T., Pletcher, M.T., Batalov, S., Barnes, S.W., Tarantino, L.M., Cooke, M.P., Wu, H., Smylie, K., Santrosyan, A., Copeland, N.G., et al. 2003. Genome-wide single-nucleotide polymorphism analysis defines haplotype patterns in mouse. Proc. Natl. Acad. Sci. 100: 3380-3385.
Witmer, P.D., Doheny, K.F., Adams, M.K., Boehm, C.D., Dizon, J.S., Goldstein, J.L., Templeton, T.M., Wheaton, A.M., Dong, P.N., Pugh, E.W., et al. 2003. The development of a highly informative mouse Simple Sequence Length Polymorphism (SSLP) marker set and construction of a mouse family tree using parsimony analysis. Genome Res. 13: 485-491.
Yalcin, B., Fullerton, J., Miller, S., Keays, D.A., Brady, S., Bhomra, A., Jefferson, A., Volpi, E., Copley, R.R., Flint, J., et al. 2004. Unexpected complexity in the haplotypes of commonly used inbred strains of laboratory mice. Proc. Natl. Acad. Sci. 101: 9734-9739. Zhang, J., Rowe, W.L., Clark, A.G., and Buetow, K.H. 2003. Genomewide distribution of high-frequency, completely mismatching SNP haplotype pairs observed to be common across human populations. Am. J. Hum. Genet. 73: 1073-1081.[CrossRef][Medline]
http://evolution.genetics.washington.edu/phylip.html; PHYLIP. http://jaxmice.jax.org; The Jackson Laboratory. http://lpg.nci.nih.gov/mulan/; LPG/NCI Mouse Haplotype Block. http://www.broad.mit.edu; Broad Institute. http://www.broad.mit.edu/personal/claire/strainsnplist_all.xls; WI haplotype map.
Received June 17, 2004; accepted in revised format December 13, 2004. This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||