|
|
|
|
Published online before print
March 13, 2006, 10.1101/gr.4791006 Genome Res. 16:491-497, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00
Letter Genetic variation in the zebrafish1 Hubrecht Laboratory, Netherlands Institute for Developmental Biology, 3584CT, Utrecht, The Netherlands; 2 Departmentof Genetics, Washington University Medical School, St. Louis, Missouri 63130, USA
Although zebrafish was introduced as a laboratory model organism several decades ago and now serves as a primary model for developmental biology, there is only limited data on its genetic variation. An establishment of a dense polymorphism map becomes a requirement for effective linkage analysis and cloning approaches in zebrafish. By comparing ESTs to whole-genome shotgun data, we predicted >50,000 high-quality candidate SNPs covering the zebrafish genome with average resolution of 41 kbp. We experimentally validated 65% of a randomly sampled subset by genotyping 16 samples from seven commonly used zebrafish strains. The analysis reveals very high nucleotide diversity between zebrafish isolates. Even with the limited number of samples that we genotyped, zebrafish isolates revealed considerable interstrain variation, ranging from 7% (inbred) to 37% (wild-derived) of polymorphic sites being heterozygous. The increased proportion of polymorphic over monomorphic sites results in five times more frequent observation of a three allelic variant compared with human or mouse. Phylogenetic analysis shows that comparisons between even the least divergent strains used in our analysis may provide one informative marker approximately every 500 nucleotides. Furthermore, the number of haplotypes per locus is relatively large, reflecting independent establishment of the different lines from wild isolates. Finally, our results suggest the presence of prominent C-to-U and A-to-I RNA editing events in zebrafish. Overall, the levels and organization of genetic variation between and within commonly used zebrafish strains are markedly different from other laboratory model organisms, which may affect experimental design and interpretation.
The zebrafish (Danio rerio) serves as a unique model for vertebrate development and pharmacological studies (Zon and Peterson 2005
There are several key advantages that distinguish another type of marker, single nucleotide polymorphism (SNP), as a marker of choice for many genetic studies. To mention a few, SNPs are the most common type of variation in genomes, allowing the generation of ultra-dense genetic maps, and there are efficient low- and high-throughput typing procedures for SNPs currently available (for review, see Vignal et al. 2002
In addition to simplifying genetic mapping experiments, studies on genetic variation in model organisms can clarify rate and composition as well as distribution and organization of polymorphic loci in the genome. In particular, it is not clear how much variation still persists in zebrafish laboratory inbred and outbred strains and how it compares to that present in wild isolates. The discovered variation at 9% of tested polymorphic loci in initially homozygous zebrafish C32 strain (Streisinger et al. 1981
Finally, the analysis of genotype data contributes to better understanding of strain history and the degree of interstrain variation. The variety of methods used to generate inbred lines, e.g., gynogenetic diploids and half-tetrad diploids, inbreeding (for review, see Beier 1998
Candidate SNP discovery We have developed a computational SNP discovery pipeline and candidate SNP database named CASCAD (Cascad Snp Candidate Database, http://cascad.niob.knaw.nl) (Guryev et al. 2004
The resulting raw data set (>1 x 106 mismatches) was filtered for high-ranking candidate SNPs based on a variety of parameters, including masking out repetitive sequences and the presence of high PHRED quality score (>20) in each of the candidate SNP alleles. After clustering, 51,769 unique candidate SNPs were obtained. Although a similar amount of input EST sequences was used in this study compared with our previous analysis for the rat (Guryev et al. 2004 The average frequency of candidate SNP is 1 per 41 kbp, and the largest gap between two adjacent markers is 2 Mb on linkage group 14. About one-third of the candidates reside in genomic regions that are annotated as protein coding, including 9111 synonymous and 6375 nonsynonymous changes. The candidate SNPs cover 13,016 of 31,219 UniGene clusters and 7841 of 22,877 predicted Ensembl genes, or approximately one-third of zebrafish genes.
Over 66% of the candidate SNPs could be assigned to unique positions in the current zebrafish genome build (Zv5; http://www.sanger.ac.uk/Projects/D_rerio/ Zv5_assembly_information. shtml). We failed to place 4% of the candidate SNPs to any location on the assembly, and a further 19% of the candidate SNPs mapped to multiple locations. Presumably, the major part of the nonunique fraction was assigned to fragments that are present redundantly in Zv5 as an artifact of the assembly process in its intermediate stage, although a small fraction may result from sequence difference between otherwise highly similar paralogs. We should mention here that Zv5 is a draft assembly and in addition to false duplications also contains other misassemblies and dropouts, meaning that all interpretations based on it should be treated with caution. Our analysis indicates that 73% candidate SNPs map to the same linkage group in Zv5 as they would be placed on gene-based meiotic map of Woods and coworkers (2005)
Validation of SNPs
A consequence of validating SNPs by resequencing from genomic amplicons (average, 300 bp) was the opportunity to identify and analyze additional variation. Thus, in addition to the 256 confirmed candidate SNPs, we found as many as 1942 additional variable positions. Only 155 of these were present in our database of 51,769 computationally derived SNP candidates. The high fraction of new SNPs discovered in our validation stage is accounted for by the presence of intronic and intergenic regions in our validation assay that could not be scored for polymorphisms by our EST and mRNA-centered computational approach. More than 96% of all polymorphic loci were diallelic (2118/2198), and the remainder consisted predominantly of short SSLPs. One-tenth of the variants observed (228) were due to small insertions or deletions (indels), displaying an intermediate indel frequency if compared to human and chicken (6.6% and 13.9%, respectively; source, dbSNP build 124). Only a small fraction of polymorphisms identified in this study was observed within coding sequence as annotated in the Ensembl database, with 178 of them being silent, 85 missense, and two frameshift mutations. We have designed a Web interface (http://cascad.niob. knaw.nl/snpview) that facilitates the selection and use of the validated SNPs in genetic experiments. This tool allows the interactive retrieval and visual representation of validated SNPs for arbitrary combinations of strains.
Candidate SNP characteristics and validation
As expected, candidate SNP verification in both rat and zebrafish is sensitive to the functional context of the polymorphism; silent substitutions are more often verified than are missense. Some trends were found to be species specific: Unlike that in laboratory rat, positive correlation was not found for SNP confirmation at CpG positions in zebrafish. Comparative analysis of methylation and dinucleotide frequencies in different organisms revealed that in spite of the higher methylation level in fish, compared to mammals, CpG depletion is clearly lower in fish (Jabbari and Bernardi 2004
Surprisingly, transitive substitutions were less frequently confirmed in zebrafish in contrast to rat, for which they had a higher verification level. As the ratio between transitions and transversions is similar for both organisms, an organism-specific mechanism is suspected. Interestingly, we found two classes of frequently nonconfirmed transitive variants in our verification set, and these correspond to the most frequent type of vertebrate RNA editing events: ADNA to GcDNA (P < 0.1) and CDNA to TcDNA (P < 0.01) due to A-to-I editing and C-to-U editing, respectively. As editing events usually affect multiple consecutively located sites, many of these events may easily be filtered out by our stringent filtering for candidates. Therefore, we performed a computational whole-genome screen for individual mismatches between EST sequences and the zebrafish genome assembly. The search was restricted to the sense strand as annotated in Ensembl build 31 and showed 8% overrepresentation of A-to-G over G-to-A substitutions and 11% excess of C-to-T versus T-to-C substitutions. From the analysis of this limited set of ESTs, we estimate that there are at least 2600 editing sites (C-to-U and A-to-I). Similarly to primates, RNA editing may be very abundant in zebrafish, with a frequency of A-to-I editing of one order of magnitude larger compared with that of mouse, rat, chicken, or fly (Eisenberg et al. 2005
When we now eliminate all nonconfirmed polymorphisms from our confirmation experiment that may have been due to RNA-editing events, we observe a positive, although not significant (possibly due to lower sample size, n = 339) verification correlation with both CpG sites and transitive mutations, similar as for the rat. These results strongly suggest that high rates RNA editing events in zebrafish account for the observed relatively low confirmation rate of transitive candidate SNPs. We need to note that in absence of solid experimental data, one cannot exclude an alternative explanation for this apparent bias between ESTs and genomic sequence, namely, the occurrence of cytosine deamination during sample preparation and library construction, but it seems unlikely as it is observed for two independent EST data sets (Washington University EST project, http://genome.wustl.edu/est) (Lo et al. 2003
Nucleotide diversity
Similarly, the estimated average nucleotide diversity (Table 3) is likely to be an underestimate as there is a bias toward functionally constrained expressed sequences in our verification set. Strikingly, even this value is about one order of magnitude higher than that observed in human populations (Deutsch et al. 2001 The high nucleotide diversity in zebrafish also results in more frequent occurrence of three alleles at a single locus. About 1% of single nucleotide variants had three alleles, which is significantly higher than observed in mouse, human, or chicken (0.19%, 0.22%, and 0.28%, respectively, as calculated from NCBI dbSNP build 124). The observed number of triallelic SNPs in zebrafish is close to an estimate based on diallelic SNPs frequency (1821), suggesting that most triallelic SNPs result from independent, unselected mutations, rather than the identification of sites of strong positive selection. Although such triallelic SNPs are mostly neglected in genetic studies in other vertebrates where they are rare, in zebrafish they may prove advantageous in designing mapping probes sets useful for a greater fraction of loci tested across a wider variety of genetic backgrounds used.
Phylogenetic relationships
The SJD and C32 strains are the most polymorphic with respect to any of the other strains that we analyzed due in part to fixation to homozygosity of many unique alleles in these inbred strains. Nevertheless, the closest relationship of our C32 isolate with SJD contradicts with the previously observed lowest divergence between C32 and AB (Nechiporuk et al. 1999 5%10% of SSLP markers from the reciprocal line (Rawls et al. 2003Although this phylogenetic tree can be used to choose optimal pairs of strains for setting up genetic mapping experiments, the diversity between any of the lines will in most cases already be ample for the selection of sufficient SNP markers. For example, the rate of polymorphisms homozygous in both closely related AB and Tu strains is estimated to be about one per 500 bp.
Intrastrain variation
Structure of genetic variation
Conclusions
SNP discovery The mRNA and EST sequence data used in this study were downloaded from NCBI GenBank (http://www.ncbi.nlm.nih.gov/Genbank) and Ensembl trace repository (http://trace.ensembl.org). EST sequences and quality data from Singapore isolate were provided by Dr. Jinrong Peng (Institute of Molecular and Cell Biology, Singapore). We used Ensembl trace archive (http://trace. ensembl.org) as a source of genomic traces. EST and mRNA sequences were masked for zebrafish-specific repeats, low-complexity regions, and zebrafish mitochondrial DNA by using RepeatMasker. Local SSAHA searches were performed to collect hits with nearly exact homology containing a single mismatch in mRNA/EST subset and remote searches (using Ensembl SSAHA search server) in case of mRNA/EST versus WGS comparison. Only hits with a high-quality mismatch (phred score >20 for both reads) within a sequence stretch of >80-bp identity were retained. The mRNA subset that is not annotated for base-calling quality data was treated as having a reliable overall quality. Hits were clustered to represent unique variations and stored in a MySQL database. Candidate SNPs were annotated and placed on the Zv5 genome assembly by using methods reported previously (Guryev et al. 2004 Predicted and discovered SNPs as well as genotype data obtained in this study were submitted to dbSNP under the following accession numbers: ss49785942ss49839678. The CASCAD database of candidate SNPs and underlying supporting information is publicly available at http://cascad.niob.knaw.nl. All scripts are freely available upon request.
SNP validation We have semirandomly sampled candidate SNPs to generate a set of markers with even distribution throughout the zebrafish linkage groups. For this purpose we have divided the assembled zebrafish genome into equally sized bins and randomly selected a candidate from every bin. Primers for PCR amplification and sequencing of the genomic region were designed by using a customized Web interface (http://primers.niob.knaw.nl) to the Primer3 program (http://www-genome.wi.mit.edu/genome_ software/other/primer3.html). Primer sequences can be obtained upon request or retrieved interactively from the Web interface (http://cascad.niob.knaw.nl/snpview) that allows the retrieval and visual representation of validated SNPs between arbitrary combinations of strains.
PCRs were carried out by using a touchdown thermocycling program (60 sec at 92°C; 30 cycles for 20 sec at 92°C, 20 sec at 65°C with a decrement of 0.4°C per cycle, and 30 sec at 72°C; followed by 10 cycles of 20 sec at 92°C, 20 sec at 58°C, and 30 sec at 72°C; and 18 sec at 72°C; GeneAmp9700, Applied Biosystems) and contained 3050 ng genomic DNA, 0.2 µM of each forward primer and 0.2 µM of each reverse primer, 400 µM of each dNTP, 25 mM Tricine, 7.0% glycerol (w/v), 1.6% DMSO (w/v), 2 mM MgCl2, 85 mM ammonium acetate (pH 8.7), and 0.2 U Taq polymerase in a total volume of 10 µL. After thermocycling, the PCR reactions were diluted with 25 µL water and mixed by pipetting, and 1 µl was used as template for dideoxy cycle sequencing, as recommended by the manufacturer (BigDye v3.1, Applied Biosystems) using one of the primers used for the PCR amplification. Sequencing reactions were analyzed on an ABI3730XL capillary sequencer (Applied Biosystems), and the obtained sequences were scored for polymorphic positions by using the PolyPhred program (Nickerson et al. 1997
Phylogenetic reconstruction
Whole-genome mutation-type screen
We thank Dr. Jinrong Peng (Institute of Molecular and Cell Biology, Singapore), Washington University St. Louis, and Agencourt Bioscience Corporation for providing zebrafish EST sequence and/or quality data, and the Zebrafish Sequencing Group at the Wellcome Trust Sanger Institute for making the WGS trace data and zebrafish genome assemblies publically available before publication. This work was supported by NWO genomics grant 050-10-024.
3 Corresponding author.
E-mail ecuppen{at}niob.knaw.nl; fax 31-30-251-6554. [The polymorphism and genotype data from this study have been submitted to dbSNP under accession nos. ss49785942ss49839678.] Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4791006
Beck J.A., Lloyd S., Hafezparast M., Lennon-Pierce M., Eppig J.T., Festing M.F., Fisher E.M. 2000. Genealogies of mouse inbred strains. Nat. Genet. 24: 2325.[CrossRef][Medline] Beier D.R. 1998. Zebrafish: Genomics on the fast track. Genome Res. 8: 917. Buth D.G., Gordon M.S., Plaut I., Drill S.L., Adams L.G. 1995. Genetic heterogeneity in isogenic homozygous clonal zebrafish. Proc. Natl. Acad. Sci. 92: 1236712369. Deutsch S., Iseli C., Bucher P., Antonarakis S.E., Scott H.S. 2001. A cSNP map and database for human chromosome 21. Genome Res. 11: 300307. Eisenberg E., Nemzer S., Kinar Y., Sorek R., Rechavi G., Levanon E. 2005. Is abundant A-to-I editing primate-specific? Trends Genet. 21: 7781.[CrossRef][Medline] Granato M. and Nusslein-Volhard C. 1996. Fishing for genes controlling development. Curr. Opin. Genet. Dev. 6: 461468.[CrossRef][Medline] Guryev V., Berezikov E., Malik E., Plasterk R.H.A., Cuppen E. 2004. Single nucleotide polymorphisms associated with rat expressed sequences. Genome Res. 14: 14381443. Guryev V., Berezikov E., Cuppen E. 2005. CASCAD: A database of annotated single nucleotide polymorphisms associated with expressed sequences. BMC Genomics 6: 10.[CrossRef][Medline] History, strains and models. In: (ed. G.J. Krinke) pp. 316.Hedrich H.J. In The laboratory rat: The handbook of experimental animals . 2000. Academic Press, NY. Helfman G.S., Colette B.B., Facey D.E. In The diversity of fishes. . 1997. Blackwell Science, Malden, MA. Jabbari K. and Bernardi G. 2004. Cytosine methylation and CpG, TpG (CpA) and TpA frequencies. Gene 333: 143149.[CrossRef][Medline] Knapik E.W., Goodman A., Ekker M., Chevrette M., Delgado J., Neuhauss S., Shimoda N., Driever W., Fishman M.C., Jacob H.J. 1998. A microsatellite genetic linkage map for zebrafish (Danio rerio). Nat. Genet. 18: 338343.[CrossRef][Medline] Kumar S., Tamura K., Nei M. 2004. MEGA3: Integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief Bioinform. 5: 150163. Lo J., Lee S., Xu M., Liu F., Ruan H., Eun A., He Y., Ma W., Wang W., Wen Z.et al. 2003. 15000 unique zebrafish EST clusters and their future use in microarray for profiling gene expression patterns during embryogenesis. Genome Res. 13: 455466. Moriyama E.N. and Powell F.R. 1996. Intraspecific nuclear DNA variation in Drosophila.. Mol. Biol. Evol. 13: 261277.[Abstract] Nechiporuk A., Finney J.E., Keating M.T., Johnson S.L. 1999. Assessment of polymorphism in zebrafish mapping strains. Genome Res. 9: 12311238. Nickerson D.A., Tobe V.O., Taylor S.L. 1997. Polyphred: Automating the detection and genotyping of single nucleotide substitutions using fluorescence-based resequencing. Nucleic Acids Res. 25: 27452751. Postlethwait J., Johnson S., Midson C., Talbot W., Gates M., Ballinger E., Africa D., Andrews R., Carl T., Eisen J. 1994. A genetic linkage map for the zebrafish. Science 264: 699703. Rawls J.F., Frieda M.R., McAdow A.R., Gross J.P., Clayton C.M., Heyen C.K., Johnson S.L. 2003. Coupled mutagenesis screens and genetic mapping in zebrafish. Genetics 163: 9971009. Shimoda N., Knapik E.W., Ziniti J., Sim C., Yamada E., Kaplan S., Jackson D., de Sauvage F., Jacob H., Fishman M.C. 1999. Zebrafish genetic map with 2000 microsatellite markers. Genomics 58: 219232.[CrossRef][Medline] Stickney H.L., Schmutz J., Woods I.G., Holtzer C.C., Dickson M.C., Kelly P.D., Myers R.M., Talbot W.S. 2002. Rapid mapping of zebrafish mutations with SNPs and oligonucleotide microarrays. Genome Res. 12: 19291934. Streisinger G., Walker C., Dower N., Knauber D., Singer F. 1981. Production of clones of homozygous diploid zebra fish (Brachydanio rerio). Nature 291: 293296.[CrossRef][Medline] Genetic backgrounds, standard lines, and their husbandry. In: (eds. H.W. Detrich III. et al.) pp. 599616.Trevarrow B. and Robison B. In The zebrafish: Cellular and developmental biology, genetics, genomics and informatics . 2004. Academic Press, NY. Vignal A., Milan D., SanCristobal M., Eggen A. 2002. A review on SNP and other types of molecular markers and their use in animal genetics. Genet. Sel. Evol. 34: 275305.[CrossRef][Medline] Wade C.M., Kulbokas III E.J., Kirby A.W., Zody M.C., Mullikin J.C., Lander E.S., Lindblad-Toh K., Daly M.J. 2002. The mosaic structure of variation in the laboratory mouse genome. Nature 420: 574578.[CrossRef][Medline] (Danio rerio), 4th ed.Westerfield M. In The zebrafish book: A guide for the laboratory use of zebrafish . 2000. University of Oregon Press, Eugene, OR. Woods I.G., Wilson C., Friedlander B., Chang P., Reyes D.K., Nix R., Kelly P.D., Chu F., Postlethwait J.H., Talbot W.S. 2005. The zebrafish gene map defines ancestral vertebrate chromosomes. Genome Res. 15: 13071314. Wu T.D. and Watanabe C.K. 2005. GMAP: A genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21: 18591875. Yalcin B., Fullerton S., Miller S., Keays D.A., Brady S., Bhorma A., Jefferson A., Volpi E., Copley R.R., Flint J. 2004. Unexpected complexity in the haplotypes of commonly used inbred strains of laboratory mice. Proc. Natl. Acad. Sci. 101: 97349739. Zon L.I. and Peterson R.T. 2005. In vivo drug discovery in the zebrafish. Nat. Rev. Drug Discov. 4: 3544.[CrossRef][Medline]
Received October 10, 2005; accepted in revised format January 18, 2006. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||