|
|
|
|
Genome Res. 13:1250-1257, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00 Resources Large-Scale Identification and Analysis of Genome-Wide Single-Nucleotide Polymorphisms for Mapping in Arabidopsis thaliana1 Max-Planck-Institute of Chemical Ecology, Jena, Germany 2 Max-Planck-Institute for Plant Breeding Research, Germany 3 Max-Planck-Institute of Molecular Plant Physiology, Golm, Germany
Genetic markers such as single nucleotide polymorphisms (SNPs) are essential tools for positional cloning, association, or quantitative trait locus mapping and the determination of genetic relationships between individuals. We identified and characterized a genome-wide set of SNP markers by generating 10,706 expressed sequence tags (ESTs) from cDNA libraries derived from 6 different accessions, and by analysis of 606 sequence tagged sites (STS) from up to 12 accessions of the model flowering plant Arabidopsis thaliana. The cDNA libraries for EST sequencing were made from individuals that were stressed by various means to enrich for transcripts from genes expressed under such conditions. SNPs discovered in these sequences may be useful markers for mapping genes involved in interactions with the biotic and abiotic environment. The STS loci are distributed randomly over the genome. By comparison with the Col-0 genome sequence, we identified a total of 8051 SNPs and 637 insertion/deletion polymorphisms (InDel). Analysis of STS-derived SNPs shows that most SNPs are rare, but that it is possible to identify intermediate frequency framework markers that can be used for genetic mapping in many different combinations of accessions. A substantial proportion of SNPs located in ORFs caused a change of the encoded amino acid. A comparison of the density of our SNP markers among accessions in both the EST and STS datasets, revealed that Cvi-0 is the most divergent accession from Col-0 among the 12 accessions studied. All of these markers are freely available via the internet.
Arabidopsis thaliana is currently the most important model organism for plant research (Meinke et al. 1998
Traditionally, most genetic studies in A. thaliana involved mutants, crosses, and mapping populations derived from the genetically distinct Columbia (Col-0) and Landsberg erecta (Ler) accessions (e.g., Lister and Dean 1993
Such genetic mapping approaches and, in addition, the map-based positional cloning of chemically induced mutants (Lukowitz et al. 2000
One limitation of SNP markers identified in pairwise comparisons is their unknown frequency in a population. Information on SNP frequency is very useful for two reasons. First, most current SNP genotyping methods require the synthesis of a primer, or a pair of primers, to be used in primer extension or PCR amplification reactions. In contrast to rare SNPs, intermediate-frequency SNPs are polymorphic in many different combinations of accessions. Thus, in large-scale QTL mapping or positional-cloning projects that involve hundreds or thousands of markers or other additional accessions, it will be cost-effective to synthesize primers only for those SNPs that are likely to be polymorphic between different combinations of parental lines. Second, if a SNP is to be used as a marker in LD or association mapping projects, its frequency carries important information, because the frequency of a given SNP in the population is correlated with the expected size of a region that is in LD with this marker (for review, see Nordborg and Tavaré 2002
There have been some studies in A. thaliana to obtain information on population structure (Bergelson et al. 1998 Another limitation of currently available SNP sets is that they have not been enriched for markers in genes related to interactions with the environment. Such genes may contribute to quantitative traits and thus may be the main targets of mapping studies directed at uncovering the genetic architecture of naturally occurring phenotypic variation.
Here, we describe a genome-wide set of SNP markers that attempts to overcome these limitations. The markers were generated from more than 10,000 expressed sequence tags (ESTs) from 6 accessions of A. thaliana and
SNPs Derived From ESTs A total of 10,706 ESTs were generated from cDNA libraries that were derived from 6 different accessions of A. thaliana (Table 1). The sequences were subjected to stringent quality filtering, including vector clipping and quality trimming (see Methods), and 7465 high-quality reads were clustered separately for every accession. This resulted in 5289 distinct clusters. Of these, 4240 (80%) consist of only one sequence read (singlets) and 1049 (20%) of clusters with at least 2 reads. The high proportion of singlet reads shows that the libraries are of low redundancy. In the following, the 5289 clusters or singlets will be commonly referred to as clustered ESTs.
We compared all clustered ESTs against the Col-0 genome sequence with BLASTN and subsequently pairwise aligned them with their best hits. A total of 1108 (21%) clustered ESTs were excluded from SNP detection. The reasons were that the BLAST matches (1) were too short (<80 bp, n = 11 clustered ESTs), (2) showed a high-sequence divergence (>3%) to the best-matching Col-0 sequence (n = 76), or (3) did hit unannotated or incompletely annotated (e.g., no ATG start codon defined) genome regions (n = 309). In addition, in 712 cases, the sim4 and/or Wise2 alignments could not be interpreted because of questionable exon-intron structure, incorrect EST assembly, chimeric clones, or incorrect gene prediction. The remaining 4176 clustered ESTs could be mapped to 2907 different annotated genes, which represent about 12% of all currently annotated genes of A. thaliana (as of November 2002). To obtain the number of annotated genes that were for the first time tagged by sequences in our EST set, we analyzed the stringently filtered EST sequences (n = 7465). Of these, 574 sequences (479 clustered ESTs) did not match any of the currently annotated genes, and 201 sequences did match a total of 177 genes that had no EST hit before. We concluded that the majority of the genes tagged by the new EST sequences has been tagged before (n = 3229), but that there is a significant number of newly tagged genes. Possibly, gene expression profiles are different among the accessions, an option that makes this EST data set useful for gene annotation in A. thaliana. We evaluated the pairwise alignments (clustered EST against genome sequence) for mismatches and were able to identify 4327 SNPs and 18 insertion/deletion polymorphisms (InDels). InDels in noncoding regions were not included because the sim4 alignment program does not have gap penalties and tends to produce inaccurate alignments around InDels. The distribution of SNPs on the five chromosomes shows that the whole genome is well covered by EST-derived SNPs with the exception of the centromeric regions that contain few transcribed genes (AGI2000). Due to the low redundancy of the cDNA libraries, 2621 (60%) of SNPs are derived from only one EST sequence. The majority of SNPs are located in coding regions (3432; 79%), and among coding SNPs, a significant proportion (1101 of 3432; 32%) did cause an amino acid replacement. Two SNPs lead to a nonsense codon. Among InDel polymorphisms, 10 are in-frame and 8 out-of-frame. Because most EST-derived SNPs are derived from only a single sequencing reaction, they need to be considered as hypothetical. To estimate the proportion of false positives among these SNPs, we designed primers for 96 amplicons covering genomic regions with EST-derived SNPs and used them to generate and sequence PCR products similar to STS generation (see below). From 96 polymorphic sites that were analyzed, 92 were confirmed to match the expected Col-0 sequence from MAtDB, 2 did display a difference, and 2 analyses failed due to PCR or sequencing problems. The two cases with a difference may reflect rare differences in the Col-0 stocks used for genome sequencing (both are from IGF BACs therefore, our Col-0 stock might be more similar to the TAMU Col-0). The PCR failure rate for the other accessions was higher than for Col-0 (on which the primers were designed), but in 81 cases, data for both the targeted SNP and the Col-0 sequence were available. Of these, only three turned out to be incorrect (96% confirmation rate). In addition, among 1858 SNPs that are located in EST clusters of at least 2 sequence reads, only 8 differ between the individual sequence reads and appear to be sequencing errors or reverse transcriptase-induced mutations. We therefore conclude that due to our stringent quality criteria, a very high proportion of EST-derived SNPs are true polymorphisms.
SNPs Derived From STS Sequences
Most STS sequences were highly similar to the Col-0 sequence, but 103 consensus sequences showed an overall divergence of >3% and were excluded from SNP detection. The comparison of the remaining 4955 consensus sequences led to the identification of 3773 SNPs and 619 InDel polymorphisms (Table 3). Due to PCR failure or low quality sequence, some SNP positions could not be genotyped in all 12 accessions (Fig. 1A). Among SNPs, 2922 (77%) are located in regions of the genome annotated as noncoding and 869 (23%) in coding regions. Among InDels, 617 were noncoding and only 2 were coding. We were able to determine the coding status of 857 SNPs in coding regions with the Wise2 program and found 410 (48%) replacement SNPs and 447 (52%) silent SNPs. Seven polymorphisms were nonsense mutations that lead to a premature stop codon.
Polymorphisms based on two sequence reads from both strands of the PCR products can be considered to be confirmed and the remaining ones as hypothetical. Using this criterion, 2331 (62%) of the SNPs and 343 (55%) of all InDels are confirmed polymorphisms for at least 1 accession. To test the reliability of our automated SNP detection, we generated 355 STS from the Col-0 accession. A total of 160,708 nonredundant high-quality base pairs was obtained from this accession and 8 (SNPs and InDels) sequence differences to the genome sequence were observed, which leads to an estimated proportion of 0.043% false-positive polymorphisms that likely result either from rare differences in the Col-0 stocks used for genome sequencing, or from errors in our sequence data. Only 1 of these differences had the status of a confirmed polymorphism among a total of 79,362 confirmed base calls from the Col-0 accession (estimated proportion of false-positive SNPs, 0.0012%). This demonstrates that the confirmed SNPs are of high reliability. To estimate the proportion of SNPs that are rare polymorphisms, we calculated the relative frequencies of all SNPs in the STS sample, whose allelic states have been determined from at least 8 of 12 accessions (n = 2640), including the reference genome sequence. Figure 1B shows that most SNPs in our sample are rare and segregate at low frequencies. A total of n = 1344 (51%) SNPs with a sampling depth of at least 8 accessions occur as singletons.
Divergence Relationships Among Accessions A distance tree of all 12 accessions based on STS-derived SNPs that have been genotyped in all accessions confirms that Cvi-0 is the most divergent and Gü-0 the least divergent accession to Col-0 (Fig. 2A). A consensus tree resulting from a bootstrap analysis of the same data shows that the topology is essentially identical to the distance-based tree, but individual nodes (except the node connecting Col-0 with Gü-0) are only weakly supported, and Cvi-0 no longer appears to be the most divergent accession (Fig. 2B), suggesting that the overall large genetic distance between Cvi-0 and Col-0 is due to a limited number of more (still <3%) divergent loci.
By use of the STS-derived SNPs, it is also possible to calculate the proportion of the observed genetic variation that segregates among the widely used Col-0, Ler, and Cvi-0 accessions (e.g., Lister and Dean 1993
GABI-MASC SNP Database
The availability of the genome sequence greatly facilitates the discovery of genetic polymorphisms in model species, because the sequencing of additional individuals and comparison with genome sequence can be automated and performed on a large scale. We used EST and STS sequence data to identify a total of 8051 unique SNP and 637 InDel polymorphisms, and thus significantly increased the number of available genetic markers for A. thaliana. The inclusion of new accessions into the SNP discovery process and the genotyping of SNPs up to 12 accessions provides genome-wide sets of SNPs for many pairwise combinations. These sets can be used for QTL mapping and as SNPs of different frequencies and types in association and LD mapping projects. The use of cDNA libraries from stressed individuals should enrich our marker collection for genes that interact with the environment and control quantitative traits.
Overall, the data confirm the observation made in earlier genome-wide and multilocus surveys of genetic variation (Bergelson et al. 1998
The EST and STS datasets can be considered to be independent samples of genome sequence diversity. The main differences between both sets of SNPs are (1) a larger number of base pairs that are necessary to find a SNP in EST than in STS sequence data, and (2) a higher ratio of silent to replacement polymorphisms in EST (
Our data support the existence of a certain degree of population structure in A. thaliana, for which evidence was also found in a genome-wide survey of AFLP markers (Sharbel et al. 2000
Our SNP markers are of immediate use for QTL mapping and positional cloning experiments with the accessions used for this study. We have extracted several subsets of
A long-term goal of this project is to apply SNP markers as tools for association and LD mapping approaches to identify genes that underlie naturally occurring genetic variation. Population genetics theory predicts that in a predominantly selfing organism like A. thaliana, large genomic regions can be expected to be in LD, suggesting that a limited set of markers may be sufficient for fine-scale mapping of interesting genes (Nordborg 2000
Selection of A. thaliana Accessions A total of 12 accessions was chosen for the survey. These include six accessions used previously for genetic mapping (Col-0, Cvi-0, Ler, Nd-1, C24, Ws-0) and six additional accessions (Ei-2, CS22491, Gü-0, Lz-0, Wei-0, Yo-0) with a high-average genetic distance to other accessions as determined from AFLP data (Sharbel et al. 2000
Construction of cDNA Libraries A total of 9696 ordered library clones were used for EST generation (number of clones subjected to EST sequencing per accession were as follows: C24, 2688; Ei-2, and Nd-1, 1632; Ler, Cvi-0, and Ak-2, 1248; in most cases, the first plate contained only 96 clones). All cDNA clones are available from the RZPD (see www.rzpd.de). The number of clones analyzed is smaller than the number of EST sequence reads produced because some cDNAs were sequenced not only from the 5'-end but also from the 3'-end.
EST Sequencing
STS Selection
With the exception of CAPS primers, primer pairs were designed automatically with the Primer3 program (S. Rozen and H.J. Skaletsky, unpubl.) by use of the Col-0 genome sequence as a template. Parameters were chosen such that the primers had an annealing temperature of 60°C and resulted in amplicons of
Genomic DNA Preparation, STS PCR Amplification, and Sequencing
Sequence Analysis
SNP Detection Polymorphisms were mapped physically onto the pseudochromosome and given a unique identifier (MASC number) after merging the EST and STS datasets. SNPs that were sequenced from one side only were scored as hypothetical and SNPs sequenced from both directions in the same accession (STS only) as confirmed. All SNP markers generated by this project are available from www.mpiz-koeln.mpg.de/masc/. Data will also be available from TAIR.
Phylogenetic Analysis
We thank Henriette Ringys-Beckstein and the DNA core facility teams at MPI-Z and MPI-CE for excellent technical assistance, and Jürgen Kroymann and Dan Kliebenstein for primer design and discussions. Many thanks to Svenja Meyer (GABI Primary Database at RZPD) for handling the data submission to EMBL/GenBank, to Heiko Schoof (GABI-Info, MIPS) for providing MAtDB annotation data, and to Dieter Berger who provided the C24 cDNA library. This work was funded by an Emmy-Noether-Fellowship of the Deutsche Forschungsge-meinschaft to K.J.S., by grants of the BMBF-GABIproject to T.A., T.M.-O., and B.W., and by the Max-Planck-Society. Partial support was also provided by grant number QLG2-CT-2001-01097 NATURaL from the European Union. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.728603.
4 Corresponding author. [EST sequences longer than 50 bp have been submitted to EMBL/GenBank under accession numbers CB255604 [GenBank] CB265223. STS sequences longer than 50 bp have been submitted to EMBL/GenBank under accession numbers BV007447 [GenBank] BV012320.]
Alonso-Blanco, C. and Koornneef, M. 2000. Naturally occurring variation in Arabidopsis: An underexploited resource for plant genetics. Trends Plant Sci. 5: 2229.[CrossRef][Medline]
Alonso-Blanco, C., El-Assal, S.E.-D., Coupland, G., and Koornneef, M. 1998a. Analysis of natural allelic variation at flowering time loci in the Landsberg erecta and Cape Verde islands ecotypes of Arabidopsis thaliana. Genetics
149:
749764. Alonso-Blanco, C., Peeters, A., Koornneef, M., Lister, C., Dean, C., van den Bosch, N., Pot, J., and Kuiper, M. 1998b. Development of an AFLP based linkage map of Ler, Col and Cvi Arabidopsis thaliana ecotypes and construction of a Ler/Cvi recombinant inbred line population. Plant J. 14: 259271.[CrossRef][Medline] The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815.[CrossRef][Medline] Altshuler, D., Pollara, V., Van Etten, C.C.W., Baldwin, J., Linton, L., and Lander, E. 2000. An SNP map of the human genome generated by reduced representation sequencing. Nature 407: 513516.[CrossRef][Medline]
Bergelson, J., Stahl, E., Dudeck, S., and Kreitman, M. 1998. Genetic variation between and within populations of Arabidopsis thaliana. Genetics
148:
13111323.
Birney, E., Thompson, J., and Gibson, T. 1996. PairWise and SearchWise: Finding the optimal alignment in a simultaneous comparison of a protein profile against all DNA translation frames. Nucleic Acids. Res. 24:
27302739. Breyne, P., Rombaut, D., Van Gysel, A., Van Montagnu, M., and Gerats, T. 1999. AFLP analysis of genetic diversity within and between Arabidopsis thaliana ecotypes. Mol. Gen. Genet. 261: 627634.[CrossRef][Medline] Cardon, L. and Bell, J. 2001. Association study designs for complex diseases. Nat. Rev. Genet. 2: 9198.[CrossRef][Medline] Cho, R., Mindrinos, M., Richards, D., Sapolsky, R., Anderson, M., Drenkard, E., Dewdney, J., Reuber, T., Stammers, M., Federspiel, N., et al. 1999. Genome-wide mapping with biallelic markers in Arabidopsis thaliana. Nat. Genet. 23: 203207.[CrossRef][Medline]
Ewing, B., Hillier, L., Wendl, M., and Green, P. 1998. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res.
8:
175185. Felsenstein, J. 1989. PHYLIPPhylogeny Inference Package (Version 3.2). Cladistics 5: 164166.
Florea, L., Hartzell, G., Zhang, Z., Rubin, G., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic sequence. Genome Res.
8:
967974.
Hagenblad, J. and Nordborg, M. 2002. Sequence variation and haplotype structure surrounding the flowering time locus FRI in Arabidopsis thaliana.
Genetics 161:
289298.
Haubold, B., Kroymann, J., Ratzka, A., Mitchell-Olds, T., and Wiehe, T. 2002. Recombination and gene conversion in a 170 kb genomic region of Arabidopsis thaliana.
Genetics 161:
12691278.
Jander, G., Norris, S., Rounsley, S., Bush, D., Levin, I., and Last, R. 2002. Arabidopsis map-based cloning in the post-genome era. Plant Phys.
129:
440450.
Kuittinen, H. and Aguadé, M. 2000. Nucleotide variation at the CHALCONE ISOMERASE locus in Arabidopsis thaliana. Genetics
155:
863872. Lister, C. and Dean, C. 1993. Recombinant inbred lines for mapping RFLP and phenotypic markers in Arabidopsis thaliana. Plant J. 4: 745750.[CrossRef]
Lukowitz, W., Gillmor, C., and Scheible, W.-R. 2000. Positional cloning in Arabidopsis. Why it feels good to have a genome initiative working for you. Plant Phys.
123:
795805.
Meinke, D.W., Cherry, J.M., Dean, C., Rounsley, S.D., and Koornneef, M. 1998. Arabidopsis thaliana: A model plant for genome analysis. Science
282:
662682.
Miyashita, N.T., Kawabe, A., Innan, H., and Terauchi, R. 1998. Intra- and interspecific DNA variation and codon bias of the alcohol dehydrogenase (Adh) locus in Arabis and Arabidopsis species. Mol. Biol. Evol.
15:
14201429.
Miyashita, N.T., Kawabe, A., and Innan, H. 1999. DNA variation in the wild plant Arabidopsis thaliana revealed by amplified random fragment length polymorphism analysis. Genetics 152:
17231731.
Nordborg, M. 2000. Linkage disequilibrium, gene trees and selfing: An ancestral recombination graph with partial self-fertilization. Genetics 154:
923929. Nordborg, M. and Tavaré, S. 2002. Linkage diseuqilibrium: What history has to tell us. Trends Genet. 18: 8390.[CrossRef][Medline] Nordborg, M., Borevitz, J., Bergelson, J., Berry, C., Chory, J., Hagenblad, J., Kreitman, M., Maloof, J., Noyes, T., Oefner, P., et al. 2002. The extent of linkage disequilibrium in Arabidopsis thaliana. Nat. Genet. 30: 190193.[CrossRef][Medline]
Rozas, J. and Rozas, R. 1999. DnaSP version 3: An integrated program for molecular population genetics and molecular evolution analysis. Bioinformatics
15:
174175.
Schoof, H., Zaccaria, P., Gundlach, H., Lemcke, K., Rudd, S., Kolesov, G., Arnold, R., Mewes, H.W., and Mayer, K.F. 2002. MIPS Arabidopsis thaliana Database (MAtDB): An integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res. 30:
9193. Sharbel, T., Haubold, B., and Mitchell-Olds, T. 2000. Genetic isolation by distance in Arabidopsis thaliana: Biogeography and postglacial colonization of Europe. Mol. Ecol. 9: 21092118.[CrossRef][Medline]
Somerville, C. and Dangl, J. 2000. GenomicsPlant biology in 2010. Science
290:
20772078. Steinmetz, L., Mindrinos, M., and Oefner, P. 2000. Combining genome sequences and new technologies for dissecting the genetics of complex phenotypes. Trends Plant Sci. 5: 397401.[Medline] Tabor, H., Risch, N., and Myers, R. 2002. Candidate-gene approaches for studying complex genetic traits: Practical considerations. Nat. Genet. Rev. 3: 17. Telles, G. and da Silva, F. 2001. Trimming and clustering sugarcane ESTs. Gen. Mol. Biol. 24: 1723.
ftp://ftpmips.gsf.de/cress; MIPS Arabidopsis thaliana annotation files (download). http://www.mpiz-koeln.mpg.de/masc/; GABI-MASC SNP database. www.arabidopsis.org/aboutcaps.html; Arabidopsis CAPS marker table. www.arabidopsis.org/Cereon/index.html; Cereon Arabidopsis polymorphism and Ler sequence collection. www.rzpd.de; German Resource center/Primary Database for genomic research. www.tigr.org/tdb/at/atgenome/Ler.html; Ler Sequence Database.
Received August 22, 2002;
accepted in revised format March 19, 2003.
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||