|
|
|
|
Published online before print
August 2, 2007, 10.1101/gr.6223207 Genome Res. 17:1414-1419, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Letter A periodic pattern of SNPs in the human genome1 Bioinformatics Research Center (BiRC), University of Aarhus, Hoegh-Guldbergs Gade 10, DK-8000 Aarhus C, Denmark; 2 Molecular Diagnostic Laboratory, Aarhus University Hospital, Brendstrupgaardsvej 90, DK-8200 Aarhus N, Denmark
By surveying a filtered, high-quality set of SNPs in the human genome, we have found that SNPs positioned 1, 2, 4, 6, or 8 bp apart are more frequent than SNPs positioned 3, 5, 7, or 9 bp apart. The observed pattern is not restricted to genomic regions that are known to cause sequencing or alignment errors, for example, transposable elements (SINE, LINE, and LTR), tandem repeats, and large duplicated regions. However, we found that the pattern is almost entirely confined to what we define as "periodic DNA." Periodic DNA is a genomic region with a high degree of periodicity in nucleotide usage. It turned out that periodic DNA is mainly small regions (average length 16.9 bp), widely distributed in the genome. Furthermore, periodic DNA has a 1.8 times higher SNP density than the rest of the genome and SNPs inside periodic DNA have a significantly higher genotyping error rate than SNPs outside periodic DNA. Our results suggest that not all SNPs in the human genome are created by independent single nucleotide mutations, and that care should be taken in analysis of SNPs from periodic DNA. The latter may have important consequences for SNP and association studies.
More than 11.5 million single nucleotide polymorphisms (SNPs) are reported in the human genome (dbSNP build 125). These are spread throughout the genome and are not restricted to certain genomic regions or genetic elements such as exons, introns, transposons, or tandem repeat sequences. Most SNPs are believed to be the product of independent single mutational events in the past, or occasionally due to multiple recurrent mutations in the same nucleotide position (Stoneking 2001
SNPs are not the only widespread variation in the genome. Insertions and deletions (indels) occur throughout the genome, giving rise to local structural polymorphisms (Tuzun et al. 2005
In this study, we report on a systematic small-scale pattern of SNPs that adds to the complexity of the genome and that cannot be explained by viewing all SNPs as the result of independent single nucleotide mutations. We filtered all known SNPs in the human genome by stringent criteria to obtain a highly reliable set of SNPs, excluding SNPs with ambiguous positions or validation problems. By examining the filtered SNPs, we observed that SNPs positioned 1, 2, 4, 6, or 8 bp apart are more frequent than SNPs positioned 3, 5, 7, or 9 bp apart (see Fig. 1). This holds even when we correct for nucleotide frequencies and site dependencies in nucleotide usage in the genome. If all positions in the genome had the same probability of being an SNP, we would expect equal numbers of SNP pairs in all distances >1. For SNP pairs in distance 1 (direct neighbor SNPs), the high CpG mutation rate is expected to lead to an over-representation compared to distances >1 (Hwang and Green 2004
One possible and obvious explanation of this 1, 2, 4, 6, 8 pattern is systematic sequencing and/or alignment errors. We ruled out this possibility by using only filtered SNPs (as defined in Methods), and by observing that the pattern is far from restricted to genomic regions associated with sequencing and alignment errors; for example, transposable elements (SINE, LINE, and LTR), tandem repeats, and large duplicated regions. Moreover, the pattern is highly abundant in transcripts. To further scrutinize the observation, we defined "periodic DNA." Periodic DNA is (small) sequences of DNA with a high degree of periodicity in nucleotide usage (defined rigorously in Methods), and periodic DNA is thus expected to contain the pattern systematically. Surprisingly, we found that by excluding SNPs in periodic DNA, the pattern virtually disappears. Hence the structure of periodic DNA may hint at the origin of the pattern. The fundamental observation is that in a segment of periodic DNA, for example, ATATATATAT, a base change, say, A to G, may be observed in several of the A positions and more frequently than by chance. This pattern could be created by copy number alterations in the AT repeat, but we find that the pattern is persistent even when the flanking regions of the SNPs align perfectly to the reference genome sequence and there are no gaps in the alignment. Hence, length polymorphism/variation cannot explain the pattern. This implies that even in a short segment of periodic DNA with period p (in the above example, p = 2), the presence of one SNP increases the probability of a second identical SNP in distances 1p, 2p, . . . bp, in the same segment. This is visible as an excess of identical SNPs in certain distances. For example, periodic DNA with periods 1, 2, or 4 is expected to have an over-representation of identical SNPs in a distance 4 bp, whereas only periodic DNA with periods 1 or 5 are expected to have an over-representation of identical SNPs in distance 5. In this study, we document this pattern in detail.
General pattern When surveying the frequency spectrum of all pairs of SNPs in various distances (d), we found that pairs of identical SNPs generally follow a 1, 2, 4, 6, 8 pattern, whereas pairs of different SNPs are almost uniformly distributed for d > 1 (Fig. 2; Supplemental Fig. S1). The CpG effect (Hwang and Green 2004 9 bp.
Using only SNPs outside transposable elements (SINE, LINE, and LTR), tandem repeats (as defined by RepeatMasker) and large duplicated regions (>1 kb), respectively, did not remove the pattern (Supplemental Fig. S3A–C).
To further validate the pattern, we analyzed only the random HapMap-ENCODE regions (The ENCODE Project Consortium 2004
Periodic DNA
The density of SNPs is higher in periodic DNA than in the rest of the genome. Thus, 7.4% of the SNPs are located in periodic DNA (4.3% of the genome), which is a 1.8 times higher SNP density than in the rest of the genome. Pairs of identical SNPs show the most significant discrepancy, with 28.1% of all pairs of identical SNPs located in periodic DNA. Pairs of different SNPs are less over-represented, with 10.0% of all pairs of different SNPs located in periodic DNA.
The distribution of periodic DNA on the nine different periods is shown in Figure 3. It is seen that sequences with periods 1, 2, or 4 are over-represented compared to the other periods. This implies that we expect pairs of identical SNPs in distances 1, 2, 4, 6, or 8 bp to be more frequent than identical SNP pairs in distances 3, 5, 7, or 9 bp. This is in good concordance with the observed frequency spectrum for pairs of identical SNPs (Fig. 4; Supplemental Fig. S1). Furthermore, Figure 4, A and B, shows that in the entire genome, as well as in periodic DNA, identical SNP pairs in distances d = 2, 4, 6, or 8 are highly over-represented compared to the expected frequency, whereas SNP pairs in distance 3 are less over-represented, and SNP pairs in distances 5, 7, or 9 are only slightly over-represented. The expected frequency of identical SNP pairs cannot be estimated for d = 1 in this way, because of the CpG mutational bias (Hwang and Green 2004
SNPs in periodic DNA have more genotyping problems than SNPs outside periodic DNA. By examining all genotyped SNPs in all individuals from the HapMap project, genotyping failed in 41.1% of the cases for SNPs inside periodic DNA, but only in 19.9% for SNPs outside periodic DNA. This difference is highly significant (P-value < 10–13). If we omit SNPs that failed to be genotyped in any individuals, the error rates are 21.4% inside periodic DNA and 12.3% outside periodic DNA, which is highly significant too (P-value < 10–13).
Location of periodic DNA Periodic DNA is under-represented in exons. Exons make up 2.1% of the entire genome, but only 1.4% of the periodic DNA is located in exons. The frequency pattern of pairs of identical SNPs in the overlap shows a damped version of the 2, 4, 6, 8 pattern (Fig. 4C), but the pairs of identical SNPs are not significantly over-represented (P = 0.38). Periodic DNA does not correlate with transcripts. Transcripts (exons + introns) make up 37.5% of the genome, and 36.9% of the periodic DNA is located in transcripts. The 2, 4, 6, 8 pattern is highly abundant in transcripts (Fig. 4D) with an over-representation of pairs of identical SNPs (P < 10–100). Periodic DNA does not correlate with tandem repeats. Tandem repeats make up 2.80% of the genome, and 2.83% of the periodic DNA is located in tandem repeats. As expected from the periodic nature of tandem repeats, the 2, 4, 6, 8 pattern is abundant in the overlap of the two (Fig. 4E), and pairs of identical SNPs are highly over-represented compared to the expected level (P = 1.7 x 10–27). Periodic DNA found in tandem repeats is longer (mean length 36.1 bp) than generally in the genome (mean 16.9 bp). The overlap contains 9.3% of all identical SNP pairs and 12.4% of all different SNP pairs found in periodic DNA. A possible explanation is that more SNP pairs are cut by the edges of short sequences. Periodic DNA does not correlate with transposable elements. Transposable elements make up 46.4% of the genome, and 43.2% of periodic DNA is located in transposable elements.
We have observed that identical pairs of SNPs in the human genome are more frequent in distances 2, 4, 6, and 8 bp, than in distances 3, 5, 7, and 9 bp. The immediate explanation of this observation is sequencing errors and/or alignment errors. To rule out this possibility, we first compiled a set of high-quality SNPs, that is, SNPs that map to a unique position in the genome, and with an exact match between the flanking regions and the reference genome sequence. In this way, all SNPs that might be wrongly placed in the genome are excluded. Furthermore, to avoid study-specific ascertainment biases, we used all SNPs reported to dbSNP as a starting point. For this set of filtered SNPs, we observed that the pattern is highly pronounced. Furthermore, we observed that the pattern is persistent even when we ignore SNPs in genomic regions that may cause sequencing and/or alignment problems, for example, transposable elements, tandem repeats, and large duplicated regions (Bailey et al. 2001 Interestingly, the entire pattern is virtually embedded in periodic DNA, which makes up only 4.3% of the genome and has 1.8 times higher SNP density than the rest of the genome. Furthermore, periodic DNA is not correlated with tandem repeats or other repetitive elements, indicating that periodic DNA is different from these types of genomic elements. In the overlap of periodic DNA and exons, the 2, 4, 6, 8 pattern is damped, which may be because of selective constraints on exons. Oppositely, the pattern is preserved in periodic DNA overlapping with transcripts (exons + introns), consequently suggesting fewer (or no) selective constraints on introns.
Our results indicate that a proportion of all SNPs in the human genome is not created by independent single nucleotide mutations. We speculate that many different mechanisms such as polymerase slippage (Weber and Wong 1993 Alternatively, a complex process of context-dependent mutations could potentially create a similar pattern, although such a process may be difficult to envisage. We note, however, that the CpG mutation bias is caused by a context-dependent mutation process, and the possibility of a more elaborate process accounting for the observed pattern is difficult to rule out per se. The exact nature of the molecular mechanism(s) is to be revealed in future studies. In conclusion, our results show that periodic DNA has some distinctive genomic features: (1) there is an excess of SNPs in periodic DNA compared to non-periodic DNA; (2) SNPs in periodic DNA are distributed according to a 2, 4, 6, 8 pattern; (3) care should be taken in analysis of SNPs from periodic DNA since SNPs in periodic DNA have a higher genotyping error rate than SNPs outside periodic DNA. The latter may have important consequences for SNP and association studies.
Reference sequence Reference sequence hg17 (NCBI build 35) was used (2001) (International Human Genome Sequencing Consortium 2004
Genomic elements
Tandem repeat regions
Transposable elements
Large duplicated regions
SNP data
We only selected unambiguously mapped SNPs, where the flanking sequences surrounding a SNP had exactly one hit to the human genome (weight = 1). To avoid SNPs with potential alignment problems on the local scale (<10 bp, e.g., due to indels), we only selected SNPs that were perfectly mapped on the local scale, i.e, where the alignment of the flanking sequences and the reference genome were exactly 1 bp apart (location type = exact). To ensure that our automated filtering process removed all alignment problems, we manually evaluated 17 random pairs of identical SNPs from periodic DNA in the UCSC Genome Browser (Kent et al. 2002
By applying the above filtering criteria, we ended up with 4,576,203 SNPs out of a total of 10,430,753 SNPs in dbSNP125 (Sherry et al. 1999
The data set containing all genotyped HapMap SNPs were downloaded from the HapMap site (http://www.hapmap.org/genotypes/; build 21a, NCB1 35), including all redundant, unfiltered SNPs and all individuals from all populations (The International HapMap Consortium 2003
HapMap-ENCODE regions
Periodic DNA This is implemented by looking at one period (p) at a time. For each p, a window of 3p bp (or 9 bp if p = 1, 2) is moved over the entire marked reference sequence (criteria a and b), and the window is marked as periodic DNA if the pattern meets criterion c. Finally, all marked windows are collapsed into regions of periodic DNA, and the smallest possible period is assigned to each region. The criterion of at most p/4 mismatches ensures that short segments of periodic DNA (9–12 bp) have a perfect periodic pattern, whereas the longer segments are allowed to have a few mismatches.
Estimation of expected frequencies
Test for over-representation of pairs of identical SNPs
g,hpd(g,h) is the probability of obtaining any SNP pair in distance d.
Software
We thank Frank Grønlund Jørgensen and Mikkel Heide Schierup for helpful discussions, and Enette Berndt Knudsen for excellent technical assistance. C.W. is supported by the Danish Cancer Society. P.V. is supported by the Lundbeck Foundation, Denmark.
3 Corresponding author.
E-mail wiuf{at}birc.au.dk; fax 45-89423077. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6223207
Bailey, J.A., Yavor, A.M., Massa, H.F., Trask, B.J., and Eichler, E.E. 2001. Segmental duplications: Organization and impact within the current human genome project assembly. Genome Res. 11: 1005–1017. Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers, E.W., Li, P.W., and Eichler, E.E. 2002. Recent segmental duplications in the human genome. Science 297: 1003–1007. Benson, G. 1999. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27: 573–580. Clark, A.G., Nielsen, R., Signorovitch, J., Matise, T.C., Glanowski, S., Heil, J., Winn-Deen, E.S., Holden, A.L., and Lai, E. 2003. Linkage disequilibrium and inference of ancestral recombination in 538 single-nucleotide polymorphism clusters across the human genome. Am. J. Hum. Genet. 73: 285–300.[CrossRef][Medline] Clark, A.G., Hubisz, M.J., Bustamante, C.D., Williamson, S.H., and Nielsen, R. 2005. Ascertainment bias in studies of human genome-wide polymorphism. Genome Res. 15: 1496–1502. Conrad, D.F., Andrews, T.D., Carter, N.P., Hurles, M.E., and Pritchard, J.K. 2006. A high-resolution survey of deletion polymorphism in the human genome. Nat. Genet. 38: 75–81.[CrossRef][Medline] The ENCODE Project Consortium. 2004. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science 306: 636–640. The ENCODE Project Consortium. 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 447: 799–816.[CrossRef][Medline] Fredman, D., White, S.J., Potter, S., Eichler, E.E., Dunnen, J.T.D., and Brookes, A.J. 2004. Complex SNP-related sequence variation in segmental genome duplications. Nat. Genet. 36: 861–866.[CrossRef][Medline] Freeman, J.L., Perry, G.H., Feuk, L., Redon, R., McCarroll, S.A., Altshuler, D.M., Aburatani, H., Jones, K.W., Tyler-Smith, C., Hurles, M.E., et al. 2006. Copy number variation: New insights in genome diversity. Genome Res. 16: 949–961. Gore, J.M., Ran, F.A., and Ornston, L.N. 2006. Deletion mutations caused by DNA strand slippage in Acinetobacter baylyi. Appl. Environ. Microbiol. 72: 5239–5245. Holliday, R. 1964. A mechanism for gene conversion in fungi. Genet. Res. 5: 282–304. Hwang, D.G. and Green, P. 2004. Inaugural article: Bayesian Markov chain Monte Carlo sequence analysis reveals varying neutral substitution patterns in mammalian evolution. Proc. Natl. Acad. Sci. 101: 13994–14001. The International HapMap Consortium. 2003. The International HapMap Project. Nature 426: 789–796.[CrossRef][Medline] International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921.[CrossRef][Medline] International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431: 931–945.[CrossRef][Medline] Jeffreys, A.J., Barber, R., Bois, P., Buard, J., Dubrova, Y.E., Grant, G., Hollies, C.R.H., May, C.A., Neumann, R., Panayi, M., et al. 1999. Human minisatellites, repeat DNA instability and meiotic recombination. Electrophoresis 20: 1665–1675.[CrossRef][Medline] Jurka, J. 2000. Repbase Update: A database and an electronic journal of repetitive elements. Trends Genet. 16: 418–420.[CrossRef][Medline] Karolchik, D., Hinrichs, A.S., Furey, T.S., Roskin, K.M., Sugnet, C.W., Haussler, D., and Kent, W.J. 2004. The UCSC Table Browser data retrieval tool. Nucleic Acids Res. 32: D493–D496. Kent, W.J., Sugnet, C.W., Furey, T.S., Roskin, K.M., Pringle, T.H., Zahler, A.M., and Haussler, A.D. 2002. The Human Genome Browser at UCSC. Genome Res. 12: 996–1006. Koboldt, D.C., Raymond, M.D., and Kwok, P.-Y. 2006. Distribution of human SNPs and its effect on high-throughput genotyping. Hum. Mutat. 27: 249–254.[CrossRef][Medline] Lewin, B. 2004. Genes VIII. Prentice-Hall, Upper Saddle River, NJ. Peer, I., Chretien, Y.R., de Bakker, P.I.W., Barrett, J.C., Daly, M.J., and Altshuler, D.M. 2006. Biases and reconciliation in estimates of linkage disequilibrium in the human genome. Am. J. Hum. Genet. 78: 588–603.[CrossRef][Medline] R Development Core Team. 2006. R: A language and environment for statistical computing. Foundation for Statistical Computing, Vienna, Austria. Redon, R., Ishikawa, S., Fitch, K.R., Feuk, L., Perry, G.H., Andrews, T.D., Fiegler, H., Shapero, M.H., Carson, A.R., Chen, W., et al. 2006. Global variation in copy number in the human genome. Nature 444: 444–454.[CrossRef][Medline] Sherry, S.T., Ward, M., and Sirotkin, K. 1999. dbSNP—Database for single nucleotide polymorphisms and other classes of minor genetic variation. Genome Res. 9: 677–679. Stoneking, M. 2001. Single nucleotide polymorphisms. From the evolutionary past. Nature 409: 821–822.[CrossRef][Medline] Tuzun, E., Sharp, A.J., Bailey, J.A., Kaul, R., Morrison, V.A., Pertz, L.M., Haugen, E., Hayden, H., Albertson, D., Pinkel, D., et al. 2005. Fine-scale structural variation of the human genome. Nat. Genet. 37: 727–732.[CrossRef][Medline] Walsh, P.S., Fildes, N.J., and Reynolds, R. 1996. Sequence analysis and characterization of stutter products at the tetranucleotide repeat locus vWA. Nucleic Acids Res. 24: 2807–2812. Weber, J.L. and Wong, C. 1993. Mutation of human short tandem repeats. Hum. Mol. Genet. 2: 1123–1128.
Received December 20, 2006; accepted in revised format June 18, 2007.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||