|
|
|
|
Published online before print
March 6, 2008 Genome Research, DOI: 10.1101/gr.070169.107 ACCEPTED PREPRINT OPEN ACCESS ARTICLE
Methods and Resources Detecting Polymorphic Regions in the Arabidopsis thaliana Genome with Resequencing Microarrays1 Friedrich Miescher Laboratory of the Max Planck Society; 2 Max Planck Institute for Developmental Biology
Whole-genome, oligonucleotide resequencing arrays have allowed the comprehensive discovery of single nucleotide polymorphisms (SNPs) in eukaryotic genomes of moderate to large size. With this technology, the detection rate for isolated SNPs is typically high. However, it is greatly reduced when other polymorphisms are located near a SNP as multiple mismatches inhibit hybridization to arrayed oligonucleotides. Contiguous tracts of suppressed hybridization therefore typify polymorphic regions such as clusters of SNPs or deletions. We developed a machine learning method, designated margin-based Prediction of Polymorphic Regions (mPPR), to predict Polymorphic Regions (PRs) from resequencing array data. Conceptually similar to Hidden Markov Models, the method is trained with discriminative learning techniques related to Support Vector Machines, and accurately identifies even very short polymorphic tracts (<10 bp). We applied this method to resequencing array data previously generated for the euchromatic genomes of 20 strains (accessions) of the best characterized plant, Arabidopsis thaliana. Non-redundantly, 27% of the genome was included within the boundaries of PRs predicted at high specificity (
Correspondence: 3 E-mail: gunnar.raetsch{at}tuebingen.mpg.de
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||