|
|
|
|
Published online before print
April 12, 2004, 10.1101/gr.2255804 Genome Res. 14:870-877, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00
Methods Decoding Randomly Ordered DNA Arrays1 Illumina, Inc., San Diego, California 92121, USA 2 Genomics Institute of the Novartis Research Foundation, San Diego, California 92121, USA
We have developed a simple and efficient algorithm to identify each member of a large collection of DNA-linked objects through the use of hybridization, and have applied it to the manufacture of randomly assembled arrays of beads in wells. Once the algorithm has been used to determine the identity of each bead, the microarray can be used in a wide variety of applications, including single nucleotide polymorphism genotyping and gene expression profiling. The algorithm requires only a few labels and several sequential hybridizations to identify thousands of different DNA sequences with great accuracy. We have decoded tens of thousands of arrays, each with 1520 sequences represented at 30-fold redundancy by up to 50,000 beads, with a median error rate of <1 x 10-4 per bead. The approach makes use of error checking codes and provides, for the first time, a direct functional quality control of every element of each array that is manufactured. The algorithm can be applied to any spatially fixed collection of objects or molecules that are associated with specific DNA sequences.
Microarray technology, devised for the analysis of complex biological systems, uses the ability of a DNA strand to hybridize specifically to its complement to extract 1000s of measurements at a time from a single sample (Watson and Crick 1953
Conventional microarrays are manufactured by spotting or synthesizing probes at known locations on a two-dimensional substrate (Fodor et al. 1991
Although randomly assembled arrays were recognized from the outset as a potentially revolutionary approach to microarray technology, the initial attempts to determine the location and identity of beads could only distinguish a few codes, limiting the usefulness of the approach (Michael et al. 1998
Design of DNA-Based Decoding Our algorithm uses sequential hybridizations of dye-labeled oligonucleotides, or decoders, complementary to bead sequences to create a combinatorial decoding scheme for arrays. It is distinct from sequencing by hybridization (SBH), which has been used successfully to characterize sequences de novo by hybridization to all n-mers or a well-chosen subset, typically in the range of 4- to 10-mers (Drmanac et al. 1996 To illustrate, we show an example of decoding eight different bead types. We use two fluorescent labels, or states (green and red) in combination with three sequential hybridizations, or stages. The sequential hybridization process is illustrated for a single bead (bead type 2 of eight) in Figure 2A. In each of the three stages, the bead is "colored" by hybridization to a fluorescently labeled decoder oligonucleotide (Fig. 2A,B). In practice, all beads in the array are labeled simultaneously at each stage, by exposure to a pooled set of decoders, so that the process is intrinsically parallel and efficient. The combinatorial assignment of green and red within each pool of eight decoders is shown in Figure 2C. There are three decoder pools in total, one for each stage. The stage 1 pool has the first four decoders colored green and the last four colored red. The decoders in subsequent stages are labeled so that after three stages each bead type is assigned a unique three-bit color code. Note that the sequences of the decoders are unchanged from stage to stage; only the fluorescent labels are varied. The bead circled in Figure 2B has the color signature (GRG), or 010 code in binary representation, in which G = 0 and R = 1. Its sequence can be identified as sequence 2 by referring to the color-lookup table in Figure 2C. Although assignment of codes to sequences is unambiguous after three stages, additional stages can be added for error checking purposes (last column of Fig. 2C) to be described below.
In this simple fashion, the eight bead types are decoded with three stages and two color labels. The approach scales exponentially. If there are N bead types and k distinguishable labels, or states, then the number of stages required is
There are a number of ways of creating different states. Distinct fluorescent labels can be used as described in Figure 2, or the intensity levels of fluorescent labels can be varied to create grayscale states. We use a process that decodes 1520 different bead types by using three states: two fluorescent "ON" states (FAM and CY3 fluorescent labels) and one nonfluorescent "OFF" state. The logarithmic relationship between the number of bead types and the number of decode stages shows that the 1520 bead types can be decoded in only
Error Checking The simplest form of error checking we have used has a parity bit and is illustrated in Figure 2C. The assigned four-bit codes all have an even parity bit sum and are termed valid codes. An error in a single decode stage is a one-bit error that creates an odd parity, or invalid code. Although the parity-based approach is effective, we have optimized error checking by designing a more advanced scheme that assigns codes in a way that takes into account inherent biases in error rates and enables estimation of the misclassification rate. In our implementation of three-state decoding, the most common errors are transitions from an ON state, one or two, to the OFF state, zero. Less common are transitions from OFF to ON. Transitions from one ON state to the other ON state are extremely rare. This can be explained by the fact that such transitions require two simultaneous classification errors: a mistaken ON in one color channel and a mistaken OFF call in the other color channel. We used these biases in error rates by designing the decode stages so that every valid code has a fixed number of OFF states. For example, if there are 1520 bead types and we use three-state eight-stage decoding, then each valid code would be designed to be OFF in exactly two stages and ON in exactly six stages (the actual scheme used is a slight variant on this design; see Methods). An example of a valid code would be 21110210. With this scheme, it is theoretically impossible to misclassify a code through any number of occurrences of the most common error type: a transition from an ON state to an OFF state. The following events can lead to misclassification: a transition from one ON state to the other, or multiple stage errors with at least one ON-to-OFF transition and at least one OFF-to-ON transition. Both events are extremely rare. The space of all possible color codes is divided into three categories: used valid, unused valid, and invalid (Table 1). The unused valid codes represent codes that could be assigned to bead types but are not currently in use. Monitoring the number of beads decoding to this category allows an estimation of the true misclassification rate, as described in the Methods section. With little extra cost to the overall process, this error checking scheme monitors the rate of single state errors, minimizes the number of misclassified beads, and permits the number of misclassified beads, one of the key determinants of array quality, to be estimated.
Decoding of Randomly Assembled Arrays By using this approach, arrays of 1520 different bead types ( 50,000 total beads) were decoded 96 at a time in the Sentrix array matrix format. Representative examples from the decoding are shown in Figure 3, and summary statistics are presented in Table 2
By using the core-by-core algorithm, we have decoded many 10s of 1000s of arrays with a median random error rate of <1 x 10-4 per bead (Table 2). The rate was estimated by using the error checking scheme summarized in Table 1. To get a more direct measure of random error rates, we decoded a matrix of 96 arrays twice and considered beads that were decoded to used codes in both decoding events (average of 44,912 ± 1098 beads/array). The misclassification rate was then estimated by dividing the number of discrepant calls by the total number of calls. This is an upper bound on the rate, as the errors are distributed among the two decoding events. The mean misclassification rate obtained in this way was 3.8 x 10-5 with a 95% confidence interval of zero to 1.8 x 10-4. The results are consistent with the estimates obtained in Table 2. This analysis does not account for any systematic misclassification errors, but functional tests (e.g., genotyping comparison studies with other technologies) have not identified any systematic misclassification (data not shown).
Error Rate Impact
A fundamental difference between randomly assembled arrays and conventional ordered arrays is that the number of beads (or probes) of each type is intrinsically a random variable with a Poisson sampling distribution for the former (Fig. 4) and is fixed and defined for the latter. Each randomly assembled array is effectively unique, having different numbers and arrangements of beads from array to array yet decoded by a single universal process. This notion is accepted for "liquid arrays" (Fulton et al. 1997 ; P. Ng, pers. comm.). At the same time, the random distribution of beads minimizes the chance of any local problem affecting the overall result, increasing robustness of the system. The only added requirement of using "unique" arrays is that the analytical data extraction step must use a uniquely defined template for each particular array. This was easy to implement as part of our data extraction software.
Our work provides a high-performance alternative to conventional microarrays. It also expands the reach of microarray assays. For example, we have used a highly miniaturized array format to construct a 96-array matrix for processing many microarray experiments cost-effectively, for 1500 assays at a time. This provides much needed statistical power that is difficult and prohibitively expensive to obtain by using conventional microarrays, and has the potential to speed the transition of microarray-based assays to large-scale clinical application.
We have used decoded arrays to create new assays for largescale genotyping (Fan et al. 2003
Importantly, our technology has been proven in highly demanding and competitive large-scale genomics applications, in particular SNP genotyping (Fan et al. 2003
Random arrays have been particularly useful for the accurate, high-throughput, and cost-effective analysis of large numbers of samples for
Finally, the decoding algorithm is general and can, in principle, be applied to any spatially fixed collection of objects or molecules that are associated with specific DNA sequences. In genomics, the classification and characterization of large collections of sequences is often a key step in the analysis of complex biological systems. For example, a library of DNA clones is traditionally searched for a single gene of interest by hybridization to a labeled DNA probe (Sambrook 1989 In conclusion, we have developed a new and scalable way to make a novel type of microarray. The highly miniaturized arrays have performed well in a variety of applications. The decoding approach used to make them is both accurate and robust and can, in principle, be used to identify not only DNA sequences on beads but also other collections of DNA sequences.
Preparation of Oligonucleotide-Linked Beads and Bead Pools A set of 1536 universal capture oligonucleotide sequences was synthesized. Each sequence was individually immobilized on activated beads, as follows, to create 1536 bead types. Silica beads 3 µm in diameter (Bangs Laboratories, Inc.) were amino functionalized by incubation in 2.5% 3-amino-propyl-trimethoxysilane (Aldrich) in ethanol for 1 h at RT and then activated by reaction with 2% 2,4,6-trichloro-1,3,5-triazine (Aldrich) in acetonitrile for 2 h at room temperature. Synthetic oligonucleotides labeled at the 5' terminus with a primary amine were covalently attached to the activated beads by overnight reaction at 50°C in a solution of 3 M NaCl and 100 mM sodium carbonate (pH 11). All reagents were of highest purity grade. Empirical measurements indicated that, on average, each bead carried on the order of 106 oligonucleotides. Following quality assessment, 16 of the bead types were discarded due to low signal-to-noise ratios. The remaining bead types were combined to create a pool containing 1520 functional bead types, each representing a unique capture sequence. The sequences were designed to serve as noninteracting decodable address sequences in addition to their function as probes that capture assay products from solution. They were selected to be 22 to 24 bases long with minimal cross-complementarity, similar GC content and Tm, no runs of a single base longer than five, and low similarity to human genomic sequences.
Preparation of Oligonucleotides and Pools Used in the Decode Process
Design of Decoder Pools
Assembly of Array Matrices
Decoding
Image Processing and Data Extraction
Decoding Algorithms
Pseudocode for Stage-by-Stage Decoding For each color channel do
Tabulate the zero and one stage and color information into the decode signatures. The second algorithm is called core-by-core decoding. For each bead and color channel, we consider the eight intensity values across decode stages. The values are sorted, and the greatest relative intensity increase is determined. This is the separation between the ON and OFF states for the core. The same procedure is repeated for all beads in both color channels. The results are combined to give the decode signatures.
Pseudocode for Core-by-Core Decoding For each color channel do
Tabulate the zero and one core and color information into the decode signatures.
The two methods give virtually identical results. In practice, we use the core-by-core method to decode the arrays, and obtain quality control information from the histograms. As part of the decoding process, quantitative metrics for array quality are output automatically and can be stored in a database. All processing and quality metric generation takes
Estimation of Misclassification Rate
We are grateful to our Illumina colleagues in array manufacturing, process development, and engineering for invaluable technical assistance, and to Bob Kain and David Barker for numerous helpful discussions and insights. This work was supported in part by National Institutes of Health grants R44 HG02003-01, R21 HG01911, and R43 CA81952 to M.S.C. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2255804. Article published online before print in April 2004.
3 These authors contributed equally to this work.
4 Corresponding author.
Alwine, J.C., Kemp, D.J., Parker, B.A., Reiser, J., Renart, J., Stark, G.R., and Wahl, G.M. 1979. Detection of specific RNAs or specific fragments of DNA by fractionation in gels and transfer to diazobenzyloxymethyl paper. Methods Enzymol. 68: 220-242.[Medline] Barker, D.L., Therault, G., Che, D., Dickinson, T., Shen, R., and Kain, R. 2003. Self-assembled random arrays: High-performance imaging and genomics applications on a high-density microarray platform. Proc. SPIE 4966: 1-11.[CrossRef] Battaglia, C., Salani, G., Consolandi, C., Bernardi, L.R., and De Bellis, G. 2000. Analysis of DNA microarrays by non-destructive fluorescent staining using SYBR green II. Biotechniques 29: 78-81.[Medline] Braeckmans, K., De Smedt, S.C., Leblans, M., Pauwels, R., and Demeester, J. 2002. Encoding microcarriers: Present and future technologies. Nat. Rev. Drug Discov. 1: 447-456.[CrossRef][Medline] Brenner, S., Johnson, M., Bridgham, J., Golda, G., Lloyd, D.H., Johnson, D., Luo, S., McCurdy, S., Foy, M., Ewan, M., et al. 2000. Gene expression analysis by massively parallel signature sequencing (MPSS) on microbead arrays. Nat. Biotechnol. 18: 630-634.[CrossRef][Medline] Chan, W.C., Maxwell, D.J., Gao, X., Bailey, R.E., Han, M., and Nie, S. 2002. Luminescent quantum dots for multiplexed biological detection and imaging. Curr. Opin. Biotechnol. 13: 40-46.[CrossRef][Medline]
Chee, M., Yang, R., Hubbell, E., Berno, A., Huang, X.C., Stern, D., Winkler, J., Lockhart, D.J., Morris, M.S., and Fodor, S.P. 1996. Accessing genetic information with high-density DNA arrays. Science 274: 610-614. Drmanac, S., Stavropoulos, N.A., Labat, I., Vonau, J., Hauser, B., Soares, M.B., and Drmanac, R. 1996. Gene-representing cDNA clusters defined by hybridization of 57,419 clones from infant brain libraries with short oligonucleotide probes. Genomics 37: 29-40.[CrossRef][Medline] Drmanac, S., Kita, D., Labat, I., Hauser, B., Schmidt, C., Burczak, J.D., and Drmanac, R. 1998. Accurate sequencing by hybridization for DNA diagnostics and individual genomics. Nat. Biotechnol. 16: 54-58.[Medline] Fan, J.-B., Oliphant, A., Shen, R., Kermani, B.G., Garcia, F., Gunderson, K.L., Hansen, M., Steemers, F., Butler, S.L., Deloukas, P., et al. 2003. Highly parallel SNP genotyping. Cold Spring Harbor Symp. Biol. 68: (in press). Fan, J-B., Yeakley, J.M., Bibikova, M., Chudin, E., Wickham, E., Chen, J., Doucet, D., Rigault, P., Zhang, B., Shen, R., et al. 2004. A versatile assay for high-throughput gene expression profiling on universal array matrices. Genome Res. (this issue).
Fodor, S.P.A., Read, J.L., Pirrung, M.C., Stryer, L., Lu, A.T., and Solas, D. 1991. Light-directed, spatially addressable parallel chemical synthesis. Science 251: 767-773.
Fulton, R.J., McDade, R.L., Smith, P.L., Kienker, L.J., and Kettman Jr., J.R. 1997. Advanced multiplexed analysis with the FlowMetrix system. Clin. Chem. 43: 1749-1756.
Galinsky, V.L. 2003a. Automatic registration of microarray images, I: Rectangular grid. Bioinformatics 19: 1824-1831.
. 2003b. Automatic registration of microarray images, II: Hexagonal grid. Bioinformatics 19: 1832-1836. Gerry, N.P., Witowski, N.E., Day, J., Hammer, R.P., Barany, G., and Barany, F. 1999. Universal DNA microarray method for multiplex detection of low abundance point mutations. J. Mol. Biol. 292: 251-262.[CrossRef][Medline]
Gunderson, K.L., Huang, X.C., Morris, M.S., Lipshutz, R.J., Lockhart, D.J., and Chee, M.S. 1998. Mutation detection by ligation to complete n-mer DNA arrays. Genome Res. 8: 1142-1153. Hamming, R.W. 1986. Coding and information theory. Prentice-Hall, Inc., Englewood Cliffs, NJ. Han, M., Gao, X., Su, J.Z., and Nie, S. 2001. Quantum-dot-tagged microbeads for multiplexed optical coding of biomolecules. Nat. Biotechnol. 19: 631-635.[CrossRef][Medline] Hardenbol, P., Baner, J., Jain, M., Nilsson, M., Namsaraev, E.A., Karlin-Neumann, G.A., Fakhrai-Rad, H., Ronaghi, M., Willis, T.D., Landegren, U., et al. 2003. Multiplexed genotyping with sequence-tagged molecular inversion probes. Nat. Biotechnol. 21: 673-678.[CrossRef][Medline]
Hessner, M.J., Wang, X., Khan, S., Meyer, L., Schlicht, M., Tackes, J., Datta, M.W., Jacob, H.J., and Ghosh, S. 2003. Use of a three-color cDNA microarray platform to measure and control support-bound probe for improved data quality and reproducibility. Nucleic Acids Res. 31: e60. Holloway, A.J., van Laar, R.K., Tothill, R.W., and Bowtell, D.D. 2002. Options availablefrom start to finishfor obtaining data from DNA microarrays II. Nat. Genet. 32(Suppl): 481-489. Hubbell, E. and Pevzner, P.A. 1999. Fidelity probes for DNA arrays. Proc. Int. Conf. Intell. Syst. Mol. Biol. 113-117.
Johnson, P.H., Walker, R.P., Jones, S.W., Stephens, K., Meurer, J., Zajchowski, D.A., Luke, M.M., Eeckman, F., Tan, Y., Wong, L., et al. 2002. Multiplex gene expression analysis for high-throughput drug discovery: Screening and analysis of compounds affecting genes overexpressed in cancer cells. Mol. Cancer Ther. 1: 1293-1304. Kennedy, G.C., Matsuzaki, H., Dong, S., Liu, W.M., Huang, J., Liu, G., Su, X., Cao, M., Chen, W., Zhang, J., et al. 2003. Large-scale genotyping of complex DNA. Nat. Biotechnol. 21: 1233-1237.[CrossRef][Medline]
Levsky, J.M., Shenoy, S.M., Pezo, R.C., and Singer, R.H. 2002. Single-cell gene expression profiling. Science 297: 836-840.
Liang, P. and Pardee, A.B. 1992. Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science 257: 967-971. Lockhart, D.J. and Trulson, M.O. 2001. Multiplex metallica. Nat. Biotechnol. 19: 1122-1123.[CrossRef][Medline] Lockhart, D.J. and Winzeler, E.A. 2000. Genomics, gene expression and DNA arrays. Nature 405: 827-836.[CrossRef][Medline] Lockhart, D.J., Dong, H., Byrne, M.C., Follettie, M.T., Gallo, M.V., Chee, M.S., Mittmann, M., Wang, C., Kobayashi, M., Horton, H., et al. 1996. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat. Biotechnol. 14: 1675-1680.[CrossRef][Medline] Michael, K.L., Taylor, L.C., Schultz, S.L., and Walt, D.R. 1998. Randomly ordered addressable high-density optical sensor arrays. Anal. Chem. 70: 1242-1248.[Medline]
Nicewarner-Pena, S.R., Freeman, R.G., Reiss, B.D., He, L., Pena, D.J., Walton, I.D., Cromer, R., Keating, C.D., and Natan, M.J. 2001. Submicrometer metallic barcodes. Science 294: 137-141.
Nuwaysir, E.F., Huang, W., Albert, T.J., Singh, J., Nuwaysir, K., Pitas, A., Richmond, T., Gorski, T., Berg, J.P., Ballin, J., et al. 2002. Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res. 12: 1749-1755.
Pastinen, T., Raitio, M., Lindroos, K., Tainola, P., Peltonen, L., and Syvanen, A.C. 2000. A system for specific, high-throughput genotyping by allele-specific primer extension on microarrays. Genome Res. 10: 1031-1042.
Patil, N., Berno, A.J., Hinds, D.A., Barrett, W.A., Doshi, J.M., Hacker, C.R., Kautzer, C.R., Lee, D.H., Marjoribanks, C., McDonough, D.P., et al. 2001. Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294: 1719-1723.
Pease, A.C., Solas, D., Sullivan, E.J., Cronin, M.T., Holmes, C.P., and Fodor, S.P. 1994. Light-generated oligonucleotide arrays for rapid DNA sequence analysis. Proc. Natl. Acad. Sci. 91: 5022-5026. Sambrook, J.E.A. 1989. Molecular cloning: A laboratory manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY.
Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467-470. Sengupta, R. and Tompa, M. 2002. Quality control in manufacturing oligo arrays: A combinatorial design approach. J. Comput. Biol. 9: 1-22.[CrossRef][Medline] Shannon, C.E. 1948a. A mathematical theory of communication. Bell System Technical J. 27: 379-423. . 1948b. A mathematical theory of communication. Bell System Technical J. 27: 623-656. Shearstone, J.R., Allaire, N.E., Getman, M.E., and Perrin, S. 2002. Nondestructive quality control for microarray production. Biotechniques 32: 1051-1052, 1054, 1056-1057.[Medline] Southern, E., Maskos, U., and Elder, R. 1992. Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: Evaluation using experimental models. Genomics 13: 1008-1017.[CrossRef][Medline] Southern, E.M. 1975. Detection of specific sequences among DNA fragments separated by gel electrophoresis. J. Mol. Biol. 98: 503-517.[CrossRef][Medline] Taylor, E., Cogdell, D., Coombes, K., Hu, L., Ramdas, L., Tabor, A., Hamilton, S., and Zhang, W. 2001. Sequence verification as quality-control step for production of cDNA microarrays. Biotechniques 31: 62-65.[Medline] Van't Veer, L.J., Dai, H., van de Vijver, M.J., He, Y.D., Hart, A.A., Mao, M., Peterse, H.L., van der Kooy, K., Marton, M.J., Witteveen, A.T., et al. 2002. Gene expression profiling predicts clinical outcome of breast cancer. Nature 415: 530-536.[CrossRef][Medline]
Wang, D.G., Fan, J.-B., Siao, C.-J., Berno, A., Young, P., Sapolsky, R., Ghandour, G., Perkins, N., Winchester, E., Spencer, J., et al. 1998. Large-scale identification, mapping, and genotyping of single-nucleotide polymorphisms in the human genome. Science 280: 1077-1082. Watson, J.D. and Crick, F.H.C. 1953. Molecular structure of nucleic acid: A structure for deoxyribose nucleic acid. Nature 171: 737-738.[CrossRef][Medline] Yeakley, J.M., Fan, J.B., Doucet, D., Luo, L., Wickham, E., Ye, Z., Chee, M.S., and Fu, X.D. 2002. Profiling alternative splicing on fiber-optic arrays. Nat. Biotechnol. 20: 353-358.[CrossRef][Medline] Yvert, G., Brem, R.B., Whittle, J., Akey, J.M., Foss, E., Smith, E.N., Mackelprang, R., and Kruglyak, L. 2003. Trans-acting regulatory variation in Saccharomyces cerevisiae and the role of transcription factors. Nat. Genet. 35: 57-64.[Medline]
www.hapmap.org; International HapMap Project.
Received December 8, 2003;
accepted in revised format January 29, 2004.
This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||