|
|
|
|
Published online before print
December 19, 2005, 10.1101/gr.4452906 Genome Res. 16:271-281, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00
Methods Design optimization methods for genomic DNA tiling arrays1 Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA 2 Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA 3 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA 4 Department of Computer Science, Northwestern University, Evanston, Illinois 60201, USA
A recent development in microarray research entails the unbiased coverage, or tiling, of genomic DNA for the large-scale identification of transcribed sequences and regulatory elements. A central issue in designing tiling arrays is that of arriving at a single-copy tile path, as significant sequence cross-hybridization can result from the presence of non-unique probes on the array. Due to the fragmentation of genomic DNA caused by the widespread distribution of repetitive elements, the problem of obtaining adequate sequence coverage increases with the sizes of subsequence tiles that are to be included in the design. This becomes increasingly problematic when considering complex eukaryotic genomes that contain many thousands of interspersed repeats. The general problem of sequence tiling can be framed as finding an optimal partitioning of non-repetitive subsequences over a prescribed range of tile sizes, on a DNA sequence comprising repetitive and non-repetitive regions. Exact solutions to the tiling problem become computationally infeasible when applied to large genomes, but successive optimizations are developed that allow their practical implementation. These include an efficient method for determining the degree of similarity of many oligonucleotide sequences over large genomes, and two algorithms for finding an optimal tile path composed of longer sequence tiles. The first algorithm, a dynamic programming approach, finds an optimal tiling in linear time and space; the second applies a heuristic search to reduce the space complexity to a constant requirement. A Web resource has also been developed, accessible at http://tiling.gersteinlab.org, to generate optimal tile paths from user-provided DNA sequences.
DNA microarrays have become ubiquitous in genomic research as tools for the large-scale analysis of gene expression. The design of DNA microarrays has generally focused on the measurement of mRNA transcript levels from annotated genes, represented either by PCR products comprising entire cDNA sequences (Schena et al. 1995
A recent trend in genomics has involved the development of "tiling" arrays: microarrays that represent a complete non-repetitive tile path over a chromosome or locus, irrespective of any genes that may be annotated in that region (Fig. 1, left). This unbiased representation of genomic DNA has enabled the discovery of many novel transcribed sequences (Kapranov et al. 2002
In addition to coding and regulatory sequences, genomes contain repetitive elements that have been introduced and replicated in high copy number over evolutionary time. The frequency and diversity of these repeat sequences increases with the size and complexity of higher eukaryotic chromosomes, accounting for When representing genomic DNA with short oligonucleotides, near-optimal coverage of the non-repetitive sequence can be achieved in a relatively straightforward manner (Fig. 2B), although a number of important factors should be considered for probe selection. A more problematic situation arises when tile sizes increase, such as when selecting suitable targets for PCR amplification. For this application it is necessary to derive a tile path of larger sequence fragments from the non-repetitive component of the genome (Fig. 2C). Small PCR products can be difficult to resolve in a high-throughput setting, while fragments of several kilobases (kb) in length can limit the precise identification of hybridizing sequences. Balancing these criteria to select appropriate target sequences while avoiding repetitive elements presents a challenging optimization problem. Here we discuss a number of issues central to the problem of partitioning non-repetitive sequences using a range of tile sizes, and present several optimization approaches suitable for both oligonucleotide- and amplicon-based microarray applications.
Tiling discontiguous genome sequences Genomic tiling arrays are intended to maximally cover a span of non-repetitive DNA with representative sequence fragments, or tiles, whose sizes fall within a prescribed range. The number of repetitive elements included in the tile path should be minimized, while the sequence is partitioned into the fewest number of tiles that can maximally cover the non-repetitive DNA. The sequences included on the array are either PCR-amplified and deposited mechanically onto glass slides via contact printing (Schena et al. 1995
Sequence similarity and single-copy tiling Genome sequences are not random and therefore contain many redundant subsequences. For the design of tiling arrays it is essential to identify and eliminate non-unique sequences in order to reduce the potential for cross-hybridization of sequences originating from elsewhere in the genome. For shorter sequence tiles, robust methods to address the problem of sequence similarity can be developed, thereby generating a single-copy tile path to be represented on the array (Fig. 2A). To implement this approach we compute the degree of similarity of any given oligonucleotide sequence in a large genome.
This problem can be stated as follows: Given a genomic sequence and an oligomer of length n, find all oligomers in the sequence differing from the input in no more than m places. In theory, we need only create a direct hash table of each sequence to a list of all subsequence occurrences. However, the space required to implement the hash quickly becomes impractical. With 4n possible oligomer sequences, a hash of size 14 requires 1 gigabyte of storage, in addition to the space needed to store each of the possible index coordinates of the input sequence (another gigabyte for large chromosomes). These requirements impose a practical limitation on the size of hash tables such that n 14. This is insufficient for most microarray applications, where oligonucleotide sizes are typically 25 nucleotides (nt).
To work around these memory constraints and account for sequence mismatches, we adopt a BLAST-like scheme similar to that described in Wang and Seed (2003
Repeat identification and low-complexity filtering In addition to identifying instances of canonical repeat families, it is often desirable to screen genomic DNA for low-complexity sequences: stretches of polypurine/polypyrimidine bases, or regions of extremely high A/T or G/C content. Repeat-Masker is able to filter some low-complexity DNA by default; more extensive filtering is often performed using programs such as DUST (R. Tatusov and D. Lipman, unpubl.) and NSEG (Wooton and Federhen 1993). DUST is included as a component of the NCBI BLAST distribution; NSEG is a member of the SEG family of programs and affords more flexible control over low-complexity filtering by using an information entropy-based model of sequence analysis.
Genomic DNA representation with short sequence tiles Designing oligonucleotide tiling arrays constitutes a different problem than selecting oligonucleotides for gene-based arrays, primarily because end-to-end or overlapping tile layouts present fewer options with regard to sequence selection. A number of factors should be considered when tiling genomic DNA with oligonucleotides, including tiling resolution, consideration of non-unique subsequences, and hybridization affinity. Subdividing contiguous genomic DNA in a naive, end-to-end fashion offers little opportunity to select optimal probe sequences because the aim is to cover the non-repetitive regions using predetermined spacing constraints. However, several strategies can be used to improve both the annealing specificity and thermodynamic properties of oligonucleotides selected for tiling arrays.
Tiling resolution
In the case of ChIP-chip experiments, chromatinimmunoprecipited DNA is hybridized to an intergenic microarray to locate transcription factor binding sites (Horak and Snyder 2002 For the fine-resolution mapping of transcribed sequences, much closer probe spacing is required. Because a large fraction of the coding sequences in many eukaryotes span only tens of nucleotides, most of these would elude detection if the genomic sequence is tiled with significant gaps. Further, if the experiment is intended to measure exon-intron boundaries, it may be desirable to cover the genomic DNA with multiple oligonucleotides such that the starting position of each probe is shifted by several nucleotides in order to overlap the previous oligonucleotide's coordinates (Fig. 1, lower right). Although this strategy increases the tiling resolution, the number of probes required will eventually occupy many more features on the array. It is therefore important to select the desired tiling resolution in a manner that considers the intended microarray platform and optimizes the use of the available array elements. Oligonucleotide probes that are selected for microarray applications are typically short (25-80 nt) and uniform in length. These assumptions allow the non-repetitive regions to be tiled by adopting a naive approach in which the sequences are subdivided into fixed-size partitions. There will naturally be many cases in which the oligomer length does not divide evenly into the size of a non-repetitive sequence fragment, and the remainder is therefore omitted from the tile path. However, the resulting loss in sequence coverage is inconsequential given the typically short length of the oligonucleotides. In these situations, it is desirable to adjust the placement of oligonucleotides in order to bias the sequence selection toward the optimal criteria, thereby reducing the potential for cross-hybridization to sequences elsewhere in the genome.
Thermodynamic properties of oligonucleotide probes
H is the enthalpy of base stacking interactions, S is the entropy of base stacking, [oligo] indicates the oligonucleotide concentration, and R is the universal gas constant 1.987 Cal/°C x Mol. Considering these criteria, it is useful to shift the placement of oligonucleotides within each region of non-repetitive DNA in order to reduce the variability of the melting temperatures associated with each probe sequence. In the case of spaced oligo tiling, an individual probe is selected from within each available region such that the calculated Tm is closest to the optimal temperature. For overlapping tiling designs, either the entire set of oligos can be shifted together such that their aggregate Tm is optimized, or the previous approach can be taken and the available regions for oligo placement simply overlap with adjacent regions instead of considering gaps between them.
Optimizing sequence coverage with longer tiles With regard to sequence tiling, a repeat-masked genome sequence can be viewed as containing two categories of nucleotide information: (1) that which comprises the coding, regulatory, and intergenic sequences located in euchromatic regions, together viewed as non-repetitive DNA (nrDNA), and (2) that which belongs to repetitive elements and low-complexity regions (rpDNA). Tiling of repeat-masked sequences can therefore be viewed as a two-class partitioning problem: Given a sequence with some subwords identified as repeat nucleotides and the remaining subwords composed of non-repetitive nucleotides, the sequence is partitioned into non-overlapping tiles of either type such that the total amount of non-repetitive sequence covered is maximized, while the number of repetitive nucleotides included in the resulting tile path is minimized.
The repetitive elements present in most eukaryotic genomes introduce a high degree of fragmentation of the non-repetitive DNA. Avoiding repeats and targeting only the remaining sequence fragments >300 bp in size results in suboptimal coverage of the non-repetitive DNA (Fig. 3). In order to improve the sequence coverage, strategies must be devised to recover some of the non-repetitive fragments that are too small to be efficiently amplified (Berman et al. 2004
Algorithms for optimal sequence tiling
We can also use the scoring function V to evaluate the score of either an individual tile Ti...j or an excluded region Xi...j,
Therefore, the scoring function evaluated over an entire tile path is the sum of all scores for individual tiles and excluded regions,
A dynamic programming solution
The main iteration of the algorithm can be described as follows: At an intermediate step in the computation we have evaluated the optimal tile paths and their associated scores for all subsequences S1.. .1 to S1...(k-1). In order to find an optimal tile path for the subsequence S1...k, for each i [max(1, k - u), max(1, k - l)] we compute the score for the tile path consisting of the optimal tile path from 1...i and the tile T(i + 1)...k using the score of the optimal tile path from 1...i and V[T(i + 1)...k]. Similarly, we also evaluate the score of the tile path consisting of the optimal solution from 1...(k - 1) and the excluded region Xk...k (the kth nucleotide). The optimal tile path for S1...k is then one of the preceding tile paths having the maximal score. This tile path and its associated score are then stored, and the algorithm proceeds to the next nucleotide in the sequence, k + 1. A schematic of the algorithm appears below.
Given optimal tiles paths for all subsequences S1.. .1 to S1...(k - 1) and associated scores
STEP 1: For each i
We also construct an additional tile path
STEP 2: From the preceding tile paths computed in Step 1, we select one having the maximal score and store it as OptimalTilePath{S1...k}, along with its associated score. STEP 3: Repeat for subsequence S1...(k + 1).
We used the algebraic dynamic programming (ADP) framework (Giegerich 2000
A linear-time, constant-space solution
The scores for included and excluded regions are given by
. Note that the score for included regions does not account for the tile cost C.
The algorithm partitions the sequence and outputs the region boundaries as processing continues. The sequence is scanned one nucleotide at a time, with the current position denoted by i. During the main iteration we keep track of an earlier position k, up to which an optimal partitioning has been determined. At each step, the algorithm attempts to determine if the window S(k + 1)...i should be classified either as an extension of the last known region R (currently extending up to k), or as the prefix of a new region starting at k + 1. Depending on the type of region R (included or excluded) and the difference D = V[I(k + 1)...i] - V[X(k + 1)...i] between the values of the scoring function for the two potential classifications of the window S(k + 1)...i, the algorithm selects one of three possible options:
Following this decision, the next nucleotide in the sequence is processed (i.e., i is incremented). The classification of the first and the last regions in the sequence is determined similarly, effectively assuming that the start of the sequence follows an excluded region, and only inspecting the sign of D if R is an included region at the end of the sequence (i.e., when i = n - 1). Since the number of times each nucleotide is examined is bounded by a constant, the overall time complexity is linear with respect to the size of the input sequence. The algorithm runs in constant space, as we need only keep a running value of D, the values of i and k, and the type of region R. A proof of optimality for this algorithm is presented in the Appendix. This algorithm imposes no implicit upper bound on the size of nrDNA partitions, although C is effectively a lower bound on tile sizes. Therefore, included regions must be subdivided into smaller tiles whose sizes reflect the desired upper limit for PCR products. In terms of experimental preparation and subsequent microarray data analysis, it is preferable to create roughly equalsized fragments whenever possible. Therefore, the most straightforward tiling of long nrDNA partitions involves 1) taking the ceiling of the length of the partition divided by the maximum tile size, then 2) subdividing the partition into equal-sized fragments of this number.
A further improvement in the time complexity is possible when u
In other words, the best path is the path ending with a tile of minimal length, or ending with a gap, or the one-nucleotide extension of the last tile in the one step shorter best path ending with a tile. Note that the third choice is valid only when k - i < u, and indeed it can be shown that when u
As a consequence of this observation, the algorithm can select the best of the above three choices at each nucleotide, instead of comparing u - l + 2 alternatives. This reduces the time complexity to O(n). In practice, a straightforward implementation of the algorithm finds the optimal tiling of the largest sequenced chromosomes in <15 sec when u = 1500 and u Although the above improvement in time complexity does not hold in general when u < 2l, in practice the algorithm performs better when applying a similar optimization in the general case: Select the best path among the above three choices, unless the third choice is invalid due to the last tile being of length u;in that case, fall back to the previously described algorithm and compute the maximal score among all u - l + 1 paths ending with a tile. Experiments with various genome sequences show that the fallback procedure is invoked very rarely, and this optimization makes a significant difference in the running time.
Tiling statistics for eukaryotic genomes A summary of tiling genome sequences of various sizes and repeat densities is presented in Table 2. Genomes with relatively few repeats were included from several model organisms, as well as the highly repetitive genomes of more recently sequenced rodents and primates. The sequences were tiled first using a naive approach in which the non-repetitive DNA was subdivided into tiles having lengths equal to the lower size bound (in this case 300 bp). The linear-time tiling method was then applied to the sequences to derive an optimal tile path for each. Table 3 includes a summary of two additional metrics that apply simple tiling schemes to each sequence. Each of the latter methods allows some inclusion of repetitive nucleotides in order to recover a higher percentage of non-repetitive DNA.
In comparing these results, a number of observations become apparent. When the sequences are tiled in a naive fashion, the coverage of non-repetitive DNA decreases dramatically as we progress from the relatively repeat-free Arabidopsis sequence to the larger mammalian genomes. This reflects the higher levels of genomic sequence fragmentation due to increased repeat content, a condition that clearly inhibits the optimal tiling of the sequence.
Applying the optimal tiling algorithm to more complex genomes improves the non-repetitive sequence coverage significantly, while the percentage of included repeats remains very low. The optimal tiling algorithm greatly outperforms the other methods in higher eukaryotes, achieving maximal coverage of non-repetitive DNA with a relatively small increase in repeat nucleotide inclusion. In terms of microarray analysis, it has been empirically shown that the number of repeats included in such an optimal tile path can be effectively blocked through the inclusion of unlabeled low-complexity or repetitive DNA (e.g., Cot-1) in hybridization samples (Rinn et al. 2003
Tiling arrays are becoming an important tool for empirical genome annotation, making available the maximum amount of non-repetitive genomic DNA for microarray interrogation. In designing an optimal tile path for microarray applications, the identification and reduction of similar sequences constitutes a fundamental issue and can significantly reduce artifacts associated with cross-hybridization (Royce et al. 2005 Numerous options exist for tiling genomic sequences with oligonucleotides, leading to microarray designs of various sequence resolutions and feature densities. Biasing the selection of probes toward uniform thermal properties and eliminating non-unique sequences across the genome can improve the annealing characteristics and hybridization specificity. Although these issues become non-trivial for large genomes, we describe an efficient solution for determining sequence similarity and rejecting non-unique probes that is appropriate for microarray applications. As sequence tiles increase in size, the sequence fragmentation introduced by repetitive elements reduces the coverage of non-repetitive DNA. For higher eukaryotes, this precludes the use of trivial partitioning strategies where maximal coverage of the non-repetitive sequence is desired. To address this problem, we present space- and time-efficient algorithms for generating optimal tile paths to improve the coverage of non-repetitive sequences while minimizing the number of repetitive nucleotides included. In this manner, a greater number of fragments of sufficient size is recovered for amplification, and a higher percentage of non-repetitive DNA is represented on the array. These approaches enable the construction of tiling arrays that maximize the amount of non-repetitive DNA for the discovery of novel functional elements in eukaryotic genomes.
Proof of optimality for the linear-time, constant-space algorithm To see why this algorithm produces an optimal partitioning, we proceed by induction on the length of the inspected sequence and assume that the algorithm has been correct prior to the ith element (i.e., the partitioning up to k is optimal, and no decision can be made so far on the window between k + 1 and i). We will show only one case of the proof; the rest is very similar. Without loss of generality, assume that the last known region R, currently extending up to k, is an included region. Consider the case when D < -C, in which the algorithm will terminate R at i and start an excluded region at i + 1. Suppose, however, that there is an optimal partitioning P with score sP that extends R at least up to position i, contrary to what the algorithm yields. Define a new partitioning N, identical to P except for the window between k + 1 and i, which in N is part of an excluded region, and let us compute its score sN. There are two possibilities: If in P the included region ends at i and an excluded region starts at i + 1, then N has the same number of partitions as P, but one region boundary has been shifted from i in P to k in N. Hence, sN is equal to sP plus the difference in the scores on the window between k + 1 and i; these scores are exactly V[I(k + 1)...i] under the partitioning P and V[X(k + 1)...i] under N, therefore:
Other partitionings that terminate the included region earlier than i can be shown similarly suboptimal by the following observation. Since by assumption the algorithm postponed the decision until i, the difference D must be between -C and 0 at all intermediate points. For the case when the algorithm postpones the partitioning decision, the proof of correctness is to construct two sequences sharing the same prefix up to i but requiring different optimal partitionings of the window from k + 1to i, which shows that indeed no decision guaranteeing optimality can be made at i. In other words, a partitioning solution not satisfying the tests in the algorithm cannot be optimal.
The correctness of the optimization is a consequence of the following propositions, which we prove under the assumptions that u
Tiling sequences using morphological operations A number of established methods can be applied to repeat-masked DNA sequences to approximate an optimal tiling solution. By treating a genome sequence as a vector of nucleotide "pixels," we can use image segmentation techniques such as region growing and other relaxation processes to close small repetitive elements in the genomic sequence, thereby merging the adjacent high-complexity sequences into contiguous tiles. This approach can be expressed using standard binary morphological algebra (Serra 1980
The converted bilevel image can now be processed in several ways to yield an expansion of the nrDNA regions into the rpDNA regions. The nrDNA elements in set A can be transformed depending on how they relate to the "background" component of the sequence, comprising the rpDNA elements in B and referred to as the structuring element. The dilation of an input image A by a structuring element B is then described by:
In this manner, rpDNA regions whose lengths are less than the number of dilation cycles are closed, and the adjacent nrDNA fragments are effectively merged into larger tiles. Although this approach describes a simple approximation to the tiling problem, it is dependent on the use of a threshold constant for dilation-erosion cycles which corresponds to a fixed maximum number of nucleotides that each repetitive element can span.
This work was supported by National Institutes of Health grant P50 HG02357.
Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4452906.
5 Corresponding author.
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J. 1990. Basic local alignment search tool. J. Mol. Biol. 215: 403-410.[CrossRef][Medline] Bao, Z. and Eddy, S.R. 2002. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res. 12: 1269-1276. Benson, G. 1999. Tandem repeats finder: A program to analyze DNA sequences. Nucleic Acids Res. 27: 573-580. Berman, P., Bertone, P., Dasgupta, B., Gerstein, M., Kao, M.Y., and Snyder, M. 2004. Fast optimal genome tiling with applications to microarray design and homology search. J. Comput. Biol. 11: 766-785.[CrossRef][Medline] Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M., Weissman, S., et al. 2004. Global identification of human transcribed sequences with genome tiling arrays. Science 306: 2242-2246. Bertone, P., Gerstein, M., and Snyder, M. 2005. Applications of DNA tiling arrays to experimental genome annotation and regulatory pathway discovery. Chromosome Res. 13: 259-274.[CrossRef][Medline] Buck, M.J. and Lieb, J.D. 2004. ChIP-chip: Considerations for the design, analysis, and application of genome-wide chromatin immunoprecipitation experiments. Genomics 83: 349-360.[CrossRef][Medline] Cawley, S., Bekiranov, S., Ng, H.H., Kapranov, P., Sekinger, E.A., Kampa, D., Piccolboni, A., Sementchenko, V., Cheng, J., Williams, A.J., et al. 2004. Unbiased mapping of transcription factor binding sites along human chromosomes 21 and 22 points to widespread regulation of noncoding RNAs. Cell 116: 499-509.[CrossRef][Medline] Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G., et al. 2005. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308: 1149-1154. Euskirchen, G., Royce, T.E., Bertone, P., Martone, R., Rinn, J.L., Nelson, F.K., Sayward, F., Luscombe, N.M., Miller, P., Gerstein, M., et al. 2004. CREB binds to multiple loci on human chromosome 22. Mol. Cell. Biol. 24: 3804-3814. Gelfand, M.S. and Roytberg, M.A. 1993. A dynamic programming approach for predicting the exon-intron structure. Biosystems 30: 173-182.[CrossRef][Medline] Giegerich, R. 2000. A systematic approach to dynamic programming in bioinformatics. Bioinformatics 16: 665-667. Gotoh, O. 1982. An improved algorithm for matching biological sequences. J. Mol. Biol. 162: 705-708.[CrossRef][Medline] Hanlon, S.E. and Lieb, J.D. 2004. Progress and challenges in profiling the dynamics of chromatin and transcription factor binding with DNA microarrays. Curr. Opin. Genet. Dev. 14: 697-705.[CrossRef][Medline] Horak, C.E. and Snyder, M. 2002. ChIP-chip: A genomic approach for identifying transcription factor binding sites. Methods Enzymol. 350: 469-483.[Medline] Horak, C.E., Luscombe, N.M., Qian, J., Bertone, P., Piccirrillo, S., Gerstein, M., and Snyder, M. 2002a. Complex transcriptional circuitry at the G1/S transition in Saccharomyces cerevisiae. Genes & Dev. 16: 3017-3033. Horak, C.E., Mahajan, M.C., Luscombe, N.M., Gerstein, M., Weissman, S.M., and Snyder, M. 2002b. GATA-1 binding sites mapped in the Hughes, T.R., Mao, M., Jones, A.R., Burchard, J., Marton, M.J., Shannon, K.W., Lefkowitz, S.M., Ziman, M., Schelter, J.M., Meyer, M.R., et al. 2001. Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol. 19: 342-347.[CrossRef][Medline] Iyer, V.R., Horak, C.E., Scafe, C.S., Botstein, D., Snyder, M., and Brown, P.O. 2001. Genomic binding sites of the yeast cell-cycle transcription factors SBF and MBF. Nature 409: 533-538.[CrossRef][Medline] Jurka, J. 2000. Repbase update: A database and an electronic journal of repetitive elements. Trends Genet. 9: 418-420. Jurka, J., Klonowski, P., Dagman, V., and Pelton, P. 1996. CENSORA program for identification and elimination of repetitive elements from DNA sequences. Comput. Chem. 20: 119-121.[CrossRef][Medline] Kampa, D., Cheng, J., Kapranov, P., Yamanaka, M., Brubaker, S., Cawley, S., Drenkow, J., Piccolboni, A., Bekiranov, S., Helt, G., et al. 2004. Novel RNAs identified from an in-depth analysis of the transcriptome of human chromosomes 21 and 22. Genome Res. 14: 331-342. Kapranov, P., Cawley, S.E., Drenkow, J., Bekiranov, S., Strausberg, R.L., Fodor, S.P., and Gingeras, T.R. 2002. Large-scale transcriptional activity in chromosomes 21 and 22. Science 296: 916-919. Kim, T.H., Barrera, L.O., Zheng, M., Qu, C., Singer, M.A., Richmond, T.A., Wu, Y., Green, R.D., and Ren, B. 2005. A high-resolution map of active promoters in the human genome. Nature 436: 876-880.[CrossRef][Medline] Kirmizis, A. and Farnham, P.J. 2004. Genomic approaches that aid in the identification of transcription factor target genes. Exp. Biol. Med. 229: 705-721. Lee, T.I., Rinaldi, N.J., Robert, F., Odom, D.T., Bar-Joseph, Z., Gerber, G.K., Hannett, N.M., Harbison, C.R., Thompson, C.M., Simon, I., et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799-804. Lipshutz, R.J., Fodor, S.P., Gingeras, T.R., and Lockhart, D.J. 1999. High density synthetic oligonucleotide arrays. Nat. Genet. 21: 20-24.[CrossRef][Medline] Martone, R., Euskirchen, G., Bertone, P., Hartman, S., Royce, T.E., Luscombe, N.M., Rinn, J.L., Nelson, F.K., Miller, P., Gerstein, M., et al. 2003. Distribution of NF- Needleman, S.B. and Wunsch, C.D. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48: 443-453.[CrossRef][Medline] Nuwaysir, E.F., Huang, W., Albert, T.J., Singh, J., Nuwaysir, K., Pitas, A., Richmond, T., Gorski, T., Berg, J.P., Ballin, J., et al. 2002. Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res. 12: 1749-1755. Odom, D.T., Zizlsperger, N., Gordon, D.B., Bell, G.W., Rinaldi, N.J., Murray, H.L., Volkert, T.L., Schreiber, J., Rolfe, P.A., Gifford, D.K., et al. 2004. Control of pancreas and liver gene expression by HNF transcription factors. Science 303: 1378-1381. Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., et al. 2000. Genome-wide location and function of DNA binding proteins. Science 290: 2306-2309. Rinn, J.L., Euskirchen, G., Bertone, P., Martone, R., Luscombe, N.M., Hartman, S., Harrison, P.M., Nelson, F.K., Miller, P., Gerstein, M., et al. 2003. The transcriptional activity of human chromosome 22. Genes & Dev. 17: 529-540. Rodriguez, B.A. and Huang, T.H. 2005. Tilling the chromatin landscape: Emerging methods for the discovery and profiling of protein-DNA interactions. Biochem. Cell Biol. 83: 525-534.[CrossRef][Medline] Royce, T.E., Rozowsky, J.S., Bertone, P., Samanta, M., Stolc, V., Weissman, S., Snyder, M., and Gerstein, M. 2005. Issues in the analysis of oligonucleotide tiling microarrays for transcript mapping. Trends Genet. 21: 466-475.[CrossRef][Medline] Rychlik, W. and Rhoads, R.E. 1989. A computer program for choosing optimal oligonucleotides for filter hybridization, sequencing and in vitro amplification of DNA. Nucleic Acids Res. 17: 8543-8551. SantaLucia, J. 1998. A unified view of polymer, dumbbell, and oligonucleotide DNA nearest-neighbor thermodynamics. Proc. Natl. Acad. Sci. 95: 1460-1465. Schena, M., Shalon, D., Davis, R.W., and Brown, P.O. 1995. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270: 467-470. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||