|
|
|
|
Genome Res. 17:720-731, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00 OPEN ACCESS ARTICLE Integrated analysis of experimental data sets reveals many novel promoters in 1% of the human genome Karaöz2,61 Department of Genetics, Stanford University School of Medicine, Stanford, California 94305, USA; 2 Bioinformatics Program, Boston University, Boston, Massachusetts 02215, USA; 3 Department of Molecular, Cellular and Developmental Biology, Yale University, New Haven, Connecticut 06520, USA; 4 Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA; 5 Biomedical Engineering Department, Boston University, Boston, Massachusetts 02215, USA
The regulation of transcriptional initiation in the human genome is a critical component of global gene regulation, but a complete catalog of human promoters currently does not exist. In order to identify regulatory regions, we developed four computational methods to integrate 129 sets of ENCODE-wide chromatin immunoprecipitation data. They collectively predicted 1393 regions. Roughly 47% of the regions were unique to one method, as each method makes different assumptions about the data. Overall, predicted regions tend to localize to highly conserved, DNase I hypersensitive, and actively transcribed regions in the genome. Interestingly, a significant portion of the regions overlaps with annotated 3'-UTRs, suggesting that some of them might regulate anti-sense transcription. The majority of the predicted regions are >2 kb away from the 5'-ends of previously annotated human cDNAs and hence are novel. These novel regions may regulate unannotated transcripts or may represent new alternative transcription start sites of known genes. We tested 163 such regions for promoter activity in four cell lines using transient transfection assays, and 25% of them showed transcriptional activity above background in at least one cell line. We also performed 5'-RACE experiments on 62 novel regions, and 76% of the regions were associated with the 5'-ends of at least two RACE products. Our results suggest that there are at least 35% more functional promoters in the human genome than currently annotated.
The pilot phase of The ENCODE Project Consortium has generated a large volume and variety of functional genomics data (The ENCODE Project Consortium 2004 With this set of transcriptional regulatory element data, we aimed to map transcriptional promoters and regulatory regions throughout the ENCODE-defined regions independent of mRNA to genomic DNA sequence alignments. We used an integrated approach that evaluated the data as a whole in a quantitative manner rather than studying each data set individually. One of the most significant analytical challenges with microarray-based functional genomics is the continuous nature of the data. Specifically in the case of ChIP-chip, a discreet biochemical event (e.g., histone modification) is usually not reflected as a binary experimental output. Therefore, invoking a threshold for calling a site bound or unbound by a transcription factor in an individual data set is often arbitrary, and individual data points near the threshold can be easily misclassified depending on whether the emphasis is placed on specificity or sensitivity. These shortcomings can be overcome when a number of experiments are analyzed together, as a modest signal that is reproduced across a number of experiments can become much more significant than it would be in a single experiment. To this end, we have implemented four complementary methods to integrate the compendium of ENCODE transcriptional regulatory element data. First, a "naïve Bayes" method computes a score that combines the ChIP signals in different experiments, which are thresholded and weighted according to how well they perform on a set of known promoters. Second, we developed a "tree-weighting" (TW) method that computes a weighted sum of counts for a given region, where the weights account for both the TSS enrichments of individual experiments and the correlation between experiments. Third, a "majority-voting" method determines the level of experimental support for each genomic position, defined by the number of cross-laboratory, cross-platform, or cross-factor experiments that designate that position above some statistical threshold. Last, we developed a "Z-score method" that generates a cumulative score by summing over the Z-scores of a genomic interval across multiple experiments. These methods predict regions of 0.6- to 1.5-kb sizes, dictated by the resolution of the underlying ChIP data sets. The regions do not indicate the direction of transcription or connectivity of exons in the vicinity, because the methods do not use sequence as input. Our main goal is to identify novel sites of transcription initiation from evidence other than existing cDNA sequences. We, therefore, take a promoter-centric approach in designing validation experiments. To evaluate the effectiveness of these different methods, we compared their predictions with TSSs identified by other independent experiments and genome annotations, many of which have been produced by the ENCODE project. We also conducted extensive experimental validation of novel regions that were not part of existing promoter annotation. We experimentally validated 85 novel promoters with transient transfection assays and rapid amplification of cDNA ends (5'-RACE) experiments, and demonstrated the power of an analytical approach that integrates the data from many genome-scale experiments. Extrapolating from these results, we estimate that there are at least 35% more novel promoters than currently annotated.
Promoter regions predicted by the four methods The four complementary approaches make different assumptions and therefore have unique advantages and disadvantages. For example, the Z-score assumes that each experiment has the same predictive power for promoters, but it makes no assumption on how a promoter should look. In contrast, naïve Bayes uses a training set of known promoters to determine which experiments have the highest predictive power and weighs the experiments accordingly. Voting explicitly takes into account the finding that experiments performed by the same laboratory or on the same microarray platform tend to identify similar genomic regions as significant. TW determines this laboratory or platform bias automatically via correlating the data sets. The number of regions predicted by each method and the agreement between them are shown in Figure 1 (for a full listing, see also Supplemental Table 1). Z-score identified the smallest number of regions (580), followed by naïve Bayes (689), TW (714), and voting (985). There are 340 regions that are predicted by all four methods, and these are likely the highest confidence promoter regions. Interestingly, Z-score, naïve Bayes, and voting had a similar percentage of unique regions (26%, 28%, and 28%, respectively); however, TW had only 5% unique regions, with 92% of its regions included in the voting list. These comparisons indicate that all four methods are identifying a significant number of the same regions but also many regions unique to that particular method, and that TW and voting perform more similarly to each other than the others. In addition, the near twofold variation in the absolute number of regions identified by the four different methods (from 580 to 985) suggests that some of the approaches may be more specific than others.
The different methods also tend to predict regions of varying length (Supplemental Fig. 1). Z-score and TW predict regions that are on average 1.5 ± 0.8 kb long, while naïve Bayes and voting predict regions roughly half the size (0.8 ± 0.3 kb and 0.6 ± 0.3 kb, respectively). The resolution of our predictions is limited by the underlying data setsthe genomic DNA produced in the fragmentation process of ChIP is 0.051 kb long. Regions that are predicted by all methods are longest (3.8 ± 2 kb; called Common4) as we merge the overlapping predictions by the four methods together. Shared regions (predicted by two or three methods) are affected by merging in the same way (1.6 ± 0.9 kb). The difference in length distribution impacts the region-based accounting of validation rate described below, as longer regions have a higher chance of being validated.
Comparison of predicted promoter regions with other data sets and annotations
As shown in Figure 2F, the intersection of all four methods shows the highest degree of overlap with all markers, supporting the hypothesis that these regions are more likely to be promoters than those identified by any of the individual methods alone. Not surprisingly, GT-TSSs and 5'-UTRs were two of the top three categories that showed the highest degree of overlap with the intersection of the four lists. Interestingly, regions of DNase I hypersensitivity have the second highest degree of overlap, perhaps because the ChIP-chip and the DNase I hypersensitivity experiments both identify the most active promoters in the cell lines tested. Further support for the regulatory potential of the predicted regions comes from the significant enrichment with data sets of active transcription (TARs/transfrags and RACEfrags) (The ENCODE Project Consortium 2007 Panels AD of Figure 2 show the degree of overlap of the same categories with the regions unique to each of the four methods. The regions unique to Z-score (Fig. 2D) and unique to naïve Bayes (Fig. 2A) show the highest degree overlap with GT-TSSs, suggesting that these two approaches are more specific than TW and voting. TW shows the least significant overlap with the other categories but also has the smallest number (38) of unique regions. Naïve Bayes and voting show the most overlap with categories that potentially indicate novel regulatory regions (DNase I hypersensitivity and FAIRE). Figure 2E shows the results for regions predicted by two or three methods, with significant overlaps with GT-TSS, 5'-UTR, DNase I hypersensitivity, and FAIRE. The significant overlaps with independent data sets are highly encouraging and indicate that we are indeed identifying promoters with an integrated analysis of ENCODE ChIP-chip data. Interestingly, some of the regions that we identified do not overlap with known promoters and are thus putative novel promoters. When we began this project, the GENCODE annotation was not fully developed, and we defined a novel promoter as one that was >2 kb away from the TSS of a GenBank cDNA. All of the promoters that we chose for experimental validation were novel based on that definition. Upon completion of the GENCODE annotations, we revised our definition of novel promoters to those that were ±2 kb surrounding GENCODE-annotated TSSs. Consequently, some of the regions we previously designated "novel promoters" are now part of the GENCODE annotation and are thus categorized as "known" below. Ninety of the 340 regions (26%) predicted by all four methods and 861 (62%) of the 1393 regions predicted by at least one method were thus deemed novel based upon the GENCODE criteria. Of the predicted regions, a significant proportion is localized to the boundaries of GENCODE-annotated transcripts (Fig. 3 shows the distance distribution in comparison to randomly placed regions of equal sizes). Yet 319 regions are >20 kb away from the 5'-end of an annotated transcript.
In order to assess whether some of the predicted regions >2 kb away from the 5'-end of a cDNA were indeed active promoters, we tested 163 regions (126 novel regions based on the GENCODE definition) by transient transfection reporter assays and 62 regions (28 remain novel) by 5'-RACE experiments.
Transient transfection assays validated 41 of 163 predicted regions
Overall, 41 tested putative promoters were functional out of the 163 tested, corresponding to a validation rate of 25%. Encouragingly, the validation rates for the novel promoters were only lower by 2% than that of the known promoters, suggesting that a similar validation rate would be observed for the remaining novel predictions if they were also tested. Regions predicted by multiple methods clearly had the highest validation rate. Specifically, predictions common to all four methods had a validation rate of 39%, followed by predictions made by two or three methods (20%), and only 13% of regions unique to one method were validated.
We compared sequence features of the predicted regions that were validated and the ones that were not. The former had a higher tendency of overlapping with a CpG island (36% versus 9%) or containing a TATA-box (12% versus 9%). This is in agreement with our previous study, which showed that promoter fragments active in transient transfection assays tended to be GC rich (Cooper et al. 2006
5'-RACE validated 47 of 62 predicted regions
Transposable elements have been suggested to play a role in the evolution of regulatory regions by dispersing novel promoters throughout the genome (Jordan et al. 2003 We also analyzed whether segmental duplications affected the validation by 5'-RACE. Out of 47 promoters that were validated by 5'-RACE, five of them (three novel) overlapped segmental duplications. For all these promoters, we examined BLAT alignments of the RACE fragments to the vicinity of the tested promoters and to the duplicated regions. In every case, there were at least two RACE products with better alignments to the tested region than to the duplication (Supplemental Table 5). The number of promoters validated by 5'-RACE generally correlated with the number of methods used to predict the promoter. Regions predicted by all four methods had a validation rate of 85%, while the ones predicted by only one method had a validation rate of 67%, and the ones predicted by two or three methods had an intermediate rate of 74%. Among the 15 tested predictions made by only one method, 10 were by the TW method and seven were validated by the RACE experiment. Unfortunately there are not enough RACE data on regions unique to other methods. The validation rate was not correlated with whether or not a CAGE/GIS-PET was present near the predicted promoter (77% for tag absent and 72% for tag present; the overall rate was 75%). We manually inspected the promoters validated by 5'-RACE with respect to GENCODE-annotated transcripts. Most of them are associated with existing genes. Only two did not overlap known transcripts; nevertheless, they seemed to interact with yet unannotated transcripts, as they fell within the boundaries of novel transcripts defined by a GIS-PET cluster. Some of them initiate transcription of products that are embedded in an intron (as sense or anti-sense), others provide an alternative TSS (and hence a new variant), and the remaining are anti-sense to an exon (typically the 5'-UTR or 3'-UTR and less frequently an internal exon) of the associated gene. Figure 4 and Supplemental Figure 4 show three examples of anti-sense transcripts represented by our RACE products. Interestingly, in many of the intron embedded and alternative TSS cases, a SINE or LINE (indicated by RepeatMasker; http://ftp.genome.washington.edu/RM/RepeatMasker.html) was found at or near the promoter region. Additionally, in two of the 3'-UTR anti-sense cases, the transcripts appeared to be spliced.
We systematically classified the transcripts associated with the 41 promoters validated by transient transfection assays and the 47 promoters validated by 5'-RACE experiments (inferred for the former and the RACE products for the latter) into 11 categories, depending upon the relative positions of the transcripts with respect to the nearest GENCODE-annotated gene (Fig. 5). The total number of cases is summed to 48 for transfection and 59 for RACE, as some classes (notably intron embedded) can be interpreted as other classes (e.g., new TSS or anti-sense). The two sets both have large representations of 5'-exon anti-sense, 3'-exon anti-sense, and intron embedded; however, the transfection set has 10 intergenic regions, while the RACE set has 11 known promoters and four pseudogenes. The discrepancy could be due to different criteria for region selection. Such classification should be helpful for inferring the biological functions of newly validated promoters.
In this study we have identified 1393 putative promoter regions in 1% of the human genome (44 ENCODE regions totaling 30 Mb) by integrating the results of many transcription-factor binding and histone modification ChIP-chip data sets. The results of this analysis provide an alternative way to map TSSs and promoters independent of aligning cDNA sequences to the genome. Approximately 52% of the promoters annotated by GENCODE in ENCODE regions were identified by our approach. Because the ChIP experiments were carried out in a limited number of cell lines under only a few conditions, we did not expect all GENCODE promoters to be identified. The observed overlap was highly significant and gave us confidence that we were able to identify many of the previously known promoters. Of the regions we identified without cDNA support, we experimentally validated 85 novel promoters from a total of 205 tested (41.5%), with 41 of 163 validated by transient transfection reporter assays and 47 of 62 by 5'-RACE experiments. Twenty regions were tested by both methods, and 18 (90%) were validated by one or both of the methods (13 were validated by 5'-RACE uniquely, two by transfection uniquely, and three by both methods). If we extrapolate the validation rate of 41.5% (85 of 205) to 861 novel regions, we estimate that there are 357 functional novel promoters in the ENCODE regions. By extrapolation, we conclude that there are at least 35% more functional promoters than those currently annotated in the human genome. Because a limited number of cell lines were used for the experimental validation and because of other inherent limitations of these experiments, this is likely an underestimate. By examining these validated promoters individually, we observed that 13% of the novel promoters are alternative promoters that start downstream of the most 5' TSS of previously characterized genes, or extend the 5'-end of previously known genes. Approximately 11% of the novel promoters are in intergenic regions and may represent the TSSs of new genes. A reason that the intergenic class may be underrepresented in the RACE-validated set is likely due to the requirement of an index exon for RACE experiments. It would be difficult to design index primers to an exon of a new gene associated with a novel promoter. Meanwhile, a surprisingly high proportion (23%) of the novel promoters are on the anti-sense strand of previously identified transcripts (mostly terminal exons), potentially driving transcription of an anti-sense transcript (Fig. 5).
It will require additional experimental work to determine the structure of the transcripts originating at these functional promoters and, consequently, whether these are alternative promoters of existing genes or promoters of new genes yet to be identified. Deep sequencing efforts (Carninci et al. 2005 While we are confident stating that the validated novel promoters are bound by proteins frequently associated with active transcription and are able to drive transcription in transient transfection assays or produce a transcript detectable by 5'-RACE, the biological relevance of these sequences remains to be determined. In vivo experiments such as targeted knockout of these sequences or in vivo reporter assays need to be performed to further characterize the roles of these sequences in living organisms. While these sequences may indeed promote transcription, the possibility exists that this may represent inconsequential transcriptional activity that has neither a positive nor a detrimental effect on the organism. In this capacity, these sequences may serve as reservoirs of regulatory potential that may be utilized in the course of evolution to positively select new genes or regulate existing genes in different ways. Thus, some or all of the novel regulatory sequences we have identified in this project may represent a snapshot of the equilibrium that has been reached between the creation and erosion of regulatory sequences in the evolving human genome. Four integrative methods were applied in this study to identify promoters because promoter-related factors were the focus of the available experimental data sets. There is no reason, however, why these approaches could not be applied to other sets of functional data to identify other types of functional genomic elements. Specifically, identifying long-range transcriptional regulatory elements such as enhancers and insulators has proven to be very difficult. With appropriate types of experimental data, a similar analysis as was conducted here could be applied to identify certain classes of long-range elements. In fact, some of the data sets we used were not restricted to promoters, e.g., mono-methylation of the lysine 4 residue on histone H3 and the binding of sequence specific factors such as TP53 and STAT1. Thus some of our predicted regions may be functional long-range elements. The major strength of our approach is that sensitivity can be improved by integration without sacrificing specificity, as integrating weak scores in multiple data sets can lead to a reliable prediction by our approach. It was clear that regions predicted by multiple methods had a higher validation rate than regions predicted by a single method, and this was seen for both experimental validation approaches. This highlights the value of using multiple methods. It would also be important to compare the performances of the different methods. The experimental results for regions predicted by only one method (Supplemental Table 1) do not support a statistically robust comparison in this work. This particular aspect of our study is an important future direction. Certainly, these analyses will become more powerful as more genome-wide functional data become available. Another potential future direction of this work would be to combine the unique advantages that the different methods afford to create a hybrid method that eliminates the shortcomings of the individual methods. For example, the experimental weightings derived by the Bayesian approach could be used to weight the contribution of the different experiments in the Z-score approach. Then, the regions identified by the Z-score approach could be added to the Bayesian training set to refine the weights of the individual experiments, and an iterative process could be invoked by this cycle.
ChIP-chip data sets Among the data generated by the ENCODE consortium, the genomic regions targeted by 18 sequence-specific transcription factors, six histone modifications, POLR2A, TAF1, and GTF2B (formerly TFIIB) were determined by ChIP using antibodies to these components and either genomic tiling array (high-density oligonucleotide or PCR products) or sequencing-based analyses (ChIP-PET and STAGE). In total there are 129 data sets on 11 different cell lines. Some of these experiments were performed at four time points after retinoic acid stimulation, and some were performed before and 30 min after interferon treatment. The raw data of these experiments were obtained from the UCSC genome browser (the ENCODE consortium; http://genome.ucsc.edu/ENCODE/).
In addition, thresholded target lists (or hits) reported for each data set at several false discovery rate (FDR) cutoffs (1%, 5%, and 10% FDR) were obtained from the Transcriptional Regulation Analysis Group (The ENCODE Project Consortium 2007
The naïve Bayes method
Training of the Bayesian model
Scanning of ENCODE regions with the Bayesian model
The TW method
The Z-score method With all the data sets aligned to one reference interval set, we did a Z-score transformation (number of standard deviations away from the mean) of each individual data set to normalize for variation between data sets. This is appropriate because each experimental data set is dominated by negative results; therefore, the distribution of each data set is approximately normal. The normalized scores allow comparing the same genomic interval between data sets in a consistent framework.
For each interval, the score assigned is simply the sum of all the normalized scores of the different data sets at that interval. To determine the significance of the score, we produced a background distribution of score sums by shuffling the values of each individual data set over the
The voting method
Merging of the predicted regions by the four methods
Overlap of the predicted regions with genomic annotations (Fig. 2)
Distance distributions of predicted regions with respect to transcript boundaries (Fig. 3)
Sequence analysis of the validated and unvalidated regions
Fragment cloning for testing promoter activity using transfection assays
Cell Culture, transient transfection, and reporter gene activity assays
Identification of active promoters
Selection of putative promoters for RACE validation In all cases, only promoters with some evidence of transcriptional activity nearby (such as a TAR, a CAGE tag, or a GIS-PET) were selected, and one active region was used as the index for the 5'-RACE design. In cases where the transcriptional activity was based only on TARs, two indices were selected: one upstream and one downstream of the promoter. To determine the design basis, we constructed a matrix for describing all the putative promoter regions. It summarized the relationship between each promoter and various transcription data. A promoter was considered to be putatively novel if it was not near (from 2kb to 200 bp) the 5'-end of a gene in the known genes track on the UCSC genome browser. We also computationally assessed each promoters functional potential based on its distance to nearby transcriptional activity as detected by transfrags/TARs, CAGE tags, and GIS-PET. A promoter was considered to be functional if a transfrag, a CAGE tag, or the 5' tag of a GIS-PET was detected within this promoter region or in its close proximity (±1.5 kb). This comparison clearly separated our predicted promoters into lists with or without transcriptional support. Some of the putative promoters were then chosen for experimental validation based on the above matrix describing an individual promoters relationship with transcriptional data (including known TSS) and the number of methods predicting it. Whenever possible, the candidates from each group were selected randomly with one half predicted to be highly novel (i.e., not near GENCODE TSS).
5'-RACE experiments Total RNA from human NB4 cell line was used in cDNA amplification by SMART RACE kit (Clontech). First-strand cDNA was synthesized using PowerScript Reverse Transcriptase. A total of 1 µg RNA was used in a final volume of 10 µL of reverse transcription (RT) reaction (100 ng/µL). RACE was followed by PCR amplification using Advantage 2 PCR Enzyme System (Clontech); 0.5 µL RT reaction from the above was used in 50 µL of PCR reaction. Nested PCRs were performed using 1 µL of RACE PCR product in 50 µL reaction. The PCR program was 30 sec at 94°C and 3 min at 72°C for five cycles; then 30 sec at 94°C, 30 sec at 70°C, and 3 min at 72°C for five cycles; followed by 25 cycles of 30 sec at 94°C and 30 sec at 68°C; concluded by an extension cycle of 3 min at 72°C. PCR products were gel-purified with QIAquick 96-well PCR purification kit (Qiagen) and subsequently treated with Taq polymerase to add "A" overhang. These PCR products were then cloned into TOPO XL PCR cloning vectors (Invitrogen). Transformation was performed with One Shot Top10 ultracompetent cells (Invitrogen) in 96-well format. Five to six subclones were produced for each specific RACE PCR product. The DNA of each subclone was prepared and digested with EcoRI. The digestions were analyzed by agarose gel electrophoresis in order to determine the approximate size of the insert. All subclones were end-sequenced using M13 forward and reverse primers. Supplemental Figure 2 shows examples of RACE PCR products. All the sequenced RACE PCR products are available as Supplemental Materials.
Assignment of RACE products to putative promoters
To evaluate the activity success rate of the predicted promoters, we first constructed a genomic promoter-vicinity library by extracting the genomic DNA sequence from 5 kb upstream to 5 kb downstream around each of the 62 promoters, from the hg17 release of the human genome (NCBI build 35). All further mapping used BLAT (Kent 2002 We then mapped all the RACE-cDNA sequences against the library and also confirmed the position and orientation of the primers by mapping them to the library. In addition, we mapped three essential features of the RACE product onto the cDNA sequence itself: The linker/adaptor and the two regions of the TOPO XL cloning vector immediately upstream and downstream of the insert. Finally, we applied a filtering algorithm to validate the association between a RACE-cDNA sequence and a promoter by requiring that the mapped part of the sequence start at the primer site and extend toward the promoter. The algorithm also ensures that the mapped part of the sequence was the full length of the insert by requiring that the TOPO XL sequences be immediately adjacent to the portion of the sequence that BLAT could map to the genomic region and in the correct orientation relative to each other and to the primer site. This end is taken as the TSS of the transcript. The filtering algorithm utilizes the presence of the forward and reverse reads and combines them to reconstruct the RACE insert. This is important since the insert can be long and the two complementary reads might not overlap but only cover the two ends of the insert, leaving the actual length of the insert unknown without using additional cues. A clone is considered positive evidence for promoter activity if the TSS falls within the region of the predicted promoter plus 1 kb on either end. All the processed sequences were deposited in GenBank under accession nos. EL582345EL585325.
We thank The ENCODE Project Consortium for making their data publicly available, the Genes and Transcripts Analysis Group for providing transcription data sets, the Transcriptional Regulatory Analysis Group for providing ChIP-chip and ChIP-sequencing data sets, and the ENCODE Chromatin/Replication Analysis Group for providing the DHS and FAIRE data set. This work was funded by ENCODE grant R01HG03110 from NHGRI, NIH to Z.W.; ENCODE grant U01HG003162 from NHGRI, NIH to R.M.M.; and ENCODE grant U01HG003156 from NHGRI, NIH to M.S. P.J.C. was supported by Stanford Genome Training Program grant T32HG00044 from NHGRI.
6 These authors contributed equally to this work.
7 Presently at SwitchGear Genomics, Menlo Park, CA 94025, USA. E-mail zhiping{at}bu.edu; fax (617) 353-6766.
E-mail myers{at}shgc.stanford.edu; fax (650) 725-9689. [Supplemental material is available online at www.genome.org.] Article is online at http://www.genome.org/cgi/doi/10.1101/gr.5716607
Balakirev, E.S. and Ayala, F.J. 2003. Pseudogenes: Are they "junk" or functional DNA? Annu. Rev. Genet. 37: 123151.[CrossRef][Medline] Bertone, P., Stolc, V., Royce, T.E., Rozowsky, J.S., Urban, A.E., Zhu, X., Rinn, J.L., Tongprasit, W., Samanta, M., Weissman, S., et al. 2004. Global identification of human transcribed sequences with genome tiling arrays. Science 306: 22422246. Carninci, P., Kasukawa, T., Katayama, S., Gough, J., Frith, M.C., Maeda, N., Oyama, R., Ravasi, T., Lenhard, B., Wells, C., et al. 2005. The transcriptional landscape of the mammalian genome. Science 309: 15591563. Cooper, S.J., Trinklein, N.D., Anton, E.D., Nguyen, L., and Myers, R.M. 2006. Comprehensive analysis of transcriptional promoter structure and function in 1% of the human genome. Genome Res. 16: 110. The ENCODE Project Consortium, 2004. The ENCODE (ENCyclopedia of DNA Elements) Project. Science 306: 636640. The ENCODE Project Consortium, 2007. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature (in press). Gerstein, M., Sonnhammer, E.L., and Chothia, C. 1994. Volume changes in protein evolution. J. Mol. Biol. 236: 10671078.[CrossRef][Medline] Giardine, B., Riemer, C., Hardison, R.C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., et al. 2005. Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 15: 14511455. Giresi, P.G., Kim, J., McDaniell, R.M., Iyer, V.R., and Lieb, J.D. 2007. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regulatory elements from human chromatin. Genome Res. (this issue) doi: 10.1101/gr.5716607. Harrow, J., Denoeud, F., Frankish, A., Reymond, A., Chen, C.K., Chrast, J., Lagarde, J., Gilbert, J.G., Storey, R., Swarbreck, D., et al. 2006. GENCODE: producing a reference annotation for ENCODE. Genome Biol. (Suppl 1) 7: S4.1S4.9. Jordan, I.K., Rogozin, I.B., Glazko, G.V., and Koonin, E.V. 2003. Origin of a substantial fraction of human regulatory sequences from transposable elements. Trends Genet. 19: 6872.[CrossRef][Medline] Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31: 5154. Kent, W.J. 2002. BLATthe BLAST-like alignment tool. Genome Res. 12: 656664. Lee, C.K., Shibata, Y., Rao, B., Strahl, B.D., and Lieb, J.D. 2004. Evidence for nucleosome depletion at active regulatory regions genome-wide. Nat. Genet. 36: 900905.[CrossRef][Medline] Ng, P., Wei, C.L., Sung, W.K., Chiu, K.P., Lipovich, L., Ang, C.C., Gupta, S., Shahab, A., Ridwan, A., Wong, C.H., et al. 2005. Gene identification signature (GIS) analysis for transcriptome characterization and genome annotation. Nat. Methods 2: 105111.[CrossRef][Medline] Sabo, P.J., Kuehn, M.S., Thurman, R., Johnson, B.E., Johnson, E.M., Cao, H., Yu, M., Rosenzweig, E., Goldy, J., Haydock, A., et al. 2006. Genome-scale mapping of DNase I sensitivity in vivo using tiling DNA microarrays. Nat. Methods 3: 511518.[CrossRef][Medline] Shiraki, T., Kondo, S., Katayama, S., Waki, K., Kasukawa, T., Kawaji, H., Kodzius, R., Watahiki, A., Nakamura, M., Arakawa, T., et al. 2003. Cap analysis gene expression for high-throughput analysis of transcriptional starting point and identification of promoter usage. Proc. Natl. Acad. Sci. 100: 1577615781. Trinklein, N.D., Aldred, S.J., Saldanha, A.J., and Myers, R.M. 2003. Identification and functional analysis of human transcriptional promoters. Genome Res. 13: 308312. Wingender, E., Chen, X., Hehl, R., Karas, H., Liebich, I., Matys, V., Meinhardt, T., Pruss, M., Reuter, I., and Schacherer, F. 2000. TRANSFAC: An integrated system for gene expression regulation. Nucleic Acids Res. 28: 316319. Zheng, D., Zhang, Z., Harrison, P.M., Karro, J., Carriero, N., and Gerstein, M. 2005. Integrated pseudogene annotation for human chromosome 22: Evidence for transcription. J. Mol. Biol. 349: 2745.[CrossRef][Medline]
Received July 3, 2006; accepted in revised format February 5, 2007. |