|
|
|
|
Published online before print
August 3, 2007, 10.1101/gr.6533407 Genome Res. 17:1278-1285, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Letter Human gene organization driven by the coordination of replication and transcription1 Centre de Génétique Moléculaire (CNRS), 91198 Gif-sur-Yvette, France; 2 Laboratoire Joliot Curie et Laboratoire de Physique, Ecole Normale Supérieure de Lyon, CNRS, 69364 Lyon, France; 3 Génétique des Génomes Bactériens, CNRS URA2171, Institut Pasteur, 75015 Paris, France; 4 Atelier de Bioinformatique, Université Pierre et Marie Curie-Paris 6, 75005 Paris, France
In this work, we investigated a large-scale organization of the human genes with respect to putative replication origins. We developed an appropriate multiscale method to analyze the nucleotide compositional skew along the genome and found that in more than one-quarter of the genome, the skew profile presents characteristic patterns consisting of successions of N-shaped structures, designated here N-domains, bordered by putative replication origins. Our analysis of recent experimental timing data confirmed that, in a number of cases, domain borders coincide with replication initiation zones active in the early S phase, whereas the central regions replicate in the late S phase. Around the putative origins, genes are abundant and broadly expressed, and their transcription is co-oriented with replication fork progression. These features weaken progressively with the distance from putative replication origins. At the center of domains, genes are rare and expressed in few tissues. We propose that this specific organization could result from the constraints of accommodating the replication and transcription initiation processes at chromatin level, and reducing head-on collisions between the two machineries. Our findings provide a new model of gene organization in the human genome, which integrates transcription, replication, and chromatin structure as coordinated determinants of genome architecture.
It has long been known that genes are nonrandomly distributed in eukaryote genomes (gene-dense regions alternating with gene deserts) (Mouchiroud et al. 1991
Here, we address the question of gene organization with respect to replication. Previous studies have shown that the human genome displays nucleotide compositional strand asymmetries that probably result from asymmetric mutation and repair processes associated with replication and transcription (Green et al. 2003
In order to predict putative replication origins, in previous studies we explored the large-scale behavior of nucleotide strand compositional asymmetries defined as the skew S = (G – C)/(G + C) + (T – A)/(T + A) (i.e., the relative excess of G over C and T over A) (Brodie of Brodie et al. 2005 1000 putative replication origins in the human genome. Remarkably, successive transitions are connected to each other by DNA segments in which the S values decrease in the 5' to 3' direction, thereby displaying a characteristic serrated pattern reminiscent of factory roofs (Fig. 1A). We propose as a working model that this N-like shape results from the superimposition of two patterns. One decreases steadily in the 5' to 3' direction and would be attributable to replication initiating at two fixed adjacent origins (associated with two upward transitions) and terminating during the successive germ-line cell divisions at various positions randomly dispersed between these origins (Fig. 1B,C). The other pattern would result from transcription-associated strand asymmetries that generate step-like blocks corresponding to (+) and (–) genes (Touchon et al. 2003
Detecting the N-domains To extract the N-domains from the noisy S profile of the genome, we developed an adapted wavelet-based multiscale methodology to identify segments of variable length and position displaying a factory-roof pattern (Methods; Supplemental Figs. S1, S2). According to the model, the selection involves (1) searching for segments that decrease between two large upward jumps, and (2) retaining those containing both intergenic regions with a linearly decreasing S profile (possibly induced by replication) and genes associated with step-like blocks (possibly induced by transcription) superimposed over this linearly decreasing profile. This amounts to disentangling the components of the skew attributed to replication and to transcription (Methods; Supplemental Fig. S3). When applied to the human genome, the method detected 678 N-domains bordered by 1060 putative replication origins. These domains are evenly distributed in most chromosomes, spanning 28.3% of the genome with a mean length L = 1.2 ± 0.6 Mbp (Fig. 2A; Methods; Supplemental Fig. S4).
During the selection process, a number of candidate structures were examined that were not finally retained as N-domains since they display some departure from symmetry of the skew with respect to the center of the domain. However, these structures, which span approximately another 30% of the genome, can be considered as N-domain-like structures, and do display a type of gene organization reminiscent of that observed in the bona-fide N-domains (described below). In most of the remaining genome regions, two types of S profile were observed. The first type, observed in regions with high gene density, small gene size, and high GC content (Lander et al. 2001 20% of the genome). These complex S profiles hampered the detection of the N-domains. For example, both small domain density and small chromosome coverage were observed in chromosome 19, which contains a high proportion of gene-rich and GC-rich regions (Supplemental Fig. S5c,d). The second type, observed in gene-poor regions with low GC content did not display large upward jumps, but rather flat patterns, suggesting that replication origins are not fixed. These regions span 15% of the genome and correspond to gene deserts (Lander et al. 2001
Analysis of the S profile of the N-domains To what extent could the specific S profile of the N-domains be expected to result from chance? We first examined the human S profile, looking for structures presenting an inverted factory-roof pattern, i.e., two downward jumps separated by a steadily increasing skew. We adapted our method to detect such structures (the method is the same as that described above apart from the analyzing wavelet; Supplemental Fig. S1b). When this method was applied to human autosomes, it detected no more than 27 inverted structures (vs. 678 N-domains) spanning only 0.6% of the genome. N-domains therefore very significantly outnumber inverted structures (P < 10–15). Secondly, we looked for N-domains in sequences obtained after shuffling the order of genes and intergenic regions (Methods), and found that they were significantly less frequent than in the native sequences (P < 10–15). This observation also provides the first indication that the existence of N-domains does indeed reflect some specific gene organization.
Replication timing profile of the N-domains
Gene organization in the N-domains Gene shuffling experiments revealed an underlying gene organization in the N-domains (see above). In order to decipher this organization, we analyzed the gene patterns. Most putative origins (domain borders) are intergenic (77%) and located near a gene promoter more often than would be expected by chance (Supplemental Fig. S6a,b). The N-domains contain approximately equal numbers of genes oriented in each direction (1511 + genes and 1507 – genes). Gene distributions in the 5' halves of domains contain more + genes than – genes, regardless of the total number of genes located in the half-domains (Supplemental Fig. S6c). Symmetrically, the 3' halves contain more – genes than + genes (Supplemental Fig. S6d). A total of 32.7% of half-domains contain one gene, and 50.9% contain more than one gene. For convenience, + genes in the 5' halves and – genes in the 3' halves are defined as R+ genes (Fig. 4A): their transcription is, in most cases, oriented in the same direction as the putative replication fork progression (genes transcribed in the opposite direction are defined as R– genes). The 678 N-domains contain significantly more R+ genes (2041) than R– genes (977) ( 2 = 375, P < 10–15, Supplemental Table S2a). Within 50 kbp of putative replication origins, the mean density of R+ genes is 8.2 times greater than that of R– genes. This asymmetry weakens progressively with the distance from the putative origins, up to 250 kbp (Fig. 4b). A similar asymmetric pattern is observed when the domains containing duplicated genes are eliminated from the analysis, whereas control domains obtained after randomization of domain positions (Methods) present similar R+ and R– gene density distributions (Supplemental Fig. S7a–a''). The mean length of the R+ genes near the putative origins is significantly greater ( 160 kbp) than that of the R– genes ( 50 kbp); however, both tend toward similar values ( 70 kbp) at the center of the domain (Fig. 4C). A similar pattern is observed after eliminating duplicated genes, whereas, in contrast, the control domains display fairly constant gene length (Supplemental Fig. S7b–b''). Within 50 kbp of the putative origins, the ratio between the numbers of base pairs transcribed in the R+ and R– directions is 23.7; this ratio falls to 1 at the domain centers (Fig. 4D). A similar pattern is observed after eliminating duplicated genes; this ratio is constant in the control domains (Supplemental Fig. S7c–c''). This strong transcriptional polarity could be mainly attributable to the preferential R+ orientation of the first gene (closest to the extremity). However, polarity is still observed for half-domains harboring various gene numbers even after the first gene has been eliminated (Supplemental Fig. S8).
Gene expression in the N-domains We analyzed the breadth of expression, Nt (number of tissues in which a gene is expressed), of genes located within the N-domains. We found that it significantly decreases from the extremities to the center, regardless of whether it is measured by EST, SAGE, or microarray data ( 2 = 29, P = 10–8 for EST data). The distribution is symmetrical in the 5' and 3' half-domains (Fig. 5A,B). Significantly decreasing mean Nt values are also observed after eliminating duplicated genes, whereas they remain constant within the control domains obtained after randomizing the domain positions (Methods; Supplemental Fig. S7d-f''). The distribution of Nt values (determined using ESTs) displays a bimodal pattern for the genes located in the domains (Fig. 5C), with one mode (peak at Nt < 5) corresponding to the genes expressed in only a few tissues, and a second mode (a wide bump centered at Nt 15) corresponding to widely expressed genes. It is noteworthy that this distribution is similar to that found for the complete set of human genes (Supplemental Fig. S9d). Genes located near the putative replication origins tend to be widely expressed (Fig. 5D), whereas those located far from them are mostly tissue specific (Fig. 5E). We checked that the decrease in both Nt values and gene length L, from the N-domain border to its center (Fig. 4C), did not reflect a correlation between these factors: no correlation was observed between gene length and expression breadth measured using EST, SAGE, or microarray data (Supplemental Fig. S9a–c). In addition, no significant correlation was observed between the transcription rate of a gene and its position within an N-domain, whether or not duplicated genes are eliminated from the analysis (data not shown).
This study shows that some features of human genome organization can be unraveled by examining the properties of the nucleotide compositional skew. The S profile exhibits a highly significant number of occurrences of so-called N-domains, specific structures consisting of two sharp upward transitions connected by a downward-sloping segment. These large structures are recognizable along all chromosomes. They are unambiguously detected by our methodology in more than one-quarter of the genome. Could these structures be generated solely by transcription? Transcription generates strand asymmetries along gene sequences, leading to step-like blocks in the S profile (Green et al. 2003
The replication timing profile of the N-domains shows that, on average, the extremities replicate earlier than the neighboring regions, which is consistent with these regions being true replication origins, active early in the S phase. This profile was established using replication timing data obtained from lymphoblastoid cells (Woodfine et al. 2005
According to our model, replication units would be better described by N-domains, as defined in this study, than by the usual replicons. Indeed, the fixed terminators of the replicons would not be suitable for describing the putative variable termination sites within the N-domains. The length of the N-domains matches the large,
We then asked whether these N-domains correspond to a specific gene organization pattern. Most putative replication origins located at domain extremities are intergenic and located close to promoters of widely expressed genes (housekeeping genes) oriented toward the domain center. Gene density, breadth of expression, and transcription polarity all tend to decrease progressively from the extremities of the domain toward its center. In the central region, genes are few in number, tissue specific, and have no preferential orientation (Fig. 6). We propose that coordination between replication and transcription is the key to this complex architecture. The putative replication origins would mostly be active early in the S phase in most tissues. Their activity could result from particular genomic context involving transcription-factor binding sites and/or from the transcription of their neighboring housekeeping genes. This activity could also be associated with an open chromatin structure, permissive to early replication and gene expression in most tissues (Gilbert et al. 2004
Near the putative origins bordering the N-domains, transcription is preferentially oriented in the same direction as replication fork progression. We propose that this co-orientation would reduce head-on collisions between the replication and transcription machineries, which could induce deleterious recombination events either directly or via stalling of the replication fork (Deshpande and Newlon 1996 The data presented here strongly suggest the existence in the human genome of regions bordered by putative early replication origins in which gene position, orientation, and expression breadth present a high level of organization, possibly mediated by the chromatin structure. This allows us to propose a model of gene order that relates transcription and replication as coordinated determinants of genome organization.
Sequence and expression data Sequence and annotation data were retrieved from the Genome Browser of the University of California Santa Cruz (UCSC, hg17). To obtain gene sequences, we used the RefSeq annotation (containing only protein-coding transcripts). When two genes presenting the same orientation overlap, the largest gene was retained. For the detection process of N-domains, sequences masked with REPEATMASKER were retrieved from the UCSC browser to avoid the biases intrinsic to repeated elements. In all other analyses, sequences were not masked. The skew, S, was computed in nonoverlapping, 1-kbp windows. EST, SAGE, and microarray data were provided by M. Sémon and L. Duret (Sémon et al. 2005
Detection of N-domains using the wavelet transform
> 0 is the replication bias. If we now take into account the contribution of transcription ST to the bias in a gene-containing domain, the asymmetry profiles can be written as:
g is the characteristic function for the gth gene (1 when there are t points within the gene, and 0 elsewhere), and cg is its transcriptional bias calculated on the Watson strand (likely to be positive for + genes and negative for – genes). For each domain identified in the previous step, we used a least-square fitting procedure to estimate the replication bias, , and each value of the gene transcription bias, cg. The resulting 2 value was used to select the domains where the S noisy profile is well described by Equation 2. As illustrated in Supplemental Figure S3 and Supplemental Table S1 for a fragment of human chromosome 6 that contains three adjacent N-domains (Supplemental Fig. S3a), this method provides a very efficient way of disentangling the step-like component of strand asymmetry associated with transcription (Supplemental Fig. S3b) from the jagged component associated with replication (Supplemental Fig. S3c).
Applying this procedure to the 22 human autosomes, we detected 678 N-domains and predicted 1060 putative origins of replication (in 296 cases, the right origin of a domain is also the left origin of the following domain). Examples of such N-domains are illustrated in Figure 2A and Supplemental Fig. S4. The domain length ranges between
Randomization of gene order
Randomization of N-domain positions
Detection of duplicated genes
Statistics The positions of the N-domains and of the inverted N-domains are available as Supplemental material.
We thank M. Sémon and L. Duret for providing the expression data for human genes, S. Camier, L. Duquenne, and M. Ghosh for their careful reading of the manuscript, and O. Hyrien and B. Michel for helpful discussions. This work was supported by the Centre National de la Recherche Scientifique (CNRS), the Agence Nationale de la Recherche (NT05-3_41825), the ACI IMPBIO 2004, the French Ministère de lEducation et de la Recherche, and the PAI Tournesol. B.A. acknowledges support from the European Commission Marie Curie action (MERG-CT-2004-511923).
5 Present addresses: Département de Mathématique, Université de Liège, 12 Grande Traverse, 4000 Liège, Belgium.
E-mail thermes{at}cgm.cnrs-gif.fr; fax 33-1-69-82-38-77. [Supplemental material is available online at www.genome.org.] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6533407
Berezney, R., Dubey, D.D., and Huberman, J.A. 2000. Heterogeneity of eukaryotic replicons, replicon clusters, and replication foci. Chromosoma 108: 471–484.[CrossRef][Medline] Brodie of Brodie, E.B., Nicolay, S., Touchon, M., Audit, B., d'Aubenton-Carafa, Y., Thermes, C., and Arneodo, A. 2005. From DNA sequence analysis to modeling replication in the human genome. Phys. Rev. Lett. 94: 248103.[CrossRef][Medline] Cajiao, I., Zhang, A., Yoo, E.J., Cooke, N.E., and Liebhaber, S.A. 2004. Bystander gene activation by a locus control region. EMBO J. 23: 3854–3863.[CrossRef][Medline] Callan, H.G. 1972. Replication of DNA in the chromosomes of eukaryotes. Proc. R. Soc. Lond. 181: 19–41.[Medline] Caron, H., van Schaik, B., van der Mee, M., Baas, F., Riggins, G., van Sluis, P., Hermus, M.C., van Asperen, R., Boon, K., Voute, P.A., et al. 2001. The human transcriptome map: Clustering of highly expressed genes in chromosomal domains. Science 291: 1289–1292. Chakalova, L., Debrand, E., Mitchell, J.A., Osborne, C.S., and Fraser, P. 2005. Replication and transcription: Shaping the landscape of the genome. Nat. Rev. Genet. 6: 669–677.[CrossRef][Medline] Cheng, J., Kapranov, P., Drenkow, J., Dike, S., Brubaker, S., Patel, S., Long, J., Stern, D., Tammana, H., Helt, G., et al. 2005. Transcriptional maps of 10 human chromosomes at 5-nucleotide resolution. Science 308: 1149–1154. Danis, E., Brodolin, K., Menut, S., Maiorano, D., Girard-Reydet, C., and Mechali, M. 2004. Specification of a DNA replication origin by a transcription complex. Nat. Cell Biol. 6: 721–730.[CrossRef][Medline] DePamphilis, M.L. 2005. Cell cycle dependent regulation of the origin recognition complex. Cell Cycle 4: 70–79.[Medline] Deshpande, A.M. and Newlon, C.S. 1996. DNA replication fork pause sites dependent on transcription. Science 272: 1030–1033.[Abstract] Edenberg, H.J. and Huberman, J.A. 1975. Eukaryotic chromosome replication. Annu. Rev. Genet. 9: 245–284.[CrossRef][Medline] Ghosh, M., Liu, G., Randall, G., Bevington, J., and Leffak, M. 2004. Transcription factor binding and induced transcription alter chromosomal c-myc replicator activity. Mol. Cell. Biol. 24: 10193–10207. Gilbert, N., Boyle, S., Fiegler, H., Woodfine, K., Carter, N.P., and Bickmore, W.A. 2004. Chromatin architecture of the human genome: Gene-rich domains are enriched in open chromatin fibers. Cell 118: 555–566.[CrossRef][Medline] Green, P., Ewing, B., Miller, W., Thomas, P.J., and Green, E.D. 2003. Transcription-associated mutational asymmetry in mammalian evolution. Nat. Genet. 33: 514–517.[CrossRef][Medline] Hurst, L.D., Pal, C., and Lercher, M.J. 2004. The evolutionary dynamics of eukaryotic gene order. Nat. Rev. Genet. 5: 299–310.[CrossRef][Medline] Jeon, Y., Bekiranov, S., Karnani, N., Kapranov, P., Ghosh, S., Macalpine, D., Lee, C., Hwang, D.S., Gingeras, T.R., and Dutta, A. 2005. Temporal profile of replication of human chromosomes. Proc. Natl. Acad. Sci. 102: 6419–6424. Kapranov, P., Drenkow, J., Cheng, J., Long, J., Helt, G., Dike, S., and Gingeras, T.R. 2005. Examples of the complex architecture of the human transcriptome revealed by RACE and high-density tiling arrays. Genome Res. 15: 987–997. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860–921.[CrossRef][Medline] Lercher, M.J., Urrutia, A.O., and Hurst, L.D. 2002. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31: 180–183.[CrossRef][Medline] Li, W.H., Gu, Z., Wang, H., and Nekrutenko, A. 2001. Evolutionary analyses of the human genome. Nature 409: 847–849.[CrossRef][Medline] Lin, C.M., Fu, H., Martinovsky, M., Bouhassira, E., and Aladjem, M.I. 2003. Dynamic alterations of replication timing in mammalian cells. Curr. Biol. 13: 1019–1028.[CrossRef][Medline] MacAlpine, D.M., Rodriguez, H.K., and Bell, S.P. 2004. Coordination of replication and transcription along a Drosophila chromosome. Genes & Dev. 18: 3094–3105. Mouchiroud, D., D'Onofrio, G., Aissani, B., Macaya, G., Gautier, C., and Bernardi, G. 1991. The distribution of genes in the human genome. Gene 100: 181–187.[CrossRef][Medline] Ovcharenko, I., Loots, G.G., Nobrega, M.A., Hardison, R.C., Miller, W., and Stubbs, L. 2005. Evolution and functional classification of vertebrate gene deserts. Genome Res. 15: 137–145. Rocha, E.P. and Danchin, A. 2003. Essentiality, not expressiveness, drives gene-strand bias in bacteria. Nat. Genet. 34: 377–378.[CrossRef][Medline] Schubeler, D., Scalzo, D., Kooperberg, C., van Steensel, B., Delrow, J., and Groudine, M. 2002. Genome-wide DNA replication profile for Drosophila melanogaster: A link between transcription and replication timing. Nat. Genet. 32: 438–442.[CrossRef][Medline] Sémon, M. and Duret, L. 2006. Evolutionary origin and maintenance of coexpressed gene clusters in mammals. Mol. Biol. Evol. 23: 1715–1723. Sémon, M., Mouchiroud, D., and Duret, L. 2005. Relationship between gene expression and GC-content in mammals: Statistical significance and biological relevance. Hum. Mol. Genet. 14: 421–427. Singer, G.A., Lloyd, A.T., Huminiecki, L.B., and Wolfe, K.H. 2005. Clusters of co-expressed genes in mammalian genomes are conserved by natural selection. Mol. Biol. Evol. 22: 767–775. Spellman, P.T. and Rubin, G.M. 2002. Evidence for large domains of similarly expressed genes in the Drosophila genome. J. Biol. 1: 5. doi: 10.1186/1475-4924-1-5.[CrossRef][Medline] Sproul, D., Gilbert, N., and Bickmore, W.A. 2005. The role of chromatin structure in regulating the expression of clustered genes. Nat. Rev. Genet. 6: 775–781.[CrossRef][Medline] Takeuchi, Y., Horiuchi, T., and Kobayashi, T. 2003. Transcription-dependent recombination and the role of fork collision in yeast rDNA. Genes & Dev. 17: 1497–1506. Touchon, M., Nicolay, S., Arneodo, A., d'Aubenton-Carafa, Y., and Thermes, C. 2003. Transcription-coupled TA and GC strand asymmetries in the human genome. FEBS Lett. 555: 579–582.[CrossRef][Medline] Touchon, M., Arneodo, A., d'Aubenton-Carafa, Y., and Thermes, C. 2004. Transcription-coupled and splicing-coupled strand asymmetries in eukaryotic genomes. Nucleic Acids Res. 32: 4969–4978. Touchon, M., Nicolay, S., Audit, B., Brodie of Brodie, E.B., d'Aubenton-Carafa, Y., Arneodo, A., and Thermes, C. 2005. Replication-associated strand asymmetries in mammalian genomes: Toward detection of replication origins. Proc. Natl. Acad. Sci. USA 102: 9836–9841. Vassilev, L. and Johnson, E.M. 1990. An initiation zone of chromosomal DNA replication located upstream of the c-myc gene in proliferating HeLa cells. Mol. Cell. Biol. 10: 4899–4904. Versteeg, R., van Schaik, B.D., van Batenburg, M.F., Roos, M., Monajemi, R., Caron, H., Bussemaker, H.J., and van Kampen, A.H. 2003. The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes. Genome Res. 13: 1998–2004. Wang, J.D., Berkmen, M.B., and Grossman, A.D. 2007. Genome-wide coorientation of replication and transcription reduces adverse effects on replication in Bacillus subtilis. Proc. Natl. Acad. Sci. 104: 5608–5613. White, E.J., Emanuelsson, O., Scalzo, D., Royce, T., Kosak, S., Oakeley, E.J., Weissman, S., Gerstein, M., Groudine, M., Snyder, M., et al. 2004. DNA replication-timing analysis of human chromosome 22 at high resolution and different developmental states. Proc. Natl. Acad. Sci. 101: 17771–17776. Woodfine, K., Beare, D.M., Ichimura, K., Debernardi, S., Mungall, A.J., Fiegler, H., Collins, V.P., Carter, N.P., and Dunham, I. 2005. Replication timing of human chromosome 6. Cell Cycle 4: 172–176.[Medline] Yurov, Y.B. and Liapunova, N.A. 1977. The units of DNA replication in the mammalian chromosomes: evidence for a large size of replication units. Chromosoma 60: 253–267.[CrossRef][Medline] Zoubak, S., Clay, O., and Bernardi, G. 1996. The gene distribution of the human genome. Gene 174: 95–102.[CrossRef][Medline]
Received March 22, 2007; accepted in revised format June 10, 2007.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||