|
|
|
|
Published online before print
April 12, 2004, 10.1101/gr.1934904 Genome Res. 14:852-859, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00
Letter Comparison of Human Chromosome 21 Conserved Nongenic Sequences (CNGs) With the Mouse and Dog Genomes Shows That Their Selective Constraint Is Independent of Their Genic Environment1 Division of Medical Genetics, University of Geneva Medical School, CH-1211 Geneva, Switzerland 2 The Institute for Genomic Research (TIGR), Rockville, Maryland 20850, USA 3 Department of Computer Sciences and Engineering, Pennsylvania State University, Pennsylvania 16802, USA 4 European Bioinformatics Institute, Hinxton CB10 1SD, UK
The analysis of conservation between the human and mouse genomes resulted in the identification of a large number of conserved nongenic sequences (CNGs). The functional significance of this nongenic conservation remains unknown, however. The availability of the sequence of a third mammalian genome, the dog, allows for a large-scale analysis of evolutionary attributes of CNGs in mammals. We have aligned 1638 previously identified CNGs and 976 conserved exons (CODs) from human chromosome 21 (Hsa21) with their orthologous sequences in mouse and dog. Attributes of selective constraint, such as sequence conservation, clustering, and direction of substitutions were compared between CNGs and CODs, showing a clear distinction between the two classes. We subsequently performed a chromosome-wide analysis of CNGs by correlating selective constraint metrics with their position on the chromosome and relative to their distance from genes. We found that CNGs appear to be randomly arranged in intergenic regions, with no bias to be closer or farther from genes. Moreover, conservation and clustering of substitutions of CNGs appear to be completely independent of their distance from genes. These results suggest that the majority of CNGs are not typical of previously described regulatory elements in terms of their location. We propose models for a global role of CNGs in genome function and regulation, through long-distance cis or trans chromosomal interactions.
Comparative genomics analysis with other vertebrate genomes promises the identification of some of the functional elements (Hardison et al. 1997
A few of the CNGs have been functionally characterized, and some of these sequences are most likely regulatory sequences of nearby genes (Loots et al. 2000
In a previous study we performed a systematic analysis of conservation of a sample of 191 human chromosome 21 (Hsa21) CNGs and their characteristics across the whole mammalian phylogenetic tree (Dermitzakis et al. 2003
In the present study we took advantage of the recent availability of the 1.5x coverage of the whole-genome shotgun sequence of the dog (Canis familiaris; Kirkness et al. 2003
Identification of Orthologous Dog Sequences The quality of the dog 1.5x assembly was assessed by comparing the sequence with two other sources of dog sequence, high-quality BAC sequencing from Eric Green (NHGRI) and colleagues (115,479 bp; GenBank accession AC144643 [GenBank] ), and shotgun reads from the NIH-funded dog genome. The levels of similarity at the nucleotide level (excluding indels) in these two comparisons were 99.9% and 99.8%, respectively (see Methods). This high level of similarity suggests that the quality of the assembly used in the present study is high enough to produce reliable conclusions regarding the levels of conservation of the chosen sequences.
The 2262 unknown conserved nongenic sequences, (herein named CNGs) and 1229 known (exonic) conserved sequences (herein named CODs) between human and mouse from Dermitzakis et al. (2002
Levels of Divergence of Intergenic CNGs, Intronic CNGs, and CODs
Direction of Substitutions in CNGs and CODs One of the interesting properties in sequence evolution is the direction of substitutions that can reveal patterns of selective constraint associated with local nucleotide composition. We previously described a pronounced bias in substitutions of CNGs from AT to GC in mouse and rabbit (Dermitzakis et al. 2002
We also computed the same values for the CODs, in order to determine a particular bias in those versus the CNGs. Figure 2B,C,D shows the relative rates of GC to AT nucleotide changes for intronic and intergenic CNGs and CODs in all three species. It is clear that the AT to GC bias is pronounced in CNGs in mouse but absent from CODs. The mouse and dog CODs do not show any strong GC to AT effect. One of the intriguing patterns is that there is a GC to AT bias in the CODs in humans. This is contrary to expectation, because the codon bias of the human genome is GC-biased in degenerate codon positions and is associated with expression levels (Iida and Akashi 2000
Organization of CNGs and Distance From Annotated Human Genes
Relationship of Nucleotide Sequence Divergence With Distance From Genes If the majority of CNGs in intergenic regions were mainly regulatory regions of the adjacent genes, one would expect that CNGs closer to genes would have a higher or at least different constraint from those far from genes. We computed the correlation coefficient for the degree of divergence in the human lineage relative to the distance from the closest gene according to the current annotation of the human genome (NCBI build 33). Surprisingly, we observed no correlation, showing that the selective constraint is independent of the genic environment of the chromosomal regions under study (Fig. 4A). This is consistent with the observation above that the density of intergenic CNGs is independent of the distance from genes.
In addition, we performed a correlation analysis of the human branch length (divergence) with the length of the intergenic region in which the CNGs reside. We also observed no correlation between the divergence values of CNGs in short versus long intergenic regions (Fig. 4B). This suggests that the nature of selective constraint of CNGs does not differ between those residing in long and short intergenic regions. It is possible that more genes will be identified on Hsa21 (e.g., keratin-associated protein gene clusters). However, given the fact that Hsa21 has been exhaustively studied, we do not expect a large increase in the number of genes. Nevertheless, even if we would introduce genes in the currently large intergenic regions, the pattern is not expected to change, because by "inserting" a gene in a uniform distribution of CNGs the pattern will remain independent of gene distance.
Clustering of Substitutions As we have previously observed, there is more pronounced clustering in intronic and intergenic CNGs than in the CODs (Fig. 5A). Interestingly, the clustering significance is not different between intronic and intergenic CNGs, suggesting that these two types of CNGs may not have different function.
The data presented above strongly suggest that attributes of selective constraint are not correlated with distance from the adjacent genes. We therefore tested whether clustering P-values show any correlation with distance from genes. We observed no correlation of P-values with distance from the nearest gene (Fig. 5B), or with the length of the intergenic region (Fig. 5C), further supporting the conclusions that the selective constraint of CNGs is independent of the distance from the adjacent genes, and likely independent of the genic environment of their genomic location.
One of the most exciting observations following the comparison of the human and mouse genomes is the discovery of the likely functional conserved nongenic (CNGs) sequences (Dermitzakis et al. 2002 The levels of divergence of the tested CNGs in dog appear to be low, and we were able to find about 73% of the CNGs and 79% of the CODs in the 1.5x sequence coverage of the dog genome. It is expected that with such coverage we should be able to detect approximately 78% of all previously identified CNGs and CODs, if all are present and conserved. Given the cloning bias of some regions and the statistical error, we conclude that almost all of the CNGs identified between human and mouse will be present and highly conserved in dog. In addition, the patterns of substitution of CNGs and CODs suggest unique properties for the genomic regions in which they are found. Specifically, we observed a direction of the substitutions in mouse and dog which leads to an increase of the GC content in their CNG sequences. Remarkably, we observed the opposite substitution bias for CODs in humans, turning coding sequences richer in AT. This is contrary to the codon bias of the human genome, which is GC-biased, suggesting possible models for chromosome-wide patterns of substitution that could alter the rate of translation and protein levels and patterns in the course of evolution. The most intriguing observation of this analysis is that the selective constraint is independent of the genic environment of the CNGs. Neither divergence nor clustering of substitutions is correlated with distance from genes or size of intergenic regions. Moreover, there is no difference in divergence or clustering between intronic and intergenic CNGs, further supporting that notion. These data suggest that the position of CNGs is independent of currently known functional elements of the genome. Therefore, the evolutionary characteristics of a CNG are not influenced by how far or close it maps to a gene, or whether it is situated 5' or 3' or in an intron of a gene.
This observation raises interesting models regarding the potential function of CNGs in the human and other mammalian genomes. It has been suggested that some of the CNGs are likely regulatory regions (Hardison 2000 One other possible model is their involvement as structural components of chromatin. The fact that they are not evenly distributed on the chromosome but have an inversely proportional density with genes shows that they are found wherever there is genomic "space", and thus their main role cannot be only the structure of the chromatin. In addition, they are single-copy sequences in the human genome, which makes them unlikely to be evenly spaced repeated structural elements.
Another intriguing but speculative hypothesis is that CNGs participate in direct regulation of gene expression through interactions in trans (Muller and Schaffner 1990
What have we learned with the use of the dog genome as a third species in the analysis? It has been shown that the dog is closer in terms of nucleotide identity to the human than the mouse, so one could argue that we have more power to detect evolutionary conserved sequences with the mouse than with the dog. However, even lower amounts of divergence can be very useful. The first contribution of the dog genome is that it shows that the selected sequences based on the humanmouse comparison are also conserved in the dog, further supporting that they are selectively constrained. In addition, with the use of a third species we are able to root the direction of substitutions, which would have been impossible with two genomes only. See for example in Dermitzakis et al. (2003 In conclusion, CNGs most likely constitute a new and heterogeneous class of functional genomic elements that are highly conserved and shared across multiple mammals. The function of some CNGs may be a local regulation of gene expression in cis; however, our results presented here favor the implication of CNGs in regulation and chromosomal interactions in a distance-independent (cis or trans) manner. From the extensive evolutionary analysis it is obvious that we need to entertain many alternative hypotheses to be tested with experimental strategies to elucidate the function of CNGs. The next challenge will be to design and interpret informative experiments to investigate the function of CNGs and assess their contribution to phenotypic variation within and between species and to complex and mendelian disorders.
The Dog Genome Sequence Sequence data, representing 1.5x coverage of the dog genome, was derived from 6.22 million sequencing reads. The end-sequencing of 2-kb and 10-kb clones was conducted under contract at Celera Genomics as described previously for human, and reads were assembled with Celera Assembler (Venter et al. 2001To assess the quality of the 1.5x assembly, two comparisons were carried out.
1. Comparison of the 1.5x Data to a `Finished' BAC Sequence Sixty-one assemblies aligned with 89,220 bp of unique BAC sequence. This proportion of the BAC sequence (77.3%) is close to the theoretical value for 1.5x coverage (77.6%). End-to-end alignment of each assembly with the BAC sequence revealed a total of 143 indels and 128 mismatched bases. Overall, there was 99.5% identity between the BAC sequence and the 1.5x assemblies. However, many indels are located within tandem repeats, and encompass multiple bases. If indels are ignored, there was 99.9% identity between the BAC sequence and the 1.5x data. It should be noted that many of these indels and mismatches will represent genuine polymorphisms (the BAC library was derived from a Doberman, the 1.5x data from a poodle). For comparison, one might consider the variation (most of which is allelic variation) of two overlapping BACs from the same dog (AC114890 [GenBank] , AC113573 [GenBank] ; each assembled after >9x coverage of >Q20 bases). The overlap of 94,162 bp contains a total of 103 indels and 271 mismatched bases.
2. Comparison of 1.5x Sequences With Boxer Shotgun Reads All 2661 dog sequences were sorted by % identity to the human genome, and the 100 with the lowest values (60%74%) were searched against all available whole-genome shotgun reads from the boxer genome. Each 1.5x sequence was aligned end-to-end with homologous boxer reads in order to quantify indels and mismatched bases. For 98 of the 1.5x dog sequences, it was possible to identify one or more boxer reads that aligned from end-to-end. These 98 sequences had a combined length of 16,293 bp. The alignments revealed a total of 14 indels and 28 mismatched bases. Overall, there was 99.8 % identity between the 1.5x sequences and the boxer reads. Again, it should be noted that many of these indels and mismatches will represent genuine polymorphisms between the poodle and boxer genomes. However, even if the error rate is as high as 0.2%, this would have a negligible effect on the analyses described herein.
Identification of Orthologous Dog Sequences
Annotation of CNGs and Exons on the Current Version of Human Chromosome 21
Estimation of Number of Substitutions, Divergence, and Branch Lengths
Periodicity
Clustering
We thank Roderic Guigo and Enrique Blanco for helpful comments and suggestions in the course of this study and for critically reading the manuscript. Support for this work was provided by the Swiss National Science Foundation, NCCR "Frontiers in Genetics," ChildCare Foundation, European Union FP5 to S.E.A. and Lejeune foundation to A.R. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1934904. Article published online before print in April 2004.
5 Corresponding authors.
E-MAIL Stylianos.Antonarakis{at}medecine.unige.ch; FAX 0041-22-379-5706. [Supplemental material is available online at www.genome.org.]
Alexandersson, M., Cawley, S., and Pachter, L. 2003. SLAM: Cross-species gene finding and alignment with a generalized pair Hidden Markov Model. Genome Res. 13: 496-502.
Boffelli, D., McAuliffe, J., Ovcharenko, D., Lewis, K.D., Ovcharenko, I., Pachter, L., and Rubin, E.M. 2003. Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 1391-1394.
Croft, J.A., Bridger, J.M., Boyle, S., Perry, P., Teague, P., and Bickmore, W.A. 1999. Differences in the localization and morphology of chromosomes in the human nucleus. J. Cell Biol. 145: 1119-1131. Dermitzakis, E.T., Reymond, A., Lyle, R., Scamuffa, N., Ucla, C., Deutsch, S., Stevenson, B.J., Flegel, V., Bucher, P., Jongeneel, C.V., et al. 2002. Numerous potentially functional but nongenic conserved sequences on human chromosome 21. Nature 420: 578-582.[CrossRef][Medline]
Dermitzakis, E.T., Reymond, A., Scamuffa, N., Ucla, C., Kirkness, E., Rossier, C., and Antonarakis, S.E. 2003. Evolutionary discrimination of mammalian conserved nongenic sequences (CNGs). Science 302: 1033-1035.
Dubchak, I., Brudno, M., Loots, G.G., Pachter, L., Mayor, C., Rubin, E.M., and Frazer, K.A. 2000. Active conservation of noncoding sequences revealed by three-way species comparisons. Genome Res. 10: 1304-1306. Duncan, I.W. 2002. Transvection effects in Drosophila. Annu. Rev. Genet. 36: 521-556.[CrossRef][Medline]
Frazer, K.A., Sheehan, J.B., Stokowski, R.P., Chen, X., Hosseini, R., Cheng, J.F., Fodor, S.P., Cox, D.R., and Patil, N. 2001. Evolutionarily conserved sequences on human chromosome 21. Genome Res. 11: 1651-1659.
Guigo, R., Dermitzakis, E.T., Agarwal, P., Ponting, C.P., Parra, G., Reymond, A., Abril, J.F., Keibler, E., Lyle, R., Ucla, C., et al. 2003. Comparison of mouse and human genomes followed by experimental verification yields an estimated 1,019 additional genes. Proc. Natl. Acad. Sci. 100: 1140-1145. Hardison, R.C. 2000. Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 16: 369-372.[CrossRef][Medline]
Hardison, R.C., Oeltjen, J., and Miller, W. 1997. Long humanmouse sequence alignments reveal novel regulatory elements: A reason to sequence the mouse genome. Genome Res. 7: 959-966. Iida, K. and Akashi, H. 2000. A test of translational selection at "silent" sites in the human genome: Base composition comparisons in alternatively spliced genes. Gene 261: 93-105.[CrossRef][Medline] Kimura, M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16: 111-120.[CrossRef][Medline]
Kirkness, E.F., Bafna, V., Halpern, A.L., Levy, S., Remington, K., Rusch, D.B., Delcher, A.L., Pop, M., Wang, W., Fraser, C.M., et al. 2003. The dog genome: Survey sequencing and comparative analysis. Science 301: 1898-1903.
Kleinjan, D.J. and van Heyningen, V. 1998. Position effect in human genetic disease. Hum. Mol. Genet. 7: 1611-1618. Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.[CrossRef][Medline]
Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M., and Frazer, K.A. 2000. Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science 288: 136-140. Muller, H.P. and Schaffner, W. 1990. Transcriptional enhancers can act in trans. Trends Genet. 6: 300-304.[CrossRef][Medline]
Nielsen, J.A., Hudson, L.D., and Armstrong, R.C. 2002. Nuclear organization in differentiating oligodendrocytes. J. Cell Sci. 115: 4071-4079.
O'Brien, S.J., Menotti-Raymond, M., Murphy, W.J., Nash, W.G., Wienberg, J., Stanyon, R., Copeland, N.G., Jenkins, N.A., Womack, J.E., and Marshall Graves, J.A. 1999. The promise of comparative genomics in mammals. Science 286: 458-462, 479-481. Pennacchio, L.A. and Rubin, E.M. 2001. Genomic strategies to identify mammalian regulatory sequences. Nat. Rev. Genet. 2: 100-109.[CrossRef][Medline]
Schwartz, S., Elnitski, L., Li, M., Weirauch, M., Riemer, C., Smit, A., Green, E.D., Hardison, R.C., and Miller, W. 2003. MultiPipMaker and supporting tools: Alignments and analysis of multiple genomic DNA sequences. Nucleic Acids Res. 31: 3518-3524.
Sorek, R. and Ast, G. 2003. Intronic sequences flanking alternatively spliced exons are conserved between human and mouse. Genome Res. 13: 1631-1637. Spitz, F., Gonzalez, F., and Duboule, D. 2003. A global control region defines a chromosomal regulatory landscape containing the HoxD cluster. Cell 113: 405-417.[CrossRef][Medline]
Stojanovic, N., Florea, L., Riemer, C., Gumucio, D., Slightom, J., Goodman, M., Miller, W., and Hardison, R. 1999. Comparison of five methods for finding conserved sequences in multiple alignments of gene regulatory regions. Nucleic Acids Res. 27: 3899-3910.
Tanabe, H., Muller, S., Neusser, M., von Hase, J., Calcagno, E., Cremer, M., Solovei, I., Cremer, C., and Cremer, T. 2002. Evolutionary conservation of chromosome territory arrangements in cell nuclei from higher primates. Proc. Natl. Acad. Sci. 99: 4424-4429.
Tang, H. and Lewontin, R.C. 1999. Locating regions of differential variability in DNA and protein sequences. Genetics 153: 485-495. Thomas, J.W., Touchman, J.W., Blakesley, R.W., Bouffard, G.G., Beckstrom-Sternberg, S.M., Margulies, E.H., Blanchette, M., Siepel, A.C., Thomas, P.J., McDowell, J.C., et al. 2003. Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788-793.[CrossRef][Medline] Ureta-Vidal, A., Ettwiller, L., and Birney, E. 2003. Comparative genomics: Genome-wide analysis in metazoan eukaryotes. Nat. Rev. Genet. 4: 251-262.[Medline]
Venter, J.C., Adams, M.D., Myers, E.W., Li, P.W., Mural, R.J., Sutton, G.G., Smith, H.O., Yandell, M., Evans, C.A., Holt, R.A., et al. 2001. The sequence of the human genome. Science 291: 1304-1351. Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562.[CrossRef][Medline]
Yang, Z. 1997. PAML: A program package for phylogenetic analysis by maximum likelihood. Comput. Appl. Biosci. 13: 555-556.
Received September 3, 2003;
accepted in revised format December 28, 2003.
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||