|
|
|
|
Genome Res. 14:1060-1067, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Letter Coexpression of Neighboring Genes in the Genome of Arabidopsis thalianaCNAP, Department of Biology, University of York, York YO10 5YW, United Kingdom
Large-scale analyses of expression data of eukaryotic organisms are now becoming increasingly routine. The data sets are revealing interesting and novel patterns of genomic organization, which provide insight both into molecular evolution and how structure and function of a genome interrelate. Our study investigates, for the first time, how genome organization affects expression of a gene in the Arabidopsis genome. The analyses show that neighboring genes are coexpressed. This pattern has been found for all eukaryotic genomes studied so far, but as yet, it remains unclear whether it is due to selective or nonselective influences. We have investigated reasons for coexpression of neighboring genes in Arabidopsis, and our evidence suggests that orientation of gene pairs plays a significant role, with potential sharing of regulatory elements in divergently transcribed genes. Using the data available in the KEGG database, we find evidence that genes in the same pathway are coexpressed, although this is not a major cause for the coexpression of neighboring genes.
Several large-scale analyses of expression data in higher eukaryotes have shown that neighboring genes tend to have similar expression patterns. Regional similarity in expression has been found in humans (Caron et al. 1995
There are a number of potential causes for neighboring genes in a genome to have similar expression patterns. First, duplicated genes often remain neighbors for significant periods of evolutionary time, and given their common ancestry, are likely to have similar expression patterns. Second, neighboring genes in prokaryotic genomes, particularly those that are functionally related, are often found in operons. To date, operons have been found in Caenorhabditis elegans (Blumenthal et al. 2002
The observations on coexpression of neighboring genes have been based on data gained from a variety of experimental techniques. These have included Serial Analysis of Gene Expression (SAGE; Lercher et al. 2002
Increasingly, data sets from DNA microarrays, which enable large numbers of genes to be analyzed simultaneously in a single experiment, are used for bioinformatics analysis. However, there are several different microarray technologies currently in use, including cDNA, oligo, and Affymetrix arrays. It is unclear as yet whether quantitative comparison of data sets from these different technologies is feasible. An example of this difficulty is illustrated in Kuo et. al. (2002 This study describes the first analysis of the Arabidopsis genome to determine whether neighboring genes are coexpressed. Gene expression in Arabidopsis has been studied in-depth worldwide, and there are publicly available data sets for both cDNA and Affymetrix microarrays. This gives the added opportunity to directly compare the impact of these two technologies on the analysis. Our results from a pairwise comparison, show that coexpression of neighboring genes does exist in the Arabidopsis genome. There is significant disparity in the conclusions that can be drawn from data derived from the two different microarray technologies. The causes of coexpression have been explored, and evidence is provided to suggest that neither gene duplication nor common functionality are the main cause for coexpression of neighboring genes in the Arabidopsis genome.
Neighboring Genes Are Coexpressed The data sets used for this analysis were derived from cDNA and Affymetrix microarrays. For each data set, as shown in Figure 1, the mean Pearsons correlation coefficient (R) of all pairs of neighboring genes was calculated to give a measure of the similarity in their expression pattern. The significance of this value was confirmed using a Monte-Carlo simulation, which compares the value obtained to a distribution of random mean R-values derived from the same set of data. Surprisingly, the mean R from the random distribution was positive rather than being zero, as would be expected by chance. A possible explanation for this effect may be the influence of housekeeping genes showing common patterns of expression in many different tissues and experimental conditions, thereby shifting the mean value into the positive. There was clear evidence for significant coexpression of neighboring genes across the genome. This was obtained for data sets from both cDNA and Affymetrix microarrays (cDNA arry: P < 0.0001, +4.99 standard deviations from the random mean, Affymetrix array, +23.1 standard deviations; Fig. 1A,C). Tandem duplicates, defined as gene pairs with a BLAST e-value <0.2 and within 10 genes of one another on the chromosome, were found to have a higher degree of coexpression than that of neighboring genes that were not tandem duplicates. This was obtained using a Mann-Whitney U-test (both data sets: P < 0.0001, Table 1). The result suggested that tandem duplicates could be a significant cause of coexpression of neighboring genes. Therefore, to determine the extent of this effect, one member of each pair of tandem duplicates was removed, and the mean coexpression was recalculated and again compared with randomized data sets. The results of these analyses are shown in Figure 1, B and D, and clearly demonstrate that the impact of tandem duplicates on the coexpression of neighboring genes is different between data obtained from the two technologies. The cDNA array data set free of tandem duplicates showed no evidence of coexpression of neighboring genes (n = 2109; P > 0.10, +1.08 standard deviations; Fig. 1B), whereas the Affymetrix data set continued to show a significant pattern (n = 1367; P < 0.0001, +18.6 standard deviations; Fig. 1D).
To investigate whether the correlation continues beyond neighboring gene pairs into clusters of increasing size, nonoverlapping blocks of three to 20 genes were compared, and the results are shown in Figure 2. Previous analyses in Drosophila suggested that blocks of genes up to 20 in size showed significant clustering of coexpressed genes. Data from only the Affymetrix arrays minus tandem duplicates are shown. The difference in degree of coexpression between real and randomized data sets remained significant for all block sizes. For nonoverlapping blocks of three to 10 genes, there is a clear, gradual decrease in coexpression. Beyond this, there is no further decrease in coexpression, and this continued for block sizes of up 20 genes. This implies that in the Arabidopsis genome, there may be clusters of up to 20 genes that are coexpressed, with an overall median cluster size of 100 kb. It was possible that the statistical significance of these results was inflated by genes that are only one, two, or three genes apart. To investigate this possibility, the randomizations were repeated, but rather than randomizing single genes in each block, groups of three genes were used. When these additional analyses were carried out, the mean R for the randomized data sets increased, but as shown in Figure 3, no random data set produced a higher mean R value than the real data set. This confirmed the significance of the finding that blocks of genes are coexpressed in the genome.
It was interesting to determine whether there was a direct correlation between distance and degree of coexpression. Thus, each pair of genes was placed in bins according to their intergenic distance (01 kb, 12 kb, 23 kb, etc.). If there is a relationship between proximity and degree of coexpression, then it could be expected that genes that are closer together would have a greater degree of coexpression than genes that are further away. For the Affymetrix data set, as shown in Figure 3, a significant correlation was observed between coexpression and intergenic distance of gene pairs up to 12 kb apart (with tandem duplicates: R2 = 0.73; P < 0.005, without tandem duplicates: R2 = 0.69; P < 0.005). Interestingly, when gene pairs in intergenic blocks >12 kb were considered, the correlation between coexpression and gene distance was no longer found to be significant. No correlation was observed for the cDNA array data sets, with or without tandem duplicates. Given this lack of correlation, it is unclear whether the quantitative results from cDNA microarrays are useful for bioinformatic analysis, and therefore, further work focused only on the Affymetrix data sets.
Genes Thought to Be Involved in the Same Biological Process Are Coexpressed
The mean R value, that is, degree of coexpression, was calculated for genes in each pathway listed in the KEGG database. The results are shown in Table 3, and illustrate several interesting features. First, the degree of coexpression shows considerable variation between different pathways. Second, the degree of coexpression is extremely high for some pathways, particularly those in which there is a known molecular interaction between gene products, such as components of the proteosome, ribosome, and replicon. Third, genes encoding enzymes of metabolic pathways are not so highly coexpressed, with some exceptions, such as those involved in the TCA cycle and fatty acid biosynthesis.
The Effect of Gene Orientation on Coexpression of Neighboring Genes Genes in a genome can be transcribed in one of two directions and therefore pairs of genes can be orientated in three alternative combinations as follows: divergent transcription ( ), convergent transcription ( ), or parallel transcription ( / ). Using the Affymetrix data set minus tandem duplicates, those pairs of genes with divergent ( ) or parallel ( / ) orientation were found to have a higher degree of coexpression than those genes with convergent ( ) orientation oftranscription (Table 4; Kruskal-Wallis, P < 0.0001). Interestingly, the pairs of genes with convergent orientation were found to have shorter intergenic distance than those with divergent or parallel orientation (Table 4; Kruskal-Wallis, P < 0.0001).
The above analysis excluded tandem duplicates. The same analysis was performed on a data set of neighboring genes that consisted only of tandem duplicates. As a basis for this analysis, the transcriptional orientation of tandem duplicates was first investigated, and as predicted, most were found to be in the parallel ( ![]() /![]() ) orientation ( 2 test, P < 0.0001). However, it was the tandem duplicates existing in the divergent ( ) orientation of transcription that showed the greatest degree of coexpression (Table 4; Kruskal-Wallis, P < 0.05).
Many technologies are now available to determine the different patterns of gene expression exhibited in cells and tissues of an organism. Often, the entire genomes of these organisms have also been sequenced. This provides the opportunity to analyze gene expression in the context of genome organization. For A. thaliana, the genome sequencing program was completed in 2000 (The Arabidopsis Genome Initiative 2000 Our results show that neighboring genes in the Arabidopsis genome are indeed coexpressed. We have observed this coexpression from two different sources of data for the statistical analysis, Affymetrix and cDNA microarray technologies. Tandem duplicates were found to have a higher degree of coexpression than other neighboring genes in our analysis, but interestingly, the impact of their removal was found to be different when the data from the two technologies were compared. Only the Affymetrix data set continued to show a significant pattern of coexpression. The loss of significance from the cDNA microarray data sets can readily be understood given the known problem of cross-hybridization arising from highly homologous genes such as tandem duplicates. This leads to a higher overall level of noise and unreliability when using cDNA arrays. In contrast, the Affymetrix technology bypasses this problem by using multiple oligonucleotides unique for each gene.
A further difference shown by the analyses of the data sets from the two technologies relates to the effect of intergenic distance, as one could predict that genes closer together would have a greater degree of coexpression than those that are more distant in the genome. A significant correlation between distance and coexpression was only found for the Affymetrix data set, either with or without the inclusion of tandem duplicates. This finding also questions the general utility of cDNA microarrays for this type of quantitative analysis. Some discrepancies have been found previously between cDNA and Affymetrix data sets, such as, for example, in the study of gene expression patterns in 56 cell lines from the National Cancer Institute (Kuo et al 2002
We have addressed several possible explanations for the observed coexpression of neighboring genes. For example, MARS are thought to influence gene expression through changing chromatin conformation patterns (Mishra and Karch 1999
Gene orientation has been examined in a number of studies for its relationship to degree of coexpression. Studies on yeast have shown that divergently transcribed genes have a higher degree of coexpression than genes in convergent orientation (Kruglyak and Tang 2000
Coexpression of neighboring genes could arise through the genes sharing a common function. For example, one could readily predict that genes encoding enzymes in a common metabolic pathway may be coordinately regulated and therefore coexpressed, particularly if the entire pathway is responsive to environmental or developmental cues. To gain an insight into the role of shared function in coexpression, we used the KEGG database to analyze gene expression in the context of gene function (Kanehisa 2002
Interestingly, when coexpression of genes across the entire genome was analyzed in the context of the KEGG database, particularly high degrees of correlation were observed for genes encoding proteins that are known to function in multicomponent complexes, such as the proteosome, ribosome, and replicon. Often, these complexes contain a high level of proteinprotein interactions and our conclusions from the Arabidopsis data are supported by studies in yeast, in which genes encoding interacting proteins tend to be coexpressed (Ge et al. 2001
Data Sources Microarray Data Data was collected from two sources. The Stanford data set is a collection of microarray experiments using cDNA microarrays. The data was downloaded from the Stanford Web site (ftp://genome-ftp.stanford.edu/pub/smd/organisms/AT). A total of 233 experiments were used and the total number of genes across all experiments was 7627 genes. Not all genes were present in each array. As an indicator of the expression level, the normalized ratio was used (channel 1/channel 2 ratio normalized). The Affymetrix data was obtained using the Nottingham Arabidopsis Stock Centre (NASC) Affywatch service (http://arabidopsis.info/prototype/; Craigon et al 2004
Detecting Local Similarity in Expression To test for pairwise local similarity in expression in the Arabidopsis genome, the mean R (Pearson's correlation coefficient) of the expression profiles for neighboring pairs of genes was calculated for both the affymetrix and cDNA data sets. Neighbors were defined as genes that were immediately adjacent in the Arabidopsis genome according to each gene's AGI name, that is, gene pairs with an AGI name (of the form At[chr]g[xxxxx]), differing by 10 or less (e.g., At1g10020 and At1g10030 are defined as neighbors). The mean R calculated from the real data set was then compared with the mean R calculated from 10,000 data sets, in which the order of genes in the Arabidopsis genome was randomized. To ensure that the R-value calculated was statistically valid for each pairwise comparison, there had to be at least 10 experiments in which both genes had valid values. For the Affymetrix data in particular, this resulted in many comparisons being rejected, due to an insufficient number of experiments in which the transcript was identified. The number of gene pair comparisons was conserved between the randomized and the real data sets (Stanford n = 2498; NASC n = 7388). When analyzing blocks of genes, the mean of all possible comparisons within the block was used as the level of coexpression for that block. Therefore, for a block of five genes, 10 different correlations were carried out, and the mean R was used as a measure of the level of coexpression for that particular block. The mean R was then compared with means calculated from randomized data sets. One hundred randomizations were carried out for each simulation. Where sub-blocks were used, the number of genes in a randomized block were varied. For example, when there were three genes in a sub-block, the Arabidopsis genome was split into blocks of three ordered neighboring genes. These blocks were then randomized. For each random distribution, the genes were split into blocks of 15 genes, from which the mean Pearson correlation coefficient was calculated using the Affymetrix array data. Tandem duplicates were excluded. Distance between genes was defined as the distance in base-pairs between the last coding position, on either strand, of the first gene to the first coding position of the second gene.
Removal of Tandem Duplicates
Identification of Genes in the Same Metabolic Pathway PERL scripts that carry out the methods described in this work are available from the authors on request.
We thank Yi Li, Kathryn Madagan, Fabian Vaistij, Eng-Kiat Lim, and Chris Winefield for their helpful comments and discussion. E.J.B.W. is funded by the BBSRC Exploiting Genomics Initiative (grant no. EGA16205. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2131104.
1 Corresponding author. [Supplemental material is available online at www.genome.org. All of the raw microarray data and metabolic pathway data will be made available as additional information. Also, all programs used to analyze data will be made available on request as well as any other data used in the analyses.]
Adachi, N. and Lieber, M.R. 2002. Bidirectional gene organization: A common architectural feature of the human genome. Cell 109: 807809.[CrossRef][Medline] The Arabidopsis Genome Initiative. 2000. Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815.[CrossRef][Medline]
Birnbaum, K., Shasha, D.E., Wang, J.Y., Jung, J.W., Lambert, G.M., Galbraith, D.W., and Benfey, P.N. 2003. A gene expression map of the Arabidopsis root. Science 302: 19561960. Blumenthal, T., Evans, D., Link, C.D., Guffanti, A., Lawson, D., Thierry-Mieg, J., Thierry-Mieg, D., Chiu, W.L., Duke, K., Kiraly, M., et al. 2002. A global analysis of Caenorhabditis elegans operons. Nature 417: 851854.[CrossRef][Medline] Boutanaev, A.M., Kalmykova, A.I., Shevelyov, Y.Y., and Nurminsky, D.I. 2002. Large clusters of co-expressed genes in the Drosophila genome. Nature 420: 666669.[CrossRef][Medline]
Caron, H., Peter, M., Vansluis, P., Speleman, F., Dekraker, J., Laureys, G., Michon, J., Brugieres, L., Voute, P.A., Westerveld, A., et al. 1995. Evidence for 2 tumor-suppressor loci on chromosomal bands-1p3536 involved in neuroblastomaone probably imprinted, another associated with n-myc amplification. Hum. Mol. Genet. 4: 535539. Cohen, B.A., Mitra, R.D., Hughes, J.D., and Church, G.M. 2000. A computational analysis of whole-genome expression data reveals chromosomal domains of gene expression. Nat. Genet. 26: 183186.[CrossRef][Medline]
Craigon, D.J., James, N., Okyere, J., Higgins, J., Jotham, J., and May, S. 2004. NASCArrays: A repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res. 32: D575D577.
Elo, A., Lyznik, A., Gonzalez, D.O., Kachman, S.D., and Mackenzie, S.A. 2003. Nuclear genes that encode mitochondrial proteins for DNA and RNA metabolism are clustered in the Arabidopsis genome. Plant Cell 15: 16191631. Ge, H., Liu, Z., Church, G.M., and Vidal, M. 2001. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae. Nat. Genet. 29: 482486.[CrossRef][Medline] Gerasimova, T.I. and Corces, V.G. 2001. Chromatin insulators and boundaries: Effects on transcription and nuclear organization. Annu. Rev. Genet. 35: 193208.[CrossRef][Medline] Glazko, G.V., Rogozin, I.B., and Glazkov, M.V. 2000. Computer prediction of DNA sites of attachment to different nuclear matrix elements. Mol. Biol. 34: 15.
Gray, T.A., Saitoh, S., and Nicholls, R.D. 1999. An imprinted, mammalian bicistronic transcript encodes two independent proteins. Proc. Natl. Acad. Sci. 96: 56165621.
Grigoriev, A. 2001. A relationship between gene expression and protein interactions on the proteome scale: Analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae. Nucleic Acids Res. 29: 35133519.
Jansen, R., Greenbaum, D., and Gerstein, M. 2002. Relating whole-genome expression data with proteinprotein interactions. Genome Res. 12: 3746. Kanehisa, M. 2002. The KEGG database. Novartis. Found. Symp. 247: 91101.[Medline] Kruglyak, S. and Tang, S. 2000. Regulation of adjacent yeast genes. Trends Genet. 16: 109111.[CrossRef][Medline]
Kuo, W.P., Jenssen, T.K., Butte, A.J., Ohno-Machado, L., and Kohane, I.S. 2002. Analysis of matched mRNA measurements from two different microarray technologies. Bioinformatics. 18: 405412.
Lee, J.M. and Sonnhammer, E.L. 2003. Genomic gene clustering analysis of pathways in eukaryotes. Genome Res. 13: 875882. Lercher, M.J., Urrutia, A.O., and Hurst, L.D. 2002. Clustering of housekeeping genes provides a unified model of gene order in the human genome. Nat. Genet. 31: 180183.[CrossRef][Medline]
Lercher, M.J., Blumenthal, T., and Hurst, L.D. 2003. Coexpression of neighboring genes in Caenorhabditis elegans is mostly due to operons and duplicate genes. Genome Res. 13: 238243.
Li, J., Pankratz, M., and Johnson, J.A. 2002. Differential gene expression patterns revealed by oligonucleotide versus long cDNA arrays. Toxicol. Sci. 69: 383390. Mishra, R.K. and Karch, F. 1999. Boundaries that demarcate structural and functional domains of chromatin. J. Biosci. 24: 377399. Reiss, J., Cohen, N., Dorche, C., Mandel, H., Mendel, R.R., Stallmeyer, B., Zabot, M.T., and Dierks, T. 1998. Mutations in a polycistronic nuclear gene associated with molybdenum cofactor deficiency. Nat. Genet. 20: 5153.[CrossRef][Medline]
Shin, R., Kim, M.J., and Paek, K.H. 2003. The CaTin1 (Capsicum annuum TMV-induced Clone 1) and CaTin1-2 genes are linked head-to-head and share a bidirectional promoter. Plant Cell Physiol. 44: 549554. Spellman, P.T. and Rubin, G.M. 2002. Evidence for large domains of similarly expressed genes in the Drosophila genome. J. Biol. 1: 5.[CrossRef][Medline]
ftp://genome-ftp.stanford.edu/pub/smd/organisms/AT; Stanford database. http://arabidopsis.info/prototype; NASC Affymetrix database. http://mips.gsf.de/proj/thal/db/index.html; MIPS Web site. http://www.genome.ad.jp/kegg/; Kegg database.
Received October 31, 2003;
accepted in revised format February 18, 2004.
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||