|
|
|
|
Published online before print
December 12, 2005, 10.1101/gr.4137606 Genome Res. 16:30-36, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00
Letter Transcription-mediated gene fusion in the human genome1 Compugen Ltd., Tel Aviv 69512, Israel 2 Faculty of Life Sciences, Bar Ilan University, Ramat Gan 52900, Israel
Transcription of a gene usually ends at a regulated termination point, preventing the RNA-polymerase from reading through the next gene. However, sporadic reports suggest that chimeric transcripts, formed by transcription of two consecutive genes into one RNA, can occur in human. The splicing and translation of such RNAs can lead to a new, fused protein, having domains from both original proteins. Here, we systematically identified over 200 cases of intergenic splicing in the human genome (involving 421 genes), and experimentally demonstrated that at least half of these fusions exist in human tissues. We showed that unique splicing patterns dominate the functional and regulatory nature of the resulting transcripts, and found intergenic distance bias in fused compared with nonfused genes. We demonstrate that the hundreds of fused genes we identified are only a subset of the actual number of fused genes in human. We describe a novel evolutionary mechanism where transcription-induced chimerism followed by retroposition results in a new, active fused gene. Finally, we provide evidence that transcription-induced chimerism can be a mechanism contributing to the evolution of protein complexes.
Eukaryotic genes are generally well defined on the genome. Transcription usually begins from a transcription start site, which is guided by the promoter, and ends at a regulated termination point (Zhao et al. 1999
Fused RNAs were shown to be regulated and to have unique expression patterns. For example, the HHLA1-OC90 fusion transcript is restricted to teratocarcinoma cell lines while absent from normal cells (Kowalski et al. 1999 As no systematic effort was carried out to detect TIC events across the human genome, the extent of this phenomenon and its implications are unknown. In this study, we systematically identified TICs in human and discovered unique features that characterize them. We describe unappreciated roles of TICs in evolution of proteins and protein complexes, and demonstrate novel implications on regulation and function of genes.
Computational identification of transcription-induced chimerism events To characterize the phenomenon of TIC in a genome-wide manner, we first clustered ESTs and cDNAs from GenBank version 136 onto the human genome sequence (build 33) using the LEADS software platform (Sorek et al. 2002
To isolate reliable events of TIC we looked for clusters that contain at least two nonoverlapping cDNA sequences that are annotated as "complete CDS." To avoid contaminations, we demanded that the sequences connecting these two cDNAs will be canonically spliced, and will share at least one splice site with each of the two separate genes. This demand also screens out cases of naturally overlapping genes (Veeramachaneni et al. 2004
After removing the above-described artifacts, we acquired a reliable data set of 212 TIC events (Supplemental Table S1). The data set contained 421 genes, with four of them participating in more than one fusion (i.e., there was evidence for their fusion both with their upstream and downstream neighboring genes). Of the 212 fusion events, 54 (25%) were supported by more than one expressed sequence, suggesting that in most cases the fusion event is relatively rare or confined to a certain tissue/condition.
To understand the extent to which these events are conserved between species, we searched evidence for our 212 TICs in ESTs of other mammals (see Methods). In 22 cases (10%), we were able to identify an EST from another species supporting the human TIC (Supplemental Table S1). This rate of conservation is similar to the reported 11% conservation of alternative splicing events between human and mouse (Yeo et al. 2005 As mentioned above, 13 human TICs were reported previously (Table 1). Of these, five were supported by ESTs, with 3/5 supported by one EST only, indicating that even one supporting spliced EST can reliably report on true transcriptional fusion. For the remaining eight reported events, there was no EST showing their existence. We therefore conclude that the 421 genes we detected are only a subset of the actual number of fused genes in human. To test the possibility that transcription-induced chimerism is cancer induced, we used EST library annotations to extract the histological origin (cancer/normal) of each EST. We then compared the histology distribution of the fusing ESTs with the general distribution in all ESTs. Of the fusing sequences, 51% originate from normal tissues, compared with 46% in the entire EST population. These results indicate that the transcription-induced chimeras present in our data are not the outcome of a cancerous condition.
Unique intergenic splicing patterns and intergenic distance bias in fusion events Figure 2 lists the distribution of the fusion-generating splicing patterns we observed. The most abundant donor site in the upstream gene is located in the last intron (55%) and the most abundant acceptor site in the downstream gene is located at the first intron (80%). This indicates a strong tendency to separately drop the last and first exons of the upstream and downstream genes, respectively, and exactly explains the 44% of cases indicated above, which drop both first and last exons. In 12% of the events, a novel exon, residing between the two fused genes, appears in the fusion transcript (and not in any of the nonconnecting sequences). The fact that such novel exons originate from the intergenic region rules out the possibility that the fusion is mediated by trans-splicing, because in the trans-splicing option, the intergenic region would not have been transcribed.
To test whether there is a preference for specific intergenic distance between fused genes, we calculated the intergenic distance distribution of the 212 fusion events, and compared it with the distance distribution of 12,395 human adjacent genes (see Methods). As shown in Figure 3, fused genes tend to reside closer on the genome than the entire human gene pair population; the median distance between fused genes was 8.5 kb, compared with a median of 48 kb for the entire gene population. This indicates that the mechanism involved in TIC generation strongly prefers shorter distances between the genes. However, our data show that gene fusions can also occur over large gene distances: In 5% of the fusion events, the distance between fused genes exceeded 50 kb.
RTPCR experimental validation
Functions of fusion products What are the possible functions of transcription-induced chimeras? To understand this, we examined the fusion patterns with respect to the resulting ORF. In 53 events in our set (25%), a fusion protein containing coding sequences of both genes (without a premature stop codon) was created. This kind of fusion might generate a bifunctional protein having properties from both original proteins, as happens in the known cases of TWE-PRIL (Pradet-Balade et al. 2002
Another functional impact could be at the transcriptional regulation level. This will occur when the fusion involves only the first exon of the upstream gene, so that the upstream gene mainly contributes its 5'UTR to the fused transcript. Indeed, 26 (12%) of our events correspond to this type of fusion. This will potentially cause the downstream gene to be regulated as the upstream one, both transcriptionally (promoter) and translationally (5'UTR; see Fig. 4B). Multiple variable first exons were described in many human genes, functioning in alternating gene regulation (Zhang et al. 2004
TIC can also be intended to suppress the expression of the upstream gene by the Nonsense Mediated Decay (NMD) mechanism (Hillman et al. 2004 Finally, some of the events seen in our database could represent transcriptional "leakage", where the transcriptional machinery accidentally ignores the termination of the upstream gene and transcribes through the downstream one. Such a leakage can be a rare, nonregulated stochastic event that does not contribute to the fitness of the organism. Indeed, only 33% of the cases in our database were supported by multiple-sequence evidence, demonstrating the low-occurrence frequency of the majority of our events. In addition, only 10% of TICs were found to be conserved between species. It could also be argued that the low frequency of protein fusion events (25% of the total) is indicative of the stochastic nature of the TIC phenomenon; however, a similar frequency of fusion proteins (23%) was also detected in the subset of events that are conserved between mammals, indicating that nonfusion-protein events can be under selective pressure as well and are hence possibly functional. Overall, although we cannot determine the actual fraction of TICs that is functional, our results suggest that at least a subset of these events have a biological role.
Currently, regulation of TIC is generally uncharacterized. Models for transcription termination indicate that both cis-acting sequence elements, as well as trans-acting termination factors that belong both to the transcriptional and the splicing machineries, act together to generate an accurate 3' end (Zhao et al. 1999
What is the proportion of TIC events in the genome? Our data suggest that
Evolution of protein complexes Intriguingly, we were able to identify a processed pseudogene indicating fusion of the genes PIP5K1A and PSD4. Although no EST supported this fusion event, we verified experimentally the existence of a PIP5K1A-PSD4 fusion transcript in human RNA (Fig. 4C). PIP5K1A and PSD4 consecutively reside on chromosome 1, while the processed pseudogene is found on chromosome 10 as a single exon chain. Moreover, we detected a GenBank cDNA (BC068549 [GenBank] ) indicating active transcription from the pseudogene itself, suggesting that it is actually an active retrogene. Indeed, we were able to verify experimentally through RTPCR and sequencing that this retrogene is being transcribed in human tissues (Fig. 4C). This retrogene underwent considerable evolution relative to the original fused transcript (79 mismatches between the sequence of BC068549 [GenBank] and the PIP5K1A-PSD4 locus). Still, translation of BC068549 [GenBank] results in an uninterrupted PIP5K1A-PSD4 fusion protein, suggesting that the new retrogene is under selective pressure and is hence functional.
This unique pseudogene example sets transcription-induced chimerism followed by retro-position as a novel molecular evolutionary mechanism enabling the creation of new, fused "Rosetta Stone" sequences. This mechanism is expected to affect mainly Eukaryotes, where the splicing machinery can efficiently remove the intergenic region. Presumably, additional fused genes were created through this mechanism during the evolutionary history of metazoa.
Conclusions
Computational search Human ESTs and cDNAs were obtained from NCBI GenBank version 136 (June 2003; http://www.ncbi.nlm.nih.gov/Genbank/ "Complete CDS" annotation of RNA sequences was obtained from the "DEFINITION" field in the GenBank sequence records. In each cluster, overlapping complete CDS cDNA sequences that aligned fully to the genome were grouped together. Each group was referred to as a gene. Gene boundaries were extended in cases where ESTs suggested longer UTRs than present in the RNA. In clusters containing more than one gene, connecting sequences were identified. Connecting sequences were required to have canonical splice sites at the fusion junction, and to share at least one splice site with each of the two separate genes. For the "alignment artifacts" filtering, each exon in the connecting sequences was aligned to the cDNA sequences of both connected genes. Sequences with exons aligned to both genes were discarded. For the manual filtration of fusion events, we used the following information from the UCSC genome browser: (1) occurrence of CpG islands before both genes; (2) existence of SWISS-PROT annotations for both genes; (3) existence of ORF, 5' and 3' UTRs for both genes.
For gene distance calculation, known RefSeqs were localized to the genome using the UCSC genome browser annotations (Karolchik et al. 2003 To calculate possible NMD of transcripts, the fused transcript was first assembled using the upstream and downstream RefSeqs connected by the fusing EST. In this transcript, premature stop codon was searched according to the rule of 55 nucleotides or more upstream to the last exonexon junction.
For the "evolution of protein complexes" analysis, annotations of genes were downloaded from the "RefSeq Summary" field in UCSC genome browser and from the comments fields in SWISS-PROT. Processed pseudogenes indicating on TIC were systematically searched in the database of >8000 processed pseudogenes compiled by Zhang et al (2003
Experimental validation of fusion events
We thank D. Schaffer, A. Golubev, A. Haviv, and N. Keren for biocomputatioanl assistance; M. Oz for providing critical resources; G. Naveh for literature assistance; and Z. Levine, U. Nir, K. Savitsky, E. Levanon, D. Milo, S. Pollock, G. Cojocaru, E. Eisenberg, and D. Dahary for fruitful discussions.
Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4137606.
3 These two authors contributed equally to this work.
4 Present address: Tel Aviv University, Department of Human Genet-ics, Sackler Faculty of Medicine, Tel Aviv, 69978 Israel.
5 Corresponding author. [Supplemental material is available online at www.genome.org.]
Communi, D., Suarez-Huerta, N., Dussossoy, D., Savi, P., and Boeynaems, J.M. 2001. Cotranscription and intergenic splicing of human P2Y11 and SSF1 genes. J. Biol. Chem. 276: 16561-16566. Cox, P.R., Siddique, T., and Zoghbi, H.Y. 2001. Genomic organization of Tropomodulins 2 and 4 and unusual intergenic and intraexonic splicing of YL-1 and Tropomodulin 4. BMC Genomics 2: 7.[CrossRef][Medline] Enright, A.J. and Ouzounis, C.A. 2001. Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biol. 2: research0034.[Medline] Fears, S., Mathieu, C., Zeleznik-Le, N., Huang, S., Rowley, J.D., and Nucifora, G. 1996. Intergenic splicing of MDS1 and EVI1 occurs in normal tissues as well as in myeloid leukemia and produces a new member of the PR domain family. Proc. Natl. Acad. Sci. 93: 1642-1647. Gardiner-Garden, M. and Frommer, M. 1987. CpG islands in vertebrate genomes. J. Mol. Biol. 196: 261-282.[CrossRef][Medline] Gilles, A.M., Presecan, E., Vonica, A., and Lascu, I. 1991. Nucleoside diphosphate kinase from human erythrocytes. Structural characterization of the two polypeptide chains responsible for heterogeneity of the hexameric enzyme. J. Biol. Chem. 266: 8784-8789. Hardy, R.W. and Wertz, G.W. 1998. The product of the respiratory syncytial virus M2 gene ORF1 enhances readthrough of intergenic junctions during viral transcription. J. Virol. 72: 520-526. Hillman, R.T., Green, R.E., and Brenner, S.E. 2004. An unappreciated role for RNA surveillance. Genome Biol. 5: R8.[CrossRef][Medline] Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31: 51-54. Kato, M., Khan, S., Gonzalez, N., O'Neill, B.P., McDonald, K.J., Cooper, B.J., Angel, N.Z., and Hart, D.N. 2003. Hodgkin's lymphoma cell lines express a fusion protein encoded by intergenically spliced mRNA for the multilectin receptor DEC-205 (CD205) and a novel C-type lectin receptor DCL-1. J. Biol. Chem. 278: 34035-34041. Kowalski, P.E., Freeman, J.D., and Mager, D.L. 1999. Intergenic splicing between a HERV-H endogenous retrovirus and two adjacent human genes. Genomics 57: 371-379.[CrossRef][Medline] Lander, E.S., Linton, L.M., Birren, B., Nusbaum, C., Zody, M.C., Baldwin, J., Devon, K., Dewar, K., Doyle, M., FitzHugh, W., et al. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.[CrossRef][Medline] Magrangeas, F., Pitiot, G., Dubois, S., Bragado-Nilsson, E., Cherel, M., Jobert, S., Lebeau, B., Boisteau, O., Lethe, B., Mallet, J., et al. 1998. Cotranscription and intergenic splicing of human galactose-1-phosphate uridylyltransferase and interleukin-11 receptor Marcotte, E.M., Pellegrini, M., Ng, H.L., Rice, D.W., Yeates, T.O., and Eisenberg, D. 1999. Detecting protein function and proteinprotein interactions from genome sequences. Science 285: 751-753. Millar, J.K., Christie, S., Semple, C.A., and Porteous, D.J. 2000. Chromosomal location and genomic structure of the human translin-associated factor X gene (TRAX; TSNAX) revealed by intergenic splicing to DISC1, a gene disrupted by a translocation segregating with schizophrenia. Genomics 67: 69-77.[CrossRef][Medline] Pardigol, A., Forssmann, U., Zucht, H.D., Loetscher, P., Schulz-Knappe, P., Baggiolini, M., Forssmann, W.G., and Magert, H.J. 1998. HCC-2, a human chemokine: Gene structure, expression pattern, and biological activity. Proc. Natl. Acad. Sci. 95: 6308-6313. Parra, G., Reymond, A., Dabbouseh, N., Dermitzakis, E.T., Castelo, R., Thomson, T.M., Antonarakis, S.E., and Guigó, R. 2006. Tandem chimerism as a means to increase protein complexity in the human genome. Genome Res. (this issue). Poulin, F., Brueschke, A., and Sonenberg, N. 2003. Gene fusion and overlapping reading frames in the mammalian genes for 4E-BP3 and MASK. J. Biol. Chem. 278: 52290-52297. Pradet-Balade, B., Medema, J.P., Lopez-Fraga, M., Lozano, J.C., Kolfschoten, G.M., Picard, A., Martinez, A.C., Garcia-Sanz, J.A., and Hahne, M. 2002. An endogenous hybrid mRNA encodes TWE-PRIL, a functional cell surface TWEAK-APRIL fusion protein. EMBO J. 21: 5711-5720.[CrossRef][Medline] Proudfoot, N.J., Furger, A., and Dye, M.J. 2002. Integrating mRNA processing with transcription. Cell 108: 501-512.[CrossRef][Medline] Sorek, R. and Safer, H.M. 2003. A novel algorithm for computational identification of contaminated EST libraries. Nucleic Acids Res. 31: 1067-1074. Sorek, R., Ast, G., and Graur, D. 2002. Alu-containing exons are alternatively spliced. Genome Res. 12: 1060-1067. Sorek, R., Shemesh, R., Cohen, Y., Basechess, O., Ast, G., and Shamir, R. 2004. A non-EST-based method for exon-skipping prediction. Genome Res. 14: 1617-1623. Thomson, T.M., Lozano, J.J., Loukili, N., Carrio, R., Serras, F., Cormand, B., Valeri, M., Diaz, V.M., Abril, J., Burset, M., et al. 2000. Fusion of the human gene for the polyubiquitination coeffector UEV1 with Kua, a newly identified gene. Genome Res. 10: 1743-1756. Upadhyaya, A.B., Lee, S.H., and DeJong, J. 1999. Identification of a general transcription factor TFIIA Veeramachaneni, V., Makalowski, W., Galdzicki, M., Sood, R., and Makalowska, I. 2004. Mammalian overlapping genes: The comparative perspective. Genome Res. 14: 280-286. Yelin, R., Dahary, D., Sorek, R., Levanon, E.Y., Goldstein, O., Shoshan, A., Diber, A., Biton, S., Tamir, Y., Khosravi, R., et al. 2003. Widespread occurrence of antisense transcription in the human genome. Nat. Biotechnol. 21: 379-386.[CrossRef][Medline] Yeo, G.W., Van Nostrand, E., Holste, D., Poggio, T., and Burge, C.B. 2005. Identification and analysis of alternative splicing events conserved in human and mouse. Proc. Natl. Acad. Sci. 102: 2850-2855. Zaphiropoulos, P.G. 1999. RNA molecules containing exons originating from different members of the cytochrome P450 2C gene subfamily (CYP2C) in human epidermis and liver. Nucleic Acids Res. 27: 2585-2590. Zhang, Z., Harrison, P.M., Liu, Y., and Gerstein, M. 2003. Millions of years of evolution preserved: A comprehensive catalog of the processed pseudogenes in the human genome. Genome Res. 13: 2541-2558. Zhang, T., Haws, P., and Wu, Q. 2004. Multiple variable first exons: A mechanism for cell- and tissue-specific gene regulation. Genome Res. 14: 79-89. Zhao, J., Hyman, L., and Moore, C. 1999. Formation of mRNA 3' ends in eukaryotes: Mechanism, regulation, and interrelationships with other steps in mRNA synthesis. Microbiol. Mol. Biol. Rev. 63: 405-445.
http://www.ncbi.nlm.nih.gov/Genbank/; NCBI GenBank version 136 (June 2003). http://www.ncbi.nlm.nih.gov/genome/guide/human/; human genome build 33 (April 2003). http://genome.ucsc.edu/; This site contains the reference sequence and working draft assemblies for a large collection of genomes. It also provides a portal to the ENCODE project.
Received May 16, 2005; accepted in revised format September 13, 2005. Related Article
This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||