|
|
|
|
Genome Res. 15:566-576, 2005 ©2005 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/05 $5.00 Methods ECgene: Genome-based EST clustering and gene modeling for alternative splicing1 Division of Molecular Life Sciences, Ewha Womans University, Seoul 120-750, Korea 2 School of Chemistry, Seoul National University, Seoul 151-747, Korea
With the availability of the human genome map and fast algorithms for sequence alignment, genome-based EST clustering became a viable method for gene modeling. We developed a novel gene-modeling method, ECgene (Gene modeling by EST Clustering), which combines genome-based EST clustering and the transcript assembly procedure in a coherent and consistent fashion. Specifically, ECgene takes alternative splicing events into consideration. The position of splice sites (i.e., exonintron boundaries) in the genome map is utilized as the critical information in the whole procedure. Sequences that share any splice sites are grouped together to define an EST cluster in a manner similar to that of the genome-based version of the UniGene algorithm. Transcript assembly is achieved using graph theory that represents the exon connectivity in each cluster as a directed acyclic graph (DAG). Distinct paths along exons correspond to possible gene models encompassing all alternative splicing events. EST sequences in each cluster are subclustered further according to the compatibility with gene structure of each splice variant, and they can be regarded as clone evidence for the corresponding isoform. The reliability of each isoform is assessed from the nature of cluster members and from the minimum number of clones required to reconstruct all exons in the transcript.
Expressed sequence tag (EST) clustering has played a central role in finding unknown genes, as evidenced by the widespread use of NCBI's UniGene (Schuler et al. 1996
Several well known EST clustering algorithms exist, most of which depend on pairwise alignment of ESTs. NCBI's UniGene is the most widely used such algorithm, whose original version was "transcript-based," examining all pairwise alignments of mRNA and EST sequences. UniGene recently switched to a "genome-based" algorithm for the human genome since build number 162, which is quite similar to our algorithm (http://www.ncbi.nlm.nih.gov/UniGene). Its major strengths are the rapid update (releases generally take less than one month) and the extensive annotation providing ample external links to important resources such as LocusLink, OMIM, MapViewer, etc. TIGR Gene Indices (TGI) is another well known EST clustering procedure that combines EST clustering based on sequence similarity and the transcript assembly procedure (Quackenbush et al. 2001
Alternative splicing (AS) is an important mechanism of modulating gene expression and function. Recent studies on AS estimated that 40%70% of human genes have alternatively spliced transcripts, and AS thus is established as a major mechanism of expanding proteome diversity (Graveley 2001
Now that the genome map is available, it is possible to exploit the information hidden in the intronic part of the genome. Lee and coworkers at UCLA developed an algorithm that took advantage of the intronic information to detect splice variants (Modrek et al. 2001
There have been some efforts to combine sequence clustering and transcript assembly procedures. Eyras and coworkers (2004 Here we present a novel gene-modeling method, ECgene (gene modeling by EST Clustering), which combines genome-based EST clustering and transcript assembly procedure in a coherent and consistent fashion, taking alternative splicing events into account. Algorithmic details are described with genome-wide analyses of the human, mouse, and rat genomes.
Algorithm overview An overview of the ECgene algorithm is shown in Figure 1 as a flow chart. The ECgene algorithm was already implemented as a Web server application in ASmodeler (Kim et al. 2004
Transcript assembly using graph theorySplice variant analysis The most critical part of the ECgene algorithm is the transcript assembly encompassing the analysis of splice variants. Other algorithmic details are given in the Methods section. The hypothetical cluster in Figure 2A illustrates the principle. We have a cluster of 16 spliced sequences that share at least one splice site. This cluster includes examples of most frequently seen AS typesalternative promoters, exon skipping, alternative splice sites (5' and 3'), and alternative polyA sites.
Our gene model is based on the graph-theoretic analysis of exon connectivity. The exon connectivity can be represented as a DAG as shown in Figure 2B. Nodes and edges correspond to exons and introns, respectively. For example, connectivity in sequence #4 is A C D E, and sequence #6 has a connectivity of C D G with the exon E skipped. We repeated the same procedure for all 16 sequences to construct a DAG for this cluster, as shown in Figure 2B.
Various ways of generating graphs have been reported. In the altSplice work by Sugnet et al. (2004
Next, we search for all possible paths along the exons in the DAG, each path representing an inferred transcript model. The path should start from the source nodes (exons A and B) and end with the terminal nodes (exons H and I). The polyA tail in the middle of exon D is ignored for the moment. All possible paths along exons are found using the standard depth-first-search (DFS) method. Sixteen paths exist in this case, as shown in Figure 2C. For each DFS solution, we look for compatible sequences in the cluster. For example, the second path A
However, not all exons are necessarily covered by member sequences in every transcript model at this stage. For example, the path A The presence of a polyA tail is definite proof of the transcript end. Five sequences (2, 11, 13, 15, 16) show evidence of polyA tails. PolyA detection in the ECgene algorithm is carried out based on conservative criteria for examining the genomic DNA sequences, as described in the Methods section. Four of the five sequences with the identical genomic locus have polyA tails at the terminal exon. However, sequence #2 has a valid polyA tail in the middle of exon D. The transcript models with sequence #2 as a member should terminate at this site, and it cannot be part of longer transcript models. The program examines all transcript models with intermediate polyA tails, and creates separate shorter transcript models with an alternative polyA tail. Detailed descriptions regarding detection of polyA tails and criteria for termination of transcripts based on the presence of polyA tails are given in the Methods section. In the end, we have nine transcript models for this example cluster, as shown in Figure 2C. Each gene model has cluster members and information on polyA tails as supporting evidence.
Minimal set of representative clones and ECgene reliability The only direct evidence for validity of a gene model is to verify the existence of the full-length clone experimentally. As an aid to judging the reliability of a gene model, we calculated the minimal set of clones ("MinClones") that are required to cover all exons in each transcript model. Alignments within a single exon (i.e., unspliced alignments) are not included in the calculation. Therefore, they represent a set of clones that reproduces the exonintron structure of the transcript model. Transcripts whose number of MinClones = 1 can be regarded as gene models with experimentally verified full-length clones, being close to the RefSeq quality. Those transcripts are classified as the ECgene Part A, and transcripts with slightly lower quality (but still highly probable) are included in Part B. Figure 2C shows MinClones for each transcript model. The first transcript has sequence #2 as the full-length clone. The second transcript needs two sequences (#4 and #7) to cover all exons. Those two sequences may be from different cDNA libraries and may not coexist in a single type of cell. This makes the second transcript less reliable than the first one. Isoforms that are possible only by joining two sequences are regarded as having less than sufficient evidence and are classified as belonging to ECgene Part B. Note that we have two transcript models with MinClones = 3, and that they belong to the ECgene Part C. Transcript structure in Part C may be questionable, since it requires concatenation of more than two clones. However, individual events of alternative splicing implied in the transcripts should be real unless the genomic alignment of mRNA/EST sequences is erroneous. There is a good chance that some of them will turn out to be real transcripts with more sequence data available in the future.
ECgene genome browser To provide more detailed information on the ECgene models, we created a utility program that shows ECgene models as custom tracks in the UCSC genome browser. Figure 3 shows the transcript structure of the BRCA2 gene using the ECgene genome browser available at http://genome.ewha.ac.kr/ECgene/gbr/. The GUI design is almost identical to that of the UCSC genome browser as shown in Figure 3A. The most useful feature would be the option of showing EST alignment that adds each transcript model and member sequences as a separate custom track. The title line includes a brief summary of the transcript model and clones. Clicking on the transcript or the title line expands the picture to show the alignment as in Figure 3B. We also provide the option of hiding unspliced alignments, since many of them are likely to be incomplete or artifactual. Furthermore, one can see the result of UniGene clustering for comparison even though they are just a collection of alignments without transcript models. For the BRCA2 gene, all 56 sequences in the UniGene cluster align in this genomic region. However, we often find that our clusters are substantially different from the UniGene clusters.
Analysis of human, mouse, and rat transcriptomes We applied the ECgene algorithm to the human, mouse, and rat genomes. The total number of input sequences is summarized in Table 1; 92%96% of RefSeq sequences align onto the genome with good quality. The aligned percentage decreases slightly for mRNA and EST sequences. The percentage of spliced sequences reflects the nature of EST clones, which are short sequences from single-pass reads. Note that the number of mRNA and EST sequences for rat is about one-tenth of those of the human and mouse genomes.
Table 2 is a summary of the application of the ECgene algorithm on the human genome. Part A contains 57,172 genes of almost RefSeq quality, 37,497 (66%) of which are multi-exon genes. The portion of single-exon genes is rather high compared to the input RefSeq in Table 1, since our criterion requires an mRNA and the number of sequences 8 regardless of the availability of full-length clones. The percentage of alternatively spliced genes among multi-exon genes varies from 25% to 43% depending on the transcript reliability. The average numbers of isoforms for multi-exon genes range from 4.1 to 7.9. Approximately 80% of alternatively spliced genes have at least one splice variant being supported by EST sequences only. All of these numbers are in good agreement with previous reports.
The total number of clusters is 311,252, a rather large number, but 55% (171,755 clusters) of those contain only one EST. Statistics regarding cluster size versus number of ECgene clusters are available in Supplemental Table S1. Clusters with an unusually large number of sequences are from the mitochondrial genome, except in one case. The statistics for coding versus non-coding transcripts are rather interesting. Whereas the number of coding transcripts shows a steady increase in the three groups in Table 2, the number of noncoding transcripts increases dramatically in Part C. Furthermore, we find that only 27 of the 79,153 noncoding transcripts in Part C have polyA tails. This strongly suggests that a substantial portion of noncoding transcripts in Part C might be artifacts, although we cannot rule out the possibility of noncoding RNAs being transcribed by different classes of RNA polymerase. Summaries for the mouse and rat genomes are given in Tables 3 and 4, respectively. The trends in mouse are almost the same as in human. The extent of alternative splicing is slightly decreased, probably due to the smaller size of the EST database. The rat genome shows much less AS events, and the average number of isoforms does not increase by including less reliable transcripts, which is due to the limited number of EST sequences for rat, about one-tenth of those of the human and mouse genomes. We expect to observe more AS events with more data available.
Classification by AS type Even though AS has been extensively studied in recent years, a truly genome-wide analysis of AS types is not available except for the mouse work using the FANTOM2 clones (Zavolan et al. 2003
Table 5 is the summary of the results. Of all 21,266 alternatively spliced genes in human, 13,175 (62%) genes show an exon-skipping event. Genes with a variation of donor (acceptor) splice site are
Alternative promoter usage is an important part of transcriptional regulation. A recent study by Landry et al. (2003 18% of all human genes ( 18,000 loci) show evidence of alternative promoter usage. Our analysis shows that 6473 genes have multiple transcription start sites (TSSs), which is almost twice as many. This increase seems to be due to the usage of all mRNA and EST sequences. However, the result should be examined critically for two reasons. First, the BLAT/SIM4 alignment for the first exon may not be correct, since detection of the small first exon is not a routine task, especially when the sequence quality is low. Second, the real transcript can turn out to be longer when more sequence data are available. The ECgene algorithm does not extend the transcript if it finds any exonintron mismatches. For example, the first exon (node A in the graph) in the final gene models #7 and #8 in Figure 2C will disappear if the sequence #1 is missing in the example cluster. Even if exon A is present in sequences #2#4, the final gene model will end up excluding exon A, since they would retain exon D. Without the sequence #1, these two gene models would start at exon C, which may not be the genuine first exon. This is an inherent problem in concatenating fragmented sequences to build the full-length model. To avoid this kind of pitfall, one should look for additional evidence of TSSs such as a CPG island, promoter site signatures, etc.
Alternative transcription termination and polyadenylation, producing mature transcripts with a 3' end of variable length, are another important regulatory factor that affects mRNA stability and post-transcriptional regulation. It is rather striking that The numbers of each type of AS are slightly smaller in mouse and even smaller in rat than in human. This should be due to coverage of the EST database. Whereas we have 5.4 and 4 million ESTs for human and mouse, respectively, just 0.54 million ESTs are available for rat.
The ECgene algorithm has many distinctive characteristics. Here we describe useful features other than the obvious meritfacile gene modeling of alternative splicing events. Our genome-based clustering has advantages and disadvantages. The major weakness is that the algorithm can be applied only for organisms with a genome map. This may not be a major limitation, because at present there are completed genome sequences of most important organisms such as human, worm, fruit fly, and mouse, with many more genomes to be completed in the near future. Once the genome map is available, the genome-based approach has several significant advantages over the conventional EST clustering methods based on pairwise alignments.
First of all, the genomic alignment for each gene model is precisely defined in the genome-based approach. Therefore we can readily utilize ample information from the genome annotation, which includes promoter, transcription regulatory elements, intron sequences, sequence variations (e.g., single nucleotide polymorphisms), gene expression data from microarray and SAGE experiments, conserved regions across species, repeat sequences, and so on. The UCSC genome browser database is an excellent example of integrating genomic resources from the public sector (Karolchik et al. 2003
Another major advantage is that genome-wide identification of AS events is naturally incorporated in our EST clustering algorithm. Most genome-based methods for detecting splice variants depend on the UniGene EST clusters (Modrek et al. 2001
Third, the UTRs are substantially longer since the terminal exons are extended by overlapping ESTs with correct orientation. Our analysis on
Fourth, gene expression pattern can be inferred at the individual isoform level either by examining the cDNA library source of ESTs or by extracting SAGE tags from the transcript models. For example, Xu and Lee found many tissue-specific (Xu et al. 2002
Data sets Genome-based EST clustering requires the genome map and transcript sequences. We used the July 2003 human reference sequence (UCSC version hg16) that is based on NCBI Build 34. The genome sequence was downloaded from the UCSC Genome Center (ftp://hgdownload.cse.ucsc.edu/goldenPath/
Mapping sequences against the genome map
Sequences of poor quality were filtered out based on several criteria. Minimum percent identities were 93% for ESTs, 96% for mRNA and RefSeq. Aligned parts should be over half of the sequence length (i.e., minimum alignment coverage = 50%). Many mRNA and EST sequences have long polyA tails, which affect the percent identity and alignment coverage values. Putative polyA tails were identified by the TRIMEST program in the EMBOSS package (http://www.emboss.org
BLAT alignments include many defects and are not quite ready for genome-based clustering. Erroneous alignments were corrected in several steps. BLAT alignment tends to create many small gaps due to the low sequence quality of ESTs (up to 3% of sequencing error). Therefore, we joined adjacent exons separated by very small introns that are shorter than 32 base pairs. Furthermore, if an alignment contained introns that did not satisfy the intron consensus signature (GT Recent versions of the BLAT program are designed to identify additional exons at both ends of the transcript. However, we found that many small initial and terminal exons from the EST sequences were not reliable, probably due to low sequence quality near the sequence ends. In an effort to avoid improper extension of transcripts, we removed any initial and terminal exons from the alignment if the exons were smaller than 20 base pairs and if the connecting introns were not canonical.
Primary EST clustering Primary clustering is based on the assumption that sequences from the same gene should share at least one splice site. Such sequences were grouped together to generate the primary clusters. However, exact determination of exonintron boundary was often problematic since the splice sites in EST sequences are often different by a few nucleotides from the true sites due to the low sequence quality of ESTs. Based on several numerical experimentations, we made an assumption that neighboring splice sites are identical if they are within ±16 base pairs. Therefore, our gene model can not distinguish splice variants whose splice sites are less than 16 base pairs apart due to this allowance. Among the splice sites within the 32-base pair range, the site supported by most sequences was chosen to be the representative splice site of the group, and was used in subsequent steps of gene modeling. At this point, primary clusters were equivalent to the genome-based UniGene clusters except that unspliced sequences were missing and that clones with the same library ID were not joined. These steps were postponed until the transcript assembly procedure.
Determination of polyA tail and transcript ends We developed our own empirical rules for identifying polyA tails; the rules depend on the length of polyA sequence and the quality of alignment onto the genomic DNA. The shortest polyA is five consecutive A's that do not align onto the genome. As the trimmed sequence gets longer, we gradually allowed matches between the putative polyA sequence and the genomic sequence. For 3' EST sequences, polyT was assumed to be present instead of polyA tail. The presence of polyA/T sites plays an important role in our gene modeling procedure, as described in the previous section. As a conservative approach, we acknowledged the transcript ends inferred from polyA tails only when we found polyA tails in one mRNA sequence or two spliced EST sequences or four unspliced EST sequences. The polyA attachment site should be identical in EST sequences. It should be noted that this conservative criterion could miss potentially genuine polyA tails. Transcript models with multiple polyA sites were split into many transcripts with only one polyA tail as described earlier.
Determination of gene direction
For each spliced sequence, we assigned the direction by counting the number of introns with GT The next step was deciding the direction of gene models. We took a stepwise approach with different weighting factors for mRNA and ESTs, namely 3 and 1, respectively. Initially, we collected mRNA and EST sequences with polyA/T tails. Sums of weighting factors in the sense and antisense strands were compared to decide the gene direction. If the gene direction was still ambiguous, we examined sequences without polyA/T tails. Up to this stage, we used ESTs with known read directions only. If the gene direction was still undecided, ESTs without known read directions were collected. Weighting factors for ESTs with and without polyA/T tails were 2 and 1, respectively. Once the gene direction was decided, we assigned the result to all sequence members and inconsistent polyA/T tails were discarded. The gene direction of clusters for single-exon genes was decided in a similar fashion. Sequence direction was decided from the read direction and polyA/T tails only, since no information from intron sequence was available.
Genes spanning multiple genomic loci
ORF and CDS determination Our rule for ORF and the coding sequence (CDS) determination considered the number of exons, the ORF length, presence of the start codon (Met), and the CDS length. We classified ORFs (defined as the region between two adjacent stop codons) into four groups: (1) spliced ORFs with Met, (2) spliced ORFs without Met, (3) single-exon ORFs with Met, and (4) single-exon ORFs without Met. Initially, we searched the first group for the ORF with the longest CDS. We accepted the coding sequences that are longer than 30 amino acids (93 base pairs) or identical to one of the SWISS-PROT proteins excluding fragmented entries. If we could not find such an ORF in the first group, other groups were examined sequentially for the presence of ORFs using the same criteria. Genes lacking apparent ORFs were defined as non-coding RNA genes.
We thank the UCSC Genome Center for making such a wonderful resource available to the public. We would also like to thank Prof. Jaesang Kim for helpful comments and editing of the manuscript. This work was supported by the Ministry of Science and Technology of Korea through the bioinformatics research program of MOST NRDP (M1-0217-00-0027) and the Korea Science and Engineering Foundation through the center for cell signaling research at Ewha Womans University.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.3030405.
3 Corresponding author. [Supplemental material is available online at www.genome.org. Gene models from genome-wide analyses for the human, mouse, and rat genomes are available at the ECgene Web site (http://genome.ewha.ac.kr/ECgene) or may be viewed through the UCSC genome browser.]
Adams, M.D., Kelley, J.M., Gocayne, J.D., Dubnick, M., Polymeropoulos, M.H., Xiao, H., Merril, C.R., Wu, A., Olde, B., Moreno, R.F., et al. 1991. Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252: 1651-1656.
Beaudoing, E. and Gautheret, D. 2001. Identification of alternate polyadenylation sites and analysis of their tissue distribution using EST data. Genome Res. 11: 1520-1526. Black, D.L. 2003. Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. BioChem. 72: 291-336.[CrossRef][Medline] Burge, C. and Karlin, S. 1997. Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268: 78-94.[CrossRef][Medline] Caceres, J.F. and Kornblihtt, A.R. 2002. Alternative splicing: Multiple control mechanisms and involvement in human disease. Trends Genet. 18: 186-193.[CrossRef][Medline]
Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T., and Hide, W. 2001. STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res. 29: 234-238.
Eyras, E., Caccamo, M., Curwen, V., and Clamp, M. 2004. ESTGenes: Alternative splicing from ESTs in Ensembl. Genome Res. 14: 976-987.
Florea, L., Hartzell, G., Zhang, Z., Rubin, G.M., and Miller, W. 1998. A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res. 8: 967-974.
Gopalan, V., Tan, T.W., Lee, B.T., and Ranganathan, S. 2004. Xpro: Database of eukaryotic protein-encoding genes. Nucleic Acids Res. 32: D59-D63. Graveley, B.R. 2001. Alternative splicing: Increasing diversity in the proteomic world. Trends Genet. 17: 100-107.[CrossRef][Medline] Heber, S., Alekseyev, M., Sze, S.H., Tang, H., and Pevzner, P.A. 2002. Splicing graphs and EST assembly problem. Bioinformatics (Suppl.) 18: S181-S188.[Abstract]
Kan, Z., Rouchka, E.C., Gish, W.R., and States, D.J. 2001. Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res. 11: 889-900.
Kan, Z., States, D., and Gish, W. 2002. Selecting for functional alternative splices in ESTs. Genome Res. 12: 1837-1845.
Karolchik, D., Baertsch, R., Diekhans, M., Furey, T.S., Hinrichs, A., Lu, Y.T., Roskin, K.M., Schwartz, M., Sugnet, C.W., Thomas, D.J., et al. 2003. The UCSC Genome Browser Database. Nucleic Acids Res. 31: 51-54.
Kawamoto, S., Yoshii, J., Mizuno, K., Ito, K., Miyamoto, Y., Ohnishi, T., Matoba, R., Hori, N., Matsumoto, Y., Okumura, T., et al. 2000. BodyMap: A collection of 3' ESTs for analysis of human gene expression information. Genome Res. 10: 1817-1827.
Kent, W.J. 2002. BLATThe BLAST-like alignment tool. Genome Res. 12: 656-664.
Kim, N., Shin, S., and Lee, S. 2004. ASmodeler: Gene modeling of alternative splicing from genomic alignment of mRNA, EST and protein sequences. Nucleic Acids Res. 32: W181-W186.
Kim, P., Kim, N., Lee, Y., Kim, B., Shin, Y., and Lee, S. 2005. ECgene: Genome annotation for alternative splicing. Nucleic Acids Res. 33: D75-D79.
Krause, A., Haas, S.A., Coward, E., and Vingron, M. 2002. SYSTERS, GeneNest, SpliceNest: Exploring sequence space from genome to protein. Nucleic Acids Res. 30: 299-300. Landry, J.R., Mager, D.L., and Wilhelm, B.T. 2003. Complex controls: The role of alternative promoters in mammalian genomes. Trends Genet. 19: 640-648.[CrossRef][Medline]
Lash, A.E., Tolstoshev, C.M., Wagner, L., Schuler, G.D., Strausberg, R.L., Riggins, G.J., and Altschul, S.F. 2000. SAGEmap: A public gene expression resource. Genome Res. 10: 1051-1060. Lavorgna, G., Dahary, D., Lehner, B., Sorek, R., Sanderson, C.M., and Casari, G. 2004. In search of antisense. Trends Biochem Sci. 29: 88-94.[CrossRef][Medline]
Lee, C., Atanelov, L., Modrek, B., and Xing, Y. 2003. ASAP: The Alternative Splicing Annotation Project. Nucleic Acids Res. 31: 101-105. Levanon, E.Y. and Sorek, R. 2003. The importance of alternative splicing in the drug discovery process. Targets 2: 109-114. Lewis, B.P., Shih, I.H., Jones-Rhoades, M.W., Bartel, D.P., and Burge, C.B. 2003. Prediction of mammalian microRNA targets. Cell 115: 787-798.[CrossRef][Medline] Maniatis, T. and Tasic, B. 2002. Alternative pre-mRNA splicing and proteome expansion in metazoans. Nature 418: 236-243.[CrossRef][Medline]
Mironov, A.A., Fickett, J.W., and Gelfand, M.S. 1999. Frequent alternative splicing of human genes. Genome Res. 9: 1288-1293.
Modrek, B., Resch, A., Grasso, C., and Lee, C. 2001. Genome-wide detection of alternative splicing in expressed sequences of human genes. Nucleic Acids Res. 29: 2850-2859.
Pesole, G., Liuni, S., Grillo, G., Licciulli, F., Mignone, F., Gissi, C., and Saccone, C. 2002. UTRdb and UTRsite: Specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002. Nucleic Acids Res. 30: 335-340.
Pospisil, H., Herrmann, A., Bortfeldt, R.H., and Reich, J.G. 2004. EASED: Extended Alternatively Spliced EST Database. Nucleic Acids Res. 32: D70-D74.
Quackenbush, J., Cho, J., Lee, D., Liang, F., Holt, I., Karamycheva, S., Parvizi, B., Pertea, G., Sultana, R., and White, J. 2001. The TIGR Gene Indices: Analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res. 29: 159-164.
Salamov, A.A. and Solovyev, V.V. 2000. Ab initio gene finding in Drosophila genomic DNA. Genome Res. 10: 516-522.
Schuler, G.D., Boguski, M.S., Stewart, E.A., Stein, L.D., Gyapay, G., Rice, K., White, R.E., Rodriguez-Tome, P., Aggarwal, A., Bajorek, E., et al. 1996. A gene map of the human genome. Science 274: 540-546.
Sorek, R. and Safer, H.M. 2003. A novel algorithm for computational identification of contaminated EST libraries. Nucleic Acids Res. 31: 1067-1074. Sugnet, C.W., Kent, W.J., Ares Jr., M., and Haussler, D. 2004. Transcriptome and genome conservation of alternative splicing events in humans and mice. Pac. Symp. Biocomput. 66-77. Tabaska, J.E. and Zhang, M.Q. 1999. Detection of polyadenylation signals in human DNA sequences. Gene 231: 77-86.[CrossRef][Medline]
Xing, Y., Resch, A., and Lee, C. 2004. The multiassembly problem: Reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Res. 14: 426-441.
Xu, Q. and Lee, C. 2003. Discovery of novel splice forms and functional analysis of cancer-specific alternative splicing in human expressed sequences. Nucleic Acids Res. 31: 5635-5643.
Xu, Q., Modrek, B., and Lee, C. 2002. Genome-wide detection of tissue-specific alternative splicing in the human transcriptome. Nucleic Acids Res. 30: 3754-3766.
Zavolan, M., van Nimwegen, E., and Gaasterland, T. 2002. Splice variation in mouse full-length cDNAs identified by mapping to the mouse genome. Genome Res. 12: 1377-1385.
Zavolan, M., Kondo, S., Schonbach, C., Adachi, J., Hume, D.A., Hayashizaki, Y., and Gaasterland, T. 2003. Impact of alternative initiation, splicing, and termination on the diversity of the mRNA transcripts encoded by the mouse transcriptome. Genome Res. 13: 1290-1300.
http://genome.ewha.ac.kr/ECgene; ECgene Web site. ftp://ftp.ncbi.nlm.nih.gov/genbank/; GenBank FTP site. ftp://hgdownload.cse.ucsc.edu/goldenPath/; Genome Browser FTP site at the UCSC Genome Center. ftp:/ftp.ncbi.nlm.nih.gov/refseq/release/; RefSeq FTP site. http://genome.ucsc.edu; UCSC Genome Bioinformatics Home. http://www.emboss.org; EMBOSS: The European Molecular Biology Open Software Suite. http://www.aceview.org; Identification and functional annotation of cDNA-supported genes in higher organisms using AceView. http://www.ncbi.nlm.nih.gov/UniGene; UniGene.
http://www.soe.ucsc.edu/
Received July 20, 2004; accepted in revised format January 11, 2005. |