|
|
|
|
Published online before print
July 17, 2003, 10.1101/gr.1261703 Genome Res. 13:1930-1937, 2003 ©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00
Methods Gene Prediction by Spectral Rotation Measure: A New Method for Identifying Protein-Coding RegionsDepartment of Computer Science, Tel-Hai Academic College, Upper Galilee 12210, Israel
A new measure for gene prediction in eukaryotes is presented. The measure is based on the Discrete Fourier Transform (DFT) phase at a frequency of 1/3, computed for the four binary sequences for A, T, C, and G. Analysis of all the experimental genes of S. cerevisiae revealed distribution of the phase in a bell-like curve around a central value, in all four nucleotides, whereas the distribution of the phase in the noncoding regions was found to be close to uniform. Similar findings were obtained for other organisms. Several measures based on the phase property are proposed. The measures are computed by clockwise rotation of the vectors, obtained by DFT for each analysis frame, by an angle equal to the corresponding central value. In protein coding regions, this rotation is assumed to closely align all vectors in the complex plane, thereby amplifying the magnitude of the vector sum. In noncoding regions, this operation does not significantly change this magnitude. Computing the measures with one chromosome and applying them on sequences of others reveals improved performance compared with other algorithms that use the 1/3 frequency feature, especially in short exons. The phase property is also used to find the reading frame of the sequence.
Gene prediction analysis, and specifically, the computational methods for finding the location of protein-coding regions in uncharacterized genomic DNA sequences, is one of the central issues in bioinformatics (Fickett 1996 Despite the extensive research in the area of gene prediction, current predictors do not provide a complete solution to the problem of gene identification. Short exons are difficult to locate, because discriminative statistical characteristics are less likely to appear in short strands. Furthermore, some genes do not possess the characteristic features that identify most genes, and hence it is not possible to track them using gene predictors that rely on these features. In this paper a new discriminating feature for gene prediction is proposed. This measure is based on the arguments of the Discrete Fourier Transform (DFT), and is shown to be a potential candidate for locating short genes and exons. The paper is organized as follows: In the Methods section, the first part deals with the frequency analysis of DNA sequence; the second part details the Fourier analysis at a frequency of 1/3, and discusses the relationship between spectral arguments and the position frequencies. The first part of the Results section introduces the distribution of the arguments of the Fourier spectra; the last two parts describe the applications for gene prediction.
Periodicities in DNA Sequences and DFT Analysis The importance of measuring different periodicities for a given DNA in order to determine the locations of protein-coding regions has already been addressed by Fickett (1992), and these periodicities have been used as discriminant features in several studies of gene prediction (Silverman and Linsker 1986
The DFT of a given numeric sequence x(n) of length N is defined by
/N)k.
Because the DNA sequence is a character string, numerical values must be assigned to each character: A, T, C, and G. One possible way of performing this conversion is to assign a binary sequence to each of the four bases (Voss 1992
A distinctive feature of protein-coding regions in DNA is the existence of short-range correlations in the nucleotide arrangement, especially a 1/3-periodicity (Fickett 1982
, and f (b,i) is the frequency of b in the codon position i, i = 1,2,3.
Anastassiou (2000
In the following sections we show how signal processing measures for gene prediction can be improved by considering a new feature of protein-coding DNA regions, which can be measured by DFT, namely, the arguments of the Fourier spectra at N/3. We show that the arguments of the Fourier spectra in coding regions are narrowly distributed around corresponding central values.
Relationship Between Spectral Arguments and Position Frequencies
/3), and ei(2 /3), and equation 3.1 takes the form:
/3) + ei(2 /3) = 0 equation 3.2 can be expressed as follows
. If all f(b,i), i = 1,2,3, are equal, then Ub(N/3) = 0. Figure 1 illustrates the case where fmin = f(b,1). A simple trigonometric computation yields:
is the argument of the vector corresponding to f1(0, 2 /3, or -2 /3).
It can be seen that arg[Ub(N/3)] will shift by -2 /3 or 2 /3 for reading frames 2 and 3, respectively.
Since it was assumed above that the ratios between each pair of the counters
The Distribution of the Arguments of the Fourier Spectra Let s be a DNA strand, and for each base b, denote b(s) = Ub(N/3). We calculated the values of arg(A[s]), arg(T[s]), arg(C[s]), and arg(G[s]) in coding and noncoding regions, for different organisms. The histograms describing these distributions for all experimental genes in the 16 chromosomes of S. cerevisiae (multiple-exon genes were concatenated to single strands; GenBank acc. nos. NC001133NC001148, at http://www.ncbi.nlm.nih.gov) are shown in Figure 2. (Because the arguments are originally in principal values [between - and ], a 2 shift was applied to part of the data so that the histograms are plotted around the angular mean.) As the figure reveals, in all four nucleotides the distributions of the arguments taper around a central value, with the distributions of arg(G[s]) and arg(T[s]) being much narrower than the other two. Similar results, in both shape and statistics, were obtained for each of the 16 chromosomes of S. cerevisiae.
The corresponding histograms for noncoding regions (intergenic spacers and introns in all genes [experimental and not experimental]) in the 16 chromosomes appear in Figure 3. The distributions for noncoding regions seem to be close to uniform, and very different from the distributions that were obtained for coding regions. A similar pattern was observed for each separate chromosome.
To make sure that the former results are not unique for S. cerevisiae, the same analysis was performed on other organisms. The resulting histograms (Fig. 4A,B,C) show the argument distributions for S. cerevisiae, S. pombe, (chromosomes 2 and 3; acc. nos. NC003423 and NC003421, respectively), and Guillardia theta (chromosome I; acc. no. AF165818 [GenBank] ), respectively. It is readily evident that the three histograms greatly resemble each other, although the exact statistical values differ somewhat. In particular, the central value of arg(G), in all three organisms, is located somewhere in the vicinity of 0. This means that the base G appears in the first codon position much more often than in the second and third positions. This is consistent with the findings of Trifonov (1987
In the following section we show how the difference between coding and noncoding regions in terms of argument distribution can be applied to gene prediction.
Rotational Measures for Gene Prediction
Assume we have an organism for which arg(A[s]), arg(T[s]), arg(C[s]), and arg(G[s]) are distributed in a similar manner to that observed in the organisms described above (i.e., bell-shaped in coding regions, and close to uniform in noncoding regions). Let µA, µT, µC, and µG, be the approximated average values, in coding regions, of arg(A[s]), arg(T[s]), arg(C[s]), and arg(G[s]), respectively. Since for a typical coding sequence s it is expected that arg(A[s])
Dividing each term in equation 5.1 by the corresponding angular deviation ( A, T, C, and G of A[s], T[s], C[s], and G[s], and respectively) will give more weight to narrower distributions, yielding the measure
Table 1 compares the performance of four measures: two introduced here, namely the SR measure and the G Rotation measure (described below), and two known measures based on Fourier spectrum: the Spectral Content measure (Tiwari et al. 1997
Figure 6 compares the probability density functions of the Spectral Content measure and the SR measure. The values of the Spectral Content measure were scaled so that intersection points of the two curves of each color are vertically aligned. The better performance of the SR measure is illustrated by the fact that the distance between the bold curves is greater than the distance between the fine curves. Detection of short exons may be rendered more effective by using one statistical parameter that is narrowly distributed when calculated over short strands. As shown in Figure 7, the distribution of arg(G[s]) in the genes of S. cerevisiae is narrow when calculated over coding strands of length 120 bp.
The following measure uses only arg(G[s]). Since only the vector G(s) is rotated in this measure, we need a fixed reference vector to maximize the vector sum. Suppose R is a real number. If s belongs to a coding region, and is in reading frame 1, the vector G(s), rotated clockwise by the approximated average argument µG will most likely be directed towards the positive real axis, that is, in the same direction as the vector R. On the other hand, if s does not belong to a coding region, the rotated vector may point in any direction. The vector sum of R and the rotated G(s) will discriminate best between coding and noncoding regions, when R is of the same order of magnitude as |G(s)|. Thus, we choose |G(s)| as the reference vector. In order to identify genes in all three reading frames, we define the G Rotation measure as
G is chosen from the set {µG, µG + (2 /3), µG - (2 /3)}, so that the value of the measure in equation 5.3 is maximal. Table 1 compares the performance of the G Rotation measure with that of other measures, on the experimental genes and exons of S. cerevisiae.
When using a measure calculated from data of one organism to predict genes in another organism, it may be preferable to use a subset of the vectors A(s), T(s), C(s), and G(s). For example, the vectors T(s) and G(s), which have narrowly distributed arguments, can be aligned to yield the TG-Rotation measure |VTG|2 = |(e-iµT/ Figure 8A shows the curve of the TG-Rotation measure, constructed from data in chromosome 16 of S. cerevisiae, on a typical split gene of S. pombe (gene SPBC582.08 in chromosome 2). For comparison, Figure 8B shows the graph of the Codon Usage measure on the same gene. The horizontal lines represent the actual location of the three exons. Note the short intron between the second and third exons. In general, one should be wary about using data from one organism to predict genes in another organism, because the respective central argument values in different organisms may not be similar. For example, note that in Figure 4A,B,C the central values of arg(A) and arg(C) are very different in the three organisms.
Table 2 summarizes the data on gene SPBC582.08.
Complementary Sequences and Reading Frame Identification Genes on the complementary strand can be detected using the following transformation from Anastassiou (2000
If V = aA(s) + tT(s) + cC(s) + gG(s), then the predictor for the complementary strand is
An nonannotated strand is examined using both measures |V|2 and |
As mentioned in the Methods section, arg[Ub(N/3)] will shift by -2 Figure 9A depicts the graph of the SR measure on the gene SPBC1685.08 in chromosome 2 of S. pombe (acc. no. NC003423). The measure's parameters were calculated from the genes in the smaller chromosome 3 (acc. no. NC003421). Figure 9B depicts the graph of arg(V). The graphs were obtained by calculating the measure with a sliding window of 351 bp, using a step of 3 bp. The gene has three exons, in reading frames 1, 3, and 2 respectively. Table 3 summarizes the data on the gene. Note the short intron between the second and third exons.
Figure 9 illustrates how the curve of arg(V) can be used to identify the exact boundaries of an exon. It is expected that along an exon the value of arg(V) will remain in the vicinity of one of the values 0, -2 /3, or 2 /3, while outside the exon, the value of arg(V) will change to some "random" value.
Figure 10A depicts the graphs of the SR measure on the gene SPBC1709.08 in chromosome 2 of S. pombe (acc. no. NC003423). The measure's parameters were calculated as in the previous example. Figure 10B depicts the graph of arg(V). In this example, the gene has one exon, between nucleotides 1033037 and 1037362, in reading frame 2. The fact that the curve of arg(V) is constant at around the value of -2
In this paper a new method for gene prediction is proposed, based on several measures of protein coding regions. The measures are derived from a regularity of the spectral phase within coding regions. In this study we found that the phase of the DFT at a frequency of 1/3 is distributed with a bell-shaped curve around a central value in coding regions, whereas in noncoding regions, the distribution was close to uniform. This behavior was shown to exist in all chromosomes of S. cerevisiae, and also in two other organisms, S. pombe and Guillardia theta. This regularity was used for the construction of measures for discriminating between coding and noncoding regions in a given nonannotated DNA sequence. The measures are constructed by clockwise rotation of the vectors, which are the values obtained by DFT analysis for the four binary sequences of each nucleotide, with the corresponding central values. After such rotation, the four vectors in coding regions tend to be aligned close to each other, whereas the arrangement of vectors in noncoding regions is random. Earlier studies described proposed measures for gene prediction based on Fourier transform at a frequency of 1/3 or at other frequencies (Trifonov and Sussman 1980
The attempt to use parameters derived from one organism to recognize genes in another organism is based on an implicit assumption of the universality of genes, at least with regard to the structure that elicits the above spectral features. However, as this study (and also previous ones) show, the peak at a frequency of 1/3 is attributed to position asymmetry of the nucleotide within the three possible locations in the codon. This asymmetry was shown to be the result of codon usage (Tsonis et al. 1991
Considering the argument distributions obtained in this study, it was predicted that wherever an analysis frame slides within a protein-coding region, the value of arg(V) (the vector sum of the rotated spectra) will be close to one of three possible values (0, -2 Last, a comment about the length of the analysis frame. A short analysis frame (less than 180 bp) may detect short exons and short introns, whereas frames of over 300 bp may miss them. However, there is a tradeoff, because the use of shorter analysis frames causes more statistical flunctuations, resulting in more false negatives and false positives. Hence, it is important to have a measure that still performs reasonably with short frames. In summary, we suggest that considering the arguments of the Fourier spectra at k=N/3 yields more information about a DNA sequence than the corresponding magnitudes alone. However, it should be noted that these two values (namely, the magnitude and the argument) are not independent. A large magnitude of a Fourier spectrum at k=N/3 is a result of a sharp position asymmetry in the corresponding base. If a sharp position asymmetry is characteristic of the coding regions of an organism, then the value of arg[Ub(N/3)] will be more stable; that is, its distribution over the genes of the organism will have low variance. However, as shown in this work, incorporating data about the distribution of the arguments of the Fourier spectra at k=N/3, along with their magnitudes, into a measure, yields a measure that is more sensitive to exon-intron transition than a measure that uses the magnitudes alone.
We thank Mr. Efim Yakir for preparing part of the figures and for technical support. We also thank Dr. Dorit Shweiki for reading the manuscript and for valuable discussions. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.1261703.
1 Corresponding author. E-MAIL yizhar_l{at}kyiftah.org.il; FAX972-4-6952899. Article published online before print in July 2003.
Almagor, H. 1985. Nucleotide distribution and the recognition of coding regions in DNA sequences: An information theory approach. J. Theor. Biol. 117: 127-136.[Medline]
Anastassiou, D. 2000. Frequency-domain analysis of biomolecular sequences. Bioinformatics 16: 1073-1082. Baldi, P. and Brunak S. 2001. Bioinformatics: The machine learning approach 2nd ed., chapter 7. MIT Press, Cambridge, MA. Borodovsky, M.Y., Sprizhitsky, Y.A., Golovanov, E.I., and Alexandrov, A.A. 1986. Statistical patterns in the primary structure of the functional regions of the Escherichia coli genome. II. Nonuniform Markov models. Mol. Biol. 20: 833-840. Borodovsky, M.Y., Koonin, E.V., and Rudd, K.E. 1994. New genes in old sequence: A strategy for finding genes in the bacterial genome. Trends Biochem. Sci. 19: 309-313.[CrossRef][Medline] Chechetkin, V.R. and Turygin, A.Y. 1995. Size-dependence of three-periodicity and long-range correlations in DNA sequences. Phys. Lett. A. 199: 75-80.[CrossRef]
Claverie, J.-M. 1997. Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet. 6: 1735-1744.
Claverie, J.M. and Bougueleret, L. 1986. Heuristic informational analysis of sequences. Nucleic. Acids Res. 14: 179-196. Dong, S. and Searls, D.B. 1994. Gene structure prediction by linguistic methods. Genomics 23: 540-551.[CrossRef][Medline] Farber, R.B., Lapedes, A.S., and Sirotkin, K.M. 1992. Determination of eukaryotic protein coding regions using neural networks and information theory. J. Mol. Biol. 226: 471-479.[CrossRef][Medline]
Fickett, J.W. 1982. Recognition of protein coding regions in DNA sequences. Nucleic Acids Res. 10: 5303-5318. Fickett, J.W. 1996. The gene identification problem: An overview for developers. Comput. Chem. 20: 103-118.
Fickett, J.W. and Tung, C.S. 1992. Assessment of protein coding measures. Nucleic Acids Res. 20: 6441-6450.
Herzel, H., Weiss, O., and Trifonov, E.N. 1999. 1011 bp periodicities in complete genomes reflect protein structure and protein folding. Bioinformatics 15: 187-193.
Krogh, A., Mian, I.S., and Haussler, D. 1994. A hidden Markov model that finds genes in E. coli DNA. Nucleic Acids Res. 22: 4768-4778. Lapedes, A.S., Barnes, C., Burks, C., Farber, R.M., and Sirotkin, K.M. 1990. Application of neural networks and other machine learning algorithms to DNA sequence analysis. In Computers and DNA (eds. G. Bell. and T. Marr), pp. 157-182. Addison-Wesley, Redwood City, CA. Mantegna, R.N., Buldyrev, S.V., Goldberger, A.L., Havlin, S., Peng, C.-K., Simmons, M., and Stanley, H.E. 1994. Linguistic features of noncoding DNA sequences. Phys. Rev. Lett. 73: 3169-3172.[CrossRef][Medline]
Mathé, C., Sagot, M-F., Schiex, T., and Rouzé, P. 2002. Current methods of gene prediction, their strength and weaknesses. Nucleic Acids Res. 30: 4103-4117. Oppenheim, A.V. and Schafer, R.W. 1999. Discrete-time signal processing, chapter 8 Prentice Hall, Upper Saddle River, NJ. Salzberg, S.L., Searls, D.B., and Kasif, S., eds. 1998. Computational methods in molecular biology, chapter 1. Elsevier, Amsterdam. Searls, D.B. 1992. The Linguistics of DNA. Amer. Scientist 80: 579-591. Shulman, M.J., Steiberg, C.M., and Westmoreland, B. 1981. The coding function of nucleotide sequences can be discerned by statistical analysis J. Theor. Biol. 88: 409-420.[Medline] Silverman, B.D. and Linsker, R. 1986. A measure of DNA periodicity. J. Theor. Biol. 118: 295-300.[CrossRef][Medline] Snyder, E.E. and Stormo, G.D. 1995. Identification of protein coding regions in genomic DNA. J. Mol. Biol. 258: 1-18.
Staden, R. and McLachlan, A.D. 1982. Codon preference and its use in identifying protein coding regions in long DNA sequences. Nucleic Acids Res. 10: 141-156. Tiwari, S., Ramachandran, S., Bhattacharya., A., Bhattacharya, S., and Ramaswamy, R. 1997. Prediction of probable genes by Fourier analysis of genomic sequences. Comput. Appl. Biosci. 113: 263-270. Trifonov, E.N. 1987. Translation framing code and framing-monitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. J. Mol. Biol. 194: 643-652.[CrossRef][Medline] Trifonov, E.N. 1998. 3-, 10.5-, 200- and 400-base periodicities in genome sequences. Phys. A. 249: 511-516.
Trifonov, E.N. and Sussman, J.L. 1980. The pitch of chromatin DNA is reflected in its nucleotide sequence. Proc. Natl. Acad. Sci. 77: 3816-3820. Tsonis, A.A., Elsner, J.B., and Tsonis, P.A., 1991: Periodicity in DNA coding sequences: Implications in gene evolution. J. Theor. Biol. 151: 323-331.[Medline]
Uberbacher, E.C. and Mural, R.J. 1991. Locating protein-coding regions in human DNA sequences by a multiple sensor-neural network approach. Proc. Natl. Acad. Sci. 88: 11261-11265. Voss, R. 1992. Evolution of long-range fractal correlations and 1/f noise in DNA base sequences. Phys. Rev. Lett. 68: 3805-3808.[CrossRef][Medline]
Xu, Y., Mural, R.J., and Uberbacher, E.C. 1994. Constructing gene models from accurately predicted exons: An application of dynamic programming. Comput. Appl. Biosci. 10: 613-623.
http://www.ncbi.nlm.nih.gov/GenBank; National Center for Biotechnology Information.
Received February 12, 2003;
accepted in revised format May 21, 2003.
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||