Genome Research scroll

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


Published online before print May 12, 2003, 10.1101/gr.703903
Genome Res. 13:1190-1202, 2003
©2003 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/03 $5.00
This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
GR7039Rv1
13/6a/1190    most recent
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Zhang, L.
Right arrow Articles by Kasif, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zhang, L.
Right arrow Articles by Kasif, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Methods

Human–Mouse Gene Identification by Comparative Evidence Integration and Evolutionary Analysis

Lingang Zhang1,3, Vladimir Pavlovic2,3,5,6, Charles R Cantor1,3,4 and Simon Kasif2,3,5

1 Center for Advanced Biotechnology 2 Bioinformatics Program 3 Department of Biomedical Engineering, Boston University, Boston, Massachusetts 02215, USA 4 Sequenom Inc., San Diego, California 92121, USA

The identification of genes in the human genome remains a challenge, as the actual predictions appear to disagree tremendously and vary dramatically on the basis of the specific gene-finding methodology used. Because the pattern of conservation in coding regions is expected to be different from intronic or intergenic regions, a comparative computational analysis can lead, in principle, to an improved computational identification of genes in the human genome by using a reference, such as mouse genome. However, this comparative methodology critically depends on three important factors: (1) the selection of the most appropriate reference genome. In particular, it is not clear whether the mouse is at the correct evolutionary distance from the human to provide sufficiently distinctive conservation levels in different genomic regions, (2) the selection of comparative features that provide the most benefit to gene recognition, and (3) the selection of evidence integration architecture that effectively interprets the comparative features. We address the first question by a novel evolutionary analysis that allows us to explicitly correlate the performance of the gene recognition system with the evolutionary distance (time) between the two genomes. Our simulation results indicate that there is a wide range of reference genomes at different evolutionary time points that appear to deliver reasonable comparative prediction of human genes. In particular, the evolutionary time between human and mouse generally falls in the region of good performance; however, better accuracy might be achieved with a reference genome further than mouse. To address the second question, we propose several natural comparative measures of conservation for identifying exons and exon boundaries. Finally, we experiment with Bayesian networks for the integration of comparative and compositional evidence.


[Software is available on request from the authors.]

Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.703903. Article published online before print in May 2003.

5 Corresponding authors.
E-MAIL kasif{at}bu.edu; FAX (617) 353-6766.
E-MAIL vladimir{at}cs.rutgers.edu

6 Present address: Dept. of Computer Science, Rutgers University, Piscataway 08854, NJ.

6 Although BLOSUM80 is expected to better characterize the divergence between human and mouse protein, we experimented with both BLOSUM62 and BLOSUM80 and found that BLOSUM62 was slightly better than, although very similar to, BLOSUM80 at identifying protein-coding regions. Hence, we used BLOSUM62 in our experiments. All other BLAST parameters are used as defaults.

7 Kullback-Leibler or KL divergence (Cover and Thomas 1991). For two distributions, p and q KL divergence is defined as

It can be shown that one type of annotation error depends on the KL divergence as error ~ exp(-KL). See V. Pavlovic, L. Zhang, and S. Kasif(in prep.) for more details.

8 Stated more precisely, results of maximum likelihood estimation are not the substitution matrices Q themselves, but rather the estimates of products Q' = Q · t', in which t' is the distance between human and mouse. Hence, substituting Q' and t = 1 in, for instance, (2) yields the exponent Q' · t = Q' · 1 = Q · t'· 1 = Q · t', and, thus, the probability of substitutions at the evolutionary distance between human and mouse.

9 Error = exp(-KL[Pc||Pn]). Precision = 1 - error.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
A. Coghlan and R. Durbin
Genomix: a method for combining gene-finders' predictions, which uses evolutionary conservation of sequence and intron exon structure
Bioinformatics, June 15, 2007; 23(12): 1468 - 1475.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. D. McAuliffe, M. I. Jordan, and L. Pachter
Subtree power analysis and species selection for comparative genomics
PNAS, May 31, 2005; 102(22): 7900 - 7905.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
W. H. Majoros, M. Pertea, and S. L. Salzberg
Efficient implementation of a generalized pair hidden Markov model for comparative gene finding
Bioinformatics, May 1, 2005; 21(9): 1782 - 1788.
[Abstract] [Full Text] [PDF]




Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
Genes Dev. Learn. Mem.
Protein Science RNA Genome Res.
Copyright © 2003 by Cold Spring Harbor Laboratory Press.