Genome Research Econo tag

Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Extract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Bork, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bork, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?

Vol. 10, Issue 4, 398-400, April 2000

INSIGHT/OUTLOOK
Powers and Pitfalls in Sequence Analysis: The 70% Hurdle

Peer Bork1

European Molecular Biology Laboratory (EMBL) 69012 Heidelberg; Germany and Max-Delbrück-Centrum, D-13122 Berlin-Buch, Germany

    ARTICLE
TOP
ARTICLE
REFERENCES

High-throughput technologies impress us almost every week with novel global results and big numbers. They often reveal important general trends that are impossible to realize with classical, low-throughput experimental methods, yet (so far) they provide fewer insights into specific, molecular detail. Because of the amount of data involved, high-throughput technologies imply the use of bioinformatics methods that deal with information transformation, storage, and analysis. By necessity, most of these processes are automated.

Partly because of the nature of current publication schemes, the accuracy and error margins of a given method are often only found in small print. It is obvious that each method has its limits and also that during data processing, some information will be lost or diluted. Because of the current need to integrate and add value to data, results from high-throughput experiments (if made publicly accessible) are often taken further by third-party research that relies on the quality of these data. Thus, I believe that public awareness of error margins for high-throughput experimental and computational methods should be increased; the incredibly valuable data accumulating in various heterogeneous databases permit powerful analyses but should not be overinterpreted. In the following discussion, I will concentrate on limits in computational sequence analysis, which is far from being perfect (Table 1), despite the fact that sequencing itself is highly automated and accurate, and despite the fact that sequence information is described in simple linear terms (using a four-letter alphabet). On average, a 70% accuracy just to predict functional and structural features has to be considered a success (Table 1).

                              
View this table:
[in this window]
[in a new window]
 
Table 1.   Selected Examples of Prediction Accuracy in Different Areas of Sequence Analysis

Limitations in the Total Knowledge Base of Protein Function

As these analysis methods are knowledge based, one of the reasons for the inaccuracy is that the quality of data in public sequence databases is still insufficient (e.g., Bork and Bairoch 1996; Bhatia et al. 1997; Pennisi 1999). This is particularly true for data on protein function. Protein function is loosely defined; cellular function is more than the very complicated network of individual molecular interactions on which it is based (Bork et al. 1998). Furthermore, the semantics for functional features are not always established. For instance, the notion of a "protein complex" not only depends heavily on detection and purification methods---which, in turn, are constantly evolving---but also on environmental conditions. Protein function is context dependent, and both molecular and cellular aspects have to be considered (for review, see Bork et al. 1998).

To illustrate some of this complexity, a good example is lactate dehydrogenase: This gene product can act both as a dehydrogenase and an eye lens structural protein, depending on its context (for review, see Piatigorsky and Wistow 1991). Even without the complication of a second, unrelated role for the same gene product, do we know enough about the function of lactate dehydrogenase, one of the best-studied proteins? We know its biochemical pathway (at least in human and some model organisms), its different isoenzymes (in organisms) with different context-dependent properties, its regulation, and the organization of its quaternary structure. However, we are probably still missing much information, even on crucial molecular features: Are we sure about alternative splice variants? Can we exclude age-dependent post-translational modifications in some tissues? Our knowledge is even more limited regarding higher order functions that involve concentration, compartmental organization, dynamics, regulation, and perhaps even the impact of external environment. Often, the available data give at best some reliable qualitative results on functional features but far from a complete understanding of functionality. Yet our ability to annotate genome sequences and translate information therein relies heavily on the summaries of features attached to each sequence in the respective public databases.

Limitations of Gene Expression Data Extrapolations

As more high-throughput technologies follow, the data will become more complicated than sequences. Novel complementary data types such as gene expression arrays will generate more functional information, but conclusions from these data are often stretched with regard to protein products. The expression of genes and their reciprocal proteins seems to correlate weakly, with a correlation coefficient of 0.48 (Anderson and Seilhammer 1997). Furthermore, recent studies (Hanke et al. 1999; Mironov et al. 1999) show that alternative splicing might affect >30% of the human genes, although measurements at the protein level have yet to confirm this. Finally, the number of known post-translational modifications of gene products is increasing constantly, so that the complexity at the protein level is enormous. Each of these modifications may change the function of the respective gene products drastically. (The entire aspect of context-dependent gene regulation is excluded from current discussions as we are only beginning to understand the complex underlying genetic machinery. For example, promoter prediction in eukaryotes has a success of only ~35% (Table 1), and there are many other regulatory elements that we cannot predict at all.)

Limitations Created by Third-Party Analyses

Public releases of completely sequenced genomes exceed a rate of one per month, with thousands of function predictions therein. Gene annotation via sequence database searches is already a routine job, but even here the error rate is considerable (Table 1). The lower limit of errors in current functional annotation of large-scale sequencing projects is 8% (Brenner 1999). As errors accumulate and propagate (Bork and Bairoch 1996; Bhatia et al 1997; Smith and Zhang 1997; Bork and Koonin 1998; Pennisi 1999), it becomes more difficult to infer correct function from the many possibilities revealed by a database search. Increasing these complications is the fact that computer programs often cannot even retrieve the source of the stored information (Doerks et al. 1998).

Use of Complementary Information to Limit Errors in Function Prediction

Some new information can be retrieved from completely sequenced genomes, for example, function can be predicted by exploitation of genomic context. Based on the observation that interacting proteins in one organism sometimes have homologs in other organisms fused together in a single gene, Marcotte et al. (1999a) predicted novel interactions for 50% of yeast proteins using gene fusion information. However, they noted an overlap with classical methods and an error rate of 82%. To see a signal they had to correct for domains present in many proteins (Marcotte et al. 1999a). By considering only orthologs with fission and fusion events (Enright et al. 1999, Snel et al. 2000), the signal-to-noise ratio increases and the number of predictions drops dramatically (7% of Escherichia coli proteins; Enright et al. 1999). With a particular question in mind, Does protein X have interaction partners?, the generation of hypotheses is extremely useful; yet to provide a general overview of protein function, it is advisable to keep the errors small. Further information can be added later, which is easier than retracting stored information. But how do we incorporate the information on error margins? Such estimates (sometimes not even the sources of the annotation) are not visible in current databases that store the results of computational approaches.

Taking the 70% Hurdle

As noted above, most prediction schemes extrapolate from current knowledge, and many bioinformatics methods have difficulty exceeding a 70% prediction accuracy (numbers in Table 1 are often overestimates because the test sets used are usually not representative of all sequences). On one hand, current methods seem to capture important features and explain general trends; on the other hand, 30% of the features are missing or predicted wrongly. This has to be kept in mind when processing the results further. Also the 70% accuracy often attaches to methods that deal with discrete objects such as sequences; making estimates about the prediction of cellular features is much more difficult as one first has to agree on semantics (or ontology in a database sense) to describe complex processes in a comparable way.

All of the above focuses on limitations in the computational prediction of qualitative features. There remains a long way to go until we are able to describe molecular processes quantitatively; current simulations of complex systems are still very rough and simplistic. However, there is still no doubt that sequence analysis is extremely powerful and that the generation of hypotheses derived by computational methods will be more and more often the first successful step in the design of experiments. If 70% of such experiments were successful, the speed of scientific discoveries would grow exponentially.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.

    FOOTNOTES

1 E-MAIL bork{at}embl-heidelberg.da; FAX 11-49-6221-387517.

    REFERENCES
TOP
ARTICLE
REFERENCES

  • Anderson, L. and J. Seilhammer. Electrophoresis 18: 533-537.
  • Andrade, M., S.I. O'Donoghue, and B. Rost. 1998. J. Mol. Biol. 276: 517-525[CrossRef][Medline].
  • Bhatia, U., K. Robison, and W. Gilbert. 1997. Science 276: 1724-1725[Free Full Text].
  • Bork, P. and A. Bairoch. 1996. Trends Genet. 12: 425-427[CrossRef][Medline].
  • Bork, P. and E.V. Koonin. 1998. Nat. Genet. 13: 313-318.
  • Bork, P., T. Dondekar, Y. Diaz-Lazcoz, F. Eisenhaber, M. Huynen, and Y. Yuan. 1998. J. Mol. Biol. 283: 707-725[CrossRef][Medline].
  • Brenner, S. 1999. Trends Genet. 15: 132-133[CrossRef][Medline].
  • Buelow, K.H., M.N. Edmonson, and A.B. Cassidy. 1999. Nat. Genet. 21: 323-325[CrossRef][Medline].
  • Dandekar, T. and K. Sharma. 1998. Regulatory RNA. Springer Verlag, Heidelberg, Germany.
  • Doerks, T., A. Bairoch, and P. Bork. 1998. Trends Genet. 14: 248-250[CrossRef][Medline].
  • Dunham, I., N. Shimizu, B.A. Roe, S. Chissoe, J.E. Colllins, R. Bruskiewich, M. Clamp, L.J. Smink, R. Ainscough, and J.P. Almeida. 1999. Nature 402: 489-495[CrossRef][Medline].
  • Ehrlich, L, M. Reczko, H. Bohr, and R.C. Wade. 1998. Protein Eng. 11: 11-19[Abstract/Free Full Text].
  • Eisenhaber, B., P. Bork, and F. Eisenhaber. 1999. J. Mol. Biol. 292: 741-758[CrossRef][Medline].
  • Enright, A.J., I. Iliopoulos, N.C. Kyrpides, and C.A. Ouzounis. 1999. Nature 402: 86-90[CrossRef][Medline].
  • Hanke, J., I. Zastrow, A. Aydin, G. Lehmann, S. Luft, J.G. Reich, and P. Bork. 1999. Trends Genet. 15: 389-390[CrossRef][Medline].
  • Jones, D.T. 1999. J. Mol. Biol. 292: 195-202[CrossRef][Medline].
  • Lupas, A. 1996. Methods Enzymol. 266: 513-525[Medline].
  • Marcotte, E.M., M. Pellegrini, H.L. Ng, D.W. Rice, T.O. Yeates, and D. Eisenberg. 1999a. Science 285: 751-753[Abstract/Free Full Text].
  • Marcotte, E.M., M. Pellegrini, M.J. Thompson, T.O. Yeates, and D. Eisenberg. 1999b. Nature 402: 83-86[CrossRef][Medline].
  • Mironov, A.A., J.W. Fickett, and M.S. Gelfand. 1999. Genome Res. 15: 755-771.
  • Muller, A., R.M. MacCallum, and M.J.E. Sternberg. 1999. J. Mol. Biol. 293: 1257-1271[CrossRef][Medline].
  • Nielsen, H., S. Brunak, and G. von Heijne. 1999. Protein Eng. 12: 3-9[Abstract/Free Full Text].
  • Pennisi, E. 1999. Science 286: 447-450[Free Full Text].
  • Piatigorski, Y. and G.J. Wistow. 1991. Science 252: 1078-1079[CrossRef][Medline].
  • Prestidge, D.S. 1995. J. Mol. Biol. 249: 923-932[CrossRef][Medline].
  • Rost, B. 1996. Methods Enzymol. 266: 525-539[CrossRef][Medline].
  • Smith, T.F. and X. Zhang. 1997. Nat. Biotechnol. 15: 1222-1223[CrossRef][Medline].
  • Snel, B., P. Bork, and M. Huynen. 2000. Trends Genet. 16: 9-11[Medline].
  • Sunyaev, S., J. Hanke, D. Brett, A. Aydin, I. Zastrow, W. Lathe, P. Bork and J. Reich. 2000. Adv. Protein Chem. 54: (in press).
  • Teichmann, S., C. Chothia, and M. Gerstein. 1999. Curr. Opin. Struct. Biol. 9: 390-399[CrossRef][Medline].
  • Tusnady, G.E. and I. Simon. 1998. J. Mol. Biol. 283: 489-506[CrossRef][Medline].


10:398-400 ©2000 by Cold Spring Harbor Laboratory Press  ISSN 1088-9051/00 $5.00

Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us   Add to Digg Digg   Add to Reddit Reddit   Add to Technorati Technorati    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
I. Friedberg
Automated protein function prediction--the genomic challenge
Brief Bioinform, September 1, 2006; 7(3): 225 - 242.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
L. Xu, H. Chen, X. Hu, R. Zhang, Z. Zhang, and Z. W. Luo
Average Gene Length Is Highly Conserved in Prokaryotes and Eukaryotes and Diverges Only Between the Two Kingdoms
Mol. Biol. Evol., June 1, 2006; 23(6): 1107 - 1108.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
H. Ochman and L. M. Davalos
The nature and dynamics of bacterial genomes.
Science, March 24, 2006; 311(5768): 1730 - 1733.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
E. Kolker, A. F. Picone, M. Y. Galperin, M. F. Romine, R. Higdon, K. S. Makarova, N. Kolker, G. A. Anderson, X. Qiu, K. J. Auberry, et al.
Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes and improved functional annotations
PNAS, February 8, 2005; 102(6): 2099 - 2104.
[Abstract] [Full Text] [PDF]


Home page
Reviews in Mineralogy and GeochemistryHome page
J. Raymond
The Evolution of Biological Carbon and Nitrogen Cycling--a Genomic Perspective
Reviews in Mineralogy and Geochemistry, January 1, 2005; 59(1): 211 - 231.
[Full Text] [PDF]


Home page
Nucleic Acids ResHome page
M. Y. Galperin and E. V. Koonin
'Conserved hypothetical' proteins: prioritization of targets for experimental study
Nucleic Acids Res., October 12, 2004; 32(18): 5452 - 5463.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. Kolker, K. S. Makarova, S. Shabalina, A. F. Picone, S. Purvine, T. Holzman, T. Cherny, D. Armbruster, R. S. Munson Jr, G. Kolesov, et al.
Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae
Nucleic Acids Res., April 30, 2004; 32(8): 2353 - 2361.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
M. Kirst, A. F. Johnson, C. Baucom, E. Ulrich, K. Hubbard, R. Staggs, C. Paule, E. Retzel, R. Whetten, and R. Sederoff
Apparent homology of expressed genes from wood-forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana
PNAS, June 10, 2003; 100(12): 7383 - 7388.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
E. M. Zdobnov, C. von Mering, I. Letunic, D. Torrents, M. Suyama, R. R. Copley, G. K. Christophides, D. Thomasova, R. A. Holt, G. M. Subramanian, et al.
Comparative Genome and Proteome Analysis of Anopheles gambiae and Drosophila melanogaster
Science, October 4, 2002; 298(5591): 149 - 159.
[Abstract] [Full Text] [PDF]


Home page
J. Cell Biol.Home page
A. Mushegian and R. Medzhitov
Evolutionary perspective on innate immune recognition
J. Cell Biol., November 26, 2001; 155(5): 705 - 710.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Extract Freely available
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Bork, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bork, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us   Add to Digg   Add to Reddit   Add to Technorati  
What's this?


Home Help [Feedback] [For Subscribers] [Archive] [Search] [Contents]
Genes Dev. Learn. Mem.
Protein Science RNA Genome Res.