|
|
|
|
Vol. 10, Issue 4, 398-400, April 2000
INSIGHT/OUTLOOK
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ARTICLE |
|---|
|
|
|---|
High-throughput technologies impress us almost every week with novel global results and big numbers. They often reveal important general trends that are impossible to realize with classical, low-throughput experimental methods, yet (so far) they provide fewer insights into specific, molecular detail. Because of the amount of data involved, high-throughput technologies imply the use of bioinformatics methods that deal with information transformation, storage, and analysis. By necessity, most of these processes are automated.
Partly because of the nature of current publication schemes, the
accuracy and error margins of a given method are often only found in
small print. It is obvious that each method has its limits and also
that during data processing, some information will be lost or diluted.
Because of the current need to integrate and add value to data, results
from high-throughput experiments (if made publicly accessible) are
often taken further by third-party research that relies on the quality
of these data. Thus, I believe that public awareness of error margins
for high-throughput experimental and computational methods should be
increased; the incredibly valuable data accumulating in various
heterogeneous databases permit powerful analyses but should not be
overinterpreted. In the following discussion, I will concentrate on
limits in computational sequence analysis, which is far from being
perfect (Table 1), despite the fact that sequencing
itself is highly automated and accurate, and despite the fact that
sequence information is described in simple linear terms (using a
four-letter alphabet). On average, a 70% accuracy just to predict
functional and structural features has to be considered a success
(Table 1).
|
Limitations in the Total Knowledge Base of Protein Function
As these analysis methods are knowledge based, one of the reasons
for the inaccuracy is that the quality of data in public sequence
databases is still insufficient (e.g., Bork and Bairoch 1996
; Bhatia et
al. 1997
; Pennisi 1999
). This is particularly true for data on protein
function. Protein function is loosely defined; cellular function is
more than the very complicated network of individual molecular
interactions on which it is based (Bork et al. 1998
). Furthermore, the
semantics for functional features are not always established. For
instance, the notion of a "protein complex" not only depends
heavily on detection and purification methods
which, in turn, are
constantly evolving
but also on environmental conditions. Protein
function is context dependent, and both molecular and cellular aspects
have to be considered (for review, see Bork et al. 1998
).
To illustrate some of this complexity, a good example is lactate
dehydrogenase: This gene product can act both as a dehydrogenase and an
eye lens structural protein, depending on its context (for review, see
Piatigorsky and Wistow 1991
). Even without the complication of a
second, unrelated role for the same gene product, do we know enough
about the function of lactate dehydrogenase, one of the best-studied
proteins? We know its biochemical pathway (at least in human and some
model organisms), its different isoenzymes (in organisms) with
different context-dependent properties, its regulation, and the
organization of its quaternary structure. However, we are probably
still missing much information, even on crucial molecular features: Are
we sure about alternative splice variants? Can we exclude age-dependent
post-translational modifications in some tissues? Our knowledge is even
more limited regarding higher order functions that involve
concentration, compartmental organization, dynamics, regulation, and
perhaps even the impact of external environment. Often, the available
data give at best some reliable qualitative results on functional
features but far from a complete understanding of functionality. Yet
our ability to annotate genome sequences and translate information
therein relies heavily on the summaries of features attached to each
sequence in the respective public databases.
Limitations of Gene Expression Data Extrapolations
As more high-throughput technologies follow, the data will become
more complicated than sequences. Novel complementary data types such as
gene expression arrays will generate more functional information, but
conclusions from these data are often stretched with regard to protein
products. The expression of genes and their reciprocal proteins seems
to correlate weakly, with a correlation coefficient of 0.48 (Anderson
and Seilhammer 1997
). Furthermore, recent studies (Hanke et al. 1999
;
Mironov et al. 1999
) show that alternative splicing might affect
>30% of the human genes, although measurements at the protein level
have yet to confirm this. Finally, the number of known
post-translational modifications of gene products is increasing
constantly, so that the complexity at the protein level is enormous.
Each of these modifications may change the function of the respective
gene products drastically. (The entire aspect of context-dependent gene
regulation is excluded from current discussions as we are only
beginning to understand the complex underlying genetic machinery. For
example, promoter prediction in eukaryotes has a success of only
~35% (Table 1), and there are many other regulatory elements that
we cannot predict at all.)
Limitations Created by Third-Party Analyses
Public releases of completely sequenced genomes exceed a rate of
one per month, with thousands of function predictions therein. Gene
annotation via sequence database searches is already a routine job, but
even here the error rate is considerable (Table 1). The lower limit of
errors in current functional annotation of large-scale sequencing
projects is 8% (Brenner 1999
). As errors accumulate and propagate
(Bork and Bairoch 1996
; Bhatia et al 1997
; Smith and Zhang 1997
; Bork
and Koonin 1998
; Pennisi 1999
), it becomes more difficult to infer
correct function from the many possibilities revealed by a database
search. Increasing these complications is the fact that computer
programs often cannot even retrieve the source of the stored
information (Doerks et al. 1998
).
Use of Complementary Information to Limit Errors in Function Prediction
Some new information can be retrieved from completely sequenced
genomes, for example, function can be predicted by exploitation of
genomic context. Based on the observation that interacting proteins in
one organism sometimes have homologs in other organisms fused together
in a single gene, Marcotte et al. (1999a)
predicted novel interactions
for 50% of yeast proteins using gene fusion information. However, they
noted an overlap with classical methods and an error rate of 82%. To
see a signal they had to correct for domains present in many proteins
(Marcotte et al. 1999a
). By considering only orthologs with fission and
fusion events (Enright et al. 1999
, Snel et al. 2000
), the
signal-to-noise ratio increases and the number of predictions drops
dramatically (7% of Escherichia coli proteins; Enright et al.
1999
). With a particular question in mind, Does protein X have
interaction partners?, the generation of hypotheses is extremely
useful; yet to provide a general overview of protein function, it is
advisable to keep the errors small. Further information can be added
later, which is easier than retracting stored information. But how do
we incorporate the information on error margins? Such estimates
(sometimes not even the sources of the annotation) are not visible in
current databases that store the results of computational approaches.
Taking the 70% Hurdle
As noted above, most prediction schemes extrapolate from current knowledge, and many bioinformatics methods have difficulty exceeding a 70% prediction accuracy (numbers in Table 1 are often overestimates because the test sets used are usually not representative of all sequences). On one hand, current methods seem to capture important features and explain general trends; on the other hand, 30% of the features are missing or predicted wrongly. This has to be kept in mind when processing the results further. Also the 70% accuracy often attaches to methods that deal with discrete objects such as sequences; making estimates about the prediction of cellular features is much more difficult as one first has to agree on semantics (or ontology in a database sense) to describe complex processes in a comparable way.
All of the above focuses on limitations in the computational prediction of qualitative features. There remains a long way to go until we are able to describe molecular processes quantitatively; current simulations of complex systems are still very rough and simplistic. However, there is still no doubt that sequence analysis is extremely powerful and that the generation of hypotheses derived by computational methods will be more and more often the first successful step in the design of experiments. If 70% of such experiments were successful, the speed of scientific discoveries would grow exponentially.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 E-MAIL bork{at}embl-heidelberg.da; FAX 11-49-6221-387517.
| |
REFERENCES |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
I. Friedberg Automated protein function prediction--the genomic challenge Brief Bioinform, September 1, 2006; 7(3): 225 - 242. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Xu, H. Chen, X. Hu, R. Zhang, Z. Zhang, and Z. W. Luo Average Gene Length Is Highly Conserved in Prokaryotes and Eukaryotes and Diverges Only Between the Two Kingdoms Mol. Biol. Evol., June 1, 2006; 23(6): 1107 - 1108. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Ochman and L. M. Davalos The nature and dynamics of bacterial genomes. Science, March 24, 2006; 311(5768): 1730 - 1733. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Kolker, A. F. Picone, M. Y. Galperin, M. F. Romine, R. Higdon, K. S. Makarova, N. Kolker, G. A. Anderson, X. Qiu, K. J. Auberry, et al. Global profiling of Shewanella oneidensis MR-1: Expression of hypothetical genes and improved functional annotations PNAS, February 8, 2005; 102(6): 2099 - 2104. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Raymond The Evolution of Biological Carbon and Nitrogen Cycling--a Genomic Perspective Reviews in Mineralogy and Geochemistry, January 1, 2005; 59(1): 211 - 231. [Full Text] [PDF] |
||||
![]() |
M. Y. Galperin and E. V. Koonin 'Conserved hypothetical' proteins: prioritization of targets for experimental study Nucleic Acids Res., October 12, 2004; 32(18): 5452 - 5463. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Kolker, K. S. Makarova, S. Shabalina, A. F. Picone, S. Purvine, T. Holzman, T. Cherny, D. Armbruster, R. S. Munson Jr, G. Kolesov, et al. Identification and functional analysis of 'hypothetical' genes expressed in Haemophilus influenzae Nucleic Acids Res., April 30, 2004; 32(8): 2353 - 2361. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Kirst, A. F. Johnson, C. Baucom, E. Ulrich, K. Hubbard, R. Staggs, C. Paule, E. Retzel, R. Whetten, and R. Sederoff Apparent homology of expressed genes from wood-forming tissues of loblolly pine (Pinus taeda L.) with Arabidopsis thaliana PNAS, June 10, 2003; 100(12): 7383 - 7388. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. M. Zdobnov, C. von Mering, I. Letunic, D. Torrents, M. Suyama, R. R. Copley, G. K. Christophides, D. Thomasova, R. A. Holt, G. M. Subramanian, et al. Comparative Genome and Proteome Analysis of Anopheles gambiae and Drosophila melanogaster Science, October 4, 2002; 298(5591): 149 - 159. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Mushegian and R. Medzhitov Evolutionary perspective on innate immune recognition J. Cell Biol., November 26, 2001; 155(5): 705 - 710. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||