|
|
|
|
Vol. 11, Issue 2, 290-299, February 2001
METHODS
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
Although protein identification by matching tandem mass spectra (MS/MS) against protein databases is a widespread tool in mass spectrometry, the question about reliability of such searches remains open. Absence of rigorous significance scores in MS/MS database search makes it difficult to discard random database hits and may lead to erroneous protein identification, particularly in the case of mutated or post-translationally modified peptides. This problem is especially important for high-throughput MS/MS projects when the possibility of expert analysis is limited. Thus, algorithms that sort out reliable database hits from unreliable ones and identify mutated and modified peptides are sought. Most MS/MS database search algorithms rely on variations of the Shared Peaks Count approach that scores pairs of spectra by the peaks (masses) they have in common. Although this approach proved to be useful, it has a high error rate in identification of mutated and modified peptides. We describe new MS/MS database search tools, MS-CONVOLUTION and MS-ALIGNMENT, which implement the spectral convolution and spectral alignment approaches to peptide identification. We further analyze these approaches to identification of modified peptides and demonstrate their advantages over the Shared Peaks Count. We also use the spectral alignment approach as a filter in a new database search algorithm that reliably identifies peptides differing by up to two mutations/modifications from a peptide in a database.
| |
INTRODUCTION |
|---|
|
|
|---|
Database search in mass spectrometry has been very
successful in protein identification. The experimental spectrum can be compared with typical spectra for each peptide in a database and the
peptide with the best fit usually provides the sequence of the
experimental peptide (Eng et al. 1994
; Mann and Wilm 1994
; Taylor and
Johnson 1997
; Fenyo et al. 1998
; Clauser et al. 1999
). However, these
methods are not mutation tolerant and are not effective for detecting
types and sites of sequence variations (Gatlin et al. 2000
).
Identification of modified peptides is an even more challenging
problem. Almost all protein sequences are post-translationally modified, and as many as 200 types of covalent modifications of amino
acid residues are known (Gooley and Packer 1997
). Although the sites of
some post-translational modifications can be predicted from DNA
sequences (Blom et al. 1999
), experimental verification of
post-translational modifications will remain an open problem even after
the human genome is completely sequenced. It raises a challenging
computational problem for the post-genomic era: Given a very large
collection of spectra representing the human proteome, which of 200 types of modifications are present in each human gene?
The computational analysis of modified peptides was pioneered by Mann
and Wilm (1995)
and Yates et al. (1995a)
.
The Peptide Sequence Tag approach (Mann and Wilm 1994
) was successful
in many applications (Shevchenko et al. 1997
), but no information about
the limitations and error rates of this approach for mutation-tolerant
MS/MS search is available. Yates et al. (1995a)
suggested an exhaustive
search approach, that is, to (implicitly) generate a virtual database
of all modified peptides for a small set of modifications and to match
the spectrum against this virtual database. This method was recently
applied to the identification of sequence variations in human
hemoglobins using SEQUEST-SNP software (Gatlin et al.
2000
). However, Yates et al. (1995a)
noted that it leads to a large
combinatorial problem, even for a small set of modification types, and
they indicated that extending this approach to a larger set of
modifications is an open problem. Pevzner et al. (2000)
proposed a new
approach to modification-tolerant database search that automatically
reveals peptide modifications without the need to generate all possible
modified peptides and compare them with the spectrum in a case-by-case
fashion. In fact, this approach does not need any prior knowledge of
the types of modifications under study.
Given a MS/MS database search algorithm, how could we estimate its error rates? It is clear that the error rates of existing MS/MS database search algorithms are small for good spectra but grow fast with diminishing spectrum quality. This is indirectly confirmed by the fact that up to 60% of spectra acquired in an automated regime cannot be matched against databases even in the case of completely sequenced genomes like yeast. The difficulties in interpreting these spectra may come either from the fact that a significant portion of peptides have poor fragmentation or from the fact that a large portion of peptides are modified and, thus, are missed by conventional database search algorithms.
Despite widespread use of MS/MS database search algorithms, we are unaware of any attempts to estimate their error rates depending on the quality of analyzed spectra. Moreover, it is not clear how to make such an estimate as there is no simple experimental method to confirm that an analyzed peptide corresponds to a peptide from the database. To estimate the error rate of an MS/MS database search one should have a collection of peptides and their mass spectra of a given quality, analyze the mass spectra via database search, and find the portion of incorrect predictions. This is an unrealistic experiment because the only reliable method to infer the sequences of peptides is MS/MS database search itself, a "Catch 22". In addition, it is not feasible to generate experimental spectra of a given quality.
We have designed a computational protocol to estimate the error rates
in MS/MS database search and to study the problem: What is the
threshold for spectrum quality that leads to erroneous peptide
identification by database search algorithms? We further compare the
efficiency of the Shared Peaks Count, spectral convolution, and
spectral alignment for MS/MS database search for both experimental and
simulated spectra. In a difference from the Shared Peaks Count, spectral convolution and spectral alignment do not require generation of virtual database of all modifications while comparing a spectrum of
a modified peptide against a database. This advantage of spectral alignment and spectral convolution approaches comes with a tradeoff in
the accuracy of its scoring function that is somewhat lower than the
accuracy of advance scoring functions like ones in SEQUEST (Eng et al. 1994
), MS-Tag (Clauser et al. 1999
),
and SHERENGA (Dancik et al. 1999
). Below, we describe how to combine the advantages of the spectral alignment with advantages of
the algorithms using advanced scoring functions.
For a large peptide database, MS/MS search algorithms produce some random hits while matching spectra of modified peptides. These random hits disguise the real similarities and increase the error rate of the database search.
However, our tests revealed that even in the case when the correct
solution is not the one with the highest score, it is among very few
high-scoring peptides. This suggests a new two-stage approach to MS/MS
database search. At the first filtration stage, the spectral alignment
is used as a filter to identify t top-scoring peptides in the
database, where t is chosen in such a way that it is almost
guaranteed that a correct hit is present among the top t hits.
These top t hits form a small database of candidate peptides
subject to further analysis at the second stage. Although the spectral
alignment is sometimes unable to distinguish which among t
top-scoring peptides is the correct one, more accurate scoring
functions (like scoring functions in SEQUEST,
MS-Tag, and SHERENGA) can be used at the
second verification stage to find the correct hit. At the verification
stage each of these t peptides can be mutated (as suggested by
spectral alignment) and compared against the experimental spectrum by
an accurate scoring scheme. This approach is conceptually similar to
the Yates et al. (1995a)
"virtual database" approach. However,
instead of exhaustive generation of all possible mutations and
modifications which often makes the virtual database approach
infeasible, our filtration procedure reduces the size of the database
to a few hundred candidate peptides.
Estimating the Error Rates in MS/MS Database Search
Let A be an algorithm that scores a spectrum S
against a peptide P from a database by assigning scores
A(S; P ). How can we estimate the error rate
of A while searching a database for high-scoring peptides?
Given a database of peptides {P1, . . . , Pk} and their corresponding spectra
{S1, . . . ,Sk} we say
that the algorithm A correctly reconstructs
Pi by spectrum Si if
|
To test our algorithms, we simulate a database of peptides, induce k mutations in each peptide in this database, simulate typical tandem mass spectra for the mutated peptides, and search these spectra against the original (nonmutated) database. The percentage of correctly matched peptides in this search characterizes the efficiency of k-mutation-tolerant MS/MS database search.
This approach requires simulations of typical (theoretical) spectra.
The offset frequency function, introduced in Dancik et al. (1999)
,
allows one to simulate realistic spectra according to probabilities of
different ion types. For testing purposes we restrict the number of
masses in the spectra and limit our analysis to b- and y-ions only
(minor ions have only a minor effect in our simulations). Some MS/MS
database search applications take into account as many as 200 of the
highest intensity masses in MS/MS spectra. However, Dancik et al.
(1999)
demonstrated that taking into account >5n
(n is the peptide length) highest intensity masses hardly
makes sense, because the signal in the remaining low intensity masses
is indistinguishable from noise. Moreover, the signal corresponding to
b- and y-ions is mainly limited to the first 2n high intensity
masses (Dancik et al. 1999
). Following these results, we generate
theoretical spectra with the number of masses equal to twice the
peptide length. Among them, 2np masses correspond to randomly
chosen b- and y-ions whereas the remaining 2n(1
p) masses are chosen randomly to simulate
noise. The spectrum with quality p = 1 contains no noise
(Fig. 1), whereas the spectrum with quality
p = 0 is made up of noise entirely. For simplicity, we
also ignore the intensities associated with these masses (although the
intensities can be easily incorporated in our algorithms).
|
Although mass spectrometrists routinely use the term "spectrum
quality," there is no standard agreement on how to define this notion. Define
pb = qb/n and
py = qy/n as
the frequencies of b-ions and y-ions correspondingly
(qb and qy are the numbers of b-
and y-ions of peptide P in spectrum S, and n
is the length of the peptide P ). We choose
p = (pb + py) / 2
to represent the spectrum quality in our simulations, and we often
assume pb = py. This parameter does not reflect the presence of minor ions (like
b
H2O). To account for minor ions
one can use the correlation between experimental spectrum S
and theoretical spectrum S(P) of peptide P
and define the spectrum quality as
. The Backbone Cleavage Score
(BCS) is another spectrum quality parameter defined as
|
y
is the number of positions i in a peptide for which either
bi or yi ion is present).
Although BCS sometimes is used to represent spectrum quality, we are
not satisfied with BCS score as it does not discriminate between the
case when both bi and yi ions are
present and the case when only one of them is present.
Algorithms for Peptide Identification Problem
Shared Peaks Count
A match between spectrum S and peptide P is the number of masses that the experimental spectrum S and the theoretical spectrum of peptide P have in common. Let D be a database of peptides and k be a parameter (number of mutations/modifications). Let Dk be a database of all peptides that are at most k mutations/modifications apart from the peptides in D. We view Dk as a "virtual" database because it is usually so large that generation of Dk is an infeasible task. We study the following "peptide identification problem": Given a database of peptides D, spectrum S, and parameter k, find a peptide in Dk whose theoretical spectrum has a maximal match to spectrum S. For the sake of simplicity, we represent a spectrum S as a set of integers, corresponding to masses of fragment ions and ignore the intensities of the fragment ions (see above). Of course, real spectra are not integer, but we assume that scaling (e.g., multiplying all masses by 10) and rounding has been done already. Mass spectrometrists usually refer to masses as peaks, because every mass corresponds to an intensity peak in the experimental spectrum. Following this terminology we denote the number of masses that two spectra have in common as the Shared Peaks Count or SPC. Figure 1 presents three spectra S1, S2, and S3 with SPC(S1, S2) = 5; SPC(S2, S3) = 6 and SPC(S1, S3) = 2. Most existing database search programs are based on the SPC between the experimental and theoretical spectra. SPC is, of course, an intuitive measure of spectral similarity. However, this measure diminishes very quickly as the number of mutations increases thus leading to limitations in detecting similarities in MS/MS database search.Spectral Convolution
Let S1 and S2 be two spectra. Following Pevzner et al. (2000)
S1 =
{s2
S1 : s1
S1,s2
S2}.
Figure 2, a and b, shows the elements of
this multiset in the form of a difference matrix. We define
(S2
S1)(x) as a
multiplicity of an integer x in the set
S2
S1, that is, the
number of pairs s1
S1,
s2
; S2 such that
s2
S1 = x. We
view spectral convolution as both a multiset S2
S1 and a function
S2
S1(x). The
elements of S2
S1 with high multiplicity correspond to peaks in the spectral convolution (S2
S1)(x). If
M(P) is the parent mass of peptide P with
the spectrum S, then SR =
M(P)
S = {M(P)
s
: s
S} is the reverse spectrum of
S.The reverse spectral convolution
(S2
S1R)(x)
is the number of pairs
s1
S1,s2
S2
such that s2 + s1
M(P) = x.
|
, then the
spectral convolution of their spectra
(S2
S1)(x) is
expected to have two approximately equal peaks at x = 0 and x =
. The other set of correlations between the spectra
of mutated peptides is captured by the reverse spectral convolution
S2
S1R,
reflecting the pairings of N-terminal and C-terminal ions.
S2
S1R(x)
is expected to have two peaks at the same positions 0 and
. The
spectral and the reverse spectral convolutions can be combined by
introducing a multiset S:
|
|
|
|
Spectral Alignment
Define a spectral product A × B of spectra A = {a1, . . . , an}and B = {b1, . . . , bm} as an an × bm two-dimensional matrix with nm 1s corresponding to all pairs of indices (ai, bj) and remaining elements being zeroes. A × B is a sparse matrix and the number of 1s at the main diagonal of this matrix describes the Shared Peaks Count between spectra A and B. The
-shifted Peaks Count [corresponding to
(A × B)(
) in spectral convolution] is the
number of 1s on the diagonal (i, +
). The limitation of the spectral convolution is that it considers diagonals separately without combining them into feasible mutation scenarios. Following Pevzner et al. (2000)
ai' = bi
bi'
and that (i', j') if
i' < i and
j' < j. To take care of the initial
conditions, we introduce a fictitious element (0,0) with
D0,0(k) = 0 and assume that (0,0) is
codiagonal with any other (i,j). The dynamic
programming recurrency for Dij(k) is
|
Branch-and-Bound Algorithm for Mutation-Tolerant Peptide Identification
The spectral alignment approach is conceptually different from the virtual database approach because it does not rely on a prior knowledge of all possible types of modifications. If the number of mutations and modifications of interest is limited, we propose a branch-and-bound algorithm (Bushnell and Chen 1996
1,
and
= M2
M1, where
1 is unknown. Although our algorithm exhaustively
searches for
1, preprocessing, restrictions on mass
differences of amino acids, and the branch-and-bound approach
significantly reduce its running time.
Consider a transformation of peptide
P = p1, . . ., pn
into a peptide
|
1 steps, switches to
diagonal
1 = m(
)
m(pi) for the intermediate j
i steps and then switches to diagonal
=
1 + m(
)
m(pj) for the last n
j + 1 steps. The score for this path is given by Score = Prefix(i
1) + Middle(i,j,
1) + Suffix(j + 1) where Prefix(i
1) is the precomputed score for the first (i
1) steps on
the 0-diagonal, Suffix(j + 1) is the precomputed
score for the last (n
j + 1) steps on the
-diagonal, and
Middle(i,j,
1) is the score
for steps from i to j on the
1 diagonal.
Middle(i,j,
1) depends on
(unknown)
1 and can be bounded by (precomputed)
Bound(
1) equal to the total number of 1s on the diagonal
1. The idea of using Bound is to cut
corners in the virtual database by estimating scores and not exploring
variants with low estimated scores in the branch-and-bound algorithm:
BRANCH-AND-BOUND ALGORITHM
|
) leads
to a significantly faster program than exhaustive generation of virtual
database of modified peptides (for 200 types of modifications, the
virtual database contains 106-107 entries per each
peptide in the real database).
| |
RESULTS |
|---|
|
|
|---|
To estimate the efficiency of MS/MS database search on experimental
spectra we used the sample of 36 annotated spectra of yeast tryptic
peptides from Dancik et al. (1999)
(Table
1). This sample contains 10 high quality
spectra (p
0.4), 14 average quality spectra
(0.3 < p< 0.4), and 12 poor quality spectra
(p
0.3). These spectra were matched against the yeast
protein database (
120,000 peptides of 14 amino acids average
length) in which every peptide was changed by k = 1 or
k = 2 mutations. We also simulated a database of 10,000 peptides with typical frequencies of amino acids in human peptides,
mutated every peptide in this database, and generated a typical
spectrum for every peptide in the database of mutated peptides. We then
searched every spectrum (of mutated peptide) against the database of
(nonmutated) peptides. The length of peptides was fixed to 15 amino
acids, which is close to the average length of tryptic peptides.
|
The efficiency of our mutation-tolerant database search was tested for
simulated spectra at different values of the spectral quality
p and the number of mutations k. The results for
spectral convolution and spectral alignment are presented in Figures
5 and 6,
respectively (MS-CONVOLUTION and MS-ALIGNMENT are available by contacting Z.M.). The first plot in Figure 5 demonstrates that even for k = 0 the spectral convolution
(in this case it is equivalent to the Shared Peaks Count) leads to errors for poor quality spectra (p < 0.3), and the error
rate grows very fast as the spectrum quality falls below
p = 0.2. Because many such spectra may be present in
high-throughput MS/MS projects, the results provide an explanation as
to why many spectra in such projects are hard to interpret. They also
indicate that interpretations of low quality spectra should be taken
with caution even for the no mutations/modifications' case. For
p
0.5, the spectral convolution approach leads to
nearly perfect predictions for k = 1 and provides 70%-80%
accurate peptide identification for k = 2. The efficiency of
the spectral convolution approach falls significantly for
k > 2 and remains below 70% even for ideal
(p = 1) spectra. The spectral alignment approach further
improves the accuracy of protein identification (Figure 6). For
k = 1, spectral alignment leads to nearly perfect
predictions as soon as the quality of spectra exceeds
p > 0.4. For p
0.5, the spectral
alignment provides 80%-90% accurate peptide identification for
k = 2. The accuracy of spectral alignment for
k = 3 improves significantly as compared to spectral
convolution but remains below 70% even for high quality spectra.
|
|
Table 1 shows the results of the spectral alignment approach for the sample of experimental spectra and reveals that most errors are associated either with short peptides or with low quality spectra. It also confirms that mutation-tolerant database search is an easier problem than modification-tolerant database search. For k = 1, the mutation-tolerant branch-and-bound algorithm makes no errors for high- and average-quality spectra, whereas the modification-tolerant branch-and-bound algorithm and spectral alignment make two errors for high and average quality spectra. We emphasize that the running time of the spectral convolution and the spectral alignment is not affected by considering modifications instead of mutations. In contrast, the running time of the branch-and-bound algorithm increases in case of modifications, because the alphabet of modifications is larger than the amino acid alphabet.
Let S be a spectrum of quality p and let P be a random (unrelated) peptide from a database. The spectrum S can match the peptide P just by chance and we are interested in the probability that the score of this match is above a threshold. If S has quality above p for a random peptide P, then P causes an error in our search algorithm. These random hits disguise the real similarities and lead to an increase in the error rate of the database search. The question then arises as to how many random hits have a higher score than the correct hit? In other words, what is the rank of the real hit in the ranked list of all hits?
Figures 5 and 6 (bottom) suggest that even in the case when the correct solution is not the highest scoring one, it remains at the top of the list of high scoring peptides. Figure 7 answers the question: "What is the rank of the correct hit in the ranked list of top-scoring database hits?"
|
Figure 8 studies the same question for experimental spectra and demonstrates that the spectral alignment places correct peptides among the 500 top-scoring peptides in most cases. The only exceptions are spectra of very short peptides and low quality spectra.
|
Considering the mutations-only case reduces the number of random hits and significantly improves the accuracy of the mutation-tolerant algorithm as compared with the modification-tolerant algorithm (Fig. 8). It justifies the two-stage filtration-and-verification approach to mutation-tolerant protein identification. At the first stage, the spectral alignment is used as a filter to identify t top-scoring peptides in the database. At the second stage each of these top-scoring peptides is verified by a comparison against the spectrum using a more accurate scoring function.
Conclusion
We described mutation-tolerant and modification-tolerant database
search approaches that are based on spectral convolution, spectral
alignment, and branch-and-bound algorithms. The algorithms have been
tested on both experimental and simulated data and proved to be
efficient for identification of mutated and modified peptides with up
to two mutations/modifications. An alternative to this method is de
novo interpretation followed by a BLAST-like database
similarity search as proposed by Taylor and Johnson (1997)
and Clauser
et al. (1999)
. This approach gives hope for mutation-tolerant searches
but is unlikely to succeed for modification-tolerant searches since de
novo reconstruction of modified peptides remains an open problem.
A number of questions related to modification-tolerant MS/MS database
searches remain open. One of them is the choice of parameter k
(number of mutations) that is not known in advance. We propose to run
MS-CONVOLUTION and MS-ALIGNMENT with a range
of k and analyze top-scoring peptides for each k. Our tests indicate that the difference between the score of the top-scoring and the second-scoring peptides may provide an insight for the choice
of k. The correct k often corresponds to the case
when the gap in the scores of two top-scoring peptides is relatively large (compare with hikes in energy landscapes; Tiana et al. 2000
). However, this is an empirical rule and further statistical analysis of
MS/MS database search is needed.
| |
ACKNOWLEDGMENTS |
|---|
The authors thank Karl Clauser for many critical comments.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL ppevzner{at}hto.usc.edu; FAX (213) 740-2424.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.154101.
| |
REFERENCES |
|---|
|
|
|---|
Received June 30, 2000; accepted in revised form November 22, 2000.
This article has been cited by other articles:
![]() |
P. J. Ulintz, B. Bodenmiller, P. C. Andrews, R. Aebersold, and A. I. Nesvizhskii Investigating MS2/MS3 Matching Statistics: A Model For Coupling Consecutive Stage Mass Spectrometry Data For Increased Peptide Identification Confidence Mol. Cell. Proteomics, January 1, 2008; 7(1): 71 - 87. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Rexroth, C. C. L. Wong, J. H. Park, J. R. Yates III, and B. A. Barry An Activated Glutamate Residue Identified in Photosystem II at the Interface between the Manganese-stabilizing Subunit and the D2 Polypeptide J. Biol. Chem., September 21, 2007; 282(38): 27802 - 27809. [Abstract] [Full Text] [PDF] |
||||
![]() |
B.-J. M. Webb-Robertson and W. R. Cannon Current trends in computational inference from mass spectrometry-based proteomics Brief Bioinform, September 1, 2007; 8(5): 304 - 317. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. R. Ramakrishnan, R. Mao, A. A. Nakorchevskiy, J. T. Prince, W. S. Willard, W. Xu, E. M. Marcotte, and D. P. Miranker A fast coarse filtering method for peptide identification by mass spectrometry Bioinformatics, June 15, 2006; 22(12): 1524 - 1531. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. D. Halligan, V. Ruotti, S. N. Twigger, and A. S. Greene DeNovoID: a web-based tool for identifying peptides from sequence and mass tags deduced from de novo peptide sequencing by mass spectroscopy Nucleic Acids Res., July 1, 2005; 33(suppl_2): W376 - W381. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. W.H. Wong, G. Cagney, and H. M. Cartwright SpecAlign--processing and alignment of mass spectra datasets Bioinformatics, May 1, 2005; 21(9): 2088 - 2090. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Habermann, J. Oegema, S. Sunyaev, and A. Shevchenko The Power and the Limitations of Cross-Species Protein Identification by Mass Spectrometry-driven Sequence Similarity Searches Mol. Cell. Proteomics, March 1, 2004; 3(3): 238 - 249. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. C. Giddings, A. A. Shah, R. Gesteland, and B. Moore Genome-based peptide fingerprint scanning PNAS, January 7, 2003; 100(1): 20 - 25. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. J. Mackey, T. A. J. Haystead, and W. R. Pearson Getting More from Less: Algorithms for Rapid Protein Identification with Multiple Short Peptide Sequences Mol. Cell. Proteomics, February 1, 2002; 1(2): 139 - 147. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. M. Marcotte Measuring the Dynamics of the Proteome Genome Res., February 1, 2001; 11(2): 191 - 193. [Full Text] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||