|
|
|
|
Vol. 12, Issue 9, 1305-1315, September 2002
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
Plant disease resistance genes have been shown to be
subject to positive selection, particularly in the leucine rich repeat (LRR) region that may determine resistance specificity. We performed a
genome-wide analysis of positive selection in members of the nucleotide
binding site (NBS)-LRR gene family of Arabidopsis thaliana. Analyses were possible for 103 of 163 NBS-LRR nucleotide sequences in
the genome, and the analyses uncovered substantial evidence of positive
selection. Sites under positive selection were detected and identified
for 10 sequence groups representing 53 NBS-LRR sequences. Functionally
characterized Arabidopsis resistance genes were in these 10 groups, but several groups with extensive evidence of positive
selection contained no previously characterized resistance genes. Amino
acid residues under positive selection were identified, and these
residues were mapped onto protein secondary structure. Positively
selected positions were disproportionately located in the LRR domain
(P < 0.001), particularly a nine-amino acid
-strand
submotif that is likely to be solvent exposed. However, a substantial
proportion (30%) of positively selected sites were located outside
LRRs, suggesting that regions other than the LRR may function in
determining resistance specificity. Because of the unusual sequence
variability in the LRRs of this class of proteins, secondary-structure
analysis identifies LRRs that are not identified by similarity analyses
alone. LRRs also contain substantial indel variation, suggesting
elasticity in LRR length could also influence resistance specificity.
| |
INTRODUCTION |
|---|
|
|
|---|
Disease resistance genes (R genes) are crucial components
of the hypersensitive response (HR), a plant defense
mechanism that results in localized cell death. The HR is triggered
when pathogen molecules, possibly virulence factors, are detected by
plant receptors; genetic analysis of the HR has lead to the cloning of
R genes, many of which encode receptor-like proteins. Based on their
predicted domain structure, R proteins encoded by the R genes have been classified into four groups: intracellular kinases, extracellular receptors, extracellular receptors coupled to kinases, and
intracellular receptors (Bent 1996
).
Most characterized R genes encode putative intracellular receptors
(Dangl and Jones 2001
), which contain either a coiled-coil (CC) or a
Toll/Interleukin-1 receptor (TIR) domain at their N-terminal end, followed by a nucleotide binding site (NBS). At the C-terminal end, these proteins consist of a series of leucine rich repeats (LRRs).
The functions of the CC, TIR, and NBS domains are not known fully, but
all similar proteins identified in animal systems play roles in
protein-protein interactions and signal transduction (Srinivasula et
al. 1998
; Kopp and Medzhitov 1999
; Inohara et al. 1999
; Burkhard et al.
2001
). The function of LRR domains is clearer because recent data
suggest that LRRs in R proteins mediate direct or indirect interaction
with pathogen molecules (Jia et al. 2000
; Dangl and Jones 2001
). The
tertiary structure of LRRs has been experimentally determined for a
diverse group of proteins (Price et al. 1998
; Marino et al. 1999
; Liker
et al. 2000
; Zhang et al. 2000
), most notably porcine ribonuclease
inhibitor (PRI; Kobe and Deisenhofer 1993
, 1995b
). Generally,
individual LRRs form repeats of
-strand-loop and
-helix-loop
units with nonleucine residues exposed and compose a binding surface
predicted involved in protein recognition (Kobe and Kajava 2001
). In R
proteins, putatively solvent-exposed residues in
-sheets may
interact with pathogen ligands and hence determine specificity for
pathogen elicitors (Thomas et al. 1997
; Ellis et al. 1999
, 2000
).
Comparative analyses of R genes from tomato, lettuce, rice, flax, and
Arabidopsis have revealed that solvent-exposed positions of
the LRRs are hypervariable and subject to positive natural selection
(Parniske et al. 1997
; Meyers et al. 1998
; Wang et al. 1998
; Noel et
al. 1999
; Ellis et al. 2000
). Evidence for positive selection is
consistent with host-pathogen coevolution (Endo et al. 1996
) and
selection for new resistance specificities. A corollary of this
observation, and the underlying basis of our work, is that positive
selection may be used as an evolutionary profile that identifies
NBS-LRR-encoding genes that are likely to function in disease resistance.
The genome sequence of Arabidopsis provides an opportunity to
investigate genomic patterns of positive selection in the NBS-LRR gene
family. To detect positive selection, we estimate ratios of
nonsynonymous to synonymous nucleotide substitutions, also known as
, on NBS-LRR gene family members.
is a molecular evolutionary measure of selection (Kimura and Ohta 1974
). When
is equal to 1, a
gene is evolving without constraint on nonsynonymous substitutions relative to synonymous substitutions, a condition interpreted as
neutral evolution. In contrast, an
> 1 is strong evidence of
positive selection (Hughes and Nei 1988
) and an
< 1 is
consistent with purifying selection, although the possibility of
positive selection cannot be excluded. To calculate
, we have
employed a maximum likelihood (ML) method that identifies the specific amino acid residues on which positive selection has acted (Nielsen and
Yang 1998
; Yang and Bielawski 2000
). The ML approach differs substantially from approaches used in previous studies of positive selection in NBS-LRR genes because previous studies partitioned nucleotide codons into predicted solvent-exposed regions and the remainder of the LRR (e.g., Parker et al. 1997
; Botella et al. 1998
;
Warren et al. 1998
; Bittner-Eddy et al. 2000
). Such a priori partitioning does not permit identification of individual amino acids
under positive selection, and it does not provide an accurate picture
of the extent and genic location of positive selection.
The location of positively selected residues is important for inferring
gene function. For example, previous studies have shown that
solvent-exposed regions of the LRR are subject to positive selection; these results have been interpreted as evidence that solvent-exposed regions mediate pathogen recognition (Parniske et al.
1997
). Here we map the position of all positively selected amino
acid residues onto NBS-LRR genes and find that most but not all are
found within the LRR region. In addition, we apply secondary-structure
prediction methods to LRR regions to characterize structural motifs and
also to determine whether positively selected sites fall predominantly
in solvent-exposed residues. Altogether, this study of
Arabidopsis NBS-LRR genes has two main goals. First, we use
selection as an evolutionary profile, hypothesizing that positive
selection may help identify the subset of NBS-LRR genes that are
most likely to function in plant defense. Second, we elucidate the
relationship among structure, function, and evolution by mapping
positively selected sites onto NBS-LRR gene secondary structure.
| |
RESULTS |
|---|
|
|
|---|
Sequence Groups
We retrieved complete amino acid sequences of 163 genes from the Arabidopsis Resistance Genes database (At-RGenes), aligned the sequences, and reconstructed a neighbor-joining phylogeny. The phylogeny based on 163 NBS-LRR genes was similar to the At-RGenes database phylogeny in that there was a clear separation between the TIR-NBS-LRR and CC-NBS-LRR sequences, and there were also some similarities in the grouping of sequences within clades (data not shown). However, the At-RGenes phylogeny was based only on the NBS region, while our analyses were based on complete sequences of NBS-LRR proteins.
The aligned genes were too divergent for analysis of positive
selection, and thus we partitioned sequences into individual groups.
Based on the initial phylogeny and sequence characteristics (see
Methods), we pared the data to 103 sequences and assigned them to 22 phylogenetically clustered groups (Table
1). During grouping, some sequences that
could not be assigned to groups were discarded as orphans, including
the characterized resistance genes RPM1 and RPS2,
which have been noted as orphans previously (Richly et al. 2002
). After
grouping, the average group size was 4.6 sequences, with the largest
containing 11 sequences and the smallest containing 2. Eight of 22 groups contained only 2 sequences, and for these groups we could not
apply the full range of ML analyses (Table 1). Group alignments ranged
in length from 822 to 1596 amino acid positions, with an average length
of 1108 positions (Table 1).
|
Detection of Positive Selection
We applied likelihood ratio (LR) tests of positive selection based
on the ML methods and codon substitution models (M) of Yang, Nielsen,
and colleagues (Yang 1997
; Nielsen and Yang 1998
; Yang et al. 2000
). We
applied two tests. The first LR test compared M1, the free-ratio model
that assumes independent
values for every branch of the phylogeny,
versus M0, the null one-ratio model that constrains
to be equal on
all phylogenetic branches (Yang 1998
; Yang and Nielsen 1998
). This test
was applied to all 22 sequence groups. After Bonferroni correction for
22 tests, six groups both fit M1 significantly better than M0 and also
had at least one lineage with
> 1 (Table
2). Thus, the comparison of M1 and M0
detected positive selection in six sequence groups.
|
Comparison between M1 and M0 yields an average
value among codons
along a phylogenetic branch. If positive selection took place in only a
few codons, the effect of positive selection on nucleotide substitution
may not be detected (Anisimova et al. 2001
). We therefore employed a
second, more specific test that examines variation in
among sites
by comparing models M7 and M8 (Yang et al. 2000
). This test could only
be applied to the 14 sequence groups with more than two sequences
(Table 2). For complete sequence alignments, LR tests with M7 and M8
identified 10 groups that fit the selective model better than the null
model and also had an
> 1. Results remained significant after
Bonferroni correction for 14 tests and an experiment-wide error of 5%
(Table 2). When the LR test suggested positive selection action had occurred, positively selected sites were identified under M8 using a
Bayesian method (Nielsen and Yang 1998
; Yang et al. 2000
). The number of inferred positively selected sites varied among the 10 groups
in which positive selection was detected. For example, only one site
was identified from group 18, but group 14 had 26 positively selected
sites (Fig. 1; Table 2).
|
We also applied the two LR tests separately to the TIR, NBS, and LRR
regions. CC domains were analyzed together with the NBS because the
short length of CC domains made separate tests impractical, and groups
with two sequences were not divided into domains because of low
information content. Of the 14 groups divided into domains, 8 groups
contained at least one domain that had (1) an estimate of
> 1
under M8, (2) sites identified to be under positive selection, and (3)
a significant LR test (Table 3). For 7 of
these 8 groups, positive selection was also detected with
whole-sequence analysis (Table 2); the lone exception was group 7, which contained positively selected sites in the LRR region alone but
no positively selected sites with complete data (Table 2). Six of the 8 domains that exhibited evidence of positive selection were LRRs, and
they included 102 of the 105 positively selected sites detected in
domain analyses (Table 3).
|
Location of Positively Selected Sites and Sequence Variation
We plotted the genic location of positively selected sites for the
10 groups that had sites detected from whole-sequence analysis (Fig.
1). Positively selected sites were not homogeneously distributed among
regions; 69% (83 of 116) of sites were located in LRRs. The
heterogeneous distribution of positively selected sites was clear from
the comparison of the proportion of sites under selection within the
NBS and LRR regions, the two domains that occur in all proteins.
Two-by-two contingency tests revealed that sites under positive
selection occur significantly more frequently in the LRR domains
(P
0.001;
2 = 56.13). Nonetheless, 33 positively selected sites were located in non-LRR regions.
We also studied the distribution of indels across regions in the groups
in which positive selection was detected. In a two-by-two contingency
test, LRRs had a significantly larger proportion of indels than non-LRR
domains (P
0.001;
2 = 145.14). However,
the high incidence of indels in the LRR was not unique to proteins
under positive selection. Positively selected sites were not detected
in groups 1, 5, 7, and 16, but indels were also more frequent in the
LRR than the NBS for these sequence groups (P
0.001,
2 = 91.36). These observations are important for two
reasons. First, the high incidence of gaps in the LRR regions provides
additional evidence that LRRs are more labile than other domains.
Second, gaps alone do not account for the high incidence of positively selected sites in LRRs.
LRR Secondary Structure and Residues Under Positive Selection
Non-leucine residues in the
-sheet of LRRs can be exposed to the
solvent phase (Kobe and Deisenhofer 1994
) and may interact with
pathogen ligands (Jones and Jones 1997
; Ellis et al. 2000
), suggesting
that the structural arrangement of variable sites in the LRR is
important. To investigate this, we analyzed the predicted secondary
structure of aligned LRR motifs and then mapped the distribution of
sites under positive selection onto these structural predictions.
Prior analyses of plant NBS-LRR R proteins indicate that the LRR
typically has a consensus sequence similar to
LXXLXXLXXLXLXX(N/C/T)X(X)LXXIPXX, where X represents any amino acid and
the other letters denote specific amino acid residues (Hammond-Kosack
and Jones 1997
; Jones and Jones 1997
). Our secondary-structure analyses
of these repeats revealed that LRR structure is in general
characterized by a coil (C) structure up to the third leucine (L) of
this consensus, followed by 3 to 6 residues that have a
-strand (E)
structure. The XXLXLXX motif within the LRR has been predicted to form
a solvent-exposed
-sheet (Jones and Jones 1997
). In our analyses,
the first 3 to 6 of these residues consistently adopted a
-strand
structure, followed by 3 to 6 residues in a coil (Fig.
2); we refer to this ~9-residue region as
E4C5. The LRR consensus of the sequences we
analyzed (Fig. 2) starts at the sixth residue of the consensus cited
above. The
strand predicted for this consensus is centered on the
second conserved L and the remaining E4C5 residues adopt a coil
structure. In roughly one-third of the LRRs, the basic secondary
structure of the remaining LRR is modified by 3 or 4 residues that
adopt an
-helix configuration (H)
these residues tend to be
aliphatic L, V, and I
and are located ~9-11 residues after the last
residue in the
strand (Fig.2). The resulting secondary-structure
pattern of CCEEEECCCCCCCCCCHHHHCC recurs throughout predicted LRR
regions and, in some cases, within protein regions not predicted to
contain LRR domains. The latter occurred in groups 11 and 18, in which
six and eight LRR motifs were detected by Pfam analysis and an
additional 6 and 3 regions, respectively, had a secondary structure
consistent with an LRR (Fig. 2).
|
When the sites under positive selection were plotted onto the predicted
secondary structure of each protein, we found that most sites fell into
E4C5. A two-by-two contingency test comparing the
E4C5 with the rest of the LRR showed that
this motif contained a significantly higher proportion of positively
selected sites (
2 = 48.60), suggesting that these sites
are evolutionarily, and perhaps functionally, distinct.
Contrasts Between Groups With and Without Positively Selected Sites
Some groups did not have identifiable positively selected sites, and it is useful to explore potential differences between groups that did and did not have positively selected sites. Because selected sites could only be identified in groups with more than two sequences with the M8 test, the ensuing section only considers groups with more than two sequences
Differences between groups with and without positively selected sites
could be a consequence of sampling, so we contrasted four statistics
between groups: (1) the number of sequences in the groups, (2) the
length of the alignments, (3) the mean sequence identities in groups,
and (4) tree lengths, which reflects the amount of sequence evolution
among the sequences in a group. Each of these statistics contributes to
the statistical power of the LR tests (Anismova et al. 2001
), and
significant differences in any of these characters between positively
and nonpositively selected groups could indicate that sampling
properties (like sequence length or sequence identity) are primarily
responsible for our results. However, we detected no significant
difference between positively selected and nonpositively selected
groups in any of the four statistics (alignment length:
t = 1.43, P = 0.17; number of sequences:
t = 1.61, P = 0.13; average sequence identity:
t = 0.69, P = 0.5; tree length:
t = 0.87, P = 0.4). These results suggest that
sampling characteristics alone do not play the primary role in
discriminating between groups with and without positively selected sites.
Although there are no obvious sampling differences between groups with
and without positively selected sites, there could be biological
differences that do not relate to resistance function. We examined two
potential biological differences. The first was gene expression, using
EST hits as a proxy for expression. This proxy is particularly rough
given the diverse origin and preparation methods of cDNA libraries used
for EST sequencing, but we used ESTs to determine if there were
substantial differences in expression between the two sequence classes.
EST information was retrieved from the Munich Information Center for
Protein Sequences (MIPS) Arabidopsis thaliana database
(http://mips.gsf.de/proj/thal/) and the RIKEN cDNA collection (Seki et
al. 2002
). We summed the number of EST hits for each sequence group
(data not shown) and contrasted the total number of hits between
positively selected and nonpositively selected groups. Although the
groups with positively selected sites had a slightly higher average
number of EST hits (12 hits vs. 8.1 hits), the difference in hits was
not significant (t = 0.91; P = 0.37). Thus, by
this method there is no detectable difference in gene expression
between the two classes of sequences.
Ectopic recombination and gene conversion among sequences is a second
biological parameter that could conceivably affect tests for positive
selection. Recombination and gene conversion among sequences could
affect test statistics because the ML test assumes a single phylogeny
adequately represents the evolution of a group of sequences. If ectopic
exchange (or gene conversion) occurs in only one genic region (for
example, the LRR), it is possible that different genic regions have
different evolutionary histories, so that the sequence did not evolve
with a single phylogenetic pattern. Although the ML approach is known
to be reasonably robust to incorrect phylogenetic assumptions (Yang et
al. 2000
), the effects of gene conversion and ectopic exchange on test
statistics is not known. Nonetheless, we tested for gene conversion and
ectopic recombination with Sawyer's (1989)
test. Of the 14 groups with more than two sequences, only two groups (groups 20 and 7) showed significant evidence of ectopic exchange at the 5% significance level,
and there is evidence for only one exchange event in each group (data
not shown). Of these two, group 20 showed evidence for positive
selection. In contrast, group 7 contains no evidence of positively
selected sites based on whole-sequence data but some evidence when the
LRR is examined separately (Tables 2 and 3; also see Discussion).
Overall, however, Sawyer's test provided little evidence of ectopic
exchange within groups, and there is no indication that ectopic events
contribute substantially to differences between groups with and without
positively selected sites.
| |
DISCUSSION |
|---|
|
|
|---|
Positive selection has been documented in genes that encode pathogen
surface proteins (Bush 2001
; Peek et al. 2001
), reproductive proteins
(Swanson et al. 2001a
,b
), and host defense systems like the human major
histocompatibility complex (Hughes and Nei 1988
), plant chitinases
(Bishop et al. 2000
), and NBS-LRR R genes (Meyers et al. 1998
; Wang et
al. 1998
; Bergelson et al. 2001
). The correlation between positive
selection and host-pathogen interactions is particularly strong. For
example, a GenBank survey uncovered remarkably few sequences (0.45%)
evolving under positive selection, but more than half of these
sequences were involved in host-pathogen interaction (Endo et al.
1996
).
The close relationship between host-pathogen interactions and positive
selection suggests that positive selection can form the basis for an
evolutionary profile to identify NBS-LRR genes that are likely to be
involved in Arabidopsis disease resistance. However, there are
at least two caveats to evolutionary profiling in this gene family. The
first caveat is that it is difficult to determine whether the detection
of positive selection is a predictor of function. If profiling is
accurate, characterized resistance genes should fall into groups for
which positive selection is consistently inferred. Our Arabidopsis
data include sequences for R genes RPS4 and RPS5,
as well as the Col-0 sequences that are most similar to the defense
genes RPP1, RPP5, RPP8/HRT, and RPP13 (Table
1), and these R genes have been inferred to be subject to positive
selection (Parker et al. 1997
; Botella et al. 1998
; McDowell et al.
1998
; Warren et al. 1998
; Bittner-Eddy et al. 2000
). Five of these six
characterized R genes, or their likely orthologs, are in groups that
have positively selected amino acid sites under M8 (Tables 1 and 2);
hence, profiling correctly identifies these groups as containing
functional resistance genes. Positive selection in these groups was
also detected after known defense genes were removed from analysis
(data not shown), raising the possibility that the groups contain
multiple functional R genes.
The sixth characterized R-gene, RPP1, has a putative Col-0 ortholog in group 5 (Table 1), in which positively selected sites were not detected (Table 2). There is, however, evidence for positive selection in this group based on the test of M0 versus M1 (Table 2). It is more interesting that there is no evidence for positive selection in this group, either by the test of M7 versus M8 or the test of M0 versus M1, when the RPP1 ortholog is removed from analysis (data not shown). Thus, group 5 appears to consist primarily of sequences that lack a signature of positive selection. One cannot ascribe function to sequences based solely on tests for positive selection, but it is tempting to speculate that most genes in group 5 either do not play a role in defense or do not directly mediate pathogen interaction.
The most similar Col-0 homolog to the characterized resistance gene
RRS1 is in group 21. To date, there has been no evidence of
positive selection in RRS1 (Deslandes et al. 2002
), and we detect no evidence for positive selection in group 21 (Table 2). We
should note, however, that group 21 consists of only two sequences, and
the power to detect positive selection in groups with two sequences
appears to be low (see below). We should also note that the Col-0
homolog of RRS1 contains a WRKY domain but unlike
RRS1 is not predicted to contain a nuclear localization signal
downstream of the LRR region (data not shown). It thus is unclear to
what extent RRS1 and its putative Col-0 homolog share functions.
Altogether, there is a strong correspondence between characterized resistance genes and positive selection. Six characterized resistance genes or their putative Col-0 orthologs fall into groups in which positive selection was detected. The putatative Col-0 ortholog of a seventh gene, RRS1, does not belong to a group in which we detected positive selection, but the Col-0 ortholog appears to differ in domain structure relative to RRS1, suggesting it may not be functionally equivalent. More importantly, we have also identified sequence groups with extensive evidence of positive selection (e.g., groups 2, 12, and 13) that do not contain known R genes. These groups may contain uncharacterized, functionally active R genes.
The second caveat to evolutionary profiling is that it is subject to
analytical limitations. For example, positive selection was not
detected in groups with two sequences (Table 2), probably reflecting
low statistical power in tests with few sequences and also in tests
that average
among codons (Anisimova et al. 2001
). The statistical
power of the test is also sensitive to factors such as sequence length
and identity (Anisimova et al. 2001
). To determine whether these
factors underlie our results, we contrasted four characteristics (the
number of sequences, sequence length, sequence identity, and tree
length) among groups. None of these four factors differed significantly
between groups with and without positively selected sites, suggesting
that sampling biases do not underlie detection of positively selected
sites. A final analytical consideration is that LR tests with M8 tend
to be conservative when the LR statistic is assumed to be
2 distributed (Anisimova et al. 2001
). The limitations of
the ML method, as applied here, tend to make it conservative, and it therefore is likely that some positively selected sites were not detected in our analyses (a Type II error). On the other
hand, this conservative bias suggests Type I errors may be rare.
Groups with and without positively selected sites could vary in biological factors other than their function (or lack thereof) in disease resistance. One such difference is ectopic recombination, or gene conversion, which could affect test statistics. We analyzed sequence groups for evidence of ectopic recombination. Only two groups (groups 7 and 20) contained evidence of gene conversion, suggesting both that gene conversion is not widespread within groups and that gene conversion does not contribute substantially to differences between groups with and without positively selected sites. We should note, however, that gene conversion may help explain some of the inconsistent results based on group 7 (see below). A second potential biological difference is that there could be differences in gene activity between the positively selected groups and the groups without positively selected sites. To investigate this possibility, we measured EST hits as a proxy for gene expression. There was no overall difference between groups with and without positively selected sites.
Genic Location of Positively Selected Sites
The ML method is thought to be more effective when applied to entire
genes, as opposed to separate sequence domains (Swanson and Yang 2002
).
It therefore is not surprising that analyses on separate domains
identified fewer positively selected sites (105 vs.116) in fewer groups
(8 vs. 10) than whole-sequence analysis. Nonetheless, the results of
domain and whole-sequence analyses were fairly consistent except for
group 7, in which 31 positively selected sites were identified in the
LRR domain analysis, but no positively selected sites were detected
with whole-sequence analysis (Tables 2 and 3). At present, the reasons
for this discrepancy are unclear, but it is possible that gene
conversion in this group (see Results) contributes to differences
between whole-sequence and domain analyses. Positive selection for
group 7 was detected with the test of M1 versus M0. With the exception of group 7, results were consistent between domain and whole-sequence analyses in two ways. First, the same amino acid sites were identified as positively selected. For example, without group 7, 75% of the 74 sites identified in domain analyses were also identified in whole-sequence analyses. Second, both analyses detected positive selection primarily in LRR domains (Table 3). The proportion of LRR
sites identified with domain analyses (97%) was greater than that
detected with whole-sequence analyses (70%), but this difference may
reflect relatively low statistical power in relatively short CC, TIR,
and NBS domains.
One important characteristic of the ML approach is that it identifies
positively selected sites without a priori delimitation of regions.
Thus, our approach differs substantially from previous analyses of
NBS-LRR sequences, because most previous analyses first targeted
subsequences within LRRs before applying tests for selection (Parniske
et al. 1997
; Wang et al. 1998
; Bergelson et al. 2001
). Nonetheless,
prior studies have documented that positive selection is present in
particular subdomains of the LRR (Parniske et al. 1997
; Meyers et al.
1998
; Wang et al. 1998
). Our results corroborate these earlier findings
and indicate that positive selection is predominantly targeted on the
LRR region, particularly the E4C5 submotif.
Unfortunately, we cannot make direct comparisons between our results
and previous results because previous papers calculated
by
averaging among codons and thus did not identify individual codons in
which positive selection has occurred.
We mapped the location of positively selected sites from whole-sequence
analysis onto LRR secondary structure. One intriguing result from this
exercise is that secondary-structure prediction identified LRR regions
that were not detected by Pfam. Although this discrepancy occurred in
only two sequence groups (groups 11 and 18; Fig. 2), these results
suggest that secondary-structure analyses could be employed for
prediction and delineation of LRR domains. In PRI, for which the
tertiary structure has been solved, each LRR forms a short region of
strand, followed by a loop, a region of
-helix, and another loop
that leads to a second LRR (Kobe and Deisenhofer 1994
). The LRRs are
arranged so that the molecule resembles a horseshoe, with the
-sheets lining the inner face and the
-helices lining the outer
face. It is hypothesized that the interactions of PRI with its ligands
are mediated primarily by
-sheets (Kobe and Deisenhofer 1995b
). In
Arabidopsis NBS-LRR proteins, the E4C5
regions adopt a
-strand-loop structure and are frequent targets of
positive selection. Amino acid residues outside the
E4C5 are often conserved among the repeats of a
single-sequence group, consistent with the hypothesis that
non-E4C5 residues are involved in interactions
between consecutive motifs (Fig. 2; Kobe and Deisenhofer 1995a
; Jones
and Jones 1997
).
One surprising aspect of our study is that 30% (34 of 116) of
positively selected sites identified in whole-sequence analyses were
not in LRRs. Based on whole-sequence analysis, 4 sites were in either
CC or TIR domains, 7 sites were in the NBS domain, and 23 sites were in
domain regions not identified by Pfam. Some of the latter may actually
be located in poorly defined LRR domains, but there remains an
appreciable number of positively selected sites outside the LRR.
Positive selection in non-LRR regions has been documented previously.
For example, a study of the flax L locus showed that the TIR
region contributes to resistance specificity and may be under positive
selection (Luck et al. 2000
). We cannot assess directly the functional
importance of the positively selected sites in non-LRR regions, but it
is possible that these sites also play a role in intra- or
intermolecular interactions in protein complexes during recognition and signaling.
We found a high incidence of alignment gaps (or indels) in LRR regions.
This LRR elasticity was found in groups both with and without evidence
of positive selection, but the consensus LRRs contained few gaps in the
E4C5 region and predicted
-sheets (data not
shown). These observations may have structural implications; indels in
the predicted coil-helix-coil region may confer additional conformational variability that could lead to altered recognition specificities. Furthermore, predicted
-helices do not appear in a
regular pattern through all groups (Fig. 2);
-Helices occur in
alternate repeats, consecutive repeats, or not at all. The distribution
of
-helices in the secondary structure may also have an affect on
the tertiary structure of the LRR domain. Taken together, we believe
that this study suggests that several factors may both individually and
collectively influence the evolution of new resistance specificities
(Fig. 3). These factors include variation
in the position of indels in the backbone of the LRR domain,
hypervariability in the E4C5 region, changes in
secondary structure resulting from amino acid substitutions in the
backbone, and expansion/contraction in the overall number of LRR units.
|
| |
METHODS |
|---|
|
|
|---|
Sequences and Alignment
Complete amino acid sequences of 163 genes were retrieved from
At-RGenes (http://www.niblrrs.ucdavis.edu/At_RGenes/) in January 2001. The 163 sequences were aligned with CLUSTALW (Thompson et
al. 1994
), using default settings. Because the genes were too divergent
for analysis of positive selection, we partitioned sequences into
individual groups in two steps. First we obtained a neighbor-joining phylogeny of the 163 genes using PAUP* version 4.0b6 (Swofford 2000
). Second, based on this phylogeny and an identity matrix, we partitioned 22 phylogenetic clades into sequence groups by
three criteria: (1) the average identity in a group was >50% at the
amino acid level (Bergelson et al. 2001
), (2) the number of
conservative amino acid substitutions was >50%, and (3) the percentage of gapped residues was <25%. The average identity within a
group was calculated with PAUP*; other statistics were calculated with GeneDoc version 2.5 (Nicholas et al. 1997
).
After grouping, we iteratively realigned sequences within each group using CLUSTALW and reestimated sequence identities. During the grouping process we eliminated 60 sequences that either did not fall into groups with the criteria outlined above or were lacking NBS-LRR protein domains. In some cases, we also trimmed the C-terminal ends of sequences that could not be aligned reliably (details available from authors). The remaining sequences contained NBS and LRR regions, and most contained either a TIR or a CC region (Table 1). Amino acid alignments were converted back into nucleotide sequence alignments, which were used in analyses. Alignments are available at http://bgbox.bio.uci.edu; gapped regions of alignments were not considered in subsequent positive selection analysis.
For some analyses, we divided amino acid and nucleotide alignments from
groups with more than two sequences into putative TIR, NBS, CC-NBS, and
LRR domains. For the groups that have sequences similar to known R
genes, domain boundaries were determined from characterization of
Arabidopsis R genes RPP5 (Parker et al. 1997
), RPS5 (Warren et al. 1998
), RPP1 (Botella et al.
1998
), RPP8 (McDowell et al. 1998
), RPS4 (Gassmann et
al. 1999
), and RPP13 (Bittner-Eddy et al. 2000
) genes (Table
1). For groups that had no published information, domains were
determined for each sequence with Pfam (Sonnhammer et al. 1998
) and
consensus domain regions were identified within each group.
Sequence Analyses
The
ratio was calculated with the computer program
Codeml from PAML (Yang 1997
; Yang et al. 2000
). The
relative fit of codon substitution models was evaluated with likelihood ratio (LR) statistics, which are assumed to be
2
distributed with degrees of freedom equal to the difference in the
number of parameters between models. LR tests for positive selection
compare a model in which there is a class of sites with
> 1
against a model that does not allow for this class. We employed two LR
tests to compare codon substitution models (M). Yang, Nielsen, and
colleagues described the substitution models in detail (Nielsen and
Yang 1998
; Yang et al. 2000
), and here we use their notation. The first
compared M1 and M0; comparison between the two models identified
phylogenetic branches with
> 1 in which positive selection had acted.
A second, more specific approach to detect positive selection is to
study variation in
among sites. This variation is tested with an
additional LR test between M7 and M8. This test has been applied widely
(Yang et al. 2000
; Swanson et al. 2001b
), but for this study it is
important to note three test characteristics. First, detection of
positive selection requires significant differences between M7 and M8
and estimates of
that exceed 1. Second, under M8 it is possible to
estimate the proportion of sites that are under positive selection, and
this proportion is denoted P1. Third, the
application of these models requires a topological, or phylogenetic, assumption. For each sequence group, PAML analyses were applied assuming the maximum parsimony (MP) tree obtained from
PAUP* branch-and-bound searches. For groups in which there
was no single MP tree, the neighbor-joining (NJ) tree was assumed. It
should be noted, however, that the ML approach is relatively
insensitive to topological assumptions (Yang et al. 2000
).
Positively selected sites were identified under M8 with the Bayesian
approach implemented in PAML (Nielsen and Yang 1998
; Yang et al. 2000
).
From groups with evidence of positive selection, based on the LR test,
we further examined sites that had a >90% posterior probability of
being in the
> 1 class. It is also important to note that for
groups of two sequences, the only appropriate LR test is that between
M1 and M0. In these cases,
was fixed at 1 for M0, whereas
was
estimated for M1.
We mapped positively selected sites onto the secondary structure of
each protein. The protein secondary structure was predicted on the
complete amino acid sequences of each group with SSPro (Baldi et al. 1999
), using default settings. This program assigns the
highest probability secondary structure
either
-helix (H),
-strand (E), or coil (C)
to each amino acid residue. Results were
mapped onto the amino acid alignment with GeneDoc (Nicholas et al. 1997
).
Gene conversion was assessed by the method of Sawyer (1989)
, as
implemented in the program Geneconv
(http://www.math.wustl.edu/~sawyer/geneconv/). The test was applied
to the nucleotide alignments of the 14 groups that contained three or
more sequences and considered only synonymous sites. Amino acid
differences among sequences were not appropriate for this test, for two
reasons. First, the amino acid differences may be driven by positive
selection, but the test assumes sequence differences are selectively
neutral. Second, amino acid differences among sequences are clustered
in LRR regions, thus potentially causing spurious results.
Geneconv reports global P-values based on an
entire alignment; significance was based on these P-values
after correction for multiple tests.
| |
WEB SITE REFERENCES |
|---|
|
|
|---|
http://bgbox.bio.uci.edu; The Web site from which aligned LRR sequences from this study can be downloaded.
http://mips.gsf.de/proj/thal/; The Munich Information Center for Protein Sequences contains Arabidopsis thaliana EST information.
http://www.math.wustl.edu/~sawyer/geneconv/; The location of Genecov, a program that tests for gene conversion.
http://www.niblrrs.ucdavis.edu/At_RGenes/; The database of Arabidopsis NBS-LRR encoding disease resistance gene homologs.
| |
ACKNOWLEDGMENTS |
|---|
We thank Z. Yang and anonymous reviewers for useful advice and suggestions, and P. Tiffin and R. Michelmore for discussion. This work was supported by a UC-MEXUS Scholarship and a fellowship from the School of Biological Sciences, University of California Irvine to M.M.P., by NSF grants 98-15855 and 01-13498 to B.S.G. B.C.M. is supported by NSF grant 99-75971.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Present address: Department of Plant and Soil Sciences, University of Delaware, Newark, Delaware 19711, USA.
4 Corresponding author.
E-MAIL bgaut{at}uci.edu; FAX (949) 824-2181.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.159402.
| |
REFERENCES |
|---|
|
|
|---|
B.
J. Biol. Chem.
274:
14560-14567
a versatile binding motif.
TIBS
19:
415-421.