Vol 13, Issue 4, 673-692, April 2003
METHODS
Ran's C-terminal, Basic Patch, and Nucleotide Exchange Mechanisms in Light of a Canonical Structure for Rab, Rho, Ras, and Ran GTPases
Andrew F. Neuwald1,3,
Natarajan Kannan1,
Aleksandar Poleksic1,
Naoya Hata1 and
Jun S. Liu2
1Cold Spring Harbor Laboratory, Cold Spring Harbor, New
York 11724, USA; 2Department of Statistics, Harvard
University, Cambridge, Massachusetts 02138, USA
 |
ABSTRACT
|
|---|
Proteins comprising the core of the eukaryotic cellular machinery
are often highly conserved, presumably due to selective constraints
maintaining important structural features. We have developed
statistical procedures to decompose these constraints into distinct
categories and to pinpoint critical structural features within each
category. When applied to P-loop GTPases, this revealed
within Rab, Rho, Ras, and Ran a canonical network of molecular
interactions centered on bound nucleotide. This network presumably
performs a crucial structural and/or mechanistic role considering that
it has persisted for more than a billion years after the divergence of
these families. We call these FY-pivot GTPases after their most
distinguishing feature, a phenylalanine or tyrosine that functions as a
pivot within this network. Specific families deviate somewhat from
canonical features in interesting ways, presumably reflecting their
functional specialization during evolution. We illustrate this here for
Ran GTPases, within which two highly conserved histidines, His30 and
His139, strikingly diverge from their canonical counterparts. These,
along with other residues specifically conserved in Ran, such as Tyr98,
Lys99, and Phe138, appear to work in conjunction with FY-pivot
canonical residues to facilitate alternative conformations in which
these histidines are strategically positioned to couple Ran's basic
patch and C-terminal switch to nucleotide exchange and effector
binding. Other core components of the cellular machinery are likewise
amenable to this approach, which we term Contrast
Hierarchical Alignment and
Interaction Network (CHAIN)
analysis.
[Supplemental material is available online at
www.genome.org.]
Certain proteins constitute the core of the
cellular machinery inasmuch as they mediate essential processes, such
as DNA replication and repair, transcription, translation, transport,
and basic metabolism. In eukaryotes, such core proteins also
include the motor proteins kinesin and myosin, structural proteins,
such as histones, actin, and tubulin, and regulatory and signaling
factors, such as certain protein kinases and Ras-like GTPases. Because
of their critical cellular roles, these proteins are conserved across
major eukaryotic taxa and, in some cases, across all divisions of life.
Our knowledge of the structure, function, and mechanism of some core
proteins seems quite extensive. Consider, for example, signaling P-loop
GTPases related to Ran, a component of the nuclear transport machinery
examined here in detail. There are currently dozens of structures of
these GTPases (Vetter and Wittinghofer 2001 ), either alone or in
complex with various regulators and effectors, and thousands of related
sequences available. Moreover, published experimental studies on these
proteins number in the tens of thousands. Our knowledge of certain
other eukaryotic core proteins is similarly extensive.
Nevertheless, there is reason to believe that we may have barely
scratched the surface in understanding these proteins. In particular,
strong functional constraints must be preserving the high degree of
sequence conservation seen across major taxonomic categories for many
core protein families and subfamilies. This is seen, for instance, in
the Ran alignment of Figure
1A. Although some of these
conserved residues are involved in substrate binding and catalysis,
many others have no known functionyet their persistence over 12
billion years of evolution implies that they are functionally quite
important. Indeed, when even minor side-chain modifications, such as
attachment or removal of an OH or CH3 group, are
consistently eliminated by natural selection, such residues are likely
to establish critical interactions similar to those established by key
catalytic residues. If so, then the patterns of conservation present in
these and related proteins may contain implicit information regarding
important but unknown structural mechanisms.


View larger version (138K):
[in this window]
[in a new window]
|
Figure 1. Ran family alignments generated by CHAIN analysis procedures.
(A) Conventional multiple alignment. The leftmostcolumn specifies each sequence's phylum; these are colored by
major eukaryotic taxa as follows: metazoans, red; fungi, dark yellow;
plants, green; protozoans, cyan. The top sequence is the
query. The NCBI sequence identifiers are: 5453555, 17553976, 3113905,
6323324, 11067497, 1710007, 13812290, 4881271, 585782, 1172840,
15691764, 606985, 8593487, 14089387, 585780, and 14581093.
(BD) Contrast hierarchical alignment. CHAIN
analysis applies three different sequence highlighting schemes to the
Ran family alignment in A to reveal the selective constraints
most characteristic of each of three hierarchical categories, which
here correspond to P-loop GTPases in B; FY-pivot GTPases in
C; and the Ran family in D. Organism descriptions
(leftmost column) are colored by category as specified in
Figure 2; sequences obtained from ESTs are indicated. Note that for the
Basidiomycota protein, which was predicted from an EST, the
replacement of histidine (H) at position 30 by aspartate (D) is likely
due to a sequencing error. Chemically similar highlighted residues are
colored similarly. Histograms above the alignments display the relative
strength of the inferred selective constraint acting at each position
within that category (quasi-logarithmic scaling is used; see Methods).
This and other aspects of this representational scheme are explained in
Figure 2. Dots below the histograms (and directly above the alignments)
indicate those residues specifically assigned to each category. Gray
dots in B and C indicate positions for which Ran
deviates from the canonical residues for that category. Note that the
conventional alignment in A helps identify residues associated
with intermediate categories, which correspond to conserved positions
in Ran that are inconsistently conserved within the three categories of
this hierarchy. A few residue positions (such as T42Ran) are
misclassified in this analysis due to alignment errors; these were
detected and addressed in our analysis through structural studies and
CHAIN analysis of related GTPases.
|
|
To access this information, we devised a statistically based approach
called CHAIN (Contrast Hierarchical
Alignment and Interaction Network)
analysis, which decomposes the sequence constraints associated with
conserved patterns in a multiple alignment into distinct categories,
while pinpointing critical features within each category (as, e.g., in
Fig.1BD). These categories presumably correspond to a series of
evolutionary adaptations leading to functional specialization and
divergence of related sequences into superfamilies, superfamilies into
families, and so forth (Fig. 2A). We
applied this approach to various protein families within the large and
diverse class of P-loop GTPases (Hall 2000 ; Leipe et al. 2002 ). This
revealed that the Ras, Rab, Rho, and Ran families share sequence
constraints (Fig. 3B) corresponding to a
network of generally conserved structural interactions (displayed in
Fig. 4 and described below). Furthermore,
detailed analysis of Ran GTPases, within the context of these canonical
features, leads to striking observations relevant to Ran's C-terminal,
basic patch, and nucleotide exchange mechanisms.

View larger version (40K):
[in this window]
[in a new window]
|
Figure 2. CHAIN analysis representational schemes. The examples shown in
AD correspond to the hierarchical alignment in
Figure 1. (A) Venn diagram representing hierarchical
relationships between aligned sequence sets. The dotted oval
corresponds to a hypothetical intermediate category consisting of
conserved residues in Ran that fall outside the categories of this
particular hierarchy. (B) Notation used within hierarchical
alignments. Position 139 of Ran is shown. The main and superfamily
sets, which contain too many sequences to display directly, are
represented in the alignment as position-specific conserved patterns.
(The total number of sequences for these two categories is shown in
parentheses.) The corresponding residue frequencies (res_freq) are
given in integer tenths below conserved residues. For example, a 5
in integer tenths indicates that the corresponding residue directly
above it occurs in 50%60% of the (weighted) sequences. Insertion
and deletion frequencies are similarly given in integer tenths (black;
range 10%100%) or hundredths (gray; range 1%9%) as indicated.
Ran family aligned residues are displayed directly. Histogram bar
heights are approximately logarithmically proportional to the measure
of selective constraint (see Methods), as defined by the following urn
model. (C) Urn model for measuring the selective constraint
acting on a specific position. The residues observed in the main set at
this position are modeled as distinctly colored balls in an urn. Some
of the colors are similar (representing biochemically similar amino
acids). The selective constraint is then defined as the difficulty of
drawing by chance at least as many of the same- or similarly-colored
balls from the urn as are observed in the subalignments (in this case,
alanine for the FY-pivot superfamily or histidine for the Ran family).
Note that our analysis of the Ran family uses the main set as the
superalignment urn (see Methods); alternatively, the FY-pivot
GTPases may also be used as the superalignment urn, though the sparser
data set would yield less accurate background frequency estimates. Note
that the alignments in Figures 1A and 1B measure sequence constraints
using a standard background model (see Methods). (D) Color
scheme used for residue side-chains in Figures 49. (E) Color
scheme for structural regions described in the text and figures. The
structure of Sec4p (pdb code: 1G17) is shown in complex with a GTP
analog (cyan) and magnesium (dark green).
|
|

View larger version (49K):
[in this window]
[in a new window]
|
Figure 4. Sec4p as a structural prototype of FY-pivot GTPases. The structure of
Sec4p is shown in complex with GDP (pdb code: 1G16). The corresponding
hierarchical alignment is given in Figure 3. Hydrogen bonds are
depicted as dotted lines, and aromatic-aromatic and van der Waals
interactions as dot clouds. Dotted lines into clouds depict CH- or
NH- interactions (Weiss et al. 2001 ). Color scheme: GDP (cyan);
main-chain traces and residue designations (colored by regions as
indicated in Fig. 2E); residue side-chains and canonical
glycine main-chains (color scheme of Fig. 2D); oxygen,
nitrogen, and hydrogen atoms establishing hydrogen bonds (red, blue,
and white, respectively); hydrogen bonding carbons (colored as their
corresponding side-chains). Figures were generated using RasMol (Sayle
and Milner-White 1995 ). (A) Canonical interactions between the
LV.D and NK.D regions. These include a perpendicular aromatic-aromatic
interaction (F108-Y100), four CH- interactions (F108-Y100,
G132-Y100, V142-F108, and I102-R140), and main-chain hydrogen bonds to
two side-chains (T107 and R140). Y100 is the FY-pivot residue. The
inset highlights interactions between the LV.D and P-loop regions.
(B) Canonical interactions within and between the NK.D and SA
regions. These include packing of a phenylalanine or tyrosine (F158)
against two small residues (G147 and A151) in the helix following the
NK.D loop, a salt bridge (R140 and E160), and several main-chain
hydrogen bonds. Residues within the LV.D region are shown for
comparison with A. (C) Canonical interactions between
the SA and Switch I regions. Note that, unlike Ran (Figs. 7D,
8), there are no major structural differences between the GDP and
GTP-forms of Sec4p in these regions. Also shown is a previously noted
(Hall 2000 ) canonical aromatic-aromatic interaction between a
phenylalanine (F45) and bound guanine. Residues homologous to R39
(mainly arginine, glutamine, or serine), though inconsistently
conserved at the sequence level, conserve hydrogen bonding interactions
with the SA region.
|
|
 |
RESULTS AND DISCUSSION
|
|---|
Identification of a Distinct Class of P-loop GTPases
Our analysis of P-loop GTPases involved the construction of contrast
hierarchical alignments (described in Methods) for about 35 distinct
families and subfamilies, many of which correspond to core components
of the translational and eukaryotic signaling machinery (Leipe et al.
2002 ). Among the various categories of GTPases analyzed in this way, we
focus here on one category primarily consisting of the Ras (Bar-Sagi
and Hall 2000 ; Crespo and Leon 2000 ), Rho (Bar-Sagi and Hall 2000 ;
Symons and Settleman 2000 ), Rab (Pfeffer 2001 ; Stenmark and Olkkonen
2001 ; Tuvim et al. 2001 ), and Ran (Clarke and Zhang 2001 ; Moore 2001 )
families, but also including a few less characterized GTPases, such as
Pike (Ye et al. 2000 ). For these GTPases, we examined corresponding
molecular interaction networks (see Methods) within currently available
structures, of which there are around 70. This revealed that
the canonical residues distinguishing these from other
GTPases (Fig. 3B) correspond to a network of structural
interactions within and between six regions surrounding the
nucleotide-binding site (Fig. 2E). These regions include the classical
P loop, Switch I, and Switch II regions and three others, which we term
the LV.D, NK.D, and SA regions after their characteristic
motifs (Fig. 3A; Valencia et al. 1991 ; Wittinghofer 1999 ; Hall 2000 ;
Vetter and Wittinghofer 2001 ). We call these proteins the FY-pivot
GTPases after their most distinguishing feature, a phenylalanine or
tyrosine that resembles a pivot near the center of this network (see
below).
CHAIN analysis of these GTPases is robust. Using various FY-pivot
GTPases as the query, we consistently find this networkthough the
characteristic residues highlighted within distinct hierarchical
alignments can vary somewhat, mainly due to occasional misaligned
regions. Close examination of these alignments and of the corresponding
structures, however, readily allowed us to detect and account for these
discrepancies. The canonical FY-pivot features found in this way are
best seen in the hierarchical alignment of Figure 3, which was created
using yeast Sec4p as the query and a structural alignment of FY-pivot
GTPases as the query family (see Methods).
We chose Sec4p as the query for this alignment because it retains all
or nearly all of the most characteristic FY-pivot canonical features
(as defined by the alignment in Fig. 3B) whereas members of other
families sometimes diverge from these features in various ways.
Divergent residues are often invariant within the family in which they
occur, suggesting that they perform a family-specific function (this
phenomenon is explored further for Ran below). Assuming that the most
recent common ancestor of these GTPases possessed all of the canonical
features, it would thus have more closely resembled Sec4p than these
other GTPases and may therefore have functioned in vesicle transport
prior to its duplication and divergence into these various families. A
high degree of divergence is found in Pike GTPases, which nevertheless
retain many of the canonical FY-pivot features. Notably, our analysis
clearly places the Ras-like GTPases Arf and Sar outside of this class.
FY-pivot GTPases function as binary molecular switches that amplify and
relay signals controlling various cellular pathways (for review, see
Hall 2000 ; Takai et al. 2001 ). They are associated with regulators that
modulate their on/off status and with effectors that mediate their
downstream effects when in the on or GTP-bound state (Hall 2000 ).
Regulators include guanine nucleotide exchange factors (GEFs), which
turn the switch on by exchanging bound GDP with GTP, and
GTPase-activating proteins (GAPs), which turn the switch off by
stimulating GTP hydrolysis. Effectors for Ran include Ran binding
proteins (RanBPs; Dingwall et al. 1995 ) and nuclear transport receptors
belonging to the importin- family (Strom and Weis 2001 ).
FY-Pivot Canonical Features
Figure 4 portrays the Rab-family GTPase Sec4p as a structural
prototype of FY-pivot GTPases with an emphasis on canonical features
relevant to our analysis of Ran below. It displays these features from
three perspectives (Fig. 4AC) that are also used for subsequent
figures (Figs. 57, respectively) in order to facilitate comparison
between families. For nearly all of these canonical residues, no
functional roles have previously been proposed.
Figure 4A represents the canonical interactions between the LV.D region
and the NK.D and P-loop regions. These interactions are well conserved
within the Ras, Rho, and Ran families (Fig.
5). Near the center of these interactions
is the most distinctive featurethe phenylalanine or tyrosine
(Y100Sec4 in Fig. 4A) immediately preceding the aspartate
(D101Sec4) of the LV.D motif (Fig. 3A). We refer to this
residue as the FY-pivot because, within distinct structures and
conformational states, it consistently maintains two CH- hydrogen
bonds with neighboring canonical residues through rotation or pivoting
of its aromatic ring. The formation of these CH- bonds involves
interaction of the electron clouds on either side of the FY-pivot
aromatic ring with two hydrogens: A C hydrogen donated by
a canonical phenylalanine or tyrosine residue (F108Sec4),
which also forms a perpendicular aromatic-aromatic interaction with the
FY-pivot, and a C hydrogen donated by a canonical glycine
residue (G132Sec4).
This glycine's and the FY-pivot's main-chains together form a
parallel sheet (not shown), within which the glycine occurs at a
non-hydrogen bonded position. This type of CH- interaction is
predicted to stabilize the -stranded conformation of glycine (Merkel
and Regan 1998 )an otherwise intrinsically destabilizing residue in
sheets. Because this glycine immediately precedes the NK.D motif,
its -stranded conformation may be important for proper positioning
of motif residues when they are bound to guanine and to the main-chain
of the P loop, which, in turn, directly binds nucleotide phosphates.
Conversely, disruption of this CH- interaction may destabilize
nucleotide binding.
Two other FY-pivot canonical residues (W114Sec4 and
D28Sec4; Fig. 4A inset) indirectly link the LV.D region to
the GTP- or GDP-bound state of the protein via hydrogen bonds to P-loop
residues, the main-chain oxygens of which hydrogen bond to the P-loop
lysine directly interacting with GTP or GDP. Below we explore how these
interactions may facilitate Ran nucleotide exchange.
Figure 4B represents the canonical interactions between the SA and NK.D
regions. One of these consists of a salt bridge between an arginine
(R140Sec4) and a glutamate (E160Sec4) that is well
conserved in FY-pivot GTPases (Fig.
6)though sequence
variability often obscures the canonical arginine within multiple
sequence alignments (Fig. 3B). Another canonical interaction involves
two small residues (G147Sec4, A151Sec4) that
facilitate packing of the helix following the NK.D motif against a
canonical phenylalanine or tyrosine (F158Sec4). Although
these small residues are well conserved within the Rab, Ras, and Rho
families (Figs. 4B and 6A,B), Ran strikingly diverges from these
canonical residues (I136Ran and H139Ran in Fig. 6C)
in a manner that presumably reflects its specific function, as we
discuss further below.

View larger version (27K):
[in this window]
[in a new window]
|
Figure 6. Canonical interactions between the NK.D and SA regions within the Ras,
Rho, and Ran families. The perspective is as shown in Figure
4B. Representations and coloring are as described in Figure 4.
(A) Human Ras-GTP (pdb code: 1QRA). (B) The Rho
family human Rac1 in complex with a GTP analog (pdb code: 1MH1).
(C) Canine Ran-GDP (pdb code: 1BYU).
|
|
Figure 4C represents three canonical interactions associated with the
Switch I and SA regions: (1) an interaction between two aromatic
side-chains (F40Sec4 and F173Sec4), (2) a hydrogen
bond between the backbone of the SA loop and a weakly conserved residue
(R39Sec4p) near the Switch I region, and (3) a (well known)
coordination between a phenylalanine (F45Sec4) and bound
guanine. Given their locations, these interactions may help coordinate
the binding or release of guanine nucleotide with conformational
changes in the Switch I region.
Figure 7 shows the Figure 4C canonical
interactions within other FY-pivot GTPases. In about 12% of these
GTPases, a leucine residue (e.g., L23Ras in Fig. 7A) replaces
the canonical Switch I aromatic residue (F40Sec4). This
common deviation may be viewed as a reversion back to a noncanonical
state, as roughly a third of all P-loop GTPases contain
leucine at this position. In Ran, however, this canonical residue is
replaced by a histidine (H30Ran in Fig. 7D). CHAIN analysis
clearly reveals that this histidine, although aromatic, is a highly
unusual replacement at this position (as indicated by the histogram in
Fig. 1D). Indeed, among all P-loop GTPases, histidine
appears to be tolerated in Ran alone: Only three out of over 5000
P-loop GTPases outside of the Ran family contained histidine at this
position, and these may be due to single base sequencing errors. The
possible role of this histidine is discussed below.
Ran GTPases
Ran controls mitotic spindle assembly and nuclear envelope formation
and directs nuclear transport of proteins bound to importin- via the
adaptor protein importin- , which recognizes classical nuclear
localization signals (Clarke and Zhang 2001 ; Moore 2001 ). When a
complex of importin- / and cargo protein enters the nucleus,
binding of Ran-GTP to importin- induces release of the cargo. Levels
of Ran-GTP are elevated in the nucleus due to chromatin-bound RCC1, the
Ran nucleotide exchange factor. When the Ran-GTP-importin- complex
exits the nucleus, importin- is released via interaction of Ran with
RanBP and Ran-GAP, which together activate GTP hydrolysis to generate
Ran-GDP. Ran-GDP is recycled back into the nucleus via a nuclear
transport factor.
Ran differs from other Ras-like GTPases in that it has a long
C-terminal extension consisting of a linker, an -helix, and an
acidic tail (DEDDDL). The acidic tail stabilizes Ran's GDP-bound state
and mediates interactions with RCC1, Ran-GAP, and RanBP (Richards et
al. 1995 ). In Ran-GDP, this C-terminal extension packs up against the
GTPase domain such that the acidic tail is located near a basic patch
region (139HRKK142) of Ran. In other Ran complexes
this C-terminal extension often undergoes a dramatic conformational
change or C-terminal switch.
Ran's Basic Patch and C-Terminal Switch
Our analysis points to H30Ran and H139 Ran as
the two residues that most distinguish the Ran family from other P-loop
GTPases (see Fig. 1D) and links these residues to Ran's basic patch
and C-terminal switching mechanisms. These residues, though highly
conserved in Ran, strikingly diverge from the FY-pivot canonical
residues at those positions.
H30Ran (Fig. 7D) corresponds to and diverges from the
canonical phenylalanine or tyrosine (F40Sec4 in Fig. 4C)
within the P-loop/Switch I region that establishes an aromatic stacking
interaction with a canonical phenylalanine (F173Sec4) within
the SA region. H30Ran establishes a similar aromatic-aromatic
interaction in those Ran complexes for which the C-terminal extension
dissociates from the GTPase domain (Fig.
8A). In the GDP-bound form, however, this
interaction is disrupted, and instead H30Ran hydrogen bonds
with the backbone nitrogen of residue 33 (Fig. 8B), which is within a
region of Switch I (3234Ran) that would otherwise clash
with the C-terminal linker (182184Ran; Vetter et al.
1999a ,b ). This bond also serves as an N-cap (Presta and Rose 1988 ;
Richardson and Richardson 1988 ) favoring formation of a short -helix
involving residues 3135, thatalong with other stabilizing
interactions by Ran-conserved residues (Fig. 8B)moves a canonical
phenylalanine (F35Ran) out of coordination with guanine.
Moreover, upon formation of this bond, nearby main-chain atoms and the
sequence-adjacent conserved arginine (R29Ran) are
repositioned for hydrogen bonding to the C-terminal linker (Fig. 8B
inset). Thus, H30Ran appears to play a central role in
establishing Ran-GDP's C-terminal conformation.
H139Ran (Fig. 6C) likewise corresponds to and diverges from a
canonical alanine typically found at this position (A151Sec4
in Fig. 4B). Notably, histidine is a particularly bulky and polar
substitute for alanine. A nearby isoleucine residue (I136Ran
in Fig. 6C) is a similarly bulky substitute for the canonical alanine
or glycine at that position (G147Sec4 in Fig. 4B). Moreover,
there is a single residue deletion between I136Ran and
H139Ran relative to the canonical residues. (This deletion is
evident in the structure-based alignment of Figure 3, but in the
sequence-based alignment of Figure 1C this deletion and
I136Ran are misaligned relative to the FY-pivot consensus
pattern below the Ran family alignment.) Due to their small size, the
canonical residues at these positions facilitate packing of the helix
in which they occur against a canonical phenylalanine or tyrosine
(F158Sec4 in Fig. 4B). In Ran this helix contains the basic
patch (139HRKK142) that, in the GDP-bound state, is
believed to interact with the C-terminal acidic tail (Fig.
9A). We propose that H139Ran,
I136Ran, and the deletion between them destabilize packing of
the basic patch helix against the canonical Y146Ran and
thereby facilitate distinct conformation states, which are stabilized
through alternative interactions involving these and other residues. In
agreement with this notion, the conformation of the basic patch varies
between different Ran-GDP structures (Scheffzek et al. 1995 ; Stewart et
al. 1998a ,b ).
Y98Ran appears to help mediate conformational changes of the
basic patch helix (Fig. 9). This tyrosine residue is invariant in Ran,
but in other FY-pivot GTPases phenylalanine typically occurs at this
position (Figs. 1, 3). This suggests that the tyrosine OH group,
though apparently selected against in other GTPases, is critical to
Ran's function. Consistent with this notion, the Y98RanOH
group hydrogen bonds to the basic patch helix in various ways depending
on Ran's conformational state. In GTP-bound forms complexed with
either importin- (Chook and Blobel 1999 ; Vetter et al. 1999a ) or
RanBD1 (Chook and Blobel 1999 ; Vetter et al. 1999a ,b ; the structure of
Ran-GTP by itself is not yet available), this OH group hydrogen bonds
with two main-chain nitrogens (Fig. 9C). In Ran-RCC1 (Renault et al.
2001 ), the OH group hydrogen bonds to the side-chain nitrogen of the
basic patch residue K141Ran (Fig. 9B), whereas in Ran-GDP it
typically hydrogen bonds to a main-chain oxygen (Fig. 9A; the latter is
found in nine out of 12 of the available structures of Ran-GDP; in two
of the three remaining structures the OH group is slightly shifted
away from this configuration, and in one it binds to the same
main-chain nitrogens as in the Ran-GTP complexes).
In all available forms of Ran, the orientation of Y98Ran is
stabilized through canonical CH- and aromatic-aromatic interactions
with the FY-pivot residue (F90Ran in Fig. 9AC). In
nucleotide-bound forms, further stabilization of the basic patch helix
occurs through interaction of I136Ran with canonical
Y146Ran and through the canonical salt bridge between the
NK.D and SA regions (Figs. 9A,C) (these interactions are maintained in
all of the 17 currently available structures of nucleotide-bound Ran).
These observations, together with others described below, suggest that
I136Ran and H139Ran destabilize the basic patch
helix and thereby facilitate distinct conformation states, which are
stabilized through alternative interactions involving these residues,
Y98Ran, and other residues depending on the particular
conformational state. At the same time, nearby canonical residues
provide a structural context for these conformational changes, which we
believe to be important for the following nucleotide exchange
mechanism.
RCC1-Mediated Nucleotide Exchange
The main contributors to guanine nucleotide-binding affinity, and
thus the main inhibitors of nucleotide exchange, are the interaction of
the P-loop lysine (Sigal et al. 1986 ; John et al. 1988 ; Klebe et al.
1993 ) and main-chain (Rensland et al. 1995 ) with phosphate, and the
coordination of Mg2+ (Hall and Self 1986 ; Klebe et al. 1995 ).
Based on these facts and on various GTPase-GEF structures, Renault et
al. (2001) proposed a general nucleotide exchange mechanism, a key
component of which is disruption of the interaction of nucleotide
phosphates with the invariant P-loop lysine and with Mg2+.
Renault et al. (2001) also proposed a specific Ran-RCC1 exchange
mechanism, a key component of which is insertion of a surface-exposed
-wedge on RCC1 between the Ran Switch II and P-loop regions.
This displaces the P loop, thereby changing the orientation of Ran's
GDP binding sites relative to the guanine base and the phosphate
groups. This favors nucleotide release because, based on studies with
Ras (Rensland et al. 1995 ), the relative orientation of these sites is
critical to nucleotide binding. Another component of this Ran-specific
mechanism is that, upon nucleotide release, the base leaves first and
the phosphate last, whereas, upon reentry, the phosphate comes in first
and the base last. Our analysis suggests additional features of this
exchange mechanism where, in particular, H139Ran and
Y98Ran play important roles.
Upon binding to Ran, RCC1 interacts most extensively with the LV.D
region (Fig. 9; Renault et al. 2001 ) and appears to pull it away from
its indirect link to the P-loop lysine (cf. W104Ran in Figs.
9D,E), perhaps thereby transiently tugging on the interaction of this
lysine with the phosphate of GDP. The RCC1-Ran interaction may also
alter binding to the guanine base, because the LV.D region is connected
via canonical interactions (Fig. 5C) to the NK.D loop, which binds both
to guanine and the P loop.
To envision precisely what may occur, consider that upon binding to
RCC1 a threefold CH- interaction (Fig. 9B), which we term a CH-
triad, forms between the FY-pivot residue, F90Ran, and two
other aromatic residues, Y98Ran and F138Ran.
Notably, in Ran the FY-pivot is always phenylalanine rather than
tyrosine, which is more commonly found at this position in other
FY-pivot GTPases and which cannot participate in such a triad due to
its OH group. This CH- triad may thus perform a specific function
associated with the Ran-RCC1 complex. Further support for this notion
is provided by the sequence adjacency of Y98Ran to
K99Ran, a key residue that inserts into and hydrogen bonds to
the central hole of the RCC1 -propeller structure (Fig. 9E; Renault
et al. 2001 ). Indeed, among all the residues of Ran, K99 establishes
the greatest surface contact (164 Å2) with RCC1.
In forming the CH- triad from Ran-GDP, Y98Ran loses a
CH- interaction with canonical V131Ran (cf. Fig. 9A,B) and
gains one with V101Ran (cf. Fig. 9D,E), whereas the OH
group of Y98Ran loses its hydrogen bond with a
main-chain C=O group and gains one with K141Ran (cf. Fig.
9A,B). This allows this C=O group and the main-chain nitrogen of
F138Ran to form a hydrogen bond (not shown), thereby
establishing a typical main-chain -helical conformation that appears
to help position F138Ran within the CH- triad.
At the same time, H139Ran flips to the other side of the
basic patch helix (Fig. 9A,B), a conformational change facilitated by
participation of the sequence-adjacent F138Ran in the CH-
triad and by the establishment of two interactions that explain the
absolute requirement for histidine at position 139: an
aromatic-aromatic interaction with canonical Y146Ran and a
hydrogen bond to a side-chain oxygen of canonical D148Ran. In
Ran, aspartate may occur at position 148 more often than glutamate,
which is prevalent in other FY-pivot GTPases, to better facilitate this
hydrogen bondthe formation of which also disrupts the canonical salt
bridge.
This H139Ran flip also brings it closer to the canonical
location for this residue relative to Y146Ran, whereas
I136Ran swings away from its interaction with
Y146Ran. These coordinated changes may be aided by the single
residue deletion between I136Ran and H139Ran,
inasmuch as the deletion disfavors concurrent interaction of both of
these residues with Y146Ran. Moreover, recall that
F90Ran of the CH- triad also forms a CH- interaction
with the glycine residue directly preceding the NK.D motif
(G115Ran in Fig. 5C). Thus one may easily imagine these
atomic rearrangements twisting the NK.D loop (cf. Fig. 9A,B) in
coordination with the previously mentioned movement of the P loop. This
may lead to opposing movements of the guanine and phosphate binding
sites, thereby leveraging separation of the P loop from the
phosphate or separation of the guanine base from the NK.D loop (Fig.
9F). The latter is consistent with the guanine base first
nucleotide release mechanism proposed for Ran-RCC1 (Renault et al.
2001 ).
These events may also help dislodge Ran's C-terminal extension from
the GTPase domain, given the proposed contact between the basic patch
and the C-terminal acidic tail. Indeed, an active role for RCC1 in
inducing the C-terminal switch has been suggested based on structural
considerations (Renault et al. 2001 ) and in vitro experiments showing
that deletion of the acidic tail speeds up RCC1-catalyzed GDP release
(Richards et al. 1995 ). Thus, these basic patch conformational changes
may link nucleotide exchange with Ran's C-terminal switch, which, in
turn, could help orchestrate other events. For example, dissociation of
the C-terminus would allow it to wrap around the Ran binding domains
(RanBDs) of RanBPs (Vetter et al. 1999b ), which play important roles in
Ran nuclear transport. Indeed, RanBD1 and RCC1 bind to distinct regions
of Ran (Renault et al. 2001 ), and RanBPs interact with Ran-RCC1 to form
trimeric complexes (Bischoff et al. 1995 ; Yokoyama et al. 1995 ; Noguchi
et al. 1997 ; Mueller et al. 1998 ) that modulate RCC1 nucleotide
exchange activity (Bischoff et al. 1995 ). Furthermore, recent studies
suggest that RanBP1 and another protein that binds to Ran, Mog1, act as
cofactors ensuring that RCC1-catalyzed exchange promotes the generation
of Ran-GTP (Nicolas et al. 2001 ). Thus the basic patch mechanism we
propose may facilitate efficient ordering of the molecular events
mediated by these proteins.
Ran Binding to Importin-
Other aspects of this H139-mediated basic patch mechanism are
suggested by Ran's interaction with importin- . The three residues
directly following H139Ran within Ran's basic patch
establish key contacts with acidic residues in importin- (Chook and
Blobel 1999 ; Vetter et al. 1999a ). In particular, the residue directly
following H139Ran, R140Ran, establishes a 152
Å2 contact surface with importin- ; this is the second
most extensive contact among all the residues in Ran. However, a
comparison of the contacting acidic residues within distinct
importin- families (Chook and Blobel 1999 ; Vetter et al. 1999a )
reveals that these residues are not homologous and, therefore, are
nonconserved. In contrast, a tryptophan (W342imp in human
importin- 1) that establishes an aromatic-aromatic interaction with
H139Ran (Fig. 9C) is conserved across diverse organisms and
distinct familiesas is an acidic loop directly preceding this
tryptophan. This interaction, along with electrostatic repulsion by
importin- 's acidic loop and the above-mentioned interactions of the
other basic patch residues, may help release Ran's C-terminal acidic
tail for binding to RanBP-related proteins, which, in this context, act
as Ran-importin- release factors and as coactivators of RanGAP
(Bischoff et al. 1995 ; Bischoff and Gorlich 1997 ) upon exit from the
nucleus (Melchior and Gerace 1998 ). The previously mentioned
repositioning of the Y98Ran OH group hydrogen bond may also
assist the conformational change required for these interactions (Fig.
9C). Incidentally, the transcriptional enhancer TIP120, which like
importin- is composed of HEAT repeats (Andrade and Bork 1995 ),
contains a similar acidic loop and adjacent tryptophan within one of
these repeats (Neuwald and Hirano 2000 )suggesting a possible role for
Ran in the regulation of this protein.
Summary and Conclusion
The cell has been likened to a collection of protein machines, each
with highly coordinated moving parts driven by energy-dependent
conformational changes (Alberts 1998 ). If so, then discovering the
underlying mechanisms at the core of this machinery is critical to
understanding cellular activities. CHAIN analysis addresses this
problem by capitalizing on two statistically useful characteristics of
many core components: High sequence conservation across diverse taxa
(implying an intricate mechanism performing an essential function) and
the existence of many homologs (reflecting the notion that Nature
seldom uses a good idea only once). Furthermore, we focused on a set of
related proteins (i.e., P-loop GTPases) for which there is extensive
sequence, structural, and other experimental dataa situation where
the sheer volume of data is likely to obscure pertinent information.
Fortunately, such an overabundance of data lends itself favorably to
statistical analysis, which can detect subtle clues pointing to the
most significant features of these proteins. Taking this approach has
led to our findings here, which demonstrate that mechanistic aspects of
core components of the cellular machinery can indeed be explored in
this way.
We found among FY-pivot GTPases a canonical network of atomic
interactions, the evolutionary persistence of which implies a critical
cellular function. What might this be? It seems unlikely to involve
binding to a specific type of domain because regulators and effectors
differ in structure from family to family. On the other hand, we note
that all of these GTPases must be stably maintained in either a GDP-
or GTP-bound state, yet, when necessary, readily undergo conformational
switching between states. One way to achieve this may be by embedding
bound guanine nucleotide within a sufficiently stable, yet flexible
network of molecular interactions that incorporates a pivoting
mechanism for coordinating conformational changes needed for nucleotide
release (and possibly other molecular events).
More specifically, our analysis here provides useful clues regarding
Ran's C-terminal, basic patch, and nucleotide exchange mechanisms. In
particular, we find that H30Ran, T98Ran, and
H139Ran, which deviate from the corresponding FY-pivot
canonical residues (though in the case of Y98Ran, only
slightly), appear to destabilize the canonical conformations in these
regions and, at the same time, help stabilize alternative conformations
unique to Ran. These deviant residues may thus facilitate additional
Ran-specific conformational states. Such innovative deviations from
canonical residues may be a common evolutionary strategy for functional
divergence and specialization, considering that other FY-pivot families
display similar variations on the canonical theme.
There is likely much more to this canonical network, however, than is
evident from the limited perspective of Ran GTPases discussed here.
Similar in-depth analysis of other FY-pivot families should help
provide a more complete picture. On the other hand, many statistically
surprising structural features found through CHAIN analysis may be
difficult to interpret in the absence of sufficient biochemical and
structural data. In these instances, CHAIN analysis nevertheless
provides a stimulus for first formulating and then experimentally
testing hypotheses regarding canonical features and strikingly
conserved deviations from those features observed for some families. In
this context, we note that experimental studies of H30F, F90Y, and Y98F
mutants of Ran are likely to be particularly informative.
Finally, we note that CHAIN analysis has other applications. For
example, it can help assess the biological relevance of specific types
of atomic interactions, as we found regarding the critical role in
FY-pivot GTPase function of CH- hydrogen bondswhose importance to
structural biology has only rather recently been appreciated (Weiss et
al. 2001 ). By enhancing our understanding of structural principles,
this could, in turn, lead to improved structural refinement and
homology modeling methods. Yet, even in the absence of structural data,
contrast hierarchical alignments can identify structurally relevant
signature patterns useful for functional classification of genomic
sequences and for evolutionary studiesas in the analysis here, which
indicates that Ras, Rho, Rab, and Ran may have evolved from a
predecessor resembling Sec4p. Thus, the strategies applied here serve
as a starting point for further analyses along various lines.
 |
METHODS
|
|---|
CHAIN Analysis
CHAIN analysis first classifies the residues conserved in a protein
family into distinct categories based on their patterns of conservation
within related sequences. To ensure that conservation within that
family reflects functionally imposed constraints rather than recent
common descent, representative family members are selected from
distinct phyla that have diverged at least half a billion years ago.
This is sufficient time for random mutations to have eliminated
sequence similarity merely due to recent descent.
CHAIN analysis is initiated using a particular query sequence of
interest. Using the gapped BLAST/PSI-BLAST procedure (Altschul et al.
1997 ), the query, which is typically a protein of known structure, is
aligned against two sets of sequences, the family set consisting of
the representatives of the query family, and the main set
consisting of these and other related sequences. Note that, when
aligning the query against the main set, a family consensus sequence
(of the same length as the query) is used rather than the query itself,
as we found that this generally improves the main set alignment.
Because the PSI-BLAST alignment still tends to misalign sequences
distantly related to the query, however, sequences in the main set are
subsequently realigned using an optimization procedure (Neuwald and
Poleksic 2000 ) based on Gibbs sampling (Lawrence et al. 1993 ; Liu et
al. 1995 , 1999 ).
Next a superfamily set that contains the family set is derived from
the main set (Fig. 2A). This is done using a procedure, called Bayesian
partitioning with pattern selection (BPPS), that optimally partitions
the main set into two distinct sets, one of which (the superfamily set)
consists of sequences sharing conserved patterns with the query family
and with each other, but not with sequences in the other set. This
yields a series of telescoping aligned sequence sets, which we term a
hierarchical alignment. We will limit our discussion here to three
sets, though this approach easily generalizes to more sets.
BPPS Procedure
The BPPS procedure is a Markov chain Monte Carlo method for
sampling random variables from a probability distribution (Liu 2001 ),
the details of which are described in the Appendix herein. Here there
are two variables: a conserved pattern and a set of indicators for
assigning each sequence to either the query or the nonquery partition.
The procedure explores possible combinations of pattern-partition
pairs, searching for one where the pattern maximally distinguishes the
sequences in the query partition from those in the nonquery partition.
This corresponds to an optimum point in the probability distribution
defined by our statistical model.
In order to examine all the categories to which the query family
belongs, the BPPS procedure is called multiple times, each time
initialized with a distinct seed pattern characteristic of a
particular query-related subgroup of the main set. For each seed
pattern, the procedure converges on a corresponding (possibly locally)
optimum pattern-partition pair that defines a superfamily set. The
final pattern found should be viewed as the characteristic signature
for this superfamily. When the superfamily and family sets are
distinct, these, together with the main set, define a three-level
hierarchical alignment. (The BPPS procedure can be used to define the
family set, but this is usually unnecessary for highly conserved
families.) Based on this alignment, the selective constraints acting on
the query are decomposed into four distinct categories, namely the
main, superfamily, and family categories as well as an intermediate
category (see below). We will refer to a set lower in the hierarchy as
a subalignment and a set higher up as a superalignment.
Measuring Selective Constraints
We identify the most significant structural features within a
contrast hierarchical alignment by determining which conserved residue
positions are most characteristic of each category. From an
evolutionary perspective this involves measuring the selective
constraints shifting residues in a specific column of a subalignment
away from (and thus in contrast to) the composition observed in
that column of the superalignment. This measure, which is defined by
the urn model in Figure 2C, is displayed graphically in the highlighted
alignments of Figures 1BD and 3.
More precisely, the selective constraint acting at position j
in a subalignment is expressed in terms of the number of random trials
needed to draw from among the residues in a superalignment (with
replacement) at least as many conserved residues as are observed in the
subalignment at that position. The likelihood of this event is given by
the cumulative binomial probability,
 | (1) |
where c and
N are the number of
conserved residues and the total number of residues, respectively, in
the j-th column of subalignment L, and
p is the frequency of
the conserved residues observed at that position for superalignment
B, which serves as the background model. The corresponding
selective constraint acting on subalignment L is then
defined as
the expected number of random trials needed to observe this event. This
measure helps decompose the total constraints acting on the family into
specific categories, because it essentially subtracts the
constraints acting on a superalignment from those acting on the
subalignment.
Histogram bar heights in alignment figures are set proportional to the
number of random trials implied by
K . If displayed
directly, however, a column with P = 0.01 (100 trials), for
example, would appear insignificant relative to a column with
P = 0.00001 (100,000 trials). On the other hand, direct
logarithmic scaling may cause the bars for significant columns to
disappear depending on the value of P for the most significant
column. We get around these problems by setting the bar height
where t is the number of random trials and 0 < 1
is a scaling parameter for adjusting the relative bar heights so as to
converge to linear scaling at = 0 and to logarithmic scaling as
1. An automated routine chooses the appropriate values of
for display (as in Fig. 1) based on the pattern positions selected by
the BPPS procedure. The order-of-magnitude-increase in t as a
function of , when the relative bar height increases by twofold, is
given by
 |
Amino Acid Distributions
Several points regarding amino acid distributions should be noted.
First, for the alignments in Figures 1A,B and 3A selective constraints
are based on a standard background distribution, which we define as the
overall amino acid frequencies observed for sequences in the main set.
Second, when estimating amino acid distributions from observed counts,
we adjust for small sample size by adding one pseudo-count for each
type of amino acid at each position. In a Bayesian statistical context,
this procedure corresponds to an uninformed prior probability. Third,
because misaligned positions corrupt position-specific background
frequency estimates, marginal alignment probabilities may be computed
(Yu and Smith 1999 ) and, when there is significant alignment
uncertainty, background distributions can be based on the weighted
average of all superalignment positions likely to be aligned with that
position in the subalignment. (This weighting procedure is akin to the
use of Dirichlet mixture priors; Brown et al. 1993 .) Note, however,
that currently marginal probabilities are computed solely to identify
regions of uncertainty, rather than to modify background distributions
in this way.
Finally, when counting observed residues (for measuring selective
constraints or for the BPPS procedure), sequences are down-weighted to
adjust for correlations between them, using the PSI-BLAST weighting
scheme (Henikoff and Henikoff 1994 ; Altschul et al. 1997 ). Note,
however, that PSI-BLAST weights sum to 1, whereas we require weights
that sum to the effective number of observed sequences. Calculation of
such weights is an unsolved problem, but, as a rough estimate, one
approach is to normalize the weights so that the maximum weight
(corresponding to the least correlated sequence) is one. This may yield
erratic results, however, because the effective number of sequences
then depends on a single outlying sequence. As a robust alternative, we
normalize weights such that the expected weight is one-half, with
occasional normalized weights greater than one being truncated to one.
Though admittedly ad hoc, this approach is adequate because we use
these as relative measures rather than as precise probabilities. Note
that weights are not computed for the query family alignment, because
these sequences are selected from distinct phyla or kingdoms and,
therefore, are treated as statistically independent.
Categorization of Residue Positions
We would like to classify residue positions within a three-level
hierarchical alignment into four functional categories, namely the
Main, Superfamily, and Family categories (Fig. 1BD, respectively),
and an Intermediate category. Although the BPPS procedure assigns
certain query residue positions to the Superfamily, it does not
explicitly assign residue positions to other categories. To
do this, CHAIN analysis uses the following (admittedly ad
hoc) procedures based on our measure of category-specific selective
constraints. In describing these procedures, we will use the following
alignment notation: Main alignment, M; Superfamily alignment,
S; Family alignment, F; null alignment (standard
amino acid background model), . Note that the positions assigned by
the following procedures are indicated by the dots directly above the
alignments in Figures 1AD and, with respect to the residue
side-chains in Figures 49, by the color scheme of Figure 2D.
Residue positions in the Superfamily category are defined by the
pattern obtained by the BPPS procedure. We also include in this
category any other positions j for which the
Superfamily-specific selective constraint
K is at least as
high as the K s for
any of the pattern positions selected by the BPPS procedure. The subset
of related residues yielding the maximum
K is chosen as the
conserved residue set at each of these additional positions. (Note that
residues in this set are typically unrelated to the query residue.)
These additional positions correspond to conserved residues in the
superfamily that are either nonconserved or divergently conserved
within the query family and thus were not selected by the BPPS
procedure. In other words, this procedure identifies positions in the
superfamily that (based on their
K ) would have been
detected by the BPPS procedure, if the query family had contained the
conserved residue at that position.
Residue positions assigned to the Main category are defined as those
that are absent from the Superfamily category and for which the
Main-set-specific selective constraint
K is greater than
the average K
obtained for those positions assigned to the Superfamily category. Note
that all of these
K are computed
from equation (1) based on standard background frequencies.
Conceptually, this identifies previously uncategorized residue
positions that are more highly conserved in the Main set than the
average Superfamily position.
Residue positions in the Family category are defined as those that are
absent from the Superfamily category and for which the Family-specific
selective constraint
K is greater than
one standard deviation above the average
K for those
positions assigned to the Superfamily category. Conceptually, this
identifies those residue positions outside of the Superfamily category
with Family constraints greater than most of the Superfamily-assigned
positions. We also assign to this category any Superfamily-assigned
positions with K s
at least as high as any of these positions and for which the Family
residue set (as defined in the Appendix) is a proper subset of the
Superfamily residue set at that position.
Finally, we classify into an Intermediate category those previously
uncategorized residue positions that have a standard
background-based Family-specific selective constraint
K greater than
one standard deviation below the mean
K for all
previously categorized positions. Conceptually, this represents
residues that are roughly just as conserved in the query family as many
of the previously classified columns but that are inconsistently
conserved across these other categories, and thus cannot otherwise be
definitively classified under the current hierarchical scheme.
Interaction Network Analysis
Interaction network analysis involves the identification of
molecular interactions associated with category-specific selective
constraints within available structures. The interactions considered
include both classical (Baker and Hubbard 1984 ) and weak (Wahl and
Sundaralingam 1997 ; Toth et al. 2001 ; Weiss et al. 2001 ) hydrogen
bonds, aromatic-aromatic interactions (Burley and Petsko 1985 ), and van
der Waals contacts (determined based on a standard distance of 4.5 Å).
The REDUCE program (Word et al. 1999 ) was used to attach hydrogen
atoms prior to hydrogen bond determination. Residue interactions
between subunits were assessed based on buried surface area (Lee and
Richards 1971 ). Other procedures generated RasMol (Sayle and
Milner-White 1995 ) scripts for displaying interactions (as in Figs.
49).
 |
APPENDIX
|
|---|
Bayesian Partitioning With Pattern Selection
Pattern Definition
A pattern is defined as sets of amino acids, one set (termed a
query residue set) for each position in the query sequence. Each
query residue set must contain either no residues, in which case the
position is ignored, or the query residue at that position along with
zero or more related residues, which ensures that these patterns
capture features of the query. Related residues are defined as those
amino acids with a positive BLOSUM62 (Henikoff and Henikoff 1992 )
score against the query residue. For example, if the query sequence
contains F at a certain position, then the possible residue sets at
that position would be {F}, {F,Y}, {F,W}, and
{F,Y,W} because the distinct, positive scoring residues against
F are Y, and W.
Bayesian Partitioning With Pattern Selection (BPPS)
Our statistical model treats the aligned sequences in the nonquery
partition as randomly and independently generated using the
position-specific residue frequencies (PSFs) observed for those
sequences. It treats sequences in the query partition as having two
types of positions: nonpattern and pattern positions. Residues at
nonpattern positions are generated from the same PSF as the nonquery
sequences. Residues at pattern positions are generated from a mixture
model at each position that is based both on the nonquery PSF (which
models nonconserved residues) and on the query residues matching the
conserved pattern at that position. For these mixture models a
parameter , the value of which is inferred from the observed
sequences, specifies for the query partition the relative fraction of
the observed residues derived from the conserved pattern model. Note
that, to ensure that the query partition contains the query family
itself, assignment of sequences in the family alignment to the nonquery
partition is disallowed by the BPPS procedure. A complicating aspect of
this model is that we do not know which sequences outside of the query
family belong to the query partition nor which positions correspond to
pattern positions nor exactly which residues are conserved at each
pattern position. To solve this problem we set up the following
statistical model and corresponding sampling strategy.
The Basic Model
To begin with, suppose we have vector observations (corresponding to
n multiply aligned protein sequences)
xi = (xi1,...,xik)
for i = 1,...,n, where k is the number of
columns (or aligned positions) and where each xij
takes one of the residues in an alphabet of size m = 20. A
more convenient method is to represent xij as a
20-dimensional vector (0,...,0,1,0,...,0), where the lone 1
indicates the observed residue type.
Suppose that each xi falls into one of two
clusters (the query or the nonquery partition), but that only a
selected number of columns (the pattern positions) are differentiating
these clusters and the remaining columns (the nonpattern positions) are
just "dummies." We introduce a row indicator vector
R = (Rl,...,Rn)
and a column indicator vector
C = (Cl,...,Ck),
where Ri = 1 if sequence i belongs to the
query partition and 0 otherwise, and Cj = 1 if
column j is differentiating (i.e., a pattern position) and 0
otherwise.
For a differentiating column, we require two 20-dimensional frequency
vectors, < |