|
|
|
Vol. 12, Issue 3, 503-514, March 2002
RESOURCES
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
We present a novel web-based resource, Gene3D, of precalculated structural assignments to gene sequences and whole genomes. This resource assigns structural domains from the CATH database to whole genes and links these to their curated functional and structural annotations within the CATH domain structure database, the functional Dictionary of Homologous Superfamilies (DHS) and PDBsum. Currently Gene3D provides annotation for 36 complete genomes (two eukaryotes, six archaea, and 28 bacteria). On average, between 30% and 40% of the genes of a given genome can be structurally annotated. Matches to structural domains are found using the profile-based method (PSI-BLAST). and a novel protocol, DRange, is used to resolve conflicts in matches involving different homologous superfamilies.
| |
INTRODUCTION |
|---|
|
|
|---|
A protein performs its function through the specific tertiary
structure it adopts, which is a consequence of its
amino acid sequence. To date, in silico biology has largely attempted
to assign functions to protein sequences solely by sequence similarity to proteins in the sequence database. Many resources exist which group
proteins into families [e.g., PROSITE (Hofmann
et al. 1999
), PRINTS (Apwieler et al. 2001b
), and Pfam (Bateman et al.
2000
)] and provide facilities for searching with a new sequence to
determine functional properties by inheritance from a putative relative.
On a genome-wide basis, GeneQuiz (Iliopoulos et al. 2000
)
was one of the first resources which attempted to provide functional
annotations for a complete genome, Saccharomyces cerevisiae, by assigning functions from related sequences in the sequence databases
(Holm and Sander 1994
). Approximately 60% of the genes could initially
be annotated in this way, and for about 20% of the genes, structures
could also be assigned. Among the most powerful methods currently
available for assigning distantly related sequences to sequence
families are the profile-based methods (e.g., PSI-BLAST; Altschul et al. 1997
) and Hidden Markov models, particularly SamT (Karplus et al. 1998
). Various studies (Park et al. 1998
,
Salamov et al. 1999
) have demonstrated their sensitivity over other
methods (e.g., BLAST, FASTA) for remote
homolog detection. Muller et al. (1999)
showed that approximately
one-third of a set of very distant homologs from the SCOP database,
previously identified through similarities in their structures, could
be matched using PSI-BLAST. Using these techniques,
GeneQuiz is currently able to assign functions for between
30% and 80% of genes in any given genome.
The Proteome database at the EBI (Apweiler et al. 2001a
)
also represents a wide-ranging sequence-based analysis of the genes across a wide range of complete genomes and partially completed genomes. This system attempts to assign genes to their related InterPro/CluSTr families and store all available information; they also provide a range of comparative genomics tools for
their analyzed genomes.
However, in addition to inheriting functions for genome sequences,
further significant benefits can be obtained by identifying the
structural family to which the sequences belong. Knowledge of the
structure allows the mapping of functionally important residues
identified experimentally or from sequence alignments to their physical
locations, thus providing important insights into functional mechanisms
and the impact of single nucleotide polymorphisms (SNPs).
Furthermore, because structure is much more conserved than sequence,
multiple alignments generated from structural comparisons are much more
accurate than those generated from sequence alone, particularly for
distant homologs. Thus, multiple structure alignments and the profiles
derived from them can often improve the detection of conserved residues
(e.g., catalytic residues), or sites associated with function (Valdar
and Thornton 2001
).
Because several recent analyses have demonstrated the need to be
cautious when inheriting functional information between distant homologs (<30% sequence identity; see Todd et al. 2001
), structural information can often help to validate putative functions. Knowledge of
the structural family allows 3D models to be built for the sequence
from which active sites can be predicted (Laskowski et al.
1996
; Luscombe et al. 1997
) and the effects of mutations on functional
properties can be assessed. Models also allow further structural
studies such as docking of putative ligands and simulation of
protein-protein interactions.
Considerable progress has been made in providing structural annotation
for genes and whole genomes. The most powerful methodologies, which
employ sequence profiles (e.g., PSI-BLAST) or fold recognition methods (e.g., GenThreader,
3D-PSSM), can provide some structural annotation for up to
50% of small microbial genomes, for example, Mycoplasma
genitalium (Huynen and Bork 1998
; Muller et al. 1999
;
Salamov et al. 1999
). Profile-based methods generally assign about 40%
of the proteins in M. genitalium (Muller et al. 1999
), whereas
threading algorithms currently provide annotations for nearly 50% of
this genome (Jones 1999
). Teichmann et al. (1999)
give a full review of
the state of the art in structure annotation of genomes.
However, most of the publicly available resources developed using these
approaches simply provide links from the gene sequence to the
structural relatives in the protein databank (PDB, Berman et al. 2000
)
with no direct information on structural family. For example, although
the genome annotation resource GeneQuiz lists structural
relatives for about 10% of the genes in the yeast genome, there are no
direct links to structural families. Another more recently established
genome resource linked to the Molecular Modeling Database
(MMDB) (Wang et al. 2000
) provides links from genes in
genomes to proteins of known structure as a list of structural relatives for each gene. Those regions of genes which, using
BLAST, can be assigned unambiguously are presented and
those authors demonstrated how 3D structure can be used to inform
functional predictions. Again, no information on structural family is provided.
Conversely, although many of the structural databases have now set up
sequence libraries which list the sequence relatives identified for
proteins of known structure, there is no direct link to the genome nor
means of browsing structural assignments for other genes from the same
genome. For example, Park et al. (1997)
recently developed
the Protein DataBank Intermediate Sequence Library (PDB-ISL), which
contains sequence relatives to structural domains in the SCOP database
(Lo Conte et al. 2000
). Sequence libraries of this sort allow for more
sensitive sequence searching when using profile-based methods such as
PSI-BLAST. They extend sequence diversity in the family so
that further searches identify more distant relatives as well as the
initial family members.
The Superfamily database (Gough et al. 2001
) uses Hidden Markov
models (HMMs) to represent each family in the SCOP database. These HMMs
are then used to identify sequence relatives to each SCOP family in a
library of genomic sequences.
Sequence relatives have also been recruited into the CATH domain
structure database using a protocol based on PSI-BLAST and
a consensus approach (DomainFinder) for assigning a domain
structure to a specific region of the gene sequence (Pearl et al.
2001
). However, there are no direct links from the sequence back to the
genome. The Gene3D resource has been set up to address
that need, and to provide links between the structural annotations for
genes in completed genomes. In addition, unlike other available
resources (e.g., GeneQuiz, Wang et al. 2000
) which often
link genes to whole PDB structures, Gene3D clearly
identifies the domain regions for which structural annotation can be provided.
In one of the earlier comparative genome analyses involving structural
data, Gerstein (1997)
used FASTA (Pearson and Lipman 1988
)
to assign folds and assess their distribution in different organisms.
Interestingly, the data indicated that most organisms' complement of
folds is highly enriched in mixed alpha/beta type folds, much more so
than the current structural databases. This may reflect the tendency
for enzymes to adopt predominantly alpha/beta folds. To facilitate this
type of analysis, Gene3D also provides statistics on the
distribution of fold groups and structural families within each genome.
These data can be used to perform comparative genome analyses and
determine any differential fold usage which may be associated with
differences in phenotypes.
| |
METHODS AND RESULTS |
|---|
|
|
|---|
Structural assignments in Gene3D are based on the CATH
domain structure classification system. This is a hierarchical system
which at the lower levels groups structures and sequences together that
have a common ancestor, based on structural similarity, sequence
identity, and common functional features (Pearl et al. 2001
). The
initial assignments are made using a combination of PSI-BLAST (Altschul et al. 1997
) and IMPALA (Schaffer et al. 1999
). Initial processing is performed by
DomainFinder (Pearl et al. 2002
), an algorithm which
identifies clear matches of gene sequences to protein domains in CATH,
and final processing is accomplished by the genome-wide annotation
method, DRange (see Methods). Using this method we have
provided structural annotation for between 30- and 40% of 36 of the
complete genomes in GenBank. The use of structural domains allows great
confidence in the domain boundary assignments generated by
PSI-BLAST; structural domains are complete domains,
whereas sequence domains, which can be small (less that 50 residues),
may only represent motifs and not complete structural domains. A web
server has been set up to retrieve these assignment data and to provide
tools for cross-genome analysis.
Gene3D is a web-based resource of structural assignments to whole genes available on the World Wide Web at http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D.
Resource Description
Gene3D provides the biologist with structural
assignments which link directly to functional and structural
information maintained within the CATH database (Pearl et al. 2001
), a
dictionary of functional information for homologous structural
superfamilies [the Dictionary of Homologous Superfamilies (DHS), Bray
et al. 2000
], and a resource providing derived structural
and functional data with additional functional links (PDBsum, Laskowski
2001
). Importantly, the DHS contains multiple structural alignments
annotated in various ways, for example with PROSITE motifs indicating
functionally important positions.
Unlike other resources, which simply provide listings of structural
domains matched to gene regions, Gene3D employs a suite of
programs (DRange) to remove conflicting assignments and
provides the biologist with curated confident nonconflicting assignments for the genes in whole genomes. Each genome also has brief
summary statistics presented which indicate the distribution of fold
types and protein structural families. The current content of
Gene3D is made up of the genes and genomes from 36 genomes
(Table 1) and the associated structural
assignments. This will be updated on a regular basis as new genomes are
released to the public gene databanks.
|
The server is made up of interlinked web pages, which allow the user to browse the structural assignments made to those complete genomes that are publicly available at the NCBI (currently 36 genomes). These consist of two components: a series of help files including a brief tutorial, and the genomic structural assignments. Access to both of these is through the main interface. Upon selecting the `browse genomes' option, the user is presented with a list of the available genomes. This list is updated with new genomes upon their release, and with every update of the CATH database (Fig. 1a). Each genome has a page summarizing the assignment statistics with an available option for listing a summary of all of the structural domains assigned to the genome (Fig. 1bi). The complete domain assignment data for the whole genome are also available for download.
|
From each genome page, the user can elect to either choose a gene of interest (Fig. 1bi) or use the search engine to find genes of interest within their chosen genome. A list of genes with structural assignments will be returned and the user may select a specific gene (Fig. 1c). Once a gene is selected (either by searching the genome or by selecting from the initial assignment list), a diagram of the gene and the placement of domains along the gene is presented (Fig. 1d). This is accompanied by the PSI-BLAST data that matched the domain with the gene region. Importantly, this page serves as the portal that links the structural assignments to the functional data within the CATH database. On the right of this page is the menu (Fig. 1, "Menus"), which allows you to choose a structural domain within your gene and go to the appropriate entry in CATH, the DHS, or PDBsum.
CATH is a structural classification database which provides details
regarding the interrelationships between differing structures and
structural families (see Methods). CATH is further linked to the DHS,
which provides both functional and structural information about the
features common between proteins within a given superfamily in the CATH
database. Recent research into enzyme superfamilies in CATH (Todd et
al. 2001
) has suggested that provided relatives have 40% or more
sequence identity and that there is considerable similarity
in function, although substrates may vary. The DHS provides information
that allows the user to assess the extent to which function varies
within a superfamily. PDBsum is a resource of processed and analyzed
PDB files providing a wealth of structural data and links to other
protein databases on the web (e.g., SWISS-PROT, KEGG, SCOP, PROCHECK).
Each level of the Gene3D database presents the user with
the option to download any applicable data files.
Statistics for the Genes in Whole Genomes
Basic statistics are presented for each genome (Fig. 1bi, Table 1). These give an indication of the quality and level of coverage attained for each genome. The total number of genes and the total number of residues in each genome are quoted alongside the number of domains assigned. Also calculated is the number of genes with at least one domain assigned, alongside the percentage of the organisms' genes this represents. A coverage score (i.e., the number of the total residues which are part of a domain assignment) is also presented. With the summary statistics is a pie chart showing the diversity of domains in an organism compared to the diversity of domains in the CATH structural database. The colored segments represent the four CATH classes (yellow, all-alpha domains; red, all-beta domains; green, alpha/beta domains; blue, domains with little secondary structure). The inner circle is divided so that each segment indicates a different architecture, and the outer circle is divided so that each segment represents a different fold (topology). The size of each segment indicates the proportion of the CATH structural database represented by that class, architecture, or topology. For each organism, those folds that have been assigned are left colored and those folds that have not been assigned to the organism are colored black. The pie chart gives a quick visual indication of how many of the folds present in the CATH structural database have been identified within an organism, which in turn indicates the structural diversity within an organism. Visual inspection of any two will allow rapid identification of which appears to be the most structurally diverse organism.
An example of this is the comparison of the M. genitalium genome with the genome of Caenorhabditis elegans. The pie chart for M. genitalium indicates that very few of the all-beta folds within the CATH structural database have been found in its genome. However, the pie charts for C. elegans indicate that approximately half of the all-beta folds have been identified in its genome. Inspection of the structural assignment data reveals that the superfamilies of immunoglobulin-like proteins are expanded within the C.elegans genome. C.elegans is a multicellular organism which requires complex cell-cell interaction. Many of the cell-surface functions responsible for mediation of cell-cell interactions are performed by proteins that are part of the immunoglobulin superfamilies, which are all-beta folds. M. genitalium, a single-celled organism, does not require the many forms of cell-cell interaction required by C.elegans and does not display the use of many of the superfamilies of all-beta immunoglobulin-like folds. The pie charts give an indication of some underlying biological differences between organisms which can be elucidated by close inspection of the assignment data.
Application of Gene3D in Genome Analysis
An example of the use of Gene3D to mine this
information is the E.coli gene yaaF. This gene
(GenBank ID:140159 or 1786213) is listed by GenBank as being a
hypothetical gene and part of the hypothetical operon of unknown
function, yaa. When the E.coli genome is searched for
`hypothetical' genes, a list of predicted genes is presented, one of
which is yaaF. Selecting this gene from the list presents a diagram of
the CATH homologous superfamilies that match this gene's product. A
single homologous superfamily (CATH ID: 3.90.245.10) matches nearly the
complete length of the gene (304 residues). The closest structural
match is the only domain (domain 0) from PDB structure 1mas chain A. To
get further information, 1masA0 is selected from the menu in the
`Goto' box; from here, the CATH database, the DHS, or PDBsum may be
selected. Selecting the CATH database takes the user to its entry
within the CATH database, which shows that this structure is a mixed
alpha/beta domain of the `Inosine-uridine Nucleoside N-ribohydrolase'
fold and that the homologous superfamily is a family of hydrolases. If
the DHS is selected, a page of curated functional data is presented to
the user. This adds further SWISS-PROT (Bairoch and
Apweiler 2000
), PROSITE, and ligand data. These functional data
indicate that the members of this homologous superfamily are purine
nucleoside hydrolases (Enzyme Commission number: 3.2.2.1) that also
possess PROSITE pattern PS01247 (Inosine uridine-preferring nucleoside
hydrolase family signature). The yaaF gene product also contains this
PROSITE motif, in the same position, with a cysteine-to-threonine
substitution at the second position. The length and the high
statistical significance of the match between yaaF and homologous
superfamily 3.90.245.10 suggest that YaaF is a gene and, as a member of
CATH Homologous superfamily 3.90.245.10, is a purine nucleoside
hydrolase. This information could now be used to design the experiments
to confirm this and discover the role of this gene within E. coli.
This may also assist in the elucidation of the role of the yaa operon.
We can also use the data in Gene3D to examine the
functions of homologous superfamilies that are multiply expanded within
genomes or sets of genomes. Such superfamilies, it is postulated, are
likely to be involved in adaptations specific to that organism/group of
organisms. We have identified putatively 204 homologous superfamilies whose compliment within specific genomes has been expanded with relation to the other genomes in our set. Many of these homologous superfamilies have no known function or are labeled as putative genes
by the genome sequencing projects. Where there is functional data, it
can often be shown that a homologous superfamily does display a
function which is specific to the organism/group of organisms. An
example of which is the CATH homologous superfamily 1.10.101.10. We identified 11 homologs of this domain in Bacillus subtilis spread across nine genes (Table
2); in all cases, the best matched known
structure is 1lbu01. The average prevalence of this gene across all of
our organisms is 0.514 domains per organism; thus, these 11 domains
represent an approximately 20-fold increase in the relative number of
these domains present within the B. subtilis genome. Four of
these genes have unknown functions (GenBank annotation), although three
genes (ykuG, yqeE, and yvjB) do have
recognized similarities with other genes. Where the genes containing these 11 domains were present in SWISS_PROT, they were all part of the N-acetylmuramoyl-L-alanine amidase
family 3. A search of the literature for these genes revealed that yqeE had been experimentally determined as a sigma-K-dependent peptidoglycan hydrolase. Further inspection of the alignment of these domains shows
that they all share only four common conserved residues (three glycines
at positions 41, 65, and 71 and a glutamine at position 61). Glycine
residues rarely take part in catalysis, so it seems likely that these
residues play a structural role. The functional annotations of these
proteins suggest that these genes are involved in the turnover and
lysing of the bacterial cell wall in B. subtilis, and this
should inform experimental design in establishing the role of three
genes with an unknown function. B. subtilis is a sporulating
bacterium that would have need of a series of complex cell
wall/spore coat metabolizing enzymes for moving from the spore state to
the vegetative state. Therefore it is possible that these domains and
proteins have differing specificities and take part in different steps
in cell wall or spore coat metabolism.
|
Comparative Analysis of Fold Usage across the Genome
Figure 2 shows that the distribution of
fold classes within the clades approximates that which can be found in
the structural databases (CATH, SCOP), as reported (Gerstein 1998
). All
of the genomes are greatly enriched in the alpha/beta folds and as such show a depleted complement of mainly alpha and mainly beta folds in
relation to the structure databases. The archaea and bacteria are
depleted in all-beta folds; this is a result of not possessing the
families of cell/cell signaling receptors that make wide use of the
immunoglobulin-like folds. The observed depletion in mainly alpha folds
may be due to underrepresentation of mainly alpha folds in the
structural databases. It has been shown that 20% to 30% of a
genome's proteins are likely to have a transmembrane helical domain
(Wallin and von Heijne 1998
; Krogh et al. 2001
); such domains are
greatly depleted within the structural databases.
|
There are many folds that are only used once by any given clade,
whereas there are a few folds that are multiply reused by the organisms
in a given clade (Fig. 3). It is
interesting to note that these top five folds have also been described
as superfolds (Pearl et al. 2001
) as they have been found to recur most
frequently within the CATH database. They are also known as frequently
occurring domains in SCOP (FODS-SCOP). These recurrent folds make up
around 20% of the structural databases. That these folds are seen to be the most used by the three clades may be the result of two differing
effects. The first of these is that these folds are truly the most used
folds in modern organisms. On the other hand, because we have the
greatest number of homologous superfamilies for these folds in the
structural databases, we correspondingly have a greater number of
sequence families and are better able to recognize members of these
folds' sequence families within the genomes. Circularly, it seems
likely that we have found so many examples of these structures because
they are disproportionately more common in these organisms.
|
The frequency distribution of the five superfolds is shown in Figure
4. This illustrates the frequency of a
given fold per gene within one of the three major kingdoms. Illustrated
alongside these is the frequency of occurrence of a superfold among all of the organisms. The bacteria most closely match the frequency distribution seen across all the clades; however this is hardly surprising, because the distribution is skewed towards the bacteria as
there are more bacterial genes in the set of organisms. Notable is the
archaea's use of the superfolds. Many of the archaea in this set of
genomes are extremophiles, and one may expect they require very stable
proteins in order to survive. It is possible that superfolds are stable
folds (Orengo et al. 1994
), and it would follow that the
archaea may make great use of them; certainly more study is required to
confirm this. Another feature of this graph is that the eukaryotes make
much greater use of the immunoglobulin-type folds compared to the other
clades. This largely comes from the input of the genes from
Caenorhabditis elegans, which is the only multicellular
organism, and is a consequence of the use of such domains in cell
signaling pathways.
|
| |
DISCUSSION |
|---|
|
|
|---|
Gene3D provides a resource for the biochemist and biologist alike. It can simply be used as a tool to find structural assignments for individual genes. More usefully, querying the database allows the examination of gene families of interest within an organism based on possession of common traits (e.g., common functions). Future additions to the server will include the ability to query the underlying Oracle relational database. This will include the ability to perform comparative queries, returning datasets compiled from multiple genomes. The compilation of such value-added databases represents some of the first steps required to fully integrate large quantities of data from the genomic data resources, which will aid differential genome analysis and the study of protein structure/function evolution and genome evolution. Furthermore, identification of those gene sequence families for which we can already provide accurate structural assignments can be used to aid the identification of those sequence families for which representative structures are still needed, and as such will aid today's structural genomics initiatives.
The database can be accessed via the World Wide Web (http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D). This server allows the user to search the preprocessed assignment data for structural assignments stored for any gene in the GenBank NRDB100 list. A further part of the server allows access to the statistics for each genome and the ability to search for any gene in that organism for which DRange-processed assignments have been made. (http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/Genome.html). Preprepared downloads of all the NCBI's genomes can also be obtained via our ftp server (ftp://ftp.biochem.ucl.ac.uk/pub/cathdata/Gene3D/).
| |
METHODS |
|---|
|
|
|---|
Dataset Selection
A library of sequences was set up containing gene sequences from
GenBank and representative sequences from the CATH database. The
nonredundant database from GenBank (at 100% identity) was used
(NRDB100) (Benson et al. 2000
). Genomic sequence data for complete
genomes is also gathered from GenBank. Only those genomes published as
complete are selected, and draft genome sequences are not used.
The CATH database is a hierarchical database of protein domains split
into four main levels (Class, Architecture, Topology, and Homologous superfamily). At the Class level, proteins are divided up based on their secondary structure content (Table 3). The next level,
Architecture, describes the positions of the secondary structure
elements in space. The third level, Topology, describes the fold of the
domain and indicates how the secondary structure elements are joined
together in space. Finally, the Homologous superfamily level groups
those domains which have a clear evolutionary relationship. Each
homologous superfamily is further subdivided into families based on
sequence similarity at 35%, 60%, 95%, and 100% sequence identities.
|
Representative structures/sequences were selected for each S95 sequence
family in the CATH Protein Family Database (the CATH PFDB, where each
sequence family contains members that are 95% sequence identical or
higher) (Pearl et al. 2001
). Each of these protein sequence families
falls into one of five main categories (CATH classes 1 to 5) or one of
two additional categories (CATH classes six and seven) which refer to
proteins currently being integrated into the CATH classification (see
Table 3).
For testing the Collapse module (see below), which resolves overlaps between the same homologous superfamilies on the same gene region, a test set of 200 nonredundant genes displaying various forms of overlapping assignments were selected and used for empirical cutoff assignment. Domains were selected because they displayed the types of overlap found in the assignment data.
Identification of Sequence Relatives to Proteins in the CATH Database Using PSI-BLAST and DomainFinder
In the first step, CATH S95reps are matched to sequences within the
NRDB100 from GenBank (Pearl et al. 2001
). Sequence matching is
performed using PSI-BLAST, and only matches with an expectation value (E-value) of less than or equal to 5×10-4
are included in the profile for the next iteration. This parameter is
recommended by Brenner et al. (1998)
and validated by Pearl et al.
(2002)
. PSI-BLAST was benchmarked to derive conservative thresholds for reliably predicting sequence domains for inclusion as
input for the DomainFinder and DRange
algorithms. A dataset of 1351 representative sequences (CATH S35Reps)
was derived from the single-segment domains in the CATH structural domain database. These are derived from the majority of homologous superfamilies in CATH (773 families from the April 5, 2000 release of
CATH). Sequences with less than 35% sequence identity to their other
selected relatives were included, thus ensuring that the dataset
contained only remote homologs. Remote homologs were chosen so that the
performance in recognizing distant relatives could be assessed. This is
necessary because homologs with sequence identities >35% are easily
identified by pairwise sequence comparison methods (Pearl et al. 2001
).
The 1351 single-segment homologs give a total of 911,925 (1351×1,350/2) pairwise relationships (false + true). Optimally the
PSI-BLAST algorithm should detect all of the true pairwise
relationships within a homologous superfamily (H-family, 2478 in total)
without any false positives.
PSI-BLAST was run for a range of E-values. Hits were recorded and scored when an S35Rep matched another S35Rep from the same homologous superfamily. Matches between S35Reps in different H-families with the same fold (same T-level), were not counted. The H-families in CATH are assigned very conservatively. Matches having the same fold group but differing homologous superfamilies suggest putative evolutionary relationships, for which we have no strong functional evidence. An overlap measure of 50% was also introduced, which was calculated as the percent of the query sequence that aligned with the target.
To annotate the genome for the purposes of reliable analysis, we wanted to maximize the coverage yet minimize the error rate. Figure 5 shows coverage plotted against error per query (EPQ) for differing overlap thresholds from 0% to 100% in steps of 10%. Selecting an overlap threshold of 50% with an E-value of 5.0×10-4 in a one-to-one relationship, half (50%) of the target is identified in 32% of the cases, with an EPQ of 0.22%. These values were used to recruit putative homologs using PSI-BLAST. However, this is the error rate of the raw data, and postprocessing (DomainFinder and DRange) of the data subsequent to this reduces the error rate further.
|
The PSI-BLAST matches are compiled into a list of CATH
superfamily assignments for various regions in each gene sequence. By
applying a clustering algorithm (DomainFinder, Pearl et
al., (2002)
, S95Rep assignments for each region on the gene are
converted into a consensus description. Where two S95Reps with the same
CATH code are assigned to the same region of a gene, boundary data from
the region where the S95Reps overlapped (the consensus region) and the
regions either side, where they did not overlap, (the extremes) are
recorded as illustrated in Figure 6. All
downstream processing is then performed by DRange, a suite
of code that attempts to resolve any clashes between two different
homologous superfamilies (H Families) that have been assigned to the
same gene region.
|
DomainFinder's clashes may arise due to the way in which the CATH database is necessarily compiled. CATH is cautious in its assignment of homologous superfamilies. Proteins which have diverged to an extent that their sequence and/or structural similarity falls below the cutoffs used to assign homologs are placed in separate homologous superfamilies unless there is sufficient additional functional evidence to merge the families. Problems for any database of domain families arise when there is not enough functional evidence available at the time of classification. In these cases, proteins with clear structural similarity but no clear sequence similarity will be assigned to the same fold group but not the same homologous superfamily. This ensures that homologous superfamilies remain self-consistent and that they do not include evolutionarily unrelated proteins. However, when distant sequences from the same protein family are placed in different H families (due to lack of functional evidence), they may match the same region of a gene of unknown structure. It will then appear that two different H families have been assigned to the same region of a gene even though the two superfamilies may actually be evolutionarily related.
Additionally, domain clashes may also arise when the N terminus of one assigned domain overlaps with the C terminus of an adjacent assigned domain on a gene. These clashes arise because domains within homologous superfamilies may contain additional residues (extensions) at their C or N terminus. Such extensions are part of the natural variability within homologous superfamilies. When domains are aligned to genes, their extensions may extend along the gene and may overlap with adjacent domain assignments, causing a clash.
DRange: A Suite of Modules to Verify Domain Assignments
The DRange suite, described below, contains four modules for cleaning the data and resolving clashes where domains from two different homologous superfamilies have been assigned to the same region of a gene. Decisions made are based on reasonable biological criteria for determining whether the overlapping regions are evolutionarily related or whether the overlapping regions fall within a tolerable level of overlap. When overlapping, clashing assignments are found, the DRange process accepts those assignments that are from different homologous superfamilies but from the same fold group and only assigns a fold to that region of the gene. In cases where the fold is different, the assignment that has the greatest sequence evidence in support is kept (Multiparse module). Finally, where there is insufficient sequence evidence, both domains are kept if the overlap is small; otherwise, both are excluded (CleanAssign module).
Collapse Module
The first of the steps in DRange is a module called Collapsewhich clears up any "noise" in the data (amounting to around 3% of the assignments). The strict cutoffs in the DomainFinder algorithm can lead to an over-cautious assignment of consensus regions. This problem, illustrated in Figure 7, arises when a homologous superfamily matches a distantly related gene and does not achieve a global alignment with the gene. The DomainFinder algorithm will not merge the smaller assignment with the others, as it does not overlap to a great enough extent. Collapse looks to find consensus regions of the same homologous superfamily that overlap enough to be merged together.
|
|
Multiparse Module
Resolving clashes between different homologous superfamilies starts with the Multiparse module. This uses the domain boundaries within CATH classified multidomain proteins to verify which domains should be accepted and which rejected when two domains from differing CATH superfamilies clash. The module does not resolve clashes where the gene will only have a single domain assigned; these are resolved by the CleanAssign module (see below). The clash of three domain assignments (labeled homologous superfamilies H1, H2, and H3) and the resolution process is illustrated in Figure 9. In the example, a gene is hit by a multidomain protein which comprises two domains belonging to homologous superfamilies H1 and H2, whose domain boundaries have already been determined. Because the multidomain sequence matches the full gene, the gene is presumed to contain the same domains as the multidomain protein from CATH. Those domain assignments that match the multidomain protein and its domain boundaries are allowed (from H families 1 and 2), and the data for the third domain assignment (H family 3) are removed from the list of consensus matches.
|
CleanAssign
The next module (CleanAssign) combines a simple overlap detection algorithm and a simple decision tree to decide whether the overlaps represent a cross assignment (i.e., a gene region where two different CATH fold groups/homologous superfamilies have been assigned) or an acceptable overlapping of domains from different superfamilies. In the case of a cross assignment, no reliable annotation of that sequence can be made and these data are removed from the process of genome annotation. On the other hand, if two separate regions of the gene are assigned different H families but only their ends overlap, this may constitute an acceptable overlap. An acceptable overlap is either not more than 30 residues or, in the case of larger domains, not more than 10% of the residues of the largest and 30% of the residues of the smallest. Figure 10 shows the decision tree with the overlap limits. Those overlapping domains which are accepted are used for genome assignment. Where the cross hits share the same fold but belong to different homologous superfamilies, data are retained for both assignments but the assignment for that region of the gene can only be made at the fold level (although the significance of the PSI-BLAST match suggests that these proteins are homologs which were undetected at the time of classification in the CATH database)
|
Genome Annotation and the Gene3D Web Server
Lastly, the structurally annotated genes are matched to the genes within the whole genomes, and all assignment data and statistics are stored in the CATH Oracle database. For a genome to be eligible for inclusion in Gene3D, the sequence must be regarded as complete and not a draft sequence; this is to increase the reliability of the results but as a necessary consequence rules out many of the eukaryotic genomes currently available. Assignment statistics are generated to assess coverage and PSI-BLAST performance between each new round of annotation. Figure 11 illustrates this whole process with typical assignment figures for the Escherichia coli genome. Table 1 shows the assignment statistics for all of the genomes.
|
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL orengo{at}biochem.ucl.ac.uk; FAX 44-207-7679-7193.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.213802.
| |
REFERENCES |
|---|
|
|
|---|
Received September 5, 2001; accepted in revised form January 11, 2002.
This article has been cited by other articles:
![]() |
J. Gough Genomic scale sub-family assignment of protein domains Nucleic Acids Res., July 28, 2006; 34(13): 3625 - 3633. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. P. Duffy, A. M. Young, B. Morin, C. J. Lucarotti, B. F. Koop, and D. B. Levin Sequence Analysis and Organization of the Neodiprion abietis Nucleopolyhedrovirus Genome J. Virol., July 15, 2006; 80(14): 6952 - 6963. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. L. Marsden, D. Lee, M. Maibaum, C. Yeats, and C. A. Orengo Comprehensive genome analysis of 203 genomes provides structural genomics with new insights into protein family space Nucleic Acids Res., February 15, 2006; 34(3): 1066 - 1080. [Abstract] |