|
|
|
|
Vol. 11, Issue 5, 754-770, May 2001
LETTER
|
| |
ABSTRACT |
|---|
|
|
|---|
The basic Helix-Loop-Helix (bHLH) proteins are transcription factors that play important roles during the development of various metazoans including fly, nematode, and vertebrates. They are also involved in human diseases, particularly in cancerogenesis. We made an extensive search for bHLH sequences in the completely sequenced genomes of Caenorhabditis elegans and of Drosophila melanogaster. We found 35 and 56 different genes, respectively, which may represent the complete set of bHLH of these organisms. A phylogenetic analysis of these genes, together with a large number (>350) of bHLH from other sources, led us to define 44 orthologous families among which 36 include bHLH from animals only, and two have representatives in both yeasts and animals. In addition, we identified two bHLH motifs present only in yeast, and four that are present only in plants; however, the latter number is certainly an underestimate. Most animal families (35/38) comprise fly, nematode, and vertebrate genes, suggesting that their common ancestor, which lived in pre-Cambrian times (600 million years ago) already owned as many as 35 different bHLH genes.
| |
INTRODUCTION |
|---|
|
|
|---|
Transcription factors of the basic Helix-Loop-Helix (bHLH) family
play a central role in cell proliferation,
determination, and differentiation (Jan and Jan 1993
; Weintraub 1993
;
Hassan and Bellen 2000
). The bHLH domain is ~60 amino acids long and comprises a DNA-binding basic region (b) followed by two
-helices separated by a variable loop region (HLH) (Ferre-D'Amar et al. 1993
).
The HLH domain promotes dimerization, allowing the formation of
homodimeric or heterodimeric complexes between different family members
(Murre et al. 1989a
; Kadesh 1993
). The two basic domains brought
together through dimerization bind specific hexanucleotide sequences
(Murre et al. 1989a
; Van Doren et al. 1991
, 1994
; Ohsako et al. 1994
).
The bHLH motif first was identified in the murine transcription factors
E12 and E47 (Murre et al. 1989b
). Numerous bHLH proteins since have
been identified in animals, plants, and fungi. A phylogenetic analysis
based on a sample of 122 bHLH sequences has lead to a subdivision into
four monophyletic groups of proteins named A, B, C, and D (Atchley and
Fitch 1997
).
Group A and B include bHLH proteins that bind hexameric DNA sequences
referred to as "E Boxes" (CANNTG), respectively CACCTG or CAGCTG
(Group A) and CACGTG or CATGTTG (Group B) (Murre et al. 1989a
; Van
Doren et al. 1991
; Dang et al. 1992
).
Group A includes several tissue-specific bHLH proteins (e.g., MyoD,
Twist, Achaete-Scute proteins; for a recent review, see Hassan and
Bellen 2000
) as well as the ubiquitously distributed E12/Daughterless-type bHLH proteins (Murre et al. 1989b
). In many instances, the tissue-specific proteins form inactive homodimers and
require the presence of a E12/Daughterless partner to form active
heterodimers (Cabrera and Alonso 1991
; Lassar et al. 1991
; Van Doren et
al. 1992
). Binding of the heterodimers to an E-box usually leads to
transcriptional activation of the target gene (Cabrera and Alonso 1991
;
Van Doren et al. 1992
).
Group B includes a large number of functionally unrelated proteins
(e.g., Myc, Max, USF, SREBP, MITF) involved in various developmental
and cellular processes (Henriksson and Luscher 1996
; Facchini and Penn
1998
; Goding 2000
). Some group-B proteins contain an additional motif,
known as a Leucine Zipper (LZ), which also is involved in protein
dimerization. Dang et al. (1992)
and Atchley and Fitch (1997)
included
in the same group B several proteins related to the Drosophila
Hairy and Enhancer of split bHLH (HER proteins; Fisher and Caudy 1998
).
These proteins are characterized by the presence of a proline instead
of an arginine at a crucial position in the basic domain. DNA-binding
site selection and in vivo studies have shown that these proteins bind
preferentially to sequences referred to as "N-boxes" (CACGCG or
CACG AG) and have only a low affinity for "E-boxes" (Ohsako et
al. 1994
; Van Doren et al. 1994
). The HER proteins are characterized
further by the presence of an additional motif, the 4-amino acid WRPW domain, which allows the interaction with the Groucho repressor protein
(Fisher and Caudy 1998
). Accordingly, the HER proteins have been shown
to act as transcriptional repressors during nervous system development
and segmentation (Kageyama and Nakanishi 1997
; Fisher and Caudy 1998
).
Group C corresponds to the family of bHLH proteins known as bHLH-PAS
(Crews 1998
). The characteristic feature of bHLH-PAS proteins is the
PAS domain, so called for the first three proteins identified with this
motif: Drosophila Period (Per), human ARNT, and
Drosophila Single-minded (Sim). The PAS domain found in
bHLH-PAS proteins is ~260-310 amino acids long and allows the
dimerization between PAS proteins, the binding of small molecules
(e.g., dioxin), and interactions with non-PAS proteins (Crews 1998
).
bHLH-PAS proteins control a variety of developmental and physiological events including neurogenesis, tracheal and salivary duct formation, toxin metabolism, circadian rhythms, and response to hypoxia (Crews 1998
). bHLH-PAS proteins bind to ACGTG or GCGTG core sequences.
Group D corresponds to HLH proteins that lack a basic domain and are
hence unable to bind DNA. This group includes the Id and
Extramacrochaete (Emc) proteins (Benezra et al. 1990
; Ellis et al.
1990
; Garrell and Modolell 1990
), which act as antagonists of group A
bHLH proteins (Van Doren et al. 1991
, 1992
).
An additional group of putative HLH proteins has been described more
recently, the COE family (for Collier/Olf-1/EBF). This group is
characterized by the presence of an additional domain involved both in
dimerization and in DNA binding, the COE domain (Crozatier et al.
1996
). The HLH sequences of this group are highly divergent from the
other bHLH motifs, making their phyletic analysis difficult.
Other than this subdivision in a few large groups, however, little is
known of the evolution and diversification of the bHLH domain. Yet,
given the importance of the bHLH genes in development, it would be
desirable to have a more refined classification scheme of the various
types of bHLH motifs, as well as a better understanding of their
evolutionary relationships both within and between organisms. We have
taken advantage of the complete sequencing of the nematode's (Caenorhabditis elegans Sequencing Consortium 1998
)
and fly's (Adams et al. 2000
) genomes to extract a large, and possibly
complete, set of bHLH genes from these two organisms. We also have used the large number of bHLH genes that now have been identified in vertebrates, as well as the smaller number available in plants and
fungi, to assess the evolutionary relationships within this family.
| |
RESULTS AND DISCUSSION |
|---|
|
|
|---|
Derivation of Comprehensive Sets of bHLH Sequences from Existing Databases
The completion of the nematode and fly sequencing projects provided
us with an opportunity to screen whole genomes for bHLH coding regions.
To collect as many such sequences as possible (hopefully, all of them)
we first retrieved a large number of bHLH sequences available from the
nonredundant NCBI (http://www.ncbi.nlm.nih.gov) and Sanger protein
databases (http://www.sanger.ac.uk), as described in the Methods
section. We used the most divergent among these sequences (as
determined by preliminary phylogenetic reconstructions) to screen by
BLASTP (Altschul et al. 1990
) the complete genomic
sequences of C. elegans and D. melanogaster. We used
the retrieved sequences that were not present in our initial collection to make new BLASTP searches in both genome databases as
well as in the nonredundant NCBI protein database. We also used yeast
and plant sequences retrieved from our original screen to make
BLASTP searches, against the Saccharomyces
cerevisiae and the Arabidopsis thaliana genome databases,
in order to isolate additional bHLH sequences from these organisms.
These various searches generated a set of more than 350 different bHLH
sequences. We did not systematically retrieve the large number of bHLH
sequences from various mammals (other than mouse) that are available.
Thus, it is clear that our set is far from including all the available bHLH sequences. We believe, however, that it provides a very extensive coverage of fly, nematode, and mouse genes, and a fair representation of the plant and fungal types.
We aligned these sequences using the multiple alignments software
CLUSTALW (Thompson et al. 1994
) and checked each alignment
by hand. The verified alignments then were used to construct phylogenetic trees as described in the Methods section. The resultant trees were bootstrapped to provide information about their statistical reliability. We used these trees to define groups of orthologous sequences.
Identification of Orthologous Families
Orthologous genes in two or more organisms are homologs that evolved
from the same gene in the last common ancestor (Fitch 1970
). Paralogous
genes are those that have resulted from within species duplication
(Fitch 1970
). Unfortunately, there is no absolute criterion that can be
used to decide if two genes are orthologous. The criterion we used to
define orthologous families was that the grouping of bHLH sequences
from at least two species into one monophyletic family should be
supported by different methods of analysis with bootstrap values
>50%. A similar criterion has been used in other analyses of protein
families (Galliot et al. 1999
). This criterion was relaxed for a few
families, as will be discussed later (lower bootstrap values; Table
1). The fact that
congruence was observed between trees constructed by different methods
suggests that our reconstruction of the bHLH phylogeny is essentially correct.
|
Our analysis led us to define 44 orthologous families ( i.e., 44 ancestral types of bHLH domains). Table 1 summarizes the 44 families and some of their properties. We named each family according to its first discovered member, or in a few cases, to its best-characterized member. The complete list of all members of every family, together with database accession numbers, can be found as supplementary material at http://www.genome.org. Two types of bHLH motifs presented special problems. First, the HLH of COE family proteins were not easily alignable with other bHLH proteins. Hence, the phylogenetic analysis of this family was mostly done without other types of bHLH sequences and using the well-conserved COE domain in addition to the bHLH. Second, although Hairy/E(spl)-related (HER) proteins appear consistently monophyletic, the resolution within the group was very poor and we were unable to identify orthologous families with any confidence. Because many amino acids flanking the bHLH motif are conserved in this group, we used a larger domain for phyletic comparisons to obtain better (but still low) phylogenetic resolution (Table 1).
Figure 1 shows an alignment of all 44 bHLH types, based on one representative of each family. Thirty-six families comprise only animal members, four families are specific to plants, two are found only in yeasts, and two have both yeast and animal representatives. Thus, the bHLH motif appeared very early in eukaryotic history, but its expansion occurred almost entirely after the divergence between plants, fungi, and animals. The presence of only four plant families in our set is most likely a result of the fact that there were no extensive searches for bHLH genes in plants. As a consequence, most plant sequences come from one species, A. thaliana, for which an extensive genome project is conducted. Indications that more plant families are to be identified come from preliminary BLASTP searches, which revealed 30 different A. thaliana bHLH sequences, most of which are unrelated to other plant bHLH sequences. We found these sequences to form four additional "families" comprising A. thaliana sequences only. These "families" are not reported in Table 1 because we choose to consider, as significant families, only groups that contain sequences from at least two different species. Hence, most A. thaliana bHLH are considered, in our work, as orphan genes (i.e., sequences that can not be assigned to any family).
|
Drosophila Genes
We found 56 bHLH sequences in D. melanogaster. Table
2 lists these sequences, the family to
which they belong, their chromosomal localization, their
characterization status, and their accession number. A version of this
table with links to Flybase (http://flybase.bio.indiana.edu; The
Flybase Consortium 1999
) is available as supplementary material at
http://www.genome.org.
|
We believe that these sequences represent, if not the full set, at least a large proportion of the bHLH domains present in the fly genome. The repeated BLASTP searches that were used to build our original set of genes were meant to detect even very divergent types of bHLH domains. Furthermore, after we determined the 44 types of bHLH domains, we made new BLASTP screens of the complete sequence of D. melanogaster with one member of each family, without finding any new genes. On the other hand, none of these searches revealed Collier, the fly COE family representative. Therefore it is conceivable that one or more highly divergent HLH families may have escaped our screens.
The BLASTP searches detected additional sequences that we
did not use in our analyses, as they did not correspond to complete HLH
motifs. Such sequences were identified because they present a marked
similarity with a small region of the bHLH domain, 20-30 amino acids
long, often including the basic region. In all cases we checked the
sequence by hand, and the decision as to whether a sequence did or did
not correspond to a bona fide bHLH domain was always clearcut. We also
checked the 61 "HLH DNA-binding domain" and 69 "Myc-type HLH
dimerization domain" sequences recently identified in the
Drosophila genome (Rubin et al. 2000
), and found that only the
56 sequences listed in Table 2 correspond to complete bHLH domains. Our
analysis is completely consistent with and extends that of Moore et al.
(2000)
, who analyzed 12 previously uncharacterized bHLH from the
Drosophila genome project. We also retrieved these 12 genes in our
screen and our family assignment coincides with that of Moore et al. (2000)
.
C. elegans Genes
We found 35 bHLH sequences in C. elegans. Table
3 shows these sequences; a version of this
table with links to Wormbase (http://www.wormbase.org) is available as
supplementary material at http://www.genome.org. A previous report
(Rubin et al. 2000
) mentioned 38 "HLH DNA-binding domain" sequences
and 8 "Myc-type HLH dimerization domain" sequences in the C. elegans genome. Prior analysis of the C. elegans genome revealed only 24 bHLH putative proteins (Ruvkun and Hobert 1998
). Here
again we checked the discrepancies between our results and the previous
ones, and found that only the 35 sequences listed in Table 3 correspond
to complete bHLH domains. These 35 sequences are likely to represent
the full set of C. elegans bHLH.
|
In contrast to the sequences from Drosophila, most of which
can easily be assigned to one of the 38 animal bHLH families, 17% of
the C. elegans sequences (6/35) cannot be confidently assigned to a specific family, and are therefore called "orphan".
Furthermore, several C. elegans bHLH included in families are
only loosely linked to the other members (their inclusion is supported
by low bootstrap values). Conversely, 40% of the animal families do
not contain C. elegans members. These results are consistent
with the traditional view of metazoan phylogeny, which held nematodes as very distantly related to both arthropods and vertebrates. Recent
molecular phylogenies indicate that, on the contrary, arthropods and
nematodes are relatives, (i.e., they group into one of the three clades
of bilaterians, the ecdysozoa) (Aguinaldo et al. 1997
; Adoutte et al.
2000
). Many nematodes, including C. elegans, have higher
mutation rates than other metazoans, not only in their rRNA genes
(Aguinaldo et al. 1997
), but also throughout their genome (Mushegian et
al. 1998
). Therefore, nematode sequences, tend to be artifactually
displaced to a wrong position because they appear as being very distant
from all others, and to end up at the base of the tree or even
associated with the outgroup (because of chance convergence at some
nucleotide positions). This phenomenon, known as "long branch
attraction phenomenon" (for a recent review, see Philippe and Laurent
1998
), presumably explains why our analysis led to the clustering of
several C. elegans sequences at the base of the group A
family, or as orphan group B genes (Table 3). Accordingly, we found
that the worm bHLH sequences diverge more rapidly than those of fly and
mouse (data not shown; a detailed analysis can be found on our Web
site, http://www.cnrs-gif.fr/cgm/evodevo/bhlh/index.html).
Interestingly, some nematode sequences have diverged very little from
their fly or mouse counterparts. These include the few functionally
characterized C. elegans bHLH genes, which show overall functional conservation with their vertebrates and/or fly orthologs; for example, the C. elegans orthologs of twist and
myoD are involved in muscle formation (Harfe et al. 1998a
,b
),
and the orthologs of atonal and NeuroD (lin-32
and cnd-1) play a role in nervous system development (Zhao
and Emmons 1995
; Hallam et al. 2000
). The genetic control of
developmental processes such as neurogenesis and myogenesis relies on
small sets of interacting genes (syntagms Garcia-Bellido 1981
). The
function of syntagms crucially relies on specific molecular
interactions among their members, hence imposing strong structural
constraints on them and preventing structural diversification (for
discussion on syntagms and evolution, see Huang 1998
). This may explain
why such networks are strongly conserved throughout metazoan evolution
(Baylies et al. 1998
; Arendt and Nübler-Jung 1999
) and why
nematode genes involved in such networks have been subject to special constraints.
Mouse Genes
We found a total of 90 different bHLH sequences in mouse (and related mammals). This large set of genes is the result of the extensive molecular analyses of processes such as neurogenesis, myogenesis, or oncogenesis in which bHLH are crucially involved. Therefore, it might be that in the absence of systematic bHLH searches or genome sequencing projects, only a small subset of vertebrate bHLH genes have been identified so far. Indeed, our initial searches showed that the same vertebrate bHLH genes may be reported under up to seven different names, suggesting the convergence of many research groups on small numbers of crucial genes.
However, our results show that at least 35 of the 38 vertebrate bHLH types have protostomian (fly and/or worm) orthologs (90%), and reciprocally, that all fly genes have mouse counterparts. Because we believe that our set of fly genes is close to complete, the fact that mouse counterparts have been identified for all fly genes suggests that our sample of bHLH genes in mouse is in fact quite extensive. Needless to say, a definite answer to the question of how complete is our knowledge of vertebrate bHLH genes will have to await the results of the various vertebrate sequencing projects that currently are under way.
Assessing Orthologies
The assesment of orthologies must necessarily be based on
phylogenetic reconstructions. Thus, although orthology is a very useful
concept, there is no foolproof way of deciding whether two similar
sequences are indeed orthologous. We will illustrate this difficult
question in the case of two closely related fly genes, Delilah
(dei) and CG11450 (Fig. 1 and 3). CG11450
recently has been described, based on overall similarity, as the
Drosophila ortholog of the vertebrate NeuroD gene
(Hassan and Bellen 2000
). We similarly retrieved CG11450 as
the closest Drosophila relative of NeuroD when making
BLASTP searches. However, the inclusion of both genes in
the NeuroD family is, not supported by the phylogenetic analyses. While
both genes clearly belong to the Atonal superfamily, they cannot be
associated unequivocally to either of the NeuroD, Ngn
or Ato families (Fig.
2 and
3). Nevertheless, CG11450 and Dei
may represent divergent NeuroD proteins as they show several residues
in their bHLH typical of this family (Fig. 1; Hassan and Bellen 2000
).
|
|
We examined whether what is known of the function of these various
genes might help us elucidate the origin of dei and
CG11450. The vertebrate representatives of the Ngn, NeuroD and
Ato families are mainly involved in the determination and the
differentiation of neural cells (Kageyama and Nakanishi 1997
; Hassan
and Bellen 2000
). In Drosophila, the Ato representatives
ato, amos and cato are all involved in
neural development (Jarman et al. 1993
; Goulding et al. 2000a
,b
; Huang
et al. 2000
). The function of the neurogenin ortholog
target of poxn (tap) is not known, but the gene is
exclusively expressed at late stages of neural development (Bush et al.
1996
; Gautier et al. 1997
; Ledent et al. 1998
). On the contrary,
dei and CG11450 are not involved in neurogenesis.
dei is required for the differentiation of specific epidermal
cells as muscle attachment sites (Armand et al. 1994
). CG11450
is expressed in the embryonic mesoderm in a pattern that overlaps that
of twist (Moore et al. 2000
). During postembryonic
development, CG11450 is involved in wing vein formation
(CG11450 corresponds to the net locus; Brentrup et
al. 2000
). Thus, one plausible interpretation of the data is that
dei and CG11450 are bona fide orthologs of the
NeuroD genes, and that their phylogenetic relationships have been
blurred by a rapid divergence associated to the acquisition of new functions.
Comparison of Fly, Nematode and Vertebrate Families
Most families comprise one protostome (fly and/or nematode) and
several (often two) vertebrate genes. The fact that most families contain both fly and vertebrate genes suggests that there was no
addition of new bHLH types in the corresponding lineages, and therefore
no important diversification of the ancestral repertoire. Among the few
families that lack fly genes, most also lack nematode genes. These may
represent the arisal of new bHLH types in the vertebrate lineage, or
alternatively a loss of ancestral types in both fly and nematode. The
analysis of bHLH genes from molluscs or annelids might help settle this
question. It is now widely believed that bilateria (triploblastic
metazoans) are composed of three main lineages: deuterostomes (which
include vertebrates and echinoderms) and protostomes themselves
including two large groups, the ecdysozoans (e.g., arthropods and
nematodes) and the lophotrochozoans (e.g., annelids, molluscs,
flatworms) (e.g., Aguinaldo et al. 1997
; de Rosa et al. 1999
; Adoutte
et al. 2000
). Therefore, the finding of ortholog genes in vertebrates
and lophotrochozoans but not in fly and nematode would strongly suggest
that gene loss(es) has occurred in the ecdysozoan lineage. Similarly,
the case of families that contain vertebrate and either worm or fly
genes is explained best by gene losses that occurred, inside the
ecdysozoan clade, in either lineage after the arthropod/nematode
divergence. This occurred in the fly lineage for only one family, MITF,
which contains vertebrate and worm but no fly genes (the case of the NeuroD family has been discussed above). The much larger number of
families that have vertebrate and fly members but no nematode representative, as well as the large number of nematode genes that
cannot be clearly assigned to specific families (orphan genes) is
likely because of the high divergence rate reported for nematode genes
in general (Aguinaldo et al. 1997
; Mushegian et al. 1998
) and that we
found within our specific data set (data not shown; for details, see
our Web site at http://www.cnrs-gif.fr/cgm/evodevo/bhlh/index.html).
Gene and Genome Duplications
Most bHLH families, as other gene families, comprise more members in
vertebrates than in other phyla (Table 1). It has been proposed that
this may reflect the occurrence of two rounds of genome duplication
during the early vertebrate evolution (Sidow 1996
; Meyer and Schartl
1999
), but this idea, mainly based on mapping of gene clusters, remains
controversial (Skrabanek and Wolfe 1998
; Hughes 1999
; Smith et al.
1999
; Martin 2001
). Many gene families in vertebrates have less than
four genes (Skrabanek and Wolfe 1998
; Smith et al. 1999
). However, this
might result from gene loss during or after the rounds of duplication
(Meyer and Schartl 1999
). Within our set of bHLH genes, the most usual case was two mouse genes per family, but we know this set is likely to be incomplete because the entire genomic sequence of the
mouse is not available. Even within this incomplete set, we observe that up to one-fourth of the families comprise four or more members (Table 1). As pointed out by Hughes (1999)
, the presence of four vertebrate members, by itself, does not support the genome duplication hypothesis. Support only may come from families whose phylogenetic tree shows a topology of the (AB) (CD) form (i.e., two pairs of two
closely related paralogs) (Hughes 1999
). Hughes (1999)
discussed the
phylogenies of 13 protein families important in development and found
that only one of them shows a (AB) (CD) topology. We constructed
individual trees for each bHLH family (available at http://www.cnrs-gif.fr/cgm/evodevo/bhlh/index.html) and often found one
or two duplication(s) during vertebrates radiation (e.g., the
Achaete-Scute family; Fig. 4). We checked
the topology of the trees of families with four or more vertebrate
members (nine families, see Table 1) and observed that none of the five
families showing a reliable phylogeny, has a (AB) (CD) topology (data
not shown; see http://www.cnrs-gif.fr/cgm/evodevo/bhlh/index.html). Hence, our data set does not support the hypothesis of two rounds of
genome duplication. Figure 4 also shows a feature we observed in
several families: the existence of extra closely related genes in the
tetraploid Xenopus and in ray-finned fishes such as the zebrafish Brachydanio rerio (actinopterygia). The latter
observation is consistent with the hypothesis than actinopterygia
genome underwent a duplication, which took place after
actinopterygian-sarcopterygian lineage divergence (the sarcopterygian
lineage include coelacanths, lungfishes, and all tetrapods) (reviewed
in Wittbrod et al. 1998
; Meyer and Schartl 1999
).
|
Duplications of Ecdysozoan bHLH
A few families contain more than one gene in fly and/or nematode,
and in some cases, more genes than in vertebrates: the Achaete-Scute, Atonal, PTF1, Enhancer of split, Hairy, AHR, and TF4 families in
Drosophila and AHR, Enhancer of split, and Max families in C. elegans (Table 1). The different protostome members of
these families arose by duplications that occurred after the
arthropod/nematode split within the ecdisozoan clade: For example, the
four Drosophila achaete-scute genes are collectively
orthologous to the two vertebrate genes and to the four nematode genes
(Fig. 4); the three Drosophila Atonal genes are collectively
orthologous to the two vertebrate and the single nematode genes (Fig.
3). We retrieved the chromosomal localizations of these genes from
Flybase and Wormbase and observed that, in most cases, the members of a
given family have very different localizations, often on different
chromosomes (Table 2 and 3). The three Drosophila Atonal
genes, for example, are found on three different chromosomes arms (2L,
3L and 3R; see Table 2). These localizations suggest that the
duplications that gave rise to the paralogs are rather ancient events.
However, in some cases, the duplications might have occurred more
recently, as the paralogs are localized close to each other in the
genome: this is, for example, the case of the four achaete-scute
genes and the seven Enhancer of split genes which are
known for long time to form gene complexes in Drosophila. We
found one similar case in C. elegans: C17C3.10
and C17C3.8 are adjacent genes and are on the same DNA
strand. In addition, two worm members of the Achaete-Scute family are
found at a similar chromosomal localization (Table 3), although
separated by several unrelated genes. Information about the timing of
duplication events may come from evolutive comparisons with
increasingly distantly related species. For example, clear orthologs of
three of the four achaete-scute genes have been found in
another dipteran, Ceratitis capitata (Wülbeck and Simpson 2000
), while a single ortholog to the four
achaete-scute genes is found in the buckeye butterfly,
Juonia coenia, a lepidoptera, and in the flour beetle,
Tribolium castaneum, a coleoptera (Figure 4; Galant et al.
1998
). Duplication, in this case, probably has occurred after the
divergence of diptera from other insects.
Phylogenetic Relationships of bHLH Families: A Reappraisal of High-Order
Although the bHLH motif has good resolving power to delimit families
of proteins and describe their evolutionary relationships at the tips
of the clades, the very early evolutionary history of the motif is more
problematic (Atchley and Fitch 1997
). Deep nodes usually have a low
statistical support (small bootstrap values). This is mainly a result
of the small size of the conserved sequence and the existence of
numerous ancient paralogs. Nevertheless, we found recurrent topologies
when constructing trees with different sequences sets and different
tree reconstruction procedures [maximum parsimony (MP), distance, and
maximum likelihood (ML)]. The congruence between trees obtained with
different methods and different data sets is usually considered in
phylogenetic reconstructions as a good argument in favor of the
validity of a given phylogeny (Adoutte et al. 2000
); however, it is not
a demonstration of its reliability. A representative tree of the
different bHLH families is shown in Figure 2. Our results agree largely
with those of Atchley and Fitch (1997)
who described the four
high-order groups (A-D) found in a neighbor-joining (NJ) tree and
subsequent work of Atchley and collaborators (Atchley et al. 1999
;
Morgenstern and Atchley 1999
). Although the high-order groups were
supported only by low bootstrap values, their validity was confirmed by MP analyses of particular sites at different positions in the bHLH
(Atchley and Fitch 1997
), analyses of bHLH flanking regions (Morgenstern and Atchley 1999
) and mathematical modeling (Atchley et
al. 1999
). The inclusion of the 44 orthologous families in the
high-order groups is shown in Table 1.
Our results diverge from the previous analyses in a few points, however. First, we have had to revise the relationship between groups A and D, and to include group D within group A. Second, our analysis suggests that group B is paraphyletic and closest to the ancestral bHLH motif. Third, we have evidence that group C is not monophyletic but includes several independent occurrences of the bHLH-PAS association. Finally, the more extensive data set used in the present study led us to define two additional groups, E and F.
Our phylogenetic analysis (Fig. 2) reveals a large monophyletic group
that corresponds to the group A defined by Atchley and Fitch (1997)
.
This group includes the E12/E47 family genes and several other
families whose members are able to heterodimerize with the E12/E47
proteins (Cabrera and Alonso 1991
; Lassar et al. 1991
; Van Doren et al.
1992
). The phylogenetic analysis (Fig.2) clearly shows that the Emc
family is deeply embedded into the group A family. Furthermore,
although group D proteins lack the DNA-binding motif, they are able to
dimerize with several group A proteins (Benerzra et al. 1990
; Ellis et
al. 1990
; Garrell and Modolell 1990
; Van Doren et al. 1991
, 1992
), but
not with other types of bHLH motifs. Therefore, our results indicate
that the Emc family, previously considered to define group D, should
also be considered as belonging to group A.
We believe that group B is paraphyletic rather than monophyletic (Fig.
2). This group is probably closest to the ancestral bHLH type from
which groups A, C, D, E, and F bHLH arise. The distribution of these
proteins in various groups of organisms stongly supports this
suggestion: Group B proteins are found in plants, yeast, and animals,
whereas the other groups (A, C, D, E, and F) are found only in animals.
Likewise, we did not find the group C of Atchley and Fitch (1997)
to
form a monophyletic group (Fig. 2). As this group comprises the
bHLH-PAS genes, one obvious explanation for its paraphyly is that the
association between the bHLH and the PAS domains occurred several times
independently, consistent with the hypothesis of a modular evolution of
the bHLH proteins by domain shuffling (for discussion, see Morgenstern and Atchley 1999
).
We found that all Hairy and Enhancer of split-related proteins form a
well-supported monophyletic group that we named group E in accordance
with Atchley and Fitch nomenclature (Fig. 2). The monophyly of this
group is confirmed by the presence of several conserved amino acids
flanking the bHLH and the presence of the WRPW peptide (Fisher and
Caudy 1998
).
Similarly, the HLH domain of the COE proteins appears well conserved among them and much divergent with respect to other bHLH families. Furthermore, all COE proteins contain a highly conserved domain, the COE domain, not found in any other proteins. Taken together, this strongly indicates that the COE proteins form a clearly distinct monophyletic group, which we named group F.
Conclusions: An Overview of bHLH Evolution
We have not been able to identify procaryotic genes that would match our bHLH sequences. Therefore, it seems that the bHLH motif has been established in early eukaryote evolution. The bHLH genes of yeast are involved in general transcriptional enhancement and cell cycle control, suggesting that this may have been the original function of the bHLH genes in primitive eukaryotes. An important diversification occurred independently in the animal and plant lineages, as seen by the 36 different families found exclusively in animals and 30 different bHLH genes found in A. thaliana, compared to the five genes found in yeast.
In animals, bHLH genes generally are involved in development and in
tissue-specific gene regulation. The 38 families have representatives
in the two major subdivisions of the animal kingdom, protostomes and
deuterostomes, and must therefore have been represented in their common
ancestor prior to the Cambrian radiation, which saw the emergence of
all present-day phyla and many extinct ones. Morphologically, these
ancestors (also called Urbilateria; De Robertis and Sasai 1996
)
probably were coelomates with antero-posterior and dorso-ventral
polarity, rudimentary appendages, some form of metamerism, a heart,
sense organs such as photoreceptors, and a complex nervous system
(Knoll and Carroll 1999
). Genetically, they possessed numerous
Hox genes (at least seven; de Rosa et al. 1999
) as well as
other homeobox genes, several intercellular signaling pathways
(TGF-B, Hedgehog, Notch, EGF), and several Pax genes
(Galliot et al. 1999
). Our analysis indicates that their genome
contained at least 35 different bHLH genes. The functional conservation that often is observed between fly and vertebrate bHLH
orthologs indicates that some of the developmental functions associated
with present-day bHLH genes already were established in these
ancestral organisms, further indicating the genomic and developmental
complexity of this ancient ancestor.
| |
METHODS |
|---|
|
|
|---|
Protein sequences were obtained mostly by BLASTP
search (Altschul et al. 1990
) at the National Center for Biotechnology (NCBI) and the Sanger center, as well as from Swissprot, GenPept, and
TrEMBL through SRS (LION Bioscience AG) and Nentrez (NCBI) software. A
table containing all sequences and their accession numbers is available
on our Web site (http://www.cnrs-gif.fr/cgm/evodevo/bhlh/index.html). Protein alignments were carried out using CLUSTALW
(Thompson et al. 1994
) with no adjustment of the default parameters,
and were subsequently edited and manually improved in Genedoc Multiple Sequence Alignment Editor and Shading Utility, Version 2.6.001 (Nicholas et al. 1997
). The evaluation of percentage conservation of
residues in multiple sequence alignments was done using the Blosum62
Similarity Scoring Table (Henikoff and Henikoff 1992
). Only the bHLH
motif (determined as in Ferre-D'Amar et al. 1993
), plus a few flanking
amino acids, was used in most of our analyses because the remaining
part of proteins from independent clades are either not homologous or
have so diverged that the alignments are meaningless. The facilities of
the Belgian EMBnet Node (http://be.embnet.org) were used for all
database searches through SRS and sequence analysis using
Genedoc software, and for most of the protein alignments using CLUSTALW. Trees were built using unweighted maximum parsimony (MP) and neighbor-joining (NJ) algorithms with the PAUP 4.0 program (Swofford 1993
). The MP analysis was performed with
the following settings: heuristic search over 100 bootstrap replicates,
MAXTREES set up to 1000 because of computer limitations, other
parameters set to default values. When large numbers of sequences
(>150) were handled, as a result of computer limitations, bootstraps
were made by "fast" stepwise-additions (1000 replicates) in
PAUP 4.0. Extensive computer simulations have shown that
such fast algorithms are as efficient as more extensive search algorithms when a large number of sequences is used (Takahashi and Nei
2000
). Distance trees were constructed with the NJ algorithm (Saitou
and Nei 1987
) using PAUP 4.0 based on a Dayhoff's PAM 250 distance matrix (Dayhoff et al. 1978
). Bootstrap replicates of the NJ
trees (1000) also were made with PAUP 4.0, parameters set
to default values.
Some alignments also were analyzed by maximum likelihood (ML) using
Puzzle 4.0.2 (Strimmer and Von Haeseler 1996
). The ML was
performed using the quartet puzzling tree search procedure with 10000 puzzling steps, using the Jones-Taylor-Thornton (JTT) model of
substitution (Jones et al. 1992
), the frequencies of amino acids being
estimated from the data set (Strimmer and Von Haeseler 1996
).
The trees were displayed with the TreeView (Version 1.5)
(Page 1996
), saved as PICT files, converted into JPEG files using
Graphic Converter, and then annotated using Adobe Photoshop.
| |
ACKNOWLEDGMENTS |
|---|
We thank Robert Herzog, Marc Colet, and André Adoutte for support. We are especially grateful to Alain Ghysen for his help in the writing of this article. We thank Daniel Van Belle for comments on protein structure. We also thank André Adoutte, Robert Herzog, Nicolas Lartillot, Michel Milinkovitch, and two anonymous referees for helpful comments on the manuscript. This work has been supported by the Federal Office for Scientific, Technical, and Cultural Affairs (V.L.) and Centre National de la Recherche Scientifique and Université de Paris-Sud (M.V.).
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL vervoort{at}cgm.cnrs-gif.fr; FAX 33 169 823160.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.177001.
| |
REFERENCES |
|---|
|
|
|---|