Genome Res. 14:354-366, 2004
©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00
Letter
Gene Conversion and the Evolution of Protocadherin Gene Cluster Diversity
James P. Noonan1,
Jane Grimwood2,
Jeremy Schmutz2,
Mark Dickson2 and
Richard M. Myers1,2,3
1 Department of Genetics, Stanford University School of Medicine, Stanford, California 94305-5120, USA
2 Stanford Human Genome Center, Stanford University School of Medicine, Palo Alto, California 94304, USA
 |
ABSTRACT
|
|---|
The synaptic cell adhesion molecules encoded by the protocadherin gene cluster are hypothesized to provide a molecular code involved in the generation of synaptic complexity in the developing brain. Variation in copy number and sequence content of protocadherin cluster genes among vertebrate species could reflect adaptive differences in protocadherin function. We have completed an analysis of zebrafish protocadherin cluster genes. Zebrafish have two unlinked protocadherin clusters, DrPcdh1 and DrPcdh2. Like mammalian protocadherin clusters, DrPcdh1 has both and variable and constant region exons. A consensus protocadherin promoter motif sequence identified in mammals is also conserved in zebrafish. Few orthologous relationships, however, are apparent between zebrafish and mammalian protocadherin proteins. Here we show that protocadherin cluster genes in human, mouse, rat, and zebrafish are subject to striking gene conversion events. These events are restricted to regions of the coding sequence, particularly the coding sequences of ectodomain 6 and the cytoplasmic domain. Diversity among paralogs is restricted to particular ectodomains that are excluded from conversion events. Conversion events are also strongly correlated with an increase in third-position GC content. We propose that the combination of lineage-specific duplication, restricted gene conversion, and adaptive variation in diversified ectodomains drives vertebrate protocadherin cluster evolution.
The evolution of the vertebrate brain from its invertebrate ancestors required a vast increase in the number and complexity of neuronal subtypes, synaptic connections, and the neural networks they comprise (Nieuwenhuys et al. 1998 ). These advances likely involved the expansion of particular classes of genes involved in brain development as well as the emergence of novel genes. Differences in brain development and function among vertebrates may be attributable to adaptive differences in particular genes shared among many vertebrate species. The members of the protocadherin gene cluster are compelling candidates to provide the molecular code required for the generation and maintenance of synaptic specificity in brain development (Kohmura et al. 1998 ; Wu and Maniatis 1999 ). The human protocadherin (Pcdh) gene cluster consists of 53 tandemly arrayed, single-exon paralogous genes organized into three subclusters, designated , , and , on chromosome 5 (Wu and Maniatis 1999 ). Each large, "variable" exon encodes an extracellular domain consisting of six cadherin-like ectodomain repeats, a transmembrane domain, and a short cytoplasmic tail. At the 3' end of both the and subclusters are an additional three short exons that are alternatively cis-spliced to each and variable exon, providing a "constant" cytoplasmic region (Wu and Maniatis 1999 ; Tasic et al. 2002 ; Wang et al. 2002a ). Each variable exon is transcribed from its own promoter, and all protocadherin cluster promoters share a highly conserved core motif (Wu et al. 2001 ; Noonan et al. 2003 ). The organization and gene content of the mouse and human protocadherin clusters are similar, indicating that the function of protocadherin cluster genes in brain development is conserved among all mammals (Wu et al. 2001 ).
Protocadherin proteins are thought to form homophilic interactions at synapses, providing a molecular means to distinguish subsets of neurons based on the combinations of protocadherins they express (Obata et al. 1995 ; Kohmura et al. 1998 ). Recent advances in the understanding of protocadherin function evoke a more sophisticated version of this hypothesis (Wang et al. 2002b ; Phillips et al. 2003 ). Mice bearing a homozygous deletion of the Pcdh cluster show normal brain development and synaptogenesis until late in the embryonic stage (Wang et al. 2002b ). We recently determined that 21% of Europeans carry a deletion of three protocadherin genes, Pcdh 8, 9, and 10, with no apparent phenotypic effect (Noonan et al. 2003 ). Late-embryonic-stage mice that are null for all of the Pcdh genes, however, suffer massive apoptosis of spinal interneurons and show some evidence of neurodegeneration in the brain (Wang et al. 2002b ). Spinal interneurons, but not hippocampal neurons, lacking the Pcdh cluster die in culture, indicating a direct requirement for Pcdh proteins in neuronal survival (Wang et al. 2002b ). A significant proportion of Pcdh expression is nonsynaptic and intracellular, indicating that these proteins have other functions besides synaptic cell adhesion (Wang et al. 2002b ; Phillips et al. 2003 ). Studies of hippocampal neurons in culture, however, show Pcdh proteins localized at subsets of excitatory synapses (Phillips et al. 2003 ). Pcdh proteins are also expressed on presynaptic and postsynaptic membranes at a subset of excitatory synapses in the hippocampus. In these cases, Pcdh proteins may mediate cellcell interactions (Phillips et al. 2003 ). Protocadherin and transcripts and proteins are expressed in overlapping but distinct patterns in the brain, and individual neurons have been shown to express multiple Pcdh proteins (Kohmura et al. 1998 ; Tasic et al. 2002 ; Wang et al. 2002b ). These results suggest a new model in which Pcdh proteins are required for neuronal maturation and synaptic modification rather than synaptogenesis (Wang et al. 2002b ; Kallenbach et al. 2003 ; Phillips et al. 2003 ). In this model, protocadherin cluster proteins are not structural components of synapses, but serve to identify and modify an enormous number of neuronal and synaptic subpopulations in the developing and adult brain (Phillips et al. 2003 ).
The protocadherin cluster is one example of a striking feature of vertebrate genomes: the presence of tandem arrays of paralogous genes encoding proteins of similar function that provide combinatorial complexity to some biological system. The classic examples, immunoglobulin and T-cell receptor gene clusters, generate the enormous range of antigen recognition molecules required to mount an effective immune response in an environment of multiple, rapidly evolving pathogens (for review, see Flajnik 2002 ). Olfactory receptor gene clusters provide the molecular means to detect minute amounts of numerous odorants, such as chemoattractants and toxins (Buck and Axel 1991 ). Each of these tandem arrays encode proteins involved in highly specific interactions such as receptor-ligand, antibody-antigen, and, for protocadherins, cell adhesion and possibly ligand recognition (Senzaki et al. 1999 ). Specificity in these interactions requires sequence diversity among paralogs, which generates the molecular and biological diversity the cluster provides. Diversity among protocadherin cluster paralogs provides the information comprising the hypothesized protocadherin synaptic code. Mechanisms that generate diversity among paralogs in a tandem array, increasing or decreasing the information content in that array, include gene duplication, diversification among duplicated paralogs, and gene conversion (Ohno 1970 ; Ohta 1980 ; Slightom et al. 1980 ; Nei et al. 1997 ). These processes are often lineage-specific, resulting in differences in gene number and sequence content even among closely related species. Differences in olfactory receptor gene number are common even among primate species, and these genes have undergone considerable gene conversion events in humans (Sharon et al. 1999 ; Newman and Trask 2003 ). Protocadherin cluster genes show variable copy number among mouse, rat, and human, with an expansion of the Pcdh cluster in mouse and rat and an additional Pcdh gene in rat (Wu et al. 2001 ; our present results). Differences in gene number and sequence diversity may be greater among more distantly related vertebrate species, and these differences may reflect adaptive differences in protocadherin function.
To investigate the diversity of vertebrate protocadherin cluster genes and the mechanisms that drive protocadherin cluster evolution, we are determining the structure of the protocadherin clusters in multiple vertebrate species. Protocadherin clusters are absent from the genomes of invertebrate model organisms such as Drosophila melanogaster and Caenorhabditis elegans, and also from the genomes of invertebrate chordates such as Ciona savignyi (Hill et al. 2001 ). Therefore, the protocadherin cluster may be a vertebrate innovation, driving the substantial increase in central nervous system complexity in vertebrates relative to other species. We have completed sequencing and assembly of a cluster of protocadherin genes in zebrafish, and have constructed maximum likelihood phylogenies of zebrafish, human, mouse, and rat protocadherin cluster proteins. We find that some general features of mammalian protocadherin clusters, including variable and constant region exon structure, are conserved in zebrafish. However, our results also indicate that lineage-specific adaptive variation, gene duplication, and especially gene conversion all contribute to extensive differences in protocadherin cluster variable exon sequence content among diverse vertebrate species.
 |
RESULTS
|
|---|
Characterization of Zebrafish Protocadherin Cluster Genes
Teleosts have undergone a whole-genome duplication event in their evolution (Amores et al. 1998 ). This raises the possibility that zebrafish and other teleosts possess two divergent protocadherin clusters, as opposed to a single cluster as occurs in terrestrial vertebrates. Consistent with this hypothesis, we have identified 66 predicted protocadherin variable exon sequences arrayed in two unlinked clusters, DrPcdh1 and DrPcdh2, in the zebrafish genome. Here we provide the general organization of DrPcdh1 and DrPcdh2 and their relationship to mammalian protocadherin clusters. Both zebrafish Pcdh clusters are similar to protocadherin clusters in mammals, including, in the case of DrPcdh1, the characteristic arrangement of variable and constant exons. The DrPcdh1 cluster maps to linkage group 10 (LG10; Zebrafish Genome Fingerprinting Project; see Methods) and consists of 38 predicted variable exon sequences distributed in two subclusters, DrPcdh1 and DrPcdh1 (Fig. 1A). By complete sequencing, we found 10 variable DrPcdh1 exons and one pseudogene followed by three short constant region exons. These constant exons encode a predicted polypeptide that is 60% identical to the human Pcdh constant region. DrPcdh1 is located directly 3' of DrPcdh1 and consists of 28 variable exons and three predicted pseudogenes, followed by three short exons encoding a predicted polypeptide 53% identical to the human Pcdh constant region. We also completed sequencing and assembly of a BAC clone that maps to LG14 and contains an additional 28 predicted protocadherin variable exons (Fig. 1B). These exons are clearly part of a distinct second protocadherin cluster in zebrafish, DrPcdh2.
We searched GenBank and the zebrafish EST assemblies at the Washington University Zebrafish Genome Resources Project (see Methods) for ESTs corresponding to our gene predictions. We found ESTs corresponding to DrPcdh1 2, 1 7, 1 10, DrPcdh1 19, 1 20, 1 21, DrPcdh21, 9, and 26. We also searched for ESTs containing and constant region sequences. We found multiple ESTs containing the DrPcdh1 constant region, as well as ESTs containing an constant region that is 82% identical on the amino acid level to the DrPcdh1 constant region. These ESTs include transcripts in which DrPcdh2 variable exons are spliced to this second constant region. Based on this result, these exons appear to comprise a second Pcdh cluster, which we designate DrPcdh2 . The DrPcdh2 constant region is likely located 3' of our DrPcdh2 BAC sequence. These results clearly indicate that there has been an expansion in the number of protocadherin genes in zebrafish relative to mammals.
In mammals, each protocadherin variable exon has its own promoter (Wu et al. 2001 ; Tasic et al. 2002 ; Wang et al. 2002a ; Noonan et al. 2003 ). Each promoter and variable exon together form the fundamental protocadherin cluster repeat unit, which duplicates and subsequently diversifies in protocadherin cluster evolution. All mammalian protocadherin promoters are therefore paralogs, and, with few exceptions, share a common CGCT motif upstream of the transcription start site (Wu et al. 2001 ; Tasic et al. 2002 ). We selected 351 bp directly upstream of the predicted translation start of each zebrafish protocadherin variable exon and used MEME to search for shared motifs (Bailey and Elkan 1994 ). Figure 2 shows the results of MEME searches on human Pcdh (Fig. 2A), DrPcdh1 (Fig. 2B), and DrPcdh2 (Fig. 2C) proximal upstream sequences. Our results demonstrate that the CGCT motif in mammals is part of a larger 15-bp promoter element that is conserved among zebrafish and human protocadherin promoters. Comparing all three motifs, it is clear that particular bases are completely conserved at various positions in this element among almost all promoters examined. The CNC configuration at the 5' end of the motif is almost universally conserved, as are the G two base pairs upstream of the CGCT element. Core promoter motifs in DrPcdh2 show some divergence from core motifs in mammalian and DrPcdh1 promoters (Fig. 2C). However, the overall level of conservation, spanning over 400 million years of evolution, indicates that the basic mechanism of protocadherin cluster regulation is highly conserved, even though the exons themselves are diverged.
Phylogenies of mammalian protocadherin cluster genes usually reflect the evolutionary relationships among the species involved. For example, mouse and rat Pcdh 1 are more similar to each other than they are to human Pcdh 1, and all three are more similar to each other than they are to any other Pcdh paralog (Wu et al. 2001 ; our present results). Orthologs are therefore identifiable across species. To determine the evolutionary relationship of zebrafish protocadherin cluster genes both to each other and to their mammalian counterparts, we translated each predicted variable exon in silico, removed the signal sequence, and made two large CLUSTALW alignments: one of DrPcdh1 , DrPcdh2 , and human Pcdh proteins, and one of DrPcdh1 , human Pcdh , and Pcdh proteins. Human and zebrafish ectodomain and cytoplasmic sequences align well, with many substitutions but few gaps. We input these alignments into SEMPHY (Friedman et al. 2002 ; see Methods) and obtained the maximum likelihood (ML) phylogeny for each, shown in Figure 3. Zebrafish protocadherin variable exons encode proteins with six cadherin-like ectodomains and a cytoplasmic domain of a size nearly identical to that of the comparable domains of mammalian protocadherins (data not shown). In most cases, however, zebrafish orthologs cannot be assigned to human protocadherin cluster proteins. In the ML tree of human and zebrafish Pcdh proteins, human Pcdh 113 are grouped on their own branch separate from all DrPcdh1 and DrPcdh2 protocadherins (Fig. 3A). This topology could reflect independent expansions of protocadherin variable exons in each lineage, or lineage-specific adaptive differences in protocadherin ectodomain sequences. In contrast, human Pcdh C2 appears to have a definable zebrafish ortholog, DrPcdh1 10 (Fig. 3A). DrPcdh1 1 is also similar to human C1 and C2. There are five C-type protocadherins in humans and mice: C1 and C2 are located 5' of the Pcdh constant region, and C3, C4, and C5 are located 5' of the Pcdh constant region. These C-type variable exons are considerably diverged from other Pcdh and Pcdh genes. DrPcdh1 10 is adjacent to the DrPcdh1 constant region, and its similarity to human C2 indicates that the C-type protocadherins as a class are as ancient as and protocadherins, and may be more highly conserved. However, the positioning of a C-type protocadherin, DrPcdh1 1, at the 5' end of the DrPcdh1 cluster is a departure from the organization of C-type protocadherins in mammals.
DrPcdh1 and DrPcdh2 descend from an ancestral tandem array duplicated in the teleost whole-genome duplication (WGD) event. Thus, we expected to see some interleaving of DrPcdh1 and DrPcdh2 proteins in this tree, reflecting their common ancestry. The tree topology, however, shows that either rapid diversification of genes between the duplicated clusters, or more likely post-WGD tandem duplications and deletions, occurred in each cluster. DrPcdh2 proteins consist of three paralog subclasses, 2 12 7, 2 82 25, and 2 262 28 (Fig. 3A). DrPcdh2 12 7 are most closely related to DrPcdh1 2, and clearly derive from subsequent duplications of the ancestral WGD-derived paralog. DrPcdh2 2628 are closely related to DrPcdh1 31 9, and both groups also have arisen from additional cluster-specific expansions. DrPcdh2 82 25 are more distantly related to DrPcdh1 and human Pcdh proteins. There has been a substantial expansion and diversification of Pcdh protocadherins in zebrafish relative to humans, who have only 12 or 15 Pcdh genes (Noonan et al. 2003 ).
In the ML tree of DrPcdh1 , hPcdh , and protein sequences, for the most part zebrafish and human protocadherins are on separate nodes (Fig. 3B). Human Pcdh and proteins are more similar to each other than they are to any zebrafish protocadherin, indicating that protocadherins are a mammalian, or at least terrestrial vertebrate, innovation. This is also true of the division of protocadherins into A and B subtypes, which are not present in zebrafish. DrPcdh1 proteins fall into three separate paralogous subclasses: 1 11 3, 1 41 18, and 1 191 28. 1 41 18 are all highly similar to each other. Although they are related to mammalian Pcdh and Pcdh proteins, they have undergone considerable diversification, and many of these genes, such as 1 161 18, appear to be recent duplicates. 1 191 28 are related in the ML tree to human C4 and C5 (Fig. 3B). These genes may be the highly divergent descendants of an ancestral C-type protocadherin, or may comprise a distinct class of protocadherins not observed in mammals (Fig. 3B).
Distribution of Gene Conversion Events in Mammalian and Zebrafish Protocadherin Cluster Genes
Tandem gene arrays are subject to gene conversion, often as part of a process of concerted evolution in which paralogs in each species become more similar to each other than to their orthologs in related species (Smith 1974 ; Ohta 1980 ; Slightom et al. 1980 ; Fitch et al. 1990 ; Drouin et al. 1999 ). To determine the extent to which gene conversion contributes to the evolution of protocadherin cluster genes, we used GeneConv to search for shared identical elements among the members of the DrPcdh1 , DrPcdh1 , DrPcdh2 , and mammalian Pcdh , Pcdh , and Pcdh paralog subclasses (Sawyer 1989 ). We found a very large number of elements greater than 95 nucleotides in length shared among paralogs within each group, with the same sequence element often shared among multiple paralogs (data not shown). Surprisingly, these elements are not randomly distributed throughout the coding regions of these genes, but are clustered at the 3' end, involving sequences coding for ectodomains 5 and 6 as well as the cytoplasmic tail.
The number and distribution of these shared identical elements could be due to greater functional constraint on particular regions of each protein, residual similarity among very recent duplicates, or frequent gene conversion. Each of these possibilities yields testable predictions. Functional constraint at the protein level does not act on synonymous sites, which will therefore show a greater number of substitutions relative to nonsynonymous sites (Miyata et al. 1980 ). Constrained regions are also likely to be functionally orthologous, and therefore highly conserved, among closely related species. Recent duplicates are more similar across their entire length, not in particular regions (Nei et al. 1997 ). Gene conversion occurs within individuals and is more likely to happen between similar paralogs, as increased sequence similarity facilitates ectopic strand invasion (Ahn et al. 1988 ; Elliott et al. 1998 ). Therefore, orthology breaks down in converted regions, the opposite outcome of functional constraint.
We built CLUSTALW protein alignments of each ectodomain (EC) and cytoplasmic domain from human, mouse, and rat Pcdh , , A, and B (excluding C-type protocadherins), DrPcdh1 31 9, DrPcdh1 41 18, DrPcdh1 191 27, and DrPcdh2 82 25. We built nucleotide alignments in RevTrans (see Methods) using each protein alignment as a template, and we generated ML gene trees for each ectodomain in SEMPHY. We observed substantial differences in the number of substitutions per site, depicted as branch lengths in each tree, among the ectodomains in each subgroup. Within each species, human, mouse, and rat Pcdh ectodomains 1 and 5 are nearly identical, as are ectodomain 6 sequences among DrPcdh1 31 9 and DrPcdh2 8 25. Human, mouse, and rat Pcdh A and Pcdh B paralogs also have homogenized sixth ectodomains. The ML gene trees for ectodomains 3 and 6 for human Pcdh and DrPcdh1 41 18 are shown in Figure 4. Both human Pcdh (Fig. 4A) and DrPcdh1 (Fig. 4C) paralogs are diverse in their third ectodomain, as expected from independent substitution in ancient duplicates. In contrast, ectodomain 6 is homogeneous within each group. Human Pcdh EC6 sequences are nearly identical (Fig. 4B); the branch lengths in this tree for 3, 4, 9, and 14 EC6 are zero, indicating a complete absence of substitutions among these sequences. This homogenization is even more extreme among DrPcdh1 4- 18 EC6 sequences (Fig. 4D). In both cases, paralogs that are considerably diverged in their third ectodomain, such as 4 and 9 (Fig. 4A,B) and DrPcdh1 14 and DrPcdh1 16 (Fig. 4C,D), have identical sixth ectodomains.
Our results demonstrate that ectodomain-specific sequence homogenization is a common feature of all protocadherin cluster genes. This sequence homogenization is also lineage-specific, resulting in nearly identical ectodomains among paralogs in the same species and divergence among homogenized ectodomains between orthologs. The orthologous relationships observed among full-length human, mouse, and rat Pcdh genes are largely recapitulated in the EC3 gene tree (Fig. 5A). Orthology breaks down entirely, however, in ectodomain 5 (Fig. 5B), even between mouse and rat. For the most part, mouse and rat Pcdh EC6 domains are more similar within each species, with complete homogenization evident among some paralogs (Fig. 5B). Breakdown of orthology is also evident in human, mouse, and rat Pcdh ectodomain 6 (Fig. 5C,D).
To determine whether the ectodomain- and lineage-specific homogenization we observe is due to functional constraint on protein sequences, we estimated the number of synonymous (dS) and nonsynonymous (dN) substitutions per site for each ectodomain gene tree and alignment by using codeml (Yang 1997 ). We then calculated the total sequence diversity at synonymous sites for each ectodomain in each paralog subgroup, expressed as the total length of the synonymous-site gene tree. Our results are shown in Figure 6. Within each subgroup in each species, neutral diversity varies substantially across domains. Different protocadherin subgroups also show different patterns of sequence homogenization. Human, mouse, and rat Pcdh paralogs show very little neutral diversity in ectodomains 1, 4, and 5 (Fig. 6A), whereas Pcdh paralogs have divergent first ectodomains. In all cases, however, there is a strong trend toward reduced neutral diversity in 3' ectodomain and cytoplasmic coding sequences. This is especially true for zebrafish protocadherins, which have almost completely homogenized EC6 and cytoplasmic sequences within each subgroup. In the most extreme case, the neutral diversity among DrPcdh2 82 25 ectodomain 6 sequences is zero (Fig. 6F). Ectodomain 3, however, shows no evidence of homogenization in any subgroup. EC3 provides most of the total neutral diversity among human, mouse, and rat Pcdh paralogs (Fig. 6A), and appears to provide much of the phylogenetic signal between orthologs as well (Fig. 5A). The second and third ectodomains of DrPcdh2 82 25 are very divergent (Fig. 6F), in some cases only 50% identical on the protein level. The absence of a clear orthologous relationship between most mammalian and zebrafish protocadherins is not due to lineage-specific homogenization, as even the most divergent zebrafish ectodomains are more similar to each other than they are to any mammalian ectodomain (data not shown). Mammalian C-type protocadherins do not appear to be as subject to homogenization as other variable exons, but the limited number of C-type protocadherins in each species limits our ability to reliably detect conversion events in these genes (data not shown).
Sequence homogenization at neutral sites strongly suggests that the patterns of homogenization we observe are the result of repeated gene conversion events, rather than functional constraint on protein sequence content. Gene conversion has been detected in genes coding for other cell adhesion molecules (Gally and Edelman 1992 ; Gallin 1998 ). Gene conversion is also indicated by the fact that homogenization is occurring among closely related paralogs, which are more likely to participate in ectopic conversion events due to their increased sequence similarity (Ahn et al. 1988 ; Elliott et al. 1998 ). In zebrafish, homogenization is also occurring among paralogs in close physical proximity (Figs. 1A,B; 6E,F), and gene conversion events between two paralogs appear to become more frequent as physical distance between the paralogs decreases (Galtier 2003 ). The localization of conversion events into discrete regions, however, suggests that a strong constraint exists on their distribution in protocadherin cluster genes.
Increased GC Content at Third Positions Accompanies Gene Conversion Events in Protocadherin Cluster Genes
There is substantial evidence that gene conversion events lead to increased GC content at codon third positions in the converted regions (Eyre-Walker 1993 ; Galtier et al. 2001 ; Smith and Eyre-Walker 2001 ; Birdsell 2002 ; Galtier 2003 ; Marais 2003 ). The molecular mechanism is unknown, but may be due to a GC bias in mismatch repair, which is required to resolve allelic and ectopic conversion events (Brown and Jiricny 1988 ; Sugawara et al. 1997 ; Galtier et al. 2001 ). Third positions are under little selective constraint and will therefore reflect this bias. We calculated average third-position GC content (GC3) for each ectodomain and cytoplasmic domain in each paralog subgroup and plotted it against neutral paralog sequence diversity, as shown in Figure 6. In each case we see the same trend: As diversity decreases, third-position GC content increases, such that there is an extreme bias in third-position GC content in homogenized regions. The average GC3 in the highly homogenized human Pcdh ectodomain 6 is 94%, compared to 38% in the divergent ectodomain 3 (Fig. 6B). This trend is consistent across species, with human, mouse, and rat protocadherin paralogs showing a nearly identical distribution of diversity and GC3 content (Fig. 6AD). GC3 content is also elevated in homogenized zebrafish ectodomains (Fig. 6E,F), although the effect is apparently tempered by an overall AT bias in zebrafish (data not shown).
We calculated Pearson correlation coefficients between GC3 content and paralog ectodomain sequence diversity. Overall, we find a very strong negative correlation between paralogous neutral sequence diversity and third-position GC content in mammalian protocadherin cluster domains (Table 1). The correlation between GC3 and neutral sequence diversity for the human Pcdh paralogs is 0.98 (one-tailed P < 3e-6). There is a similar, albeit weaker, negative correlation in zebrafish protocadherin genes. The effect of this correlation is seen in Figure 6. In all mammalian paralog subgroups, ectodomains with high neutral sequence diversity have a GC3 value between 40% and 50%. Invariably, as neutral sequence diversity decreases, GC3 content increases. This is true even in ectodomains with intermediate neutral sequence diversity relative to the most divergent and most homogenized ectodomains. In these moderately homogenized domains, GC3 content increases, but is not as high as it is in completely homogenized ectodomains. These data strongly indicate that the level of neutral sequence diversity in a particular ectodomain is determined by the degree to which gene conversion events are maintained in that ectodomain, which also determines the level of bias in GC3 content. In zebrafish, GC3 content is not as tightly correlated with neutral sequence diversity. GC3 content is invariant in some cases even among ectodomains with varying neutral sequence diversity (Fig. 6E). In completely homogenized ectodomains, however, GC3 content increases substantially (Fig. 6E,F).
View this table:
[in this window]
[in a new window]
|
Table 1. Pearson Correlation (r) Between Synonymous Paralogous Sequence Diversity and Third Position GC Content (GC3) for Protocadherin Cluster Paralog Subgroups
|
|
 |
DISCUSSION
|
|---|
In this study, we identified 66 protocadherin cluster genes arrayed into two clusters in the zebrafish genome (Fig. 1). These zebrafish genes show limited orthology with mammalian protocadherin cluster genes. Nevertheless, it appears that some general features of protocadherin cluster organization are conserved among all vertebrates. For instance, zebrafish and mammalian protocadherin promoters share a highly conserved core motif (Fig. 2). We demonstrated earlier that mammalian protocadherin promoters show an increase in transcriptional activity upon neuronal differentiation in a cell-based reporter assay (Noonan et al. 2003 ). Because neuron-specific expression is likely to be a common feature of all protocadherin cluster genes in all vertebrates, it is not surprising that the regulatory element that confers this specificity is tightly constrained. The variable and constant exon structure and the alternatively spliced forms of each protocadherin transcript that result from this organization are apparently highly conserved as well. The mouse Pcdh constant region has been shown to interact with the Fyn tyrosine kinase (Kohmura et al. 1998 ), and the Pcdh and Pcdh constant region protein sequences are well conserved between zebrafish and humans. Therefore, constant region-mediated interactions with cytoplasmic signaling and structural proteins may be fundamental to protocadherin function and common to all vertebrates.
The most mutable components of protocadherin clusters among vertebrate species are the number and sequence composition of the variable exons. If protocadherin cluster proteins provide the molecular code in the development and maintenance of synaptic connections through putatively homophilic interactions across the synaptic cleft, it is the diversity among variable exon paralogs that generates combinatorial complexity in this code. In this regard, the zebrafish and mammalian protocadherin molecular codes are radically different. Although Pcdh and Pcdh genes exist in both lineages and are orthologous as groups (Fig. 3; data not shown), as individual genes they are divergent. Mammals also have entire classes of protocadherin cluster genes that zebrafish apparently lack (Pcdh and the division of Pcdh into A and B; Wu and Maniatis 1999 ). The information content provided by each paralog, therefore, is lineage-specific, and the divergence of mammalian and zebrafish protocadherins could be the result of adaptive specialization of mammalian versus nonmammalian protocadherin cluster genes. This specialization would manifest in protocadherin proteins with different homophilic or heterophilic adhesive properties, resulting in lineage-specific protocadherin molecular codes in brain development.
Sequence differences among orthologous protocadherins in multiple vertebrate lineages may reflect adaptive differences in protocadherin function that contribute to lineage-specific structural and functional specializations in the brain. Although many of the fundamental regions of the brain appear to be present in all vertebrates, these regions are highly specialized within each vertebrate lineage (Striedter 1998 ). The result of this specialization is that homologous relationships between brain structures in divergent vertebrate species are sometimes difficult or impossible to establish. For example, the mammalian brain has a large, laminar isocortex with no obvious homolog in the brains of nonmammalian vertebrates (Nieuwenhuys et al. 1998 ). Protocadherins could play a role in the development of this unique structure. There are species differences in the development of particular brain structures as well. Zebrafish have a nonlaminar telencephalon that develops through a process of eversion, rather than evagination as in terrestrial vertebrates.
The divergence of zebrafish and mammalian variable exons that we observe could be due to adaptive variation erasing the phylogenetic signal between orthologs. There have also been multiple independent expansions of protocadherin genes in each lineage. Some DrPcdh1 variable exons are clearly recent duplicates: Although they show the same patterns of homogenization as all other protocadherins, they show much less neutral paralog diversity overall (Figs. 3B,4C,6E). Because multiple, highly similar duplicates are likely to form "heterophilic" associations with each other, recent duplicates provide no additional homophilic interactions to distinguish subsets of neurons or synapses. The larger number of recent duplicates in zebrafish results in a less diverse set of protocadherins relative to mammals. This redundancy, coupled with a limited requirement for uniquely homophilic protocadherins relative to mammals, may allow zebrafish to tolerate substantial changes in protocadherin cluster gene number with no effect on fitness. However, outside of their homogenized regions, some DrPcdh1 and DrPcdh2 variable exons are divergent (Figs. 3A,6F; data not shown), indicating that they are ancient duplicates with diversified functions. Therefore, some of the zebrafish-specific expansion must provide an adaptive benefit, possibly by driving the development and function of teleost-specific neuronal pathways. Additional duplication events provide more raw material for adaptive evolution of protocadherin function, including the emergence of novel diversified paralogs, in each species.
Adaptive variation interacts with gene conversion in the process of protocadherin evolution. We have discovered clear evidence that gene conversion is common in protocadherin cluster genes. It is equally clear that some regions of protocadherin cluster genes, such as ectodomain 3, are excluded from conversion events. These regions provide most of the phylogenetic signal among mammals. Two possibilities present themselves. The first is that gene conversion events are needed to homogenize regions of each paralog class that have identical, essential yet lineage-specific functions among paralogs. In this model, ectodomain 6 mediates adhesion within each paralog subgroup, for example Pcdh to Pcdh , whereas ectodomain 3 mediates specific homophilic interactions. This is in contrast to the adhesion mechanism of classical cadherins, in which ectodomain 1 mediates homophilic interactions (Boggon et al. 2003 ). Gene conversion, therefore, acts to constrain protein sequence diversity. It does not matter that homogenized regions are lineage-specific, because protocadherin paralogs have to associate only with other paralogs in the same organism. In a strictly homophilic adhesion model, however, it seems that specific interactions could best be achieved by diversification among paralogs. It is possible that heterophilic associations occur in some circumstances among the members of each paralog subgroup. A definitive answer to this question awaits a rigorous examination of protocadherin adhesion.
The second possibility, which is not exclusive of the first, is that duplicated genes are prone to frequent gene conversion events, which, all else being equal, will be distributed among the duplicates according to their relative sequence similarity. Genes are homogenized within each paralog subgroup, which are more similar to each other than they are to the members of other subgroups. In zebrafish protocadherins, homogenization events also occur among genes in close physical proximity, possibly independent of any functional similarity among the proteins. Conversion events are selected against in diversified ectodomains because diversity is required in these domains for proper homophilic adhesion and to maintain the combinatorial complexity in brain development the proteins provide. These diversified domains show adaptive differences among distantly related vertebrate lineages. In this model, gene conversion events are very frequent, but are restricted to regions where protein sequence diversity is not functionally relevant. In our opinion, the data strongly support this conclusion. Diversified ectodomains are always orthologous among mouse, rat, and human, indicating that they have a precisely defined and well conserved function. A gene conversion tract that extended into a diversified ectodomain would effectively knock out that paralog and result in a duplication of the paralog contributing the donor sequence to the conversion event. A limited number of such events would probably have little deleterious effect, but their accumulation would limit the number of unique homophilic interactions available to the organism, which ultimately is disadvantageous.
In conclusion, our results indicate that protocadherin cluster genes undergo concerted evolution due to gene conversion in some parts of their coding sequences. These conversions are most likely attributable to the density of related paralogs in each cluster. Other regions, such as the core ectodomains of each molecule that provide the unique information specifying homophilic adhesion, are insulated from conversion events by selection. These regions are subject to adaptive variation among lineages, generating lineage-specific protocadherin molecular codes that contribute to differences in brain development, structure, and function among species. Lineage-specific gene duplications provide additional material for diversification of protocadherin cluster information content among vertebrate species. Therefore, the combination of lineage-specific duplication, restricted gene conversion, and adaptive variation in diversified ectodomains drives vertebrate protocadherin evolution. The expansion and adaptive evolution of large gene families, arranged both in tandem arrays and scattered throughout the genome, is a driving force in the evolution of vertebrate diversity (Ohno 1970 ). Several of these families, such as G-protein coupled receptors and Krüppel-type zinc finger proteins, have hundreds of genes that provide an ample reservoir for generating within- and between-species differences. Understanding the mechanisms that generate functional diversity among gene family members involved in complex systems is an essential first step toward an understanding of the molecular basis of speciation in vertebrates.
 |
METHODS
|
|---|
BAC Isolation and Sequencing
We used a local implementation of TBLASTN v.2.2.5 (Altschul et al. 1997 ) to search the Ensembl zebrafish whole-genome shotgun trace database (http://trace.ensembl.org) for sequencing reads predicted to encode protein fragments similar to human protocadherin cluster proteins. We designed overgo probes from a subset of the high-scoring sequences we obtained and screened the CHORI-211 zebrafish BAC library (Children's Hospital of Oakland Research Institute) using standard protocols (McPherson et al. 2001 ). We obtained 52 BAC clones and verified these by using PCR with primers designed against the high-scoring zebrafish protocadherin trace sequences. We chose a large-insert clone with many positive PCR hits, CH211201d21, for our initial round of sequencing. We used STS content mapping based on draft sequence from this clone and XhoI restriction digests of all positive clones to assemble a minimum tiling path. For sequencing, BAC DNA was hydrodynamically sheared with a Hydroshear Instrument (GeneMachines), size selected (34kb), and subcloned into the plasmid pIK96 (Stanford Human Genome Center, http://www-shgc.stanford.edu). Randomly selected plasmid subclones were sequenced in both directions with universal primers and BigDye Terminator chemistry (Applied Biosystems) to an average sequence depth of 10X. Sequences were then assembled and edited using the Phred/Phrap/Consed suite of programs (Ewing and Green 1998 ; Ewing et al. 1998 ; Gordon et al. 1998 ). Following manual inspection of the assembled sequences, finishing was performed by resequencing plasmid subclones and by walking on plasmid subclones or the large insert clone using custom primers. All finishing reactions were performed with dGTP BigDye Terminator chemistry (Applied Biosystems). Finished clones contain no gaps and are estimated to contain less than one error per 100,000 bp. The four BAC clones we sequenced for this study are CH211150p19 (AC144823
[GenBank]
), CH211201d21 (AC144828
[GenBank]
), CH21120c7 (AC144826
[GenBank]
), and CH21140o9 (AC146480
[GenBank]
). CH211150p19, 201d21, and 20c7 form one contig that contains all or nearly all of one zebrafish protocadherin cluster, DrPcdh1. These clones have since been mapped to linkage group 10 by the Zebrafish Genome Fingerprinting Project (http://www.sanger.ac.uk/Projects/D_rerio/WebFPC/zebra/large.shtml). CH21140o9 maps to linkage group 14 and contains 28 additional predicted protocadherin variable exons.
Protocadherin Cluster Gene Prediction and Sequence Annotation
We used TBLASTN to identify large single-exon genes encoding proteins similar to human Pcdh variable and constant region protein sequences in our assembled BAC sequence. We also searched our assembled sequence for large open reading frames using OrfFinder (http://www.ncbi.nlm.nih.gov/) and compared the results of both methods. For the most part, the TBLASTN results were unambiguous. In cases of multiple in-frame translation start sites, we chose the translation start that resulted in a signal sequence of similar length and sequence composition as those of nearby paralogs. We used our predicted exons in TBLASTN and BLASTN searches of zebrafish ESTs from the Washington University Zebrafish Genome Resources Project (http://zfish.wustl.edu/) and from GenBank (http://www.ncbi.nlm.nih.gov). We obtained rat genomic sequence from the UCSC Genome Browser (Kent et al. 2002 ; http://genome.ucsc.edu/, June 2003 freeze) and searched for protocadherins with TBLASTN as above. We extracted, managed, and translated protocadherin exon sequences and annotated genomic sequence using custom Perl scripts.
Sequence Analysis and Phylogenetic Tree Construction
We used 351 base pairs of sequence upstream from each known and predicted translation start site of human, mouse, rat, and zebrafish Pcdh variable exons to search for conserved motifs using MEME (Bailey and Elkan 1994 ; http://www.meme.sdsc.edu/meme/website/intro.html). We input the core motif MEME identified into WebLogo (Schneider and Stephens 1990 ; http://weblogo.berkeley.edu/logo.cgi) to generate logograms for mammalian Pcdh, DrPcdh1, and DrPcdh2 promoters. To determine the phylogeny of full-length zebrafish and human Pcdh proteins, we removed the signal sequence from each protein and built CLUSTALW alignments (Chenna et al. 2003 ; http://www.ebi.ac.uk/clustalw/), which we then used to estimate the maximum likelihood (ML) phylogeny in SEMPHY with default parameters (Friedman et al. 2002 ). We used TreeEdit to draw and root each tree by midpoint (http://evolve.zoo.ox.ac.uk/). We initially identified shared identical sequences among protocadherin cluster genes using GeneConv (Sawyer 1989 ; http://www.math.wustl.edu/~sawyer/geneconv). We identified protocadherin domains in each protein with HMMER2.2 (Durbin et al. 1998 ; http://hmmer.wustl.edu/) using pfamA hidden Markov model PF00028 (cadherin domain) and extracted each domain by using custom Perl scripts. We built nucleotide alignments of each ectodomain with RevTrans, a Python application that aligns coding sequences based on the protein alignment (Wernersson and Pedersen 2003 ; http://www.cbs.dtu.dk/services/RevTrans/). We estimated ML gene trees in SEMPHY using the Kimura 2-parameter model of nucleotide substitution with a transition-transversion ratio of 2. We estimated synonymous and nonsynonymous substitution rates for each tree and average third-position GC content for each alignment by using CODEML (Yang 1997 ; http://abacus.gene.ucl.ac.uk/software/paml.html).
 |
Acknowledgements
|
|---|
We thank the members of the Stanford Human Genome Center Sequencing Group for their outstanding technical contributions to this project. We thank Dr. Arend Sidow for helpful comments on the manuscript, Drs. William Talbot and Marcus Feldman for providing insight and support, Christopher Brown and Gregory Cooper for excellent discussions and technical assistance, and the members of the Myers lab for discussions and support. This work was supported by the Stanford Genome Training Program (NIH training grant 5 T32 HG00044 to J.P.N.) and the NIH Centers for Excellence in Genomic Science initiative (1 P50) HG 02568-01 to R.M.M.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
 |
Footnotes
|
|---|
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2133704.
3 Corresponding author. E-MAIL myers{at}shgc.stanford.edu; FAX (650) 725-9689. 
[The BAC sequence data from this study have been submitted to GenBank under accession nos. AC144823
[GenBank]
, AC144828
[GenBank]
, AC144826
[GenBank]
, and AC146480
[GenBank]
. Predicted gene sequences are available on the Myers Lab Web site, http://www-shgc.stanford.edu/myerslab/.]
 |
REFERENCES
|
|---|
Ahn, B.-Y., Dornfeld, K.J., Fagrelius, T.J., and Livingston, D.M. 1988. Effect of limited homology on gene conversion in a Saccharomyces cerevisiae plasmid recombination system. Mol. Cell. Biol. 8: 24422448.[Abstract/Free Full Text]
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 33893402.[Abstract/Free Full Text]
Amores, A., Force, A., Yan, Y.L., Joly, L., Amemiya, C., Fritz, A., Ho, R.K., Langeland, J., Prince, V., Wang, Y.L., et al. 1998. Zebrafish hox clusters and vertebrate genome evolution. Science 282: 17111714.[Abstract/Free Full Text]
Bailey, T. and Elkan, C. 1994. Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 2836, AAAI Press, Menlo Park, CA.
Birdsell, J.A. 2002. Integrating genomics, bioinformatics, and classical genetics to study the effects of recombination on genome evolution. Mol. Biol. Evol. 19: 11811197.[Abstract/Free Full Text]
Boggon, T.J., Murray, J., Chappuis-Flament, S., Wong, E., Gumbiner, B.M., and Shapiro, L. 2003. C-cadherin ectodomain structure and implications for cell adhesion mechanisms. Science 296: 13081313.
Brown, T.C., and Jiricny, J. 1988. Different base/base mispairs are corrected with different efficiencies and specificities in monkey kidney cells. Cell 54: 705711.[CrossRef][Medline]
Buck, L. and Axel, R. 1991. A novel multigene family may encode odorant receptors: A molecular basis for odor recognition. Cell 65: 175187.[CrossRef][Medline]
Chenna, R., Sugawara, H., Koike, T., Lopez, R., Gibson, T.J., Higgins, D.G., and Thompson, J.D. 2003. Multiple sequence alignment with the Clustal series of programs. Nucleic Acids Res. 31: 34973500.[Abstract/Free Full Text]
Drouin, G., Prat, F., Ell, M., and Clarke, G.D.P. 1999. Detecting and characterizing gene conversions between multigene family members. Mol. Biol. Evol. 16: 13691390.[Abstract]
Durbin, R., Eddy, S., Krogh, A., and Mitchison, G. 1998. Biological sequence analysis: Probabilistic models of proteins and nucleic acids Chapter 5. Cambridge Univ. Press, Cambridge, UK.
Elliott, B., Richardson, C., Winderbaum, J., Nickoloff, J.A., and Jasin, M. 1998. Gene conversion tracts from double-strand break repair in mammalian cells. Mol. Cell. Biol. 18: 93101.[Abstract/Free Full Text]
Ewing, B. and Green, P. 1998. Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res. 8: 186194.[Abstract/Free Full Text]
Ewing, B., Hillier, L., Wendl, M.C., and Green, P. 1998. Base-calling of automated sequencer traces using Phred. I. Accuracy assessment. Genome Res. 8: 175185.[Abstract/Free Full Text]
Eyre-Walker, A. 1993. Recombination and mammalian genome evolution. Proc. R. Soc. Lond. Ser. B 252: 237243.[Medline]
Fitch, D.H.A., Mainone, C., Goodman, M., and Slightom, J.L. 1990. Molecular history of gene conversions in the primate fetal -globin genes. J. Biol. Chem. 265: 781793.[Abstract/Free Full Text]
Flajnik, M.F. 2002. Comparative analyses of immunoglobulin genes: Surprises and portents. Nat. Rev. Immunol. 2: 688698.[CrossRef][Medline]
Friedman, N., Ninio, M., Pe'er, I., and Pupko, T. 2002. A structural EM algorithm for phylogenetic inference. J. Comput. Biol. 9: 331353.[CrossRef][Medline]
Gallin, W.J. 1998. Evolution of the "classical" cadherin family of cell adhesion molecules in vertebrates. Mol. Biol. Evol. 15: 10991107.[Abstract]
Gally, J.A., and Edelman, G.M. 1992. Evidence for gene conversion in genes for cell-adhesion molecules. Proc. Natl. Acad. Sci. 89: 32763279.[Abstract/Free Full Text]
Galtier, N. 2003. Gene conversion drives GC content evolution in mammalian histones. Trends Genet. 19: 6568.[CrossRef][Medline]
Galtier, N., Piganeau, G., Mouchiroud, D., and Duret, L. 2001. GC-content evolution in mammalian genomes: The biased gene conversion hypothesis. Genetics 159: 907911.[Free Full Text]
Gordon, D., Abajian, C., and Green, P. 1998. Consed: A graphical tool for sequence finishing. Genome Res. 8: 195202.[Abstract/Free Full Text]
Hill, E., Broadbent, I.D., Chothia, C., and Pettitt, J. 2001. Cadherin superfamily proteins in Caenorhabditis elegans and Drosophila melanogaster. J. Mol. Biol. 305: 10111024.[CrossRef][Medline]
Kallenbach, S., Khantane, S., Carrol, P., Gayet, O. |