|
|
|
|
|
Vol. 11, Issue 5, 685-702, May 2001 The Complete Human Olfactory Subgenome1 Department of Molecular Genetics and the Crown Human Genome Center, The Weizmann Institute of Science, Rehovot 76100, Israel; 2 Bioinformatics Graduate Program, Boston University, Boston, Massachusetts 02215, USA; 3 Department of Biological Regulation, The Weizmann Institute of Science, Rehovot 76100, Israel
Olfactory receptors likely constitute the largest gene superfamily in the vertebrate genome. Here we present the nearly complete human olfactory subgenome elucidated by mining the genome draft with gene discovery algorithms. Over 900 olfactory receptor genes and pseudogenes (ORs) were identified, two-thirds of which were not annotated previously. The number of extrapolated ORs is in good agreement with previous theoretical predictions. The sequence of at least 63% of the ORs is disrupted by what appears to be a random process of pseudogene formation. ORs constitute 17 gene families, 4 of which contain more than 100 members each. "Fish-like" Class I ORs, previously considered a relic in higher tetrapods, constitute as much as 10% of the human repertoire, all in one large cluster on chromosome 11. Their lower pseudogene fraction suggests a functional significance. ORs are disposed on all human chromosomes except 20 and Y, and nearly 80% are found in clusters of 6-138 genes. A novel comparative cluster analysis was used to trace the evolutionary path that may have led to OR proliferation and diversification throughout the genome. The results of this analysis suggest the following genome expansion history: first, the generation of a "tetrapod-specific" Class II OR cluster on chromosome 11 by local duplication, then a single-step duplication of this cluster to chromosome 1, and finally an avalanche of duplication events out of chromosome 1 to most other chromosomes. The results of the data mining and characterization of ORs can be accessed at the Human Olfactory Receptor Data Exploratorium Web site (http://bioinfo.weizmann.ac.il/HORDE).
The vertebrate olfactory system can differentiate among millions of
chemicals, which are detected by olfactory receptor
(OR) proteins. These are encoded by the largest gene superfamily known to date, itself part of the G-protein coupled receptor (GPCR) hyperfamily. ORs were first characterized in rat (Buck and Axel 1991 To date, >300 human OR genes and pseudogenes have been reported
(Parmentier et al. 1992 Detailed analyses of large-scale genomic sequences of human OR clusters
provided the first direct understanding of the genomic structure of OR
genes and of their organization into clusters (Glusman et al. 1996 By integrating such genomic information with the phylogenetic analysis
of ORs, we could reconstruct the putative evolutionary history of the
first completely sequenced OR gene cluster, on human chromosome 17 (Glusman et al. 2000b The recently announced first draft of the human genome (International
Human Genome Sequencing Consortium 2001
The Human Olfactory Receptor Repertoire We have performed a comprehensive data mining effort for OR genes in
several data sources that together constitute the first draft of the
human genome (International Human Genome Sequencing Consortium 2001
Nearly 90% of the ORs were found in genomic sequences (Table
1), half of which were confirmed by an
additional independent sequence of any type (genomic, mRNA, etc). A
significant majority of the ORs (681, 75%) spanned an interval
corresponding to a full-length coding region (Fig.
2a). Of these, at least 322 ORs had intact open reading frames, and are predicted to be functional. Data suggesting the presence of an expressed transcript are currently available for only a small fraction of these (<10%, Table 1), and
from various tissue sources. The possibility of genomic contamination cannot be denied. On the other hand, when predicted functional OR genes
are studied in detail (Sosinsky et al. 2000
Isochore Distribution In a gene cluster previously characterized by us, ORs were found to
be located in a G + C poor, L isochore (Glusman et al. 2000b Chromosomal Distribution We have assigned approximate chromosomal megabase (Mb)
coordinates to most OR-containing genomic clones by integrating
information from the University of California at Santa Cruz's genome
draft (http://genome.ucsc.edu) with mapping information from UDB, the Unified DataBase (Chalifa-Caspi et al. 1998 The chromosomal distribution of ORs is extremely biased, with six chromosomes (1, 6, 9, 11, 14, and 19) accounting for 73% of the repertoire. The remaining 27% are scattered on most other chromosomes, down to a single OR gene on chromosome 22 (Figs. 1, 3a). Most strikingly, chromosome 11 alone has nearly half (42%) of all of the localized ORs. This observation, together with the unique genomic organization and diversity of ORs in this chromosome (see following) suggests a central role for chromosome 11 in the evolutionary history of the olfactory subgenome.
Olfactory Receptor Clusters We analyzed the entire genome for the occurrence of OR clusters. The definition used was that two consecutive ORs >1 Mb apart belong to different clusters. The nomenclature proposed here for naming clusters is in the form of "chromosome@coordinate", for example, "11@52" is the cluster on chromosome 11 at a position 52 Mb from the p telomere. Simulation experiments indicate that if the genes were distributed
randomly, no clusters would be expected to include more than five genes
(Fig. 3b). In contrast, we observed 24 clusters with six ORs or more,
and these clusters include 78% of the ORs for which a coordinate is
available. Two clusters, both on chromosome 11, include more than 100 members each (Fig. 1). The number of ORs in a cluster appears to follow
an inverse power-law distribution (Fig. 3b), in analogy to that
demonstrated for compositional correlations in long DNA sequences
(Bernaola-Galván et al. 1996 The olfactory subgenome occupies nearly 1% of the human genome. The mean cluster size was ~300 kb (excluding singletons), and 90% of the clusters had a size in the range 100 kb to 1 Mb. OR clusters were exhaustively searched for non-OR genes. Only the large clusters on chromosome 11 included long segments (up to 800-kb long) devoid of ORs but including other, non-OR genes. These may therefore be referred to as "super-clusters". Excluding these non-OR segments, the two largest OR clusters (11@52 and 11@4) span 3.25 Mb and 1.4 Mb of sequence, respectively. By summing up the Mb spans of all observed clusters that belong to the OR subgenome, and assuming that singleton ORs occupy 10 kb each (likely an underestimate), the total amount of sequence occupied by all ORs (genome-wide) is computed to be ~30 Mb. Olfactory Receptor Pseudogenes Of the 681 full-length ORs identified, 359 (53%) have one or more frame disruptions (frameshifts, in-frame stop codons, or disrupting interspersed repeats (Fig. 2b) and are considered to be pseudogenes. The pseudogene fraction is somewhat larger (63%) if partial sequences are included (i.e., those for which the full sequence will be available in the future). The fraction of pseudogenes observed in draft genomic sequences is higher than that from finished sequence, suggesting that the current results may somewhat overestimate the real pseudogene fraction. We asked whether pseudogene formation tends to be a cluster-wide phenomenon. For this, an analysis was performed for each cluster, whereby the deviation from the genomic average pseudogene fraction was computed and a probability was calculated by assuming a binomial distribution. None of the clusters showed a significant deviation from the expected pseudogene composition, except for the 9@106 cluster, in which only 1 of 15 ORs is a pseudogene. It may be concluded that OR disruption is a random process targeted at individual genes. The OR Sequence Space We analyzed the identity score between each of the ORs and a representative data set of 55 non-OR GPCRs. The level of protein identity (PID) of ORs to their respective nearest GPCRs was 27.6% ± 2%, and none of the ORs showed more than 36% PID to any of the GPCRs studied (Fig. 4a). On the other hand, 96% of all ORs show >40% PID to their nearest neighbor. This suggests that the 40% cutoff efficiently discriminates between members of the OR superfamily and other GPCRs (Z-score for an OR to be 40% identical to a non-OR GPCR: 6.3). The relatively few OR sequences that show <40% identity to the most similar OR gene still show higher similarity to ORs than to non-OR GPCRs. Therefore, we assume that the OR data set is probably mostly free of contamination from irrelevant sequences.
We classified ORs into families and subfamilies on the basis of
evolutionary divergence as published (Dayhoff 1976 We used principal components analysis (PCA) (de Leeuw 1988
Human Class I ORs Are Abundant and Probably Functional Class I ORs have originally been identified in fish (Ngai et al.
1993 Evolution of Sister Clusters We have recently reported the analysis of a cluster (17@3) on
chromosome 17 (Glusman et al. 2000b
One of the surprising paralogous pairwise relations links the only ORs
in these two clusters that do not belong to family 1. Thus,
OR5C1, an OR of family 5 in the 9@106 cluster shows
similarity to several ORs of family 3 in the 17@3 cluster. This is
consistent with the phylogenetic tree observed for OR family consensi
(Fig. 4d). It may be hypothesized that this cluster duplication marked the establishment of family 3, which evolved out of family 5, with
OR5C1 representing their "evolutionary link". The two
families separate well in PCA (Fig. 5). Interestingly, the region
surrounding OR5C1 shows a higher G + C content (Fig. 6), a
relative rarity among ORs. Most likely, a family 5 OR from a G + C
rich isochore invaded the 9@106 cluster before cluster duplication
and was retained and expanded in the 17@3 cluster. This is in line
with our previous observation that family 3 ORs have a higher G + C content
than does the family 1 cluster surrounding them (Glusman et al. 2000b Global Cluster Evolution To better understand the evolutionary pathways that led to the present human OR repertoire, we have performed a comprehensive comparison of the 24 clusters that contain six ORs or more. This was done by a novel, automated comparative cluster analysis (CCA), which formalizes the pairwise cluster comparison exemplified in the previous section. In brief, each pair of clusters was characterized by a metric that embodies the similarity of ORs within them (cluster identity level, CIL) and the probability that one of them arose from the other by partial or complete cluster duplication. Subsequently, a dendrogram was constructed on the basis of such pairwise comparisons (Fig. 7). The results are consistent with two ancestral gene clusters, each containing solely members of one class: Class I on 11@4 and Class II on 11@52. The latter appears to have given rise to all other clusters by way of sequential cluster duplication, and it probably included at least one founder gene for each of families 1, 2, 4-6, and 8-10.
This analysis also suggests that an early major event in the evolution of Class II ORs was the duplication of the almost complete ancestral Class II cluster, into what is now the q-arm sub-telomeric region of chromosome 1 (1@255; Figs. 1, 7), with a CIL of 47%. From this point in evolutionary history, the two clusters apparently had rather different fates: the original one (11@52) expanded within chromosome 11 by growing in size and by duplicating into new locations, including to the vicinity of the Class I cluster (11@6). In contrast, the new cluster (1@255) proceeded along the path of interchromosomal migration. It seems to have multiplied directly into at least six locations in the genome, and many of these propagated further into additional chromosomal neighborhoods. Potential Orthologous Assignments The assignment of orthologous pairs can be difficult for several reasons, including gene duplication events that occurred subsequent to speciation and unequal rates of evolution in different species and gene lineages. Such difficulty is compounded by the fact that usually the data sets are incomplete: the true ortholog of a gene in a given species may not have been observed yet. We took advantage of having the human genome sequence with almost
complete coverage to detect a most similar human gene (best hit, or
BeT) for each nonhuman OR sequence detected in the present data mining
effort. We then calculated for each represented species an average
identity level for all of its BeTs (Table
2). To avoid underestimates due to
still-undetected human ORs and, conversely, overestimates due to
contamination of nonhuman data sets with human sequence, we ignored
BeTs with PIDs over 2 standard deviations away from the observed mean
for each species. Translation into the million year (Myr) timescale
for vertebrate evolution (Kumar and Hedges 1998
Subfamily-Specific Expansions For most of the families, the average number of ORs per subfamily is surprisingly constant. This is manifested in a linear relationship between the number of genes and the number of subfamilies (Fig. 4c). The slope indicates an average of two ORs per subfamily, and a global calculation (including all families) shows an average of three ORs per subfamily, or over five ORs per subfamily if singleton subfamilies are excluded. Families that obey this "two ORs per subfamily" rule are likely to represent ancient divergence events, in which gene duplication took place for a certain period of time and then stopped. Thus, for a typical gene in such families there is only one other gene with an identity score higher than 60%. There are, however, three families that show significant deviation from the slope of 2. These are families 2, 4, and, to a much higher extent, 7 (Fig. 4c). The simplest interpretation is that certain subfamilies within such families have undergone a recent flurry of gene duplication and hence have many more ORs. Most subfamilies (>85%) are chromosome- and cluster-specific. On the other hand, some specific OR subfamilies have undergone a striking scattering phenomenon. One subfamily of 7 (7E) dispersed to at least 35 genomic locations on almost all chromosomes (Fig. 1), in what is probably a primate-specific evolutionary trait. Likewise, some subfamilies of family 4 together expanded into over 15 locations throughout the genome.
The Structure of the Human OR Subgenome The full characterization of the human olfactory subgenome is
significant for a number of reasons. By identifying the number of
functional human olfactory receptors, we have provided crucial information for understanding the genomic basis of combinatorial information encoding in this pathway (Lancet 1986 Previous estimates of the size of the human OR repertoire have ranged
widely, from a rough extrapolation of 130 ORs (Ben-Arie et al. 1994 The present report provides a first global genomic map at <1-Mb
resolution of the olfactory subgenome. This is made possible by the
recent availability of large-scale human genomic sequence (International Human Genome Sequencing Consortium 2001 The overall localization of ORs on all chromosomes except 20 and Y is
in agreement with previous work based on fluorescence in situ
hybridization (FISH) (Rouquier et al. 1998 A major outcome of the OR mapping results is the discovery of a
disproportionately large OR count on chromosome 11. There were previous
hints for OR-rich regions on this chromosome, suggesting seven distinct
OR clusters (Buettner et al. 1998 The Genome-Wide Evolution of OR Genes Our CCA provided a powerful tool to analyze the evolutionary history
that has led to the present genome-wide disposition of ORs. It is a
unique case in which a very large, well-defined vertebrate gene
superfamily is subjected to a systematic formal scrutiny, by using the
availability of the first nearly complete vertebrate genome. It appears
that genome-wide expansion was initiated from chromosome 11 (Fig.
8) but went indirectly through an early
duplication to chromosome 1 (1@255). Interestingly, the timeframe
for this initial cluster duplication is compatible with that of the
second tetraploidization event of vertebrate evolution. Indeed, the
q-telomeric region of chromosome 1 and the centromeric region of
chromosome 11, where the ancestral OR clusters reside, have been shown
to be paralogous since those ancient times (Jekely and Friedrich 1999
We hypothesize that the major driving force for the multichromosomal proliferation resides in the properties of this chromosome 1 cluster. It is not unlikely that two different mechanisms have been at work during the major steps in the radiation of OR clusters. The chromosome 11 repertoire expansion may have been the result of an intrachromosomal duplication mechanism, leading, among others, to the formation of two "super-clusters". On other chromosomes, a second process that enabled the copying of ORs among different chromosomes has led to further repertoire augmentation. Many of the inferred cluster duplications have very similar CIL values,
lowering the confidence on the specific evolutionary pathway
described in Figure 7. Nevertheless, the idiosyncratic gene contents of
each cluster allows the reconstruction of several directional events of
partial cluster duplication. The main potential pitfalls of this
analysis include the assumption of nearly constant evolutionary rates
on all clusters, and the possibility of gene conversion, which could
lead to erroneous cluster lineage reconstruction. Although this has
been demonstrated in the primate lineage (Sharon et al. 1999 Another important result of the CCA, coupled with the detection of
potential orthologs for nonhuman ORs, is the delineation of the
potential timeframe for OR cluster evolution. In the earliest stage,
presumably before the emergence of amphibians (>400 Myr ago),
precursors of most of the extant OR families appeared by local gene
duplication. Next followed the radiation to multiple chromosomes,
around the era of amphibians (300-400 Myr ago). Subsequently, a
lengthy period of relative quiescence took place, lasting perhaps 150 Myr, with only minor further local duplication and diversification. This is manifested in the fact that most subfamilies are
cluster-specific and in the significantly reduced evolutionary rate
observed. Finally, in the last 10 Myr, the primate repertoire was
subject to the combined effects of many functional genes turning into
pseudogenes (Sharon et al. 1999 Class I ORs The most ancient event that can be inferred in the evolution of ORs
is the initial split between Class I (fish-like) and Class II
(tetrapod-specific). In the clawed frog Xenopus these were shown to be expressed differentially in water- or air-accessible cavities, respectively (Freitag et al. 1995 Class II families are all present in more than one chromosome each,
except for the very small family 12. In sharp contrast, all human Class
I ORs for which a coordinate could be ascertained are located in a
single large cluster (11@4). Why did they not migrate to other
chromosomes when interchromosomal duplications appear to be the rule?
Most such duplications appear to have followed the invasion of
chromosome 1, and OR gene duplications from chromosome 11 into other
chromosomes appear to have been rather infrequent events. Therefore,
Class I ORs may have remained clustered by chance alone. Alternatively,
Class I expression might depend on the presence of regional control
sequences, similar to the locus control region of Principal Components Analysis of ORs The complete depiction of the mutual relationships between
n sequences requires their visualization in an
(n PCA results are not based on the family classification scheme used
(Glusman et al. 2000a A potential difficulty of the PCA method is sensitivity to sampling biases. Because PCA aims to account for a maximum amount of the variability in the data, overrepresented families could appear to be more dissimilar from the rest of the ORs than they are in reality. This effect does not appear to cause any major distortions in the analysis of ORs presented here. For example, the non-OR GPCRs we used segregate from all ORs in the first principal component of the initial round of PCA, despite their small numbers. Similarly, the Class I ORs segregate first in the next round of PCA, even though they are a minority. The Pseudogenization of the OR Repertoire We have observed nearly 1000 ORs in the human genome (a microsmatic
species), not unlike the number expected in macrosmatic species (e.g.,
rodent, canine). What brings about the suggested difference in odor
perception capabilities between macrosmatic and microsmatic mammals?
This is likely to be the result of the fact that only one-third of the
human ORs appears to be functional, consistent with previous reports
showing a large proportion of pseudogenes (Rouquier et al. 1998 We found pseudogenes to be intermingled with apparently functional
genes. The distribution of pseudogenes is consistent with a scenario in
which genes have become pseudogenes at random, potentially because of
reduced purifying selective pressure on the whole repertoire (Rouquier
et al. 1998 The observed OR clusters lack a clear internal structure as observed
for homeobox (Deschamps et al. 1999
Data Mining A data mining pipeline was constructed to detect all available
OR-like sequences in the public databases and to update the results as
new database versions are released. TBLASTN (Altschul et
al. 1997 Localization of ORs Genomic localization was done on the basis of the July 17 freeze of
the University of California at Santa Cruz's Working Draft Sequence
(UCSC "GoldenPath"; http://genome.ucsc.edu), which presents a
tentative assembly of the finished and draft human genomic sequence based on the Washington University-Saint Louis clone map
(http://genome.wustl.edu/gsc). A coordinate was assigned to each
finished or unfinished genomic clone, in megabases (Mb) from the p
telomere of the given chromosome. In parallel, we used the Unified
DataBase (UDB; http://bioinformatics.weizmann.ac.il/udb) to assign
similar Mb coordinates to the clones, on the basis of their marker
contents (Chalifa-Caspi et al. 1998 Detection and Classification of OR Sequences Each subcontig was compared by using FASTY (Pearson et
al. 1997 A given gene could be represented in more than one overlapping sequenced clone. We removed such redundancy by considering two sequences as representing the same gene, if they are in the same chromosome, located in clones <300 kb apart and at least 99% identical at the nucleotide level. An exception to this rule occurred when two genes coappeared in the same clone, in which case they were considered to be distinct (only three such cases were encountered). It is possible that some very similar and neighboring genes could be misclassified as being the same, but we estimate this circumstance to be rare on the basis of our GESTALT analysis of complete clusters, which included >75% of all ORs detected. Sequences localized to a chromosome but without a coordinate were only compared to other sequences within that chromosome, and for those sequences lacking a chromosomal assignment the criterion of chromosome location was not applied. For each resulting gene with more than one constituent sequence, a
weighted consensus nucleotide sequence was created after multiple
alignment by CLUSTALW (Higgins et al. 1996 ORs with a length of at least 275 amino acids without frame disruptions (frameshifts, in-frame stop codons, or disrupting interspersed repeats) were considered to be full length and apparently intact. Partial sequences without internal frame disruptions but disrupted by virtue of being embedded in non-OR sequence were defined as pseudogenes. Apparently intact ORs that were incompletely sequenced were excluded from the computations. Each OR gene was assigned a family and subfamily by amino acid sequence
similarity to previously classified OR genes, as described (Glusman et
al. 2000a Isochore Analysis To study the G + C content of an OR's environment, we used the
unmasked sequences within 5, 10, 20, and 30 kb surrounding (but
excluding) its coding sequence. The four resulting data sets yielded
almost identical results (average difference in G + C content between
sets was <1.5%). We therefore used the 5-kb environment range. Such
G + C content values could be computed for 77% of ORs. Whole genome
values were taken from the genome draft (International Human Genome
Sequencing Consortium 2001 Detection of Potential Orthologs In addition to the human ORs, the data mining procedure yielded 851 OR sequences from 31 additional species. A BeT was determined for each
of the nonhuman ORs detected as a result of the data mining effort, by
comparing its conceptually translated sequence to the final set of
human ORs and taking the hit with the highest PID. For each species, a
divergence level from human was computed as the average PID of its ORs,
excluding those more than two standard deviations away from the mean. A
published molecular timescale (Kumar and Hedges 1998 Principal Components Analysis A distance matrix was constructed representing the pairwise PID scores of ORs to each other and to 55 non-OR GPCRs. Each column in the matrix was normalized by dividing it by its standard deviation. Principal components were computed by using the Matlab package (MathWorks). Only apparently functional genes were used in this computation. The three first principal components were then used to map all genes, as well as pseudogenes, onto a three-dimensional space. Rendering of graphics and visualization were performed by using Spotfire.net Desktop 5.0 (Spotfire Inc., http://www.spotfire.com). In the sequential PCA method, clearly segregating groups of sequences are removed, and rounds of PCA are performed iteratively on the reduced data sets to visualize further sequence variability information. Comparative Cluster Analysis OR clusters were defined as the maximal groups of OR genes along
one chromosome, such that the distance between two consecutive genes
does not exceed 1 Mb. This cutoff was taken because of the low
resolution of the mapping information used for assembling the genome
draft. A tentative DNA sequence was built for each cluster by
assembling all relevant finished and unfinished clones in
Sequencher (GeneCodes Corp). Although some uncertainty remains in contig orientation and order, this did not affect the analysis on the basis of the gene content of the clusters. To validate
the correctness of the OR detection pipeline, we analyzed and
visualized all cluster sequences by using the GESTALT Workbench (Glusman and Lancet 2000 For each possible pair of OR clusters, we tested the hypothesis that one of them arose as a partial or full duplication of the other, and we determined the PID cutoff that best describes the divergence between the genes composing them. A cluster identity value IC was used in this analysis in the following way: For every IC value in the range 20%-100%, with an increment of 1%, the structure of both clusters was reconstructed by identifying later duplications. OR genes in each cluster showing a mutual identity of IC + 5% or more were defined as a "later duplication group" (LDG), potentially formed by local duplication after cluster divergence. Subsequently, the clusters were subjected to a pairwise comparison among all possible pairs of LDGs. The identity value IG between two LDGs A and B (of different clusters) was defined as the average PIDs between all pairs of genes ai and bj, where gene ai is located in LDG A, and gene bj is located in LDG B. This analysis yielded a matrix of IG scores between LDGs of both clusters. We then identified those pairs of LDGs that represent mutual BeTs, that is, that show higher identity to each other than does each of them to the other LDGs. These LDG pairs represent putative gene units at the time of cluster
divergence, and their IG should be compatible with the postulated cluster identity level IC. This was tested by
defining a score function f(IC as
We thank Eitan Rubin and Compugen Ltd. for providing olfactory receptor sequence information from the LEADS database. D.L. holds the Ralph and Lois Silver Chair in Human Genetics. This work was supported by the Crown Human Genome Center, a Ministry of Science grant to the National Laboratory for Genome Infrastructure, the National Institutes of Health (DC00305), the Krupp foundation, the German-Israeli Foundation for scientific research and development and the Weizmann Institute Glasberg, Levy, Nathan Brunschwig, and Levine funds. The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
4 Present address: The Institute for Systems Biology, 4225 Roosevelt Way NE, Seattle, WA 98105, USA.
5 Corresponding author.
E-MAIL doron.lancet{at}weizmann.ac.il; FAX 972-8-9344487.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.171001.
|