|
|
|
|
Vol. 11, Issue 5, 645-651, May 2001
COMMENTARY
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ARTICLE |
|---|
|
|
|---|
The past year has brought unprecedented public attention to
biomedical research, with a particularly intense
focus on the Human Genome Project and the completion
of a first-generation ~3-billion-basepair human genome sequence. Much
of this attention related to the competition between the two parallel,
yet separate, efforts of the publicly-funded International Human Genome
Sequencing Consortium and the private company Celera Genomics. Despite
the apparent rancor between the groups, two celebratory events notably punctuate the past year: the joint media announcement in late June 2000 that both groups had generated a "working draft" sequence of the
human genome and the two landmark scientific publications in February
2001 that describe the efforts of each project (International Human
Genome Sequencing Consortium 2001
; Venter et al. 2001
).
Numerous grandiose clichés and metaphors have been used to convey the
magnitude of these accomplishments and their associated implications
for biomedical research and clinical medicine. Here we add one more to
this list. Our choice for capturing the essence of contemporary human
genome analysis is an analogy to a mountain climbing expedition, one
where significant progress has been made to provide a spectacular view
of the genetic landscape. But this is not an expedition that is
complete, with uncertain
yet exciting
genomic terrain ahead. Indeed,
the Human Genome Project is now firmly at the "base camp" of the
expedition to elucidate the human genetic blueprint and to begin to
understand its content. Nevertheless, this is a milestone of tremendous
significance and excitement.
Here we outline some of the key lessons learned during the initial analysis of the human genome sequence. We highlight a few of the many remaining questions in understanding the genome's structure and function, with most answers likely becoming available later in the expedition. Finally, we preview the anticipated climb to the final summit and the ascent to a complete and finished sequence.
Detailed Reports from the Base Camp
Two papers (International Human Genome Sequencing Consortium 2001
;
Venter et al. 2001
) report on the draft human genome sequence and
provide the first detailed views from the base camp. The rigor and
presentation of both papers is outstanding, and the respective groups
should be commended for their scholarly contributions to the scientific
literature. Of course, these papers and the related companion articles
in the corresponding issues of Nature and Science represent the tip of a literature iceberg that will be revealed over time, as more complete sequence becomes available and other investigators develop more insightful ways of studying the genome's structure and function. Indeed, much of this can be followed daily on
various Web sites that provide browser-based views of the human genome
(e.g., see genome.cse.ucsc.edu, www.ensembl.org,
www.ncbi.nlm.nih.gov/genome/guide/human). Even keeping track of the
most informative Web sites can be challenging; toward that end,
electronic "hubs" have been created that provide electronic
pointers to the most relevant and, in some cases, new sites (e.g., see
www.nhgri.nih.gov/genome_hub).
We make no attempt to summarize all of the analyses presented in the
two major and other companion papers. Rather, we provide some clues
about the major findings uncovered and actively encourage all readers
to carefully read through these important publications.
|
Views from the Initial Ascent
The Human Gene Inventory
A very small fraction of the human genome sequence encodes protein. Perhaps a bit surprising to some, this figure is 1-2%. Generating a complete inventory of protein-coding sequences represents a high priority in the analysis of the human genome. Even a year ago, there remained significant debate about the total number of human genes, with estimates differing by threefold or more (Ewing and Green 2000Tandem Versus Dispersed Gene Duplications
If new proteins have arisen from the old, then how did the new biological functions arise and evolve? Susumu Ohno (Ohno 1970
Horizontal Gene Transfer
One major surprise from the human sequence (International Human Genome Sequencing Consortium 2001Nature of Non-Coding DNA
More than 98% of the human genome does not encode protein. Of this, just under one-half corresponds to interspersed repetitive DNA, reflecting remnants of transposons. The human genome contains the following major classes of interspersed repeats (in the indicated proportions of the total draft sequence): SINES (13%), LINES (20%), LTR retrotransposons (8%), and DNA transposons (3%) (International Human Genome Sequencing Consortium 2001Long-Range Distribution of Repetitive Sequences
Consistent with the general nonhomogeneous nature of features across the human genome, the distribution of repetitive elements varies greatly across each human chromosome. Some regions contain upwards of 90% repetitive sequences in intervals >500 kb, whereas others (e.g., the HOX gene clusters) are nearly devoid of repeats (<2% of the sequence). The possible functional significance of both extremes is of great interest: Do they arise from variation in the invasion or the retention of repeat sequences? Similarly, certain repetitive elements tend to cluster within specific genomic neighborhoods, such as Alu elements within the GC-rich regions of the genome.Specialized Chromosomal Regions
There is now increased acceptance of the view that repeated sequences can impart biological functions. Nowhere is this more evident than in the structures of centromeres and telomeres as well as their neighboring locations. The function of the ~5% of the human genome recognized as containing
-satellite sequences is now widely thought
to be associated with centromere function, while the
telomere-associated repeat motif TTAGGG is critical for telomere
structure and function (Riethman et al. 2001Patterns of Polymorphism across the Human Genome
Most sequence differences between individuals arise from alleleic sequence changes, and understanding the patterns of this genomic variation is crucial for the study of molecular alterations in human disease. Simple sequence repeat (SSR) motifs are highly polymorphic and have been the workhorses of genetic disease mapping. The draft human genome sequence reveals a SSR every 2 kb, (on average) accounting for ~3% of the human genome. Of these, dinucleotide repeats are the most abundant (with AC and AT repeats being the most common), but trinucleotide repeats, including those directly involved in human disease, are less frequent than expected. The source of this intriguing difference is unknown. Previous mapping indicated the presence of fewer X chromosome SSRs than expected, either because they are less frequent or are less variable. The draft sequence shows an incidence of SSRs on the sex chromosomes equal to that on autosomes, indicating that X chromosome SSRs are less polymorphic. This can be explained by the unique genetic behavior of the sex chromosomes. The draft human sequence has also driven the discovery of the most common type of human sequence variant, the single nucleotide polymorphism (SNP). To date, >2.3 million SNPs have been identified (The International SNP Map Working Group 2001
4. This value appears to hold for entire autosomal
chromosomes but is severely reduced on sex chromosomes. The latter
finding is expected because of the special transmission biology of sex chromosomes. The most important finding is the highly significant local
genomic variation in SNP diversity; thus, when 200-kb segments across
the genome are compared, the nucleotide diversity varies from 0 to 60 x
10
4. These data are as expected, given the current thinking
of human genetic and genome history (i.e., the founding of the human
population from ~10,000 founders ~150,000 years ago). Importantly,
the SNP data shows that there is greater variation within a given
genome than between a given human genomes. With the majority of SNPs in
non-coding DNA and unlikely to affect biological function, the
challenge remains to identify the ~1% of SNPs that affect protein
function, so that they can be used as direct probes of common human
disease. Nonetheless, the majority of all known SNPs remain crucial
tools for linkage and association markers, helping to identify the
subset of common SNPs that are functionally relevant.
Patterns of Recombination across the Genome
One major factor affecting SNP frequency across the genome is the variation in recombination frequency within the human genome. As with other organisms, the frequency of meiotic recombination shows variation at several levels. The mapping of >8000 SSRs, and comparison of the resulting meiotic map to the physical map, allow for a detailed accounting of this variation (Yu et al. 2001Genome-Wide Mutation Rates
SNP patterns also depend on local mutation frequencies. Evolution occurs through the selection of individual mutations, and so it is natural to ask whether the mutation rate is the same in the compositionally heterogeneous human genome. The draft human sequence shows remarkable regional bias in substitution patterns based on GC content. Specifically, GC/CG pairs mutate to AT/TA pairs at a higher rate in AT-rich regions compared to GC-rich regions. This bias appears to rise from the earlier replication of GC-rich regions and the corresponding depletion of guanine pools. It is also likely that differences in DNA repair associated with transcriptional activity (e.g., gene-rich regions) may also contribute to the variation. Consequently, the human genome is not at equilibrium and has ~7% greater GC-content than expected. This feature may be due to natural selection or the invasion of transposable elements that prefer GC-rich regions. It has been suggested that much of non-repeated DNA may be remnants of ancient repeats no longer recognizable through mutation. Two other major features emerge: the youth of the Y chromosome, where DNA can be both gained and lost with little functional consequence (as assessed by younger-than-average LINE and other repeat elements) and the 2:1 excess of mutations in the male versus female germline. This last feature could either reflect inefficient repair in the male germline, analogous to mtDNA, or the greater number of cell divisions in male meioses.Strategies for Genome Exploration
Significant attention has been given, in the scientific and popular press, to the two different strategies adopted by the Human Genome Project and Celera Genomics for their respective human genome expeditions. The main questions revolved around which was the most efficient method, whether repeats compromised the assembly of whole-genome shotgun data, and the accuracy of the assembly prior to finishing. While perhaps not yielding a particularly good "story," the assessment made at the base camp reveals remarkable convergence in these respective approaches, especially relevant for the planning of future ascents and expeditions of the human and other genomes. In their paper (International Human Genome Sequencing Consortium 2001
|
To the Summit of the Human Genome ... and Beyond
The human genome has surely not been the first genome to be sequenced; nevertheless, it has been the most remarkable. This is not an anthropocentric view. The interest in this genome, from a strictly scientific view, stems from what it is teaching us about how information is retained and modified, as well as how it evolves within a genome. Among all genomes sequenced, the human genome has the lowest gene density; in other words, has the largest noncoding-to-coding ratio. Only a tiny fraction of the genome is gene coding, including the surprising fistful that we "inherited" from bacteria. Although no new protein families are evident, the genome has been remarkably adept at increasing protein diversity both by combining new domains and by invoking alternate splicing. Surprisingly, beyond this stable set of DNA, the rest, and majority, of the human genome is a tempest. The genome appears to routinely both duplicate segments and haul them away to both neighboring and distant parts and be a "rooming house" for transposons of various sorts and for various lengths, although it has been quiet lately. Much of the transposon-derived sequence changes quickly, and thus is now beyond recognition and appears as unique non-coding DNA. When these elements go for a ride, so do genes in the neighborhood. These features explain both how new functions usually emerge and how the C-value paradox can be explained. The C-value paradox arose from an attempt to explain the lack of correlation between organismal complexity and DNA content: Many salamanders have DNA contents that would put the human to shame. Thus, saltational DNA expansion may indicate such transposon invasions, which themselves can mould the genome in the future but not necessarily lead to new gene functions or proteins. Although non-genic, these elements lead to human disease. Thus, the non-homogeneous nature of the human genome may arise from very different and distinct processes shaping the evolution of the genome. Finishing the human genome and comparing it to other vertebrate genomes should clarify whether or not the above is a realistic scenario (see below).
As with any expedition, arrival at the base camp also intensifies the
preparation for the final ascent. Although the initial view of the
human genome is fascinating, the scientific data from the completed and
finished sequence will be even more so. Thus, the highest current
priority is to finish the human genome sequence to a high accuracy and
as completely as possible. Currently, greater than one-third (>1 Gb)
of the human genome is finished to an accuracy of <1 error per 10,000 bp, with the goal of finishing the remaining <2 Gb within the next two
years. Among the many tasks are the following: Numerous gaps, both in
the clone map and in the sequence of individual clones, must be filled;
minor (although, irritating) instances of clone-to-clone contamination
within the draft sequence must be rectified; and, long-range
ambiguities, especially with regard to large segmental duplications,
(see Eichler 2001
) need to be resolved.
With a finished human genome sequence in hand, the view from the summit
will be even more spectacular because some of the current views will
undergo revision. Activities at the summit will aggressively focus on
further refinement of the human gene inventory. These will be aided by
ever-improving computational tools for gene prediction (e.g., Kan et
al. 2001
; Rojic et al. 2001
; Yeh et al. 2001
; Zhou et al. 2001
),
comparative analyses with other vertebrate genome sequence (see below),
and the generation of complete sets of mammalian full-length cDNA
sequences (Strausberg et al. 1999
; The RIKEN Genome Exploration
Research Group Phase II Team and The FANTOM Consortium 2001
) (e.g., see
mgc.nci.nih.gov). In parallel, a more complete understanding of the
corresponding and inevitably more complex human protein inventory will
be pursued. Almost certainly this will involve new initiatives to study
protein structure and function on a large scale, similar to the
industrialization of DNA sequencing that has occurred over the past
five years. In addition, systematic efforts to identify regulatory
elements that orchestrate the complex expression of genes will begin
(Pennacchio and Rubin 2001
). Finally, genetic variation of the sequence
will teach us to what extent protein and regulatory functions are
impacted by inter-individual sequence differences (Chakravarti 2001
).
A key component of the above activities, especially with respect to the
cataloging of genes and their regulatory elements, will be the
comparative sequencing of multiple other vertebrate genomes by the
Human Genome Project. In contrast to the sequencing of the human
genome, smaller groups of sequencing centers will come together to
elucidate the sequence of other vertebrates. For example, hybrid
sequencing strategies are being actively used to sequence the mouse,
rat, and zebrafish genomes. Sequencing efforts involving the generation
of whole-genome shotgun data are also ongoing for two pufferfish
species. The selection and prioritization of other vertebrate genomes
for sequencing by the Human Genome Project represents an active and
often lively topic of discussion [e.g., in contemplating the
sequencing of the chimpanzee genome (VandeBerg et al. 2000
; Varki 2000
;
McConkey and Varki 2000
)]. A critical issue is the desired level of
completeness and accuracy for comparative sequencing, although it has
been the experience to date that the most definitive and compelling conclusions emerge only from the analysis of highly accurate sequence data. With the available sequencing capacity, the continued decline in
the costs of sequencing, and the increasing recognition of the value of
comparative sequence data, it can be confidently anticipated that
numerous other parallel expeditions will be initiated for sequencing
myriad vertebrate genomes in the years ahead.
| |
FOOTNOTES |
|---|
3 Corresponding author.
E-MAIL egreen{at}nhgri.nih.gov; FAX (301) 402-4735.
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.188701.
| |
REFERENCES |
|---|
|
|
|---|

This article has been cited by other articles:
![]() |
M. Svensson, A.-K. Mossberg, J. Pettersson, S. Linse, and C. Svanborg Lipids as cofactors in protein folding: Stereo-specific lipid-protein interactions are required to form HAMLET (human {underlined alpha}-lactalbumin made lethal to tumor cells) Protein Sci., December 1, 2003; 12(12): 2805 - 2814. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||