|
|
|
Published online before print
September 20, 2001, 10.1101/gr.179201
Vol. 11, Issue 10, 1736-1745, October 2001
METHODS
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ABSTRACT |
|---|
|
|
|---|
A large-scale BAC end-sequencing project at The Institute for
Genomic Research (TIGR) has generated one of the most extensive sets of
sequence markers for the mouse genome to date. With a sequencing
success rate of >80%, an average read length of 485 bp, and ABI3700
capillary sequencers, we have generated 449,234 nonredundant mouse BAC
end sequences (mBESs) with 218 Mb total from 257,318 clones from
libraries RPCI-23 and RPCI-24, representing 15× clone coverage, 7%
sequence coverage, and a marker every 7 kb across the genome. A total
of 191,916 BACs have sequences from both ends providing 12× genome
coverage. The average Q20 length is 406 bp and 84% of the bases have
phred quality scores
20. RPCI-24 mBESs have more Q20
bases and longer reads on average than RPCI-23 sequences. ABI3700
sequencers and the sample tracking system ensure that > 95% of mBESs
are associated with the right clone identifiers. We have found that a
significant fraction of mBESs contains L1 repeats and ~48% of the
clones have both ends with
100 bp contiguous unique Q20 bases.
About 3% mBESs match ESTs and > 70% of matches were conserved
between the mouse and the human or the rat. Approximately 0.1% mBESs
contain STSs. About 0.2% mBESs match human finished sequences and
> 70% of these sequences have EST hits. The analyses indicate that our
high-quality mouse BAC end sequences will be a valuable resource to the community.
| |
INTRODUCTION |
|---|
|
|
|---|
Because of the high stability (Shizuya et al. 1992
; Kim et al.
1996a
,b
), libraries constructed in bacterial
artificial chromosome (BAC) vectors have become the standard clone sets
in high-throughput genomic sequencing projects of organisms with large
genomes. End sequences from BACs provide highly specific markers. A
genome sequencing approach (Venter et al. 1996
) has been described, in which a clone contig is extended by selecting the minimally overlapping clones in each direction by searching the finished BAC sequence against
a BAC end sequence (BES) database. Because BACs (an average insert size
of 150 kb) are sufficiently large to traverse most tandem arrays of
homology units and repeats, BESs are useful in genome assembly and
chromosome walking and have been used extensively to confirm, join, and
order existing contigs (International Human Genome Sequencing
Consortium 2001a
). The whole-genome shotgun sequencing strategy relies
on BESs as the primary scaffold onto which the end sequences from the
smaller clones are assembled (Venter et al. 1998
, 2001
).
The mouse and the human share many fundamental biological processes. Consequently, the mouse has been used frequently in medical research and is the best model system for studying human disease. Additionally, the mouse genome sequence facilitates the accurate annotation of the human genome. As such, National Institutes of Health (NIH) launched a mouse genome-sequencing project in October, 1999 (http://www.nhgri.nih.gov/NEWS/MouseRelease.htm).
Compared with the human, significantly fewer large-scale mapping
efforts have been conducted for the mouse and much less data are
available to the community (Hudson et al. 1995
; Dietrich et al. 1996
;
Schuler et al. 1996
; McCarthy et al. 1997
; Stewart et al. 1997
;
Deloukas et al. 1998
; Van Etten et al. 1999
; International Human Genome
Mapping Consortium 2001a
; Olivier et al. 2001
). A large-scale BAC
end-sequencing project generates an extensive set of random markers
across the genome in an inexpensive and rapid fashion, and will be
crucial to the success of the combined strategy of BAC-based sequencing
and a moderate level of whole-genome shotgun sequencing that is being
used for the mouse genome. The Institute for Genomic Research (TIGR) is
the only center conducting large-scale BAC end-sequencing for the
mouse, in which the aim of the project is to generate accurate BES
pairs from 170,000 RPCI-23 clones (Osoegawa et al. 2000
) and 130,000 RPCI-24 clones to support the mouse genome sequencing project. The same
set of clones has been fingerprinted at the Genome Sequencing Centre of
British Columbia Cancer Research Centre at Vancouver Canada (http://www.bcgsc.bc.ca/projects/mouse_mapping/). We have approached the goal of the project and have generated ~450,000 sequences (http://www.tigr.org/tdb/bac_ends/mouse/bac_end_intro.html). To provide
a better characterization of this valuable resource, we conducted
comprehensive quality assessment and sequence analyses as described below.
| |
RESULTS |
|---|
|
|
|---|
Quantities
TIGR has been generating mBESs from the EcoRI-based library
RPCI-23 (Osoegawa et al. 2000
) and the MboI-based library
RPCI-24 (Table 1). As of June 27, 2001, we
have generated 449,234 nonredundant mBESs from 257,318 BACs, of which
274,277 were from 154,795 RPCI-23 clones and 174,957 were from 102,524 RPCI-24 clones. With a pair rate of 75%, a total of 191,916 (119,483 from RPCI-23 and 72,433 from RPCI-24) had both T7 and Sp6 ends,
providing 11.6× genome coverage by paired-end clones, assuming an
average BAC insert size of 197.5 kb for RPCI-23 (Osoegawa et al. 2000
)
and 155 kb for RPCI-24. The average edited read length was 485 bp,
representing a total of 218 Mb or 7% of the mouse genome. The basic
sequencing process consisted of template preparation, reaction, clean
up, electrophoresis, and sequence trimming as described (Kelley et al.
1999
, http://www.tigr.org/tdb/bac_ends/mouse/bac_end_intro.html), and
ABI3700 capillary sequencers were used. Base calls were performed with
phred (Ewing and Green 1998
; Ewing et al. 1998
) and the
quality scores were further adjusted with Paracel
TraceTuner. Sequencing trimming was conducted with the
program lucy (http://www.tigr.org/softlab/) with criteria
of (1) a < 2.5% overall base-call error rate; (2) a read length of
> 100 bp; and (3) no vector and Escherichia coli sequences.
With this standard, ~81% sequencing attempts yielded useful reads
ranging from 101 to 984 bp. The success rate with the RPCI-23 library
was slightly higher than with the RPCI-24 library (83% vs. 79%) and
T7 ends of RPCI-23 clones yielded a higher success rate than Sp6 ends
(84% vs. 81%). Statistics indicated that most of the sequencing
failures were due to sequences whose quality did not meet the trimming
standard (14% for RPCI-23 and 15.6% for RPCI-24), most likely because
of inadequate BAC template purity due to many possibilities. Although these RPCI libraries had higher success rates than other libraries that
we have sequenced with the same protocols, it would still be useful if
the percentage of the wells in which more than one type of clone
population dominates were reported for the libraries. An insufficient
amount of templates also resulted in sequencing failures (3% for
RPCI-23 and 4% for RPCI-24), some of which were caused by empty wells
(2.1% were reported for RPCI-23 at
http://www.chori.org/bacpac/23framefmouse.htm and 3.35% were reported
at http://www.chori.org/bacpac/mmouse24.htm). We have found < 1%
vector sequence contamination and a negligible amount of E. coli sequence contamination for both libraries. Excluding low
quality, vector, and E. coli sequences, for a total of 272,573 BACs that were attempted from both ends, the entire process yielded 70% clones having both ends (72% for RPCI-23 and 69% for RPCI-24), 20% having only one end (19% for RPCI-23 and 21% for RPCI-24), and
10% having no ends (9% for RPCI-23 and 10% for RPCI-24). Although the overall performance of RPCI-24 was not as nearly good as that of
RPCI-23, the current success rate of this library has increased and was
comparable with that of RPCI-23 with longer reads (see below).
|
Quality
Q20 Length Distributions
The base-call program phred (Ewing and Green 1998
20, the error rate is
1% and the
accuracy is
99%. This base is called a Q20 base or a high-quality
base. It is a common practice to assess the sequence quality by the
number of Q20 bases in each sequence (Q20 length). We therefore
examined the phred QV of each base of mBES reads before
and after trimming. For a total of 453,137 mBES traces (277,490 from
RPCI-23 and 175,647 from RPCI-24) with a 388-Mb total, the Q20 length
ranged from 10 to 910 bp with an average of 438 bp and a SD of 159 bp
before trimming, and ranged from 3 to 839 bp with an average of 406 bp and a SD of 166 bp after trimming (Fig.
1A). The edited read length ranged from 100 to 940 bp with an average of 485 bp and a SD of 174 bp. RPCI-24 had a
higher average Q20 length than RPCI-23, 461 versus 409 bp before
trimming, and 438 versus 387 bp after trimming, and therefore had a
longer average edited read length (515 bp versus 466 bp) (Table 1). The
total bases after trimming were 219 Mb, of which 10% had
phred QV < 15 and 84% had phred QV
20 (Fig. 1B).
Again, RPCI-24 had an higher Q20 base fraction than RPCI-23, 86%
versus 83%. Our results indicated that ~84% of the bases had
base-call error rates of
1% in each sequence on average and in the
overall dataset.
|
Repetitive DNA
Unique sequences are most useful for genome assembly and it is
therefore desirable to know the repetitive DNA content in the mBES
dataset. We analyzed 453,317 mBESs that consisted of 277,490 (129 Mb)
RPCI-23 sequences and 175,647 (90 Mb) RPCI-24 sequences by
RepeatMasker
(http://ftp.genome.washington.edu/RM/RepeatMasker.html) (Smit 1996
)
and found that 64% of the sequences contained repeats and 37% of the
bases were repeats. RPCI-24 mBESs had slightly more repeats than
RPCI-23 mBESs (Table 1). The repeat sequences ranged from 20 to 925 bp
with an average of 276 bp and a SD of 194 bp.
We examined the repeat composition of mBESs (Fig. 2) and found that a significant fraction were LINE1 (L1) repeats, 27.6% sequences and 21% bases for both libraries. The L1 repeats ranged from 11 to 925 bp with an average of 370 bp and accounted for 2%-100%, with an 80% average of the total bases in L1-containing mBESs, making a significant portion of mBESs less useful. To find out whether this high percentage of L1 represented the true repeat composition of the mouse genome, we examined 258-Mb mouse phase 1-3 genomic sequences and found that 30%-36% of the bases were repeats and 7%-15% were L1, depending on the sequencing phase (the lowest for phase 1 and the highest for phase 3 probably because of the sequence accuracy) and the GC content, indicating that mBESs from both libraries contained more L1 repeats than sequences that were obtained by complete sequencing of discrete mouse genomic regions up to 300 kb. To find out whether this was related to the cloning sites (EcoRI for RPCI-23 and MboI for RPCI-24) of the BAC libraries, we studied the frequency of EcoRI and MboI in L1 and L1-free sequences separated from the mouse phase 1-3 sequences. EcoRI occurred once every 3 kb in L1 repeats and once every 4 kb in L1-free sequences, thus increased by 33% in L1. Similarly, MboI occurred once every 0.3 kb in L1 and once every 0.4 kb in L1-free sequences, increasing by 33%. mBESs were similar to the phase 3 sequences in GC content, the increased occurrence of these restriction sites in L1 repeats would mostly explain the ~33% more L1 in mBESs. To further test this hypothesis, we studied the repeat composition of 50,000 (24 Mb) plasmid ends from a 10-kb sheared DNA library submitted to dbGSS by the Utah genome center and have found that the overall repeat content was similar to that of mBESs; however, L1 only accounted for 20% of the sequences and 15% of the bases. Similar results were obtained for 50,000 trimmed mouse whole-genome shotgun reads that were randomly chosen from TraceArchive (http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?). These studies further supported the hypothesis that L1 contents increased in the end sequences of clones from EcoRI and MboI partial digest libraries.
|
Unique Q20 Bases
Genome assembly requires high-quality unique sequences; we therefore
studied the Phred QVs of the unique sequences of mBESs.
After 219 Mb of the same mBES dataset described above were repeat
masked by RepeatMasker, we identified 124 Mb (57%) unique
sequences, among which similar phred QV composition was
observed as the entire dataset shown in Figure 1B, and a total of 105 Mb had QVs
20 (the unique Q20 bases). RPCI-24 dataset had ~1%
lower unique base, but similar unique Q20 base fractions compared with
RPCI-23 dataset, and T7 ends had 2% lower unique base, and 1% lower
unique Q20 base fractions than Sp6 ends, possibly because of their
higher average Q20 length. Clones with paired ends (two-end clones)
were more useful than clones with only one end. Therefore, we examined
the distribution of various types of bases for two-end clones (119,483 from RPCI-23 and 72,433 from RPCI-24) (Fig.
3). By use of 100 bp as the cut off length,
all clones had both ends above 100 bp, 71% of the clones had at least
one end, and 54% had both ends, with
100 bp contiguous unique
sequences; 66% had at least one end, and 48% had both ends, with
100-bp contiguous unique Q20 bases. The last fraction (48%) of the
clones was most useful in genome assembly. Similar distributions were
observed for both libraries, but RPCI-24 had on average ~20 bp more
unique Q20 bases than RPCI-23.
|
We examined the phred QV composition at each base position
of mBESs because this information is useful for applications such as
primer design. Our analyses indicated two high-quality regions with Q20
base fractions above 90%, bases 63-151 and 208-367 for RPCI-23, and
54-162 and 201-430 for RPCI-24 (Fig. 4).
Lower quality regions included the first 30-40 bases and those after base 450 (RPCI-23) or 530 (RPCI-24), as well as a middle region (bases
168-184) that was possibly caused by a dye blob resulting from
increased reaction cycles required to sequence BAC ends. RPCI-24 reads
had higher quality than those of RPCI-23 at most positions, especially
toward the 3' end. For the same reasons discussed above, it is useful
to associate the sequence quality with the repeat information. We
therefore studied the repeat content at each base position and have
observed similar profiles for both ends and for both libraries, except
that repeat contents at most positions were higher in T7 ends than Sp6
ends, and for RPCI-24 than RPCI-23. The total repeat content increased
from 19% at base 1 to above 36% at base 40 (RPCI-23) or 35 (RPCI-24),
and remain above 36% until base 500 (RPCI-23) or 550 (RPCI-24). Unique
base profiles were the reverse of those of repeats and were in an
inverse relationship with those of Q20 bases (Fig. 4), indicating that more repeats can be identified from more accurate reads. Therefore, for
a large dataset of random genome-wide sequences, a higher repeat
content generally indicates a higher sequence quality. This was also
supported by the bigger difference between RPCI-23 and RPCI-24 in both
Q20 base and unique-base contents toward the 3' end (Fig. 4). On the
basis of our analyses, for studies such as primer design for RH mapping
(Oliver et al. 2001
) using mBESs, unique sequences from bases 63-151
and 208-367 for RPCI-23; and 54-162 and 201-430 for RPCI-24 were the
best because of the high quality.
|
Self-Comparison
To study the randomness of the mBES dataset, we searched each of
274,285 repeat-masked RPCI-23 mBESs against the database containing the
same sequences and selected matches having identity
90%, length
100 bp, and unmatched bases at either 5' or 3' end (overhang)
< 50 bp. We identified 19% mBESs matching other mBESs with an
average identity of 96.6% and an average length of 283 bp. Among these
mBESs, > 90% hit < 5 other sequences and under 1% hit > 50
other sequences. We did the same analyses with 171,858 mBES from
RPCI-24 and found that 9% had hits with similar parameters. Because
the average sequence identity was only 96.6%, the higher self-hit rate
obtained for RPCI-23 was likely due to more false positives caused by
repeat matches and might indicate that RPCI-23 mBESs contained more
repeats that yet need to be identified. Only a small fraction of mBESs
was identified in this study, indicating that mBESs were fairly
randomly distributed on the genome.
Comparison with Finished Mouse Sequences
Paired-End Coverage
To study the effective coverage of the genome by paired-end clones, we matched a total of 191,916 mBES pairs (119,482 from RPCI-23 and 72,433 from RPCI-24) to two mouse contigs as follows: 1.5 Mb NT_002588 of chromosome 17 (C17) and 0.6 Mb NT_026540 of chromosome 5 (C5). We chose these contigs because they were big and seemed to be more representative of the genome than other contigs, as the overall repeat content was ~30% closer to that of the whole genome shotgun reads (34%). Performing searches as described in the legend to Figure 5, we identified 69 pairs with 7.9× coverage and one 14-kb gap on C17 (Fig. 5) and 26 pairs with 7.9× coverage and one 4-kb gap on C5. The average insert size of matched pairs on the two contigs was calculated to be 193 kb with a SD of 29.7 kb for RPCI-23 and 162 kb with a SD of 36.7 kb for RPCI-24, similar to the reported values [(http://www.chori.org/bacpac/23framefmouse.htm and http://www.chori.org/bacpac/mmouse24.htm, Osoegawa et al. (2000)
|
Sequence Identity
Together with Q20 length, mBES sequence accuracy can also be assessed by examining their identities to finished sequences. We therefore studied the identities of mBESs to the mouse contigs and found an average of 99.4% for RPCI-24 and 99.2% for RPCI-23. RPCI-24 mBESs were more identical to finished sequences than RPCI-23 mBESs (Fig. 6), consistent with the Q20-length results. As a comparison, Figure 6 also showed identities of TIGR human BES pairs from library RPCI-11 to human finished chromosome 21 sequences and the distribution was almost identical to that of RPCI-24 mBES. The high-sequence identities to finished sequences showed that mBESs from both RPCI-23 and RPCI-24 were sufficiently accurate for any applications involved.
|
Sequence Coverage
Upon examining the match length of mBESs to the mouse contigs C5 and C17, we found a 482-bp average or 2.2% sequence coverage for RPCI-23, and a 475-bp average or 2.1% coverage for RPCI-24. These were lower than the expected 3.7% coverage for RPCI-23 and similar to the expected 2.5% for RPCI-24, possibly because repeat masking obscured more of the real matches for RPCI-23 and RPCI-24 was more random.Clone Tracking
Along with a high-sequencing success rate, a good clone tracking
system is required for large-scale BAC end sequencing. The process
involves hundreds of thousands of clones and mistakes can occur at any
steps, such as library replication, clone picking, cell growth,
template preparation, cycling reaction, electrophoresis, and database
loading. TIGR is one of the two centers that conducted large-scale
human BAC end sequencing and our tracking analyses showed that > 90%
of the human BESs were associated with the right clones (Zhao et al.
2000
). The analyses also indicated that > 90% tracking errors
originated from lane mistracking with the ABI377 sequencers (the other
10% were caused by errors such as plate mishandling, primer
mislabeling, etc.). We switched to the ABI3700 machines for the mouse
project. Taking advantage of these capillary sequencers, we have
incorporated into the process a complete tracking check that, for each
96-well plate, involved sequencing two ends of a clone directly picked
from the 384-well master plate and matching the sequences to the
corresponding BESs obtained from the large-scale process. More
automation has also been introduced to the process. To find out how
much improvement has been achieved, we conducted a comprehensive
tracking analysis involving both external and internal data, which not
only evaluated the effectiveness of the modified process and helped to
identify places in which further improvement is required, but also
provided BES users information on clone fidelity in the dataset.
Comparing BES to the External Data
Overlapping BESs Between Laboratories
One way to assess the clone-tracking accuracy is to examine the discrepancy of BESs from a statistically significant number of clones that were independently sequenced by TIGR and another center. This analysis was reported for the human BAC end-sequencing project (Zhao et al. 2000Matching to BAC Sequences of the Same Clones
Another possible method for tracking assessment is to match the BESs to the BAC sequences of the same clones whose entire insert have been sequenced as a result of the Human Genome Project (BAC Resource Consortium 2001Matching to Large Contigs
Another way to assess clone tracking is to compare BES pairs with a large contig and select those having both ends matching the contig with correct orientation and reasonable insert sizes. Excluding false positives that arise from repeat matches and false negatives that are caused by low-quality sequences, a higher fraction of pairs with such matches generally indicates a more accurate clone tracking. We compared ~100,000 mBES pair-end clones to mouse chromosome 17 contig (1.5 Mb) and found that 85% of the identified clones had two ends matching to the contig. The same analyses were conducted by matching ~100,000 human BES pairs to chromosomes 21 (Hattori et al. 2000Internal Sequence Comparisons
In the course of the project, a small fraction of clones were additionally end sequenced by processes that were independent of the large-scale operation and included the following: (1) each corner clone of 384-well plates was sequenced twice to ensure that the re-arraying from 384-well plates to 96-well plates was carried out accurately before clones entering the large-scale pipeline (re-array check sequencing); (2) for each 96-well plate, a clone picked directly from the 384-well master plate was end sequenced and matched to the corresponding large-scale BESs to verify the accuracy of the entire operation (final tracking check sequencing); (3) a number of clones were resequenced for reasons such as per BES user's request, suspicious tracking, the unusually low sequencing success rate, or testing new protocols (priority clone sequencing). The comparison results of BESs from these processes to the corresponding large-scale BESs follows. Of a total of 1033 two-end clones, 93% had both-ends-match and 5% had one-end-match; of 2321 total one-end clones, 95% matched. Overall, 95% BESs matched. The same analysis was conducted for the human RPCI-11 clones for comparison purposes; of 1092 total two-ends clones, 88% had both-ends-match and 9% had one-end-match; of 4711 total one-end clones, 90% matched; and 91% BES matched overall.
Another set of data to assess the tracking came from sequencing clones that were provided to us by a collaborator. The clones were from a different copy of the same library (RPCI-23) and therefore provided a good source to evaluate the entire process from library replicating to sequence generation. We compared these mBESs with the corresponding large-scale sequences and obtained the following results: for 418 two-end clones, 91% had both-ends-match and 5% had one-end-match; for 216 total one-end clones, 89% matched, and the overall matching rate was 92% of mBESs. The rates here were somewhat lower than the values obtained above, possibly because these clones have been through more growth and selection cycles that potentially could introduce more problems such as human errors.
The clone tracking for our large-scale process should be more accurate than the apparent sequence match rate of 95% obtained here. This is because unlike the large-scale operation in which clones were picked by a robot, the clones in the processes described above were generally picked by hand and more human errors could be introduced. Beside the match rates, it would be useful to know the percentages in which clear mistracking happened. However, such analyses were more complicated. The phred quality scores of the external sequences used in the analyses were not available and therefore we could not distinguish false negatives due to the low quality data. For internal sequence comparisons, we only found very few cases (< 10) in which mismatches were clearly due to mishandling of the large-scale process (clone mispicking, plates misaligned, primer mislabeled, etc.). More studies need to be done to find out other reasons such as more than one type of BAC population in a well.
Sequence Analyses
EST Contents
A total of 453,137 repeat-masked mBESs (277,490 RPCI-23 and 175,647 RPCI-24) were searched against human, mouse, and rat EST databases at TIGR (Adams et al. 1991
95% and score
300 were selected. The results
indicated that a larger fraction of mBESs matched mouse ESTs with a
higher identity and a longer match length on average, 3% mBESs matched
mouse ESTs with a 98% identity and a 200-bp length, whereas 0.1%
mBESs matched human ESTs with a 97% identity and a 182-bp length, and
0.2% mBESs matched rat ESTs with a 97% identity and a 167-bp length.
A higher percentage of RPCI-24 mBESs was found to match ESTs (Table 1).
Matched ESTs were involved in signal transduction, cell defense, gene
expression, structure, metabolism, and other functions. A significant
fraction of the matches were found to be conserved between the species, 71% between the human and the mouse, 70% between the mouse and the
rat, 36% between the human and the rat, and 27% among the three
species. Most of the conserved matches were not classified (80%),
whereas others included heat shock proteins, transcription factors, and
ribosome proteins. We did the same analyses with the Unigene database
and obtained similar results.
STS Contents
By running e-PCR (Schuler 1998Comparison with Human Finished Sequences
tBLASTX compares query nucleotide sequences with a nucleotide database on the protein level by six-frame translation. Using tBLASTX, we compared 197,099 repeat-masked RPCI-23 mBESs with 2355 human contigs (544 Mb total), ranging from 1.8 kb to 34 Mb from GenBank and selected matches with identity
90% and match
length
50 bp. We have found 922 matches by 350 mBESs with a 95%
identity and a 269-bp length on average. The match frequency was one
every 590 kb overall and one every 1-1.5 Mb on finished chromosomes
20, 21, and 22. The higher frequency on other chromosomes might be due
to non-randomness of the finished sequences or the selection of
gene-dense regions for sequencing. We have found 71% of the identified
mBESs had hits to ESTs of the mouse (67%), human (37%), or rat (
33%). The majority of these ESTs were not categorized, and those
characterized included ribosomal proteins, ubiquitin, and other
abundant proteins. We currently are repeating the analyses to map all
mBESs from libraries RPCI-23 and RPCI-24 to the assembled human
genomes, both the public version (GoldenPath, NCBI) and the private
version (www.celera.com).
One concern with tBLASTX was speed, and it took months to
finish the analyses. We therefore looked for alternatives such as
BLASTN. With appropriate parameters and matrix (http://sapiens.wustl.edu/~ikorf/mmhs/index.html),
BLASTN can achieve a high sensitivity with tremendous
gains on speed, as it compares sequences on the nucleotide level. We
compared matches identified by tBLASTX using
BLASTN and found that all had nucleotide identities above
76% and a majority (95%) were above 80% with an average of 90%. The
lower identity on the nucleotide level was due to third-base wobble. An
effective approach to place mBESs onto the human genome might be a
two-step comparison by searching the entire database first with liberal criteria by BLASTN and then using tBLASTX to search
the smaller dataset of potential candidates identified by BLASTN.
| |
DISCUSSION |
|---|
|
|
|---|
The goal of a large-scale BAC end-sequencing project is to generate
dense and accurate end-sequence pairs that are randomly distributed
across the genome. With this goal in mind, we have been end sequencing
mouse BACs on a large-scale to support the mouse genome project and
have generated ~450,000 mBESs from ~260,000 RPCI-23 and RPCI-24
BACs with 75% of the clones having paired ends. We have improved the
protocol used for sequencing our human BAC ends and increased the
success rate to > 80%. As a result, for the 273,000 clones that were
attempted at both ends, 70% have two ends, 20% have one end, and 10%
have no ends. The sequencing is performed with ABI3700 capillary
sequencers that have greatly improved the sample tracking accuracy and
> 95% of the sequences are from the right clones (below). With an
average read length of 485 bp, the sequences are slightly longer than
our human BAC ends (Zhao et al. 2000
) and add up to 219 Mb or 7% of
the genome. With a 406-bp average Q20 length and 84% of the bases
having phred QV
20, mBESs match mouse finished
sequences with an average identity of > 99%. The project therefore
generates accurate genome-wide sequence pairs and provides the dense
markers supporting the mouse genome project.
We have end sequenced the entire RPCI-23 library (170,000 clones) and are approaching our goal of 130,000 clones for the RPCI-24 library. Although the current success rate with both libraries is similar, both the average Q20 length and the average read length are longer with RPCI-24. Consequently, a higher percentage of RPCI-24 mBESs hit repeat database, mouse finished sequences, ESTs, and STSs with a longer length and a higher identity on average. Although mBESs from both libraries seem to be fairly randomly distributed on the mouse genome based on the analyses, RPCI-24 sequences seem to be more random and unique because the observed pair and sequence coverage are closer to the expected, and fewer sequences have hits in the mBESs database itself. Therefore, it seems to be more useful to end sequence more clones from BAC libraries that were made with the 4-base cutter MboI.
Repeats present problems in genome assembly and we have found that 65%
of mBESs and 36% of the bases contain known genome-wide repeats. The
most useful end-sequenced clones are those having both ends with
100-bp unique Q20 bases, which are found to be 48% in the
database. Compared with the mouse phase 1-3 genomic sequences, as well
as the end sequences from a sheared DNA plasmid library and the mouse
whole-genome shotgun reads, mBESs contain significantly more L1 repeats
as a result of more frequent occurrence of EcoRI and
MboI sites in this type of repeats. We believe that end
sequencing multiple BAC libraries with different cloning sites and with
sheared DNA will make the resource more useful, as the sequences will
better represent the genome.
The success of large-scale BAC end sequencing requires not only a
high-sequencing success rate but also an accurate clone tracking. The
linkage between the sequence data and the clones is critical because
the data are only as useful as the corresponding clone is retrievable.
BAC end sequencing involves hundreds of thousands of clones and
mistakes can happen at many stages of the process; a good tracking
system is therefore required. Our sample tracking involves both
automation and a laboratory information management system (LIMS) that
is based around a set of databases implemented in Sybase and uses bar
codes at several stages of the sequencing process. Prior to the mouse
project, TIGR had generated 300,000 BESs from 180,000 human clones and
the evaluation indicated that > 90% of human BESs are associated
with the right clone identifiers (Zhao et al. 2000
). The analyses have
revealed several sources of tracking error, however, > 90% of the
errors were found to originate from lane mis-tracking with ABI377
sequencers. We have switched to the ABI3700 machines and introduced a
complete control step into the process by taking advantage of the
capillary sequencers, one clone's correct tracking ensures the correct
tracking of the entire plate of clones. In addition, more automation
has been introduced into the process, which has potentially reduced
human errors. All tracking analyses with both the external and internal data indicate a more accurate performance for the mouse. We are confident that at least 95% of mouse ends are associated with the
right clone identifiers.
To better characterize this valuable resource, we compared mBESs with
the finished human sequences by tBLASTX on the protein
level and found ~70% of the identified mBESs have EST hits,
indicating that the majority of the conserved regions are transcribed.
Gene densities vary with chromosomes (Crollius et al. 2000
; Ewing and
Green 2000
; Liang et al. 2000
) and chromosomes 17, 19, and 22 are gene
rich, whereas chromosomes 4, 18, 21, and X are gene poor. Our match
frequency also varies with chromosomes and is ~1 every 1-1.5 Mb on
chromosome 21 and 22. The mBESs dataset used in the study accounts for
3% of the genome, which would indicate 1 hit per 30-45-kb human
sequences by mBESs with 100% sequence coverage. This somewhat supports
1 gene per 85 kb for the human genome (Ewing and Green 2000
). We found
that 3%, 0.1%, and 0.2% of mBESs match mouse, human, and rat ESTs,
respectively, and a majority of the matches are conserved between these
species, further supporting the hypothesis that transcribed sequences
are more conserved. Approximately 0.1% mBESs contain STS markers,
which made chromosome assignment to > 900 BACs. Our analyses indicate that the mouse BAC ends resource will be even more useful than the
human BAC ends resource for many research fields.
| |
METHODS |
|---|
|
|
|---|
BAC Libraries
Mouse BAC libraries RPCI-23 and RPCI-24 were purchased
from BACPAC RESOURCES at Children's Hospital Oakland Research
Institute (http://www.chori.org/bacpac/orderingframe.htm). RPCI-23
was made by cloning the EcoR1/EcoRI methylase
partially digested female C57BL/6J DNA in the pBACe3.6 cloning vector
at the EcoR1 site (Osoegawa et al. 2000
). RPCI-24 was made by
cloning the MboI partially digested male C57BL/6J DNA in the
pTARBAC1 cloning vector at the BamH1 site.
BAC End Sequencing
BAC end sequencing and trimming were performed following the basic
procedure as described (Kelley et al. 1999
) with a few modifications.
The sequencing was conducted on the ABI3700 capillary sequencers. The
process included BAC template preparation, cycling reactions,
electrophoresis, and sequence trimming. Detailed protocols can be found
at http://www.tigr.org/tdb/bac_ends/mouse/bac_end_intro.html. Briefly,
BAC template preparation was based on the 96-well format. BAC libraries
were received in 384-well plates and were arrayed to 96-well plates
with a robot (Flexys) before sequencing. BAC clones were innoculated
into one 96-deep well block containing 1.5 mL/well medium, and grown
overnight in an oxygenated growth chamber (HI-GRO from GeneMachines)
at 37°C. The cells were harvested by centrifugation and the BACs were
purified by use of a 96-well BAC purification kit from QIAGEN. The
cells were lysed by a standard alkaline lysis protocol, and passed over
a QIAGEN Turbo filter plate using vacuum filtration. BAC DNA was
precipitated with isopropanol and washed with 70% ethanol. The DNA was
then resuspended in 35 µL of resuspension buffer (1 mM Tris at pH
8.0) and 10 µL was used per sequencing reaction once the quality of
DNA was checked on agarose gels. Sequencing reactions were performed
using 0.7-strength Big Dye terminator chemistry (ABI PN 4303154),
0.1-0.5 µg of template DNA, 4 pmoles of primer, 1× CSA sequencing
buffer (ABI PN 361028C), and MgCl2 added to 1.4 mM. The
cycling reactions were conducted with MJ Tetrad Thermal Cyclers under
the following conditions: 96°C - 2 min; cycle 74× 96°C - 10 sec,
54°C - 10 sec, 60°C - 4 min; 4°C - hold. The standard T7
primer and a custom-designed primer (CTGGCCGTCGA CATTTAGG) at the SP6
end were used. The reaction mixture was then cleaned up by isopropanol
precipitation followed by 70% ethanol wash. Electrophoresis of the
reaction mix was carried out with the ABI 3700 Automated DNA Sequencers
using POP5 polymer. Sequence trimming was conducted by processing the
traces using base-calling software phred (Ewing and Green
1998
; Ewing et al. 1998
) and the quality scores were further adjusted
with TraceTuner from Paracel that was specifically trained for our ABI3700 data with POP5, and the sequences were then trimmed by
a locally written software lucy (http://www.tigr.org/softlab/) with criteria of overall base-call error < 2.5%, reads > 100 bp, and free
of vector and E. coli sequences.
Clone Tracking
The linkage between the sequences and the clones was tracked using a laboratory information management system (LIMS) that was based around a set of databases implemented in Sybase and used barcodes at several stages of the sequencing process. In addition, two more tracking control steps were built into the process. (1) For template preparation, a re-array check was conducted to ensure that 96-well plates were arrayed (by a robot) correctly from their 384-well master plates. This involved picking clones from the corners of 384-well plates and their corresponding 96-well plates, and sequencing to verify that the clones were identical. (2) A final complete clone tracking check was conducted to verify the accuracy of the entire process. For each 96-well plate, one clone was picked directly from the 384-well master plate, sequenced from both ends, and compared with its corresponding sequences from the large-scale process.
Data Source
Sequences other than BAC ends used in the analyses were downloaded from GenBank. BAC end sequences can be searched by clone or by sequence at http://www.tigr.org/tdb/humgen/bac_end_search/bac_end_search.html and the entire database can be ftped at ftp://ftp.tigr.org/pub/data/m_musculus/bac_end_sequences/.
Sequence Searches
Sequence comparisons were performed using programs
BLAST series (Altschul et al. 1990
; Altschul and Gish,
1996
) and the BLAST version used for all searches was
WU-BLAST2.0 from Washington University
(http://blast.wustl.edu/). BLAST outputs were parsed out
in a tab-delimited format using the program btab
(http://www.tigr.org/softlab/). A multiple FASTA file was
searched against a database one by one using the program sx from NCBI. The searches were performed with a parallel virtual machine Linux cluster consisting of 14 nodes, each with a 450 MHZ Pentium II, 512 MB RAM and 18 GB storage space.
Sequence Repeats Masking
Sequences were repeat masked by either RepeatMasker (http://ftp.genome.washington.edu/RM/RepeatMasker.html) or its faster version MaskerAids (http://sapiens.wustl.edu/maskeraid/) that replaces crossmatch with BLASTN.
Sequence Analyses and Graphical Presentations
Perl and shell scripts were used extensively for the analyses. Graphical plots were generated using either a Perl module GD.pm, Microsoft Excel, or PowerPoint.
| |
ACKNOWLEDGMENTS |
|---|
We are grateful for the excellent sequencing work provided by all present and past members of TIGR BAC Ends Team; to Michael Heaney, Michael Holmes, Susan Lo, Eddy Arnold, Mark Sengamalay, Billy Lee, and other informatics members at TIGR for their database support; to all the genome centers producing the human and mouse sequence data used in the analyses; and to Mark Adams, Warren Gish, Allan Bradley, Richard Gibbs, Maja Bucan, Kristi Berry, John Gill, Kazu Osegowa, Pieter de Jong, Ken Dewar, Marvin Stodolsky, Adam Felsenfeld, Doug Smith, John Quankenbush, Bruce Roe, Arian Smit, and Jerzy Jurka for their critical comments and useful discussion. This work was supported by Grant U01-HG02137 from NIH to S.Z. and W.C.N.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked "advertisement" in accordance with 18 USC section 1734 solely to indicate this fact.
| |
FOOTNOTES |
|---|
1 Corresponding author.
E-MAIL szhao{at}tigr.org; FAX (301) 838-0208.
Article published on-line before print: Genome Res., 10.1101/gr.179201.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.179201.
| |
REFERENCES |
|---|
|
|
|---|
Received January 8, 2001; accepted in revised form July 25, 2001.
This article has been cited by other articles:
![]() |
D. B. Menke, C. Guenther, and D. M. Kingsley Dual hindlimb control elements in the Tbx4 gene and region-specific control of bone size in vertebrate limbs Development, August 1, 2008; 135(15): 2543 - 2553. [Abstract] [Full Text] [PDF] |
||||
![]() |
Z. Wang, T. Miyake, S. V. Edwards, and C. T. Amemiya Tuatara (Sphenodon) Genomics: BAC Library Construction, Sequence Survey, and Application to the DMRT Gene Family J. Hered., November 1, 2006; 97(6): 541 - 548. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Milosavljevic, R. A. Harris, E. J. Sodergren, A. R. Jackson, K. J. Kalafus, A. Hodgson, A. Cree, W. Dai, M. Csuros, B. Zhu, et al. Pooled genomic indexing of rhesus macaque Genome Res., February 1, 2005; 15(2): 292 - 301. [Abstract] [Full Text] [PDF] |
||||
![]() |