|
|
|
|
Published online before print
February 15, 2006, 10.1101/gr.4431306 Genome Res. 16:550-556, 2006 ©2006 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/06 $5.00 OPEN ACCESS ARTICLE
Resource GeneDesign: Rapid, automated design of multikilobase synthetic genesHigh Throughput Biology Center, The Johns Hopkins University School of Medicine, Baltimore Maryland 21205, USA
Modern molecular biology has brought many new tools to the geneticist as well as an exponentially expanding database of genomes and new genes for study. Of particular use in the analysis of these genes is the synthetic gene, a nucleotide sequence designed to the specifications of the investigator. Typically, synthetic genes encode the same product as the gene of interest, but the synthetic nucleotide sequence for that protein may contain modifications affecting expression or base composition. Other desirable changes typically involve the revision of restriction sites. Designing synthetic genes by hand is a time-consuming and error-prone process that may involve several computer programs. We have developed a tools environment that combines many modules to provide a platform for rapid synthetic gene design for multikilobase sequences. We have used GeneDesign to successfully design a synthetic Ty1 element and a large variety of other synthetic sequences. GeneDesign has been implemented as a publicly accessible Web-based resource and can be found at http://slam.bs.jhmi.edu/gd.
The power and flexibility of gene synthesis is increasingly being recognized (Han and Boeke 2004
The theory of gene design when high expression levels are desired is relatively uncomplicated. First, the desired protein sequence should be reverse translated into a nucleotide sequence. This step allows codon usage to be optimized for the host organism to be used for expression, or changed completely to accommodate a variety of constraints (Fig. 1). While there are enormous numbers of possible synthetic sequences that can be made, and could in principle lead to increased expression, we have used the highly simplifying method of choosing the single most abundant codon specifying each amino acid in highly expressed genes for the host organism of choice. Codon optimization can be an important factor in establishing gene expression, although generally it is less significant than are promoter strength, position in the genome, etc. Second, the new nucleotide sequence may be analyzed for the strategic introduction and removal of restriction sites (Fig. 2). A useful strategy is to space sites evenly throughout the gene. Both introduction and removal of sites are done without altering the amino acid sequence. Finally, the sequence to be made should be minced into small oligonucleotides for assembly by PCR as described by Stemmer and others (Fig. 3; Stemmer et al. 1995
While the above theory is indeed relatively uncomplicated, manual design is a complex, tedious, and error-prone process. In the past, researchers used many different programs to address the requirements of the separate steps of synthetic gene design. Alternatively, they sent off their requirements to a black box provided by a gene synthesis company and let it use its proprietary programs to design genes. Today there are two publicly accessible computer programs that perform synthesis-related oligonucleotide design: Gene2Oligo (Rouillard et al. 2004
In this article we describe a suite of Web-based programs that is able to perform all of the functions outlined above for gene design in a directed, step-wise manner. It accepts as input either amino acid or nucleotide sequences and allows users to move through the process of design in a series of modules that address practical issues surrounding cloning vector sequences, restriction site placement, and oligonucleotide design. Users can follow the main "design a gene" path or use the modules individually as needed. We have tested this program with the 5.2-kb gag-pol gene on the yeast retrotransposon Ty1 and a 0.6-kb nucleotide fragment of the human retrotransposon L1 that was difficult to break into oligos manually (Han and Boeke 2004
Workflow with GeneDesign GeneDesign consists of six modules that may be used individually or in series to automate the tasks associated with the design and manipulation of synthetic sequences (Fig. 4). The modules are, in typical order of use: reverse translation, codon juggling, silent restriction site insertion, silent restriction site removal, oligo design, and sequence analysis. Although the modular design allows any number of permutations, we anticipate that most users will be interested in the design a gene pathway. For instance, an investigator with a 500-amino-acid human gene to be expressed in yeast for modular mutagenesis would use the design a gene path. She would begin with the reverse translation module, yielding a 1500-bp nucleotide sequence that is optimized for expression in yeast. She then takes the sequence to the silent site insertion module, where she is able to define the qualities of the landmark sites to be used for modular mutagenesis and to select their locations. Finally, she takes the sequence to the oligonucleotide design module, which breaks the synthetic sequence into three 500-bp "chunks" (separated by unique restriction sites) and each of those three chunks into 12 overlapping 60mers for PCR assembly and amplification. Another researcher with a 600-bp nucleotide sequence from yeast that is to be cloned into bacteria would begin with the codon juggling module to optimize the nucleotide sequence for expression in Escherichia coli, and then take the sequence to silent site removal to knock out any instances of internal restriction sites that conflict with his choice of cloning vector. Finally, he would use the oligonucleotide design module, which would leave him two 300-bp chunks containing a total of 16 60mers encoding his new synthetic gene.
The sequence analysis module is accessible from all of the other modules and is designed to provide useful information about the nascent synthetic sequence during the design process. A manual describing each module in detail is available online and as a PDF, and the user interface includes guidelines for use as well.
Reverse translation module
Codon juggling module
Silent restriction site insertion module Because no changes are ever considered that alter the first-frame amino acid sequence in any way, the encoding of second- and third-frame ORFs is not preserved. GeneDesign will check that the sequence submitted for silent site insertion is a simple coding sequence in the first frame of translation; it is recommended (but not required) that landmark sites be inserted only into ORFs because the effects of inserting restriction sites into noncoding sequences is difficult to predict. GeneDesign consults a list containing every possible amino acid permutation that could be encoded by each frame of each restriction enzyme recognition sequence, searches the translated nucleotide sequence for these short amino acid sequences, and presents the results as a display of all possible silent site introductions. Sites that are defined as interesting by the user or are absent from the users vector are presented in red, and all other sites are presented in black. At this point the user may go through the display and prepare a solution manually or give the program an amino acid interval at which sites are desired and have the program select the enzyme sites automatically. Before entering automatic design mode, the user is given the opportunity to rank enzymes by overhang, recognition site length, recognition sequence, and price. Only enzymes that fit the provided criteria will be considered. By default the program will not consider enzymes that leave blunt ends or single base pair overhangs, as these are more difficult to ligate in the assembly of the synthetic gene. In automatic design mode, GeneDesign breaks the nucleotide sequence into pieces according the user-defined interval and then ranks each piece by the number of possible restriction site introductions. The chunks with the fewest possible introductions are processed first, and the highest-ranking enzyme present is chosen for a landmark. This enzyme is added to the list of used cutters so that any enzyme with an ambiguous site that could resolve to the same sequence will not be considered for the rest of the sequence. The program processes each piece this way, attempting to space consecutive landmark choices by at least half the interval length to avoid an unnecessary clustering of sites that would only remove more sites from consideration. The solution is presented to the user with the same graphic that was used to list all possible introductions, with the programs landmark selections presented in blue. The user can make changes to the programs choices or have the program re-evaluate the sequence completely.
After a solution is reached that is satisfactory to the user, the sites are processed for introduction. The nucleotide sequence is compared to the sequence needed to introduce each enzyme and changed accordingly, with care taken to preserve the amino acid sequence. Important practical considerations at this step are the length of the segment to be synthesized and the type of vector to be used. The larger the insert and vector sizes, the less sites will be available. Small vectors with very few sites have been described (Mandecki et al. 1990 The solution is summarized for the user in the final screen, where he or she has the opportunity to check the properties of the introduced enzymes. If undesirable enzymes have been included, the user is able to select those enzymes and begin the process over again with them automatically added to the list of banned sites.
Silent restriction site removal module
Oligonucleotide design module
Within each of these chunks, the oligos are then designed. The user defines an oligo length and a target annealing temperature for the oligo overlaps. The defaults are 60-bp oligos with 56°C overlaps, which works well for us with yeast and mammalian sequences that are
GeneDesign uses the formula ((chunk length overlap length)/(oligo length overlap length)) to determine the number of oligos of the requested size that will actually fit in the chunk. Chunk length is always 500, and overlap length is always 20. Only a few oligo lengths are suggested to the user because only a few lengths will, in this formula, result in an even number of oligos (Table 1). Oligo lengths of
If oligos 60 bp are requested, GeneDesign first breaks the chunk into an even number of oligos of the requested length with 20-bp overlaps. After adjusting every oligo in length to evenly make up the difference between 500-bp and the actual chunk length (thus ensuring that no oligo is a grossly different length), it analyzes the average melting temperature of the overlaps and adjusts the target melting temperature for that chunk. This on-the-fly adjustment allows every chunk to have an internally consistent Tm for assembly and prevents the program from stalling because of an impossible design requirement. Once the target Tm for the chunk has been determined, the oligo lengths are adjusted so that the Tms of each overlap are consistent with the target.
The Tms of the oligo overlaps are calculated as an average of three formulas: two salt-adjusted equations (Baldino Jr. et al. 1989 Every oligo is displayed for the users approval, and when ready, the user can export them as a tab-delimited text file for ordering.
Sequence analysis module
Program output
Oligo design and A+T content
Oligo assembly and amplification We addressed the question of to what degree fluctuations in annealing temperature would affect the efficiency and the accuracy of gene synthesis by using the PCR assembly technique as a way to evaluate its robustness. To do this, we chose three 600-bp chunks of synthetic human L1 retrotransposon (58% GC) and eight 500-bp chunks of synthetic yeast Ty1 retrotransposon (44% GC). The L1 chunks were designed to have mean annealing temperatures of 50°C, 53°C, or 56°C, and the Ty1 chunks were designed to have an annealing temperature of 56°C. Each chunk was assembled and amplified across a 20°C gradient of annealing temperatures centered on the mean annealing temperature, and we were able to obtain a band of the proper size in every case (Fig. 6). Thus we conclude that the process is remarkably robust and that small variations in PCR machines or oligonucleotide melting temperatures are unlikely to create problems. We were able to transform plasmids containing the amplified DNA into competent cells and obtain DNA sequences from both high and low temperature endpoints.
Rate of mutation In GeneDesigns default oligonucleotide design strategy, oligos are overlapped, leaving 125 base gaps throughout each 500-bp chunk. Because annealing in double-stranded regions is expected to reduce mutation frequencies by selecting against incorrectly base-paired molecules (i.e., molecules containing incorrect bases), we performed an experiment to evaluate the mutation frequencies in the single- and double-stranded segments of a chunk. We determined the mutation rate of synthetic sequence from double-stranded and single-stranded oligo coverage by aligning the sequenced clones with the set of oligos from which it was assembled and locating the mutations. We found that the ratio of mutations per kilobase in single-stranded to double-stranded regions was, as expected, elevated. However, the elevation in the single-strand regions was only 44% higher than that of double-stranded regions on average. The total number of mutations per kilobase was <5 using our conditions and this particular preparation of oligonucleotides (Table 2). In practice, we sequenced 24 clones per 500-bp chunk, and on average, four of these are perfect (no substitutions) and the nearly all instances have at least one perfect clone.
Not all combinations of oligo length, annealing temperature, and base composition are possible in gapped oligo design. In gapless design, conflicts of annealing temperature and oligo length do not usually arise because temperature optimization is not carried out by oligo length adjustment. In order to ensure that GeneDesign could still perform well on sequences with unusually high (or low) A+T content, the oligonucleotide design algorithm is designed to sample A+T content and readjust the design parameters accordingly. Especially when designing oligos for larger genes, this allows the program to come as close as possible to the annealing temperature the user requested and still find an oligo design solution, no matter the base composition of the sequence. Individual 500-bp pieces of the gene can have their constituent oligo annealing temperature adjusted specifically to their A+T content.
In terms of fidelity, we have noted only a marginally significant benefit to gapless oligo design. As has been noted before (Hoover and Lubkowski 2002
Comparison with existing oligo design programs
Applications of GeneDesign
Next steps
GeneDesign is written in Perl and C. The source code is available as a link from the GeneDesign home page. All output is displayed in HTML friendly to JavaScript activated browsers. Safari 1.3 and Firefox are the recommended browsers for Macintosh and Windows platforms, respectively. All oligos in this study were synthesized by Integrated DNA Technologies on a 100-nm scale with standard desalting purification.
A+T content and oligo design
Assembly and amplification PCR
We performed "amplification PCR" (Stemmer et al. 1995
Cloning and sequencing
We thank Brian Greenlee for helpful graphic advice, Brian Olson and Mark Forrer for programing advice, and Daniel Yuan for help with Web serving. Supported in part by NIH grants CA16519, GM36481, and Roadmap grant RR020839 to J.D.B.
1 Corresponding author.
E-mail jboeke{at}jhmi.edu; fax (410) 502-1872. Article published online ahead of print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.4431306 Freely available online through the Genome Research Open Access option.
Baldino Jr. , F. , Chesselet M.F., Lewis M.E. 1989. High-resolution in situ hybridization histochemistry. Methods Enzymol. 168: 761777.[Medline] Borer P.N., Dengler B., Tinoco Jr. , I. , Uhlenbeck O.C. 1974. Stability of ribonucleic acid double-stranded helices. J. Mol. Biol. 86: 843853.[CrossRef][Medline] Cello J., Paul A.V., Wimmer E. 2002. Chemical synthesis of poliovirus cDNA: Generation of infectious virus in the absence of natural template. Science 297: 10161018. Han J.S. and Boeke J.D. 2004. A highly active synthetic mammalian retrotransposon. Nature 429: 314318.[CrossRef][Medline] Hoover D.M. and Lubkowski J. 2002. DNAWorks: An automated method for designing oligonucleotides for PCR-based gene synthesis. Nucleic Acids Res. 30: e43. Itakura K., Hirose T., Crea R., Riggs A.D., Heyneker H.L., Bolivar F., Boyer H.W. 1977. Expression in Escherichia coli of a chemically synthesized gene for the hormone somatostatin. Science 198: 10561063. Jay E., MacKnight D., Lutze-Wallace C., Harrison D., Wishart P., Liu W.Y., Asundi V., Pomeroy-Cloney L., Rommens J., Eglington L.et al. 1984. Chemical synthesis of a biologically active gene for human immune interferon- Jayaraj S., Reid R., Santi D.V. 2005. GeMS: An advanced software package for designing synthetic genes. Nucleic Acids Res. 33: 30113016. Krieg R., Stucka R., Clark S., Feldmann H. 1991. The use of a synthetic tRNA gene as a novel approach to study in vivo transcription and chromatin structure in yeast. Nucleic Acids Res. 19: 38493855. Mandecki W., Hayden M.A., Shallcross M.A., Stotland E. 1990. A totally synthetic plasmid for general cloning, gene expression and mutagenesis in Escherichia coli.. Gene 94: 103107.[CrossRef][Medline] Nambiar K.P., Stackhouse J., Stauffer D.M., Kennedy W.P., Eldredge J.K., Benner S.A. 1984. Total synthesis and cloning of a gene coding for the ribonuclease S protein. Science 223: 12991301. Neves F.O., Ho P.L., Raw I., Pereira C.A., Moreira C., Nascimento A.L. 2004. Overexpression of a synthetic gene encoding human Patterson S.S., Dionisi H.M., Gupta R.K., Sayler G.S. 2005. Codon optimization of bacterial luciferase (lux) for expression in mammalian cells. J. Ind. Microbiol. Biotechnol. 32: 115123.[CrossRef][Medline] Quinn T.P., Tweedy N.B., Williams R.W., Richardson J.S., Richardson D.C. 1994. Betadoublet: De novo design, synthesis, and characterization of a Rouillard J.M., Lee W., Truan G., Gao X., Zhou X., Gulari E. 2004. Gene2Oligo: Oligonucleotide design for in vitro gene synthesis. Nucleic Acids Res. 32: W176W180. Rychlik W., Spencer W.J., Rhoads R.E. 1990. Optimization of the annealing temperature for DNA amplification in vitro. Nucleic Acids Res. 18: 64096412. Sharp P.M., Cowe E., Higgins D.G., Shields D.C., Wolfe K.H., Wright F. 1988. Codon usage patterns in Escherichia coli, Bacillus subtilis, Saccharomyces cerevisiae, Schizosaccharomyces pombe, Drosophila melanogaster and Homo sapiens: A review of the considerable within-species diversity. Nucleic Acids Res. 16: 82078211. Smith H.O., Hutchison III C.A., Pfannkoch C., Venter J.C. 2003. Generating a synthetic genome by whole genome assembly: phiX174 bacteriophage from synthetic oligonucleotides. Proc. Natl. Acad. Sci. 100: 1544015445. Stemmer W.P., Crameri A., Ha K.D., Brennan T.M., Heyneker H.L. 1995. Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides. Gene 164: 4953.[CrossRef][Medline] Sugimoto N., Nakano S., Yoneyama M., Honda K. 1996. Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes. Nucleic Acids Res. 24: 45014505. Tian J., Gong H., Sheng N., Zhou X., Gulari E., Gao X., Church G. 2004. Accurate multiplex gene synthesis from programmable DNA microchips. Nature 432: 10501054.[CrossRef][Medline]
Received July 14, 2005; accepted in revised format November 9, 2005. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||