|
|
|
|
Published online before print
February 23, 2007, 10.1101/gr.6037607 Genome Res. 17:536-543, 2007 ©2007 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/07 $5.00
Resource Approaching a complete repository of sequence-verified protein-encoding clones for Saccharomyces cerevisiae1 Harvard Institute of Proteomics, Harvard Medical School, Cambridge, Massachusetts 02141, USA; 2 Division of Genetics, Department of Medicine, Brigham & Womens Hospital, Harvard Medical School, Boston, Masschusetts 02115, USA; 3 Harvard University Graduate Biophysics Program, Cambridge, Massachusetts 02138, USA; 4 Ludwig Institute for Cancer Research, Sao Paulo SP Brazil 01509-010; 5 DF/HCC DNA Resource Core, Harvard Medical School, Cambridge, Massachusetts 02141, USA; 6 Ludwig Institute for Cancer Research, University of California San Diego, School of Medicine, La Jolla, California 92093, USA; 7 Ludwig Institute for Cancer Research, New York, New York 10158, USA; 8 Department of Pathology, Brigham & Womens Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA; 9 Harvard-MIT Division of Health Sciences & Technology (HST), Harvard Medical School, Boston, Massachusetts 02115, USA
The availability of an annotated genome sequence for the yeast Saccharomyces cerevisiae has made possible the proteome-scale study of protein function and proteinprotein interactions. These studies rely on availability of cloned open reading frame (ORF) collections that can be used for cell-free or cell-based protein expression. Several yeast ORF collections are available, but their use and data interpretation can be hindered by reliance on now out-of-date annotations, the inflexible presence of N- or C-terminal tags, and/or the unknown presence of mutations introduced during the cloning process. High-throughput biochemical and genetic analyses would benefit from a "gold standard" (fully sequence-verified, high-quality) ORF collection, which allows for high confidence in and reproducibility of experimental results. Here, we describe Yeast FLEXGene, a S. cerevisiae protein-coding clone collection that covers over 5000 predicted protein-coding sequences. The clone set covers 87% of the current S. cerevisiae genome annotation and includes full sequencing of each ORF insert. Availability of this collection makes possible a wide variety of studies from purified proteins to mutation suppression analysis, which should contribute to a global understanding of yeast protein function.
The budding yeast Saccharomyces cerevisiae is one of the most studied eukaryotes at the genetic, molecular, and cellular levels. Many of the mechanisms that control molecular and cell biology of the yeast are conserved in other eukaryotes, including mechanisms of such basic functions as DNA replication, progression through the cell cycle, and transcriptional regulation. Together with rapid growth and genetic tractability, this feature makes yeast particularly valuable for biological research.
Sequencing of the S. cerevisiae genome began as a worldwide collaboration and was completed in 1996, providing the first example of a fully sequenced eukaryotic genome. The 12,068 kilobase-pair sequence defined 5885 potential protein-encoding genes on 16 chromosomes (Goffeau et al. 1996
Annotation of protein-coding genes in the S. cerevisiae genome has changed over time as new experimental data and advanced sequence analyses led to improved annotation. In 2003, a comparative analysis of S. cerevisiae with three related species led to the proposed elimination of about 500 previously annotated ORFs and redefinition of start and/or stop codons for at least 300 ORFs (Kellis et al. 2003
The knowledge gained from extensive annotation of the S. cerevisiae genome over the past decade has made it possible for researchers to take a genome- and proteome-wide view of yeast gene function. The earliest genome-scale ORF collections for S. cerevisiae were constructed using a gap-repair cloning approach (Hudson et al. 1997
Although these ORF collections have proved useful for specific proteomic studies, the ORF inserts are basically locked into the original vector and cannot be moved to another vector without a PCR amplification step (Marsischky and LaBaer 2004 Among the limitations of end-read sequencing is that many clones do not end up with full sequence coverage and are effectively unfinished. Here, we describe a new collection of yeast ORF clones, Yeast FLEXGene (Full Length EXpresssion-ready), in which all of the clones were full-length sequence verified and contain minimal differences between the clone and reference sequences at the amino acid level. This collection is based on the best available gene annotation, constructed in a recombinational cloning vector that enables high-throughput transfer into a wide variety of vectors, and produced with a stop codon at its native location, allowing for the production of either native or N-terminally tagged protein. The majority of clones (68%) have a normalized stop codon potentially enabling some suppression strategies. We set as a goal to obtain at least 5000 completed clones. The current collection includes clones for 5003 genes and covers 87% of the predicted protein-coding sequences for S. cerevisiae, and preliminary evidence suggests that the collection will be useful for a variety of genomic and proteomic-based approaches.
Identification of an ORF target set from the annotated S. cerevisiae genome sequence To create an initial reference set of target ORFs, the genomic sequence of the 6277 predicted S. cerevisiae ORFs annotated at the time we initiated our study (2000) were downloaded from the Saccharomyces Genome Database (SGD). In addition, the first phase of our cloning effort (Phase One) relied on a pre-existing set of gene-specific primers from Research Genetics that were based on an earlier annotation of the S. cerevisiae genome. Our target set of reference ORFs was not static, however. We adapted to major revisions and the analysis presented here is based on the major revision released in 2004. Thus, our final target set comprises 5774 ORFs (215 additional ORFs and 252 modified ORFs relative to the 1999 set). About 500 initially targeted ORFs were dubious ORFs, pseudogenes, or Ty elements, and were not attempted at later stages.
Amplification of ORFs by PCR from a normalized genomic template
The overall failure rate during the first two phases was 70% and failures were primarily due to quality and design issues pertaining to the RG primers and polymerase choice. For Phases Three and Four, we used higher fidelity polymerases than those used for previous phases. Together with the inclusion of newly designed ORF-specific primers, both polymerases improved overall cloning success (Table 1).
Capture of PCR products in a vector compatible with cloning via enzyme-mediated, site-specific recombination PCR products were initially captured in the Gateway entry vector pDONR201 and later in pDONR221. Using this system, ORFs are captured in the correct orientation via subtle but noncompatible differences between the 5' and 3'-flanking att site sequences. Capture was initially done using the BP reaction, a method well suited to high-throughput cloning. However, we found that approach did not efficiently capture fragments 2.5 kb or longer. For this reason, in our last cloning phase, a linearized derivate of pDONR221 was used in conjunction with Clontechs In-Fusion method for ORFs longer than 2.5 kb. Capture of the PCR amplified fragment in the vector was defined as positive when colonies were detected after transformation, thus allowing single-colony isolation on solid agar (Fig. 1). Initially, we selected four colonies per ORF and maintained them separately, as we expected that this would increase the likelihood of obtaining at least one mutation-free clone. With experience, our methods have improved such that the benefits of choosing multiple isolates no longer outweigh the costs, as 80% of ORFs can be accepted based on a single isolate. Thus, by Phase Four, we revised our strategy to isolate one colony per ORF. After capture into the vector, half of the capture reaction was plated on solid agar and the remaining transformation mix was stored at 80°C, allowing us to return to the frozen transformation mix without the need to repeat the entire cloning procedure.
DNA sequencing reveals high-fidelity capture of 87% of known and predicted yeast ORFs
ORF size, PCR primer attributes, and GC content contribute to cloning failure
In our amplification strategy, primers target gene-specific regions of 2030 nucleotides in length that correspond to the extreme 5' and 3' ends of the coding sequence. We were surprised to find that many different genes share identical 5' and 3' ends, making it difficult to amplify all desired ORFs. To determine the extent to which primer specificity contributed to cloning failure, we compared primer sequences for all ORFs with one another. Matching primer pairs typically causes favored amplification of the shortest gene sharing the primer sequences. We also examined whether primer sequences could bind elsewhere in the genome (not necessarily at the ends of other genes). This situation leads to failed amplification or amplified junk sequence. We found that high primer sequence similarity with other ORFs, as well as high primer stickiness to genomic DNA, reduced the cloning success rate from 87% to 70%. Details are listed in Supplemental Table 2 and Supplemental Figure 1.
Failure to clone some sequences could reflect errors in the target ORF sequences
Yeast ORF clones are useful for protein expression and analysis Our rationale for sequence verification of all clones in the yeast collection was to ensure that the clones are useful for protein expression-based assays. To further test the utility of the clones in protein-based assays, we transferred a functionally related set of clones from the entry vector to a bacterial expression vector, induced expression, and purified the proteins. In total, we selected 257 clones that encode known and predicted transcription factors for transfer into the protein expression vector pDEST-GST (LaBaer et al. 2004
In a pilot study we applied the purified proteins to a protein-binding microarray (PBM) to identify DNA sequence motifs bound by the query protein(s) (Bulyk et al. 1999
Genome-sequencing projects have produced an immense amount of information regarding the organization, evolution, and coding capacity of genomes. Availability of this information has propelled biological research in the direction of genome- or proteome-scaled approaches. The need to develop tools and resources to facilitate this type of research is ever increasing. Large-scale functional proteomics studies, for example, rely on the availability of cloned copies of DNA-encoding the proteins, which make it possible to express proteins in vivo or in vitro and use them in a wide variety of assays (Uetz et al. 2000 An ideal collection of protein-encoding clones would embody the virtues of comprehensive coverage of all ORFs, simplified transfer of ORFs to any protein expression vector and full-length sequence validation of all ORFs. In this report, we have described the cloning and verification of yeast FLEXGene ORF clones that meet this "gold standard" for clone quality. In our vector choice, we exploited the availability of recombination-based cloning technology, making it possible for the ORFs in our collection to be easily moved from one vector to another, facilitating the widest possible range of functional experimentation. Importantly, the clones in the collection we describe here were clonally isolated and full-length sequence verified. The collection covers 87% of S. cerevisiae protein-coding sequences (Supplemental Table 1) and 82% of the clones in this collection match perfectly to the reference peptide sequence from current ORF annotation (18% of clones carry one or two amino acid changes). These clones all have GenBank listings and can be searched and are available at http://plasmid.hms.harvard.edu. The effort to build this ORF collection was carried out in four distinct phases, in which clones that failed in a prior phase were carried forward to the next phase. Despite the fact that failed clones were carried forward, we found that several major factors contributed to a much higher failure rate in earlier phases than in later phases. These included: (1) incorrectly designed and/or synthesized PCR primers; (2) the use of PCR enzymes with low fidelity; (3) difficulties sequencing inserts in the entry vector pDONR201, which made it impossible to achieve full-sequence assemblies for many clones; and (4) erroneous genome annotation. We addressed each of these issues in the subsequent cloning phases and achieved a twofold higher success rate in later versus initial phases in terms of obtaining full-length verified clones (Table 1).
Despite multiple attempts using different primers and cloning strategies, however, the collection still lacks a qualified clone for some ORFs. Examining the factors contributing to lost ORFs will inform future projects, particularly those involving eukaryotic genes. Factors such as ORF size (Fig. 2), GC content, primer similarity, and primer stickiness to other genes or to genomic DNA (Supplemental Table 2; Supplemental Fig. 1) make some ORFs more difficult to clone from a genomic template than others. In total, 427 ORFs in the target list were more difficult to clone due to one or more of these factors (i.e., ORFs size
Our aim was to create a high-quality clone collection useful for the broadest possible variety of functional studies of yeast proteins. We used the protein-binding microarray (PBM) approach to identify the DNA sequence motif of Rap1 to demonstrate the use of the clones described here in protein-based assays (Fig. 4). The usefulness of the resource was also demonstrated in a high-throughput screen to identify the cellular targets of a small-molecule inhibitor of the TOR pathway (Butcher et al. 2006
Preparation of genomic DNA Genomic DNA was purified from S. cerevisiae strain S288C identical to the strain used for the initial published genome sequence (Goffeau et al. 1996
Primer design, synthesis, and PCR amplification
For Phases Three and Four, first-step PCR, optimal gene-specific primers were designed using a modified nearest-neighbor algorithm (Sugimoto et al. 1995
Capture of amplified ORFs To obtain a linear pDONR221 version that would allow for directional cloning, we introduced unique restriction sites in the 5' (NcoI) and 3' (XhoI) att sequences by site-directed mutagenesis, and sequence verified the correct insertion after In-Fusion reactions.
Identification of known or predicted transcription factor genes
Subcloning of ORFs into expression vector, bacterial expression, and purification of proteins
Immunoblotting
Protein-binding microarray experiments and data analysis
Informatics Sequence analysis and verification was performed using ACE, a web-based automatic sequence validation package developed in our group for high-throughput clone sequence validation. The features and implementation of this system will be described elsewhere (E. Taycher, A. Rolfs, Y. Hu, D. Zuo, S. Mohr, J. Williamson, and J. LaBaer, in prep).
Accepted clones were imported into and can be publicly searched and requested via the Plasmid Information Database (PlasmID; http://plasmid.hms.harvard.edu). The features and implementation of this system is described elsewhere (Zuo et al. 2007
We thank all past and present members of the Harvard Institute of Proteomics who have contributed to the development of the techniques that made this work possible. Special thanks to Dr. Leonardo Brizuela, Dr. Joeseph Pearlberg for helpful discussions and advice, and Stephanie Ness (LICR, San Diego) for her efforts in performing DNA sequencing. We thank Katrina Saulrieta and Zachary Smith for technical assistance. This work has been supported by grants R01 HG002923 (J.L.) and R01 HG003420 (M.L.B.) from the National Institutes of Health, and funds from the Ludwig Institute for Cancer Research.
10 Corresponding author.
E-mail Joshua_labaer{at}hms.harvard.edu; fax (617) 324-0824. [Supplemental material is available online at www.genome.org] Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.6037607
Bateman, A., Coin, L., Durbin, R., Finn, R.D., Hollich, V., Griffiths-Jones, S., Khanna, A., Marshall, M., Moxon, S., Sonnhammer, E.L., et al. 2004. The Pfam protein families database. Nucleic Acids Res. 32: D138D141. Berger, M.F. and Bulyk, M.L. 2006. Protein binding microarrays (PBMs) for the rapid, high-throughput characterization of the sequence specificities of DNA binding proteins. In Gene mapping, discovery, and expression (ed. M. Bina), pp. 245260. The Humana Press, Inc., Totowa, NJ. Bulyk, M.L., Gentalen, E., Lockhart, D.J., and Church, G.M. 1999. Quantifying DNA-protein interactions by double-stranded DNA arrays. Nat. Biotechnol. 17: 573.[CrossRef][Medline] Bulyk, M.L., Huang, X., Choo, Y., and Church, G.M. 2001. Exploring the DNA-binding specificities of zinc fingers with DNA microarrays. Proc. Natl. Acad. Sci. 98: 71587163. Butcher, R.A., Bhullar, B.S., Perlstein, E.O., Marsischky, G., LaBaer, J., and Schreiber, S.L. 2006. Microarray-based method for monitoring yeast overexpression strains reveals small-molecule targets in TOR pathway. Nat. Chem. Biol. 2: 103109.[CrossRef][Medline] Costanzo, M.C., Hogan, J.D., Cusick, M.E., Davis, B.P., Fancher, A.M., Hodges, P.E., Kondu, P., Lengieza, C., Lew-Smith, J.E., Lingner, C., et al. 2000. The yeast proteome database (YPD) and Caenorhabditis elegans proteome database (WormPD): Comprehensive resources for the organization and comparison of model organism protein information. Nucleic Acids Res. 28: 7376. Gelperin, D.M., White, M.A., Wilkinson, M.L., Kon, Y., Kung, L.A., Wise, K.J., Lopez-Hoyo, N., Jiang, L., Piccirillo, S., Yu, H., et al. 2005. Biochemical and genetic analysis of the yeast proteome with a movable ORF collection. Genes & Dev. 19: 28162826. Goffeau, A., Barrell, B.G., Bussey, H., Davis, R.W., Dujon, B., Feldmann, H., Galibert, F., Hoheisel, J.D., Jacq, C., Johnston, M., et al. 1996. Life with 6000 genes. Science 274: 563567. Hall, D.A., Zhu, H., Zhu, X., Royce, T., Gerstein, M., and Snyder, M. 2004. Regulation of gene expression by a metabolic enzyme. Science 306: 482484. Harbison, C.T., Gordon, D.B., Lee, T.I., Rinaldi, N.J., Macisaac, K.D., Danford, T.W., Hannett, N.M., Tagne, J.B., Reynolds, D.B., Yoo, J., et al. 2004. Transcriptional regulatory code of a eukaryotic genome. Nature 431: 99104.[CrossRef][Medline] Huang, J., Zhu, H., Haggarty, S.J., Spring, D.R., Hwang, H., Jin, F., Snyder, M., and Schreiber, S.L. 2004. Finding new components of the target of rapamycin (TOR) signaling network through chemical genetics and proteome chips. Proc. Natl. Acad. Sci. 101: 1659416599. Huber, B. and Bulyk, M. 2006. Meta-analysis discovery of tissue-specific DNA sequence motifs from mammalian gene expression data. BMC Bioinformatics 7: 7.[CrossRef][Medline] Hudson Jr., J.R., Dawson, E.P., Rushing, K.L., Jackson, C.H., Lockshon, D., Conover, D., Lanciault, C., Harris, J.R., Simmons, S.J., Rothstein, R., et al. 1997. The complete set of predicted genes from Saccharomyces cerevisiae in a readily usable form. Genome Res. 7: 11691173. Hughes, J.D., Estep, P.W., Tavazoie, S., and Church, G.M. 2000. Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae. J. Mol. Biol. 296: 12051214.[CrossRef][Medline] Ito, T., Chiba, T., Ozawa, R., Yoshida, M., Hattori, M., and Sakaki, Y. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. 98: 45694574. Kellis, M., Patterson, N., Endrizzi, M., Birren, B., and Lander, E. 2003. Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: 241254.[CrossRef][Medline] Kumar, A., Agarwal, S., Heyman, J.A., Matson, S., Heidtman, M., Piccirillo, S., Umansky, L., Drawid, A., Jansen, R., Liu, Y., et al. 2002. Subcellular localization of the yeast proteome. Genes & Dev. 16: 707719. LaBaer, J., Qiu, Q., Anumanthan, A., Mar, W., Zuo, D., Murthy, T.V., Taycher, H., Halleck, A., Hainsworth, E., Lory, S., et al. 2004. The Pseudomonas aeruginosa PA01 gene collection. Genome Res. 14: 21902200. Lee, T., Rinaldi, N., Robert, R., Odom, D., Bar-Joseph, Z., Gerber, G., Hannett, N., Harbison, C., Thompson, C., Simon, I., et al. 2002. Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799804. Liu, X., Brutlag, D., and Liu, J. 2001. BioProspector: Discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 6: 127138. Marsischky, G. and LaBaer, J. 2004. Many paths to many clones: A comparative look at high-throughput cloning methods. Genome Res. 14: 20202028. Mewes, H., Frishman, D., Guldener, U., Mannhaupt, G., Mayer, K., Mokrejs, M., Morgenstern, B., Munsterkotter, M., Rudd, S., and Weil, B. 2002. MIPS: A database for genomes and protein sequences. Nucleic Acids Res. 30: 3134. Mukherjee, S., Berger, M.F., Jona, G., Wang, X.S., Muzzey, D., Snyder, M., Young, R.A., and Bulyk, M.L. 2004. Rapid analysis of the DNA-binding specificities of transcription factors with DNA microarrays. Nat. Genet. 36: 13311339.[CrossRef][Medline] Ptacek, J., Devgan, G., Michaud, G., Zhu, H., Zhu, X., Fasolo, J., Guo, H., Jona, G., Breitkreutz, A., Sopko, R., et al. 2005. Global analysis of protein phosphorylation in yeast. Nature 438: 679684.[CrossRef][Medline] Ren, B., Robert, F., Wyrick, J.J., Aparicio, O., Jennings, E.G., Simon, I., Zeitlinger, J., Schreiber, J., Hannett, N., Kanin, E., et al. 2000. Genome-wide location and function of DNA binding proteins. Science 290: 23062309. Sopko, R., Huang, D., Preston, N., Chua, G., Papp, B., Kafadar, K., Snyder, M., Oliver, S.G., Cyert, M., Hughes, T.R., et al. 2006. Mapping pathways and phenotypes by systematic gene overexpression. Mol. Cell 21: 319330.[CrossRef][Medline] Sugimoto, N., Nakano, S., Katoh, M., Matsumura, A., Nakamuta, H., Ohmichi, T., Yoneyama, M., and Sasaki, M. 1995. Thermodynamic parameters to predict stability of RNA/DNA hybrid duplexes. Biochemistry 34: 1121111216.[CrossRef][Medline] Uetz, P., Giot, L., Cagney, G., Mansfield, T.A., Judson, R.S., Knight, J.R., Lockshon, D., Narayan, V., Srinivasan, M., Pochart, P., et al. 2000. A comprehensive analysis of proteinprotein interactions in Saccharomyces cerevisiae. Nature 403: 623627.[CrossRef][Medline] Zhu, H., Klemic, J.F., Chang, S., Bertone, P., Casamayor, A., Klemic, K.G., Smith, D., Gerstein, M., Reed, M.A., and Snyder, M. 2000. Analysis of yeast protein kinases using protein chips. Nat. Genet. 26: 283289.[CrossRef][Medline] Zhu, H., Bilgin, M., Bangham, R., Hall, D., Casamayor, A., Bertone, P., Lan, N., Jansen, R., Bidlingmaier, S., Houfek, T., et al. 2001. Global analysis of protein activities using proteome chips. Science 293: 21012105. Zuo, D., Mohr, S.E., Hu, Y., Taycher, E., Rolfs, A., Kramer, J., Williamson, J., and LaBaer, J. 2007. PlasmID: A centralized repository for plasmid clone information and distribution. Nucleic Acids Res. 35: D680D684.
Received October 13, 2006; accepted in revised format January 3, 2007. This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||