|
|
|
|
Genome Res. 14:2145-2154, 2004 ©2004 by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/04 $5.00 Resources High-Throughput Computational and Experimental Techniques in Structural Genomics1 New York Structural Genomics Research Consortium, Albert Einstein College of Medicine, Bronx, New York 10461, USA 2 Department of Physiology and Biophysics, Albert Einstein College of Medicine, Bronx, New York 10461, USA 3 Department of Biochemistry, Albert Einstein College of Medicine, Bronx, New York 10461, USA 4 Center for Synchrotron Biosciences, Albert Einstein College of Medicine, Bronx, New York 10461, USA 5 Departments of Biopharmaceutical Sciences and Pharmaceutical Chemistry and California Institute for Quantitative Biomedical Research, University of California San Francisco, San Francisco, California 94143, USA
Structural genomics has as its goal the provision of structural information for all possible ORF sequences through a combination of experimental and computational approaches. The access to genome sequences and cloning resources from an ever-widening array of organisms is driving high-throughput structural studies by the New York Structural Genomics Research Consortium. In this report, we outline the progress of the Consortium in establishing its pipeline for structural genomics, and some of the experimental and bioinformatics efforts leading to structural annotation of proteins. The Consortium has established a pipeline for structural biology studies, automated modeling of ORF sequences using solved (template) structures, and a novel high-throughput approach (metallomics) to examining the metal binding to purified protein targets. The Consortium has so far produced 493 purified proteins from >1077 expression vectors. A total of 95 have resulted in crystal structures, and 81 are deposited in the Protein Data Bank (PDB). Comparative modeling of these structures has generated >40,000 structural models. We also initiated a high-throughput metal analysis of the purified proteins; this has determined that 10%-15% of the targets contain a stoichiometric structural or catalytic transition metal atom. The progress of the structural genomics centers in the U.S. and around the world suggests that the goal of providing useful structural information on most all ORF domains will be realized. This projected resource will provide structural biology information important to understanding the function of most proteins of the cell.
The complete genomes of a number of organisms have been sequenced and many more are underway. This progress in gene sequencing has shifted the landscape of biology, such that goals related to understanding the structure and function of each gene product, as well as their interactions within the cellular environment that lead to the behavior of complex systems are within reach, or at least to be contemplated. The sequencing of model organisms from bacterial species to human has allowed the identification of genes both essential to function, as well as genes that give rise to the diversity of life forms. Although the exact numbers and natures of the genes is still open to question, recent estimates place the numbers at <20,000 for Caenorhabditis elegans and Caenorhabditis briggsae and
The Protein Structure Initiative (PSI) funded by the National Institute of General Medical Sciences (www.nigms.nih.gov/psi) includes the structural genomics efforts of nine centers in the United States. In the so-called phase 1 of the PSI (Editorial 2004
As a rule, the ORF targets selected for the structural genomics efforts are <30% identical (across a reasonable length) to proteins already deposited in the PDB (Sali 1998
The major benefit from structural genomics efforts is the provision of structural models for biologists to understand gene function. In addition, the wealth of structural information will be used to address issues of protein folding, protein structure prediction, and protein evolution. In terms of biomedical impact, the structural data will facilitate design of therapeutic agents by comparing functionally similar protein structures of pathogens and hosts, or proteins in diseased and normal tissues. The structural genomics efforts have facilitated technical developments in structure determination and the establishment of high-throughput facilities for the use of a wide community of scientists. Also, the structural genomics projects are providing reagents and materials for spin-off projects that examine function in vivo and in vitro. Lastly, retrospective analyses using the unprecedented volume of high-throughput experiments are helping to establish methods to predict experimental outcomes for protein production and crystallization. In this report, we outline the progress of the New York Structural Genomics Consortium (NYSGXRC, www.nysgxrc.org) in implementing and developing its structural genomics pipeline. We emphasize the coordination of bioinformatics efforts with the experimental methods of the consortium, including the development of an integrated consortium database to manage the workflow, the overall progress from cloning to modeling, the impact of the modeling of NYSGXRC structures, and novel experimental and bioinformatics approaches to examining the structure of metalloproteins, termed metallomics (Hasnain 2004
Design and Use of an Online Experimental Database One of the key features in the successful internal functioning of the NYSGXRC (and any large multi-task project) has been the development of a database for effective communication among the participants. The Integrated Consortium Experimental Database (IceDB) has been set up to facilitate data management among the various research groups in the NYSGXRC. IceDB fulfills several roles; it serves as a Laboratory Information Management System (LIMS) for exchanging, querying, displaying, and archiving experimental and bioinformatics data; it is used as an automated and versatile bioinformatics tool for bioinformatics screening and analysis; and finally, it is an interface and data exchange platform for users, other centers, and external resources. The system technically is a MySQL relational database that is organically interconnected with a series of locally implemented bioinformatics programs and external databases. The relational database can be accessed through a Web interface at www.nysgxrc.org. It is coded in HTML and Perl CGI languages. IceDB is composed of two main parts, Target List and Progress Report. Target List contains the potential targets and their annotations in order to aid target selection. Several bioinformatics programs have been implemented for screening, such as calculating peptide statistics, predicting secondary structure, membrane immersed, and disordered regions from the sequences. Progress Report collects and displays the experimental data, and tracks the progress for all the selected targets. The collected experimental data include fields such as cloning, expression, biophysical characterization, crystallization, X_Ray data collection, X_Ray refinement, X_Ray structure, and PDB deposition. Users can insert comments and actual data (graphs, images) as appropriate for each class of field. IceDB automatically generates weekly progress statistics and XML-formatted progress reports for TargetDB, the centralized database of the PSI. IceDB also compares regularly and systematically all the active targets in the internal pipeline with the ones in TargetDB, and identifies potential overlapping cases, the extent of their sequential overlap and similarity, and the stages of experimental progress toward these structures. IceDB interfaces with three major external resources and several public databases. This cross-linking is essential to consortium communication, as specific tasks in the structural genomics pipeline are distributed among various independent laboratories. For example, ORF target-selection bioinformatics tasks are primarily carried out at UCSF in the Sali laboratory. A list of curated ORF targets is then transmitted to the large-scale cloning and protein production facilities at Structural Genomix (SGX) in San Diego, where the overall Consortium's effort is directed by Stephen Burley. IceDB regularly exchanges data with the LIMS of SGX. Thus, data generated at SGX on cloning, expression, solubility, and purification of protein targets is automatically uploaded. Purified ORF targets are shipped from SGX to the four crystallographic laboratories in New York for automated crystallization, and these labs use IceDB to track progress in generating crystals and assessing diffraction quality upon preliminary synchrotron data collection. In this way, the crystallography laboratories receive necessary information on the targets from SGX, and SGX can determine which ORF targets are showing progress through the pipeline.
To keep track of structure solution activity at the National Synchrotron Light Source, IceDB automatically communicates with the Automated Structure Determination Platform (ASDP). ASDP is used for high-throughput X-ray structure determination subsequent to data collection (Chance et al. 2002
Output of NYSGXRC Pipeline to Date and Worldwide Progress in Structural Genomics
Among the first 65 NYSGXRC target structures solved, 53 have been classified by SCOP (Murzin et al. 1995 and structure, 23 have alternating and , 11 are all- , and six are all- protein classes. At the fold level, the 53 structures are distributed among 36 fold types. The solved targets were also compared with already known structures using the DALI program (Holm and Sander 1995
On the basis of our current protein production rates, we now have sufficient statistics to reliably estimate the NYSGXRC output in the immediate future. The above statistics argue that
The production statistics for the 15 structural genomics centers located around the world as of May 2004 include 28,293 proteins cloned with expression observed in 16,468 of the vector targets (or 58%). A total of 6177 targets have been seen to produce soluble protein, from which 5924 proteins have been purified. Thus, the overall experience is that purified protein has been obtained from 36% of the vectors for which expression has been observed. A total of 2162 of the purified proteins formed crystalline material, and 1034 (17% of the purified target set) resulted in diffraction quality crystals, whereas 715 structures have been deposited to the PDB. These outcomes are expected to improve, as some of the proteins are still at some intermediate stage in the various pipelines. Compared with the goal of producing 10,000-15,000 new structures to provide completeness in structural genomics (Vitkup et al. 2001
Modeling NYSGXRC Sequences: How Structural Models Are Informing New Biology
A suite of bioinformatics programs and databases is at the foundation of the NYSGXRC's computational efforts. MODBASE (http://salilab.org/modbase) is a comprehensive database of annotated comparative protein structure models (Pieper et al. 2004
MODBASE is organized into several model data sets. The largest contains models for domains in 659,495 sequences of 1,182,126 unique protein sequences in the complete SWISS-PROT/TrEMBL (Boeckmann et al. 2003
Relying on the first 63 unique NYSGXRC solved structures, MODPIPE produced models for domains in 33,340 sequences in SWISS-PROT/TrEMBL (Table 2). The modeled sequences come from 2676 different organisms, with a kingdom distribution of 41% Prokaryota, 2% Archaea, and 57% Eukaryota. This organism classification has been derived from the NCBI taxonomy database, where all protein sequences are matched with a taxonomy id (Wheeler et al. 2000
Considering that the target sequences for NYSGXRC were selected to have <30% sequence identity to a known experimental structure, most of the modeled ORF sequences have been characterized structurally for the first time. Thus, these data sets indicate the increased coverage of the sequence-structure space by the NYSGXRC structures. In fact, the experience so far for the U.S. centers is that 70% of their PDB deposits in 2002-2003 are for proteins containing unique sequences, (i.e, sequences with <30% sequence identity to the closest known structure) compared with only 10% of the deposits overall during the same time period (Editorial 2004
The most interesting cases for functional analysis would be proteins for which sequence-based methods failed to establish a meaningful connection to a protein of known function or structure. On the basis of our current experience, every third target solved in the NYSGXRC pipeline remains functionally uncharacterized. These proteins are ripe for experimental investigation using biochemical or genetic approaches. Although funds are available from the NIH for the study of functionally characterized structures solved by the PSI centers, no mechanism exists to systematically study the uncharacterized proteins (Editorial 2004
Another way to glean functional insight for unannotated protein structures is through the comparative modeling pipeline. Structure-based search and confirmation of protein relationship is usually more reliable and sensitive than sequence-only based approaches. Such structural (and potentially functional) assignments are called "nontrivial hits" (summarized in Table 2 in the M column), and are usually based on very low (<20%) sequence identity between aligned regions of the target and template sequences. An example is the model of a protein sequence annotated in the TrEMBL database (Boeckmann et al. 2003
High-Throughput Annotation of Metal-Binding Targets
Up to one-third of proteins contain metal atoms (Hasnain 2004
We analyzed 143 proteins from prokaryotic sources recently delivered by SGX to the crystallography laboratories for crystallization testing. For each protein, 200 µg of sample were loaded onto the sample plates and dried under controlled conditions. The results in terms of corrected counts for T834, which was annotated as a hypothetical protein (Table 3), are shown in Figure 1. The sample showed significant nickel fluorescence counts, but minimal amounts of the other metals were detected. Of the 143 samples examined, >20 indicated some transition metal content (data not shown). To limit the analysis to likely cases of structural or functional metal atoms, the metal-to-protein stoichiometry was determined by comparison of the corrected counts with an appropriately chosen set of standards for each metal at the same experimental conditions; thus, the number of moles of each metal was accurately measured. The results for the 16 proteins that showed a metal/protein ratio of 0.7 or greater are shown in Table 3; the error in this analysis is ±0.2, such that we report data only for metal binding that is likely to be stoichiometric, and therefore relevant. Of these, two proteins contain two or more metal atoms per protein molecule, and 14 proteins contain one or more metal per molecule (metal/protein ratios 0.7-1.6), including T834. Zinc was observed in eight cases, copper and nickel in three each, and iron and manganese once.
In the following section, we examine the known annotations for these 16 proteins. Our analysis is likely to emphasize false negatives, as some metalloproteins may lose a metal atom in the purification step. We have already excluded one false positive, where a stoichiometry of 0.5 Zn/protein was observed for T1429 (data not shown, AC:Q57549). This target was solved by the NYSGXRC (1q98 [PDB] ). An anomalous difference Fourier analysis showed no evidence of a metal atom signature. However, this protein does have exposed Cys residues that may be able to coordinate adventitious Zn during the purification. This is one factor leading to the choice of a cutoff of 0.7 metal/protein for the annotation of metalloprotein identity.
Functional Annotation of Metal-Binding Proteins For T763, a zinc/protein stoichiometry of 1.3 was measured; the protein was annotated as a putative amidohydrolyase. The BLAST search indicated a close relationship with a zinc-containing carboxypeptidase and an overall similarity with the M40 peptidase family. The COG analysis indicated that the target belongs to a metal-dependent amidase family. A search against PDB found no significant homologies. The annotation of this protein as a metalloprotein is very strongly confirmed by the bioinformatics analysis, although the crystal structure of this protein remains unsolved. T830 is annotated as a hypothetical protein with similarity to an ADP-ribose pyrophosphatase, which is indicated to have a magnesium cofactor. The COG database also indicates that this target belongs to the same enzyme family. No similarity to any structure in the PDB was found. The annotation of T830 as a manganese-containing enzyme is reasonable, as active sites that bind magnesium generally can be exchanged for manganese. Thus, the metal analysis provides evidence that this target is a metal-dependent hydrolyase. For T1407, T797, T1403, and T1404, the identification as metalloproteins is well supported by bioinformatics, which, in each case, provides a functional annotation (in terms of enzyme activity) consistent with metal binding by the target. T1407 (binding Ni) is annotated in the alcohol dehydrogenase family (the presence in this target of a metal-binding motif was also seen in PROSITE). A related structure in the PDB is seen to contain Fe. Zinc-containing T797 is a DNA-glycosylase closely related to PDB entry 1nku [PDB] , which also contains zinc. T1404, the MazG protein and the related T1403 are indicated to have pyrophosphatase motifs consistent with zinc binding. In several cases, the metal binding provides a new annotation for protein of unknown or not-well-understood functions. T790, indicated to contain copper, was annotated a hypothetical protein and COG indicated an uncharacterized enzyme. A related structure in the PDB is seen to contain Zn. T1405 also is listed as a hypothetical protein predicted to be related to glutamine amidotransferases; the metalloprotein annotation may assist in better understanding its function. In other cases, the proteins have good annotations, but no indication of metal binding, and the metal content may suggest important structural or functional information. For example, T773 is annotated as a monooxygenase and the zinc ion may be related to the protein's catalytic function, or may serve as a structural metal. T788 has over 4 Zn/protein indicated; it is unclear as to how this may be related to its annotated enzyme function. However, T824, which has over 2 Cu/protein and is annotated as type-I restriction enzyme, may have metal functions directly related to the DNA cleavage mechanism of this protein. In the case of T818, the indicated zinc atom may represent a false positive, in that the structure of a nearly identical sequence shows no metal atom or indication of a likely metal-binding site. Overall, the metallomics analysis found many metalloproteins among the 143 proteins examined so far. On the basis of the observed annotations, the metal content was, in most cases, very reasonable, and in other cases, potentially informative with respect to protein function. Using the cutoff of measured metal/protein stoichiometry of 0.7, the rate of false positives may be in the range of from 5% to 10%. The range of false negatives cannot be estimated yet without more data. Over the next 18 mo, we expect to screen over 900 additional proteins provided by SGX, such that we can better refine these numbers.
Conclusion: Opportunities and Limitations of the Protein Structure Initiative and the Next Challenge for Structural Biology
Although this sequence coverage and the number of modeled proteins may look impressive, usually only one domain within the ORF sequence of each protein is modeled. On average, proteins have two or three domains. That is, an average yeast ORFs codes for 472 amino acid residues, whereas the average size of domains in CATH (Orengo et al. 1997
The next challenge involves understanding the domain interactions and the assembly of proteins into complexes, Figure 2 (Gavin et al. 2002
Metallomics Analysis We irradiated samples with synchrotron X-rays produced by the NSLS X-ray ring (the ring operates at the constant energy of 2.8 GeV and current decaying with time from 280 to 200 mA). The beamline configuration is similar to that used for focused beam X-ray absorption spectroscopy measurements (Chance et al. 1996A total of 16 sample wells were bored in a Teflon plate, and three plates can be simultaneously loaded onto a multiplate rail. The synchrotron beam is shaped by slits to match the size of the sample well (2.5 x 6.5 mm). After loading samples in sample wells and drying them in a controlled manner, the plates are placed into the rail. The first run consists of selecting the characteristic energies for three metals using the detector software and starting an automated program that positions sample wells in front of the beam and collects the data. A total of 60, 1-sec-long counting intervals are summed. The second run screens the same set of 48 samples for another three metals. The total time to complete both runs is about 4 h, or about 4 min/sample.
The validity of the metal determinations was evaluated as follows. We have previously published methods of quantitation for metal atoms in biological samples using X-ray absorption spectroscopy (Chance et al. 1992
Target sequences were retrieved from IceDB in the NYSGXRC Web site (www.nysgxrc.org/nysgxrc-cgi/search_progress_report.cgi), and were analyzed by PSI-BLAST searches against SWISS-PROT (Altschul et al. 1997
We thank Stephen Burley and Steve Almo for advice on this project and Jeff Bonnano for coordinating sample delivery from SGX. Chris Lima kindly analyzed T1429 for presence of metal atoms by anomalous difference Fourier. This research is supported primarily by a grant from the National Institute for General Medical Sciences under the PSI Program (P50-GM-62529). Additional funding is provided under R01-GM-54762 (A.S.), R33-CA-84699 (A.S.), and the National Institute for Biomedical Imaging and Bioengineering and its Biomedical Technology Centers Program under P41-EB-01979 (M.R.C.). Support from the Sander Family Supporting Foundation, Sun Academic Equipment Grant EDUD-7824-020257-US, an IBM SUR grant, and an Intel computer hardware gift are also acknowledged (A.S.).
6 Corresponding author. E-MAIL mrc{at}aecom.yu.edu; FAX (718) 430-8587. Article and publication are at http://www.genome.org/cgi/doi/10.1101/gr.2537904.
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25: 3389-3402.
Andreeva, A., Howorth, D., Brenner, S.E., Hubbard, T.J., Chothia, C., and Murzin, A.G. 2004. SCOP database in 2004: Refinements integrate structure and sequence family data. Nucleic Acids Res. 32: D226-D229.
Baker, D. and Sali, A. 2001. Protein structure prediction and structural genomics. Science 294: 93-96.
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Rapp, B.A., and Wheeler, D.L. 2002. GenBank. Nucleic Acids Res. 30: 17-20. Bentley, S.D., Chater, K.F., Cerdeno-Tarraga, A.M., Challis, G.L., Thomson, N.R., James, K.D., Harris, D.E., Quail, M.A., Kieser, H., Harper, D., et al. 2002. Complete genome sequence of the model actinomycete Streptomyces coelicolor A3(2). Nature 417: 141-147.[CrossRef][Medline]
Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., and Bourne, P.E. 2000. The Protein Data Bank. Nucleic Acids Res. 28: 235-242.
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M.J., Michoud, K., O'Donovan, C., Phan, I., et al. 2003. The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res. 31: 365-370. Burley, S.K. and Bonanno, J.B. 2003. Structural genomics. Methods Biochem. Anal. 44: 591-612.[Medline] Burley, S.K., Almo, S.C., Bonanno, J.B., Capel, M., Chance, M.R., Gaasterland, T., Lin, D., Sali, A., Studier, F.W., and Swaminathan, S. 1999. Structural genomics: Beyond the human genome project. Nat. Genet. 23: 151-157.[CrossRef][Medline]
Chance, M.R., Sagi, I., Wirt, M.D., Frisbie, S.M., Scheuring, E., Chen, E., Bess Jr., J.W., Henderson, L.E., Arthur, L.O., South, T.L., et al. 1992. Extended x-ray absorption fine structure studies of a retrovirus: Equine infectious anemia virus cysteine arrays are coordinated to zinc. Proc. Natl. Acad. Sci. 89: 10041-10045. Chance, M.R., Miller, L.M., Fischetti, R.F., Scheuring, E., Huang, W.X., Sclavi, B., Hai, Y., and Sullivan, M. 1996. Global mapping of structural solutions provided by the extended X-ray absorption fine structure ab initio code FEFF 6.01: Structure of the cryogenic photoproduct of the myoglobin-carbon monoxide complex. Biochemistry 35: 9014-9023.[CrossRef][Medline]
Chance, M.R., Bresnick, A.R., Burley, S.K., Jiang, J.S., Lima, C.D., Sali, A., Almo, S.C., Bonanno, J.B., Buglino, J.A., Boulton, S., et al. 2002. Structural genomics: A pipeline for providing structures for the biologist. Protein Sci. 11: 723-738. Editorial. 2004. PSI-phase 1 and beyond. Nat. Struct. Mol. Biol. 11: 201.[CrossRef][Medline]
Eswar, N., John, B., Mirkovic, N., Fiser, A., Ilyin, V., Pieper, U., Stuart, A.C., Marti-Renom, M.A., Madhusudhan, M.S., Yerkovich, B., et al. 2003. Tools for comparative protein structure modeling and analysis. Nucleic Acids Res. 31: 3375-3380. Fiser, A., Sanchez, R., Melo, F., and Sali, A. 2001. Comparative protein structure modeling. In Computational biochemistry and biophysics (eds. M. Watanabe et al.), pp. 275-312. Marcel Decker, NY. Gavin, A.C., Bosche, M., Krause, R., Grandi, P., Marzioch, M., Bauer, A., Schultz, J., Rick, J.M., Michon, A.M., Cruciat, C.M., et al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141-147.[CrossRef][Medline] Gerstein, M., Edwards, A., Arrowsmith, C.H., and Montelione, G.T. 2003. Structural genomics: Current progress. Science 299: 1663. Goulding, C.W., Perry, L.J., Anderson, D., Sawaya, M.R., Cascio, D., Apostol, M.I., Chan, S., Parseghian, A., Wang, S.S., Wu, Y., et al. 2003. Structural genomics of Mycobacterium tuberculosis: A preliminary report of progress at UCLA. Biophys. Chem. 105: 361-370.[CrossRef][Medline] Guan, J., Almo, S.C., and Chance, M.R. 2004. Synchrotron radiolysis and mass spectrometry: A probe of the actin cytoskeleton. Acct. Chem. Res. 37: 221-229. Hasnain, S.S. 2004. Synchrotron techniques for metalloproteins and human disease in post genome era. J. Synchrotron. Radiat. 11: 7-11.[CrossRef][Medline]
Hendrickson, W.A. 1991. Determination of macromolecular structures from anomalous diffraction of synchrotron radiation. Science 254: 51-58. Holm, L. and Sander, C. 1995. Dali: A network tool for protein structure comparison. Trends Biochem. Sci. 20: 478-480.[CrossRef][Medline]
____. 1996. Mapping the protein universe. Science 273: 595-603.
John, B. and Sali, A. 2003. Comparative protein structure modeling by iterative alignment, model building, and model Assessment. Nucleic Acids Res. 31: 3982-3992.
Lesley, S.A., Kuhn, P., Godzik, A., Deacon, A.M., Mathews, I., Kreusch, A., Spraggon, G., Klock, H.E., McMullan, D., Shin, T., et al. 2002. Structural genomics of the Thermotoga maritima proteome implemented in a high-throughput structure determination pipeline. Proc. Natl. Acad. Sci. 99: 11664-11669.
Liu, J. and Rost, B. 2001. Comparing function and structure between entire proteomes. Protein Sci. 10: 1970-1979.
Lujan, H.D., Mowatt, M.R., Wu, J.J., Lu, Y., Lees, A., Chance, M.R., and Nash, T.E. 1995. Purification of a variant-specific surface protein of Giardia lamblia and characterization of its metal-binding properties. J. Biol. Chem. 270: 13807-13813. Murzin, A.G., Brenner, S.E., Hubbard, T., and Chothia, C. 1995. SCOP: A structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. 247: 536-540.[CrossRef][Medline] Orengo, C.A., Michie, A.D., Jones, S., Jones, D.T., Swindells, M.B., and Thornton, J.M. 1997. CATHA hierarchic classification of protein domain structures. Structure 5: 1093-1108.[Medline]
Pieper, U., Eswar, N., Stuart, A.C., Ilyin, V.A., and Sali, A. 2002. MODBASE, a database of annotated comparative protein structure models. Nucleic Acids Res. 30: 255-259.
Pieper, U., Eswar, N., Braberg, H., Madhusudhan, M.S., Davis, F., Stuart, A.C., Mirkovic, N., Rossi, A., Marti-Renom, M.A., Fiser, A., et al. 2004. MODBASE, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res. 32: D217-D222. Rajashankar, K., Chance, M.R., Burley, S.K., Jiang, J.S., Almo, S.C., Bresnick, A.R., Hunag, R., He, G., Chen, H., Sullivan, M., et al. 2001. Structural genomics at the National Synchrotron Light Source. NSLS Activity Report 2002: 2-28 to 2-32. Reboul, J., Vaglio, P., Rual, J.F., Lamesch, P., Martinez, M., Armstrong, C.M., Li, S., Jacotot, L., Bertin, N., Janky, R., et al. 2003. C. elegans ORFeome version 1.1: Experimental verification of the genome annotation and resource for proteome-scale protein expression. Nat. Genet. 34: 35-41.[CrossRef][Medline] Sali, A. 1995. Comparative protein modeling by satisfaction of spatial restraints. Mol. Med. Today 1: 270-277.[CrossRef][Medline] ____. 100,000 protein structures for the biologist. Nat. Struct. Biol. 5: 1029-1032. Sali, A., Glaeser, R., Earnest, T., and Baumeister, W. 2003. From words to literature in strutural proteomics. Nature Insight 422: 216-225.
Sanchez, R. and Sali, A. 1998. Large-scale protein structure modeling of the Saccharomyces cerevisiae genome. Proc. Natl. Acad. Sci. 95: 13597-13602.
Sanchez, R., Pieper, U., Mirkovic, N., de Bakker, P.I., Wittenstein, E., and Sali, A. 2000. MODBASE, a database of annotated comparative protein structure models. Nucleic Acids Res. 28: 250-253.
Schaffer, A.A., Aravind, L., Madden, T.L., Shavirin, S., Spouge, J.L., Wolf, Y.I., Koonin, E.V., and Altschul, S.F. 2001. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29: 2994-3005. Shi, W., Ostrov, D., Gerchman, S., Kycia, H., Studier, W., Edstrom, W., Bresnick, A.R., Ehrlich, J., Blanchard, J., Almo, S.C., et al. 2003. High-throughput structural biology and proteomics. In Protein chips, biochips, and proteomics: The next phase of genomics discovery, Chapter 12, pp. 299-324. Marcel Decker, NY. Stein, L.D., Bao, Z., Blasiar, D., Blumenthal, T., Brent, M.R., Chen, N., Chinwalla, A., Clarke, L., Clee, C., Coghlan, A., et al. 2003. The genome sequence of Caenorhabditis briggsae: A platform for comparative genomics. PLoS Biol. 1: E45.[Medline] Summers, M.F., Henderson, L.E., Chance, M.R., Bess Jr., J.W., South, T.L., Blake, P.R., Sagi, I., Perez-Alvarado, G., Sowder III, R.C., Hare, D.R., et al. 1992. Nucleocapsid zinc fingers detected in retroviruses: EXAFS studies of intact viruses and the solution-state structure of the nucleocapsid protein from HIV-1. Protein Sci. 1: 563-574.[Abstract] Szpunar, J. 2004. Metallomics: A new frontier in analytical chemistry. Anal. Bioanal. Chem. 378: 54-56.[CrossRef][Medline] Terwilliger, T.C., Park, M.S., Waldo, G.S., Berendzen, J., Hung, L.W., Kim, C.Y., Smith, C.V., Sacchettini, J.C., Bellinzoni, M., Bossi, R., et al. 2003. The TB structural genomics consortium: a resource for Mycobacterium tuberculosis biology. Tuberculosis (Edinb) 83: 223-249. Tompa, P. 2002. Intrinsically unstructured proteins. Trends Biochem. Sci. 27: 527-533.[CrossRef][Medline]
Tong, A.H., Lesage, G., Bader, G.D., Ding, H., Xu, H., Xin, X., Young, J., Berriz, G.F., Brost, R.L., Chang, M., et al. 2004. Global mapping of the yeast genetic interaction network. Science 303: 808-813. Vitkup, D., Melamud, E., Moult, J., and Sander, C. 2001. Completeness in structural genomics. Nat. Struct. Biol. 8: 559-566.[CrossRef][Medline] Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexandersson, M., An, P., et al. 2002. Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520-562.[CrossRef][Medline]
Westbrook, J., Feng, Z., Jain, S., Bhat, T.N., Thanki, N., Ravichandran, V., Gilliland, G.L., Bluhm, W., Weissig, H., Greer, D.S., et al. 2002. The Protein Data Bank: Unifying the archive. Nucleic Acids Res. 30: 245-248.
Westbrook, J., Feng, Z., Chen, L., Yang, H., and Berman, H.M. 2003. The Protein Data Bank and structural genomics. Nucleic Acids Res. 31: 489-491.
Wheeler, D.L., Chappey, C., Lash, A.E., Leipe, D.D., Madden, T.L., Schuler, G.D., Tatusova, T.A., and Rapp, B.A. 2000. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 28: 10-14. Zhang, C. and Kim, S.H. 2003. Overview of structural genomics: From structure to function. Curr. Opin. Chem. Biol. 7: 28-32.[CrossRef][Medline]
www.nigms.nih.gov/psi; NIH Web site providing information and relevant links for the Protein Structure Initiative. http://targetdb.pdb.org; Web site operated by the Protein Databank to allow searching of targets from the structural genomics centers. www.nysgxrc.org; Web site operated by the NYSGRC. Its functions are to provide a public target list and progress as well as to allow consortium members to enter target data. http://salilab.org/modbase; MODBASE, a comprehensive database of comparative protein structure models. www-archbac.u-psud.fr/genomics/COG_Guess.html; Clusters of Orthologous Groups Database Query Page to perform similarity search in COG database. This provides a function and COG category guess for input sequence. http://salilab.org/modbase/models_nysgxrc.html; Summary and statistics of homology modeling results using the NYSGXRC PDB structures as templates.
Received March 3, 2004; accepted in revised format May 12, 2004. This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||