New Page 3

Interdisciplinary workshop promoting collaboration
in high-throughput X-ray structure determination

Summary of impact of the workshop

Web site: http://solve.lanl.gov/santa_fe_workshop.html

This workshop was a meeting to foster advances in technology for macromolecular structure determination by encouraging and enabling developers (of software, methods and automation) in the field of structural genomics and theorists to work together.

The participants were expected to focus their presentations and discussion on working together in order to foster collaboration and innovation. Additionally attendees included structural biologists as well as theorists interested in analysis of protein structure.

The meeting was well-attended (35 attendees) and there was very strong enthusiasm about the level of cooperation and collaboration that was developed at the meeting. The participants drew up a list of areas where collaboration was either begun during the meeting or could clearly begin after it; these included coordination in the development of crystallographic infrastructure (several projects are devoted to automation; their libraries and definitions can be coordinated), automated identification of symmetries in “heavy-atom” positions in crystals (a collaboration was begun at the meeting in this area), expert systems for automatic data collection and iterative model-building at moderate resolution (being developed by several groups), among other areas.

In the spirit of collaboration, most of the participants have agreed to place their talks on the web, including unpublished data, and these can be viewed along with summaries of all the talks by anyone at the site http://solve.lanl.gov/santa_fe_workshop.html.

Interdisciplinary Workshop Promoting Collaboration
in High-Throughput X-ray Structure Determination

March 22-23, 2002

La Fonda Hotel
Santa Fe, New Mexico, USA.

Sponsored by the Institute for Complex Adaptive Matter
and the International Structural Genomics Organization

Areas of Collaboration identified during the workshop

Program, presentations, links, and summaries

Collaboration and Automation of Structure Determination (Chair: Samar Hasnain)

Eleanor Dodson, University of York CCP4 Philosophy and Success in Cooperation

Dr. Dodson discussed the long-time CCP4 collaborative project which has been promoting macromolecular crystallography and collaboration for over 20 years. The project was designed as a resource for the community, with goals including software development, teaching, and workshops. The early emphasis was on workshops, with the development of a comprehensive software package occurring over the years.

The CCP4 philosophy of software is modularity; this allows seamless replacement of one module with another as methods change. Very important additional features of CCP4 include common data structures and library routines

Dr. Dodson pointed out that collaboration has benefits and complications. Benefits include common systems and reduction of duplication, but complications include the need to really discuss things, which is time consuming. Additionally, the allocation of credit is more difficult if many people are involved.

To move into a modern computing environment, Dr. Dodson points out, there has to be a symbiosis between those who really have the problems and computer scientists and other developers. Data management is urgently needed, so information can be passed from one stage to another in such a way that it doesn’t have to be entered more than once.

Paul Adams, Lawrence Berkeley Laboratory PHENIX: A New Framework for Collaborative Software Development (download presentation [PDF] 3 MB)

Dr. Adams discussed the PHENIX project, a collaborative project with a goal of developing a comprehensive system for crystallographic structure determination. The strategy is to use a scripting language (python) to link core routines and data objects in a way that is easy to manipulate. This allows rapid development of ideas, with optimization coming later. From the point of view of the user, the system will provide automated structure solution, but also the flexibility for individual approaches. The core of the system is a low-level framework for crystallographic programming.

Dr. Adams talked about how the PHENIX infrastructure can be shared. Source code will be distributed to academic groups. Other developers can “plug in” their work to the PHENIX infrastructure so that they can make use of the infrastructure. For example, the Pymol visualization software developed by Warren Delano was made a plug-in to PHENIX and can be used by PHENIX users.

The high-level programming environment in PHENIX is designed to let users (or developers) link tasks easily and allow the creation of complex algorithms. It allows the incorporation of external programs as well.

Dr. Adams emphasized that there are many opportunities for all developers to make use of the PHENIX infrastructure.

There was discussion of how important it may be not to duplicate too much on low-level routines; in particular the CCP4 project and the PHENIX project already have some duplication at this level.

Ralf Grosse-Kunstleve, Lawrence Berkeley National Laboratory The Computational Crystallography Toolbox: Open Source Tools for Crystallographic Software Development (download presentation [PPT] - 0.3 MB)

Dr. Grosse-Kunstleve talked about a set of building blocks for PHENIX that he emphasized are really building blocks for everyone’s benefit and use. They are available as open-source code. The cctbx toolbox has fundamental algorithms for computational crystallography, including space group (symmetry), crystallographic, and standard array calculations.

The code for cctbx can be obtained over the web through the SourceForge project, an open-source organization with some 380,000 users. Ralf described how SourceForge is very nicely set up to allow one developer to put in software, then allow another developer to make changes and allow both to make use of each other’s work.

Ralf discussed the trade-off between high-level languages (easy to develop and debug, but slow) and lower-level languages (harder to develop, faster). The balance he chose is using a very high-level scripting language (Python) whenever practical, and a moderate-level programming language (C++) for routines requiring speed. These two languages are quite similar, particularly in the use of classes, memory management and exception handling, and are complementary in their flexibility and speed. The two languages can be integrated nicely usig the Boost.Python library system (also open-source). The documentation in the cctbx toolbox is embedded directly in the C++ source code, supplemented with external web documentation (automatically produced).

To generate versions for as many platforms as possible, the cctbx toolbox uses Python to automatically generate Makefiles using a set of dependencies that are defined in simple Python scripts. The expectation is that end-users will usually get binaries. The cctbx toolbox can be used as a pure C++ library by any developers as well.

There was discussion of how do you control the quality of an open-source project. The answer is that normally many people can contribute but someone is the curator.

Wladek Minor, University Virginia Critical Components of Automatic Determination of Protein Structures - What we can share and coordinate

Dr. Minor started out by discussing bottlenecks in structural biology ; pointing out that many of the individual steps are now pretty well under control, but information flow is often limiting. Automation needs to work about 70% of the time for certain classes of cases, or it isn’t that useful, he emphasized.

In practice, solving structures with X-ray crystallography can sometimes take as little as an hour or so, but other times as long as years. It is key to identify early on whether the data is going to be useful; this is an information flow (and analysis) problem.

Dr. Minor talked about databases in structural genomics, particularly the Midwest Center for Structural Genomics, and how to make them real-time and self-consistent. HKL2000 is software for processing diffraction data and for managing part of the data flow. For example HKL2000 can search to see if a sequence similar to the one of a target protein belongs to a protein in the PDB. A big question is what information is shared. Dr. Minor presented system that can provide automatic report for everyone and in the same time allow members of the project comfortably share all data between different groups involved in the project.

The constant change of data formats even for well establish detectors is a big issue (there are 170 formats for images that denzo can recognize).

Dr. Minor emphasized that the flow of information among groups requires not only standardization of the information transfer (such as xml definitions) but also self-consistent information. This information has to be derived from authoritative source – project database.

There was some discussion about sharing information by structural genomics centers, as the data in theses projects should be very open. The challenge is to create system that is not only open but also one that provides all information in real time. Such a system is necessary to to avoid the effort duplication. There was also discussion how to deal with many data formats. Suggestion was that it made the most sense to have code that recognizes all formats and returns the information in a standard way). Some people felt that implementation of Image_cif format can be solution for the future.

There was some discussion how others can work with HKL2000. The source code is available for inspection and discussion in the program authors' laboratories

General discussion

General discussion included questions of credit; allocating credit is difficult in a collaborative environment. For CCP4, the programs identify publications to cite as a way around this. An idea was suggested that databases of structure determination should keep track of credit. Another discussion was how to collaborate with each of these groups.

Beamline Automation I. (Chair: Wladek Minor)

Masashi Miyano and Takashi Kumasaka, Riken Harima Institute at Spring-8 Full Automation Approach of Crystallization Screening and its Observation

Dr. Miyano described the RIKEN structural genomics/proteomics initiative (RSGI) and the Spring8 beamlines that are contributing to it. These beamlines (and NMR facilities) plan to solve 1500 structures in 5 years, including Thermus thermophilus and M. tuberculosis proteins, as well as mouse, human and plant, proteins from cDNAs. He described a high-throughput protein production system applied to the hyperthermophilic protein production. This system feeds into a robot-based crystallization setup based “TERA” with automated dispensing of precipitant solutions and visualization, and storage for a large number of crystallization trials. So far the scoring of crystallization is manual. Crystals are stored on bar-coded pins and are to be automatically mounted at the beamlines. Prototype sample changers were described.

Thomas Earnest, Lawrence Berkeley National Laboratory Development of Robotic Crystal Mounting and Alignment Systems for Biological Crystallography at the Advanced Light Source

Dr. Earnest talked about 5 fully operational beamlines at the ALS and several new superbend beamlines and their capabilities. One particularly important feature of some of these is exceptional optics for visualizing crystals. He emphasized that high-throughput structure determination is incompatible with human intervention. Manual intervention is too slow, is tediuous, and wastes time. Robotic systems have been developed to do what the person might do for most of the steps in collecting data, including sample mounting, crystal centering and data collection. An important strategy is to have the capability to automatically run through a set of many crystals, then to choose the best ones and collect data on them. The plan is to develop a 3-D representation of the crystal, to automatically identify the location of the crystal and to center it and screen or collect data on it. Dr. Earnest emphasized that “smart” software is the key to the next stages in automation of beamlines.

The strategy at the ALS for development of hardware is to build it from scratch rather than trying to link existing hardware that is not designed specifically for crystallography. Their overall plan is to link data collection, processing, and analysis with high-level software. The hardware and details of construction have been shared with several other facilities.

An item of discussion was the compatibility of pins for sample mounting. Dr. Earnest described how the two California groups developing sample mounting have now agreed on compatible pins.

Liz Duke, Daresbury Laboratory Towards Automated Beamlines at Daresbury

Dr. Duke describes the efforts at Daresbury SRS to automate x-ray crystallographic efforts. There are 5 operational beamlines at the SRS and one more planned. Current efforts on automation are beginning with identification of all motions necessary in the hutch (which would be carried out with outside control) and building robotics to carry these out. The Daresbury group plans to use an unscheduled beamline to replicate a highly-used beamline and to develop all the hardware. They have a constraint on automation; all operations have to be possible both with and without automation so that certain types of crystals (viruses) can be done in capillaries.

Dr. Duke emphasized that collaboration between instrumentation engineers and software engineers is required, and that the automation project is very large amount of work. They are comparing the possibilities of collaboration compared with purchasing commercial equipment. Dr. Duke also mentioned that the Diamond synchrotron protein crystallography beamlines are being planned from the start with automation in mind.

Ab-initio and Direct Methods Phasing I. (Chair: Alexandre Urzhumtsev)

Charles Weeks, Hauptman-Woodward Med. Res. Inst. Integrating Direct Methods into Automated Protein-Phasing Packages (download presentation [PPT] - 0.7 MB)

Dr. Weeks described how to go from solving a heavy-atom substructure with direct methods to phasing a full protein structure. This is a problem of integrating a method for substructure solution with methods for substructure refinement, density modification, and model building. He discussed the BnP (Buffalo and Pittsburgh) system that combines the SnB direct-methods program with normalization and substructure-comparison routines as well as components from the macromolecular crystallography package PHASES. All of these programs are linked together with a Java-based GUI.

Dr. Weeks emphasized that the timely identification of which direct-methods trial substructures are correct is critical. He discussed methods for automating this decision step using different scoring methods. The best included consideration of the improvement in various figures of merit obtained in going from the starting values to the final refined values. A second important step in the automation process is identifying how many of the top peaks in the direct-methods solution should be included in subsequent protein phasing. Correct sites tend to occur reproducibly in different (independent) solutions, but false sites do not.

Better communication between BnP and other software would increase the possibilities for automation, and Dr. Weeks requested help in achieving this. For example, the BnP control file (known as a “configuration” file) could be preloaded at the beamline where the X-ray data is collected. Collaborations with the authors of complementary procedures (e.g. the identification of NCS symmetry involving heavy-atom sites) are also sought.

Quan Hao, Cornell University Direct Methods and SAD Phasing

Dr. Hao described how single-wavelength phasing and direct methods can be integrated. The information necessary for phasing can be obtained from either high-resolution data or anomalous diffraction (or SIR). The protocol used included CCP4-based programs SAPI (tangent-formula refinement), ABS (identification of enantiomer), and OASIS (single-wavelength phasing using direct methods and Sim weighting). He described possible collaborations involving maximum-likelihood and other statistical methods in SAPI and OASIS.

Angel Garcia, LANL All-atom Studies of the Folding/Unfolding Transition

Dr. Garcia described an area quite different than many of the previous speakers, simulations of protein folding. He emphasized new simulation methods that are applicable to the folding of proteins and peptides, and that indirectly are quite relevant to crystallography. He described the method of replica exchange for simulation of protein folding over long (microsecond) time scales. The method involves Monte Carlo-based exchanges of partial configurations among a large set of parallel simulations. Whenever a particular simulation finds itself in a bad configuration (high enthalpy) it heats up (exchanging with another at higher temperature) and whenever is in a good configuration, it cools off further. This method is able to simulate the folding/unfolding transition with temperature, showing in detail the distribution of native-like contacts as a function of temperature. A surprise was that states exist where the main-chain rmsd to the native structure is low (2 A) but a contact map of all atoms within 6 A of each other is almost completely non-native-like. The method can characterize thermodynamics such as enthalpy, entropy and specific heat.

Beamline automation II. (Chair: Thomas Earnest)

Steve Kinder, Daresbury Laboratory Data Acquisition Developments and Collaborations at Daresbury Laboratory

Dr. Kinder discussed experiences in collaboration at Daresbury, including PXGEN++ and DNA. He emphasized that collaboration is not so easy, and it is a good idea to practice with friends first. One project he described was an update to software (PXGEN) for x-ray data collection. The objectives in the project were well-defined, and the project was largely within Daresbury laboratory, making collaboration a little easier than it might otherwise have been. The PXGEN++ software was designed to have a common GUI independent of the detector, and standard controls for various beamlines. The second project is DNA, a much large collaborative effort to automate collection and processing of protein crystallography data, involving several laboratories (LMB Cambridge, ESRF, Daresbury). The DNA system involves an expert system that controls data collection and the MOSFLM data processing software. Dr. Kinder emphasized that this group is very interested in collaboration, particularly in ideas and in sharing code.

Joseph Brunzelle, Northwestern University Automated Structure Determination: Database Driven and Ant-Controlled

Dr. Brunzelle described a Python-based system for linking together existing algorithms for structure determination in a flexible way that allows crossover among methods. The general approach he used is a Python and web-based system. It is suitable for incorporating multiple approaches and for including decision-making (AI). A web-based database interface using “Slither” is used. The mySQL database itself is a relational database including the raw data and the analysis, as it proceeds. The “ants” correspond to existing software, wrapped in a standard way so that the data format is not an issue. The “queen ant” controls the worker ants who carry out standard operations on crystallographic data and return the results.

A discussion question was how to maintain a software suite that depends on many other pieces of independent software. The answer was that it would be a lot easier if everyone could use xml tags. In practice it is not so hard to keep up with software changes, as template files are used to control input.

Model-building. (Chair: Todd Yeates)

Tom Ioerger, Texas A&M University The TEXTAL System for Automated Model Building (download presentation [PPT])

Dr. Ioerger described a pattern-recognition approach to interpreting electron density maps and building an atomic model. The TEXTAL approach is to examine regions of density in a map and to ask whether similar patterns have been seen in other maps. The implementation involves feature extraction from previously-interpreted maps to yield a database of patterns. The TEXTAL system has three main steps: identification of C_a positions, identification of side chains, and then sequence and real-space refinement.

The recognition of C_a positions is not as simple as it might seem, and Dr. Ioerger focused his discussion on this step. The approach involves normalization of the map, skeletonization, calculation of features, using a neural network to identify likely C_apositions using these features, and then selecting which ones to actually link together. Dr. Ioerger discussed areas of collaboration that he is interested in, including making the tracing tools available, incorporating real-space refinement and iteration with density modification.

David Levitt, University of Minnesota MAID: An Automated Electron Density Fitting Program Applicable to Low Resolution Maps (download presentation [PPT] - 1.3 MB)

Dr. Levitt talked about how he has automated what a skilled crystallographer does when they build a model of a protein into an electron density map. The MAID software was developed using a 292-residue dehydrogenase SAD map at 2.5 A, which it can fit 80% of with about 0.53 A rms for main chain atoms.

The overall approach used is to first find the segments of secondary structure (helix and sheet), then to extend them. During the building procedure real-space molecular dynamics are used to optimize geometry and the fit. This is all done in several iterative steps. First the helices and sheets are found, then connections are found and several connected helices are obtained. Particular attention is paid to being conservative so as not to introduce errors that may be difficult to remove later. These are then used to identify the fit to the amino acid sequence. Then, using the amino acid sequence assignment, the extension and building of loops is made much easier. The sequence assignment is done by testing each possible alignment at each position in the partially-built model.

Dr Levitt talked about the idea of iterating the model-building process. It seemed that using the initial model built by MAID, combining sA-weighted phases with the initial phases, that the next cycle of model-building would be much better. Initial tests didn’t show improvement, but the discussion suggested a number of additional ideas that might be used to make iterative model-building very useful.

Phase combination, MR, and Refinement I. (Chair: Eleanor Dodson)

Brent Segelke, Lawrence Livermore Shake&wARP as a Local Real Space Validation Tool

Dr. Segelke described how the Livermore group has used the wARP and CCP4 software in a way that reduces model bias introduced in model-building and refinement. The basic idea is to use a wARP-like procedure to build several models starting with a starting model after “shaking” (moving atoms slightly) and removal of a random part of the model. The combination of shaking and deletion introduces phase changes at both low and high resolution. Applying the procedure to a number of cases allowed some substantial improvements in the models and correction of errors. They plan to make a web-based version available to all users.

An additional tool developed at Livermore is Autosolve, an automated procedure for molecular replacement. It includes a search of the PDB and automated initial modeling of the structure of interest based on structures with similar sequences from the PDB, a series of molecular replacement attempts, and output of a map, a partially refined structure and annotation of quality.

Michael Chapman, Florida State University Real-space Simulated Annealing Refinement, a Tool in Model-Building, and a Paradigm for Holistic Refinement (download presention - 1.5 MB)

Dr. Chapman described the use of real-space refinement in building structures and why it helps. He started with several examples, including one of an SAD/molecular replacement case where real-space refinement followed by reciprocal-space refinement gave an improvement over just reciprocal-space refinement comparable to that obtained with manual intervention, but automatically and much more rapidly.

Real-space refinement is optimization of a model based on a fit of the density calculated from the model to the density in the map. An important aspect is adjusting the calculated density to match the resolution of the map. Dr. Chapman emphasized that an advantage of real-space refinement is that no amount of fitting of one area can improve the fit in another. The overall message is that first everything that can be done in real-space should be done, and only then should reciprocal-space methods be used.

Dr. Chapman described another application of real-space refinement, including electron microscopy and solid-state NMR. A new direction in the work is to re-examine force fields used for bonded and non-bonded geometries. For example hydrogen-bonding force-fields and restraints including directionality at the acceptor and stringent criteria for identifying hydrogen bonds improved the free R-factor for structures at moderate resolution (3A). Dr. Chapman noted that this procedure should only be used towards the end of refinement at moderate resolution.

Dr. Chapman emphasized that the real-space procedure is dependent on many other packages and that he is very interested in combining the methods with others. The methods inparticular have clear application to automated model-building. The programs are available at http://www.sb.fsu.edu/~chapman.

Ab-initio and Direct Methods Phasing II. (Chair: Charles Weeks)

Alexandre Urzhumtsev, Universite Henri Poincare-Nancy Improved and Alternative Means for Phasing (download presentation [PPT] - 3 MB)

Dr. Urzhumtsev pointed out that there are many proteins that do not yield great crystals, but that there is a lot of information that we have in advance that can help in solving the structure. In the case of molecular replacement, he described how to modify scoring procedures to evaluate possible rotation function solutions. The method is based on the clustering of solutions, with a hypothesis that the most densely-populated clusters are most likely to be correct. For the translation function in molecular replacement, a search with low-resolution data including a solvent model greatly improved the capability of identifying the correct solution.

Another method described by Dr. Urzhumtsev was to use low-resolution direct phasing to identify positions of the molecules in the cell. The approach requires a way to separate good maps from poor ones. In particular, the number, position, and shape of connected regions is very useful. The method includes starting from andom phase sets, selecting those that are connected (many connected regions of the same size; few isolated regions of density), and averaging the best, then iterating the process. Dr. Urzhumtsev described the application of the method to LDL, which yields only low-resolution data (27 A). Another example was a lectin that diffracts to about 6 A, where three clearly distinguished molecules were found. Another method was binary integer analysis of the phase problem.

Charlie Strauss, LANL De Novo Structure Prediction Using Rosetta (download presentation [PPT]-6 MB)

Dr. Strauss described an ab initio approach to predicting protein structures he developed with Dr. David Baker and colleagues. The approach begins with a fragment library (from the PDB) consistent with local sequence preferences in the target sequence. Then the fragments are assembled into models with plausible global properties. Finally these are clustered based on structural similarity, with the number in a cluster being an important factor for ranking. The method has been quite successful in low-resolution resolution prediction. The approach is limited to relatively small proteins, up to about 150 residues. Dr. Strauss emphasized the importance of the CASP blind prediction competition for invigorating the field (and encouraged the crystallographers to continue providing blind tests).

Dr. Strauss continued by describing how the ab initio approach could be modified to include experimental restraints (such as limited NMR restraints) to greatly decrease the number of plausible solutions. He showed that a very small amount of experimental information could dramatically improve the quality of structure predictions. Dr. Strauss also gave several examples of how a coarse model predicted using Rosetta could be used to provide functional annotations, both by structural similarity to proteins with known structure and function and by mapping residues known to be involved in function onto the predicted structures.

Chang-Shung Tung, LANL From Low-resolution Data to Atomic Structure

Dr. Tung described his model-building methods for extending low-resolution or partial models to all-atom models. He showed that structural regularities in the conformation of 4-Ca segments of structure allow a reliable inference of all main-chain atoms from Ca coordinates. The method allows reducing the phi-psi angles to just one parameter, which simplifies modeling and speeds it up as well. Dr. Tung also described a loop-generation algorithm that is useful for modeling unobserved loops in proteins. He described a related approach for generating all-atom nucleic acids structures from coordinates of the phosphorous atoms, and how it can be applied to building models as large as the ribosomal RNA from phosphorous atoms alone.

Phase Combination, MR, and Refinement II. (Chair: Paul Adams)

Garib Murshudov, University of York Various Levels of Collaboration Between REFMAC and ARP/wARP Developers

In keeping with the theme of the meeting, Dr. Murshudov described current collaborations that exist and that might be desirable. Desirable collaborations included those in the area of automation of structure determination, moving from crystallographic to post-crystallographic analysis, and analysis of biomolecules. Collaboration can occur at several levels: data structures and functions, programs, algorithms in crystallography, general algorithms, as well as discussions of results and problems. Post crystallographic problems include adding hydrogens, ligand detection, rotamer libraries. Areas that Dr. Murshudov is working on include full use of experimental data in MAD/MIR, use of all information available throughout the process, calculation of uncertainties, and automatic ligand detection.

Dr. Murhudov reviewed the use of Bayesian statistics in crystallography, including the use of prior information (e.g., probabilities of bond distances). He gave the specific example of SAD data and its analysis. A second area he discussed is the uncertainty in parameters. To estimate them, the second derivatives of the likelihood function are needed; these are computationally expensive. The calculation can be done mor rapidly in reciprocal space, however. These second derivatives can also be used for other things, including probability distributions for the parameters, for improving maps, for sampling, and for evaluating the significance of observations (distances, for example).

ARP/wARP developers have been collaborating with Dr. Murshudov for some time. This includes consistent and synchronized releases, development of algorithms, and testing of programs. Other collaborations include dictionary and deposition for the EBI, CCP4 support, and others.

Tom Terwilliger, LANL Maximum-likelihood Density Modification and Automated Model-Building (http://solve.lanl.gov/)

Dr. Terwilliger described the automated model-building capabilities of the RESOLVE software and how these could be integrated with maximum-likelihood density modification. He discussed the model-building process, which begins by identifying helices and sheets in an electron density map using an FFT-based convolution search similar to one developed earlier by Kevin Cowtan. Then fragments of secondary structure from refined protein structures are fitted to these locations as a starting point for model-building. These fragments are then extended using libraries of tripeptides, initially without regard to overlaps. Then the fragments are assembled and a set of non-overlapping fragments is obtained. Side chains are fitted using a rotamer library and correlations of local density to average density from the library.

The model-building works well for structures with resolution of about 3 A or better. It can then be combined with maximum-likelihood density modification, using calculated electron density from the model as a target for “expected” electron density in the map, in an iterative fashion. Dr. Terwilliger showed that in the case of gene 5 protein (a small 87-residue protein at a resolution of 2.6 A), iterative application of model-building and maximum-likelihood density modification resulted in most of the model being built.

Todd Yeates, University of California Checking for Problems in Structures and Diffraction Data, with an Update on Twinning

Dr. Yeates discussed how to check for errors and problems in structures, including unit cell measurement errors, model-building errors, and merohedral twinning. He pointed out an early type of error involving unit cell lengths (now very rare) that his group had detected by noticing “stretching” of proteins due to the atoms going to the correct fractional positions in a cell with incorrect cell dimensions. A second approach (ERRAT) examined the statistics of non-bonded intractions in a model and compared it with model distributions. This algorithm provides a local measure of model quality. There remain crystal structures reported recently that are improbable based on these statistics, but they represent a small percentage of the total (< 1%). Dr. Yeates suggested that the use of the structure factors to check model quality remains a good idea.

A long-standing problem has been merohedral twinning. Dr. Yeates has a web site that helps identify twinning from intensity data (http://www.doe-mbi.ucla.edu/Services/Twinning). In this situation of merohedral twinning, the lattice of the crystal has a higher order symmetry than the space group (e.g., p4 with alternating regions reversed in orientation). In this case intensities of each diffraction spot are the weighted sum of intensities of two reflections. In the worst case, the symmetry appears to be higher than it really is (i.e., P4 appears to be P422). The twin server checks for all these scenarios. Perfect twinning gives rise to non-Wilson intensity distributions. This can be confused by anisotropic diffraction, however. A new local statistic based on local relationships between reflections can overcome this problem.

Partial twinning is usually detected by an unexpected similarity between reflections related by a twinning operation. However this can be mimicked by NCS if it is nearly crystallographic. Dr. Yeates is developing methods to identify partial twinning in the presence of NCS.

Dr. Yeates emphasized that crystallographers must remain vigilant, and verification programs should be run more frequently. Expert systems should be designed to identify twinning and other pitfalls.

Data Harvesting and Deposition and Meeting Discussion (Chair: Tom Terwilliger)

John Westbrook, Rutgers University (download presentation [PPT] - 2.5 MB ) and

Kim Hendrick, European Bioinformatics Institute (download presentation [PPT] - 1.5 MB)

Collecting Data for the PDB

Dr. Westbrook discussed how to facilitate seamless data exchange and deposition. He emphasized the need for data specifications and software that implements these specifications. The situation that needs to be avoided is one where many different groups use different definitions and analysis of results is impeded. The web site at http://deposit.pdb.org/mmcif/ describes a large set of dictionaries including NMR, modeling, crystallization, symmetry, image data, extensions for structural genomics, properties of beamlines. The data definition project has a long history beginning with projects of the IUCR, and is being driven now largely by the needs of structural genomics.

The data dictionary for X-ray data is in final review; others in progress include NMR and protein production. The PDB has spent significant effort to define a CORBA API for communication of data items. This is describe at http://openmms.sdsc.edu/ .

The current PDB strategy for data integration is to collect experimental information as mmCIF (or otherwise electronically parseable) output, combined with information from the ADIT deposition tool, then making all data available in the exchange dictionary format (http://beta.pdb.org) . Data harvesting is currently implemented from several software programs including many CCP4 programs, HKL2000, and others. Software integration tools are available at http://deposit.pdb.org/software and http://deposit.pdb.org/mmcif .

To participate in the process, anyone can comment on the data items in http://deposit.pdb.org/mmcif ; also the identification of what should be captured for deposition is still useful. There is a workshop scheduled May 24-25, 2002 on “Structural genomics informatics and software integration” as well.

Dr. Henrick described how the EBI hosts a number of databases including SWISS-PROT, TREMBL, Array-Express, and others. The EBI is also a host for deposition to the PDB. The EBI and the PDB are a good example of using agreed common data items and an exchange mechanism (they don’t use the same software but they communicate seamlessly). The common data representation includes an abstract data model and data definitions. Harvesting, exchange and storage follow. The strategy then is to create a pipeline for the data, with individuals defining their required inputs and outputs, and mapping them to the data model. There are many reasons to collaborate on a data model for crystallography, NMR and other structures. The main one is that the original PDB representation isn’t rich enough for all the data that is useful to save. The good news is that data dictionaries and methods for extending them exist, as does a data model. There are several European efforts to define the process. On is an E-science resource for structural genomics, another is the SPINE structural genomics project; another is the CCP4 coordinate library project. The UML (universal modeling language) approach is useful for defining relationships in a project and for generating code for classes describing these relationships. This has been applied in the CCPN project for NMR data storage and another project for electron microscopy. Open issues at this time include elements of the data models, API specification, and migration of existing software.

Areas of Collaboration identified during the workshop:

Use of crystallographic infrastructure (e.g., PHENIX/CCP4 libraries and platform)
Coordination of API to libraries (and of output formats from programs)
Crystal handling specifications (e.g., mounting pins)
Replica exchange techniques applied to crystallography (e.g., as extension of simulated annealing)
Bayesian applications to SAD phasing, and other applications
Heavy-atom NCS identification
Expert systems for automatic data collection
Education efforts
Sample labeling methods (bar-codes)
Structure prediction and molecular replacement
Real-space refinement and validation of structures
Iterative model-building at moderate resolution
Twinning analysis