Interdisciplinary
workshop promoting collaboration Summary
of impact of the workshop Web site: http://solve.lanl.gov/santa_fe_workshop.html This
workshop was a meeting to foster advances in technology for macromolecular
structure determination by encouraging and enabling developers (of
software, methods and automation) in the field of structural genomics and
theorists to work together. The
participants were expected to focus their presentations and discussion on
working together in order to foster collaboration and innovation.
Additionally attendees included structural biologists as well as
theorists interested in analysis of protein structure. The meeting was
well-attended (35 attendees) and there was very strong enthusiasm about
the level of cooperation and collaboration that was developed at the
meeting. The participants drew up a list of areas where collaboration was
either begun during the meeting or could clearly begin after it; these
included coordination in the development of crystallographic
infrastructure (several projects are
devoted to automation; their libraries and definitions can be
coordinated), automated
identification of symmetries in “heavy-atom” positions in crystals (a
collaboration was begun at the meeting in this area), expert systems for automatic data collection and iterative
model-building at moderate resolution (being developed by several groups),
among other areas. In the spirit of collaboration, most of the participants have agreed to place their talks on the web, including unpublished data, and these can be viewed along with summaries of all the talks by anyone at the site http://solve.lanl.gov/santa_fe_workshop.html.
March
22-23, 2002
La Fonda Hotel Sponsored
by the Institute for Complex Adaptive Matter Areas of Collaboration
identified during the workshop Program,
presentations, links, and summaries Collaboration and Automation of Structure
Determination (Chair: Samar Hasnain)
Eleanor
Dodson, University of York
CCP4
Philosophy and Success in Cooperation Dr. Dodson discussed the long-time CCP4 collaborative project which has been promoting macromolecular crystallography and collaboration for over 20 years. The project was designed as a resource for the community, with goals including software development, teaching, and workshops. The early emphasis was on workshops, with the development of a comprehensive software package occurring over the years. The CCP4 philosophy of software is modularity; this allows seamless replacement of one module with another as methods change. Very important additional features of CCP4 include common data structures and library routines Dr. Dodson pointed out that collaboration has benefits and complications. Benefits include common systems and reduction of duplication, but complications include the need to really discuss things, which is time consuming. Additionally, the allocation of credit is more difficult if many people are involved. To move into a modern
computing environment, Dr. Dodson points out, there has to be a symbiosis
between those who really have the problems and computer scientists and
other developers. Data
management is urgently needed, so
information can be passed from one stage to another in such a way that it
doesn’t have to be entered more than once. Paul
Adams, Lawrence Berkeley Laboratory
PHENIX:
A New Framework for Collaborative Software Development (download
presentation [PDF] 3 MB) Dr. Adams discussed the
PHENIX project, a collaborative project with a goal of developing a
comprehensive system for crystallographic structure determination. The
strategy is to use a scripting language (python) to link core routines and
data objects in a way that is easy to manipulate. This allows rapid
development of ideas, with optimization coming later. From the point of
view of the user, the system will provide automated structure solution,
but also the flexibility for individual approaches.
The core of the system is a low-level framework for
crystallographic programming. Dr. Adams talked about how
the PHENIX infrastructure can be shared. Source code will be distributed
to academic groups. Other developers can “plug in” their work to the
PHENIX infrastructure so that they can make use of the infrastructure.
For example, the Pymol visualization software developed by Warren
Delano was made a plug-in to PHENIX and can be used by PHENIX users. The high-level programming
environment in PHENIX is designed to let users (or developers) link tasks
easily and allow the creation of complex algorithms. It allows the incorporation of external programs as well. There was discussion of how
important it may be not to duplicate too much on low-level routines; in
particular the CCP4 project and the PHENIX project already have some
duplication at this level. Ralf Grosse-Kunstleve,
Lawrence Berkeley National Laboratory The Computational Crystallography
Toolbox: Open Source Tools for Crystallographic Software Development (download
presentation [PPT] - 0.3 MB) Dr. Grosse-Kunstleve
talked about a set of building blocks for PHENIX that he emphasized are
really building blocks for everyone’s benefit and use. They are available as open-source code. The cctbx toolbox has
fundamental algorithms for computational crystallography, including space
group (symmetry), crystallographic, and standard array calculations. The code for cctbx can be
obtained over the web through the SourceForge project, an open-source
organization with some 380,000 users.
Ralf described how SourceForge is very nicely set up to allow one
developer to put in software, then allow another developer to make changes
and allow both to make use of each other’s work. Ralf discussed the
trade-off between high-level languages (easy to develop and debug, but
slow) and lower-level languages (harder to develop, faster). The balance
he chose is using a very high-level scripting language (Python) whenever
practical, and a moderate-level programming language (C++) for routines
requiring speed. These two
languages are quite similar, particularly in the use of classes, memory
management and exception handling, and are complementary in their
flexibility and speed. The two languages can be integrated nicely usig the
Boost.Python library system (also open-source).
The documentation in the cctbx toolbox is embedded directly in the
C++ source code, supplemented with external web documentation
(automatically produced). To generate versions for
as many platforms as possible, the cctbx toolbox uses Python to
automatically generate Makefiles using a set of dependencies that are
defined in simple Python scripts. The
expectation is that end-users will usually get binaries.
The cctbx toolbox can be used as a pure C++ library by any
developers as well. There was discussion of
how do you control the quality of an open-source project. The answer is
that normally many people can contribute but someone is the curator. Wladek
Minor, University Virginia Critical
Components of Automatic Determination of Protein Structures - What we can
share and coordinate Dr. Minor started out
by discussing bottlenecks in structural biology ; pointing out that many
of the individual steps are now pretty well under control, but information
flow is often limiting. Automation
needs to work about 70% of the time for certain classes of cases, or it
isn’t that useful, he emphasized. In practice, solving
structures with X-ray crystallography can sometimes take as little as an
hour or so, but other times as long as years. It is key to identify early
on whether the data is going to be useful; this is an information flow
(and analysis) problem. Dr. Minor talked about databases in structural genomics, particularly the Midwest Center for Structural Genomics, and how to make them real-time and self-consistent. HKL2000 is software for processing diffraction data and for managing part of the data flow. For example HKL2000 can search to see if a sequence similar to the one of a target protein belongs to a protein in the PDB. A big question is what information is shared. Dr. Minor presented system that can provide automatic report for everyone and in the same time allow members of the project comfortably share all data between different groups involved in the project. The constant change of
data formats even for well establish detectors is a big issue
(there are 170 formats for images that denzo can recognize). Dr. Minor emphasized that
the flow of information among
groups requires not only standardization of the information transfer (such
as xml definitions) but also self-consistent information. This information
has to be derived from authoritative source – project database. There was some discussion
about sharing information by structural genomics centers, as the data in
theses projects should be very
open. The challenge is to create system that is not only open but also one
that provides all information in real time. Such a system is necessary to
to avoid the effort duplication.
There was also
discussion how to deal with
many data formats. Suggestion was that it made the most sense to have code
that recognizes all formats and returns the information in a standard
way). Some people felt that
implementation of Image_cif format can be solution for the future. There was some discussion
how others can work with HKL2000. The source code is available for
inspection and discussion in the program authors' laboratories General discussion General discussion included
questions of credit; allocating credit is difficult in a collaborative
environment. For CCP4, the programs identify publications to cite as a way
around this. An idea was suggested that databases of structure
determination should keep track of credit.
Another discussion was how to collaborate with each of these
groups. Beamline Automation I.
(Chair: Wladek Minor) Masashi
Miyano and Takashi Kumasaka, Riken Harima Institute at Spring-8
Full
Automation Approach of Crystallization Screening and its Observation Dr. Miyano described the
RIKEN structural genomics/proteomics
initiative (RSGI) and the Spring8 beamlines that are contributing
to it. These beamlines (and
NMR facilities) plan to solve 1500 structures in 5 years, including
Thermus thermophilus and M. tuberculosis proteins, as well as mouse, human
and plant, proteins from cDNAs. He
described a high-throughput protein production system applied to the
hyperthermophilic protein production.
This system feeds into a robot-based
crystallization setup based “TERA” with automated dispensing of
precipitant solutions and visualization, and storage for a large number of
crystallization trials. So far the scoring of crystallization is manual.
Crystals are stored on bar-coded pins and are to be automatically
mounted at the beamlines. Prototype sample changers were described. Thomas
Earnest, Lawrence Berkeley National Laboratory
Development
of Robotic Crystal Mounting and Alignment Systems for Biological
Crystallography at the Advanced Light Source Dr. Earnest talked about 5
fully operational beamlines at the ALS and several new superbend beamlines
and their capabilities. One particularly important feature of some of
these is exceptional optics for visualizing crystals. He emphasized that
high-throughput structure determination is incompatible with human
intervention. Manual
intervention is too slow, is tediuous, and wastes time. Robotic systems
have been developed to do what the person might do for most of the steps
in collecting data, including sample mounting, crystal centering and data
collection. An important strategy is to have the capability to
automatically run through a set of many crystals, then to choose the best
ones and collect data on them. The plan is to develop a 3-D representation
of the crystal, to automatically identify the location of the crystal and
to center it and screen or collect data on it.
Dr. Earnest emphasized that “smart” software is the key to the
next stages in automation of beamlines. The strategy at the ALS for
development of hardware is to build it from scratch rather than trying to
link existing hardware that is not designed specifically for
crystallography. Their
overall plan is to link data collection, processing, and analysis with
high-level software. The
hardware and details of construction have been shared with several other
facilities. An item of discussion was
the compatibility of pins for sample mounting.
Dr. Earnest described how the two California groups developing
sample mounting have now agreed on compatible pins. Liz Duke, Daresbury Laboratory Towards Automated Beamlines at Daresbury Dr. Duke describes the
efforts at Daresbury SRS to automate x-ray crystallographic efforts. There
are 5 operational beamlines at the SRS and one more planned.
Current efforts on automation are beginning with identification of
all motions necessary in the hutch (which would be carried out with
outside control) and building robotics to carry these out. The Daresbury
group plans to use an unscheduled beamline to replicate a highly-used
beamline and to develop all the hardware.
They have a constraint on automation; all operations have to be
possible both with and without automation so that certain types of
crystals (viruses) can be done in capillaries. Dr. Duke emphasized that
collaboration between instrumentation engineers and software engineers is
required, and that the automation project is very large amount of work.
They are comparing the possibilities of collaboration compared with
purchasing commercial equipment. Dr. Duke also mentioned that the Diamond
synchrotron protein crystallography beamlines are being planned from the
start with automation in mind. Ab-initio and Direct Methods
Phasing I. (Chair:
Alexandre Urzhumtsev) Charles
Weeks, Hauptman-Woodward Med. Res. Inst. Integrating
Direct Methods into Automated Protein-Phasing Packages (download
presentation [PPT] - 0.7 MB) Dr. Weeks
described how to go from solving a heavy-atom substructure with direct
methods to phasing a full protein structure.
This is a problem of integrating a method for substructure solution
with methods for substructure refinement, density modification, and model
building. He discussed the
BnP (Buffalo and Pittsburgh) system that combines the SnB direct-methods
program with normalization and substructure-comparison routines as well as
components from the macromolecular crystallography package PHASES. All of these programs are linked together with a Java-based
GUI. Dr. Weeks
emphasized that the timely identification of which direct-methods trial
substructures are correct is critical.
He discussed methods for automating this decision step using
different scoring methods. The
best included consideration of the improvement in various figures of merit
obtained in going from the starting values to the final refined values. A second important step in the automation process is
identifying how many of the top peaks in the direct-methods solution
should be included in subsequent protein phasing.
Correct sites tend to occur reproducibly in different (independent)
solutions, but false sites do not. Better
communication between BnP and other software would increase the
possibilities for automation, and Dr. Weeks requested help in achieving
this. For example, the BnP
control file (known as a “configuration” file) could be preloaded at
the beamline where the X-ray data is collected.
Collaborations with the authors of complementary procedures (e.g.
the identification of NCS symmetry involving heavy-atom sites) are also
sought. Quan
Hao, Cornell University Direct
Methods and SAD Phasing Dr. Hao described how
single-wavelength phasing and direct methods can be integrated. The
information necessary for phasing can be obtained from either
high-resolution data or anomalous diffraction (or SIR). The protocol used
included CCP4-based programs SAPI (tangent-formula refinement), ABS
(identification of enantiomer), and OASIS (single-wavelength phasing using
direct methods and Sim weighting). He described possible collaborations
involving maximum-likelihood and other statistical methods in SAPI and
OASIS. Angel
Garcia, LANL All-atom Studies of the
Folding/Unfolding Transition Dr. Garcia described an
area quite different than many of the previous speakers, simulations of
protein folding. He emphasized new simulation methods that are applicable
to the folding of proteins and peptides, and that indirectly are quite
relevant to crystallography. He described the method of replica exchange
for simulation of protein folding over long (microsecond) time scales. The method involves Monte Carlo-based exchanges of partial
configurations among a large set of parallel simulations. Whenever a
particular simulation finds itself in a bad configuration (high enthalpy)
it heats up (exchanging with another at higher temperature) and whenever
is in a good configuration, it cools off further.
This method is able to simulate the folding/unfolding transition
with temperature, showing in detail the distribution of native-like
contacts as a function of temperature.
A surprise was that states exist where the main-chain rmsd to the
native structure is low (2 A) but a contact map of all atoms within 6 A of
each other is almost completely non-native-like. The method can
characterize thermodynamics such as enthalpy, entropy and specific heat. Beamline automation II.
(Chair: Thomas Earnest) Steve
Kinder, Daresbury Laboratory Data
Acquisition Developments and Collaborations at Daresbury Laboratory Dr. Kinder discussed
experiences in collaboration at Daresbury, including PXGEN++ and DNA. He
emphasized that collaboration is not so easy, and it is a good idea to
practice with friends first. One
project he described was an update to software (PXGEN) for x-ray data
collection. The objectives in the project were well-defined, and the
project was largely within Daresbury laboratory, making collaboration a
little easier than it might otherwise have been.
The PXGEN++ software was designed to have a common GUI independent
of the detector, and standard controls for various beamlines.
The second project is DNA, a much large collaborative effort to
automate collection and processing of protein crystallography data,
involving several laboratories (LMB Cambridge, ESRF, Daresbury). The DNA
system involves an expert system that controls data collection and the
MOSFLM data processing software. Dr. Kinder emphasized that this group is very interested in
collaboration, particularly in ideas and in sharing code. Joseph
Brunzelle, Northwestern University Automated
Structure Determination: Database Driven and Ant-Controlled Dr.
Brunzelle described a Python-based system for linking together existing
algorithms for structure determination in a flexible way that allows
crossover among methods. The
general approach he used is a Python and web-based system.
It is suitable for incorporating multiple approaches and for
including decision-making (AI). A web-based database interface using
“Slither” is used. The
mySQL database itself is a relational database including the raw data and
the analysis, as it proceeds. The “ants” correspond to existing
software, wrapped in a standard way so that the data format is not an
issue. The “queen ant” controls the worker ants who carry out standard
operations on crystallographic data and return the results. A
discussion question was how to maintain a software suite that depends on
many other pieces of independent software. The answer was that it would be
a lot easier if everyone could use xml tags.
In practice it is not so hard to keep up with software changes, as
template files are used to control input. Model-building. (Chair: Todd
Yeates) Tom
Ioerger, Texas A&M University The
TEXTAL System for Automated Model Building (download
presentation [PPT]) Dr. Ioerger described a
pattern-recognition approach to interpreting electron density maps and
building an atomic model. The
TEXTAL approach is to examine regions of density in a map and to ask
whether similar patterns have been seen in other maps. The implementation
involves feature extraction from previously-interpreted maps to yield a
database of patterns. The
TEXTAL system has three main steps: identification of Ca
positions, identification of side chains, and then sequence and real-space
refinement. The recognition of Ca
positions is not as simple as it might seem, and Dr. Ioerger focused his
discussion on this step. The
approach involves normalization of the map, skeletonization, calculation
of features, using a neural network to identify likely Ca positions
using these features, and then selecting which ones to actually link
together. Dr. Ioerger
discussed areas of collaboration that he is interested in, including
making the tracing tools available, incorporating real-space refinement
and iteration with density modification. David
Levitt, University of Minnesota MAID:
An Automated Electron Density Fitting Program Applicable to Low Resolution
Maps (download
presentation [PPT] - 1.3 MB) Dr. Levitt talked about how
he has automated what a skilled crystallographer does when they build a
model of a protein into an electron density map.
The MAID software was developed using a 292-residue dehydrogenase
SAD map at 2.5 A, which it can fit 80% of with about 0.53 A rms for main
chain atoms. The overall approach used
is to first find the segments of secondary structure (helix and sheet),
then to extend them. During the building procedure real-space molecular
dynamics are used to optimize geometry and the fit.
This is all done in several iterative steps. First the helices and
sheets are found, then connections are found and several connected helices
are obtained. Particular attention is paid to being conservative so as not
to introduce errors that may be difficult to remove later.
These are then used to identify the fit to the amino acid sequence.
Then, using the amino acid sequence assignment, the extension and building
of loops is made much easier. The sequence assignment is done by testing each possible
alignment at each position in the partially-built model. Dr Levitt talked about the
idea of iterating the model-building process. It seemed that using the
initial model built by MAID, combining sA-weighted phases with the initial
phases, that the next cycle of model-building would be much better.
Initial tests didn’t show improvement, but the discussion suggested a
number of additional ideas that might be used to make iterative
model-building very useful. Phase combination, MR, and
Refinement I. (Chair: Eleanor Dodson) Brent
Segelke, Lawrence Livermore Shake&wARP
as a Local Real Space Validation Tool An
additional tool developed at Livermore is Autosolve, an automated
procedure for molecular replacement. It includes a search of the PDB and
automated initial modeling of the structure of interest based on
structures with similar sequences from the PDB, a series of molecular
replacement attempts, and output of a map, a partially refined structure
and annotation of quality. Michael
Chapman, Florida State University Real-space
Simulated Annealing Refinement, a Tool in Model-Building, and a Paradigm
for Holistic Refinement (download
presention - 1.5 MB) Dr. Chapman described the
use of real-space refinement in building structures and why it helps. He
started with several examples, including one of an SAD/molecular
replacement case where real-space refinement followed by reciprocal-space
refinement gave an improvement over just reciprocal-space refinement
comparable to that obtained with manual intervention, but automatically
and much more rapidly. Dr. Chapman described
another application of real-space refinement, including electron
microscopy and solid-state NMR. A
new direction in the work is to re-examine force fields used for bonded
and non-bonded geometries. For example hydrogen-bonding force-fields and
restraints including directionality at the acceptor and stringent criteria
for identifying hydrogen bonds improved the free R-factor for structures
at moderate resolution (3A). Dr. Chapman noted that this procedure should
only be used towards the end of refinement at moderate resolution. Dr. Chapman emphasized that
the real-space procedure is dependent on many other packages and that he
is very interested in combining the methods with others.
The methods inparticular have clear application to automated
model-building. The programs are available at http://www.sb.fsu.edu/~chapman. Ab-initio and Direct Methods
Phasing II. (Chair: Charles Weeks) Alexandre
Urzhumtsev, Universite Henri Poincare-Nancy Improved
and Alternative Means for Phasing (download
presentation [PPT] - 3 MB) Dr. Urzhumtsev pointed out
that there are many proteins that do not yield great crystals, but that
there is a lot of information that we have in advance that can help in
solving the structure. In the
case of molecular replacement, he described how to modify scoring
procedures to evaluate possible rotation function solutions. The method is
based on the clustering of solutions, with a hypothesis that the most
densely-populated clusters are most likely to be correct.
For the translation function in molecular replacement, a search
with low-resolution data including a solvent model greatly improved the
capability of identifying the correct solution. Another method described by
Dr. Urzhumtsev was to use low-resolution direct phasing to identify
positions of the molecules in the cell.
The approach requires a way to separate good maps from poor ones.
In particular, the number, position, and shape of connected regions
is very useful. The method
includes starting from andom phase sets, selecting those that are
connected (many connected regions of the same size; few isolated regions
of density), and averaging the best, then iterating the process.
Dr. Urzhumtsev described the application of the method to LDL,
which yields only low-resolution data (27 A).
Another example was a lectin that diffracts to about 6 A, where
three clearly distinguished molecules were found. Another method was
binary integer analysis of the phase problem. Charlie
Strauss, LANL De
Novo Structure Prediction Using Rosetta (download
presentation [PPT]-6 MB) Dr. Strauss described an ab
initio approach to predicting protein structures he developed with Dr.
David Baker and colleagues. The approach begins with a fragment library
(from the PDB) consistent with local sequence preferences in the target
sequence. Then the fragments are assembled into models with plausible
global properties. Finally
these are clustered based on structural similarity, with the number in a
cluster being an important factor for ranking.
The method has been quite successful in low-resolution resolution
prediction. The approach is limited to relatively small proteins, up to
about 150 residues. Dr. Strauss emphasized the importance of the CASP blind
prediction competition for invigorating the field (and encouraged the
crystallographers to continue providing blind tests). Dr. Strauss continued by
describing how the ab initio approach could be modified to include
experimental restraints (such as limited NMR restraints) to greatly
decrease the number of plausible
solutions. He showed that a
very small amount of experimental information could dramatically improve
the quality of structure predictions. Dr. Strauss also gave several
examples of how a coarse model predicted using Rosetta could be used to
provide functional annotations, both by structural similarity to proteins
with known structure and function and by mapping residues known to be
involved in function onto the predicted structures. Chang-Shung
Tung, LANL From
Low-resolution Data to Atomic Structure Dr. Tung described his model-building methods for extending low-resolution or partial models to all-atom models. He showed that structural regularities in the conformation of 4-Ca segments of structure allow a reliable inference of all main-chain atoms from Ca coordinates. The method allows reducing the phi-psi angles to just one parameter, which simplifies modeling and speeds it up as well. Dr. Tung also described a loop-generation algorithm that is useful for modeling unobserved loops in proteins. He described a related approach for generating all-atom nucleic acids structures from coordinates of the phosphorous atoms, and how it can be applied to building models as large as the ribosomal RNA from phosphorous atoms alone. Phase Combination, MR, and
Refinement II. (Chair: Paul Adams) Garib
Murshudov, University of York Various
Levels of Collaboration Between REFMAC and ARP/wARP Developers In keeping with the theme
of the meeting, Dr. Murshudov described current collaborations that exist
and that might be desirable. Desirable collaborations included those in
the area of automation of structure determination, moving from
crystallographic to post-crystallographic analysis, and analysis of
biomolecules. Collaboration can occur at several levels: data structures
and functions, programs, algorithms in crystallography, general
algorithms, as well as discussions of results and problems.
Post crystallographic problems include adding hydrogens, ligand
detection, rotamer libraries. Areas
that Dr. Murshudov is working on include full use of experimental data in
MAD/MIR, use of all information available throughout the process,
calculation of uncertainties, and automatic ligand detection. Dr. Murhudov reviewed the
use of Bayesian statistics in crystallography, including the use of prior
information (e.g., probabilities of bond distances).
He gave the specific example of SAD data and its analysis. A second
area he discussed is the uncertainty in parameters. To estimate them, the
second derivatives of the likelihood function are needed; these are
computationally expensive. The calculation can be done mor rapidly in
reciprocal space, however. These second derivatives can also be used for
other things, including probability distributions for the parameters, for
improving maps, for sampling, and for evaluating the significance of
observations (distances, for example). ARP/wARP developers have
been collaborating with Dr. Murshudov for some time. This includes
consistent and synchronized releases, development of algorithms, and
testing of programs. Other collaborations include dictionary and
deposition for the EBI, CCP4 support, and others. Tom
Terwilliger, LANL Maximum-likelihood Density
Modification and Automated Model-Building
(http://solve.lanl.gov/) Dr. Terwilliger described
the automated model-building capabilities of the RESOLVE software and how
these could be integrated with maximum-likelihood density modification. He
discussed the model-building process, which begins by identifying helices
and sheets in an electron density map using an FFT-based convolution
search similar to one developed earlier by Kevin Cowtan. Then fragments of
secondary structure from refined protein structures are fitted to these
locations as a starting point for model-building. These fragments are then
extended using libraries of tripeptides, initially without regard to
overlaps. Then the fragments are assembled and a set of non-overlapping
fragments is obtained. Side chains are fitted using a rotamer library and
correlations of local density to average density from the library. The model-building works
well for structures with resolution of about 3 A or better. It can then be
combined with maximum-likelihood density modification, using calculated
electron density from the model as a target for “expected” electron
density in the map, in an iterative fashion. Dr. Terwilliger showed that
in the case of gene 5 protein (a small 87-residue protein at a resolution
of 2.6 A), iterative application of model-building and maximum-likelihood
density modification resulted in most of the model being built. Todd
Yeates, University of California Checking
for Problems in Structures and Diffraction Data, with an Update on
Twinning Dr. Yeates discussed how to
check for errors and problems in structures, including unit cell
measurement errors, model-building errors, and merohedral twinning. He pointed out an early type of error involving unit cell
lengths (now very rare) that his group had detected by noticing
“stretching” of proteins due to the atoms going to the correct
fractional positions in a cell with incorrect cell dimensions. A second
approach (ERRAT) examined the statistics of non-bonded intractions in a
model and compared it with model distributions. This algorithm provides a
local measure of model quality. There
remain crystal structures reported recently that are improbable based on
these statistics, but they represent a small percentage of the total (<
1%). Dr. Yeates suggested
that the use of the structure factors to check model quality remains a
good idea. A long-standing problem has
been merohedral twinning. Dr.
Yeates has a web site that helps identify twinning from intensity data (http://www.doe-mbi.ucla.edu/Services/Twinning).
In this situation of merohedral twinning, the lattice of the crystal has a
higher order symmetry than the space group (e.g., p4 with alternating
regions reversed in orientation). In this case intensities of each
diffraction spot are the weighted sum of intensities of two reflections. In the worst case, the symmetry appears to be higher than it
really is (i.e., P4 appears to be P422). The twin server checks for all
these scenarios. Perfect twinning gives rise to non-Wilson intensity
distributions. This can be confused by anisotropic diffraction, however.
A new local statistic based on local relationships between
reflections can overcome this problem. Partial twinning is usually
detected by an unexpected similarity between reflections related by a
twinning operation. However this can be mimicked by NCS if it is nearly
crystallographic. Dr. Yeates is developing methods to identify partial
twinning in the presence of NCS. Dr. Yeates emphasized that
crystallographers must remain vigilant, and verification programs should
be run more frequently. Expert systems should be designed to identify
twinning and other pitfalls. Data
Harvesting and Deposition and Meeting Discussion (Chair: Tom Terwilliger) John
Westbrook, Rutgers University (download
presentation [PPT] - 2.5 MB ) and Kim Hendrick, European Bioinformatics Institute (download presentation [PPT] - 1.5 MB) Collecting Data for the PDB Dr. Westbrook discussed how
to facilitate seamless data exchange and deposition. He emphasized the
need for data specifications and software that implements these
specifications. The situation that needs to be avoided is one where many
different groups use different definitions and analysis of results is
impeded. The web site at http://deposit.pdb.org/mmcif/
describes a large set of dictionaries including NMR, modeling,
crystallization, symmetry, image data, extensions for structural genomics,
properties of beamlines. The
data definition project has a long history beginning with projects of the
IUCR, and is being driven now largely by the needs of structural genomics. The data dictionary for
X-ray data is in final review; others in progress include NMR and protein
production. The PDB has spent significant effort to define a CORBA API for
communication of data items. This is describe at http://openmms.sdsc.edu/
. The current PDB strategy
for data integration is to collect experimental information as mmCIF (or
otherwise electronically parseable) output, combined with information from
the ADIT deposition tool, then making all data available in the exchange
dictionary format (http://beta.pdb.org) .
Data harvesting is currently implemented from several software
programs including many CCP4 programs, HKL2000, and others. Software
integration tools are available at http://deposit.pdb.org/software
and http://deposit.pdb.org/mmcif
. To participate in the
process, anyone can comment on the data items in http://deposit.pdb.org/mmcif
; also the identification of what should be captured for deposition is
still useful. There is a
workshop scheduled May 24-25, 2002 on
“Structural genomics informatics and software integration” as
well. Dr. Henrick described how the EBI hosts a number of databases including SWISS-PROT, TREMBL, Array-Express, and others. The EBI is also a host for deposition to the PDB. The EBI and the PDB are a good example of using agreed common data items and an exchange mechanism (they don’t use the same software but they communicate seamlessly). The common data representation includes an abstract data model and data definitions. Harvesting, exchange and storage follow. The strategy then is to create a pipeline for the data, with individuals defining their required inputs and outputs, and mapping them to the data model. There are many reasons to collaborate on a data model for crystallography, NMR and other structures. The main one is that the original PDB representation isn’t rich enough for all the data that is useful to save. The good news is that data dictionaries and methods for extending them exist, as does a data model. There are several European efforts to define the process. On is an E-science resource for structural genomics, another is the SPINE structural genomics project; another is the CCP4 coordinate library project. The UML (universal modeling language) approach is useful for defining relationships in a project and for generating code for classes describing these relationships. This has been applied in the CCPN project for NMR data storage and another project for electron microscopy. Open issues at this time include elements of the data models, API specification, and migration of existing software. Areas of Collaboration identified during the workshop: Use of crystallographic
infrastructure (e.g., PHENIX/CCP4
libraries and platform) |