Protein Engineering: Bioinformatic Catalyst Design

Putting engineering back into protein engineering:
bioinformatic approaches to catalyst design
Claes Gustafsson, Sridhar Govindarajan and Jeremy Minshull
Complex multivariate engineering problems are commonplace
and not unique to protein engineering. Mathematical and datamining tools developed in other fields of engineering have now
been applied to analyze sequence–activity relationships of
peptides and proteins and to assist in the design of proteins
and peptides with specified properties. Decreasing costs of
DNA sequencing in conjunction with methods to quickly
synthesize statistically representative sets of proteins allow
modern heuristic statistics to be applied to protein
engineering. This provides an alternative approach to
expensive assays or unreliable high-throughput surrogate
This review comes from a themed issue on
Protein technologies and commercial enzymes
Edited by Gjalt Huisman and Stephen Sligar
Protein engineering has classically been approached from
two diametrically opposed directions: rational design and
directed evolution. Rational design, in the tradition of
Descartes and Leibniz, attempts to understand protein
structure and function at a complete mechanistic level so
that any desired change can be effected by calculation
from first principles. Directed evolution, in the tradition
of John Locke and other empiricists, attempts to find
a desired solution by testing many different variants,
typically using various evolutionary based algorithms.
Both rational design and directed evolution in their
many alternative formats have shortcomings and advantages that have been discussed and compared elsewhere [1–3].
Modern heuristics applied to protein engineering is a
synthesis of empirical data and a rational analysis of that
information. The very first paper describing chemical
synthesis of a gene proposed that systematic variation
of amino acids would enable an understanding of the
relationships between the sequence of a protein and its
structure, physical behavior and activity [4]. Soon after
that, Svante Wold’s group developed and applied multivariate data analysis techniques to peptide design and
suggested that ‘the rapid development of protein engineering may then make it possible to produce designed
sets of mature proteins and enzymes for QSAR studies’
[5,6]. This review will summarize recent publications in
which modern heuristics have been applied to protein
engineering and describes technological advances that are
enabling Wold’s vision.
Protein optimization from an engineering
When faced with solving a difficult problem it can be
enlightening to see if a similar type of problem has been
solved before. Many disciplines and industries face the
same challenges of high system complexity and abundant variables that confront protein engineering [7]. In
some industries increasing complexity is intentional, as
in the addition of new control parameters for a car’s
combustion engine. Sometimes it is inherent to the
system itself, for example, in clinical drug trials. The
common challenge in car manufacturing, clinical trials
and protein engineering is to account for as much of this
complexity as possible when describing the relationship
between input variables (e.g. piston angle and temperature for car engines, age and medical history for patients
or amino acid residues available at each position for protein engineering [8]) and output variables (e.g. exhaust
levels and fuel efficiency for cars, side effects and survival rate for patients or the desired commercial properties such as catalytic activity, thermostability, substrate
specificity and immunogenicity for protein engineering).
Measured output variables may in turn result from combinations of properties that are not explicitly measured;
for protein engineering, these may include expression
levels and protein solubility [9]. Like small-molecule
quantitative structure–activity relationships (QSAR),
which have enjoyed much success in pharmaceutical
development, heuristic protein engineering aims to
identify the relationship between input and output variables to create biological macromolecules with defined
properties. For reasons described below, more work
has been published optimizing peptides than proteins
using engineering concepts. We therefore use peptide
examples to describe some of the principles before
describing how the same engineering tools are used to
optimize proteins.
Bioinformatic approaches to catalyst design Gustafsson, Govindarajan and Minshull 367
Navigating in protein sequence space
Multivariate design of improved polypeptides
Protein engineering can be divided into two subtasks:
defining the solution space and defining the search
Figure 1 shows a procedure for peptide optimization
derived from the one used by Norinder et al. [32] to
design analogs of the neuropeptide substance P with
increased affinity for the neurokinin 1 (NK1) receptor.
These authors used partial least squares (PLS) regression
[33,34] to correlate the sequences of 36 substance P
analogs with their activities. They used this model to
identify the positions and amino acid properties in substance P that had the largest effects on NK1 binding. The
authors designed, synthesized and tested six new peptides that the model predicted to be improved NK1
binders. All six were shown to be highly active. Their
sequence–activity data was added to the first 36 peptides
to build a second generation PLS model, which was used
to design a further three variants. One of these had an
IC50 of 5 pM, 300-fold better than the wild-type peptide
and 45-fold better than the best of the original 36 variants
[32]. It is striking that extremely small numbers of variants (45) were made and tested to achieve very significant improvements in the desired function.
Define the solution space
The total possible number of proteins encoded by a 1 kb
gene is 20333 (20 alternative amino acids at each position
in a string of 333 residues) 10430. This is an unfeasibly
large number of variants to screen. Fortunately, not all
possible sequences need be considered as naturally occurring proteins can usually be relied on to provide a starting
point for engineering efforts. Active point-mutants [10],
phylogenetic substitutions [11], structural modeling
[12,13] and known immunogenic constraints [14] are
well-explored methods of targeting specific regions of a
protein for change.
Define the search algorithm
Protein engineering is a non-polynomial (NP)-complete
problem [15,16], meaning that the problem scales nonpolynomially with increasing complexity and no known
algorithm can guarantee determining the optimal solution
without evaluating all possible solutions. Empirical protein engineers have largely limited themselves to address the NP-complete problem with exhaustive searches
using ultra-high-throughput phage and ribosome display
screens [17,18] or evolutionary methods [1–3,19]. By
contrast, the wider engineering community has exploited
genetic algorithms as well as regression-based algorithms,
neural nets, clustering, and several other tools as alternative techniques to address NP-complete problems [20].
Statistical targeting of amino acid changes
Comparisons of natural protein and DNA sequences,
particularly those using the powerful technique of principal component analysis, can be used to identify residues
that are important for specific functionality within a
protein [21,22,23,24,25]. Natural substitution patterns
can also be used to infer which changes are likely to be
acceptable within functional proteins. For example, a
recent study of subtilisin variants found that all 52 of
the amino acid variations found in 15 homologs were
active within the context of at least one backbone; their
incorporation produced proteases with varying catalytic
properties [26]. In another set of experiments, all of the
active-site residues from one fungal phytase were
replaced with those from another, again the result was
an active protein with altered catalytic properties [11].
By incorporating small numbers of changes identified
from alignments of naturally occurring sequences, it
has also been possible to increase the thermostability
of a fungal phytase by over 308C [27]. Substitution
matrices derived from synonymous and non-synonymous
substitution rates can also be used to choose reasonable
amino acid changes if there is insufficient phylogenetic
data to use sequence alignments [28–30,31].
The same techniques have also been applied to proteins.
In one particularly informative example, Bucht and colleagues optimized a complex protein phenotype: the
activity of acetylcholinesterase expressed on the surface
of human COS-1 cells. Display of acetylcholinesterase on
the cell surface occurs as a result of glycosyl phosphatidylinositol modification at the C terminus of the protein.
The authors identified two amino acids in the signal
peptide region of the protein, the identity of which
affected cell-surface localization of the protein. They
synthesized eight variant genes, tested the surface expression of the eight encoded proteins and used PLS to
Figure 1
Create initial set of variants
and measure desired phenotype
Build sequence–activity model
Design new variants based on model
predictions for high performing sequences
Add new data to refine
sequence–activity model
Synthesize and test new variants
Current Opinion in Biotechnology
Polypeptide optimization using mathematical models. The process is
that used by Norinder et al. [32] for the optimization of the neuropeptide
substance P.
368 Protein technologies and commercial enzymes
model the sequence–activity relationship. The authors
then constructed an additional 27 variants in this same
region of the protein, using them to test and refine the
model, thereby identifying the optimal sequence for cellsurface expression of acetylcholinesterase [35]. Modeling sequence–activity relationships to identify optimal
protein variants has not been limited to amino acids
localized to a small region of a protein. Statistical analysis
of mutations distributed throughout several enzymes has
been used to identify the contributions of those changes
to function of the protein [36] and to predict the sequence with best function [37]. Mathematical sequence–
activity modeling has thus been validated at many scales
of complexity: from small molecules to peptides to localized regions of proteins to changes spread throughout
entire proteins.
Although there is a growing body of work in which
sequence–activity relationships are used to design improved peptides [5,6,38,39], application of the same
methods to protein/biocatalyst engineering is still in its
infancy. One reason for this has been the difficulty in
producing large numbers of modified molecules [40]; in
contrast to peptides, proteins cannot easily be synthesized
directly. As technology improves, the synthesis of individually designed genes becomes increasingly costeffective [41,42]. Testing variants taken from libraries
that are even cheaper to produce is also likely to produce
useful sequence–activity relationships [43].
Experimental design of maximally
informative datasets
Another useful statistical tool with its origins in other
engineering disciplines is that of experimental design.
This is a technique by which a variant set is designed to
contain the maximum amount of information for subsequent analysis of sequence–activity data [44]. Using
D-optimal design, Mee et al. [45] designed, synthesized
and tested a training set of 60 analogs of a 15 amino acid
antibacterial peptide. A regression-based model derived
from the sequence–activity correlation of the 60 datapoints was used to design and synthesize 39 new peptides
predicted to have improved activity. The best designed
peptide was twice as potent as the best one in the training
set. In their selection of acetylcholinesterase variants,
Bucht et al. [35] also used experimental design to choose
the eight gene variants that would best represent the
sequence variation they were exploring.
Accounting for amino acid interactions
If an amino acid change at one position affects the
functional consequences of changing other amino acids
in a protein, predictive sequence–function models must
account for this. A model that incorporates amino acid
interactions requires more data than one that assumes that
the amino acids act to achieve the same quality of model
[40,46]. In studies of antigen–antibody binding [40]
and ligand–receptor binding [47], researchers found
that very few interaction terms (and thus very little
additional data) were needed to produce accurate descriptions of the sequence–activity relationship.
Recent work from Husimi’s group suggests that this result
is also true for proteins. Individual amino acid changes
contributing to specific properties of dihydrofolate reductase [36], thermolysin and prolyl endopeptidase [37] are
approximately independent. Of particular interest is a
recent study in which only two of 14 randomly generated
mutations that increased prolyl endopeptidase thermostability appeared to be interdependent. The authors’
model contained a single interaction term to account for
this residue pair. A gene variant containing the pair predicted to interact was synthesized and tested; its activity
was shown be as predicted by the model. Only 45 gene
variants were needed to accurately model the activities of
16 384 possible sequence combinations [46].
Heuristic methods are becoming more
Other successful examples of heuristic approaches to
analyze and optimize biological systems include the
optimization of peptidase I using neural networks [48],
calculations of individual amino acid contributions to
serine protease inhibitor activity [49], PLS-based prediction of the determinants of protein localization [50,51],
and protein contact map and interaction site prediction
using neural networks [52]. In work complementing
modeling to assess the contributions of small numbers
of changes at many positions, sequence–activity relationships have been derived using PLS to quantitate the
effects of multiple amino acid substitutions at single
positions in haloalkane dehalogenase, T4 lysozyme, subtilisin and tryptophan synthase. These methods have also
been used to determine the physicochemical properties
required at identified positions to confer specific enzyme
properties [53]. Furthermore, the same tools have been
used to systematically characterize the substrates for a set
of haloalkane dehalogenase variants to determine the
effects of amino acid changes on substrate specificity of
the enzyme [54].
Conclusions: drivers for change
By casting the protein engineering problem as an optimization problem common to other engineering disciplines, we are able to exploit many different problem
solving algorithms. Gone are the technological barriers
to synthesizing statistically representative datasets. As
Wold predicted in 1986, the capture of protein sequence–
activity relationships now permits the design of optimized
There are several drivers for applying modern engineering tools to protein engineering. Firstly, the human genome project, microarrays and other recent large scientific
Bioinformatic approaches to catalyst design Gustafsson, Govindarajan and Minshull 369
endeavours have changed biology from a ‘one variable at a
time’ science to a science engulfed in variables. Secondly,
statistical tools developed and deployed in a variety of
engineering areas can now be operated by non-statisticians
from any desktop computer. Finally, the cost of generating
and sequencing statistically representative sets of genes is
continuously decreasing.
haloalkane dehalogenase mutants. J Microbiol Methods 2001,