Tutorial: running a simple optimization with GROMACS topologies
To recap from the platform flow page, locuaz iteratively performs mutations on the selected residues of the binder, followed by minimization and MD simulation. The trajectory is then scored with various scorers, to assess whether the mutation results in higher or lower affinity. If the mutation results in better affinity, the mutation is accepted, and the process is repeated. If the mutation does not significantly improve the affinity, then the mutants are discarded and a new set of mutants are generated, based on the original complex(es).
In this tutorial, we are going to optimize the 3 CDRs from a nanobody against a p53 protein. This p53 has a TWIST box where the TWIST protein attaches and promotes its degradation. To prevent this, we’re going to optimize a nanobody that will compete for the binding of the TWIST box. For more info, check the reference.
Assuming we have a conda environment named locuaz, let’s start by activating it:
mamba activate locuaz
Setting up the system
It is advisable to use a pre-equilibrated system as the starting complex for the optimization protocol. In general, the following guidelines can be followed for the systems preparation:
The binder and target are docked with any docking program of choice.
The selected initial complex structures is then equilibrated in a box of water.
The PDB file of the equilibrated structure can then be used for the protocol.
Necessary files
This tutorial will provide a step-by-step guidance to the use of the locuaz protocol to optimize a nanobody
(VHH) as a p53 binder (figure shown above). The bulk of it will concentrate on the writing of the
configuration yaml file. A more detailed explanation of the available options, can be found in the
YAML configuration file. The materials for this tutorial are located in
the examples/simple_tutorial
folder. You can dowload them from this link
d11.pdb
: the PDB file of the pre-equilibrated complex. Target chains go first and then the binders.config_simple.yaml
: the configuration file; all options go in here.mdp
directory: minimization, NVT and NPT GROMACS input files.gmxmmpbsa
: the input file for the gmxmmpbsa scorer
The configuration file
Options don’t go all bunched up together, there are classified in sections. We will review each one of them.
paths
paths:
gmxrc: /usr/local/gromacs/bin
scorers: /home/pbarletta/labo/22/locuaz/bin
mutator: /home/pbarletta/labo/22/locuaz/sample_bin/dlpacker
mdp: /home/pbarletta/labo/22/locuaz/daux/mdp
input: [ /home/pbarletta/labo/22/locuaz/daux/d11_oct ]
work: /home/pbarletta/labo/22/locuaz/daux/rosa
- If you are running this example, you will have to change every field on this section, since these are system dependent.
gmxrc
: is the location of GROMACSGMXRC
script and binaries.scorers
: all scorers have to be in this folder, with a folder for each one and all its necessary files inside. So, for example, if you are using bluues, evoef2 and gmxmmpbsa, inside thescorers
directory you will need thebluues
folder with the bluues binary inside, theevoef2
with the evoef2 binary inside and thegmxmmpbsa
folder with the gmxmmpbsa input script inside.mutator
: the necessary files for the mutator have to be here. We are using the dlpr mutator in this example, so the DLPacker files have to be here: charges.rtp, DLPacker_weights.h5 and library.npz.mdp
: GROMACS mdp input files have to be here.input
: the starting PDB goes here.work
: protocol generated files go here.
main
main:
name: d11
starting_epoch: 0
We are going to leave most values to their defaults and only set the name of our run.
name
: the name of the run. The input PDB located in[paths][input]
has to be named after this name ({name}.pdb
), we are naming this d11.
starting_epoch
: we are going to start on the default value of0
, but if your protocol run is a continuation of a previous one, you can set this value to other number in order to facilitate your posterior analysis.
protocol
protocol:
epochs: 10
branches: 2
memory_size: 4
failed_memory_size: 8
Global options of the protocol run go here.
epochs
: the number of epochs we want to run. Remember that a failed epoch, that is, an epoch that fails to generate at least 1 branch that improves the binding score is backed up (its folder is prefixed withbu_
) and is not included in the total number. So this will be the total number of successful epochs.
branches
: in principle, the number of branches that will be created at each epoch. If we look at the workflow from Introduction, it would be the ‘width’. This option depends on other option calledconstant_width
, see below for more info.
memory_size
: we want to prevent locuaz from mutating a position that was recently mutated, so we set this number to4
, this means that if position, say,128
is mutated on epoch12
, then it won’t be mutated again at least until epoch17
.
failed_memory_size
: similar tomemory_size
but it’s only filled if the mutation of the position failed to improve affinity. Useful when we don’t want the protocol to go back to a failed position for a long time, but at the same time we don’t want to increase thememory_size
too much, which would eliminate a lot of randomness from out run. We will set it to8
.
constant_width
is defaulted to True
, this means that each epoch will have 4 branches so if,
for example, 1 complex moves on to the next epoch, then 4 mutations will be performed on this complex
but if it were 3 complexes then 2 of them would be mutated just once, and only one of them (chosen randomly),
will be mutated twice; thus giving 4 new branches.
If constant_width
was False
, then branches
is the number of mutations performed on each complex from the
previous step. So if 2 complexes move on to the next epoch and branches=4
, then the next epoch will run 8
branches, since 4 new complexes were obtained from each surviving complex.
generation
Warning
Mutation Generators have been deprecated since version 0.7.0 and are slated for removal in 0.8.0. Use a Mutation Creator instead.
generation:
generator: SPM4gmxmmpbsa
probe_radius: 3
Now we begin to deal with a locuaz concept, Blocks. These are the moving parts of locuaz. The first one is the mutation generator, the block that is in charge of taking the sequence of the current complex and generating a new sequence from it.
generator
: we are using the SPM4gmxmmpbsa generator, so later we will have to include gmxmmpbsa as a scorer, so this generator can read the energy decomposition file from gmxmmpbsa and choose the position with the lowest contribution to the affinity as the position to mutate.
probe_radius
: this parameter is only used when the generator includes interface information, which is the case for SPM4gmxmmpbsa and others (eg: SPM4i). The generator uses freesasa to determine the CDR residues that form part of the interface and only considers those as potentials candidates for mutation. Since freesasa is a rolling-probe method,probe_radius
allows the user to set the size of this probe. In this example we are using a radius of3
, a rather large probe, so more residues end up being classified as part of the interface.
In the included gmxmmpbsa file you will find a dedicated section to perform this decomposition:
/
&decomp
idecomp=2, dec_verbose=0,
print_res="within 4"
/
The &idecomp
section needs to be present when using the SPM4gmxmmpbsa
generator.
Check Mutation Generators for a
reference of the implementation,
and YAML configuration file page for more details.
creation
creation:
sites: 1
sites_interfacing: true
sites_interfacing_probe_radius: 3.0
sites_probability: mmpbsa
aa_bins: ["CDEST", "AGIMLV", "PFWY", "RNQHK"]
aa_bins_criteria: without
aa_probability: ReisBarletta
Now we begin to deal with a locuaz concept, Blocks. These are the moving parts of locuaz. The first one is the Mutation Creator the block that is in charge of taking the sequence of the current complex and generating a new sequence from it.
sites
: we’re only going to mutate 1 site at a time. Since each new branch will only have 1 Single Point Mutation with respect to its parent branch, increasing this number will effectively increase the number of new branches the Generator will try to generate, though the number of branches we set above in the protocol section (2) will ultimately decide the number of generated branches at each epoch.
sites_interfacing
: use FreeSASA to determine the binder residues that lie on the interface and only consider those for a mutation.
sites_interfacing_probe_radius
: this parameter is only used when the Creator includes interface information. Since FreeSASA is a rolling-probe method, probe_radius allows the user to set the size of this probe. In this example we are using a radius of3
, a rather large probe, so more residues end up being classified as part of the interface.
sites_probability
: the creator can read the energy decomposition file from gmxmmpbsa and choose the position with the lowest contribution to the affinity as the position to mutate. This eliminates all randomness from the site choosing process.
aa_bins
: bins allow us to classify the current amino acid (AA) at the chosen site, so we can later choose an AA of a different or similar kind. In this case we’re classifying AAs as negatively polarized or charged, hydrophobic, cyclic and positively polarized or charged.
aa_bins_criteria
: we’re asking locuaz to determine the bin of the current AA and choose another one from any other bin.
aa_probability
: once the bin is chosen, each AA will have a probability of being selected according to the probability of finding them in antigen-antibody interface according to the Reis & Barletta paper.
Another word on sites_probability: mmpbsa
. This option forces us to choose 2
other options, the first one being sites_interfacing=true
, or else the residues
that are not part of the interface will always show up as the least contributing.
Also, we will have to include gmxmmpbsa as a scorer later, or else we’ll get an error.
In the included gmxmmpbsa file you will find a dedicated section to perform this decomposition:
/
&decomp
idecomp=2, dec_verbose=0,
print_res="within 4"
/
The &idecomp
section needs to be present when setting sites_probability: mmpbsa
.
Check the YAML configuration file page for more details.
mutation
mutation:
mutator: dlpr
reconstruct_radius: 5
This is another block, the one that is in charge of performing the actual mutation.
mutator
: the external program to mutate the complex and find a suitable side-chain orientation. We are usingdlpr
since it depends on the DLPacker program which comes built-in with locuaz and also performs a nice reconstruction of the surrounding side-chains.
reconstruct_radius
: residues below this distance from the mutated position will also get their side-chains reoriented.
Check Mutators for a reference of the implementation and YAML configuration file for more details.
pruning
pruning:
pruner: consensus
threshold: 2
In this block, you can set how the top branches from an epoch will be selected to pass onto the next one.
pruner
: the consensus pruner is the default one.
threshold
: the minimum number of scorers that have to improve for a branch to be considered better than its parents. Check Pruners for more info.
md
md:
gmx_mdrun: gmx mdrun
mdp_names:
min_mdp: min.mdp
nvt_mdp: short_nvt.mdp
npt_mdp: short_npt.mdp
ngpus: 1
mpi_procs: 1
omp_procs: 4
pinoffsets: [0]
water_type: tip3p
force_field: amber99sb-ildn
box_type: octahedron
Options related to the molecular dynamics run go in here.
gmx_mdrun
: some systems compile the gmx mdrun binary with a different name. The usual default ofgmx mdrun
works for most cases.
mdp_names
: these files should be inconfig['paths']['mdp']
. We set the names of min_mdp, nvt_mdp and npt_mdp.
ngpus
: number of available gpus. These would determine the number of parallel complexes that will be run.
mpi_procs
: number of available MPI processors.
omp_procs
: number of available OMP threads.
pinoffsets
: a list with the offsets for each system being run in parallel.
water_type
: water model.
force_field
: force field used to build the topology of the system.
box_type
: locuaz actually doesn’t change the box (no calls to editconf), but it if this option is set to triclinic, it will check that the system doesn’t go out of the box after the MD run. If it does, the branch will be discarded by assigning +inf values to each score value.
target
target:
chainID: [A]
Target options go here.
chainID
: list with the chainIDs of the target.
binder
binder:
chainID: [B]
mutating_chainID: [B, B, B]
mutating_resSeq: [[27, 28, 29, 30, 31, 32, 33], [53, 54, 55, 56, 57, 58, 59],
[99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116]]
mutating_resname: [ [A, P, W, E, N, T, L], ['Y', 'V', 'F', 'I', 'Y', 'H', 'A'],
['T', 'K', 'W', 'R', 'N', 'Q', 'R', 'E', 'G', 'R', 'G', 'G', 'K', 'S', 'D', 'S', 'P', 'T'] ]
Binder options go here.
chainID
: list with the chainIDs of the binder.
mutating_chainID
: list with the chainIDs of the target that will be mutated. In this example we are repeating the chainIDB
3 times. You are allowed to do this to clearly separate CDRs.
mutating_resSeq
: list of lists with the positions you want to mutate. We are typing 3 lists, one for each CDR.
mutating_resname
: these are the one-letter code of the amino acids that correspond to the mutating_resSeq from above. You can type them straight away or wrap them in''
. This is a mandatory field, and is used to check that the user typed the right positions on the field above. It’s only checked when the protocol runs for the first time.
scoring
scoring:
functions: [evoef2, bluues, piepisa, gmxmmpbsa]
nthreads: 6
mpi_procs: 4
Finally, we have the options related to scoring.
functions
: list of scorers to use. Check schema.yaml for a reference of all the currently available ones.
nthreads
: number of processes used for all the scorers but gmxmmpbsa.
mpi_procs
: number of MPI processors used for gmxmmpbsa. If set to1
, gmxmmpbsa will not use its MPI capabilites. Useful if you are having issues with MPI, though it will be slow.
start
: Useful if you want to skip a few frames before starting to score. 0-indexed.
end
: Also 0-indexed. Defaults to-1
, which means all remaining frames.
Check Scorers for more info on each scorer.
Running the protocol
Once the input file has been specified, and all the files are gathered, the protocol can now be run by firstly activating the environment, if you haven’t already.
mamba activate locuaz
python /home/user/locuaz/locuaz/protocol.py config_tleap.yaml
Now the protocol will create the working directory folder. In this folder, the progress of the protocol will be written on the nb.log file and folders corresponding to each epochs and branches will be created in this directory.