PBT Workflow¶
PBT is an asynchronous optimization algorithm for jointly optimizing a population of models and their hyperparameters while effectively using a fixed computational budget. Like a simple parallel grid search, PBT begins by randomly sampling selected hyperparameters and initial weights and training multiple models in parallel using these hyperparameters and weights. However, unlike a parallel search, each training run periodically and asynchronously runs an evaluate method when a model is considered ready, comparing its performance against that of other models. If it is under-performing, PBT uses two additional methods to improve performance: exploit and explore. Exploit leverages the work of the population as a whole by replacing an underperforming model with a better one, i.e., by replacing a model’s current weights with those of the better performing model. Explore attempts to find new better performing hyperparameters by perturbing those of the better performing model. Training then continues with the new weights and the new hyperparameters. Evaluate, exploit, and explore are performed asynchronously and independently by each model for some specified number of steps. In this way the hyperparameters are optimized online and computational resources are focused on better performing hyperparameters and weights, quickly discarding unpromising solutions.
This PBT example is written in Python using the MPI for Python (mpi4py)
package. It consists of model agnostic framework code for creating PBT
workflows (python/pbt.py
) and an example workflow
(python/tc1_pbt.py
). This example workflow trains a variant of our
tc1 benchmark (models/tc1
). In this example, a tc1 model run is
considered underperforming if its validation loss is in the lower 20% of
the population, at which time it will perform an exploit and explore.
During exploit a model loads the weights of a model randomly selected
from the top 20%. (Loading and storing of weights is file-based, where
weights are serialized every epoch and then loaded as necessary.) During
the explore, a model perturbs the learning rate of the selected better
performing model, and then continues training with the new weights and
learning rate.
Requirements¶
This workflow: git@github.com:ECP-CANDLE/Supervisor.git. Clone and cd to workflows/pbt (the directory containing this README).
Python: the PBT workflow has been tested under Python 2.7.
MPI for Python (mpi4py): http://mpi4py.scipy.org/docs/
Keras: https://keras.io
CANDLE Benchmark Code: git@github.com:ECP-CANDLE/Benchmarks.git. Clone and switch to the frameworks branch.
TC1 benchmark data:
ftp://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/type-class/type_18_300_test.csv ftp://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/type-class/type_18_300_train.csv
type_18_300_train.csv
andtype_18_300_test.csv
should be copied intoX/Benchmarks/Data/Pilot1
, where X is wherever you cloned the Benchmark repository. For example, from within X/Benchmarksmkdir -p Data/Pilot1 cd Data/Pilot1 wget ftp://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/type-class/type_18_300_test.csv wget ftp://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/type-class/type_18_300_train.csv
Running the Workflow¶
The PBT workflow is an MPI application that when given N number of
processes, runs N - 1 tc1 models, and uses the remaining process to run
a datastore into which the models can put and get model peformance data.
The workflow can be run using the scripts in the scripts
directory.
Two scripts are provided: local_run_pbt.sh
and
sbatch_run_pbt.sh
. The former can be used to run on a local desktop
or laptop. The latter can be used to submit the PBT workflow on hpc
resources that use the slurm scheduler. In either case, main application
file is python/tc1_pbt.py
.
When run the PBT workflow will create an experiments directory in which
the output will be written. The output consists of a weights
directory into which each tc1 instance writes is model weights every
epoch, and an output.csv file that records the accuracy, loss, learning
rate, validation accuracy, and validation loss for each model
(identified by MPI rank) each epoch. Additionally each tc1 model run
will execute within its own run_N
instance directory (e.g.
run_1
, run_2
and so forth) within the output directory.
local_run_pbt.sh¶
local_run_pbt.sh
takes 3 arguments
The number of processes to use
An experiment id
The path to a pbt parameter file (see below) that defines the tc1 hyperparameters
The experiment id is used to as the name of the experiments directory
into which the model output will be written as mentioned above. For
example, given the location of the scripts
directory as
workflows/pbt/scripts
and an experiment id of r1
, the
experiments directory will be workflows/pbt/experiments/r1
.
sbatch_run_pbt.sh¶
sbatch_run_pbt.sh
takes 2 arguments:
An experiment id
The path to a pbt parameter file (see below) that defines the tc1 hyperparameters
The experiment id is again used as the name of the experiments directory
into which the model output will be written, as mentioned above. For
example, given the location of the scripts
directory as
workflows/pbt/scripts
and an experiment id of r1
, the
experiments directory will be workflows/pbt/experiments/r1
.
sbatch_run_pbt.sh
ultimately calls sbatch
to submit the job
defined in scripts/pbt.sbatch
. That file can be copied and edited as
appropriate, setting the queue, walltime, python, etc. for your HPC
machine. It is currently configured for NERSC’s Cori system.
Hyperparameter Configuration File¶
The PBT workflow uses a json format file for defining the hyperparameter space used by the PBT algorithm. The PBT workflow includes 2 sample hyperparameter configuration files for the tc1 model.
data/tc1_params_full.json
: runs the full tc1 model, including the default convolution layer and no feature subsampling.data/tc1_params_small.json
: runs a faster version of the tc1 model by ommitting the convolution layer and subsampling the features.
The hyperparameter configuration file has a json format consisting of a list of json dictionaries, each one of which defines a hyperparameter. Each dictionary has the following required keys:
name: the name of the hyperparameter (e.g. epochs)
type: determines how the models are initialized from the named parameter - one of
constant
,int
,float
,logical
, orcategorical
.constant
: all the tc1 models are initialized with the specifed valueint
: each tc1 model is initialized with an int randomly drawn from the range defined bylower
andupper
boundsfloat
: each tc1 model is initialized with a float randomly drawn from the range defined bylower
andupper
boundslogical
: each tc1 model is initialized with a random boolean.categorical
: each tc1 model is initialized with an element chosen at random from the list of elements invalues
.
The following keys are required depending on value of the type
key.
If the type
is constant
: * value
: the constant value
If the type
is int
, or float
: * lower
: the lower bound
of the range to randomly draw from * upper
: the upper bound of the
range to randomly draw from
If the type
is categorical
: * values
: the list of elements
to randomly choose from * element_type
: the type of the elements to
choose from. One of int
, float
, string
, or logical
A sample hyperparameter definition file:
[
{
"name": "epochs",
"type": "constant",
"value": 5
},
{
"name": "activation",
"type": "categorical",
"element_type": "string",
"values": ["softmax", "elu", "softplus", "softsign", "relu", "tanh", "sigmoid", "hard_sigmoid", "linear"]
},
{
"name": "batch_size",
"type": "categorical",
"element_type": "int",
"values": [32, 64]
},
{
"name": "lr",
"type": "float",
"lower": 0.0001,
"upper": 0.01
}
]
Note that any other keys are ignored by the workflow but can be used to
add additional information about the hyperparameter. For example, the
sample files contain a comment
entry that contains additional
information about that hyperparameter.
Workflow Explained¶
The workflow consists of 3 parts. The DNN tc1 model in models/tc1
,
the PBT python code in python/pbt.py
and the python code that runs
the tc1 model using PBT, python/tc1_pbt.py
.
tc1¶
The tc1 model is a lightly modified version of the CANDLE tc1 benchmark.
The code has been updated so that an external Keras callback can be
passed through models/tc1/tc1_runner.run()
and attached to the
model. The PBT algorithnm is run via this callback.
python/pbt.py
¶
pbt.py
provides the model-agnostic framework code for implementing a
PBT workflow. It has 4 main components.
A PBTMetaDataStore class. This implements an in-memory datastore for the model run performance and hyperparamter data. It also manages a locking scheme for model weight file IO in order to prevent issues with concurrent file access.
A PBTClient class. This allows an individual instance of a model to communicate with the PBTMetaDataStore, sending it peformance data, querying performance data for a better performing model, requesting read and write locks for reading other model weights and writing its own. The PBTClient and PBTMetaDataStore communicate via MPI.
A PBTCallback class. This is a Keras callback that given model-specific ready, exploit, and explore implementations will pass its current performance data to the data store and write its model’s weights every epoch. Then when ready, it will perform an an evaluate to find a better performing model. Assuming one is found, an exploit and explore be peformed to update its model’s weights and hyperparameters appropriately. A PBTCallback uses a PBTClient to ommunicate with a PBTMetaDataStore.
A PBTWorker interface. This interface defines the API for PBT’s ready, exploit and explore steps. Client code implements this interface, supplying implementations appropriate to that particular workflow.
python/tc1_pbt.py
¶
tc1_pbt.py
implements PBT for the tc1 model using the classes and
functions in pbt.py
. In tc1_pbt.py
, rank 0 first generates and
distribute the hyperparameters to the models running on the other ranks.
The ga_utils package is used to read the hyperparameter definition file
(see above) and generate, a set of hyperparameters for each model. Once
the hyperparameters are distributed, a PBTMetaDataStore is started, also
on rank 0.
PBTMetaDataStore’s constructor is passed the path of the output
directory where the output.csv
file will be written together with a
the path to a log file in which user customizable log messages are
written. PBTMetaDataStore also takes a reference to an evaluate
function that is used to evaluate a model’s current performance and
select a better performing model. That function must have the following
arguments: a list of dictionaries that contains the metadata for all the
models, and a score against which model performance is determined.
Exactly what the score represents (e.g. the validation loss) is domain
specific and is provided in the PBTWorker.pack_data
method described
below.
In tc1_pbt.py
, truncation_select
implements this evaluate
function and is passed to the PBTMetaDataStore. In
truncation_select
, if the specified score is in the top 80% of
scores, then an empty dictionary is returned. This empty dictionary
indicates that a better performing model was not found and thus
exploit and explore should not occur. If the specified score is in
the bottom 20% then the data for a model in the top 20% is random
selected and returned in a python dictionary. The data in this
dictionary, the rank of the better performing model and its relevant
hyperparameters can then be used in exploit and explore.
With the PBTMetaDataStore initialized on rank 0, all the remaining processes run the tc1 model. A PBTCallback is added to each one of these models. The PBTCallback constructor requires a instance of a class that implements the PBTWorker interface. A PBTCallback calls the 3 methods of a PBTWorkder to:
Retrieve a model’s metadata and hyperparameters in order put them in the PBTMetaDataStore (
PBTWorker.pack_data
),Specifies which performance metric to use as the ‘score’ for model performance (also in
PBTWorker.pack_data
) in an evaluate.Determine when a model is ready for a potential exploit and explore (
PBTWorker.ready
),Perform the exploit and explore update (
PBTWorker.update
).
In the tc1 PBT workflow, tc1_pbt.TC1PBTWorker
implements the
PBTWorker
interface. TC1PBTWorker.pack_data
retrieves a model’s
current learning rate, and specifies the validation loss as the
performance score. TC1PBTWorker.ready
specifies that the model is
ready every 5 epochs. (5 is too soon to begin sharing weights, but it
serves as an example and does exercise the workflow code within a
reasonable amount of time.) TC1PBTWorker.update
updates the model
with a better performing learning rate after having perturbed it. Note
that update
does not need to load the better performing model’s
weights. That is done automatically in PBTCallback.
In sum then, in a PBTCallback at the end of every epoch:
pack_data
is called to put every model’s performance data and selected hyperparameters into the PBTMetaDataStore.ready
is called to determine if a model is ready for an exploit / explore update.If
ready
returns true, then the PBTCallback queries the PBTMetaDataStore for a better performing model using the supplied evaluate function (e.g.truncation_select
).If the selection function returns data from a better performing model, then
update
is called to update the under performing model with the better performing hyperparameters, and the PBTCallback loads the better performing model’s weights into the under performing model.
Adapting the Workflow to a Different Model¶
tc1_pbt.py
can easily be adapted to work with a different model. The
following changes will need to be made:
A new hyperparameter definition file. The rank 0 code that reads this file can be re-used.
A new evaluate function. This can be passed to the PBTMetaDataStore constructor in place of
truncation_select
A new PBTWorker implementation, implementing
ready
,pack_data
, andupdate
as appropriate for the new model and workflow. This can be passed to the PBTCallback in place ofTC1PBTWorker
.