GA (genetic algorithm) based based hyperparameter optimization on CANDLE Benchmarks¶
The GA workflow uses the Python deap package (http://deap.readthedocs.io/en/master) to optimize hyperparameters using a genetic algorithm.
Running¶
cd into the Supervisor/workflows/GA/test directory
Specify the GA parameters in the cfg-prm-1.sh file (see below for more information on the GA parameters)
Specify the PROCS, QUEUE etc. in cfg-sys-1.sh file
You will pass the MODEL_NAME, SITE, and optional experiment id arguments to test-1.sh file when launching:
./test-1.sh <model_name> <machine_name> [expid]
wheremodel_name
can be tc1 etc.,machine_name
can be local, cori, theta, titan etc. (see NOTE below on creating new SITE files.)Update the parameter space json file if necessary. The parameter space is defined in json file (see workflows/GA/data/tc1_param_space_ga.json for an example with tc1). The cfg-prm-1.sh script will attempt to select the correct json given the model name. Edit that file as appropriate. The parameter space json file is further described here
The benchmark will be run for the number of processors specified
Final objective function values, along with parameters, will be available in the experiments directory in a finals_results file and also printed to standard out.
User requirements¶
What you need to install to run the workflow:
This workflow -
git@github.com:ECP-CANDLE/Supervisor.git
. Clone and switch to themaster
branch. Thencd
toworkflows/GA
(the directory containing this README).TC1 or other benchmark -
git@github.com:ECP-CANDLE/Benchmarks.git
. Clone and switch to theframeworks
branch.benchmark data - See the individual benchmarks README for obtaining the initial data
Python specific installation requirements:
pandas
deap
These may be already part of the existing python installation. If not
these can be installed using conda
or pip
. They must be
installed using the same python installation used by swift-t. A
swift-t -v
will print the python that swift-t has embedded.
If any required python packages must be installed locally, then you will probably need to add your local site-packages directory to the PYTHONPATH specified in cfg-sys-1.sh. For example,
export PYTHONPATH=/global/u1/n/ncollier/.local/cori/deeplearning2.7/lib/python2.7/site-packages
Calling sequence¶
Function calls:
test-1.sh -> swift/workflow.sh ->
(GA via EQPy)
swift/workflow.swift -> common/python/deap_ga.py
(Benchmark)
swift/workflow.swift -> common/swift/obj_app.swift ->
common/sh/model.sh -> common/python/model_runner.py -> 'calls Benchmark'
(Results from Benchmark returned to the GA via EQPy)
common/swift/obj_app.swift -> swift/workflow.swift ->
common/python/deap_ga.py
Scheduling scripts:
test-1.sh -> cfg-sys-1.sh ->
common/sh/<machine_name> - module, scheduling, langs .sh files
Making Changes¶
To create your own SITE files in workflows/common/sh/: - langs-SITE.sh - langs-app-SITE.sh - modules-SITE.sh - sched-SITE.sh config
copy existing ones but modify the langs-SITE.sh file to define the EQPy location (see workflows/common/sh/langs-local-as.sh for an example).
Structure¶
The point of the script structure is that it is easy to make copy and
modify the test-*.sh
script, and the cfg-*.sh
scripts. These can
be checked back into the repo for use by others. The test-*.sh
script and the cfg-*.sh
scripts should simply contain environment
variables that control how workflow.sh
and workflow.swift
operate.
test-1.sh
and cfg-{sys,prm}-1.sh
should be unmodified for simple
testing.
The relevant parameters for the GA algorithm are defined in
cfg-prm-*.sh
scripts (see example in cfg-prm-1.sh
). These are: -
SEED: The random seed used by deap in the GA. - NUM_ITERATIONS: The
number of iterations the GA should perform. - POPULATION_SIZE: The
maximum number of hyperparameter sets to evaluate in each iteration.
GA_STRATEGY: The algorithm used by the GA. Can be one of “simple” or
“mu_plus_lambda”. See eaSimple and eaMuPlusLambda at
https://deap.readthedocs.io/en/master/api/algo.html?highlight=eaSimple#module-deap.algorithms
for more information.
Hyperparameter Configuration File¶
The GA workflow uses a json format file for defining the hyperparameter
space. The GA workflow comes with 4 sample hyperparameter spaces in the
GA/data
directory, one each for the combo, nt3, p1b1 and tc1
benchmarkts.
The hyperparameter configuration file has a json format consisting of a list of json dictionaries, each one of which defines a hyperparameter. Each dictionary has the following required keys:
name: the name of the hyperparameter (e.g. epochs)
type: determines how the initial population (i.e. the hyperparameter sets) are initialized from the named parameter and how those values are subsequently mutated by the GA. Type is one of
constant
,int
,float
,logical
,categorical
, orordered
.constant
:each model is initialized with the same specifed value
mutation always returns the same specified value
int
:each model is initialized with an int randomly drawn from the range defined by
lower
andupper
boundsmutation is peformed by adding the results of a random draw from a gaussian distribution to the current value, where the gaussian distribution’s mu is 0 and its sigma is specified by the
sigma
entry.
float
:each model is initialized with a float randomly drawn from the range defined by
lower
andupper
boundsmutation is peformed by adding the results of a random draw from a gaussian distribution to the current value, where the gaussian distribution’s mu is 0 and its sigma is specified by the
sigma
entry.
logical
:each model is initialized with a random boolean.
mutation flips the logical value
categorical
:each model is initialized with an element chosen at random from the list of elements in
values
.mutation chooses an element from the
values
list at random
ordered
:each model is inititalized with an element chosen at random from the list of elements in
values
.given the index of the current value in the list of
values
, mutation selects the element n number of indices away, where n is the result of a random draw between 1 andsigma
and then is negated with a 0.5 probability.
The following keys are required depending on value of the type
key.
If the type
is constant
: * value
: the constant value
If the type
is int
, or float
: * lower
: the lower bound
of the range to draw from * upper
: the upper bound of the range to
draw from * sigma
: the sigma value used by the mutation operator
(see above).
If the type
is categorical
: * values
: the list of elements
to choose from * element_type
: the type of the elements to choose
from. One of int
, float
, string
, or logical
If the type
is ordered
: * values
: the list of elements to
choose from * element_type
: the type of the elements to choose
from. One of int
, float
, string
, or logical
*
sigma
: the sigma value used by the mutation operator (see above).
A sample hyperparameter definition file:
[
{
"name": "activation",
"type": "categorical",
"element_type": "string",
"values": ["softmax", "elu", "softplus", "softsign", "relu", "tanh", "sigmoid", "hard_sigmoid", "linear"]
},
{
"name": "optimizer",
"type": "categorical",
"element_type": "string",
"values": ["adam", "rmsprop"]
},
{
"name": "lr",
"type": "float",
"lower": 0.0001,
"upper": 0.01,
"sigma": "0.000495"
},
{
"name": "batch_size",
"type": "ordered",
"element_type": "int",
"values": [16, 32, 64, 128, 256],
"sigma": 1
}
]
Note that any other keys are ignored by the workflow but can be used to
add additional information about the hyperparameter. For example, the
sample files contain a comment
entry that contains additional
information about that hyperparameter.
Where to check for output¶
This includes error output.
When you run the test script, you will get a message about
TURBINE_OUTPUT
. This will be the main output directory for your
run.
On a local system, stdout/stderr for the workflow will go to your terminal.
On a scheduled system, stdout/stderr for the workflow will go to
TURBINE_OUTPUT/output.txt
The individual objective function (model) runs stdout/stderr go into directories of the form:
TURBINE_OUTPUT/EXPID/run/RUNID/model.log
where EXPID
is the user-provided experiment ID, and RUNID
are
the various model runs generated by async-search, one per parameter set,
of the form R_I_J
where R
is the restart number, I
is the
iteration number, and J
is the sample within the iteration.
Each successful run of the workflow will produce a final_results_2
file. The first line of the file contains the GA’s final population,
that is, the final hyperparameter sets. The second line contains the
final score (e.g. val loss) for each parameter set. The remainder of the
file reports the GA’s per iteration statistics. The columns are:
gen: the generation / iteration
nevals: the number of evaluations performed in this generation. In generations after the first, this may be less the total population size as some combinations will already have been evaluated.
avg: the average score
std: the standard deviation
min: the minimum score
max: the maximum score
ts: a timestamp recording when this generation finished. The value is the number of seconds since the epoch in floating point format