GA (genetic algorithm) based based hyperparameter optimization on CANDLE Benchmarks
===================================================================================

The GA workflow uses the Python deap package
(http://deap.readthedocs.io/en/master) to optimize hyperparameters using
a genetic algorithm.

Running
-------

1. cd into the **Supervisor/workflows/GA/test** directory
2. Specify the GA parameters in the **cfg-prm-1.sh** file (see
   `below <#structure>`__ for more information on the GA parameters)
3. Specify the PROCS, QUEUE etc. in **cfg-sys-1.sh** file
4. You will pass the MODEL_NAME, SITE, and optional experiment id
   arguments to **test-1.sh** file when launching:
   ``./test-1.sh <model_name> <machine_name> [expid]`` where
   ``model_name`` can be tc1 etc., ``machine_name`` can be local, cori,
   theta, titan etc. (see `NOTE <#making_changes>`__ below on creating
   new SITE files.)
5. Update the parameter space json file if necessary. The parameter
   space is defined in json file (see
   **workflows/GA/data/tc1_param_space_ga.json** for an example with
   tc1). The **cfg-prm-1.sh** script will attempt to select the correct
   json given the model name. Edit that file as appropriate. The
   parameter space json file is further described `here <#config>`__
6. The benchmark will be run for the number of processors specified
7. Final objective function values, along with parameters, will be
   available in the experiments directory in a **finals_results** file
   and also printed to standard out.

User requirements
-----------------

What you need to install to run the workflow:

-  This workflow - ``git@github.com:ECP-CANDLE/Supervisor.git`` . Clone
   and switch to the ``master`` branch. Then ``cd`` to ``workflows/GA``
   (the directory containing this README).
-  TC1 or other benchmark - ``git@github.com:ECP-CANDLE/Benchmarks.git``
   . Clone and switch to the ``frameworks`` branch.
-  benchmark data - See the individual benchmarks README for obtaining
   the initial data

Python specific installation requirements:

1. pandas
2. deap

These may be already part of the existing python installation. If not
these can be installed using ``conda`` or ``pip``. They must be
installed using the same python installation used by swift-t. A
``swift-t -v`` will print the python that swift-t has embedded.

If any required python packages must be installed locally, then you will
probably need to add your local site-packages directory to the
PYTHONPATH specified in **cfg-sys-1.sh**. For example,

``export PYTHONPATH=/global/u1/n/ncollier/.local/cori/deeplearning2.7/lib/python2.7/site-packages``

Calling sequence
----------------

Function calls:

::

   test-1.sh -> swift/workflow.sh ->

         (GA via EQPy)
         swift/workflow.swift -> common/python/deap_ga.py

         (Benchmark)
         swift/workflow.swift -> common/swift/obj_app.swift ->
         common/sh/model.sh -> common/python/model_runner.py -> 'calls Benchmark'

         (Results from Benchmark returned to the GA via EQPy)
         common/swift/obj_app.swift -> swift/workflow.swift ->
         common/python/deap_ga.py

Scheduling scripts:

::

   test-1.sh -> cfg-sys-1.sh ->
         common/sh/<machine_name> - module, scheduling, langs .sh files

Making Changes 
---------------

To create your own SITE files in workflows/common/sh/: - langs-SITE.sh -
langs-app-SITE.sh - modules-SITE.sh - sched-SITE.sh config

copy existing ones but modify the langs-SITE.sh file to define the EQPy
location (see workflows/common/sh/langs-local-as.sh for an example).

Structure 
~~~~~~~~~~

The point of the script structure is that it is easy to make copy and
modify the ``test-*.sh`` script, and the ``cfg-*.sh`` scripts. These can
be checked back into the repo for use by others. The ``test-*.sh``
script and the ``cfg-*.sh`` scripts should simply contain environment
variables that control how ``workflow.sh`` and ``workflow.swift``
operate.

``test-1.sh`` and ``cfg-{sys,prm}-1.sh`` should be unmodified for simple
testing.

The relevant parameters for the GA algorithm are defined in
``cfg-prm-*.sh`` scripts (see example in ``cfg-prm-1.sh``). These are: -
SEED: The random seed used by deap in the GA. - NUM_ITERATIONS: The
number of iterations the GA should perform. - POPULATION_SIZE: The
maximum number of hyperparameter sets to evaluate in each iteration.
GA_STRATEGY: The algorithm used by the GA. Can be one of “simple” or
“mu_plus_lambda”. See eaSimple and eaMuPlusLambda at
https://deap.readthedocs.io/en/master/api/algo.html?highlight=eaSimple#module-deap.algorithms
for more information.

Hyperparameter Configuration File 
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The GA workflow uses a json format file for defining the hyperparameter
space. The GA workflow comes with 4 sample hyperparameter spaces in the
``GA/data`` directory, one each for the combo, nt3, p1b1 and tc1
benchmarkts.

The hyperparameter configuration file has a json format consisting of a
list of json dictionaries, each one of which defines a hyperparameter.
Each dictionary has the following required keys:

-  name: the name of the hyperparameter (e.g. *epochs*)
-  type: determines how the initial population (i.e. the hyperparameter
   sets) are initialized from the named parameter and how those values
   are subsequently mutated by the GA. Type is one of ``constant``,
   ``int``, ``float``, ``logical``, ``categorical``, or ``ordered``.

   -  ``constant``:

      -  each model is initialized with the same specifed value
      -  mutation always returns the same specified value

   -  ``int``:

      -  each model is initialized with an int randomly drawn from the
         range defined by ``lower`` and ``upper`` bounds
      -  mutation is peformed by adding the results of a random draw
         from a gaussian distribution to the current value, where the
         gaussian distribution’s mu is 0 and its sigma is specified by
         the ``sigma`` entry.

   -  ``float``:

      -  each model is initialized with a float randomly drawn from the
         range defined by ``lower`` and ``upper`` bounds
      -  mutation is peformed by adding the results of a random draw
         from a gaussian distribution to the current value, where the
         gaussian distribution’s mu is 0 and its sigma is specified by
         the ``sigma`` entry.

   -  ``logical``:

      -  each model is initialized with a random boolean.
      -  mutation flips the logical value

   -  ``categorical``:

      -  each model is initialized with an element chosen at random from
         the list of elements in ``values``.
      -  mutation chooses an element from the ``values`` list at random

   -  ``ordered``:

      -  each model is inititalized with an element chosen at random
         from the list of elements in ``values``.
      -  given the index of the current value in the list of ``values``,
         mutation selects the element *n* number of indices away, where
         n is the result of a random draw between 1 and ``sigma`` and
         then is negated with a 0.5 probability.

The following keys are required depending on value of the ``type`` key.

If the ``type`` is ``constant``: \* ``value``: the constant value

If the ``type`` is ``int``, or ``float``: \* ``lower``: the lower bound
of the range to draw from \* ``upper``: the upper bound of the range to
draw from \* ``sigma``: the sigma value used by the mutation operator
(see above).

If the ``type`` is ``categorical``: \* ``values``: the list of elements
to choose from \* ``element_type``: the type of the elements to choose
from. One of ``int``, ``float``, ``string``, or ``logical``

If the ``type`` is ``ordered``: \* ``values``: the list of elements to
choose from \* ``element_type``: the type of the elements to choose
from. One of ``int``, ``float``, ``string``, or ``logical`` \*
``sigma``: the sigma value used by the mutation operator (see above).

A sample hyperparameter definition file:

.. code:: javascript

   [
     {
       "name": "activation",
       "type": "categorical",
       "element_type": "string",
       "values": ["softmax", "elu", "softplus", "softsign", "relu", "tanh", "sigmoid", "hard_sigmoid", "linear"]
     },

     {
       "name": "optimizer",
       "type": "categorical",
       "element_type": "string",
       "values": ["adam", "rmsprop"]
     },

     {
       "name": "lr",
       "type": "float",
       "lower": 0.0001,
       "upper": 0.01,
       "sigma": "0.000495"
     },

     {
       "name": "batch_size",
       "type": "ordered",
       "element_type": "int",
       "values": [16, 32, 64, 128, 256],
       "sigma": 1
     }
   ]

Note that any other keys are ignored by the workflow but can be used to
add additional information about the hyperparameter. For example, the
sample files contain a ``comment`` entry that contains additional
information about that hyperparameter.

Where to check for output
~~~~~~~~~~~~~~~~~~~~~~~~~

This includes error output.

When you run the test script, you will get a message about
``TURBINE_OUTPUT`` . This will be the main output directory for your
run.

-  On a local system, stdout/stderr for the workflow will go to your
   terminal.
-  On a scheduled system, stdout/stderr for the workflow will go to
   ``TURBINE_OUTPUT/output.txt``

The individual objective function (model) runs stdout/stderr go into
directories of the form:

``TURBINE_OUTPUT/EXPID/run/RUNID/model.log``

where ``EXPID`` is the user-provided experiment ID, and ``RUNID`` are
the various model runs generated by async-search, one per parameter set,
of the form ``R_I_J`` where ``R`` is the restart number, ``I`` is the
iteration number, and ``J`` is the sample within the iteration.

Each successful run of the workflow will produce a ``final_results_2``
file. The first line of the file contains the GA’s final population,
that is, the final hyperparameter sets. The second line contains the
final score (e.g. val loss) for each parameter set. The remainder of the
file reports the GA’s per iteration statistics. The columns are:

-  gen: the generation / iteration
-  nevals: the number of evaluations performed in this generation. In
   generations after the first, this may be less the total population
   size as some combinations will already have been evaluated.
-  avg: the average score
-  std: the standard deviation
-  min: the minimum score
-  max: the maximum score
-  ts: a timestamp recording when this generation finished. The value is
   the number of seconds since the epoch in floating point format