Run Asynchronous Search based hyperparameter optimization on CANDLE Benchmarks
==============================================================================

async-search is an asynchronous iterative optimizer written in Python.
It evaluates the best values of hyperparameters for CANDLE “Benchmarks”
available here: ``git@github.com:ECP-CANDLE/Benchmarks.git``

Running
-------

1. cd into the **Supervisor/workflows/async-search/test** directory
2. Specify the async-search parameters in the *cfg-prm-1.sh* file
   (INIT_SIZE, etc.).
3. Specify the PROCS, queue etc. in **cfg-sys-1.sh** file (NOTE:
   currently INIT_SIZE must be at least PROCS-2)
4. You will pass the MODEL_NAME, SITE, and optional experiment id
   arguments to **test-1.sh** file when launching:
   ``./test-1.sh <model_name> <machine_name> [expid]`` where
   ``model_name`` can be tc1 etc., ``machine_name`` can be local, cori,
   theta, titan etc. (see `NOTE <#making_changes>`__ below on creating
   new SITE files.)
5. The parameter space is defined in a **problem*.py** file (see
   **workflows/async-search/python/problem_tc1.py** for an example with
   tc1.). This is imported as ``problem`` in **async-search.py**.
6. The benchmark will be run for the number of processors specified
7. Final objective function values, along with parameters, will be
   available in the experiments directory and also printed

User requirements
-----------------

What you need to install to run the workflow:

-  This workflow - ``git@github.com:ECP-CANDLE/Supervisor.git`` . Clone
   and switch to the ``master`` branch. Then ``cd`` to
   ``workflows/async-search`` (the directory containing this README).
-  TC1 benchmark - ``git@github.com:ECP-CANDLE/Benchmarks.git`` . Clone
   and switch to the ``frameworks`` branch.
-  benchmark data - See the individual benchmarks README for obtaining
   the initial data

Python specific installation needed:

::

   conda install h5py
   conda install scikit-learn
   conda install pandas
   conda install mpi4py
   conda install -c conda-forge keras
   conda install -c conda-forge scikit-optimize

Calling sequence
----------------

Function calls:

::

   test-1.sh -> swift/workflow.sh ->

         (Async-search via EQPy)
         swift/workflow.swift <-> python/async-search.py

         (Benchmark)
         swift/workflow.swift -> obj_folder/obj_app.swift ->
         common/sh/model.sh -> common/python/model_runner.py -> 'calls Benchmark'

         (Results from Benchmark returned directly to Async-search)
         obj_folder/obj_app.swift -> python/async-search.py

Scheduling scripts:

::

   test-1.sh -> cfg-sys-1.sh ->
         common/sh/<machine_name> - module, scheduling, langs .sh files

Making Changes 
---------------

To create your own SITE files in workflows/common/sh/: - langs-SITE.sh -
langs-app-SITE.sh - modules-SITE.sh - sched-SITE.sh config

copy existing ones but modify the langs-SITE.sh file to define the EQPy
location (see workflows/common/sh/langs-local.sh for an example).

Structure
~~~~~~~~~

The point of the script structure is that it is easy to make copy and
modify the ``test-*.sh`` script, and the ``cfg-*.sh`` scripts. These can
be checked back into the repo for use by others. The ``test-*.sh``
script and the ``cfg-*.sh`` scripts should simply contain environment
variables that control how ``workflow.sh`` and ``workflow.swift``
operate.

``test-1.sh`` and ``cfg-{sys,prm}-1.sh`` should be unmodified for simple
testing.

The relevant parameters for the asynchronous search algorithm are
defined in ``cfg-*.sh`` scripts (see example in ``cfg-prm-1.sh``). These
are: - INIT_SIZE: The number of initial random samples. (Note: INIT_SIZE
needs to be larger than PROCS-2 for now.) - MAX_EVALS: The maximum
number of evaluations/tasks to perform. - NUM_BUFFER: The size of the
tasks buffer that should be maintained above the available workers
(num_workers) such that if the currently out tasks are less than (num
workers + NUM_BUFFER), more tasks are generated. - MAX_THRESHOLD: Under
normal circumstances, when a single model evaluation is finished, a new
hyper parameter set is produced for evaluation. If the model evaluations
occur within 15 seconds of each other, a MAX_THRESHOLD number of
evalutions must occur before the corresponding number of new values are
produced for evaluation. This can help with performance when many models
finish within a few seconds of each other. - N_JOBS: The number of jobs
to run in parallel when producing points (i.e. hyperparameter values)
for evaluation. -1 will set this to the number of cores.

Where to check for output
~~~~~~~~~~~~~~~~~~~~~~~~~

This includes error output.

When you run the test script, you will get a message about
``TURBINE_OUTPUT`` . This will be the main output directory for your
run.

-  On a local system, stdout/stderr for the workflow will go to your
   terminal.
-  On a scheduled system, stdout/stderr for the workflow will go to
   ``TURBINE_OUTPUT/output.txt``

The individual objective function (model) runs stdout/stderr go into
directories of the form:

``TURBINE_OUTPUT/EXPID/run/RUNID/model.log``

where ``EXPID`` is the user-provided experiment ID, and ``RUNID`` are
the various model runs generated by async-search, one per parameter set,
of the form ``R_I_J`` where ``R`` is the restart number, ``I`` is the
iteration number, and ``J`` is the sample within the iteration.