GA (genetic algorithm) based based hyperparameter optimization on CANDLE Benchmarks =================================================================================== The GA workflow uses the Python deap package (http://deap.readthedocs.io/en/master) to optimize hyperparameters using a genetic algorithm. Running ------- 1. cd into the **Supervisor/workflows/GA/test** directory 2. Specify the GA parameters in the **cfg-prm-1.sh** file (see `below <#structure>`__ for more information on the GA parameters) 3. Specify the PROCS, QUEUE etc. in **cfg-sys-1.sh** file 4. You will pass the MODEL_NAME, SITE, and optional experiment id arguments to **test-1.sh** file when launching: ``./test-1.sh [expid]`` where ``model_name`` can be tc1 etc., ``machine_name`` can be local, cori, theta, titan etc. (see `NOTE <#making_changes>`__ below on creating new SITE files.) 5. Update the parameter space json file if necessary. The parameter space is defined in json file (see **workflows/GA/data/tc1_param_space_ga.json** for an example with tc1). The **cfg-prm-1.sh** script will attempt to select the correct json given the model name. Edit that file as appropriate. The parameter space json file is further described `here <#config>`__ 6. The benchmark will be run for the number of processors specified 7. Final objective function values, along with parameters, will be available in the experiments directory in a **finals_results** file and also printed to standard out. User requirements ----------------- What you need to install to run the workflow: - This workflow - ``git@github.com:ECP-CANDLE/Supervisor.git`` . Clone and switch to the ``master`` branch. Then ``cd`` to ``workflows/GA`` (the directory containing this README). - TC1 or other benchmark - ``git@github.com:ECP-CANDLE/Benchmarks.git`` . Clone and switch to the ``frameworks`` branch. - benchmark data - See the individual benchmarks README for obtaining the initial data Python specific installation requirements: 1. pandas 2. deap These may be already part of the existing python installation. If not these can be installed using ``conda`` or ``pip``. They must be installed using the same python installation used by swift-t. A ``swift-t -v`` will print the python that swift-t has embedded. If any required python packages must be installed locally, then you will probably need to add your local site-packages directory to the PYTHONPATH specified in **cfg-sys-1.sh**. For example, ``export PYTHONPATH=/global/u1/n/ncollier/.local/cori/deeplearning2.7/lib/python2.7/site-packages`` Calling sequence ---------------- Function calls: :: test-1.sh -> swift/workflow.sh -> (GA via EQPy) swift/workflow.swift -> common/python/deap_ga.py (Benchmark) swift/workflow.swift -> common/swift/obj_app.swift -> common/sh/model.sh -> common/python/model_runner.py -> 'calls Benchmark' (Results from Benchmark returned to the GA via EQPy) common/swift/obj_app.swift -> swift/workflow.swift -> common/python/deap_ga.py Scheduling scripts: :: test-1.sh -> cfg-sys-1.sh -> common/sh/ - module, scheduling, langs .sh files Making Changes --------------- To create your own SITE files in workflows/common/sh/: - langs-SITE.sh - langs-app-SITE.sh - modules-SITE.sh - sched-SITE.sh config copy existing ones but modify the langs-SITE.sh file to define the EQPy location (see workflows/common/sh/langs-local-as.sh for an example). Structure ~~~~~~~~~~ The point of the script structure is that it is easy to make copy and modify the ``test-*.sh`` script, and the ``cfg-*.sh`` scripts. These can be checked back into the repo for use by others. The ``test-*.sh`` script and the ``cfg-*.sh`` scripts should simply contain environment variables that control how ``workflow.sh`` and ``workflow.swift`` operate. ``test-1.sh`` and ``cfg-{sys,prm}-1.sh`` should be unmodified for simple testing. The relevant parameters for the GA algorithm are defined in ``cfg-prm-*.sh`` scripts (see example in ``cfg-prm-1.sh``). These are: - SEED: The random seed used by deap in the GA. - NUM_ITERATIONS: The number of iterations the GA should perform. - POPULATION_SIZE: The maximum number of hyperparameter sets to evaluate in each iteration. GA_STRATEGY: The algorithm used by the GA. Can be one of “simple” or “mu_plus_lambda”. See eaSimple and eaMuPlusLambda at https://deap.readthedocs.io/en/master/api/algo.html?highlight=eaSimple#module-deap.algorithms for more information. Hyperparameter Configuration File ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The GA workflow uses a json format file for defining the hyperparameter space. The GA workflow comes with 4 sample hyperparameter spaces in the ``GA/data`` directory, one each for the combo, nt3, p1b1 and tc1 benchmarkts. The hyperparameter configuration file has a json format consisting of a list of json dictionaries, each one of which defines a hyperparameter. Each dictionary has the following required keys: - name: the name of the hyperparameter (e.g. *epochs*) - type: determines how the initial population (i.e. the hyperparameter sets) are initialized from the named parameter and how those values are subsequently mutated by the GA. Type is one of ``constant``, ``int``, ``float``, ``logical``, ``categorical``, or ``ordered``. - ``constant``: - each model is initialized with the same specifed value - mutation always returns the same specified value - ``int``: - each model is initialized with an int randomly drawn from the range defined by ``lower`` and ``upper`` bounds - mutation is peformed by adding the results of a random draw from a gaussian distribution to the current value, where the gaussian distribution’s mu is 0 and its sigma is specified by the ``sigma`` entry. - ``float``: - each model is initialized with a float randomly drawn from the range defined by ``lower`` and ``upper`` bounds - mutation is peformed by adding the results of a random draw from a gaussian distribution to the current value, where the gaussian distribution’s mu is 0 and its sigma is specified by the ``sigma`` entry. - ``logical``: - each model is initialized with a random boolean. - mutation flips the logical value - ``categorical``: - each model is initialized with an element chosen at random from the list of elements in ``values``. - mutation chooses an element from the ``values`` list at random - ``ordered``: - each model is inititalized with an element chosen at random from the list of elements in ``values``. - given the index of the current value in the list of ``values``, mutation selects the element *n* number of indices away, where n is the result of a random draw between 1 and ``sigma`` and then is negated with a 0.5 probability. The following keys are required depending on value of the ``type`` key. If the ``type`` is ``constant``: \* ``value``: the constant value If the ``type`` is ``int``, or ``float``: \* ``lower``: the lower bound of the range to draw from \* ``upper``: the upper bound of the range to draw from \* ``sigma``: the sigma value used by the mutation operator (see above). If the ``type`` is ``categorical``: \* ``values``: the list of elements to choose from \* ``element_type``: the type of the elements to choose from. One of ``int``, ``float``, ``string``, or ``logical`` If the ``type`` is ``ordered``: \* ``values``: the list of elements to choose from \* ``element_type``: the type of the elements to choose from. One of ``int``, ``float``, ``string``, or ``logical`` \* ``sigma``: the sigma value used by the mutation operator (see above). A sample hyperparameter definition file: .. code:: javascript [ { "name": "activation", "type": "categorical", "element_type": "string", "values": ["softmax", "elu", "softplus", "softsign", "relu", "tanh", "sigmoid", "hard_sigmoid", "linear"] }, { "name": "optimizer", "type": "categorical", "element_type": "string", "values": ["adam", "rmsprop"] }, { "name": "lr", "type": "float", "lower": 0.0001, "upper": 0.01, "sigma": "0.000495" }, { "name": "batch_size", "type": "ordered", "element_type": "int", "values": [16, 32, 64, 128, 256], "sigma": 1 } ] Note that any other keys are ignored by the workflow but can be used to add additional information about the hyperparameter. For example, the sample files contain a ``comment`` entry that contains additional information about that hyperparameter. Where to check for output ~~~~~~~~~~~~~~~~~~~~~~~~~ This includes error output. When you run the test script, you will get a message about ``TURBINE_OUTPUT`` . This will be the main output directory for your run. - On a local system, stdout/stderr for the workflow will go to your terminal. - On a scheduled system, stdout/stderr for the workflow will go to ``TURBINE_OUTPUT/output.txt`` The individual objective function (model) runs stdout/stderr go into directories of the form: ``TURBINE_OUTPUT/EXPID/run/RUNID/model.log`` where ``EXPID`` is the user-provided experiment ID, and ``RUNID`` are the various model runs generated by async-search, one per parameter set, of the form ``R_I_J`` where ``R`` is the restart number, ``I`` is the iteration number, and ``J`` is the sample within the iteration. Each successful run of the workflow will produce a ``final_results_2`` file. The first line of the file contains the GA’s final population, that is, the final hyperparameter sets. The second line contains the final score (e.g. val loss) for each parameter set. The remainder of the file reports the GA’s per iteration statistics. The columns are: - gen: the generation / iteration - nevals: the number of evaluations performed in this generation. In generations after the first, this may be less the total population size as some combinations will already have been evaluated. - avg: the average score - std: the standard deviation - min: the minimum score - max: the maximum score - ts: a timestamp recording when this generation finished. The value is the number of seconds since the epoch in floating point format