CANDLE Shared Installation¶
Terminology and scope¶
In addition to the CANDLE shared installation being a ready-to-use, central installation of CANDLE, it is further a set of scripts that adds to the functionality of CANDLE and makes it easier to use for new users. For brevity, below these scripts will be called the “wrapper scripts” or “wrappers,” as they are essentially wrappers around the Supervisor/Benchmarks codebase with the aim of improving functionality while leaving the codebase as untouched as possible. These scripts should not interfere in any way with how CANDLE is currently being run; these are only enhancements. They are currently set up and tested on Biowulf and Summit.
The source code of the wrapper scripts is currently located here. This contains code that (1) helps to set up and test these scripts alongside new clones of the Supervisor and Benchmarks repositories, and (2) adds various features to CANDLE. The documentation for setup (#1) can be found here; the documentation for usage (#2) is below.
Overview of wrapper scripts functionality¶
For users¶
Run CANDLE as a central installation. E.g., instead of cloning the Supervisor and Benchmarks repos as usual and then running a Supervisor workflow directly in the
Supervisor/workflows/<WORKFLOW>/test
directory, you would go to any arbitrary directory on the filesystem, create or edit a single text file (“input file”), and call CANDLE with the input file as an argument. This is similar to how other large HPC-enabled software packages are run, e.g., software for calculating electronic structureEdit only a single text input file to modify everything you would need to set in order to run a job, e.g., workflow type, hyperparameter space, number of workers, walltime, “default model” settings, etc.
Minimally modify a bare model script, e.g., no need to add
initialize_parameters()
andrun()
functions (whose content occassionally changes) to a new model that you’d like to run using CANDLE. The wrapper scripts still work for canonically CANDLE-compliant (i.e., “candelized”) model scripts such as the already-written main.py
files used to run the benchmarks. Additional benefits of only minimally modifying a bare model script:The output of the model using each hyperparameter set is put in its own file,
subprocess_out_and_err.txt
Custom environments can be automatically defined for running the model script using e.g. the keywords
supp_modules
,python_bin_path
,exec_python_module
,supp_pythonpath
, described further in the section on input file keywords below for keywords noted to apply for “minimal CANDLE-compliance only”
Run model scripts written in other languages such as
R
andbash
(tested on Biowulf but not yet tested on Summit); minimal additions to the wrapper scripts are needed for adding additional language supportPerform a consistent workflow for testing and production jobs, i.e.:
Testing: Using
candle submit-job <INPUT-FILE>
with the input file keyword setting ofrun_workflow=0
on an interactive node (e.g.,bsub -W 1:00 -nnodes 1 -P med106 -q debug -Is /bin/bash
) for testing modifications to a model script
Note: See here if you encounter an issue with the CUDA driver when testing a model in this interactive mode
Production: Using
candle submit-job <INPUT-FILE>
this time with the default keyword setting ofrun_workflow=1
on a login node for submitting a CANDLE job as usual
As long as the wrapper scripts are set up properly and your model script runs successfully using
run_workflow=0
, you can be pretty confident that submitting the job usingrun_workflow=1
will pick up and run without dying
For developers¶
Modify only a single file whenever the CANDLE-compliance procedure changes (candle_compliant_wrapper.py). E.g., if the benchmarks used the minimal modification to the main
.py
files rather than the traditional CANDLE-compliance procedure, there would be no need to update every benchmark whenever the CANDLE-compliance procedure changedEdit only a single file in order to make system-specific changes (preprocess.py) such as custom modification to the
$TURBINE_LAUNCH_OPTIONS
variable; no need to edit each Supervisor workflow’sworkflow.sh
file
Loading the candle
module¶
We are currently getting CANDLE approved as user-managed software on
Summit. Once it is approved, we will be able to load the candle
module via module load candle
. In the interim, do this instead:
source /gpfs/alpine/med106/world-shared/candle/env_for_lmod-tf2.sh
Both methods primarily do the following:
Sets the
$CANDLE
variable to/gpfs/alpine/med106/world-shared/candle/tf2
in order to set the top-level directory of the entire CANDLE file tree, including theSupervisor
,Benchmarks
, andwrappers
GitHub repositoriesAppends
$CANDLE/wrappers/bin
to$PATH
in order to be able to runcandle
from the command lineSets the
$SITE
variable tosummit-tf2
in order to specify the HPC system and environmentAppends
$CANDLE/Benchmarks/common
to$PYTHONPATH
to allow one to write a Python model script in an arbitrary directory and to be able to runimport candle
in the script
Quick-start examples (for Summit)¶
Step 1: Setup¶
# Load the CANDLE module; do the following for the time being in lieu of "module load candle", as we are currently getting CANDLE approved as user-managed software
source /gpfs/alpine/med106/world-shared/candle/env_for_lmod-tf2.sh
# Enter a possibly empty directory that is completely outside of the Supervisor/Benchmarks repositories on the Alpine filesystem, such as $MEMBERWORK
cd /gpfs/alpine/med106/scratch/weismana/notebook/2020-11-13/testing_candle_installation
Step 2: Run sample CANDLE-compliant model scripts¶
This refers to model scripts that the developers refer to as “CANDLE-compliant” or “candelized” as usual.
NT3 using UPF (CANDLE-compliant model scripts)¶
# Import the UPF example (one file will be copied over)
candle import-template upf
# Submit the job to the queue
candle submit-job upf_example.in
NT3 using mlrMBO (CANDLE-compliant model scripts)¶
# Import the mlrMBO example (two files will be copied over)
candle import-template mlrmbo
# Submit the job to the queue
candle submit-job mlrmbo_example.in
Step 3: Run sample non-CANDLE-compliant model scripts¶
This refers to model scripts that have gone from “bare” (e.g., one downloaded directly from the Internet) to “minimally modified,” a process described below.
MNIST using UPF (non-CANDLE-compliant model scripts)¶
# Pre-fetch the MNIST data since Summit compute nodes can't access the Internet (this has nothing to do with the wrapper scripts)
mkdir candle_generated_files
/gpfs/alpine/world-shared/med106/sw/condaenv-200408/bin/python -c "from keras.datasets import mnist; import os; (x_train, y_train), (x_test, y_test) = mnist.load_data(os.path.join(os.getcwd(), 'candle_generated_files', 'mnist.npz'))"
# Import the grid example (two files will be copied over)
candle import-template grid
# Submit the job to the queue
candle submit-job grid_example.in
NT3 using mlrMBO (non-CANDLE-compliant model scripts)¶
# Import the bayesian example (two files will be copied over)
candle import-template bayesian
# Submit the job to the queue
candle submit-job bayesian_example.in
How to minimally modify a bare model script for use with the wrapper scripts¶
Set the hyperparameters in the model script using a dictionary called
candle_params
Ensure somewhere near the end of the script either the normal
history
object is defined or a metric of how well the hyperparameter set performed (a value you want to minimize, such as the loss evaluated on a test set) is returned as a number in thecandle_value_to_return
variable
This is demonstrated in $CANDLE/wrappers/examples/summit-tf2/grid/mnist_mlp.py.
Running a non-CANDLE-compliant model on its own, outside of Supervisor¶
One drawback to minimally modifying a bare model script as opposed to
making it fully CANDLE-compliant is that the former cannot generally run
standalone (which you should only do on an interactive node), e.g.,
python my_model_script.py
. There are two simple ways to handle this:
Use the recommended workflow of setting
run_workflow=0
and then running the model script usingcandle submit-job my_input_file.in
Run
bash run_candle_model_standalone.sh
. Explanation: The first time a minimally CANDLE-compliant model script is run, using either setting ofrun_workflow
, a file calledrun_candle_model_standalone.sh
is created, which runscandle_compliant_wrapper.py
using Python, just as you’re desiring to run a fully CANDLE-compliant model script using Python in this situation. (As some environment variables are required to be set incandle_compliant_wrapper.py
and the files it calls,run_candle_model_standalone.sh
also sets some environment variables.)
Aside from not needing to make a model script fully CANDLE-compliant,
the usual advantages of running minimally CANDLE-compliant scripts like
this apply here, e.g., model scripts can be written in other languages
and a custom environment can be automatically defined via, e.g.,
supp_modules
, python_bin_path
, exec_python_module
,
supp_pythonpath
.
As usual for miminally CANDLE-compliant model scripts, the output of the
script is placed in subprocess_out_and_err.txt
.
Input file format¶
The input file should contain three sections: &control
,
&default_model
, and ¶m_space
. Each section should start with
this header on its own line and end with /
on its own line. (This
input file format is based on the Quantum
Espresso electronic structure
software.) Four sample input files, corresponding to the four examples
in the quick-start examples
above, are here:
upf,
mlrmbo,
grid,
bayesian.
Spaces at the beginnings of the content-containing lines are optional
but are recommended for readability.
&control
section¶
The &control
section contains all settings aside from those
specified in the &default_model
and ¶m_space
sections
(detailed below) in the format keyword = value
. Spaces around the
=
sign are optional, and each keyword setting should be on its own
line. Each value
ultimately gets interpreted by bash
and hence
is taken to be a string by default; thus, quotes are not necessary for
string value
s.
Here is a list of possible keyword
s and their default value
s
(if None
, then the keyword is required), as specified in
$CANDLE/wrappers/site-specific_settings.sh:
|
Default |
Notes |
---|---|---|
|
|
Full path to the model script |
|
|
Currently only
|
|
|
OLCF project to use,
e.g., |
|
|
In |
|
|
workers=GPUs. The
number of nodes used
on Summit
will be
ceil(( |
|
|
Valid backends are
|
|
Empty string |
Supplementary
|
|
Empty string |
Actual Python version
to use if not the one
set in
|
|
Empty string |
Actual Python
|
|
Empty string |
|
|
Empty string |
Extra arguments to
the |
|
Empty string |
Actual R |
|
Empty string |
Full path to a
supplementary
|
|
1 |
0 will run your model script once using the default model parameters on the current node (so only use this on an interactive node); 1 will run the actual Supervisor workflow, submitting the job to the queue as usual |
|
0 |
1 will set up the job but not execute it so that you can examine the settings files generated in the submission directory; 0 will run the job as usual |
|
|
Partition to use for the CANDLE job |
|
|
|
|
|
|
|
|
|
|
|
&default_model
section¶
This can contain either a single keyword/value line containing the
candle_default_model_file
keyword pointing to the full path of the
default model text file to use, e.g.,
candle_default_model_file = $CANDLE/Benchmarks/Pilot1/NT3/nt3_default_model.txt
or the contents of such a default model file as, e.g., in the
grid
or
bayesian
examples in the quick-start section
above.
¶m_space
section¶
This can contain either a single keyword/value line containing the
candle_param_space_file
keyword pointing to the full path of the
file specifying the hyperparameter space to use, e.g.,
candle_param_space_file = $CANDLE/Supervisor/workflows/mlrMBO/data/nt3_nightly.R
or the contents of such a parameter space file as, e.g., in the
grid
or
upf
examples in the quick-start section
above or here:
¶m_space
makeDiscreteParam("batch_size", values = c(16, 32))
makeIntegerParam("epochs", lower = 2, upper = 5)
makeDiscreteParam("optimizer", values = c("adam", "sgd", "rmsprop", "adagrad", "adadelta"))
makeNumericParam("dropout", lower = 0, upper = 0.9)
makeNumericParam("learning_rate", lower = 0.00001, upper = 0.1)
/
Note there are no commas at the end of each line in the example above.
Code organization¶
A description of what every file does in the wrappers
repository, which
is cloned to $CANDLE/wrappers
, can be found
here. Some particular notes:
In addition to the page you are reading, all documentation is currently in the top-level directory:
README.md
(see this file for additional notes),repository_organization.md
,setup-biowulf.md
, andsetup-summit.md
Directories pertaining to the setup of the wrappers repository and in general of CANDLE on a new HPC system (involved in the setup documentation) are
log_files
,swift-t_setup
, andtest_files
Directories pertaining to the usage of the wrapper scripts (involved in the usage documentation that you are currently reading) are:
lmod_modules
: contains.lua
files used by thelmod
system for loadingmodule
s, enabling one to run, e.g., module load candlebin
: contains a single script calledcandle
that can be accessed by typingcandle
on the command line once the CANDLE module has been loaded. You can generate a usage message by simply typingcandle
orcandle help
on the command line and hitting Enterexamples
: contains sample/template input files and model scripts for different$SITE
scommands
: contains one directory so-named for each command to thecandle
program, each containing all files related to the command. Possible commands areimport-template
,generate-grid
,submit-job
, andaggregate-results
. The file calledcommand_script.sh
in each command’s directory is the main file called when the command is run usingcandle <COMMAND> ...
. The only command not currently tested on Summit isaggregate-results
. The bulk of the files involved in the functionality described in this document correspond to thesubmit-job
command, i.e., are located in thesubmit-job
subdirectory
Recommendations for particular use cases¶
Run grid
or bayesian
hyperparameter searches on an already CANDLE-compliant model script such as a benchmark¶
Note that you can copy a benchmark to your working directory and make the modifications there, as the templates show.
Enter a directory on Summit’s Alpine filesystem such as
$MEMBERWORK
Load the
candle
module viasource /gpfs/alpine/med106/world-shared/candle/env_for_lmod-tf2.sh
Import one of the templates for running canonically CANDLE-compliant models using
candle import-template {upf|mlrmbo}
and delete all but the copied-over input fileRename and tweak the input file to your liking using the documentation for input files above
Ensure your model runs on an interactive node (e.g.,
bsub -W 1:00 -nnodes 1 -P med106 -q debug -Is /bin/bash
) by setting therun_workflow=0
keyword setting in the&control
section of the input file and runningcandle submit-job <INPUT-FILE>
Submit your job from a login node by setting the default setting of
run_workflow=1
in the&control
section of the input file and runningcandle submit-job <INPUT-FILE>
Create a new model script on which you want to run grid
or bayesian
hyperparameter searches¶
Enter a directory on Summit’s Alpine filesystem such as
$MEMBERWORK
Load the
candle
module viasource /gpfs/alpine/med106/world-shared/candle/env_for_lmod-tf2.sh
Create a bare model script as usual (e.g., download a model from the Internet, tweak it, and apply it on your data)
Make the model script minimally CANDLE-compliant as described above
Import one of the templates for running minimally CANDLE-compliant models using
candle import-template {grid|bayesian}
; delete all but the input fileRename and tweak the input file to your liking using the documentation for input files above
Ensure your model runs on an interactive node (e.g.,
bsub -W 1:00 -nnodes 1 -P med106 -q debug -Is /bin/bash
) by setting therun_workflow=0
keyword setting in the&control
section of the input file and runningcandle submit-job <INPUT-FILE>
Submit your job from a login node by setting the default setting of
run_workflow=1
in the&control
section of the input file and runningcandle submit-job <INPUT-FILE>
Run a model script written in another language such as R
or bash
¶
Ask Andrew Weisman to test this first because he hasn’t tested it on Summit yet.
Pull updates to the central installation of CANDLE that have already been pulled into the main Supervisor/Benchmarks repositories¶
Load the
candle
module viasource /gpfs/alpine/med106/world-shared/candle/env_for_lmod-tf2.sh
Enter the clone you’d like to update via
cd $CANDLE/Supervisor
orcd $CANDLE/Benchmarks
Run
git pull
, adjusting the permissions if necessary the very first time (or ask Andrew to do this)
Commit changes to the wrapper scripts or to the Supervisor or Benchmarks clones in the central installation¶
Load the
candle
module viasource /gpfs/alpine/med106/world-shared/candle/env_for_lmod-tf2.sh
Enter the clone you’d like to update via
cd $CANDLE/{wrappers|Supervisor|Benchmarks}
Make your modifications to the code and commit your changes, adjusting the permissions if necessary the very first time (or ask Andrew to do this)
Ask Andrew to push the changes to newly forked versions of the corresponding repositories and submit pull requests into the main versions of the repositories
Contribution ideas¶
Feel free to make any changes you’d like to the code and commit them via the preliminary workflow above. Below are some ideas for particular ways to contribute:
Implement workflows other than
grid
andbayesian
(UQ would be great!) by following the instructions hereIf this is something you personally want, allow for command-line arguments to the
candle
command, such asrun_workflow
or any other input file keywordsCheck/preprocess the four mlrMBO keywords (
design_size
,propose_points
,max_iterations
,max_budget
) by following the instructions here and seeing their usage here (good exercise to get familiar with the wrappers code)Anything else!
Known issues¶
CUDA driver. If, when running on an interactive node (using
run_workflow=0
in the input file), you get an error liketensorflow.python.framework.errors_impl.InternalError: cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version
then likely you need to load the CUDA module corresponding to that which is automatically loaded in batch mode, based on the contents of$CANDLE/Supervisor/workflows/common/sh/env-summit-tf2.sh
; currently, this means that you need to runmodule load cuda/10.2.89
. Explanation: When following the interactive protocol for testing, only the default version of Python is loaded prior to running the model using the default model settings, as opposed to the CUDA module being loaded as well. Note: This is a relatively new issue.InvalidArgumentError. You may need to add
K.clear_session()
prior to, say,model = Sequential()
in a Keras-based model. Otherwise, once the same rank runs a model script a second time, we get a strangeInvalidArgumentError
error that kills Supervisor (see the comments in $CANDLE/Benchmarks/Pilot1/NT3/nt3_candle_wrappers_baseline_keras2.py for more details). It is wholly possible that this is a bug that has gotten fixed in subsequent versions of Keras/Tensorflow.Path to CANDLE library. If you, say, pull a Benchmark model script out of the
Benchmarks
repository into your own separate directory, you may need to add a line likesys.path.append(os.path.join(os.getenv('CANDLE'), 'Benchmarks', 'Pilot1', 'NT3'))
. This is demonstrated in $CANDLE/wrappers/examples/summit-tf2/mlrmbo/nt3_baseline_keras2.py.