ckpt_keras_utils module¶

CKPT KERAS UTILS

CANDLE checkpoint/restart utilities for Keras

Hyperparameters that affect CANDLE checkpoint/restart:

ckpt_restart_mode“off” | “auto” | “required”: If ‘auto’ or ‘required’, automatically try to restart from most recent (highest epoch) model.h5. ‘required’ will fail if a model cannot be found. Default: “auto”
ckpt_save_bestboolean: If True, save whenever save_best_metric has improved. Default: True
ckpt_save_best_metricstring: Required when ckpt_save_best=True, else unused. The metric in logs.model to track for improvement. Default: “val_loss”
ckpt_skip_epochsinteger: Number of initial epochs to skip before writing checkpoints Default: 0
ckpt_save_interval: integer: Save whenever epoch % save_interval == 0. Set save_interval=0 to disable these checkpoints (save nothing). Default: 1 (save everything)
ckpt_checksumboolean: If True, compute a checksum for the model and store it in the JSON. Also, confirm checksum at restart time. Default: False
ckpt_keep_modestring: “linear” or “exponential” (NYI) Default: “linear”
ckpt_keep_limit: integer GZ: Maximal number of checkpoints to keep. This can be set lower to reduce disk usage. Default: 1000000
ckpt_directory: string: The top directory to use. Default: “./save” Typical user values: “/tmp/user/ckpts”: I.e. I am going to move these myself. “/other-fs/user/ckpts”: I.e. My working FS is different from the FS

I want to use for checkpoints.
ckpt_metadatastring: Arbitrary string to add to the JSON file regarding job ID, hardware location, etc. May be None or an empty string. Default: None

Usage:

Add before training:

initial_epoch = 0 J = candle.restart(gParameters, model) if J is not None:

initial_epoch = J[‘epoch’]

Set up a callback for checkpoints:

ckpt = candle.CandleCheckpointCallback(gParameters) history = model.fit(epochs=gParameters[‘epochs’],

initial_epoch=initial_epoch, … callbacks=[… , ckpt])

Optionally, log a final report:

ckpt.report_final()

Controlling restart:

Most restart control options are in gParameters.

Normally, restart() looks at the soft link ckpts/last, which should point to a good ckpt directory number in epochs/* and restarts from there.

To roll back, simply re-link ckpts/last to point to a different directory number in epochs/* . Any later epochs will simply be overwritten (and a debug message will be reported).

If ckpts/last is missing or a broken link, restart() will start from scratch.

Keep policy:

The ckpt_keep settings only apply to the current run. Checkpoints from prior runs will never be deleted by clean(). You may simply remove any of them. Normally you will not want to remove the one pointed to by ckpts/last, but if you do, restart() will simply start from scratch.

Logging:

A log of ckpt operations is in ckpt_directory/ckpt.log

class ckpt_keras_utils.CandleCheckpointCallback(gParameters, logger='DEFAULT', verbose=True)[source]¶

Bases: keras.callbacks.Callback

Keras Callback for CANDLE-compliant Benchmarks to use for checkpointing Creates a JSON file alongside the weights and optimizer checkpoints that includes important metadata, particularly for restarting and tracking complex workflows.

checksum(dir_work)[source]¶: Simple checksum dispatch dir_work: A PosixPath

clean(epoch_now)[source]¶

Clean old epoch directories: in accordance with ckpt_keep policies.

Return number of checkpoints kept and deleted

debug(message)[source]¶

delete(epoch)[source]¶

info(message)[source]¶

keep(epoch, epoch_now, kept)[source]¶: kept: Number of epochs already kept return True if we are keeping this epoch, else False

on_epoch_end(epoch, logs=None)[source]¶

Note: We immediately increment epoch from index-from-0 to index-from-1 to match the TensorFlow output. Normally, ckpts/best is the best saved state,

and ckpts/last is the last saved state.

Procedure: 1. Write current state to ckpts/work 2. Rename ckpts/work to ckpts/epoch/NNN 3. If best, link ckpts/best to ckpts/epoch/NNN 4. Link ckpts/last to ckpts/epoch/NNN 5. Clean up old ckpts according to keep policy

on_train_end(logs=None)[source]¶

Called at the end of training.

Subclasses should override for any actions to run.

Parameters: logs – Dict. Currently the output of the last call to on_epoch_end() is passed to this argument for this method but that may change in the future.

relpath(p)[source]¶

report_final()[source]¶

report_initial()[source]¶: Simply report that we are ready to run

save_check(logs, epoch)[source]¶: Make sure we want to save this epoch based on the model metrics in given logs Also updates epoch_best if appropriate

save_check_best(logs, epoch)[source]¶

scan_params(gParameters)[source]¶: Simply translate gParameters into instance fields

symlink(src, dst)[source]¶: Like os.symlink, but overwrites dst and logs

write_json(jsonfile, epoch)[source]¶

write_model(dir_work, epoch)[source]¶: Do the I/O, report stats dir_work: A PosixPath

class ckpt_keras_utils.MultiGPUCheckpoint(filepath, monitor='val_loss', verbose=0, save_best_only=False, save_weights_only=False, mode='auto', save_freq='epoch', options=None, **kwargs)[source]¶

Bases: keras.callbacks.ModelCheckpoint

set_model(model)[source]¶

class ckpt_keras_utils.ParamRequired[source]¶

Bases: object

Indicates that the user params must contain this key

class ckpt_keras_utils.ParamType[source]¶

Bases: enum.Enum

Possible gParameters types

BOOLEAN = 2¶

FLOAT = 6¶

FLOAT_NN = 7¶

INTEGER = 3¶

INTEGER_GZ = 5¶

INTEGER_NN = 4¶

STRING = 1¶

ckpt_keras_utils.checksum_file(logger, filename)[source]¶: Read file, compute checksum, return it as a string.

ckpt_keras_utils.ckpt_defs(defs)[source]¶

ckpt_keras_utils.ckpt_parser(parser)[source]¶

ckpt_keras_utils.disabled(gParameters, key)[source]¶: Is this parameter set to False?

ckpt_keras_utils.enabled(gParameters, key)[source]¶: Is this parameter set to True?

ckpt_keras_utils.param(gParameters, key, dflt, type_=<ParamType.STRING: 1>, allowed=None)[source]¶: Pull key from parameters with type checks and conversions

ckpt_keras_utils.param_allowed(key, value, allowed)[source]¶: Check that the value is in the list of allowed values If allowed is None, there is no check, simply success

ckpt_keras_utils.param_type_check(key, value, type_)[source]¶

Check that value is convertable to given type:: if not, raise TypeError

Return the value as converted to given type

ckpt_keras_utils.param_type_check_bool(key, value)[source]¶

ckpt_keras_utils.param_type_check_float(key, value, type_)[source]¶

ckpt_keras_utils.param_type_check_int(key, value, type_)[source]¶

ckpt_keras_utils.restart(gParameters, model, verbose=True)[source]¶

Possibly restarts model from CheckpointCallback according to given settings and the ckpt-info.json

return: The JSON dict if the restart happened or None if the restart did not happen.

ckpt_keras_utils.restart_json(gParameters, logger, directory)[source]¶