ckpt_keras_utils module¶
CKPT KERAS UTILS
CANDLE checkpoint/restart utilities for Keras
Hyperparameters that affect CANDLE checkpoint/restart:
- ckpt_restart_mode“off” | “auto” | “required”
If ‘auto’ or ‘required’, automatically try to restart from most recent (highest epoch) model.h5. ‘required’ will fail if a model cannot be found. Default: “auto”
- ckpt_save_bestboolean
If True, save whenever save_best_metric has improved. Default: True
- ckpt_save_best_metricstring
Required when ckpt_save_best=True, else unused. The metric in logs.model to track for improvement. Default: “val_loss”
- ckpt_skip_epochsinteger
Number of initial epochs to skip before writing checkpoints Default: 0
- ckpt_save_interval: integer
Save whenever epoch % save_interval == 0. Set save_interval=0 to disable these checkpoints (save nothing). Default: 1 (save everything)
- ckpt_checksumboolean
If True, compute a checksum for the model and store it in the JSON. Also, confirm checksum at restart time. Default: False
- ckpt_keep_modestring
“linear” or “exponential” (NYI) Default: “linear”
- ckpt_keep_limit: integer GZ
Maximal number of checkpoints to keep. This can be set lower to reduce disk usage. Default: 1000000
- ckpt_directory: string
The top directory to use. Default: “./save” Typical user values: “/tmp/user/ckpts”: I.e. I am going to move these myself. “/other-fs/user/ckpts”: I.e. My working FS is different from the FS
I want to use for checkpoints.
- ckpt_metadatastring
Arbitrary string to add to the JSON file regarding job ID, hardware location, etc. May be None or an empty string. Default: None
Usage:
Add before training:
initial_epoch = 0 J = candle.restart(gParameters, model) if J is not None:
initial_epoch = J[‘epoch’]
Set up a callback for checkpoints:
ckpt = candle.CandleCheckpointCallback(gParameters) history = model.fit(epochs=gParameters[‘epochs’],
initial_epoch=initial_epoch, … callbacks=[… , ckpt])
Optionally, log a final report:
ckpt.report_final()
Controlling restart:
Most restart control options are in gParameters.
Normally, restart() looks at the soft link ckpts/last, which should point to a good ckpt directory number in epochs/* and restarts from there.
To roll back, simply re-link ckpts/last to point to a different directory number in epochs/* . Any later epochs will simply be overwritten (and a debug message will be reported).
If ckpts/last is missing or a broken link, restart() will start from scratch.
Keep policy:
The ckpt_keep settings only apply to the current run. Checkpoints from prior runs will never be deleted by clean(). You may simply remove any of them. Normally you will not want to remove the one pointed to by ckpts/last, but if you do, restart() will simply start from scratch.
Logging:
A log of ckpt operations is in ckpt_directory/ckpt.log
-
class
ckpt_keras_utils.
CandleCheckpointCallback
(gParameters, logger='DEFAULT', verbose=True)[source]¶ Bases:
keras.callbacks.Callback
Keras Callback for CANDLE-compliant Benchmarks to use for checkpointing Creates a JSON file alongside the weights and optimizer checkpoints that includes important metadata, particularly for restarting and tracking complex workflows.
-
clean
(epoch_now)[source]¶ - Clean old epoch directories
in accordance with ckpt_keep policies.
Return number of checkpoints kept and deleted
-
keep
(epoch, epoch_now, kept)[source]¶ kept: Number of epochs already kept return True if we are keeping this epoch, else False
-
on_epoch_end
(epoch, logs=None)[source]¶ Note: We immediately increment epoch from index-from-0 to index-from-1 to match the TensorFlow output. Normally, ckpts/best is the best saved state,
and ckpts/last is the last saved state.
Procedure: 1. Write current state to ckpts/work 2. Rename ckpts/work to ckpts/epoch/NNN 3. If best, link ckpts/best to ckpts/epoch/NNN 4. Link ckpts/last to ckpts/epoch/NNN 5. Clean up old ckpts according to keep policy
-
on_train_end
(logs=None)[source]¶ Called at the end of training.
Subclasses should override for any actions to run.
- Parameters
logs – Dict. Currently the output of the last call to on_epoch_end() is passed to this argument for this method but that may change in the future.
-
-
class
ckpt_keras_utils.
MultiGPUCheckpoint
(filepath, monitor='val_loss', verbose=0, save_best_only=False, save_weights_only=False, mode='auto', save_freq='epoch', options=None, **kwargs)[source]¶ Bases:
keras.callbacks.ModelCheckpoint
-
class
ckpt_keras_utils.
ParamRequired
[source]¶ Bases:
object
Indicates that the user params must contain this key
-
class
ckpt_keras_utils.
ParamType
[source]¶ Bases:
enum.Enum
Possible gParameters types
-
BOOLEAN
= 2¶
-
FLOAT
= 6¶
-
FLOAT_NN
= 7¶
-
INTEGER
= 3¶
-
INTEGER_GZ
= 5¶
-
INTEGER_NN
= 4¶
-
STRING
= 1¶
-
-
ckpt_keras_utils.
checksum_file
(logger, filename)[source]¶ Read file, compute checksum, return it as a string.
-
ckpt_keras_utils.
param
(gParameters, key, dflt, type_=<ParamType.STRING: 1>, allowed=None)[source]¶ Pull key from parameters with type checks and conversions
-
ckpt_keras_utils.
param_allowed
(key, value, allowed)[source]¶ Check that the value is in the list of allowed values If allowed is None, there is no check, simply success
-
ckpt_keras_utils.
param_type_check
(key, value, type_)[source]¶ - Check that value is convertable to given type:
if not, raise TypeError
Return the value as converted to given type