candle

Functions

candle.generate_cross_validation_partition(group_label, n_folds=5, n_repeats=1, portions=None, random_seed=None)[source]

This function generates partition indices of samples for cross- validation analysis.

Parameters
  • group_label – 1-D array or list of group labels of samples. If there are no groups in samples, a list of sample indices can be supplied for generating partitions based on individual samples rather than sample groups.

  • n_folds (int) – positive integer larger than 1, indicating the number of folds for cross-validation. Default is 5.

  • n_repeats (int) – positive integer, indicating how many times the n_folds cross-validation should be repeated. So the total number of cross-validation trials is n_folds * n_repeats. Default is 1.

  • portions – 1-D array or list of positive integers, indicating the number of data folds in each set (e.g. training set, testing set, or validation set) after partitioning. The summation of elements in portions must be equal to n_folds. Default is [1, n_folds - 1].

  • random_seed (int) – positive integer, the seed for random generator. Default is None.

Returns

list of n_folds * n_repeats lists, each of which contains len(portions) sample index lists for a cross-validation trial.

candle.quantile_normalization(data)[source]

This function does quantile normalization to input data. After normalization, the samples (rows) in output data follow the same distribution, which is the average distribution calculated based on all samples. This function allows missing values, and assume missing values occur at random.

Parameters

data – numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features].

Returns

numpy array or pandas data frame containing the data after quantile normalization.

candle.load_csv_data(train_path: str, test_path: Optional[str] = None, sep: str = ',', nrows: Optional[int] = None, x_cols: Optional[List] = None, y_cols: Optional[List] = None, drop_cols: Optional[List] = None, onehot_cols: Optional[List] = None, n_cols: Optional[int] = None, random_cols: bool = False, shuffle: bool = False, scaling: Optional[str] = None, dtype=None, validation_split: Optional[float] = None, return_dataframe: bool = True, return_header: bool = False, seed: int = 7102)[source]

Load data from the files specified. Columns corresponding to data features and labels can be specified. A one-hot encoding can be used for either features or labels. If validation_split is specified, trainig data is further split into training and validation partitions. pandas DataFrames are used to load and pre-process the data. If specified, those DataFrames are returned. Otherwise just values are returned. Labels to output can be integer labels (for classification) or continuous labels (for regression). Columns to load can be specified, randomly selected or a subset can be dropped. Order of rows can be shuffled. Data can be rescaled. This function assumes that the files contain a header with column names.

Parameters
  • train_path – Name of the file to load the training data.

  • test_path – Name of the file to load the testing data. (Optional).

  • sep – Character used as column separator. (Default: ‘,’, comma separated values).

  • nrows (int) – Number of rows to load from the files. (Default: None, all the rows are used).

  • x_cols – List of columns to use as features. (Default: None).

  • y_cols – List of columns to use as labels. (Default: None).

  • drop_cols – List of columns to drop from the files being loaded. (Default: None, all the columns are used).

  • onehot_cols – List of columns to one-hot encode. (Default: None).

  • n_cols (int) – Number of columns to load from the files. (Default: None).

  • random_cols (boolean) – Boolean flag to indicate random selection of columns. If True a number of n_cols columns is randomly selected, if False the specified columns are used. (Default: False).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) –

    String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’.

    • maxabs: scales data to range [-1 to 1].

    • minmax: scales data to range [-1 to 1].

    • std : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype – Data type to use for the output pandas DataFrames. (Default: None).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: None, no validation partition is constructed).

  • return_dataframe (boolean) – Boolean flag to indicate that the pandas DataFrames used for data pre-processing are to be returned. (Default: True, pandas DataFrames are returned).

  • return_header (boolean) – Boolean flag to indicate if the column headers are to be returned. (Default: False, no column headers are separetely returned).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

Tuples of data features and labels are returned, for train, validation and testing partitions, together with the column names (headers). The specific objects to return depend on the options selected.

candle.load_Xy_data_noheader(train_file: str, test_file: str, classes: int, usecols: ~typing.Optional[~typing.List] = None, scaling: ~typing.Optional[str] = None, dtype=<class 'numpy.float32'>)[source]

Load training and testing data from the files specified, with the first column to use as label. Construct corresponding training and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved. This function assumes that the files do not contain a header.

Parameters
  • train_file (string) – Name of the file to load the training data.

  • test_file (string) – Name of the file to load the testing data.

  • classes (int) – Number of total classes to consider when building the categorical (one-hot) label encoding.

  • usecols – List of column indices to load from the files. (Default: None, all the columns are used).

  • scaling (string) –

    String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’.

    • maxabs: scales data to range [-1 to 1].

    • minmax: scales data to range [-1 to 1].

    • std : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

Returns

Tuple of pandas DataFrames where

  • X_train - Data features for training loaded in a pandas DataFrame and pre-processed as specified.

  • Y_train - Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_test - Data features for testing loaded in a pandas DataFrame and pre-processed as specified.

  • Y_test - Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

candle.load_Xy_one_hot_data2(train_file: str, test_file: str, class_col: ~typing.Optional[int] = None, drop_cols: ~typing.Optional[~typing.List] = None, n_cols: ~typing.Optional[int] = None, shuffle: bool = False, scaling: ~typing.Optional[str] = None, validation_split: float = 0.1, dtype=<class 'numpy.float32'>, seed: int = 7102)[source]

Load training and testing data from the files specified, with a column indicated to use as label. Further split trainig data into training and validation partitions, and construct corresponding training, validation and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved, but training is split into training and validation partitions. This function assumes that the files contain a header with column names.

Parameters
  • train_file (string) – Name of the file to load the training data.

  • test_file (string) – Name of the file to load the testing data.

  • class_col (int) – Index of the column to use as the label. (Default: None, this would cause the function to fail, a label has to be indicated at calling).

  • drop_cols (List) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).

  • n_cols (int) – Number of columns to load from the files. (Default: None, all the columns are used).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are loaded is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) –

    String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’.

    • maxabs: scales data to range [-1 to 1].

    • minmax: scales data to range [-1 to 1].

    • std: scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: 0.1, ten percent of the training data is used for the validation partition).

  • dtype – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

Tuple of pandas DataFrames where

  • X_train: Data features for training loaded in a pandas DataFrame and pre-processed as specified.

  • y_train: Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_val: Data features for validation loaded in a pandas DataFrame and pre-processed as specified.

  • y_val: Data labels for validation loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_test: Data features for testing loaded in a pandas DataFrame and pre-processed as specified.

  • y_test: Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

candle.select_decorrelated_features(data, method='pearson', threshold=None, random_seed=None)[source]

This function selects features whose mutual absolute correlation coefficients are smaller than a threshold. It allows missing values in data. The correlation coefficient of two features are calculated based on the observations that are not missing in both features. Features with only one or no value present and features with a zero standard deviation are not considered for selection.

Parameters
  • data – numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features].

  • method (string) –

    indicating the method used for calculating correlation coefficient. Default is ‘pearson’.

    • pearson: Pearson correlation coefficient

    • kendall: Kendall Tau correlation coefficient

    • spearman: Spearman rank correlation coefficient

  • threshold (float) – If two features have an absolute correlation coefficient higher than threshold, one of the features is removed. If threshold is None, a feature is removed only when the two features are exactly identical. Default is None.

  • random_seed (int) – seed of random generator for ordering the features. If it is None, features are not re-ordered before feature selection and thus the first feature is always selected. Default is None.

Returns

1-D numpy array containing the indices of selected features.

candle.select_features_by_missing_values(data, threshold=0.1)[source]

This function returns the indices of the features whose missing rates are smaller than the threshold.

Parameters
  • data – numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features]

  • threshold (float) – range of [0, 1]. Features with a missing rate smaller than threshold will be selected. Default is 0.1

Returns

1-D numpy array containing the indices of selected features

candle.select_features_by_variation(data, variation_measure='var', threshold=None, portion=None, draw_histogram=False, bins=100, log=False)[source]

This function evaluates the variations of individual features and returns the indices of features with large variations. Missing values are ignored in evaluating variation.

Parameters
  • data – numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features].

  • variation_metric (string) – string indicating the metric used for evaluating feature variation. ‘var’ indicates variance; ‘std’ indicates standard deviation; ‘mad’ indicates median absolute deviation. Default is ‘var’.

  • threshold (float) – Features with a variation larger than threshold will be selected. Default is None.

  • portion (float) – float in the range of [0, 1]. It is the portion of features to be selected based on variation. The number of selected features will be the smaller of int(portion * n_features) and the total number of features with non-missing variations. Default is None. threshold and portion can not take real values and be used simultaneously.

  • draw_histogram (bool) – whether to draw a histogram of feature variations. Default is False.

  • bins (int) – positive integer, the number of bins in the histogram. Default is the smaller of 50 and the number of features with non-missing variations.

  • log (bool) – whether the histogram should be drawn on log scale.

Returns

1-D numpy array containing the indices of selected features. If both threshold and portion are None, indices will be an empty array.

candle.get_file(fname: str, origin: str, unpack: bool = False, md5_hash: Optional[str] = None, cache_subdir: str = 'common', datadir: Optional[str] = None) str[source]

Downloads a file from a URL if it not already in the cache. Passing the MD5 hash will verify the file after download as well as if it is already present in the cache.

Parameters
  • fname (string) – name of the file

  • origin (string) – original URL of the file

  • unpack (bool) – whether the file should be decompressed

  • md5_hash (string) – MD5 hash of the file for verification

  • cache_subdir (string) – directory being used as the cache

  • datadir (string) – if set, datadir becomes its setting (which could be e.g. an absolute path) and cache_subdir no longer matters

Returns

Path to the downloaded file

Return type

string

candle.validate_file(fpath: str, md5_hash: str) bool[source]

Validates a file against a MD5 hash.

Parameters
  • fpath (string) – path to the file being validated

  • md5_hash (string) – the MD5 hash being validated against

Returns

Whether the file is valid

Return type

boolean

candle.fetch_file(link: str, subdir: str, unpack: bool = False, md5_hash: Optional[str] = None) str[source]

Convert URL to file path and download the file if it is not already present in spedified cache.

Parameters
  • link (string) – URL of the file to download

  • subdir (string) – Local path to check for cached file.

  • unpack (bool) – Flag to specify if the file to download should be decompressed too. (default: False, no decompression)

  • md5_hash (string) – MD5 hash used as a checksum to verify data integrity. Verification is carried out if a hash is provided. (default: None, no verification)

Returns

local path to the downloaded, or cached, file.

Return type

string

candle.keras_default_config() Dict[source]

Defines parameters that intervine in different functions using the keras defaults.

This helps to keep consistency in parameters between frameworks.

candle.set_up_logger(logfile: str, logger: Logger, verbose: bool = False, fmt_line: str = '[%(asctime)s %(process)d] %(message)s', fmt_date: str = '%Y-%m-%d %H:%M:%S') None[source]

Set up the event logging system. Two handlers are created. One to send log records to a specified file and one to send log records to the (defaulf) sys.stderr stream. The logger and the file handler are set to DEBUG logging level. The stream handler is set to INFO logging level, or to DEBUG logging level if the verbose flag is specified. Logging messages which are less severe than the level set will be ignored.

Parameters
  • logfile (string) – File to store the log records

  • logger (Logger) – Python object for the logging interface

  • verbose (boolean) – Flag to increase the logging level from INFO to DEBUG. It only applies to the stream handler.

candle.str2bool(v: str) bool[source]

This is taken from: https://stackoverflow.com/questions/15008758/parsing-boolean-values-with- argparse Because type=bool is not interpreted as a bool and action=’store_true’ cannot be undone.

Parameters

v (string) – String to interpret

Returns

Boolean value. It raises and exception if the provided string cannot be interpreted as a boolean type.

  • Strings recognized as boolean True : ‘yes’, ‘true’, ‘t’, ‘y’, ‘1’ and uppercase versions (where applicable).

  • Strings recognized as boolean False : ‘no’, ‘false’, ‘f’, ‘n’, ‘0’ and uppercase versions (where applicable).

Return type

boolean

candle.verify_path(path: str) None[source]

Verify if a directory path exists locally. If the path does not exist, but is a valid path, it recursivelly creates the specified directory path structure.

Parameters

path (string) – Description of local directory path

candle.add_cluster_noise(x_data, loc=0.0, scale=0.5, col_ids=[0], noise_type='gaussian', row_ids=[0], y_noise_level=0.0)[source]
candle.add_column_noise(x_data, loc=0.0, scale=0.5, col_ids=[0], noise_type='gaussian')[source]
candle.add_gaussian_noise(x_data, loc=0.0, scale=0.5)[source]
candle.add_noise(data, labels, params)[source]
candle.label_flip(y_data_categorical, y_noise_level)[source]
candle.label_flip_correlated(y_data_categorical, y_noise_level, x_data, col_ids, threshold)[source]
candle.combat_batch_effect_removal(data, batch_labels, model=None, numerical_covariates=None)[source]

This function corrects for batch effect in data.

Parameters
  • data – pandas data frame of numeric values, with a size of (n_features, n_samples)

  • batch_labels – pandas series, with a length of n_samples. It should provide the batch labels of samples. Its indices are the same as the column names (sample names) in “data”.

  • model – an object of patsy.design_info.DesignMatrix. It is a design matrix describing the covariate information on the samples that could cause batch effects. If not provided, this function will attempt to coarsely correct just based on the information provided in “batch”.

  • numerical_covariates – a list of the names of covariates in “model” that are numerical rather than categorical.

Returns

pandas data frame of numeric values, with a size of (n_features, n_samples). It is the data with batch effects corrected.

candle.coxen_multi_drug_gene_selection(source_data, target_data, drug_response_data, drug_response_col, tumor_col, drug_col, prediction_power_measure='lm', num_predictive_gene=100, generalization_power_measure='ccc', num_generalizable_gene=50, union_of_single_drug_selection=False)[source]

This function uses the COXEN approach to select genes for predicting the response of multiple drugs. It assumes no missing data exist. It works in three modes. (1) If union_of_single_drug_selection is True, prediction_power_measure must be either ‘pearson’ or ‘mutual_info’. This functions runs coxen_single_drug_gene_selection for every drug with the parameter setting and takes the union of the selected genes of every drug as the output. The size of the selected gene set may be larger than num_generalizable_gene. (2) If union_of_single_drug_selection is False and prediction_power_measure is ‘lm’, this function uses a linear model to fit the response of multiple drugs using the expression of a gene, while the drugs are one-hot encoded. The p-value associated with the coefficient of gene expression is used as the prediction power measure, according to which num_predictive_gene genes will be selected. Then, among the predictive genes, num_generalizable_gene generalizable genes will be selected. (3) If union_of_single_drug_selection is False and prediction_power_measure is ‘pearson’ or ‘mutual_info’, for each drug this functions ranks the genes according to their power of predicting the response of the drug. The union of an equal number of predictive genes for every drug will be generated, and its size must be at least num_predictive_gene. Then, num_generalizable_gene generalizable genes will be selected.

Parameters
  • source_data – pandas data frame of gene expressions of tumors, for which drug response is known. Its size is [n_source_samples, n_features].

  • target_data – pandas data frame of gene expressions of tumors, for which drug response needs to be predicted. Its size is [n_target_samples, n_features]. source_data and target_data have the same set of features and the orders of features must match.

  • drug_response_data – pandas data frame of drug response that must include a column of drug response values, a column of tumor IDs, and a column of drug IDs.

  • drug_response_col – non-negative integer or string. If integer, it is the column index of drug response in drug_response_data. If string, it is the column name of drug response.

  • tumor_col – non-negative integer or string. If integer, it is the column index of tumor IDs in drug_response_data. If string, it is the column name of tumor IDs.

  • drug_col – non-negative integer or string. If integer, it is the column index of drugs in drug_response_data. If string, it is the column name of drugs.

  • prediction_power_measure (string) – ‘pearson’ uses the absolute value of Pearson correlation coefficient to measure prediction power of a gene; ‘mutual_info’ uses the mutual information to measure prediction power of a gene; ‘lm’ uses the linear regression model to select predictive genes for multiple drugs. Default is ‘lm’.

  • num_predictive_gene (int) – the number of predictive genes to be selected.

  • generalization_power_measure (string) – ‘pearson’ indicates the Pearson correlation coefficient; ‘ccc’ indicates the concordance correlation coefficient. Default is ‘ccc’.

  • num_generalizable_gene (int) – the number of generalizable genes to be selected.

  • union_of_single_drug_selection (bool) – whether the final gene set should be the union of genes selected for every drug.

Returns

1-D numpy array containing the indices of selected genes.

candle.coxen_single_drug_gene_selection(source_data, target_data, drug_response_data, drug_response_col, tumor_col, prediction_power_measure='pearson', num_predictive_gene=100, generalization_power_measure='ccc', num_generalizable_gene=50, multi_drug_mode=False)[source]

This function selects genes for drug response prediction using the COXEN approach. The COXEN approach is designed for selecting genes to predict the response of tumor cells to a specific drug. This function assumes no missing data exist.

Parameters
  • source_data – pandas data frame of gene expressions of tumors, for which drug response is known. Its size is [n_source_samples, n_features].

  • target_data – pandas data frame of gene expressions of tumors, for which drug response needs to be predicted. Its size is [n_target_samples, n_features]. source_data and target_data have the same set of features and the orders of features must match.

  • drug_response_data – pandas data frame of drug response values for a drug. It must include a column of drug response values and a column of tumor IDs.

  • drug_response_col – non-negative integer or string. If integer, it is the column index of drug response in drug_response_data. If string, it is the column name of drug response.

  • tumor_col – non-negative integer or string. If integer, it is the column index of tumor IDs in drug_response_data. If string, it is the column name of tumor IDs.

  • prediction_power_measure (string) – ‘pearson’ uses the absolute value of Pearson correlation coefficient to measure prediction power of gene; ‘mutual_info’ uses the mutual information to measure prediction power of gene. Default is ‘pearson’.

  • num_predictive_gene (int) – the number of predictive genes to be selected.

  • generalization_power_measure (string) – ‘pearson’ indicates the Pearson correlation coefficient; ‘ccc’ indicates the concordance correlation coefficient. Default is ‘ccc’.

  • num_generalizable_gene (int) – the number of generalizable genes to be selected. :param bool multi_drug_mode: indicating whether the function runs as an auxiliary function of COXEN gene selection for multiple drugs. Default is False.

Returns

1-D numpy array containing the indices of selected genes, if multi_drug_mode is False; 1-D numpy array of indices of sorting all genes according to their prediction power, if multi_drug_mode is True.

candle.generate_gene_set_data(data, genes, gene_name_type='entrez', gene_set_category='c6.all', metric='mean', standardize=False, data_dir='../../Data/examples/Gene_Sets/MSigDB.v7.0/')[source]

This function generates genomic data summarized at the gene set level.

Parameters
  • data – numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features].

  • genes – 1-D array or list of gene names with a length of n_features. It indicates which gene a genomic feature belongs to.

  • gene_name_type (string) – the type of gene name used in genes. ‘entrez’ indicates Entrez gene ID and ‘symbols’ indicates HGNC gene symbol. Default is ‘symbols’.

  • gene_set_category (string) – the gene sets for which data will be calculated. ‘c2.cgp’ indicates gene sets affected by chemical and genetic perturbations; ‘c2.cp.biocarta’ indicates BioCarta gene sets; ‘c2.cp.kegg’ indicates KEGG gene sets; ‘c2.cp.pid’ indicates PID gene sets; ‘c2.cp.reactome’ indicates Reactome gene sets; ‘c5.bp’ indicates GO biological processes; ‘c5.cc’ indicates GO cellular components; ‘c5.mf’ indicates GO molecular functions; ‘c6.all’ indicates oncogenic signatures. Default is ‘c6.all’.

  • metric (string) – the way to calculate gene-set-level data. ‘mean’ calculates the mean of gene features belonging to the same gene set. ‘sum’ calculates the summation of gene features belonging to the same gene set. ‘max’ calculates the maximum of gene features. ‘min’ calculates the minimum of gene features. ‘abs_mean’ calculates the mean of absolute values. ‘abs_maximum’ calculates the maximum of absolute values. Default is ‘mean’.

  • standardize (bool) – whether to standardize features before calculation. Standardization transforms each feature to have a zero mean and a unit standard deviation.

Returns

a data frame of calculated gene-set-level data. Column names are the gene set names.

candle.check_flag_conflicts(params: ConfigDict)[source]

Check if parameters that must be exclusive are used in conjunction. The check is made against CONFLICT_LIST, a global list that describes parameter pairs that should be exclusive. Raises an exception if pairs of parameters in CONFLICT_LIST are specified simulataneously.

Parameters

params (Dict) – list to extract keywords from

candle.finalize_parameters(bmk)[source]

Utility to parse parameters in common as well as parameters particular to each benchmark.

Parameters

bmk (Benchmark) – Object that has benchmark filepaths and specifications

Returns

Dictionary with all the parameters necessary to run the benchmark. Command line overwrites config file specifications

candle.parse_from_dictlist(dictlist: List[ParseDict], parser)[source]

Functionality to parse options.

Parameters
  • pardict (List) – Specification of parameters

  • parser (ArgumentParser) – Current parser

Returns

consolidated parameters

Return type

ArgumentParser

candle.compute_empirical_calibration_interpolation(pSigma_cal: Type[ndarray], pPred_cal: Type[ndarray], true_cal: Type[ndarray], cv: int = 10)[source]

Use the arrays provided to estimate an empirical mapping between standard deviation and absolute value of error, both of which have been observed during inference. Since most of the times the prediction statistics are very noisy, two smoothing steps (based on scipy’s savgol filter) are performed. Cubic Hermite splines (PchipInterpolator) are constructed for interpolation. This type of splines preserves the monotonicity in the interpolation data and does not overshoot if the data is not smooth. The overall process of constructing a spline to express the mapping from standard deviation to error is composed of smoothing- interpolation-smoothing-interpolation.

Parameters
  • pSigma_cal (numpy array) – Part of the standard deviations array to use for calibration.

  • pPred_cal (numpy array) – Part of the predictions array to use for calibration.

  • true_cal (numpy array) – Part of the true (observed) values array to use for calibration.

  • cv (int) – Number of cross validations folds to run to determine a ‘good’ fit.

Returns

Tuple of python objects

  • splineobj_best : scipy.interpolate python object A python object from scipy.interpolate that computes a cubic Hermite splines (PchipInterpolator) constructed to express the mapping from standard deviation to error after a ‘drastic’ smoothing of the predictions. A ‘good’ fit is determined by taking the spline for the fold that produces the smaller mean absolute error in testing data (not used for the smoothing / interpolation).

  • splineobj2 : scipy.interpolate python object A python object from scipy.interpolate that computes a cubic Hermite splines (PchipInterpolator) constructed to express the mapping from standard deviation to error. This spline is generated for interpolating the samples generated after the smoothing of the first interpolation spline (i.e. splineobj_best).

candle.compute_statistics_heteroscedastic(df_data: DataFrame, col_true: int = 4, col_pred_start: int = 6, col_std_pred_start: int = 7) Tuple[Type[ndarray], ...][source]

Extracts ground truth, mean prediction, error, standard deviation of prediction and predicted (learned) standard deviation from inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas dataframe) – Data frame generated by current heteroscedastic inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_HET.tsv).

  • col_true (int) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HET format).

  • col_pred_start (int) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with standard deviation predictions (Default: 6 index, step 2, in current HET format).

  • col_std_pred_start (int) – Index of the column in the data frame where the first predicted standard deviation value is stored. All the predicted values during inference are stored and are interspaced with predictions (Default: 7 index, step 2, in current HET format).

Returns

Tuple of numpy arrays

  • Ytrue (numpy array): Array with true (observed) values

  • Ypred_mean (numpy array): Array with predicted values (mean of predictions).

  • yerror (numpy array): Array with errors computed (observed - predicted).

  • sigma (numpy array): Array with standard deviations learned with deep learning model. For heteroscedastic inference this corresponds to the sqrt(exp(s^2)) with s^2 predicted value.

  • Ypred_std (numpy array): Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string): Name of data colum or quantity predicted (as extracted from the data frame using the col_true index).

candle.compute_statistics_homoscedastic(df_data: DataFrame, col_true: int = 4, col_pred_start: int = 6) Tuple[Type[ndarray], ...][source]

Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas dataframe) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>.predicted_INFER.tsv).

  • col_true (int) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HOM format).

  • col_pred_start (int) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored (Default: 6 index, in current HOM format).

Returns

Tuple of numpy arrays

  • Ytrue (numpy array): Array with true (observed) values

  • Ypred_mean (numpy array): Array with predicted values (mean of predictions).

  • yerror (numpy array): Array with errors computed (observed - predicted).

  • sigma (numpy array): Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).

  • Ypred_std (numpy array): Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string): Name of data colum or quantity predicted (as extracted from the data frame using the col_true index).

candle.compute_statistics_homoscedastic_summary(df_data: DataFrame, col_true: int = 0, col_pred: int = 6, col_std_pred: int = 7) Tuple[Type[ndarray], ...][source]

Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes the statistics over all the inference realizations.

Parameters
  • df_data (pandas dataframe) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>_pred.tsv).

  • col_true (int) – Index of the column in the data frame where the true value is stored (Default: 0, index in current CANDLE format).

  • col_pred (int) – Index of the column in the data frame where the predicted value is stored (Default: 6, index in current CANDLE format).

  • col_std_pred (int) – Index of the column in the data frame where the standard deviation of the predicted values is stored (Default: 7, index in current CANDLE format).

Returns

Tuple of numpy arrays

  • Ytrue (numpy array): Array with true (observed) values

  • Ypred_mean (numpy array): Array with predicted values (mean from summary).

  • yerror (numpy array): Array with errors computed (observed - predicted).

  • sigma (numpy array): Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).

  • Ypred_std (numpy array): Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string): Name of data colum or quantity predicted (as extracted from the data frame using the col_true index).

candle.compute_statistics_quantile(df_data: DataFrame, sigma_divisor: float = 2.56, col_true: int = 4, col_pred_start: int = 6) Tuple[Type[ndarray], ...][source]

Extracts ground truth, 50th percentile mean prediction, low percentile and high percentile mean prediction (usually 1st decile and 9th decile respectively), error (using 5th decile), standard deviation of prediction (using 5th decile) and predicted (learned) standard deviation from interdecile range in inference data frame. The latter includes all the individual inference realizations.

Parameters
  • df_data (pandas dataframe) – Data frame generated by current quantile inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_QTL.tsv).

  • sigma_divisor (float) – Divisor to convert from the intercedile range to the corresponding standard deviation for a Gaussian distribution. (Default: 2.56, consisten with an interdecile range computed from the difference between the 9th and 1st deciles).

  • col_true (int) – Index of the column in the data frame where the true value is stored (Default: 4, index in current QTL format).

  • col_pred_start (int) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with other percentile predictions (Default: 6 index, step 3, in current QTL format).

Returns

Tuple of numpy arrays

  • Ytrue (numpy array): Array with true (observed) values

  • Ypred (numpy array): Array with predicted values (based on the 50th percentile).

  • yerror (numpy array): Array with errors computed (observed - predicted).

  • sigma (numpy array): Array with standard deviations learned with deep learning model. This corresponds to the interdecile range divided by the sigma divisor.

  • Ypred_std (numpy array): Array with standard deviations computed from regular (homoscedastic) inference.

  • pred_name (string): Name of data colum or quantity predicted (as extracted from the data frame using the col_true index).

  • Ypred_Lp_mean (numpy array): Array with predicted values of the lower percentile (usually the 1st decile).

  • Ypred_Hp_mean (numpy array): Array with predicted values of the higher percentile (usually the 9th decile).

candle.generate_index_distribution(numTrain: int, numTest: int, numValidation: int, params: UQDistDict) Tuple[Any, ...][source]

Generates a vector of indices to partition the data for training. NO CHECKING IS DONE: it is assumed that the data could be partitioned in the specified blocks and that the block indices describe a coherent partition.

Parameters
  • numTrain (int) – Number of training data points

  • numTest (int) – Number of testing data points

  • numValidation (int) – Number of validation data points (may be zero)

  • params (Dict) – Contains the keywords that control the behavior of the function (uq_train_fr, uq_valid_fr, uq_test_fr for fraction specification, uq_train_vec, uq_valid_vec, uq_test_vec for block list specification, and uq_train_bks, uq_valid_bks, uq_test_bks for block number specification)

Returns

Tuple of numpy arrays

  • indexTrain (int numpy array): Indices for data in training

  • indexValidation (int numpy array): Indices for data in validation (if any)

  • indexTest (int numpy array): Indices for data in testing (if merging)

candle.split_data_for_empirical_calibration(Ytrue: Type[ndarray], Ypred: Type[ndarray], sigma: Type[ndarray], cal_split: float = 0.8) Tuple[Type[ndarray], ...][source]

Extracts a portion of the arrays provided for the computation of the calibration and reserves the remainder portion for testing.

Parameters
  • Ytrue (numpy array) – Array with true (observed) values

  • Ypred (numpy array) – Array with predicted values.

  • sigma (numpy array) – Array with standard deviations learned with deep learning model (or std value computed from prediction if homoscedastic inference).

  • cal_split (float) – Split of data to use for estimating the calibration relationship. It is assumet that it will be a value in (0, 1). (Default: use 80% of predictions to generate empirical calibration).

Returns

Tuple of numpy arrays

  • index_perm_total (numpy array): Random permutation of the array indices. The first ‘num_cal’ of the indices correspond to the samples that are used for calibration, while the remainder are the samples reserved for calibration testing.

  • pSigma_cal (numpy array): Part of the input sigma array to use for calibration.

  • pSigma_test (numpy array): Part of the input sigma array to reserve for testing.

  • pPred_cal (numpy array): Part of the input Ypred array to use for calibration.

  • pPred_test (numpy array): Part of the input Ypred array to reserve for testing.

  • true_cal (numpy array): Part of the input Ytrue array to use for calibration.

  • true_test (numpy array): Part of the input Ytrue array to reserve for testing.

candle.plot_2d_density_sigma_vs_error(sigma, yerror, method=None, figprefix=None)[source]

Functionality to plot a 2D histogram of the distribution of the standard deviations computed for the predictions vs. the computed errors (i.e. values of observed - predicted). The plot generated is stored in a png file.

Parameters
  • sigma (numpy array) – Array with standard deviations computed.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • method (string) – Method used to comput the standard deviations (i.e. dropout, heteroscedastic, etc.).

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_density_sigma_error.png’ string will be appended to the figprefix given.

candle.plot_array(nparray, xlabel, ylabel, title, fname)[source]
candle.plot_calibrated_std(y_test, y_pred, std_calibrated, thresC, pred_name=None, figprefix=None)[source]

Functionality to plot values in testing set after calibration. An estimation of the lower-confidence samples is made. The plot generated is stored in a png file.

Parameters
  • y_test (numpy array) – Array with (true) observed values.

  • y_pred (numpy array) – Array with predicted values.

  • std_calibrated (numpy array) – Array with standard deviation values after calibration.

  • thresC (float) – Threshold to label low confidence predictions (low confidence predictions are the ones with std > thresC).

  • pred_name (string) – Name of data colum or quantity predicted (e.g. growth, AUC, etc.).

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_calibrated.png’ string will be appended to the figprefix given.

candle.plot_calibration_interpolation(mean_sigma, error, splineobj1, splineobj2, method='', figprefix=None, steps=False)[source]

Functionality to plot empirical calibration curves estimated by interpolation of the computed standard deviations and errors. Since the estimations are very noisy, two levels of smoothing are used. Both can be plotted independently, if requested. The plot(s) generated is(are) stored in png file(s).

Parameters
  • mean_sigma (numpy array) – Array with the mean standard deviations computed in inference.

  • error (numpy array) – Array with the errors computed from the means predicted in inference.

  • splineobj1 (scipy.interpolate python object) – A python object from scipy.interpolate that computes a cubic Hermite spline (PchipInterpolator) to express the interpolation after the first smoothing. This spline is a partial result generated during the empirical calibration procedure.

  • splineobj2 (scipy.interpolate python object) – A python object from scipy.interpolate that computes a cubic Hermite spline (PchipInterpolator) to express the mapping from standard deviation to error. This spline is generated for interpolating the predictions after a process of smoothing-interpolation-smoothing computed during the empirical calibration procedure.

  • method (string) – Method used to comput the standard deviations (i.e. dropout, heteroscedastic, etc.).

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_empirical_calibration_interpolation.png’ string will be appended to the figprefix given.

  • steps (bool) – Besides the complete empirical calibration (including the interpolating spline), also generates partial plots with only the spline of the interpolating spline after the first smoothing level (smooth1).

candle.plot_contamination(y_true, y_pred, sigma, T=None, thresC=0.1, pred_name=None, figprefix=None)[source]

Functionality to plot results for the contamination model. This includes the latent variables T if they are given (i.e. if the results provided correspond to training results). Global parameters for the normal distribution are used for shading 80% confidence interval. If results for training (i.e. T available), samples determined to be outliers (i.e. samples whose probability of membership to the heavy tailed distribution (Cauchy) is greater than the threshold given) are highlighted. The plot(s) generated is(are) stored in a png file.

Parameters
  • y_true (numpy array) – Array with observed values.

  • y_pred (numpy array) – Array with predicted values.

  • sigma (float) – Standard deviation of the normal distribution.

  • T (numpy array) – Array with latent variables (i.e. membership to normal and heavy-tailed distributions). If in testing T is not available (i.e. None)

  • thresC (float) – Threshold to label outliers (outliers are the ones with probability of membership to heavy-tailed distribution, i.e. T[:,1] > thresC).

  • pred_name (string) – Name of data colum or quantity predicted (e.g. growth, AUC, etc.).

  • figprefix (string) – String to prefix the filename to store the figures generated. A ‘_contamination.png’ string will be appended to the figprefix given.

candle.plot_decile_predictions(Ypred, Ypred_Lp, Ypred_Hp, decile_list, pred_name=None, figprefix=None)[source]

Functionality to plot the mean of the deciles predicted. The plot generated is stored in a png file.

Parameters
  • Ypred (numpy array) – Array with median predicted values.

  • Ypred_Lp (numpy array) – Array with low decile predicted values.

  • Ypred_Hp (numpy array) – Array with high decile predicted values.

  • decile_list (List) – List of deciles predicted (e.g. ‘1st’, ‘9th’, etc.)

  • pred_name (string) – Name of data colum or quantity predicted (e.g. growth, AUC, etc.)

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_decile_predictions.png’ string will be appended to the figprefix given.

candle.plot_density_observed_vs_predicted(Ytest, Ypred, pred_name=None, figprefix=None)[source]

Functionality to plot a 2D histogram of the distribution of observed (ground truth) values vs. predicted values. The plot generated is stored in a png file.

Parameters
  • Ytest (numpy array) – Array with (true) observed values

  • Ypred (numpy array) – Array with predicted values.

  • pred_name (string) – Name of data colum or quantity predicted (e.g. growth, AUC, etc.)

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_density_predictions.png’ string will be appended to the figprefix given.

candle.plot_histogram_error_per_sigma(sigma, yerror, method=None, figprefix=None)[source]

Functionality to plot a 1D histogram of the distribution of computed errors (i.e. values of observed - predicted) observed for specific values of standard deviations computed. The range of standard deviations computed is split in xbins values and the 1D histograms of error distributions for the smallest six standard deviations are plotted. The plot generated is stored in a png file.

Parameters
  • sigma (numpy array) – Array with standard deviations computed.

  • yerror (numpy array) – Array with errors computed (observed - predicted).

  • method (string) – Method used to comput the standard deviations (i.e. dropout, heteroscedastic, etc.).

  • figprefix (string) – String to prefix the filename to store the figure generated. A ‘_histogram_error_per_sigma.png’ string will be appended to the figprefix given.

candle.plot_history(out, history, metric='loss', val=True, title=None, width=8, height=6)[source]
candle.plot_scatter(data, classes, out, width=10, height=8)[source]
candle.clr_callback(mode: Optional[str] = None, base_lr: float = 0.0001, max_lr: float = 0.001, gamma: float = 0.999994) Callback[source]

Creates keras callback for cyclical learning rate.

candle.clr_check_args(args: Dict) bool[source]

Checks if the arguments for cyclical learning rate are valid.

candle.clr_set_args(args: Dict) Dict[source]
candle.build_initializer(initializer: str, kerasDefaults: Dict, seed: Optional[int] = None, constant: float = 0.0)[source]

Set the initializer to the appropriate Keras initializer function based on the input string and learning rate. Other required values are set to the Keras default values.

Parameters
  • initializer (string) – String to choose the initializer Options recognized: ‘constant’, ‘uniform’, ‘normal’, ‘glorot_uniform’, ‘lecun_uniform’, ‘he_normal’ See the Keras documentation for a full description of the options

  • kerasDefaults (List) – List of default parameter values to ensure consistency between frameworks

  • seed (int) – Random number seed

  • constant (float) – Constant value (for the constant initializer only)

Returns

The appropriate Keras initializer function

candle.build_optimizer(optimizer, lr, kerasDefaults)[source]

Set the optimizer to the appropriate Keras optimizer function based on the input string and learning rate. Other required values are set to the Keras default values.

Parameters
  • optimizer (string) – String to choose the optimizer Options recognized: ‘sgd’, ‘rmsprop’, ‘adagrad’, adadelta’, ‘adam’ See the Keras documentation for a full description of the options

  • lr (float) – Learning rate

  • kerasDefaults (List) – List of default parameter values to ensure consistency between frameworks

Returns

The appropriate Keras optimizer function

candle.compute_trainable_params(model)[source]

Extract number of parameters from the given Keras model

Parameters

model – Keras model

Returns

python dictionary that contains trainable_params, non_trainable_params and total_params

candle.get_function(name: str)[source]
candle.mae(y_true, y_pred)[source]
candle.mse(y_true, y_pred)[source]
candle.r2(y_true, y_pred)[source]
candle.register_permanent_dropout()[source]
candle.set_parallelism_threads()[source]

Set the number of parallel threads according to the number available on the hardware.

candle.set_seed(seed: int)[source]

Set the random number seed to the desired value.

Parameters

seed (int) – Random number seed.

candle.abstention_acc_class_i_metric(nb_classes: Union[int, Type[ndarray]], class_i: int)[source]

Function to estimate accuracy over the class i prediction after removing the samples where the model is abstaining.

Parameters
  • nb_classes (int or ndarray) – Integer or numpy array defining indices of the abstention class

  • class_i (int) – Index of the class to estimate accuracy after removing abstention samples

candle.abstention_acc_metric(nb_classes: Union[int, Type[ndarray]])[source]

Abstained accuracy: Function to estimate accuracy over the predicted samples after removing the samples where the model is abstaining.

Parameters

nb_classes (int or ndarray) – Integer or numpy array defining indices of the abstention class

candle.abstention_class_i_metric(nb_classes: Union[int, Type[ndarray]], class_i: int)[source]

Function to estimate fraction of the samples where the model is abstaining in class i.

Parameters
  • nb_classes (int or ndarray) – Integer or numpy array defining indices of the abstention class

  • class_i (int) – Index of the class to estimate accuracy

candle.abstention_loss(alpha, mask: Type[ndarray])[source]

Function to compute abstention loss. It is composed by two terms: (i) original loss of the multiclass classification problem, (ii) cost associated to the abstaining samples.

Parameters
  • alpha – Keras variable. Weight of abstention term in cost function

  • mask (ndarray) – Numpy array to use as mask for abstention: it is 1 on the output associated to the abstention class and 0 otherwise

candle.abstention_metric(nb_classes: Union[int, Type[ndarray]])[source]

Function to estimate fraction of the samples where the model is abstaining.

Parameters

nb_classes (int or ndarray) – Integer or numpy array defining indices of the abstention class

candle.acc_class_i_metric(class_i: int)[source]

Function to estimate accuracy over the ith class prediction. This estimation is global (i.e. abstaining samples are not removed)

Parameters

class_i (int) – Index of the class to estimate accuracy

candle.add_index_to_output(y_train: Type[ndarray]) Type[ndarray][source]

This function adds a column to the training output to store the indices of the corresponding samples in the training set.

Parameters

y_train (ndarray) – Numpy array of the output in the training set

candle.add_model_output(modelIn, mode: Optional[str] = None, num_add: Optional[int] = None, activation: Optional[str] = None)[source]

This function modifies the last dense layer in the passed keras model. The modification includes adding units and optionally changing the activation function.

Parameters
  • modelIn – keras model. Keras model to be modified.

  • mode (string) –

    Mode to modify the layer. It could be:

    • ’abstain’ for adding an arbitrary number of units for the abstention optimization strategy.

    • ’qtl’ for quantile regression which needs the outputs to be tripled.

    • ’het’ for heteroscedastic regression which needs the outputs to be doubled.

  • num_add (int) – Number of units to add. This only applies to the ‘abstain’ mode.

  • activation (string) – String with keras specification of activation function (e.g. ‘relu’, ‘sigomid’, ‘softmax’, etc.)

Returns

Keras model after last dense layer has been modified as specified. If there is no mode specified it returns the same model. If the mode is not one of ‘abstain’, ‘qtl’ or ‘het’ an exception is raised.

Return type

Keras model

candle.contamination_loss(nout: int, T_k, a, sigmaSQ, gammaSQ)[source]

Function to compute contamination loss. It is composed by two terms: (i) the loss with respect to the normal distribution that models the distribution of the training data samples, (ii) the loss with respect to the Cauchy distribution that models the distribution of the outlier samples. Note that the evaluation of this contamination loss function does not make sense for any data different to the training set. This is because latent variables are only defined for samples in the training set.

Parameters
  • nout (int) – Number of outputs without uq augmentation (in the contamination model the augmentation corresponds to the data index in training).

  • T_k – Keras tensor. Tensor containing latent variables (probability of membership to normal and Cauchy distributions) for each of the samples in the training set. (Validation data is usually augmented too to be able to run training with validation set, however loss in validation should not be used as a criterion for early stopping training since the latent variables are defined for the training only, and thus, are not valid when used in combination with data different from training).

  • a – Keras variable. Probability of belonging to the normal distribution

  • sigmaSQ – Keras variable. Variance estimated for the normal distribution

  • gammaSQ – Keras variable. Scale estimated for the Cauchy distribution

candle.heteroscedastic_loss(nout: int)[source]

This function computes the heteroscedastic loss for the heteroscedastic model. Both mean and standard deviation predictions are taken into account.

Parameters

nout (int) – Number of outputs without uq augmentation

candle.mae_contamination_metric(nout: int)[source]

This function computes the mean absolute error (mae) for the contamination model. The mae is computed over the prediction. Therefore, the augmentation for the index variable is ignored.

Parameters

nout (int) – Number of outputs without uq augmentation (in the contamination model the augmentation corresponds to the data index in training).

candle.mae_heteroscedastic_metric(nout: int)[source]

This function computes the mean absolute error (mae) for the heteroscedastic model. The mae is computed over the prediction of the mean and the standard deviation prediction is not taken into account.

Parameters

nout (int) – Number of outputs without uq augmentation

candle.meanS_heteroscedastic_metric(nout: int)[source]

This function computes the mean log of the variance (log S) for the heteroscedastic model. The mean log is computed over the standard deviation prediction and the mean prediction is not taken into account.

Parameters

nout (int) – Number of outputs without uq augmentation

candle.modify_labels(numclasses_out: int, ytrain: Type[ndarray], ytest: Type[ndarray], yval: Optional[Type[ndarray]] = None) Tuple[Type[ndarray], ...][source]

This function generates a categorical representation with a class added for indicating abstention.

Parameters
  • numclasses_out (int) – Original number of classes + 1 abstention class

  • ytrain (ndarray) – Numpy array of the classes (labels) in the training set

  • ytest (ndarray) – Numpy array of the classes (labels) in the testing set

  • yval (ndarray) – Numpy array of the classes (labels) in the validation set

candle.mse_contamination_metric(nout: int)[source]

This function computes the mean squared error (mse) for the contamination model. The mse is computed over the prediction. Therefore, the augmentation for the index variable is ignored.

Parameters

nout (int) – Number of outputs without uq augmentation (in the contamination model the augmentation corresponds to the data index in training).

candle.mse_heteroscedastic_metric(nout: int)[source]

This function computes the mean squared error (mse) for the heteroscedastic model. The mse is computed over the prediction of the mean and the standard deviation prediction is not taken into account.

Parameters

nout (int) – Number of outputs without uq augmentation

candle.quantile_loss(quantile: float, y_true, y_pred)[source]

This function computes the quantile loss for a given quantile fraction.

Parameters
  • quantile – float in (0, 1). Quantile fraction to compute the loss.

  • y_true – Keras tensor. Keras tensor including the ground truth

  • y_pred – Keras tensor. Keras tensor including the predictions of a quantile model.

candle.quantile_metric(nout: int, index: int, quantile: float)[source]

This function computes the quantile metric for a given quantile and corresponding output index. This is provided as a metric to track evolution while training.

Parameters
  • nout (int) – Number of outputs without uq augmentation

  • index (int) – Index of output corresponding to the given quantile.

  • quantile – float in (0, 1). Fraction corresponding to the quantile

candle.r2_contamination_metric(nout: int)[source]

This function computes the r2 for the contamination model. The r2 is computed over the prediction. Therefore, the augmentation for the index variable is ignored.

Parameters

nout (int) – Number of outputs without uq augmentation (in the contamination model the augmentation corresponds to the data index in training).

candle.r2_heteroscedastic_metric(nout: int)[source]

This function computes the r2 for the heteroscedastic model. The r2 is computed over the prediction of the mean and the standard deviation prediction is not taken into account.

Parameters

nout (int) – Number of outputs without uq augmentation

candle.sparse_abstention_acc_metric(nb_classes: Union[int, Type[ndarray]])[source]

Abstained accuracy: Function to estimate accuracy over the predicted samples after removing the samples where the model is abstaining. Assumes y_true is not one-hot encoded.

Parameters

nb_classes (int or ndarray) – Integer or numpy array defining indices of the abstention class

candle.sparse_abstention_loss(alpha, mask: Type[ndarray])[source]

Function to compute abstention loss. It is composed by two terms: (i) original loss of the multiclass classification problem, (ii) cost associated to the abstaining samples. Assumes y_true is not one-hot encoded.

Parameters
  • alpha – Keras variable. Weight of abstention term in cost function

  • mask (ndarray) – Numpy array to use as mask for abstention: it is 1 on the output associated to the abstention class and 0 otherwise

candle.triple_quantile_loss(nout: int, lowquantile: float, highquantile: float)[source]

This function computes the quantile loss for the median and low and high quantiles. The median is given twice the weight of the other components.

Parameters
  • nout (int) – Number of outputs without uq augmentation

  • lowquantile – float in (0, 1). Fraction corresponding to the low quantile

  • highquantile – float in (0, 1). Fraction corresponding to the high quantile

candle.plot_metrics(history, title=None, skip_ep=0, outdir='.', add_lr=False)[source]

Plots keras training curves history.

Args:

skip_ep: number of epochs to skip when plotting metrics add_lr: add curve of learning rate progression over epochs

candle.build_pytorch_activation(activation: str)[source]
candle.build_pytorch_optimizer(model, optimizer: str, lr: float, kerasDefaults: Dict, trainable_only: bool = True)[source]
candle.get_pytorch_function(name: str)[source]
candle.pytorch_initialize(weights, initializer, kerasDefaults, seed=None, constant=0.0)[source]
candle.pytorch_mse(y_true, y_pred)[source]
candle.pytorch_xent(y_true, y_pred)[source]
candle.set_pytorch_seed(seed)[source]

Set the random number seed to the desired value

Parameters

seedinteger

Random number seed.

candle.set_pytorch_threads()[source]

Classes

class candle.Benchmark(filepath: str, defmodel: str, framework: str, prog: Optional[str] = None, desc: Optional[str] = None, parser=None, additional_definitions=None, required=None)[source]

Class that implements an interface to handle configuration options for the different CANDLE benchmarks.

It provides access to all the common configuration options and configuration options particular to each individual benchmark. It describes what minimum requirements should be specified to instantiate the corresponding benchmark. It interacts with the argparser to extract command-line options and arguments from the benchmark’s configuration files.

Inheritance

check_required_exists(gparam: ConfigDict) None[source]

Functionality to verify that the required model parameters have been specified.

format_benchmark_config_arguments(dictfileparam: ConfigDict) ConfigDict[source]

Functionality to format the particular parameters of the benchmark.

Parameters

dictfileparam (ConfigDict) – parameters read from configuration file

Return args

parameters read from command-line Most of the time command-line overwrites configuration file except when the command-line is using default values and config file defines those values

Return type

ConfigDict

get_parameter_from_file(absfname, param)[source]

Functionality to extract the value of one parameter from the configuration file given. Execution is terminated if the parameter specified is not found in the configuration file.

Parameters
  • absfname (string) – filename of the the configuration file including absolute path.

  • param (string) – parameter to extract from configuration file.

Returns

a string with the value of the parameter read from the configuration file.

Return type

string

parse_parameters() None[source]

Functionality to parse options common for all benchmarks.

This functionality is based on methods ‘get_default_neon_parser’ and ‘get_common_parser’ which are defined previously(above). If the order changes or they are moved, the calling has to be updated.

read_config_file(file: str) ConfigDict[source]

Functionality to read the configue file specific for each benchmark.

Parameters

file (string) – path to the configuration file

Returns

parameters read from configuration file

Return type

ConfigDict

set_locals()[source]

Functionality to set variables specific for the benchmark.

  • required: set of required parameters for the benchmark.

  • additional_definitions: list of dictionaries describing the additional parameters for the benchmark.

class candle.Progbar(target, width=30, verbose=1, interval=0.01)[source]

Progress bar

Inheritance

update(current, values=[], force=False)[source]
Parameters
  • current (int) – index of current step

  • values (List) – list of tuples (name, value_for_last_step). The progress bar will display averages for these values.

  • force (bool) – force visual progress update

class candle.ArgumentStruct(**entries)[source]

Class that converts a python dictionary into an object with named entries given by the dictionary keys.

This structure simplifies the calling convention for accessing the dictionary values (corresponding to problem parameters). After the object instantiation both modes of access (dictionary or object entries) can be used.

Inheritance

class candle.MultiGPUCheckpoint(filepath, monitor: str = 'val_loss', verbose: int = 0, save_best_only: bool = False, save_weights_only: bool = False, mode: str = 'auto', save_freq='epoch', options=None, initial_value_threshold=None, **kwargs)[source]

Inheritance

class candle.CyclicLR(base_lr=0.001, max_lr=0.006, step_size=2000.0, mode='triangular', gamma=1.0, scale_fn=None, scale_mode='cycle')[source]

This callback implements a cyclical learning rate policy (CLR). The method cycles the learning rate between two boundaries with some constant frequency.

  • base_lr: initial learning rate which is the lower boundary in the cycle.

  • max_lr: upper boundary in the cycle. Functionally, it defines the cycle amplitude (max_lr - base_lr). The lr at any cycle is the sum of base_lr and some scaling of the amplitude; therefore max_lr may not actually be reached depending on scaling function.

  • step_size: number of training iterations per half cycle. Authors suggest setting step_size 2-8 x training iterations in epoch.

  • mode: one of {triangular, triangular2, exp_range}. Default ‘triangular’. Values correspond to policies detailed above. If scale_fn is not None, this argument is ignored.

  • gamma: constant in ‘exp_range’ scaling function: gamma**(cycle iterations)

  • scale_fn: Custom scaling policy defined by a single argument lambda function, where 0 <= scale_fn(x) <= 1 for all x >= 0. mode paramater is ignored

  • scale_mode: {‘cycle’, ‘iterations’}. Defines whether scale_fn is evaluated on cycle number or cycle iterations (training iterations since start of cycle). Default is ‘cycle’.

The amplitude of the cycle can be scaled on a per-iteration or per-cycle basis. This class has three built-in policies, as put forth in the paper.

  • “triangular”: A basic triangular cycle w/ no amplitude scaling.

  • “triangular2”: A basic triangular cycle that scales initial amplitude by half each cycle.

  • “exp_range”: A cycle that scales initial amplitude by gamma**(cycle iterations) at each cycle iteration.

For more detail, please see paper.

# Example for CIFAR-10 w/ batch size 100:

clr = CyclicLR(base_lr=0.001, max_lr=0.006,
               step_size=2000., mode='triangular')
model.fit(X_train, Y_train, callbacks=[clr])

Class also supports custom scaling functions:

clr_fn = lambda x: 0.5*(1+np.sin(x*np.pi/2.))
clr = CyclicLR(base_lr=0.001, max_lr=0.006,
               step_size=2000., scale_fn=clr_fn,
               scale_mode='cycle')
model.fit(X_train, Y_train, callbacks=[clr])

# References

Inheritance

on_batch_end(epoch, logs=None)[source]

A backwards compatibility alias for on_train_batch_end.

on_epoch_end(epoch, logs=None)[source]

Called at the end of an epoch.

Subclasses should override for any actions to run. This function should only be called during TRAIN mode.

Args:

epoch: Integer, index of epoch. logs: Dict, metric results for this training epoch, and for the

validation epoch if validation is performed. Validation result keys are prefixed with val_. For training epoch, the values of the Model’s metrics are returned. Example: {‘loss’: 0.2, ‘accuracy’: 0.7}.

on_train_begin(logs={})[source]

Called at the beginning of training.

Subclasses should override for any actions to run.

Args:
logs: Dict. Currently no data is passed to this argument for this

method but that may change in the future.

class candle.CandleRemoteMonitor(params=None)[source]

Capture Run level output and store/send for monitoring.

Inheritance

on_epoch_begin(epoch, logs=None)[source]

Called at the start of an epoch.

Subclasses should override for any actions to run. This function should only be called during TRAIN mode.

Args:

epoch: Integer, index of epoch. logs: Dict. Currently no data is passed to this argument for this

method but that may change in the future.

on_epoch_end(epoch, logs=None)[source]

Called at the end of an epoch.

Subclasses should override for any actions to run. This function should only be called during TRAIN mode.

Args:

epoch: Integer, index of epoch. logs: Dict, metric results for this training epoch, and for the

validation epoch if validation is performed. Validation result keys are prefixed with val_. For training epoch, the values of the Model’s metrics are returned. Example: {‘loss’: 0.2, ‘accuracy’: 0.7}.

on_train_begin(logs=None)[source]

Called at the beginning of training.

Subclasses should override for any actions to run.

Args:
logs: Dict. Currently no data is passed to this argument for this

method but that may change in the future.

on_train_end(logs=None)[source]

Called at the end of training.

Subclasses should override for any actions to run.

Args:
logs: Dict. Currently the output of the last call to

on_epoch_end() is passed to this argument for this method but that may change in the future.

save()[source]

Save log_messages to file.

class candle.LoggingCallback(print_fcn=<built-in function print>)[source]

Inheritance

on_epoch_end(epoch, logs={})[source]

Called at the end of an epoch.

Subclasses should override for any actions to run. This function should only be called during TRAIN mode.

Args:

epoch: Integer, index of epoch. logs: Dict, metric results for this training epoch, and for the

validation epoch if validation is performed. Validation result keys are prefixed with val_. For training epoch, the values of the Model’s metrics are returned. Example: {‘loss’: 0.2, ‘accuracy’: 0.7}.

class candle.PermanentDropout(*args, **kwargs)[source]

Inheritance

call(x, mask=None)[source]

This is where the layer’s logic lives.

The call() method may not create state (except in its first invocation, wrapping the creation of variables or other resources in tf.init_scope()). It is recommended to create state, including tf.Variable instances and nested Layer instances,

in __init__(), or in the build() method that is

called automatically before call() executes for the first time.

Args:
inputs: Input tensor, or dict/list/tuple of input tensors.

The first positional inputs argument is subject to special rules: - inputs must be explicitly passed. A layer cannot have zero

arguments, and inputs cannot be provided via the default value of a keyword argument.

  • NumPy array or Python scalar values in inputs get cast as tensors.

  • Keras mask metadata is only collected from inputs.

  • Layers are built (build(input_shape) method) using shape info from inputs only.

  • input_spec compatibility is only checked against inputs.

  • Mixed precision input casting is only applied to inputs. If a layer has tensor arguments in *args or **kwargs, their casting behavior in mixed precision should be handled manually.

  • The SavedModel input specification is generated using inputs only.

  • Integration with various ecosystem packages like TFMOT, TFLite, TF.js, etc is only supported for inputs and not for tensors in positional and keyword arguments.

*args: Additional positional arguments. May contain tensors, although

this is not recommended, for the reasons above.

**kwargs: Additional keyword arguments. May contain tensors, although

this is not recommended, for the reasons above. The following optional keyword arguments are reserved: - training: Boolean scalar tensor of Python boolean indicating

whether the call is meant for training or inference.

  • mask: Boolean input mask. If the layer’s call() method takes a mask argument, its default value will be set to the mask generated for inputs by the previous layer (if input did come from a layer that generated a corresponding mask, i.e. if it came from a Keras layer with masking support).

Returns:

A tensor or list/tuple of tensors.

class candle.TerminateOnTimeOut(timeout_in_sec=10)[source]

This class implements timeout on model training.

When the script reaches timeout, this class sets model.stop_training = True

Inheritance

on_epoch_end(epoch, logs={})[source]

On every epoch end, check whether it exceeded timeout and terminate training if necessary.

on_train_begin(logs={})[source]

Start clock to calculate timeout.

class candle.AbstentionAdapt_Callback(acc_monitor, abs_monitor, alpha0: float, init_abs_epoch: int = 4, alpha_scale_factor: float = 0.8, min_abs_acc: float = 0.9, max_abs_frac: float = 0.4, acc_gain: float = 5.0, abs_gain: float = 1.0)[source]

This callback is used to adapt the parameter alpha in the abstention loss.

The parameter alpha (weight of the abstention term in the abstention loss) is increased or decreased adaptively during the training run. It is decreased if the current abstention accuracy is less than the minimum accuracy set or increased if the current abstention fraction is greater than the maximum fraction set. The abstention accuracy metric to use must be specified as the ‘acc_monitor’ argument in the initialization of the callback. It could be: the global abstention accuracy (abstention_acc), the abstention accuracy over the ith class (acc_class_i), etc. The abstention metric to use must be specified as the ‘abs_monitor’ argument in the initialization of the callback. It should be the metric that computes the fraction of samples for which the model is abstaining (abstention). The factor alpha is modified if the current abstention accuracy is less than the minimum accuracy set or if the current abstention fraction is greater than the maximum fraction set. Thresholds for minimum and maximum correction factors are computed and the correction over alpha is not allowed to be less or greater than them, respectively, to avoid huge swings in the abstention loss evolution.

Inheritance

on_epoch_end(epoch: int, logs=None)[source]

Updates the weight of abstention term on epoch end.

Parameters
  • epoch (int) – Current epoch in training.

  • logs – Metrics stored during current keras training.

class candle.Contamination_Callback(x, y, a_max=0.99)[source]

This callback is used to update the parameters of the contamination model.

This functionality follows the EM algorithm: in the E-step latent variables are updated and in the M-step global variables are updated. The global variables correspond to ‘a’ (probability of membership to normal class), ‘sigmaSQ’ (variance of normal class) and ‘gammaSQ’ (scale of Cauchy class, modeling outliers). The latent variables correspond to ‘T_k’ (the first column corresponds to the probability of membership to the normal distribution, while the second column corresponds to the probability of membership to the Cauchy distribution i.e. outlier).

Inheritance

on_epoch_end(epoch: int, logs={})[source]

Updates the parameters of the distributions in the contamination model on epoch end. The parameters updated are: ‘a’ for the global weight of the membership to the normal distribution, ‘sigmaSQ’ for the variance of the normal distribution and ‘gammaSQ’ for the scale of the Cauchy distribution of outliers. The latent variables are updated as well: ‘T_k’ describing in the first column the probability of membership to normal distribution and in the second column probability of membership to the Cauchy distribution i.e. outlier. Stores evolution of global parameters (a, sigmaSQ and gammaSQ).

Parameters
  • epoch (int) – Current epoch in training.

  • logs – keras logs. Metrics stored during current keras training.