uq_utils module¶

uq_utils.compute_empirical_calibration_interpolation(pSigma_cal, pPred_cal, true_cal, cv=10)[source]¶

Use the arrays provided to estimate an empirical mapping between standard deviation and absolute value of error, both of which have been observed during inference. Since most of the times the prediction statistics are very noisy, two smoothing steps (based on scipy’s savgol filter) are performed. Cubic Hermite splines (PchipInterpolator) are constructed for interpolation. This type of splines preserves the monotonicity in the interpolation data and does not overshoot if the data is not smooth. The overall process of constructing a spline to express the mapping from standard deviation to error is composed of smoothing-interpolation-smoothing-interpolation.

Parameters

pSigma_cal (numpy array) – Part of the standard deviations array to use for calibration.
pPred_cal (numpy array) – Part of the predictions array to use for calibration.
true_cal (numpy array) – Part of the true (observed) values array to use for calibration.
cv (int) – Number of cross validations folds to run to determine a ‘good’ fit.

Returns

splineobj_best (scipy.interpolate python object) – A python object from scipy.interpolate that computes a cubic Hermite splines (PchipInterpolator) constructed to express the mapping from standard deviation to error after a ‘drastic’ smoothing of the predictions. A ‘good’ fit is determined by taking the spline for the fold that produces the smaller mean absolute error in testing data (not used for the smoothing / interpolation).
splineobj2 (scipy.interpolate python object) – A python object from scipy.interpolate that computes a cubic Hermite splines (PchipInterpolator) constructed to express the mapping from standard deviation to error. This spline is generated for interpolating the samples generated after the smoothing of the first interpolation spline (i.e. splineobj_best).

uq_utils.compute_limits(numdata, numblocks, blocksize, blockn)[source]¶

Generates the limit of indices corresponding to a specific block. It takes into account the non-exact divisibility of numdata into numblocks letting the last block to take the extra chunk.

Parameters

numdata (int) – Total number of data points to distribute
numblocks (int) – Total number of blocks to distribute into
blocksize (int) – Size of data per block
blockn (int) – Index of block, from 0 to numblocks-1

Returns

start (int) – Position to start assigning indices
end (int) – One beyond position to stop assigning indices

uq_utils.compute_statistics_heteroscedastic(df_data, col_true=4, col_pred_start=6, col_std_pred_start=7)[source]¶

Extracts ground truth, mean prediction, error, standard deviation of prediction and predicted (learned) standard deviation from inference data frame. The latter includes all the individual inference realizations.

Parameters

df_data (pandas data frame) – Data frame generated by current heteroscedastic inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_HET.tsv).
col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HET format).
col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with standard deviation predictions (Default: 6 index, step 2, in current HET format).
col_std_pred_start (integer) – Index of the column in the data frame where the first predicted standard deviation value is stored. All the predicted values during inference are stored and are interspaced with predictions (Default: 7 index, step 2, in current HET format).

Returns

Ytrue (numpy array) – Array with true (observed) values
Ypred_mean (numpy array) – Array with predicted values (mean of predictions).
yerror (numpy array) – Array with errors computed (observed - predicted).
sigma (numpy array) – Array with standard deviations learned with deep learning model. For heteroscedastic inference this corresponds to the sqrt(exp(s^2)) with s^2 predicted value.
Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.
pred_name (string) – Name of data colum or quantity predicted (as extracted from the data frame using the col_true index).

uq_utils.compute_statistics_homoscedastic(df_data, col_true=4, col_pred_start=6)[source]¶

Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes all the individual inference realizations.

Parameters

df_data (pandas data frame) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>.predicted_INFER.tsv).
col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current HOM format).
col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored (Default: 6 index, in current HOM format).

Returns

Ytrue (numpy array) – Array with true (observed) values
Ypred_mean (numpy array) – Array with predicted values (mean of predictions).
yerror (numpy array) – Array with errors computed (observed - predicted).
sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).
Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.
pred_name (string) – Name of data colum or quantity predicted (as extracted from the data frame using the col_true index).

uq_utils.compute_statistics_homoscedastic_summary(df_data, col_true=0, col_pred=6, col_std_pred=7)[source]¶

Extracts ground truth, mean prediction, error and standard deviation of prediction from inference data frame. The latter includes the statistics over all the inference realizations.

Parameters

df_data (pandas data frame) – Data frame generated by current CANDLE inference experiments. Indices are hard coded to agree with current CANDLE version. (The inference file usually has the name: <model>_pred.tsv).
col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 0, index in current CANDLE format).
col_pred (integer) – Index of the column in the data frame where the predicted value is stored (Default: 6, index in current CANDLE format).
col_std_pred (integer) – Index of the column in the data frame where the standard deviation of the predicted values is stored (Default: 7, index in current CANDLE format).

Returns

Ytrue (numpy array) – Array with true (observed) values
Ypred_mean (numpy array) – Array with predicted values (mean from summary).
yerror (numpy array) – Array with errors computed (observed - predicted).
sigma (numpy array) – Array with standard deviations learned with deep learning model. For homoscedastic inference this corresponds to the std value computed from prediction (and is equal to the following returned variable).
Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.
pred_name (string) – Name of data colum or quantity predicted (as extracted from the data frame using the col_true index).

uq_utils.compute_statistics_quantile(df_data, sigma_divisor=2.56, col_true=4, col_pred_start=6)[source]¶

Extracts ground truth, 50th percentile mean prediction, low percentile and high percentile mean prediction (usually 1st decile and 9th decile respectively), error (using 5th decile), standard deviation of prediction (using 5th decile) and predicted (learned) standard deviation from interdecile range in inference data frame. The latter includes all the individual inference realizations.

Parameters

df_data (pandas data frame) – Data frame generated by current quantile inference experiments. Indices are hard coded to agree with current version. (The inference file usually has the name: <model>.predicted_INFER_QTL.tsv).
sigma_divisor (float) – Divisor to convert from the intercedile range to the corresponding standard deviation for a Gaussian distribution. (Default: 2.56, consisten with an interdecile range computed from the difference between the 9th and 1st deciles).
col_true (integer) – Index of the column in the data frame where the true value is stored (Default: 4, index in current QTL format).
col_pred_start (integer) – Index of the column in the data frame where the first predicted value is stored. All the predicted values during inference are stored and are interspaced with other percentile predictions (Default: 6 index, step 3, in current QTL format).

Returns

Ytrue (numpy array) – Array with true (observed) values
Ypred (numpy array) – Array with predicted values (based on the 50th percentile).
yerror (numpy array) – Array with errors computed (observed - predicted).
sigma (numpy array) – Array with standard deviations learned with deep learning model. This corresponds to the interdecile range divided by the sigma divisor.
Ypred_std (numpy array) – Array with standard deviations computed from regular (homoscedastic) inference.
pred_name (string) – Name of data colum or quantity predicted (as extracted from the data frame using the col_true index).
Ypred_Lp_mean (numpy array) – Array with predicted values of the lower percentile (usually the 1st decile).
Ypred_Hp_mean (numpy array) – Array with predicted values of the higher percentile (usually the 9th decile).

uq_utils.fill_array(blocklist, maxsize, numdata, numblocks, blocksize)[source]¶

Fills a new array of integers with the indices corresponding to the specified block structure.

Parameters

blocklist (list) – List of integers describen the block indices that go into the array
maxsize (int) – Maximum possible length for the partition (the size of the common block size plus the remainder, if any).
numdata (int) – Total number of data points to distribute
numblocks (int) – Total number of blocks to distribute into
blocksize (int) – Size of data per block

Returns

indexArray (int numpy array) – Indices for specific data partition. Resizes the array to the correct length.

uq_utils.generate_index_distribution(numTrain, numTest, numValidation, params)[source]¶

Generates a vector of indices to partition the data for training. NO CHECKING IS DONE: it is assumed that the data could be partitioned in the specified blocks and that the block indices describe a coherent partition.

Parameters

numTrain (int) – Number of training data points
numTest (int) – Number of testing data points
numValidation (int) – Number of validation data points (may be zero)
params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_fr, uq_valid_fr, uq_test_fr for fraction specification, uq_train_vec, uq_valid_vec, uq_test_vec for block list specification, and uq_train_bks, uq_valid_bks, uq_test_bks for block number specification)

Returns

indexTrain (int numpy array) – Indices for data in training
indexValidation (int numpy array) – Indices for data in validation (if any)
indexTest (int numpy array) – Indices for data in testing (if merging)

uq_utils.generate_index_distribution_from_block_list(numTrain, numTest, numValidation, params)[source]¶

Generates a vector of indices to partition the data for training. NO CHECKING IS DONE: it is assumed that the data could be partitioned in the specified list of blocks and that the block indices describe a coherent partition.

Parameters

numTrain (int) – Number of training data points
numTest (int) – Number of testing data points
numValidation (int) – Number of validation data points (may be zero)
params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_vec, uq_valid_vec, uq_test_vec)

Returns

indexTrain (int numpy array) – Indices for data in training
indexValidation (int numpy array) – Indices for data in validation (if any)
indexTest (int numpy array) – Indices for data in testing (if merging)

uq_utils.generate_index_distribution_from_blocks(numTrain, numTest, numValidation, params)[source]¶

Generates a vector of indices to partition the data for training. NO CHECKING IS DONE: it is assumed that the data could be partitioned in the specified block quantities and that the block quantities describe a coherent partition.

Parameters

numTrain (int) – Number of training data points
numTest (int) – Number of testing data points
numValidation (int) – Number of validation data points (may be zero)
params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_bks, uq_valid_bks, uq_test_bks)

Returns

indexTrain (int numpy array) – Indices for data in training
indexValidation (int numpy array) – Indices for data in validation (if any)
indexTest (int numpy array) – Indices for data in testing (if merging)

uq_utils.generate_index_distribution_from_fraction(numTrain, numTest, numValidation, params)[source]¶

Generates a vector of indices to partition the data for training. It checks that the fractions provided are (0, 1) and add up to 1.

Parameters

numTrain (int) – Number of training data points
numTest (int) – Number of testing data points
numValidation (int) – Number of validation data points (may be zero)
params (dictionary with parameters) – Contains the keywords that control the behavior of the function (uq_train_fr, uq_valid_fr, uq_test_fr)

Returns

indexTrain (int numpy array) – Indices for data in training
indexValidation (int numpy array) – Indices for data in validation (if any)
indexTest (int numpy array) – Indices for data in testing (if merging)

uq_utils.split_data_for_empirical_calibration(Ytrue, Ypred, sigma, cal_split=0.8)[source]¶

Extracts a portion of the arrays provided for the computation of the calibration and reserves the remainder portion for testing.

Parameters

Ytrue (numpy array) – Array with true (observed) values
Ypred (numpy array) – Array with predicted values.
sigma (numpy array) – Array with standard deviations learned with deep learning model (or std value computed from prediction if homoscedastic inference).
cal_split (float) – Split of data to use for estimating the calibration relationship. It is assumet that it will be a value in (0, 1). (Default: use 80% of predictions to generate empirical calibration).

Returns

index_perm_total (numpy array) – Random permutation of the array indices. The first ‘num_cal’ of the indices correspond to the samples that are used for calibration, while the remainder are the samples reserved for calibration testing.
pSigma_cal (numpy array) – Part of the input sigma array to use for calibration.
pSigma_test (numpy array) – Part of the input sigma array to reserve for testing.
pPred_cal (numpy array) – Part of the input Ypred array to use for calibration.
pPred_test (numpy array) – Part of the input Ypred array to reserve for testing.
true_cal (numpy array) – Part of the input Ytrue array to use for calibration.
true_test (numpy array) – Part of the input Ytrue array to reserve for testing.