data_utils module

data_utils.convert_to_class(y_one_hot, dtype=<class 'int'>)[source]

Converts a one-hot class encoding (array with as many positions as total classes, with 1 in the corresponding class position, 0 in the other positions), or soft-max class encoding (array with as many positions as total classes, whose largest valued position is used as class membership) to an integer class encoding.

Parameters
  • y_one_hot (numpy array) – Input array with one-hot or soft-max class encoding.

  • dtype (data type) – Data type to use for the output numpy array. (Default: int, integer data is used to represent the class membership).

Returns

Returns a numpy array with an integer class encoding.

data_utils.discretize_array(y, bins=5)[source]

Discretize values of given array.

Parameters
  • y (numpy array) – array to discretize.

  • bins (int) – Number of bins for distributing column values.

Returns

  • Returns an array with the bin number associated to the values in the

  • original array.

data_utils.discretize_dataframe(df, col, bins=2, cutoffs=None)[source]

Discretize values of given column in pandas dataframe.

Parameters
  • df (pandas dataframe) – dataframe to process.

  • col (int) – Index of column to bin.

  • bins (int) – Number of bins for distributing column values.

  • cutoffs (list) – List of bin limits. If None, the limits are computed as percentiles. (Default: None).

Returns

  • Returns the data frame with the values of the specified column binned, i.e. the values

  • are replaced by the associated bin number.

data_utils.drop_impute_and_scale_dataframe(df, scaling='std', imputing='mean', dropna='all')[source]

Impute missing values with mean and scale data included in pandas dataframe.

Parameters
  • df (pandas dataframe) – dataframe to process

  • scaling (string) – String describing type of scaling to apply. ‘maxabs’ [-1,1], ‘minmax’ [0,1], ‘std’, or None, optional (Default ‘std’)

  • imputing (string) – String describing type of imputation to apply. ‘mean’ replace missing values with mean value along the column, ‘median’ replace missing values with median value along the column, ‘most_frequent’ replace missing values with most frequent value along column (Default: ‘mean’).

  • dropna (string) – String describing strategy for handling missing values. ‘all’ if all values are NA, drop that column. ‘any’ if any NA values are present, dropt that column. (Default: ‘all’).

Returns

Returns the data frame after handling missing values and scaling.

data_utils.impute_and_scale_array(mat, scaling=None)[source]

Impute missing values with mean and scale data included in numpy array.

Parameters
  • mat (numpy array) – Array to scale

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

Returns

Returns the numpy array imputed with the mean value of the column and scaled by the method specified. If no scaling method is specified, it returns the imputed numpy array.

data_utils.load_X_data(train_file, test_file, drop_cols=None, n_cols=None, shuffle=False, scaling=None, dtype=<class 'numpy.float32'>, seed=7102)[source]

Load training and testing unlabeleled data from the files specified and construct corresponding training and testing pandas DataFrames. Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved. This function assumes that the files contain a header with column names.

Parameters
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • drop_cols (list) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).

  • n_cols (integer) – Number of columns to load from the files. (Default: None, all the columns are used).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

  • X_train (pandas DataFrame) – Data for training loaded in a pandas DataFrame and pre-processed as specified.

  • X_test (pandas DataFrame) – Data for testing loaded in a pandas DataFrame and pre-processed as specified.

data_utils.load_X_data2(train_file, test_file, drop_cols=None, n_cols=None, shuffle=False, scaling=None, validation_split=0.1, dtype=<class 'numpy.float32'>, seed=7102)[source]

Load training and testing unlabeleled data from the files specified. Further split trainig data into training and validation partitions, and construct corresponding training, validation and testing pandas DataFrames. Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved, but training is split into training and validation partitions. This function assumes that the files contain a header with column names.

Parameters
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • drop_cols (list) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).

  • n_cols (integer) – Number of columns to load from the files. (Default: None, all the columns are used).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: 0.1, ten percent of the training data is used for the validation partition).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

  • X_train (pandas DataFrame) – Data for training loaded in a pandas DataFrame and pre-processed as specified.

  • X_val (pandas DataFrame) – Data for validation loaded in a pandas DataFrame and pre-processed as specified.

  • X_test (pandas DataFrame) – Data for testing loaded in a pandas DataFrame and pre-processed as specified.

data_utils.load_Xy_data2(train_file, test_file, class_col=None, drop_cols=None, n_cols=None, shuffle=False, scaling=None, validation_split=0.1, dtype=<class 'numpy.float32'>, seed=7102)[source]

Load training and testing data from the files specified, with a column indicated to use as label. Further split trainig data into training and validation partitions, and construct corresponding training, validation and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output can be integer labels (for classification) or continuous labels (for regression). Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved, but training is split into training and validation partitions. This function assumes that the files contain a header with column names.

Parameters
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • class_col (integer) – Index of the column to use as the label. (Default: None, this would cause the function to fail, a label has to be indicated at calling).

  • drop_cols (list) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).

  • n_cols (integer) – Number of columns to load from the files. (Default: None, all the columns are used).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are loaded is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: 0.1, ten percent of the training data is used for the validation partition).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

  • X_train (pandas DataFrame) – Data features for training loaded in a pandas DataFrame and pre-processed as specified.

  • y_train (pandas DataFrame) – Data labels for training loaded in a pandas DataFrame.

  • X_val (pandas DataFrame) – Data features for validation loaded in a pandas DataFrame and pre-processed as specified.

  • y_val (pandas DataFrame) – Data labels for validation loaded in a pandas DataFrame.

  • X_test (pandas DataFrame) – Data features for testing loaded in a pandas DataFrame and pre-processed as specified.

  • y_test (pandas DataFrame) – Data labels for testing loaded in a pandas DataFrame.

data_utils.load_Xy_data_noheader(train_file, test_file, classes, usecols=None, scaling=None, dtype=<class 'numpy.float32'>)[source]

Load training and testing data from the files specified, with the first column to use as label. Construct corresponding training and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved. This function assumes that the files do not contain a header.

Parameters
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • classes (integer) – Number of total classes to consider when building the categorical (one-hot) label encoding.

  • usecols (list) – List of column indices to load from the files. (Default: None, all the columns are used).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

Returns

  • X_train (pandas DataFrame) – Data features for training loaded in a pandas DataFrame and pre-processed as specified.

  • Y_train (pandas DataFrame) – Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_test (pandas DataFrame) – Data features for testing loaded in a pandas DataFrame and pre-processed as specified.

  • Y_test (pandas DataFrame) – Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

data_utils.load_Xy_one_hot_data(train_file, test_file, class_col=None, drop_cols=None, n_cols=None, shuffle=False, scaling=None, dtype=<class 'numpy.float32'>, seed=7102)[source]

Load training and testing data from the files specified, with a column indicated to use as label. Construct corresponding training and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved. This function assumes that the files contain a header with column names.

Parameters
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • class_col (integer) – Index of the column to use as the label. (Default: None, this would cause the function to fail, a label has to be indicated at calling).

  • drop_cols (list) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).

  • n_cols (integer) – Number of columns to load from the files. (Default: None, all the columns are used).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

  • X_train (pandas DataFrame) – Data features for training loaded in a pandas DataFrame and pre-processed as specified.

  • y_train (pandas DataFrame) – Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_test (pandas DataFrame) – Data features for testing loaded in a pandas DataFrame and pre-processed as specified.

  • y_test (pandas DataFrame) – Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

data_utils.load_Xy_one_hot_data2(train_file, test_file, class_col=None, drop_cols=None, n_cols=None, shuffle=False, scaling=None, validation_split=0.1, dtype=<class 'numpy.float32'>, seed=7102)[source]

Load training and testing data from the files specified, with a column indicated to use as label. Further split trainig data into training and validation partitions, and construct corresponding training, validation and testing pandas DataFrames, separated into data (i.e. features) and labels. Labels to output are one-hot encoded (categorical). Columns to load can be selected or dropped. Order of rows can be shuffled. Data can be rescaled. Training and testing partitions (coming from the respective files) are preserved, but training is split into training and validation partitions. This function assumes that the files contain a header with column names.

Parameters
  • train_file (filename) – Name of the file to load the training data.

  • test_file (filename) – Name of the file to load the testing data.

  • class_col (integer) – Index of the column to use as the label. (Default: None, this would cause the function to fail, a label has to be indicated at calling).

  • drop_cols (list) – List of column names to drop from the files being loaded. (Default: None, all the columns are used).

  • n_cols (integer) – Number of columns to load from the files. (Default: None, all the columns are used).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are loaded is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: 0.1, ten percent of the training data is used for the validation partition).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: DEFAULT_DATATYPE defined in default_utils).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

  • X_train (pandas DataFrame) – Data features for training loaded in a pandas DataFrame and pre-processed as specified.

  • y_train (pandas DataFrame) – Data labels for training loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_val (pandas DataFrame) – Data features for validation loaded in a pandas DataFrame and pre-processed as specified.

  • y_val (pandas DataFrame) – Data labels for validation loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

  • X_test (pandas DataFrame) – Data features for testing loaded in a pandas DataFrame and pre-processed as specified.

  • y_test (pandas DataFrame) – Data labels for testing loaded in a pandas DataFrame. One-hot encoding (categorical) is used.

data_utils.load_csv_data(train_path, test_path=None, sep=', ', nrows=None, x_cols=None, y_cols=None, drop_cols=None, onehot_cols=None, n_cols=None, random_cols=False, shuffle=False, scaling=None, dtype=None, validation_split=None, return_dataframe=True, return_header=False, seed=7102)[source]

Load data from the files specified. Columns corresponding to data features and labels can be specified. A one-hot encoding can be used for either features or labels. If validation_split is specified, trainig data is further split into training and validation partitions. pandas DataFrames are used to load and pre-process the data. If specified, those DataFrames are returned. Otherwise just values are returned. Labels to output can be integer labels (for classification) or continuous labels (for regression). Columns to load can be specified, randomly selected or a subset can be dropped. Order of rows can be shuffled. Data can be rescaled. This function assumes that the files contain a header with column names.

Parameters
  • train_path (filename) – Name of the file to load the training data.

  • test_path (filename) – Name of the file to load the testing data. (Optional).

  • sep (character) – Character used as column separator. (Default: ‘,’, comma separated values).

  • nrows (integer) – Number of rows to load from the files. (Default: None, all the rows are used).

  • x_cols (list) – List of columns to use as features. (Default: None).

  • y_cols (list) – List of columns to use as labels. (Default: None).

  • drop_cols (list) – List of columns to drop from the files being loaded. (Default: None, all the columns are used).

  • onehot_cols (list) – List of columns to one-hot encode. (Default: None).

  • n_cols (integer) – Number of columns to load from the files. (Default: None).

  • random_cols (boolean) – Boolean flag to indicate random selection of columns. If True a number of n_cols columns is randomly selected, if False the specified columns are used. (Default: False).

  • shuffle (boolean) – Boolean flag to indicate row shuffling. If True the rows are re-ordered, if False the order in which rows are read is preserved. (Default: False, no permutation of the loading row order).

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

  • dtype (data type) – Data type to use for the output pandas DataFrames. (Default: None).

  • validation_split (float) – Fraction of training data to set aside for validation. (Default: None, no validation partition is constructed).

  • return_dataframe (boolean) – Boolean flag to indicate that the pandas DataFrames used for data pre-processing are to be returned. (Default: True, pandas DataFrames are returned).

  • return_header (boolean) – Boolean flag to indicate if the column headers are to be returned. (Default: False, no column headers are separetely returned).

  • seed (int) – Value to intialize or re-seed the generator. (Default: DEFAULT_SEED defined in default_utils).

Returns

Tuples of data features and labels are returned, for train, validation and testing partitions, together with the column names (headers). The specific objects to return depend on the options selected.

data_utils.lookup(df, query, ret, keys, match='match')[source]

Dataframe lookup.

Parameters
  • df (pandas dataframe) – dataframe for retrieving values.

  • query (string) – String for searching.

  • ret (int/string or list) – Names or indices of columns to be returned.

  • keys (list) – List of strings or integers specifying the names or indices of columns to look into.

  • match (string) – String describing strategy for matching keys to query.

Returns

  • Returns a list of the values in the dataframe whose columns match

  • the specified query and have been selected to be returned.

data_utils.scale_array(mat, scaling=None)[source]

Scale data included in numpy array.

Parameters
  • mat (numpy array) – Array to scale

  • scaling (string) – String describing type of scaling to apply. Options recognized: ‘maxabs’, ‘minmax’, ‘std’. ‘maxabs’ : scales data to range [-1 to 1]. ‘minmax’ : scales data to range [-1 to 1]. ‘std’ : scales data to normal variable with mean 0 and standard deviation 1. (Default: None, no scaling).

Returns

Returns the numpy array scaled by the method specified. If no scaling method is specified, it returns the numpy array unmodified.

data_utils.to_categorical(y, num_classes=None)[source]

Converts a class vector (integers) to binary class matrix. E.g. for use with categorical_crossentropy. :param y: class vector to be converted into a matrix

(integers from 0 to num_classes).

Parameters

num_classes – total number of classes.

Returns

A binary matrix representation of the input. The classes axis is placed last.