feature_selection_utils module¶

feature_selection_utils.select_decorrelated_features(data, method='pearson', threshold=None, random_seed=None)[source]¶

This function selects features whose mutual absolute correlation coefficients are smaller than a threshold. It allows missing values in data. The correlation coefficient of two features are calculated based on the observations that are not missing in both features. Features with only one or no value present and features with a zero standard deviation are not considered for selection.

data: numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features]. method: string indicating the method used for calculating correlation coefficient. ‘pearson’ indicates Pearson

correlation coefficient; ‘kendall’ indicates Kendall Tau correlation coefficient; ‘spearman’ indicates Spearman rank correlation coefficient. Default is ‘pearson’.

threshold: float. If two features have an absolute correlation coefficient higher than threshold,: one of the features is removed. If threshold is None, a feature is removed only when the two features are exactly identical. Default is None.
random_seed: positive integer, seed of random generator for ordering the features. If it is None, features: are not re-ordered before feature selection and thus the first feature is always selected. Default is None.

indices: 1-D numpy array containing the indices of selected features.

feature_selection_utils.select_features_by_missing_values(data, threshold=0.1)[source]¶

This function returns the indices of the features whose missing rates are smaller than the threshold.

data: numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features] threshold: float in the range of [0, 1]. Features with a missing rate smaller than threshold will be selected.

Default is 0.1

indices: 1-D numpy array containing the indices of selected features

feature_selection_utils.select_features_by_variation(data, variation_measure='var', threshold=None, portion=None, draw_histogram=False, bins=100, log=False)[source]¶

This function evaluates the variations of individual features and returns the indices of features with large variations. Missing values are ignored in evaluating variation.

data: numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features]. variation_metric: string indicating the metric used for evaluating feature variation. ‘var’ indicates variance;

‘std’ indicates standard deviation; ‘mad’ indicates median absolute deviation. Default is ‘var’.

threshold: float. Features with a variation larger than threshold will be selected. Default is None. portion: float in the range of [0, 1]. It is the portion of features to be selected based on variation.

The number of selected features will be the smaller of int(portion * n_features) and the total number of features with non-missing variations. Default is None. threshold and portion can not take real values and be used simultaneously.

draw_histogram: boolean, whether to draw a histogram of feature variations. Default is False. bins: positive integer, the number of bins in the histogram. Default is the smaller of 50 and the number of

features with non-missing variations.

log: boolean, indicating whether the histogram should be drawn on log scale.

indices: 1-D numpy array containing the indices of selected features. If both threshold and: portion are None, indices will be an empty array.