P1_utils module

P1_utils.aprior(gamma_hat)[source]
P1_utils.bprior(gamma_hat)[source]
P1_utils.calculate_concordance_correlation_coefficient(u, v)[source]

This function calculates the concordance correlation coefficient between two input 1-D numpy arrays.

u: 1-D numpy array of a variable v: 1-D numpy array of a variable

ccc: a numeric value of concordance correlation coefficient between the two input variables.

P1_utils.combat_batch_effect_removal(data, batch_labels, model=None, numerical_covariates=None)[source]

This function corrects for batch effect in data.

data: pandas data frame of numeric values, with a size of (n_features, n_samples) batch_labels: pandas series, with a length of n_samples. It should provide the batch labels of samples.

Its indices are the same as the column names (sample names) in “data”.

model: an object of patsy.design_info.DesignMatrix. It is a design matrix describing the covariate

information on the samples that could cause batch effects. If not provided, this function will attempt to coarsely correct just based on the information provided in “batch”.

numerical_covariates: a list of the names of covariates in “model” that are numerical rather than

categorical.

correctedpandas data frame of numeric values, with a size of (n_features, n_samples). It is

the data with batch effects corrected.

P1_utils.coxen_multi_drug_gene_selection(source_data, target_data, drug_response_data, drug_response_col, tumor_col, drug_col, prediction_power_measure='lm', num_predictive_gene=100, generalization_power_measure='ccc', num_generalizable_gene=50, union_of_single_drug_selection=False)[source]

This function uses the COXEN approach to select genes for predicting the response of multiple drugs. It assumes no missing data exist. It works in three modes. (1) If union_of_single_drug_selection is True, prediction_power_measure must be either ‘pearson’ or ‘mutual_info’. This functions runs coxen_single_drug_gene_selection for every drug with the parameter setting and takes the union of the selected genes of every drug as the output. The size of the selected gene set may be larger than num_generalizable_gene. (2) If union_of_single_drug_selection is False and prediction_power_measure is ‘lm’, this function uses a linear model to fit the response of multiple drugs using the expression of a gene, while the drugs are one-hot encoded. The p-value associated with the coefficient of gene expression is used as the prediction power measure, according to which num_predictive_gene genes will be selected. Then, among the predictive genes, num_generalizable_gene generalizable genes will be selected. (3) If union_of_single_drug_selection is False and prediction_power_measure is ‘pearson’ or ‘mutual_info’, for each drug this functions ranks the genes according to their power of predicting the response of the drug. The union of an equal number of predictive genes for every drug will be generated, and its size must be at least num_predictive_gene. Then, num_generalizable_gene generalizable genes will be selected.

source_data: pandas data frame of gene expressions of tumors, for which drug response is known. Its size is

[n_source_samples, n_features].

target_data: pandas data frame of gene expressions of tumors, for which drug response needs to be predicted.

Its size is [n_target_samples, n_features]. source_data and target_data have the same set of features and the orders of features must match.

drug_response_data: pandas data frame of drug response that must include a column of drug response values,

a column of tumor IDs, and a column of drug IDs.

drug_response_col: non-negative integer or string. If integer, it is the column index of drug response in

drug_response_data. If string, it is the column name of drug response.

tumor_col: non-negative integer or string. If integer, it is the column index of tumor IDs in drug_response_data.

If string, it is the column name of tumor IDs.

drug_col: non-negative integer or string. If integer, it is the column index of drugs in drug_response_data.

If string, it is the column name of drugs.

prediction_power_measure: string. ‘pearson’ uses the absolute value of Pearson correlation coefficient to

measure prediction power of a gene; ‘mutual_info’ uses the mutual information to measure prediction power of a gene; ‘lm’ uses the linear regression model to select predictive genes for multiple drugs. Default is ‘lm’.

num_predictive_gene: positive integer indicating the number of predictive genes to be selected. generalization_power_measure: string. ‘pearson’ indicates the Pearson correlation coefficient;

‘ccc’ indicates the concordance correlation coefficient. Default is ‘ccc’.

num_generalizable_gene: positive integer indicating the number of generalizable genes to be selected. union_of_single_drug_selection: boolean, indicating whether the final gene set should be the union of genes

selected for every drug.

indices: 1-D numpy array containing the indices of selected genes.

P1_utils.coxen_single_drug_gene_selection(source_data, target_data, drug_response_data, drug_response_col, tumor_col, prediction_power_measure='pearson', num_predictive_gene=100, generalization_power_measure='ccc', num_generalizable_gene=50, multi_drug_mode=False)[source]

This function selects genes for drug response prediction using the COXEN approach. The COXEN approach is designed for selecting genes to predict the response of tumor cells to a specific drug. This function assumes no missing data exist.

source_data: pandas data frame of gene expressions of tumors, for which drug response is known. Its size is

[n_source_samples, n_features].

target_data: pandas data frame of gene expressions of tumors, for which drug response needs to be predicted.

Its size is [n_target_samples, n_features]. source_data and target_data have the same set of features and the orders of features must match.

drug_response_data: pandas data frame of drug response values for a drug. It must include a column of drug

response values and a column of tumor IDs.

drug_response_col: non-negative integer or string. If integer, it is the column index of drug response in

drug_response_data. If string, it is the column name of drug response.

tumor_col: non-negative integer or string. If integer, it is the column index of tumor IDs in drug_response_data.

If string, it is the column name of tumor IDs.

prediction_power_measure: string. ‘pearson’ uses the absolute value of Pearson correlation coefficient to

measure prediction power of gene; ‘mutual_info’ uses the mutual information to measure prediction power of gene. Default is ‘pearson’.

num_predictive_gene: positive integer indicating the number of predictive genes to be selected. generalization_power_measure: string. ‘pearson’ indicates the Pearson correlation coefficient;

‘ccc’ indicates the concordance correlation coefficient. Default is ‘ccc’.

num_generalizable_gene: positive integer indicating the number of generalizable genes to be selected. multi_drug_mode: boolean, indicating whether the function runs as an auxiliary function of COXEN

gene selection for multiple drugs. Default is False.

indices: 1-D numpy array containing the indices of selected genes, if multi_drug_mode is False; 1-D numpy array of indices of sorting all genes according to their prediction power, if multi_drug_mode is True.

P1_utils.design_mat(mod, numerical_covariates, batch_levels)[source]
P1_utils.generalization_feature_selection(data1, data2, measure, cutoff)[source]

This function uses the Pearson correlation coefficient to select the features that are generalizable between data1 and data2.

data1: 2D numpy array of the first dataset with a size of (n_samples_1, n_features) data2: 2D numpy array of the second dataset with a size of (n_samples_2, n_features) measure: string. ‘pearson’ indicates the Pearson correlation coefficient;

‘ccc’ indicates the concordance correlation coefficient. Default is ‘pearson’.

cutoff: a positive number for selecting generalizable features. If cutoff < 1, this function selects

the features with a correlation coefficient >= cutoff. If cutoff >= 1, it must be an integer indicating the number of features to be selected based on correlation coefficient.

fid: 1-D numpy array containing the indices of selected features.

P1_utils.generate_gene_set_data(data, genes, gene_name_type='entrez', gene_set_category='c6.all', metric='mean', standardize=False, data_dir='../../Data/examples/Gene_Sets/MSigDB.v7.0/')[source]

This function generates genomic data summarized at the gene set level.

data: numpy array or pandas data frame of numeric values, with a shape of [n_samples, n_features]. genes: 1-D array or list of gene names with a length of n_features. It indicates which gene a genomic

feature belongs to.

gene_name_type: string, indicating the type of gene name used in genes. ‘entrez’ indicates Entrez gene ID and

‘symbols’ indicates HGNC gene symbol. Default is ‘symbols’.

gene_set_category: string, indicating the gene sets for which data will be calculated. ‘c2.cgp’ indicates gene sets

affected by chemical and genetic perturbations; ‘c2.cp.biocarta’ indicates BioCarta gene sets; ‘c2.cp.kegg’ indicates KEGG gene sets; ‘c2.cp.pid’ indicates PID gene sets; ‘c2.cp.reactome’ indicates Reactome gene sets; ‘c5.bp’ indicates GO biological processes; ‘c5.cc’ indicates GO cellular components; ‘c5.mf’ indicates GO molecular functions; ‘c6.all’ indicates oncogenic signatures. Default is ‘c6.all’.

metric: string, indicating the way to calculate gene-set-level data. ‘mean’ calculates the mean of gene

features belonging to the same gene set. ‘sum’ calculates the summation of gene features belonging to the same gene set. ‘max’ calculates the maximum of gene features. ‘min’ calculates the minimum of gene features. ‘abs_mean’ calculates the mean of absolute values. ‘abs_maximum’ calculates the maximum of absolute values. Default is ‘mean’.

standardize: boolean, indicating whether to standardize features before calculation. Standardization transforms

each feature to have a zero mean and a unit standard deviation.

gene_set_data: a data frame of calculated gene-set-level data. Column names are the gene set names.

P1_utils.it_sol(sdat, g_hat, d_hat, g_bar, t2, a, b, conv=0.0001)[source]
P1_utils.postmean(g_hat, g_bar, n, d_star, t2)[source]
P1_utils.postvar(sum2, n, a, b)[source]