Data preprocessing - feature selection examples

Background

Data preprocessing is an important front-end step in data analysis that prepares data for subsequent analysis. It not only enables the subsequent analysis by processing and transforming data, but also influences the quality of subsequent analysis sometimes significantly. Several common examples of data preprocessing are data standardization and normalization to remove/suppress noise, removal of batch effect to combine datasets for larger studies, and generation of new representations of data to enable new analyses. Feature selection can be viewed as a kind of data preprocessing for prediction analysis. Its goal is to select a (minimum) subset of available features, based on which prediction models with a good performance can be constructed. And the performance can be evaluated from multiple aspects, such as the prediction accuracy and the speed of constructing the prediction model.

The data preprocessing methods can generate data partitions to enable flexible cross-validation analysis, normalize and remove batch effects from gene expression data of cancer cells, and generate genomic representations at the gene set level for cancer cells. The feature selection methods can filter features based on missing values and variations, and perform feature decorrelation. Features without much variation might not be useful for prediction and highly-correlated features are not necessary to be all included in the prediction model. We also implement and extend the co-expression extrapolation (COXEN) gene selection method for Pilot 1 project [3], which can select predictive and generalizable genes for predicting drug response in the precision oncology applications.

General Data Preprocessing Functions

generate_cross_validation_partition

To flexibly generate data partitions for cross-validation analysis, such as partitioning of grouped samples into sets that do not share groups.

Data Preprocessing Functions Specific to Pilot 1 Applications

quantile_normalizationa

To perform quantile normalization of genomic data [1] with tolerance of missing values. [see example code]

combat_batch_effect_removal

To perform ComBat analysis [2] on gene expression data to remove batch effects. [see example code]

generate_gene_set_data

To calculate genomic representations at gene set level, such as the average expression values of genes in a pathway and the total number of SNP mutations in a genetic pathway. [see example code]

General Feature Selection Functions

select_features_by_missing_values

To remove features with (many) missing values. [see example code]

select_features_by_variation

To remove features with no or small variations. [see example code]

select_decorrelated_features

To select a subset of features that are not identical or highly correlated with each other. [see example code]

Feature (Gene) Selection Functions Specific to Pilot 1 Applications

coxen_single_drug_gene_selection

To perform co-expression extrapolation (COXEN) analysis [3] that selects predictive and generalizable genes for predicting the response of tumor cells to a specific drug.

coxen_multi_drug_gene_selection

To extend the COXEN approach for selecting genes to predict the response of tumor cells to multiple drugs in precision oncology applications.

Running the example

The code demonstrates feature selection methods that CANDLE provides.

It can be run by executing python M16_test.py

Download data

Code

# download all the data if needed from the repo
data_url = 'http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/'
file_name = 'small_drug_descriptor_data_unique_samples.txt'
drug_descriptor = candle.get_file(file_name, data_url+file_name, cache_subdir='examples')

file_name = 'small_drug_response_data.txt'
response_data = candle.get_file(file_name, data_url+file_name, cache_subdir='examples')

file_name = 'Gene_Expression_Full_Data_Unique_Samples.txt'
gene_expression = candle.get_file(file_name, data_url+file_name, cache_subdir='examples')

file_name = 'CCLE_NCI60_Gene_Expression_Full_Data.txt'
ccle_nci60 = candle.get_file(file_name, data_url+file_name, cache_subdir='examples')

Output

Importing candle utils for keras
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/small_drug_descriptor_data_unique_samples.txt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/small_drug_response_data.txt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/Gene_Expression_Full_Data_Unique_Samples.txt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/CCLE_NCI60_Gene_Expression_Full_Data.txt

Download gene set

Code

# download all the gene_set files needed
data_url = 'http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/'
for gene_set_category in ['c2.cgp','c2.cp.biocarta','c2.cp.kegg','c2.cp.pid','c2.cp.reactome','c5.bp','c5.cc','c5.mf','c6.all']:
    for gene_name_type in ['entrez', 'symbols']:
        file_name = gene_set_category+'.v7.0.'+gene_name_type+'.gmt'
        local_file = candle.get_file(file_name, data_url+file_name, cache_subdir='examples/Gene_Sets/MSigDB.v7.0')
# extract base directory for gene_set data files
data_dir = local_file.split(file_name)[0]
print('Gene Set data is locally stored at ', data_dir)

Output

Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cgp.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cgp.v7.0.symbols.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.biocarta.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.biocarta.v7.0.symbols.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.kegg.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.kegg.v7.0.symbols.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.pid.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.pid.v7.0.symbols.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.reactome.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.reactome.v7.0.symbols.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.bp.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.bp.v7.0.symbols.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.cc.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.cc.v7.0.symbols.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.mf.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.mf.v7.0.symbols.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c6.all.v7.0.entrez.gmt
Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c6.all.v7.0.symbols.gmt
Gene Set data is locally stored at  /Users/hsyoo/projects/CANDLE/Benchmarks/common/../Data/examples/Gene_Sets/MSigDB.v7.0/

Select features based on missing values

Code

print('Testing select_features_by_missing_values')
print('Drug descriptor dataframe includes 10 drugs (rows) and 10 drug descriptor features (columns)')
data = pd.read_csv(drug_descriptor, sep='\t', engine='c', na_values=['na', '-', ''], header=0, index_col=0, low_memory=False)
print(data)
print('Select features with missing rates smaller than 0.1')
id = candle.select_features_by_missing_values(data, threshold=0.1)
print('Feature IDs', id)
print('Select features with missing rates smaller than 0.3')
id = candle.select_features_by_missing_values(data.values, threshold=0.3)
print('Feature IDs', id)

Output

Testing select_features_by_missing_values
Drug descriptor dataframe includes 10 drugs (rows) and 10 drug descriptor features (columns)
               MW     AMW      Sv      Se  ...     Mv  Psi_e_1d  Psi_e_1s  VE3sign_X
Drug_1     475.40   8.804  34.718  54.523  ...  0.643       NaN       NaN        NaN
Drug_10    457.71  10.898  29.154  43.640  ...  0.694       NaN       NaN     -2.752
Drug_100   561.80   6.688  49.975  83.607  ...  0.595       NaN       NaN     -4.335
Drug_1000  362.51   6.840  32.794  52.461  ...  0.619       NaN       NaN     -9.968
Drug_1001  628.83   7.763  51.593  81.570  ...  0.637       NaN       NaN     -2.166
Drug_1002  377.19  10.777  26.191  36.578  ...  0.748       NaN       NaN     -1.526
Drug_1003  371.42   8.254  30.896  45.473  ...  0.687       NaN       NaN     -4.983
Drug_1004  453.60   8.100  37.949  55.872  ...  0.678       NaN       NaN     -4.100
Drug_1005  277.35   7.704  23.940  35.934  ...  0.665       NaN       NaN     -5.234
Drug_1006  409.47   8.189  34.423  50.356  ...  0.688       NaN       NaN     -2.513

[10 rows x 10 columns]
Select features with missing rates smaller than 0.1
Feature IDs [0 1 2 3 4 5 6]
Select features with missing rates smaller than 0.3
Feature IDs [0 1 2 3 4 5 6 9]

Select features based on variation

Code

print('Testing select_features_by_variation')
print('Select features with a variance larger than 100')
id = candle.select_features_by_variation(data, variation_measure='var', threshold=100, portion=None, draw_histogram=False)
print('Feature IDs', id)
print('Select the top 2 features with the largest standard deviation')
id = candle.select_features_by_variation(data, variation_measure='std', portion=0.2)
print('Feature IDs', id)

Output

Testing select_features_by_variation
Select features with a variance larger than 100
Feature IDs [0 3 5]
Select the top 2 features with the largest standard deviation
Feature IDs [0 5]

Select decorrelated features

Code

print('Testing select_decorrelated_features')
print('Select features that are not identical to each other and are not all missing.')
id = candle.select_decorrelated_features(data, threshold=None, random_seed=None)
print('Feature IDs', id)
print('Select features whose absolute mutual Spearman correlation coefficient is smaller than 0.8')
id = candle.select_decorrelated_features(data, method='spearman', threshold=0.8, random_seed=10)
print('Feature IDs', id)

Output

Testing select_decorrelated_features
Select features that are not identical to each other and are not all missing.
Feature IDs [0 1 2 3 4 5 6 9]
Select features whose absolute mutual Spearman correlation coefficient is smaller than 0.8
Feature IDs [0 2 6 9]

Generate cross-validation partitions of data

Code

print('Testing generate_cross_validation_partition')
print('Generate 5-fold cross-validation partition of 10 samples twice')
p = candle.generate_cross_validation_partition(range(10), n_folds=5, n_repeats=2, portions=None, random_seed=None)
print(p)
print('Drug response data of 5 cell lines treated by various drugs.')
data = pd.read_csv(response_data, sep='\t', engine='c', na_values=['na', '-', ''], header=0, index_col=None, low_memory=False)
print(data)
print('Generate partition indices to divide the data into 4 sets without shared cell lines for 5 times.')
p = candle.generate_cross_validation_partition(data.CELL, n_folds=5, n_repeats=1, portions=[1, 1, 1, 2], random_seed=1)
print(p)

Output

Testing generate_cross_validation_partition
Generate 5-fold cross-validation partition of 10 samples twice
[[[0, 5], [1, 2, 3, 4, 6, 7, 8, 9]], [[1, 6], [0, 2, 3, 4, 5, 7, 8, 9]], [[2, 7], [0, 1, 3, 4, 5, 6, 8, 9]], [[3, 8], [0, 1, 2, 4, 5, 6, 7, 9]], [[4, 9], [0, 1, 2, 3, 5, 6, 7, 8]], [[5, 8], [0, 1, 2, 3, 4, 6, 7, 9]], [[3, 9], [0, 1, 2, 4, 5, 6, 7, 8]], [[2, 4], [0, 1, 3, 5, 6, 7, 8, 9]], [[1, 7], [0, 2, 3, 4, 5, 6, 8, 9]], [[0, 6], [1, 2, 3, 4, 5, 7, 8, 9]]]
Drug response data of 5 cell lines treated by various drugs.
   SOURCE        CELL     DRUG     AUC   EC50   EC50se   R2fit      HS
0    CCLE  CCLE.22RV1   CCLE.1  0.7153  5.660   0.6867  0.9533  0.6669
1    CCLE  CCLE.22RV1  CCLE.10  0.9579  7.023   0.7111  0.4332  4.0000
2    CCLE  CCLE.22RV1  CCLE.11  0.4130  7.551   0.0385  0.9948  1.3380
3    CCLE  CCLE.22RV1  CCLE.12  0.8004  5.198  11.7100  0.9944  4.0000
4    CCLE  CCLE.22RV1  CCLE.13  0.5071  7.149   0.3175  0.8069  1.0150
..    ...         ...      ...     ...    ...      ...     ...     ...
95   CCLE    CCLE.697  CCLE.12  0.7869  5.278  20.1200  0.8856  4.0000
96   CCLE    CCLE.697  CCLE.13  0.4433  7.474   0.0265  0.9978  3.7080
97   CCLE    CCLE.697  CCLE.14  0.4337  7.466   0.0106  0.9996  3.4330
98   CCLE    CCLE.697  CCLE.15  0.8721  3.097  29.1300  0.4884  0.2528
99   CCLE    CCLE.697  CCLE.16  0.7955  7.496   0.1195  0.9396  1.9560

[100 rows x 8 columns]
Generate partition indices to divide the data into 4 sets without shared cell lines for 5 times.
[[[68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]], [[44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91]], [[92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91]], [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 92, 93, 94, 95, 96, 97, 98, 99]], [[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 92, 93, 94, 95, 96, 97, 98, 99]]]
Using TensorFlow backend.
...
found 2 batches
found 0 numerical covariates...
found 0 categorical variables:
Standardizing Data across genes.
Fitting L/S model and finding priors
Finding parametric adjustments

Quantile normalization of gene expression data

Code

print('Testing quantile_normalization')
print('Gene expression data of 897 cell lines (columns) and 17741 genes (rows).')
data = pd.read_csv(gene_expression, sep='\t', engine='c', na_values=['na', '-', ''], header=0, index_col=[0, 1], low_memory=False)
print(data)
print('Before normalization')
third_quartile = data.quantile(0.75, axis=0)
print('Max difference of third quartile between cell lines is ' + str(np.round(a=np.max(third_quartile) - np.min(third_quartile), decimals=2)))
second_quartile = data.quantile(0.5, axis=0)
print('Max difference of median between cell lines is ' + str(np.round(a=np.max(second_quartile) - np.min(second_quartile), decimals=2)))
first_quartile = data.quantile(0.25, axis=0)
print('Max difference of first quartile between cell lines is ' + str(np.round(a=np.max(first_quartile) - np.min(first_quartile), decimals=2)))
norm_data = candle.quantile_normalization(np.transpose(data))
norm_data = np.transpose(norm_data)
print('After normalization')
third_quartile = norm_data.quantile(0.75, axis=0)
print('Max difference of third quartile between cell lines is ' + str(np.round(a=np.max(third_quartile) - np.min(third_quartile), decimals=2)))
second_quartile = norm_data.quantile(0.5, axis=0)
print('Max difference of median between cell lines is ' + str(np.round(a=np.max(second_quartile) - np.min(second_quartile), decimals=2)))
first_quartile = norm_data.quantile(0.25, axis=0)
print('Max difference of first quartile between cell lines is ' + str(np.round(a=np.max(first_quartile) - np.min(first_quartile), decimals=2)))

Output

Testing quantile_normalization
Gene expression data of 897 cell lines (columns) and 17741 genes (rows).
                     CCL_61  CCL_62  CCL_63  ...  CCL_1076  CCL_1077  CCL_1078
entrezID geneSymbol                          ...
1        A1BG          0.99    0.03    0.36  ...      2.56      3.55      3.04
29974    A1CF          4.03    3.03    0.00  ...      0.00      0.03      0.00
2        A2M           2.68    0.03    0.16  ...      0.77      0.31      1.20
144568   A2ML1         0.07    0.07    0.01  ...      0.01      0.00      1.09
127550   A3GALT2       0.15    0.00    0.06  ...      2.34      0.00      0.03
...                     ...     ...     ...  ...       ...       ...       ...
440590   ZYG11A        0.41    0.06    1.70  ...      0.75      3.44      2.44
79699    ZYG11B        4.45    4.23    3.08  ...      4.25      3.61      3.68
7791     ZYX           4.65    5.72    6.67  ...      7.78      4.12      5.97
23140    ZZEF1         4.14    3.98    3.90  ...      4.62      3.76      3.54
26009    ZZZ3          4.77    5.01    3.90  ...      4.38      3.46      3.60

[17741 rows x 897 columns]
Before normalization
Max difference of third quartile between cell lines is 1.86
Max difference of median between cell lines is 2.25
Max difference of first quartile between cell lines is 0.5
After normalization
Max difference of third quartile between cell lines is 0.01
Max difference of median between cell lines is 0.02
Max difference of first quartile between cell lines is 0.06

Generate gene-set-level data

print('Testing generate_gene_set_data')
gene_set_data = candle.generate_gene_set_data(np.transpose(norm_data), [i[0] for i in norm_data.index], gene_name_type='entrez',
                                       gene_set_category='c6.all', metric='mean', standardize=True, data_dir=data_dir)
print('Generate gene-set-level data of 897 cell lines and 189 oncogenic signature gene sets')
print(gene_set_data)
gene_set_data = candle.generate_gene_set_data(np.transpose(norm_data), [i[1] for i in norm_data.index], gene_name_type='symbols',
                                       gene_set_category='c2.cp.kegg', metric='sum', standardize=True, data_dir=data_dir)
print('Generate gene-set-level data of 897 cell lines and 186 KEGG pathways')
print(gene_set_data)

Output

Testing generate_gene_set_data
Generate gene-set-level data of 897 cell lines and 189 oncogenic signature gene sets
          GLI1_UP.V1_DN  GLI1_UP.V1_UP  ...  LEF1_UP.V1_DN  LEF1_UP.V1_UP
CCL_61        -0.031096       0.283946  ...       0.096461      -0.329343
CCL_62         0.362855      -0.101684  ...       0.426951      -0.477634
CCL_63        -0.304989      -0.165160  ...       0.036932      -0.201916
CCL_64        -0.037737      -0.043124  ...       0.154256      -0.210188
CCL_65         0.102477       0.438871  ...      -0.166487       0.287382
...                 ...            ...  ...            ...            ...
CCL_1074       0.508978       0.137934  ...       0.148213       0.166717
CCL_1075      -0.145029       0.216169  ...      -0.067391       0.258455
CCL_1076      -0.357758       0.337235  ...       0.008950       0.186134
CCL_1077       0.086597      -0.266070  ...       0.217244      -0.276022
CCL_1078       0.374237      -0.428383  ...       0.312984      -0.303721

[897 rows x 189 columns]
Generate gene-set-level data of 897 cell lines and 186 KEGG pathways
          KEGG_GLYCOLYSIS_GLUCONEOGENESIS  ...  KEGG_VIRAL_MYOCARDITIS
CCL_61                           6.495365  ...              -30.504868
CCL_62                          30.679006  ...               -7.205641
CCL_63                          10.534238  ...               -5.414998
CCL_64                           6.142140  ...              -10.555601
CCL_65                          -0.303868  ...               -9.784998
...                                   ...  ...                     ...
CCL_1074                        -1.945281  ...                6.891960
CCL_1075                       -21.373730  ...                0.612092
CCL_1076                       -11.711818  ...              -10.353794
CCL_1077                       -11.576702  ...              -31.679962
CCL_1078                       -10.355489  ...              -26.232325

[897 rows x 186 columns]

Combat batch normalization on gene expression data

Code

print('Testing combat_batch_effect_removal')
print('Gene expression data of 60 NCI60 cell lines and 1018 CCLE cell lines with 17741 genes.')
data = pd.read_csv(ccle_nci60, sep='\t', engine='c', na_values=['na', '-', ''], header=0, index_col=[0, 1], low_memory=False)
print(data)
resource = np.array([i.split('.')[0] for i in data.columns])

print('Before removal of batch effect between NCI60 and CCLE datasets')
# Identify NCI60 cell lines and quantile normalize their gene expression data
id = np.where(resource == 'NCI60')[0]
norm_data_NCI60 = candle.quantile_normalization(np.transpose(data.iloc[:, id]))
print('Average third quartile of NCI60 cell lines is ' + str(np.round(a=np.mean(norm_data_NCI60.quantile(0.75, axis=1)), decimals=2)))
print('Average median of NCI60 cell lines is ' + str(np.round(a=np.mean(norm_data_NCI60.quantile(0.5, axis=1)), decimals=2)))
print('Average first quartile of NCI60 cell lines is ' + str(np.round(a=np.mean(norm_data_NCI60.quantile(0.25, axis=1)), decimals=2)))

# Identify CCLE cell lines and quantile normalize their gene expression data
id = np.where(resource == 'CCLE')[0]
norm_data_CCLE = candle.quantile_normalization(np.transpose(data.iloc[:, id]))
print('Average third quartile of CCLE cell lines is ' + str(np.round(a=np.mean(norm_data_CCLE.quantile(0.75, axis=1)), decimals=2)))
print('Average median of CCLE cell lines is ' + str(np.round(a=np.mean(norm_data_CCLE.quantile(0.5, axis=1)), decimals=2)))
print('Average first quartile of CCLE cell lines is ' + str(np.round(a=np.mean(norm_data_CCLE.quantile(0.25, axis=1)), decimals=2)))

# Combine normalized data of NCI60 cell lines and CCLE cell lines
norm_data = pd.concat((norm_data_NCI60, norm_data_CCLE), axis=0)
norm_data = np.transpose(norm_data)

# Apply ComBat algorithm to remove the batch effect between NCI60 and CCLE
corrected_data = candle.combat_batch_effect_removal(norm_data, pd.Series([i.split('.')[0] for i in norm_data.columns], index=norm_data.columns))

print('After removal of batch effect between NCI60 and CCLE datasets')

resource = np.array([i.split('.')[0] for i in corrected_data.columns])
id = np.where(resource == 'NCI60')[0]
corrected_data_NCI60 = np.transpose(corrected_data.iloc[:, id])
print('Average third quartile of NCI60 cell lines is ' + str(np.round(a=np.mean(corrected_data_NCI60.quantile(0.75, axis=1)), decimals=2)))
print('Average median of NCI60 cell lines is ' + str(np.round(a=np.mean(corrected_data_NCI60.quantile(0.5, axis=1)), decimals=2)))
print('Average first quartile of NCI60 cell lines is ' + str(np.round(a=np.mean(corrected_data_NCI60.quantile(0.25, axis=1)), decimals=2)))

# Identify CCLE cell lines and quantile normalize their gene expression data
id = np.where(resource == 'CCLE')[0]
corrected_data_CCLE = np.transpose(corrected_data.iloc[:, id])
print('Average third quartile of CCLE cell lines is ' + str(np.round(a=np.mean(corrected_data_CCLE.quantile(0.75, axis=1)), decimals=2)))
print('Average median of CCLE cell lines is ' + str(np.round(a=np.mean(corrected_data_CCLE.quantile(0.5, axis=1)), decimals=2)))
print('Average first quartile of CCLE cell lines is ' + str(np.round(a=np.mean(corrected_data_CCLE.quantile(0.25, axis=1)), decimals=2)))

Output

Testing combat_batch_effect_removal
Gene expression data of 60 NCI60 cell lines and 1018 CCLE cell lines with 17741 genes.
                     NCI60.786-0|CCL_1  ...  CCLE.ZR7530|CCL_1078
entrezID geneSymbol                     ...
1        A1BG                     0.00  ...                  3.04
29974    A1CF                     0.00  ...                  0.00
2        A2M                      0.00  ...                  1.20
144568   A2ML1                    0.00  ...                  1.09
127550   A3GALT2                  0.00  ...                  0.03
...                                ...  ...                   ...
440590   ZYG11A                   0.01  ...                  2.44
79699    ZYG11B                   3.37  ...                  3.68
7791     ZYX                      7.05  ...                  5.97
23140    ZZEF1                    4.05  ...                  3.54
26009    ZZZ3                     4.10  ...                  3.60

[17741 rows x 1078 columns]
Before removal of batch effect between NCI60 and CCLE datasets
Average third quartile of NCI60 cell lines is 4.0
Average median of NCI60 cell lines is 1.71
Average first quartile of NCI60 cell lines is 0.01
Average third quartile of CCLE cell lines is 4.88
Average median of CCLE cell lines is 2.75
Average first quartile of CCLE cell lines is 0.14
Adjusting data
After removal of batch effect between NCI60 and CCLE datasets
Average third quartile of NCI60 cell lines is 4.81
Average median of NCI60 cell lines is 2.65
Average first quartile of NCI60 cell lines is 0.23
Average third quartile of CCLE cell lines is 4.83
Average median of CCLE cell lines is 2.72
Average first quartile of CCLE cell lines is 0.13

References

  1. Bolstad BM, Irizarry RA, Astrand M, et al. (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias Bioinformatics. 2003 Jan 22;19(2):185-93.

  2. Johnson WE, Rabinovic A, and Li C (2007) Adjusting batch effects in microarray expression data using Empirical Bayes methods Biostatistics 8(1):118-127.

  3. Lee JK, Havaleshko DM, Cho H, et al. (2007) A strategy for predicting the chemosensitivity of human cancers and its application to drug discovery Proc Natl Acad Sci USA, 2007 Aug 7; 104(32):13086-91. Epub 2007 Jul 31