Data preprocessing - feature selection examples
===============================================

Background
----------

Data preprocessing is an important front-end step in data analysis that
prepares data for subsequent analysis. It not only enables the
subsequent analysis by processing and transforming data, but also
influences the quality of subsequent analysis sometimes significantly.
Several common examples of data preprocessing are data standardization
and normalization to remove/suppress noise, removal of batch effect to
combine datasets for larger studies, and generation of new
representations of data to enable new analyses. Feature selection can be
viewed as a kind of data preprocessing for prediction analysis. Its goal
is to select a (minimum) subset of available features, based on which
prediction models with a good performance can be constructed. And the
performance can be evaluated from multiple aspects, such as the
prediction accuracy and the speed of constructing the prediction model.

The data preprocessing methods can generate data partitions to enable
flexible cross-validation analysis, normalize and remove batch effects
from gene expression data of cancer cells, and generate genomic
representations at the gene set level for cancer cells. The feature
selection methods can filter features based on missing values and
variations, and perform feature decorrelation. Features without much
variation might not be useful for prediction and highly-correlated
features are not necessary to be all included in the prediction model.
We also implement and extend the co-expression extrapolation (COXEN)
gene selection method for Pilot 1 project [3], which can select
predictive and generalizable genes for predicting drug response in the
precision oncology applications.

General Data Preprocessing Functions
------------------------------------

``generate_cross_validation_partition``

To flexibly generate data partitions for cross-validation analysis, such
as partitioning of grouped samples into sets that do not share groups.

Data Preprocessing Functions Specific to Pilot 1 Applications
-------------------------------------------------------------

``quantile_normalizationa``

To perform quantile normalization of genomic data [1] with tolerance of
missing values. `[see example
code] <#quantile-normalization-of-gene-expression-data>`__

``combat_batch_effect_removal``

To perform ComBat analysis [2] on gene expression data to remove batch
effects. `[see example
code] <#combat-batch-normalization-on-gene-expression-data>`__

``generate_gene_set_data``

To calculate genomic representations at gene set level, such as the
average expression values of genes in a pathway and the total number of
SNP mutations in a genetic pathway. `[see example
code] <#generate-gene-set-level-data>`__

General Feature Selection Functions
-----------------------------------

``select_features_by_missing_values``

To remove features with (many) missing values. `[see example
code] <#select-features-based-on-missing-values>`__

``select_features_by_variation``

To remove features with no or small variations. `[see example
code] <#select-features-based-on-variation>`__

``select_decorrelated_features``

To select a subset of features that are not identical or highly
correlated with each other. `[see example
code] <#select-decorrelated-features>`__

Feature (Gene) Selection Functions Specific to Pilot 1 Applications
-------------------------------------------------------------------

``coxen_single_drug_gene_selection``

To perform co-expression extrapolation (COXEN) analysis [3] that selects
predictive and generalizable genes for predicting the response of tumor
cells to a specific drug.

``coxen_multi_drug_gene_selection``

To extend the COXEN approach for selecting genes to predict the response
of tumor cells to multiple drugs in precision oncology applications.

Running the example
-------------------

The code demonstrates feature selection methods that CANDLE provides.

It can be run by executing ``python M16_test.py``

Download data
~~~~~~~~~~~~~

Code

.. code:: python

   # download all the data if needed from the repo
   data_url = 'http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/'
   file_name = 'small_drug_descriptor_data_unique_samples.txt'
   drug_descriptor = candle.get_file(file_name, data_url+file_name, cache_subdir='examples')

   file_name = 'small_drug_response_data.txt'
   response_data = candle.get_file(file_name, data_url+file_name, cache_subdir='examples')

   file_name = 'Gene_Expression_Full_Data_Unique_Samples.txt'
   gene_expression = candle.get_file(file_name, data_url+file_name, cache_subdir='examples')

   file_name = 'CCLE_NCI60_Gene_Expression_Full_Data.txt'
   ccle_nci60 = candle.get_file(file_name, data_url+file_name, cache_subdir='examples')

Output

.. code:: bash

   Importing candle utils for keras
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/small_drug_descriptor_data_unique_samples.txt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/small_drug_response_data.txt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/Gene_Expression_Full_Data_Unique_Samples.txt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Data_For_Testing/CCLE_NCI60_Gene_Expression_Full_Data.txt

Download gene set
~~~~~~~~~~~~~~~~~

Code

.. code:: python

   # download all the gene_set files needed
   data_url = 'http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/'
   for gene_set_category in ['c2.cgp','c2.cp.biocarta','c2.cp.kegg','c2.cp.pid','c2.cp.reactome','c5.bp','c5.cc','c5.mf','c6.all']:
       for gene_name_type in ['entrez', 'symbols']:
           file_name = gene_set_category+'.v7.0.'+gene_name_type+'.gmt'
           local_file = candle.get_file(file_name, data_url+file_name, cache_subdir='examples/Gene_Sets/MSigDB.v7.0')
   # extract base directory for gene_set data files
   data_dir = local_file.split(file_name)[0]
   print('Gene Set data is locally stored at ', data_dir)

Output

::

   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cgp.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cgp.v7.0.symbols.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.biocarta.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.biocarta.v7.0.symbols.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.kegg.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.kegg.v7.0.symbols.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.pid.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.pid.v7.0.symbols.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.reactome.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c2.cp.reactome.v7.0.symbols.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.bp.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.bp.v7.0.symbols.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.cc.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.cc.v7.0.symbols.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.mf.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c5.mf.v7.0.symbols.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c6.all.v7.0.entrez.gmt
   Origin =  http://ftp.mcs.anl.gov/pub/candle/public/benchmarks/Pilot1/uno/Candle_Milestone_16_Version_12_15_2019/Data/Gene_Sets/MSigDB.v7.0/c6.all.v7.0.symbols.gmt
   Gene Set data is locally stored at  /Users/hsyoo/projects/CANDLE/Benchmarks/common/../Data/examples/Gene_Sets/MSigDB.v7.0/

Select features based on missing values
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Code

.. code:: python

   print('Testing select_features_by_missing_values')
   print('Drug descriptor dataframe includes 10 drugs (rows) and 10 drug descriptor features (columns)')
   data = pd.read_csv(drug_descriptor, sep='\t', engine='c', na_values=['na', '-', ''], header=0, index_col=0, low_memory=False)
   print(data)
   print('Select features with missing rates smaller than 0.1')
   id = candle.select_features_by_missing_values(data, threshold=0.1)
   print('Feature IDs', id)
   print('Select features with missing rates smaller than 0.3')
   id = candle.select_features_by_missing_values(data.values, threshold=0.3)
   print('Feature IDs', id)

Output

.. code:: bash

   Testing select_features_by_missing_values
   Drug descriptor dataframe includes 10 drugs (rows) and 10 drug descriptor features (columns)
                  MW     AMW      Sv      Se  ...     Mv  Psi_e_1d  Psi_e_1s  VE3sign_X
   Drug_1     475.40   8.804  34.718  54.523  ...  0.643       NaN       NaN        NaN
   Drug_10    457.71  10.898  29.154  43.640  ...  0.694       NaN       NaN     -2.752
   Drug_100   561.80   6.688  49.975  83.607  ...  0.595       NaN       NaN     -4.335
   Drug_1000  362.51   6.840  32.794  52.461  ...  0.619       NaN       NaN     -9.968
   Drug_1001  628.83   7.763  51.593  81.570  ...  0.637       NaN       NaN     -2.166
   Drug_1002  377.19  10.777  26.191  36.578  ...  0.748       NaN       NaN     -1.526
   Drug_1003  371.42   8.254  30.896  45.473  ...  0.687       NaN       NaN     -4.983
   Drug_1004  453.60   8.100  37.949  55.872  ...  0.678       NaN       NaN     -4.100
   Drug_1005  277.35   7.704  23.940  35.934  ...  0.665       NaN       NaN     -5.234
   Drug_1006  409.47   8.189  34.423  50.356  ...  0.688       NaN       NaN     -2.513

   [10 rows x 10 columns]
   Select features with missing rates smaller than 0.1
   Feature IDs [0 1 2 3 4 5 6]
   Select features with missing rates smaller than 0.3
   Feature IDs [0 1 2 3 4 5 6 9]

Select features based on variation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Code

.. code:: python

   print('Testing select_features_by_variation')
   print('Select features with a variance larger than 100')
   id = candle.select_features_by_variation(data, variation_measure='var', threshold=100, portion=None, draw_histogram=False)
   print('Feature IDs', id)
   print('Select the top 2 features with the largest standard deviation')
   id = candle.select_features_by_variation(data, variation_measure='std', portion=0.2)
   print('Feature IDs', id)

Output

::

   Testing select_features_by_variation
   Select features with a variance larger than 100
   Feature IDs [0 3 5]
   Select the top 2 features with the largest standard deviation
   Feature IDs [0 5]

Select decorrelated features
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Code

.. code:: python

   print('Testing select_decorrelated_features')
   print('Select features that are not identical to each other and are not all missing.')
   id = candle.select_decorrelated_features(data, threshold=None, random_seed=None)
   print('Feature IDs', id)
   print('Select features whose absolute mutual Spearman correlation coefficient is smaller than 0.8')
   id = candle.select_decorrelated_features(data, method='spearman', threshold=0.8, random_seed=10)
   print('Feature IDs', id)

Output

::

   Testing select_decorrelated_features
   Select features that are not identical to each other and are not all missing.
   Feature IDs [0 1 2 3 4 5 6 9]
   Select features whose absolute mutual Spearman correlation coefficient is smaller than 0.8
   Feature IDs [0 2 6 9]

Generate cross-validation partitions of data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Code

.. code:: python

   print('Testing generate_cross_validation_partition')
   print('Generate 5-fold cross-validation partition of 10 samples twice')
   p = candle.generate_cross_validation_partition(range(10), n_folds=5, n_repeats=2, portions=None, random_seed=None)
   print(p)
   print('Drug response data of 5 cell lines treated by various drugs.')
   data = pd.read_csv(response_data, sep='\t', engine='c', na_values=['na', '-', ''], header=0, index_col=None, low_memory=False)
   print(data)
   print('Generate partition indices to divide the data into 4 sets without shared cell lines for 5 times.')
   p = candle.generate_cross_validation_partition(data.CELL, n_folds=5, n_repeats=1, portions=[1, 1, 1, 2], random_seed=1)
   print(p)

Output

::

   Testing generate_cross_validation_partition
   Generate 5-fold cross-validation partition of 10 samples twice
   [[[0, 5], [1, 2, 3, 4, 6, 7, 8, 9]], [[1, 6], [0, 2, 3, 4, 5, 7, 8, 9]], [[2, 7], [0, 1, 3, 4, 5, 6, 8, 9]], [[3, 8], [0, 1, 2, 4, 5, 6, 7, 9]], [[4, 9], [0, 1, 2, 3, 5, 6, 7, 8]], [[5, 8], [0, 1, 2, 3, 4, 6, 7, 9]], [[3, 9], [0, 1, 2, 4, 5, 6, 7, 8]], [[2, 4], [0, 1, 3, 5, 6, 7, 8, 9]], [[1, 7], [0, 2, 3, 4, 5, 6, 8, 9]], [[0, 6], [1, 2, 3, 4, 5, 7, 8, 9]]]
   Drug response data of 5 cell lines treated by various drugs.
      SOURCE        CELL     DRUG     AUC   EC50   EC50se   R2fit      HS
   0    CCLE  CCLE.22RV1   CCLE.1  0.7153  5.660   0.6867  0.9533  0.6669
   1    CCLE  CCLE.22RV1  CCLE.10  0.9579  7.023   0.7111  0.4332  4.0000
   2    CCLE  CCLE.22RV1  CCLE.11  0.4130  7.551   0.0385  0.9948  1.3380
   3    CCLE  CCLE.22RV1  CCLE.12  0.8004  5.198  11.7100  0.9944  4.0000
   4    CCLE  CCLE.22RV1  CCLE.13  0.5071  7.149   0.3175  0.8069  1.0150
   ..    ...         ...      ...     ...    ...      ...     ...     ...
   95   CCLE    CCLE.697  CCLE.12  0.7869  5.278  20.1200  0.8856  4.0000
   96   CCLE    CCLE.697  CCLE.13  0.4433  7.474   0.0265  0.9978  3.7080
   97   CCLE    CCLE.697  CCLE.14  0.4337  7.466   0.0106  0.9996  3.4330
   98   CCLE    CCLE.697  CCLE.15  0.8721  3.097  29.1300  0.4884  0.2528
   99   CCLE    CCLE.697  CCLE.16  0.7955  7.496   0.1195  0.9396  1.9560

   [100 rows x 8 columns]
   Generate partition indices to divide the data into 4 sets without shared cell lines for 5 times.
   [[[68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]], [[44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91]], [[92, 93, 94, 95, 96, 97, 98, 99], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91]], [[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 92, 93, 94, 95, 96, 97, 98, 99]], [[24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43], [68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91], [44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 92, 93, 94, 95, 96, 97, 98, 99]]]
   Using TensorFlow backend.
   ...
   found 2 batches
   found 0 numerical covariates...
   found 0 categorical variables:
   Standardizing Data across genes.
   Fitting L/S model and finding priors
   Finding parametric adjustments

Quantile normalization of gene expression data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Code

.. code:: python

   print('Testing quantile_normalization')
   print('Gene expression data of 897 cell lines (columns) and 17741 genes (rows).')
   data = pd.read_csv(gene_expression, sep='\t', engine='c', na_values=['na', '-', ''], header=0, index_col=[0, 1], low_memory=False)
   print(data)
   print('Before normalization')
   third_quartile = data.quantile(0.75, axis=0)
   print('Max difference of third quartile between cell lines is ' + str(np.round(a=np.max(third_quartile) - np.min(third_quartile), decimals=2)))
   second_quartile = data.quantile(0.5, axis=0)
   print('Max difference of median between cell lines is ' + str(np.round(a=np.max(second_quartile) - np.min(second_quartile), decimals=2)))
   first_quartile = data.quantile(0.25, axis=0)
   print('Max difference of first quartile between cell lines is ' + str(np.round(a=np.max(first_quartile) - np.min(first_quartile), decimals=2)))
   norm_data = candle.quantile_normalization(np.transpose(data))
   norm_data = np.transpose(norm_data)
   print('After normalization')
   third_quartile = norm_data.quantile(0.75, axis=0)
   print('Max difference of third quartile between cell lines is ' + str(np.round(a=np.max(third_quartile) - np.min(third_quartile), decimals=2)))
   second_quartile = norm_data.quantile(0.5, axis=0)
   print('Max difference of median between cell lines is ' + str(np.round(a=np.max(second_quartile) - np.min(second_quartile), decimals=2)))
   first_quartile = norm_data.quantile(0.25, axis=0)
   print('Max difference of first quartile between cell lines is ' + str(np.round(a=np.max(first_quartile) - np.min(first_quartile), decimals=2)))

Output

::

   Testing quantile_normalization
   Gene expression data of 897 cell lines (columns) and 17741 genes (rows).
                        CCL_61  CCL_62  CCL_63  ...  CCL_1076  CCL_1077  CCL_1078
   entrezID geneSymbol                          ...                              
   1        A1BG          0.99    0.03    0.36  ...      2.56      3.55      3.04
   29974    A1CF          4.03    3.03    0.00  ...      0.00      0.03      0.00
   2        A2M           2.68    0.03    0.16  ...      0.77      0.31      1.20
   144568   A2ML1         0.07    0.07    0.01  ...      0.01      0.00      1.09
   127550   A3GALT2       0.15    0.00    0.06  ...      2.34      0.00      0.03
   ...                     ...     ...     ...  ...       ...       ...       ...
   440590   ZYG11A        0.41    0.06    1.70  ...      0.75      3.44      2.44
   79699    ZYG11B        4.45    4.23    3.08  ...      4.25      3.61      3.68
   7791     ZYX           4.65    5.72    6.67  ...      7.78      4.12      5.97
   23140    ZZEF1         4.14    3.98    3.90  ...      4.62      3.76      3.54
   26009    ZZZ3          4.77    5.01    3.90  ...      4.38      3.46      3.60

   [17741 rows x 897 columns]
   Before normalization
   Max difference of third quartile between cell lines is 1.86
   Max difference of median between cell lines is 2.25
   Max difference of first quartile between cell lines is 0.5
   After normalization
   Max difference of third quartile between cell lines is 0.01
   Max difference of median between cell lines is 0.02
   Max difference of first quartile between cell lines is 0.06

Generate gene-set-level data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code:: python

   print('Testing generate_gene_set_data')
   gene_set_data = candle.generate_gene_set_data(np.transpose(norm_data), [i[0] for i in norm_data.index], gene_name_type='entrez',
                                          gene_set_category='c6.all', metric='mean', standardize=True, data_dir=data_dir)
   print('Generate gene-set-level data of 897 cell lines and 189 oncogenic signature gene sets')
   print(gene_set_data)
   gene_set_data = candle.generate_gene_set_data(np.transpose(norm_data), [i[1] for i in norm_data.index], gene_name_type='symbols',
                                          gene_set_category='c2.cp.kegg', metric='sum', standardize=True, data_dir=data_dir)
   print('Generate gene-set-level data of 897 cell lines and 186 KEGG pathways')
   print(gene_set_data)

Output

::

   Testing generate_gene_set_data
   Generate gene-set-level data of 897 cell lines and 189 oncogenic signature gene sets
             GLI1_UP.V1_DN  GLI1_UP.V1_UP  ...  LEF1_UP.V1_DN  LEF1_UP.V1_UP
   CCL_61        -0.031096       0.283946  ...       0.096461      -0.329343
   CCL_62         0.362855      -0.101684  ...       0.426951      -0.477634
   CCL_63        -0.304989      -0.165160  ...       0.036932      -0.201916
   CCL_64        -0.037737      -0.043124  ...       0.154256      -0.210188
   CCL_65         0.102477       0.438871  ...      -0.166487       0.287382
   ...                 ...            ...  ...            ...            ...
   CCL_1074       0.508978       0.137934  ...       0.148213       0.166717
   CCL_1075      -0.145029       0.216169  ...      -0.067391       0.258455
   CCL_1076      -0.357758       0.337235  ...       0.008950       0.186134
   CCL_1077       0.086597      -0.266070  ...       0.217244      -0.276022
   CCL_1078       0.374237      -0.428383  ...       0.312984      -0.303721

   [897 rows x 189 columns]
   Generate gene-set-level data of 897 cell lines and 186 KEGG pathways
             KEGG_GLYCOLYSIS_GLUCONEOGENESIS  ...  KEGG_VIRAL_MYOCARDITIS
   CCL_61                           6.495365  ...              -30.504868
   CCL_62                          30.679006  ...               -7.205641
   CCL_63                          10.534238  ...               -5.414998
   CCL_64                           6.142140  ...              -10.555601
   CCL_65                          -0.303868  ...               -9.784998
   ...                                   ...  ...                     ...
   CCL_1074                        -1.945281  ...                6.891960
   CCL_1075                       -21.373730  ...                0.612092
   CCL_1076                       -11.711818  ...              -10.353794
   CCL_1077                       -11.576702  ...              -31.679962
   CCL_1078                       -10.355489  ...              -26.232325

   [897 rows x 186 columns]

Combat batch normalization on gene expression data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Code

.. code:: python

   print('Testing combat_batch_effect_removal')
   print('Gene expression data of 60 NCI60 cell lines and 1018 CCLE cell lines with 17741 genes.')
   data = pd.read_csv(ccle_nci60, sep='\t', engine='c', na_values=['na', '-', ''], header=0, index_col=[0, 1], low_memory=False)
   print(data)
   resource = np.array([i.split('.')[0] for i in data.columns])

   print('Before removal of batch effect between NCI60 and CCLE datasets')
   # Identify NCI60 cell lines and quantile normalize their gene expression data
   id = np.where(resource == 'NCI60')[0]
   norm_data_NCI60 = candle.quantile_normalization(np.transpose(data.iloc[:, id]))
   print('Average third quartile of NCI60 cell lines is ' + str(np.round(a=np.mean(norm_data_NCI60.quantile(0.75, axis=1)), decimals=2)))
   print('Average median of NCI60 cell lines is ' + str(np.round(a=np.mean(norm_data_NCI60.quantile(0.5, axis=1)), decimals=2)))
   print('Average first quartile of NCI60 cell lines is ' + str(np.round(a=np.mean(norm_data_NCI60.quantile(0.25, axis=1)), decimals=2)))

   # Identify CCLE cell lines and quantile normalize their gene expression data
   id = np.where(resource == 'CCLE')[0]
   norm_data_CCLE = candle.quantile_normalization(np.transpose(data.iloc[:, id]))
   print('Average third quartile of CCLE cell lines is ' + str(np.round(a=np.mean(norm_data_CCLE.quantile(0.75, axis=1)), decimals=2)))
   print('Average median of CCLE cell lines is ' + str(np.round(a=np.mean(norm_data_CCLE.quantile(0.5, axis=1)), decimals=2)))
   print('Average first quartile of CCLE cell lines is ' + str(np.round(a=np.mean(norm_data_CCLE.quantile(0.25, axis=1)), decimals=2)))

   # Combine normalized data of NCI60 cell lines and CCLE cell lines
   norm_data = pd.concat((norm_data_NCI60, norm_data_CCLE), axis=0)
   norm_data = np.transpose(norm_data)

   # Apply ComBat algorithm to remove the batch effect between NCI60 and CCLE
   corrected_data = candle.combat_batch_effect_removal(norm_data, pd.Series([i.split('.')[0] for i in norm_data.columns], index=norm_data.columns))

   print('After removal of batch effect between NCI60 and CCLE datasets')

   resource = np.array([i.split('.')[0] for i in corrected_data.columns])
   id = np.where(resource == 'NCI60')[0]
   corrected_data_NCI60 = np.transpose(corrected_data.iloc[:, id])
   print('Average third quartile of NCI60 cell lines is ' + str(np.round(a=np.mean(corrected_data_NCI60.quantile(0.75, axis=1)), decimals=2)))
   print('Average median of NCI60 cell lines is ' + str(np.round(a=np.mean(corrected_data_NCI60.quantile(0.5, axis=1)), decimals=2)))
   print('Average first quartile of NCI60 cell lines is ' + str(np.round(a=np.mean(corrected_data_NCI60.quantile(0.25, axis=1)), decimals=2)))

   # Identify CCLE cell lines and quantile normalize their gene expression data
   id = np.where(resource == 'CCLE')[0]
   corrected_data_CCLE = np.transpose(corrected_data.iloc[:, id])
   print('Average third quartile of CCLE cell lines is ' + str(np.round(a=np.mean(corrected_data_CCLE.quantile(0.75, axis=1)), decimals=2)))
   print('Average median of CCLE cell lines is ' + str(np.round(a=np.mean(corrected_data_CCLE.quantile(0.5, axis=1)), decimals=2)))
   print('Average first quartile of CCLE cell lines is ' + str(np.round(a=np.mean(corrected_data_CCLE.quantile(0.25, axis=1)), decimals=2)))

Output

::

   Testing combat_batch_effect_removal
   Gene expression data of 60 NCI60 cell lines and 1018 CCLE cell lines with 17741 genes.
                        NCI60.786-0|CCL_1  ...  CCLE.ZR7530|CCL_1078
   entrezID geneSymbol                     ...                      
   1        A1BG                     0.00  ...                  3.04
   29974    A1CF                     0.00  ...                  0.00
   2        A2M                      0.00  ...                  1.20
   144568   A2ML1                    0.00  ...                  1.09
   127550   A3GALT2                  0.00  ...                  0.03
   ...                                ...  ...                   ...
   440590   ZYG11A                   0.01  ...                  2.44
   79699    ZYG11B                   3.37  ...                  3.68
   7791     ZYX                      7.05  ...                  5.97
   23140    ZZEF1                    4.05  ...                  3.54
   26009    ZZZ3                     4.10  ...                  3.60

   [17741 rows x 1078 columns]
   Before removal of batch effect between NCI60 and CCLE datasets
   Average third quartile of NCI60 cell lines is 4.0
   Average median of NCI60 cell lines is 1.71
   Average first quartile of NCI60 cell lines is 0.01
   Average third quartile of CCLE cell lines is 4.88
   Average median of CCLE cell lines is 2.75
   Average first quartile of CCLE cell lines is 0.14
   Adjusting data
   After removal of batch effect between NCI60 and CCLE datasets
   Average third quartile of NCI60 cell lines is 4.81
   Average median of NCI60 cell lines is 2.65
   Average first quartile of NCI60 cell lines is 0.23
   Average third quartile of CCLE cell lines is 4.83
   Average median of CCLE cell lines is 2.72
   Average first quartile of CCLE cell lines is 0.13

References
----------

1. Bolstad BM, Irizarry RA, Astrand M, et al. (2003) *A comparison of
   normalization methods for high density oligonucleotide array data
   based on variance and bias* Bioinformatics. 2003 Jan 22;19(2):185-93.

2. Johnson WE, Rabinovic A, and Li C (2007) *Adjusting batch effects in
   microarray expression data using Empirical Bayes methods*
   Biostatistics 8(1):118-127.

3. Lee JK, Havaleshko DM, Cho H, et al. (2007) *A strategy for
   predicting the chemosensitivity of human cancers and its application
   to drug discovery* Proc Natl Acad Sci USA, 2007 Aug 7;
   104(32):13086-91. Epub 2007 Jul 31