Initial Model Search

Parameter Sweep for ECFP Depth and #bits

- August 2019 (3)

- Templates (3)

Next, a sweep of parameters on the ECFP was performed for both the LR and SVM method, considering the following:

EC Depth = [4,5,6]

EC #Bits = [1024, 2048, 4096]

Again, 125 train/test splits were performed, and the distribution of MCC values were calculated.

https://imgur.com/p5UAUFx

Here, the means of the MCC values for each model were calculated.

**Method_EC Depth_#Bits MeanMCCValue**

svm_6_1024 0.594463

lr_6_1024 0.594463

svm_5_1024 0.642099

lr_5_1024 0.642099

svm_4_1024 0.648357

lr_4_1024 0.648357

lr_6_2048 0.653864

svm_6_2048 0.653864

lr_5_2048 0.658755

svm_5_2048 0.658755

svm_6_4096 0.658897

lr_6_4096 0.658897

svm_5_4096 0.667104

lr_5_4096 0.667104

svm_4_4096 0.667121

lr_4_4096 0.667121

lr_4_2048 0.673339

svm_4_2048 0.673339

The best methods were SVM and LR at depth 4, with 2048 bits. This gave an average MCC value of 0.67 +/- 0.1 for both.

**Picking Initial Methods**

With the goal in mind of being able to classify potential S4 compounds, an initial search for a well suited classification method was undertaken. All compounds in the database that met the following criteria were used in the model search:

* Compounds with SMILES strings

* Compounds with Ion Activity, and were either a 0 or 1

This resulted in 575 compounds being used. Next, the following models were used (with default settings) from sklearn to classify compounds as either 0 or 1 (Ion Activity Assay) as the class, and RDKit ECFP4 (2048-bit) fingerprints as the inputs : KNN, Linear SVM, Random Forest, Naive Bayes, Decision Trees, and Logisitic Regression. To determine which model was the most accurate, a train/test split (80/20) was done 125 times, and for each loop, a model was built for each method, and the matthews correlation coefficient (MCC) was calculated as an unbiased and accurate measure of model accuracy. The distributions of these MCC scores for each model were then compared.

https://imgur.com/y8uHYSi

Treating the MCC values as distributions, the Kolmogorov-Smirnov statistic was calculated to determine the p-values of distribution similarity for each method:

**Method A, Method B, p-value**

mcc_knn,mcc_knn,1.0

mcc_knn,mcc_svm,0.13700610573284444

mcc_knn,mcc_rf,0.007449442574861611

mcc_knn,mcc_nb,8.296026497590731e-38

mcc_knn,mcc_dt,3.5280572995108e-21

mcc_knn,mcc_lr,0.987342261870452

mcc_svm,mcc_knn,0.13700610573284444

mcc_svm,mcc_svm,1.0

mcc_svm,mcc_rf,4.409900257709484e-05

mcc_svm,mcc_nb,8.500551823859001e-41

mcc_svm,mcc_dt,3.535605015038742e-25

mcc_svm,mcc_lr,0.18293778552780215

mcc_rf,mcc_knn,0.007449442574861611

mcc_rf,mcc_svm,4.409900257709484e-05

mcc_rf,mcc_rf,1.0

mcc_rf,mcc_nb,1.677584074335309e-31

mcc_rf,mcc_dt,1.4852492791766038e-16

mcc_rf,mcc_lr,0.03647438799031367

mcc_nb,mcc_knn,8.296026497590731e-38

mcc_nb,mcc_svm,8.500551823859001e-41

mcc_nb,mcc_rf,1.677584074335309e-31

mcc_nb,mcc_nb,1.0

mcc_nb,mcc_dt,1.096954798445088e-14

mcc_nb,mcc_lr,8.296026497590731e-38

mcc_dt,mcc_knn,3.5280572995108e-21

mcc_dt,mcc_svm,3.535605015038742e-25

mcc_dt,mcc_rf,1.4852492791766038e-16

mcc_dt,mcc_nb,1.096954798445088e-14

mcc_dt,mcc_dt,1.0

mcc_dt,mcc_lr,1.2305157079847292e-20

mcc_lr,mcc_knn,0.987342261870452

mcc_lr,mcc_svm,0.18293778552780215

mcc_lr,mcc_rf,0.03647438799031367

mcc_lr,mcc_nb,8.296026497590731e-38

mcc_lr,mcc_dt,1.2305157079847292e-20

mcc_lr,mcc_lr,1.0

The Linear SVM and Logistic Regression methods were best, with average MCC values of 0.67 +/- 0.11 and 0.64 +/- 0.11 respectively, and were statistically signficant in their difference of distribution from the rest of the methods (but not significantly different from one another).

Moving forward, we will explore Linear SVMs and LR as our base methods, and explore some light parameter searching to determine if we can improve the performance.

8-9-2019

**Data Processing**

For the purpose of supplying data for building the ML model, the data set for ION Regulation DATA was downloaded from http://tinyurl.com/OSM-Series4CompData as a .csv on Friday August 9, 2019.

Ran the attached pyton script to keep Potency vs Parasite (uMol), Ion Regulation Activity, Ion Regulation Test Set and Smiles. All data rows contining NaNs were dropped.

The attached ouput file contains the relevent data to be used in our model building.