All Notebooks | Help | Support | About
13th August 2019 @ 06:57

Next, a sweep of parameters on the ECFP was performed for both the LR and SVM method, considering the following: 

 

EC Depth = [4,5,6]

EC #Bits = [1024, 2048, 4096]

 

Again, 125 train/test splits were performed, and the distribution of MCC values were calculated. 

 

https://imgur.com/p5UAUFx

 

Here, the means of the MCC values for each model were calculated. 

 

Method_EC Depth_#Bits MeanMCCValue

svm_6_1024 0.594463

lr_6_1024 0.594463

svm_5_1024 0.642099
lr_5_1024 0.642099
svm_4_1024 0.648357
lr_4_1024 0.648357
lr_6_2048 0.653864
svm_6_2048 0.653864
lr_5_2048 0.658755
svm_5_2048 0.658755
svm_6_4096 0.658897
lr_6_4096 0.658897
svm_5_4096 0.667104
lr_5_4096 0.667104
svm_4_4096 0.667121
lr_4_4096 0.667121
lr_4_2048 0.673339
svm_4_2048 0.673339

 

The best methods were SVM and LR at depth 4, with 2048 bits. This gave an average MCC value of 0.67 +/- 0.1 for both. 

12th August 2019 @ 18:44

Picking Initial Methods

With the goal in mind of being able to classify potential S4 compounds, an initial search for a well suited classification method was undertaken. All compounds in the database that met the following criteria were used in the model search: 

* Compounds with SMILES strings

* Compounds with  Ion Activity, and were either a 0 or 1 

This resulted in 575 compounds being used. Next, the following models were used (with default settings) from sklearn to classify compounds as either 0 or 1 (Ion Activity Assay) as the class, and RDKit ECFP4 (2048-bit) fingerprints as the inputs : KNN, Linear SVM, Random Forest, Naive Bayes, Decision Trees, and Logisitic Regression. To determine which model was the most accurate, a train/test split (80/20) was done 125 times, and for each loop, a model was built for each method, and the matthews correlation coefficient (MCC) was calculated as an unbiased and accurate measure of model accuracy. The distributions of these MCC scores for each model were then compared. 

https://imgur.com/y8uHYSi

Treating the MCC values as distributions, the Kolmogorov-Smirnov statistic was calculated to determine the p-values of distribution similarity for each method:

Method A, Method B, p-value
mcc_knn,mcc_knn,1.0
mcc_knn,mcc_svm,0.13700610573284444
mcc_knn,mcc_rf,0.007449442574861611
mcc_knn,mcc_nb,8.296026497590731e-38
mcc_knn,mcc_dt,3.5280572995108e-21
mcc_knn,mcc_lr,0.987342261870452
mcc_svm,mcc_knn,0.13700610573284444
mcc_svm,mcc_svm,1.0
mcc_svm,mcc_rf,4.409900257709484e-05
mcc_svm,mcc_nb,8.500551823859001e-41
mcc_svm,mcc_dt,3.535605015038742e-25
mcc_svm,mcc_lr,0.18293778552780215
mcc_rf,mcc_knn,0.007449442574861611
mcc_rf,mcc_svm,4.409900257709484e-05
mcc_rf,mcc_rf,1.0
mcc_rf,mcc_nb,1.677584074335309e-31
mcc_rf,mcc_dt,1.4852492791766038e-16
mcc_rf,mcc_lr,0.03647438799031367
mcc_nb,mcc_knn,8.296026497590731e-38
mcc_nb,mcc_svm,8.500551823859001e-41
mcc_nb,mcc_rf,1.677584074335309e-31
mcc_nb,mcc_nb,1.0
mcc_nb,mcc_dt,1.096954798445088e-14
mcc_nb,mcc_lr,8.296026497590731e-38
mcc_dt,mcc_knn,3.5280572995108e-21
mcc_dt,mcc_svm,3.535605015038742e-25
mcc_dt,mcc_rf,1.4852492791766038e-16
mcc_dt,mcc_nb,1.096954798445088e-14
mcc_dt,mcc_dt,1.0
mcc_dt,mcc_lr,1.2305157079847292e-20
mcc_lr,mcc_knn,0.987342261870452
mcc_lr,mcc_svm,0.18293778552780215
mcc_lr,mcc_rf,0.03647438799031367
mcc_lr,mcc_nb,8.296026497590731e-38
mcc_lr,mcc_dt,1.2305157079847292e-20
mcc_lr,mcc_lr,1.0

The Linear SVM and Logistic Regression methods were best, with average MCC values of 0.67 +/- 0.11 and  0.64 +/- 0.11 respectively, and were statistically signficant in their difference of distribution from the rest of the methods (but not significantly different from one another).

Moving forward, we will explore Linear SVMs and LR as our base methods, and explore some light parameter searching to determine if we can improve the performance. 

 

 

9th August 2019 @ 20:59

 

8-9-2019

Data Processing

For the purpose of supplying data for building the ML model, the data set for ION Regulation DATA was downloaded from http://tinyurl.com/OSM-Series4CompData as a .csv on Friday August 9, 2019. 

Ran the attached pyton script to keep Potency vs Parasite (uMol), Ion Regulation Activity, Ion Regulation Test Set and Smiles. All data rows contining NaNs were dropped. 

The attached ouput file contains the relevent data to be used in our model building.

Attached Files