All Notebooks | Help | Support | About
12th August 2019 @ 18:44

Picking Initial Methods

With the goal in mind of being able to classify potential S4 compounds, an initial search for a well suited classification method was undertaken. All compounds in the database that met the following criteria were used in the model search: 

* Compounds with SMILES strings

* Compounds with  Ion Activity, and were either a 0 or 1 

This resulted in 575 compounds being used. Next, the following models were used (with default settings) from sklearn to classify compounds as either 0 or 1 (Ion Activity Assay) as the class, and RDKit ECFP4 (2048-bit) fingerprints as the inputs : KNN, Linear SVM, Random Forest, Naive Bayes, Decision Trees, and Logisitic Regression. To determine which model was the most accurate, a train/test split (80/20) was done 125 times, and for each loop, a model was built for each method, and the matthews correlation coefficient (MCC) was calculated as an unbiased and accurate measure of model accuracy. The distributions of these MCC scores for each model were then compared. 

https://imgur.com/y8uHYSi

Treating the MCC values as distributions, the Kolmogorov-Smirnov statistic was calculated to determine the p-values of distribution similarity for each method:

Method A, Method B, p-value
mcc_knn,mcc_knn,1.0
mcc_knn,mcc_svm,0.13700610573284444
mcc_knn,mcc_rf,0.007449442574861611
mcc_knn,mcc_nb,8.296026497590731e-38
mcc_knn,mcc_dt,3.5280572995108e-21
mcc_knn,mcc_lr,0.987342261870452
mcc_svm,mcc_knn,0.13700610573284444
mcc_svm,mcc_svm,1.0
mcc_svm,mcc_rf,4.409900257709484e-05
mcc_svm,mcc_nb,8.500551823859001e-41
mcc_svm,mcc_dt,3.535605015038742e-25
mcc_svm,mcc_lr,0.18293778552780215
mcc_rf,mcc_knn,0.007449442574861611
mcc_rf,mcc_svm,4.409900257709484e-05
mcc_rf,mcc_rf,1.0
mcc_rf,mcc_nb,1.677584074335309e-31
mcc_rf,mcc_dt,1.4852492791766038e-16
mcc_rf,mcc_lr,0.03647438799031367
mcc_nb,mcc_knn,8.296026497590731e-38
mcc_nb,mcc_svm,8.500551823859001e-41
mcc_nb,mcc_rf,1.677584074335309e-31
mcc_nb,mcc_nb,1.0
mcc_nb,mcc_dt,1.096954798445088e-14
mcc_nb,mcc_lr,8.296026497590731e-38
mcc_dt,mcc_knn,3.5280572995108e-21
mcc_dt,mcc_svm,3.535605015038742e-25
mcc_dt,mcc_rf,1.4852492791766038e-16
mcc_dt,mcc_nb,1.096954798445088e-14
mcc_dt,mcc_dt,1.0
mcc_dt,mcc_lr,1.2305157079847292e-20
mcc_lr,mcc_knn,0.987342261870452
mcc_lr,mcc_svm,0.18293778552780215
mcc_lr,mcc_rf,0.03647438799031367
mcc_lr,mcc_nb,8.296026497590731e-38
mcc_lr,mcc_dt,1.2305157079847292e-20
mcc_lr,mcc_lr,1.0

The Linear SVM and Logistic Regression methods were best, with average MCC values of 0.67 +/- 0.11 and  0.64 +/- 0.11 respectively, and were statistically signficant in their difference of distribution from the rest of the methods (but not significantly different from one another).

Moving forward, we will explore Linear SVMs and LR as our base methods, and explore some light parameter searching to determine if we can improve the performance.