All Notebooks | Help | Support | About
10th October 2019 @ 21:42

In order to make final predictions, I will use a very simple ensemble method; both of my IC50 prediction and Classificaiton methods will be used in combination to make a final prediction on the compounds. 

 

Note: One of the  test compounds (OSM-LO-1) has a SMILES structure that can't be parsed by RDKIT, so I will just make a hand-noted prediction of '99999' in it's stead.

The final output of my method, using IC50 prediction and a LR classifier, are: 

ids IC50 (nM) Class Active IC50 Alone Ensemble
OSM-LO-6 44.6769280216237 1 Yes Yes
OSM-LO-5 55.480472126021 1 Yes Yes
OSM-LO-2 76.4338544310826 1 Yes Yes
OSM-LO-14 101.638922178812 0 Yes No
OSM-S-692 106.087178493049 0 Yes No
OSM-LO-8 118.143574531465 1 Yes Yes
OSM-S-666 134.573385720318 1 Yes Yes
OSM-S-683 143.076282012967 1 Yes Yes
OSM-LO-10 226.819415668836 1 Yes No
OSM-LO-4 229.790935593512 1 Yes No
OSM-S-694 244.851037695996 0 Yes No
OSM-LO-9 248.86026691543 0 Yes No
OSM-S-680 257.773314068201 0 Yes No
OSM-LO-7 266.26974195272 1 Yes No
OSM-S-691 315.518528635974 0 No No
OSM-S-690 335.013695110697 1 No No
OSM-S-556 338.630837946452 1 No No
OMS-S-685 344.536706929361 0 No No
OSM-S-689 360.183488372355 0 No No
OSM-LO-11 397.118924578365 1 No No
OSM-S-662 426.769361853415 0 No No
OSM-S-668 453.922994482435 0 No No
OSM-S-693 456.151258271413 0 No No
OSM-S-669 465.064949085674 0 No No
OSM-S-672 565.109870023023 0 No No
OSM-S-687 579.163481827239 0 No No
OSM-S-670 601.420382543488 0 No No
OSM-S-678 614.919633090005 0 No No
OSM-S-673 668.740367852594 0 No No
OSM-S-676 788.741168089636 0 No No
OSM-S-675 896.707465441918 0 No No
OSM-LO-12 997.770340508849 0 No No
OSM-S-651 1947.51657703661 0 No No
OSM-LO-1 99999 0 No No

Here, do to the uncertainty of of the accuracy of the model, I created a cutoff of ~300nM for the IC50 alone for the contribution to the Ensemble method.

Also due to uncertainty of how the final model will be evaluation, I've include all 3 metrics here. I would likely suggest a reliance on the Ensemble scoring as the final metric, resulting in a total of 6 predicted actives. Interesting, there was discordance on 2 compounds, which don't have the core structure of other S4 compounds;

Attached Files
10th October 2019 @ 19:27

Using the script `classifier_splits.py` script, looked at the ability to properly classify Series4 compounds as active/inactive (anything below 1uM is active, everything above in active). Used ECFP4 (2048 bits) as input, and 3 different models. For each model, ran the calculation 100 times and averaged the std/mean MCC value:

LogisiticRegression:

Mean MCC : 0.45

Std MCC : 0.11

 

LinearSVM

Mean MCC : 0.41

Std MCC : 0.10

 

KNN (5 Neighbors):

Mean : 0.35

Std : 0.09

 

Seems they all have roughly the same performance, with logisticregression (lbfgs solver) being the best, but only still moderate. 

10th October 2019 @ 17:53

First step is to see if a neural network can accuratately predict IC50 for S4 within the data itself. Using a 85/15 train/test split (10 times), and the 'regression-splits.py' code, we get the following output: 

 

Error(MUE) FPTrainCoverage FPTestCoverage FPDistance TrainSE TestSE DiffSE
0.406803206043688 0.7607421875 0.2978515625 1.51090081430421 283.157389740819 241.396673688237 -41.7607160525819
0.625348493063008 0.75146484375 0.30419921875 1.50061470167768 284.266382033001 230.733396391119 -53.532985641882
0.484961726471979 0.75732421875 0.3076171875 1.46847544306137 283.240459964728 237.660696266006 -45.5797636987222
0.426486544983989 0.7353515625 0.408203125 1.52144791060531 279.195660887414 264.922063977684 -14.2735969097294
0.503196972824924 0.7470703125 0.3583984375 1.5595922971518 281.9797972678 251.990136824408 -29.9896604433914
0.48722154874268 0.75390625 0.34716796875 1.52362170328423 280.479354815024 259.130075717435 -21.3492790975894
0.456902186811316 0.75 0.3115234375 1.5216212794691 281.094482407587 252.809307747338 -28.2851746602484
0.389651567524616 0.74853515625 0.3212890625 1.35652142365003 282.158609097749 248.388753134443 -33.7698559633056
0.489750917843627 0.75390625 0.34375 1.58219756455388 282.262265128849 247.462477468405 -34.7997876604435
0.371573041126291 0.75244140625 0.318359375 1.26816736842709 283.39295100471 243.692955000444 -39.6999960042661

 

Here, the error is in pIC50 units (just log10(IC50), not negative log). The additional columns are internal metrics for the predictive performance of a FP training set in a neural network: 

*FPTrainCoverage : % 'on' bits in training set (should be >0.75)

*FPTestCoverage : % 'on' bits in test set (measure of diversity)

*FPTestDistance : L2 Norm between average FPTrain and FPTest (should be ~<1 for accuracies of ~1 pIC50 unit or better)

*FPTrain/TestSE : Measure of the shannon entropy of the train/test sets for FPs

*DiffSE : Difference in the Shannon Entropies between the two sets 

Overall, happy with the performance of this, as we seem to have _decent_ coverage of the input space (~0.75). 

In this, we used ECFP4 from rdkit as our FPs (depth 4, 2048 bits), and a multi-layered NN (3 training layers with 2 exteremely modest Dropout layers)

Attached Files