All Notebooks | Help | Support | About
4th April 2017 @ 11:00


This post presents the negative results that have lead to the final design of my semisupervised project submission. While the following experiments individually appear disjointed, they were intended to be combined into a generative classification paradigm with the following experimental design:

  1. Train baseline multitask classification models for classifying Ion Regulation activity (One shot machine learning)
  2. Train a generative model to sample new Series 4 molecules (VAE)
  3. Rank the best synthetic Series 4 molecules based on approximate Ligand Lipophilicity Efficiency (aLLE). aLLE is calculated from pEC50 - cLogP. The pEC50 values would be calculated from a regression model with the assumption that pEC50 values are correlated with Ion Regulation Activity (PGN). Select the best ranking synthetic Series 4 molecules
  4. Classify the synthetic Series 4 compounds and compile these results as a separate task to support training a final classification model
  5. Train a final multitask classification model incorporating the additional synthetic data

This approach aimed to utilise all the available data to expand the applicability domain of any classification model to be more predictive of unseen compounds.


One shot models from DeepChem were implemented using competition and semi supervised datasets to predict the Ion Regulation activity of the molecules in the test set. While around ~0.7 ROC AUC was initially archieved with these models, they do not output predictions for individual molecules which is a requirement for the competition. Since my entry into this competition was fairly late (and well after when it should have finished), there was no time to analyse the open source code to make it work. As such, development of these models were discontinued for being too opaque to analyse.

Multiclass prediction was not well supported by DeepChem at the time, so a OneVsAll and a OneVsOne hack was devised to enable multi class classification with multitask binary classification models. This hack was not deployed due to the limited amount of data available for the “Slightly active” class and the discontinuation of the One shot models that could have utilised this limited amount of data to make useful predictions.  

A Variational Autoencoder (VAE) was trained on ~18,000 SMILES structures from a combined Nature, Novartis, and OSM anti-malarial screening dataset with the aim of sampling additional Series 4 structures (triazolopyrazines) for the semi-supervised classification model. Unfortunately, this model only achieved 70% accuracy which is less than the 95% accuracy found in the literature for drug-like molecules [1]. This factor, coupled with the lack of time, meant the VAE was not used for sampling any molecules.

A Progressive Neural Network model for predicting EC50 values was trained. Since a previous analysis found EC50 values to correlate with Ion Regulation Activity, it was hypothesised this model could aid in the selection of additional Series 4 molecules sampled by the VAE. This model addressed the overfitting found in prior results by only utilising molecules annotated with Ion Regulation activity, resulting in similar, and sane, internal and external validation error metrics. While this model performed marginally better than other models with 0.64 MAE, the discontinuation of the VAE made this model redundant. 


[1] Automatic chemical design using a data-driven continuous representation of molecules

31st March 2017 @ 03:03


A classification model for the PfATP4 Ion Regulation Assay was experimentally selected from various neural network architectures, sampling strategies, and featurisations.  


This project aimed to create a classification model for the PfATP4 Ion Regulation Assay that would be predictive for the Series 4 OSM compounds within the provided dataset, as well as those in the unseen validation dataset.


The provided OSM Competition dataset contained 478 structures annotated with Ion Regulation Activity data after curation.

While the dataset featured three classes, consisting of Active, Slightly Active, and Inactive, only seven molecules were found for the slightly inactive class. These molecules were removed as there were too few for accurate modelling.

The remaining molecules were divided into training and testing datasets based on their Ion Regulation Testset designation. This resulted in a Training dataset with 442 molecules and Test dataset with 29 molecules.

Additional datasets were composed from screening data in the literature [1][2]. A criterion of less than 2 uM XC50 activity and at least 75% growth inhibition of either wild-type or drug resistant Plasmodium falciparum strains were used to select a 5723 molecule subset from [1] and 5693 molecules from [2]. These molecules were initially assigned a dummy class, however, subsequent modelling either predicted a putative class (Nature) or were left unlabelled.


A semi-supervised machine learning paradigm adapted from the machine learning algorithms implemented in the DeepChem project [3]  was used to construct QSAR models from both the labelled and unlabelled datasets. All molecules were featurised by either Graph convolutional techniques or with 1024 Bit ECFP4 descriptors. A 80/10/10 train, test, internal validation was used to split the Training dataset for model construction and internal validation before testing on the external validation dataset.


The following results present the performance for the Bypass Multitask Neural Network classification model with ECFP4 descriptors as ranked by ROC AUC in both internal and external validation datasets.

Classification Matrix:

Predicted Class  
Positive Negative  
16 0 Positive Actual Class
5 8 Negative  

Performance Statistics

Measure Performance
Sensitivity 1.00
Specificity 0.614
Balanced Accuracy 0.808
Precision 0.762
Correctly Classified 24
Incorrectly Classified 5
Accuracy 0.828
ROC AUC 0.784

Individual OSM compound results

OSM ID Actual IR Class Predicted IR Class Probability
OSM-S-201 0 Active 0.978
OSM-S-366 0 Inactive 0.225
OSM-S-175 1 Active 0.903
OSM-S-218 1 Active 0.943
OSM-S-272 1 Active 0.535
OSM-S-279 1 Active 0.931
OSM-S-293 1 Inactive 0.001
OSM-S-353 1 Active 0.754
OSM-S-376 1 Active 0.966
OSM-S-378 1 Active 0.972
OSM-S-379 1 Active 0.954
OSM-S-389 1 Active 0.988
OSM-S-390 1 Active 0.986
OSM-S-363 0 Inactive 0.323
OSM-S-364 0 Active 0.663
OSM-S-372 0 Inactive 0.183
OSM-S-373 0 Active 0.832
OSM-S-374 0 Active 0.895
OSM-S-375 0 Inactive 0.489
OSM-S-382 0 Inactive 0.000
OSM-S-386 0 Active 0.983
OSM-S-387 0 Inactive 0.017
OSM-S-388 0 Inactive 0.000
OSM-S-369 1 Active 0.814
OSM-S-370 1 Active 0.914
OSM-S-371 1 Active 0.957
OSM-S-383 1 Active 0.943
OSM-S-384 1 Active 0.890
OSM-S-385 1 Active 0.994


This model trades off specificity for greater positive prediction power with perfect sensitivity observed for this testset. 


[1] Gamo F-J, Sanz LM, Vidal J, de Cozar C, Alvarez E, Lavandera J-L, et al. (2010). Thousands of chemical starting points for antimalarial lead identification. Nature 465: 305-310.

[2] Plouffe D, Brinker A, McNamara C, Henson K, Kato N, Kuhen K, et al. (2008). In silico activity profiling reveals the mechanism of action of antimalarials discovered in a high-throughput screen. Proceedings of the National Academy of Sciences of the United States of America 105: 9059-9064.





24th March 2017 @ 02:38


Multitask machine learning algorithms train and predict on more than one output. These models have been found to higher prediction performance compared to Single Task models, especially in domains where data is limited. This competition features a small dataset so the utilisation of all available relevant data is crucial to produce a useful model for the unseen validation chemicals. Previous data analysis has found the included ChemBL EC50 data to be non-linearly correlated with the OSM EC50 data, so it is hypothesised that non-linear multitask modelling methodologies will featuer higher performance than singletask models. 


This experiment aims to implement multitask models using the OSM and ChemBL EC50 data in the provided competition dataset and compare their testset prediction performance to single task models.


The ChemBL EC50 data was extracted from a previous analysis as the Mean_AltEC50 and appended to the training dataset. The numerical Mean_AltEC50 values were stored in an adjacent column to the OSM EC50 values. 


Multitask variants of the Progressive Neural Network (DT-PGN), Deep Neural Network (DT-DNN), and Graph Convolution (DT-GraphConv) machine learning algorithms modelled both tasks in the training dataset, while a Progressive Neural Network only modelling OSM EC50 was chosen as the representative single task model (ST-PGN). All machine learning algorithms modelled 1024 ECFP fingerprints to their respective endpoints, while the DT-Graph Convolution modelled graphical featurizations of each molecule to their classes. A 80/10/10 training/test/validation split of the dataset was used to train and evaluate each model. All model hypermeters were optimised for the best held out validation prediction performance, which consist of the 37 molecules in the combined OSM Testset.

Hyperparameter ST-PGN  DT-PGN DT-DNN


2 2 2
Layer dimensions 1000, 500 1500, 1500 1500, 1500
Dropouts per layer 0.15, 0.1 0.1, 0.1 0.1, 0.1
Number of epochs 100 100 100
Optimizer Adam Adam Adam
Batch size


32 32
Penalty 0.0001 0.001 0.0001
Learning rate 0.001 0.001 0.001

DT-GraphConv architecture/hyperparameters:

  • Total Layers: 10
  • Layer Configuration: 2x(Convolutional, Normalization, Pooling)
  • Number of epochs: 100
  • Optimizer: Adam
  • Batch size: 128
  • Learning rate: 0.001 


The multitask DT-PGN and DT-DNN models featured higher  external testset performance than the singletask PGN model, while the multitask DT-GraphConv model featured lower external testset performance compared to the singletask PGN model. Raw predictions for each Testset molecule are in the attached spreadsheet.

There is a substantial prediction performance difference between the Internal Validation and External Testset for all models.

Model  Training (MAE) Internal Validation (MAE) External Testset BC (MAE)
ST-PGN 0.680365333 8.040726457 2.957931574
DT-PGN 0.77677925 6.45179363 2.527557844
DT-GraphConv 1.57414035 5.825520611 3.824818007
DT-DNN 1.026544839 5.363081693 2.791228748


  • Multitask models perform better than their singletask counterparts for OSM EC50 prediction.
  • Multitask Graph Convolutional models continue to underperform compared to previous findings.
  • The substantial performance difference between the Internal and External Validation datasets may indicate the molecules in the external testset are not well represented in the training dataset. Future experiments should substitute the training/test/validation splitting of the training dataset with a K-fold cross validation methodology in order to maximise the usage of chemicals in the training set.
Attached Files
22nd March 2017 @ 12:27


The OSM competition spreadsheet contains a column labelled "Alternative EC50 from Chembl (uM)". While it is currently unclear how these values were acquired, their presence in the spreadsheet allows for a brief analysis to determine if they correlate to the desired modelling target, "Potency vs Parasite (uMol)". A correlation between these two activities could enable multitask regression modelling which could feature enhanced performance for the Test datasets.


Determine the correlation between OSM and ChemBL EC50 values within the provided competition dataset.


Since multiple ChemBL EC50 values may be present within a single cell, all 359 ChemBL EC50 values were extracted from the competition dataset and converted from text to columns in Microsoft Excel. This resulted in the formation of multiple columns containing values for each row (OSM molecule). These values were averaged in order to consolidate the multiple values to a single representative value in a new column called "Mean_AltEC50". Potency vs Parasite (uMol) EC50 values were then carefully inserted adjacent to their corresponding ChemBL EC50 data. 


The OSM and ChemBL EC50 values were graphed with a scatterplot in Microsoft Excel. Linear, logarithmic, power, and exponential trendlines were fitted to this data. The R^2 values were used as a measure of the correlation between the OSM and ChemBL EC50 values.


The logarithmic, exponential, and linear trendlines display a poor correlation between OSM and ChemBL EC50 values of less than 0.1 R^2. However, the power trendline features a better correlation with 0.186 R^2. 

Trendline Type R^2


Logarithmic 0.02357






The correlation between ChemBL and OSM EC50 values is non-linear. As such, this correlation could be utilised by multitask neural network models to potentially enhance their predictive performance compared to single task models. The performance of dual task models compared to single task models should be investigated in a follow up experiment.

Future analyses should generate some form of identification that is compatible with Excel's VLOOKUP function instead of relying on sorting the entire dataset.

Attached Files
22nd March 2017 @ 11:25


To assess the prediction performance of the Progressive Neural Network model on the held out "B" and "C" test sets.


The molecules labelled with "B" and "C" Ion regulation Test Set were combined to create a single, 37 molecule Test Dataset. An additional class was also created by transforming the associated "Potency vs Parasite (uMol)" values for these molecules by log10(EC50 + 1).


A Progressive Neural Network model was constructed using the datasets described in Part 1 and with the hyperparameters listed below. This model was used to predict 37 log10(EC50 + 1) transformed "Potency vs Parasite (uMol)" values in the test set. The log(x + 1) transformation was then reversed for all predictions to enable comparison with the true "Potency vs Parasite (uMol) values of the test set.

Progressive neural network hyperparameters:

  • Layers: 2
  • Layer dimensions: 1000
  • Number of epochs: 50
  • Dropouts per layer: 0.25
  • Optimizer: Adam
  • Batch size: 100
  • Loss: Root Mean Square Error


The calculated Root Mean Squared Error for the Progressive neural network model for assessing the combined test set was 4.1340 uMol. 

The true and predicted Potency vs Parasite (uMol) values are displayed below. 

OSM Code Ion Regulation Test Set PotencyuMol PGN_ST_predictions
OSM-S-367 A,B 8.1938 2.404632188
OSM-S-380 A,B 0.11 3.151741719
OSM-S-175 B 0.3475 7.300927026
OSM-S-201 B 4.5956 7.719285267
OSM-S-204 B 0.9018 5.808532719
OSM-S-218 B 0.1105 0.366073106
OSM-S-254 B 0.7744 1.420859794
OSM-S-272 B 0.1078 0.68316042
OSM-S-278 B 4.2154 5.616926461
OSM-S-279 B 0.314275 2.844591687
OSM-S-293 B 0.13 0.987342693
OSM-S-353 B 0.1137 1.776545003
OSM-S-366 B 0.4349 1.629969458
OSM-S-376 B 0.5767 1.354778073
OSM-S-377 B 0.01668 0.153477093
OSM-S-378 B 10 2.057914063
OSM-S-379 B 0.3292 2.85889783
OSM-S-381 B 0.02432 0.957832692
OSM-S-389 B 0.1408 2.532740452
OSM-S-390 B 0.074 1.758208853
OSM-S-363 C 10 2.540437817
OSM-S-364 C 10 0.619554596
OSM-S-368 C 2.239 1.436717336
OSM-S-369 C 0.251 0.985902954
OSM-S-370 C 1.995 3.386147645
OSM-S-371 C 0.372 4.859706705
OSM-S-372 C 10 7.033399774
OSM-S-373 C 10 14.63480484
OSM-S-374 C 10 9.961609289
OSM-S-375 C 10 1.156782548
OSM-S-382 C 10 10.05324279
OSM-S-383 C 0.135 1.212617192
OSM-S-384 C 0.928 1.202246959
OSM-S-385 C 8.586 2.308562764
OSM-S-386 C 4.801 4.868107745
OSM-S-387 C 10 1.059725732
OSM-S-388 C 10 14.21659369


Many EC50 predictions from this initial modelling effort are often in the wrong order of magnitude of the actual assay result which indicates the need to reduce the RMSE measure well below 4 uM in order to produce a predictive model. This could be achieved by further model tuning (at the risk of overfitting the Testset), multitask/transfer learning of related assay activities to make better use of the limited data, and dataset augmentation to hopefully expand the applicability domain of in silico models and enhance prediction performance for the Series 4 compounds of the Test Sets.

Attached Files