All Notebooks | Help | Support | About
18th January 2014 @ 23:21

I've just noticed a problem with SMILES that makes we wonder whether we ought to be using it at all for this project. Comments welcome in case I've missed something. (Those with OpenID's or Google accounts can easily login and comment below, or I will link this post on G+ here)

OSM recently had a meeting. Several of the compounds inherited in Series 4 contain a difluoromethyl group, which is a pain to make. We were wondering whether we could synthesize a compound with a trifluoromethyl on it instead, since those should be much easier to make (Action Item is here). It appeared, via quick searches in the meeting, that no such compound had been made. The R = CH3 was known. I therefore needed to make sure that the R = CF3 was not known, and, as a control, to see whether the R = H compound was known.

 Substitution Analysis

I therefore went to the spreadsheet of knowns in this series, which contains SMILES strings for all the compounds. In Chemdraw I constructed the strings for each compound and searched the sheet, coming up a blank. OK, I thought, we should make these compounds. But then I happened to notice that the strings I had generated actually looked *nothing* like the strings in the sheet (compare the red strings below). So I copied one of the strings from the sheet and pasted it into Chemdraw, giving me the structure with circles in the aromatic rings. It seemed to matter how you drew the structure. "Surely not" I laughed, but on quick inspection, I can see that others have known about this for some time. This in itself means we should stop using SMILES for this project. This is astonishing, given how widespread is the use of SMILES in medchem. Not to downplay the inherent difficulty of dealing with variations in how molecules are represented, but we can't live with this kind of ambiguity if we are searching compound databases.

 Smiles comparison

But it gets worse. I copied the SMILES string from the sheet and generated a structure in Chemdraw. I then used that structure to generate a new SMILES string from within Chemdraw, which is different to the original string. It seems as though the nature of the string is dependent on which software is used to generate it? Can that be right?

 Smiles generation

Luckily there are other means. Chris Southan has been at pains to point out the benefits to the consortium of using InChI and InChiKey. As you can see above (blue and green respectively), those perform just fine, and are immune to the way the compound is drawn. Unless there are any objections, shall we move to InChiKey from this point on? Is there a benefit to using SMILES in e.g. similarity searching? 

(comment below or here) Chemdraws of schemes are below if you want to play. Author of this post: Mat Todd.

Attached Files
Simple Smiles Generation.cdx
SMILES issue.cdx
Substitution Analysis for Post.cdx
Simple Smiles Generation.png
SMILES issue.png
Substitution Analysis for Post.png