Searching too many chemicals to count

cheminformatics

LNP

Searching trillions of structures on a laptop

Author

Akshay Balsubramani

Too large to count? Browse instead

Some of the most exciting algorithmic advances in applied drug discovery involve manipulating chemical spaces that are truly vast. When the space cannot be sampled and cleverly stored in memory, the modular structure relating chemicals in the space to each other is key to concisely manipulating and learning on it.

The ubiquitous cheminformatics package RDKit has some wonderful functionality for searching over synthon spaces that are represented in this manner. The functionality runs blazing fast, even on inconceivably large virtual spaces, because it does not scale in the combinatorics of putting the synthons together, only in the number of distinct synthons themselves.

Many synthon formats have blossomed in the last few years. Most of the time, they are equivalent to each other, finding different ways of encoding fragments and how to put them together.

Here we collect some useful code for dealing with these formats and manipulating them. We show how some of this functionality will be useful, in a way that’s complementary to existing canonical demonstrations of it, working in the RDKit ecosystem as usual.

Synthons in standard form

Representing spaces implicitly with synthons

A synthon is a convenient unit of retrosynthetic analysis in chemistry. For us it has a precise computational meaning – a synthon is a fragment with reactive centers indicated by dummy atoms, which are removed when an in silico bond formation (“reaction”) occurs. Here is a simple example, a bond formation between two reactants, namely the ester linkage of the ionizable lipid MC3.

CODE

from PIL import Image
import io

file_pfx = "../../files/chem/"

decomp_tools_path = file_pfx + "combinatorial_chemistry_decomposition_tools.py"

from importlib.machinery import SourceFileLoader
ccd_tools = SourceFileLoader("ccd_tools", decomp_tools_path).load_module()


mc3_smiles = "CCCCCC=CCC=CCCCCCCCCC(CCCCCCCCC=CCC=CCCCCC)OC(=O)CCCN(C)C"
mc3_head = "[1*]C(=O)CCCN(C)C"
mc3_tail = "CCCCCC=CCC=CCCCCCCCCC(CCCCCCCCC=CCC=CCCCCC)O[1*]"

smarts1 = mc3_head + "." + mc3_tail + ">>" + mc3_smiles

bio = io.BytesIO(ccd_tools.draw_reaction(smarts1, width=2000, height=400))
Image.open(bio)

This has advantages and disadvantages.

On one hand, this representation is unambiguous. Given two synthons, I know exactly how to put them together without any further information.
On the other hand, this way of representing synthons is not synthetically plausible as it does not seek to mimic any physical reaction mechanism. All carbons and heteroatoms are conserved, and only bonds between existing atoms are formed.

Putting these synthons together in a file is a beautifully concise way of representing a space. A synthon file is a file with one synthon per line, in a very specific format which we review below because of the scarcity of other documentation.

Each synthon has at least one attachment point to other synthons, which allows it to participate in in silico reactions.

Column name	Description	Notes
SMILES	Synthon SMILES with attachment atoms marked as `[1], [2], …` (or the transuranic labels `[U]`, `[Np]`, `[Pu]`, `[Am]` for up to 4).	Required
synton_id	String identifier, e.g. `3-1`.	Required
synton#	Integer role (0, 1, 2, …) within its reaction.	Required
reaction_id	Name/ID grouping synthons into one reaction.	Required
release	Free-text “release” tag.	Optional

A collection of synthons combinatorially describes a space according to the following parsing rule.

For each reaction_id, take one synthon for each synton# present, giving a set of multiple synthons.
Replace any pairs of matching attachment labels (e.g. [1*], [2*], …) by a direct bond between the linked atoms.

Summary: RDKit synthon file format

Header line must literally read SMILES synton_id synton# reaction_id
or (with optional column)
SMILES synton_id synton# reaction_id release
(synton spelling is deliberate).
Separator:
- Tab by default, any ASCII whitespace accepted.
- A comma-separated variant is also allowed (SMILES,synton_id,synton_role,reaction_id,…).
Validation: loader checks header and correct field count per row; else throws std::runtime_error.
Interrupts: if the process receives SIGINT, the reader exits early with cancelled = true.

Synthons from reaction chemistry

Conventionally, virtual spaces are generated by applying in silico SMARTS reactions to realistic fragments in ways that mimic known chemistry. As I’ve written, such reactions and fragments can be generated for known small-molecule design spaces, such as ionizable lipid design for LNP engineering – for instance, carboxylic acid fragments and alcohols can combine to make ester linkages. This approach leads to virtual spaces which tend to have more developable and otherwise synthesizable molecules, by systematically generating variations on existing reaction schemas.

In such situations, we may want to take advantage of fast synthon-based functionality provided by RDKit. Doing this is a non-trivial process which uses knowledge of the underlying chemistry. Here we will work through how it can be done for an example of interest – LNP engineering.

Examining a reaction

Overall, this example illustrates how to convert any reaction schema – comprised of lists of fragments and reaction SMARTS combining them – into a synthon file. Each reaction is used to build synthons under its reaction_id. The number of reactants is reflected in the maximum synton# for that reaction_id. For each such reactant, possible synthons are appended to the synthon file by referring to the fragment files.

The starting point is the reactions used to comprise the synthon space. We’ll start with a single reaction for simplicity. Let’s take the esterification reaction, which we discuss in another post on LNP engineering, as a prime example. This reaction is, as written, a coupling between an acid and an alcohol that involves dehydration.

CODE

smarts_esterification = (
    "[#6:4][C:2](=[O:3])[OX2H1:1]."
    "[#6X4,c;!$(C([OX2H])[O,S,#7,#15]):6][OX2H1]"
    ">>"
    "[#6:6][O:1][C:2](=[O:3])[#6:4]"
)

bio = io.BytesIO(ccd_tools.draw_reaction(smarts_esterification))
Image.open(bio)

But the dehydration involves removal of a heteroatom, and therefore won’t fit in the RDKit synthon schema. Instead, the heteroatom must be taken out of one of the reactants. In this case, the natural chemical choice is the alcohol, leading to a reaction that does fit into the RDKit synthon schema. (Recall that the attachment points are labeled with transuranics [U], [Np], etc as specified above.)

CODE

smarts_synthon_esterification = (
    "[#6:4][C:2](=[O:3])[OX2H1:1][U]."
    "[#6X4,c;!$(C([OX2H])[O,S,#7,#15]):6][U]"
    ">>"
    "[#6:6][O:1][C:2](=[O:3])[#6:4]"
)

bio = io.BytesIO(ccd_tools.draw_reaction(smarts_synthon_esterification))
Image.open(bio)

This dictates how we assemble the synthons. The reaction is asymmetric, and the reactants play asymmetric roles. So we should always think carefully about the chemical intuition behind associating a certain set of synthons with a reactant.

By this reasoning, we realize that the above esterification reaction is not enough to fully specify the synthons. Consider, for instance, two different occurrences of esterification, in the ionizable lipids ALC-0315 and SM-102:

Figure 1: Adapted from Jorgensen et al. 2023

If we’re considering ALC-0315, the acid synthons correspond to lipophilic tails, and the alcohol synthons to ionizable heads. On the other hand, if we’re considering SM-102, it is the alcohol synthons that are the tails, while the acid synthons correspond to ionizable heads. This starkly demonstrates how the synthon definition depends on the context of the synthesis being considered, not just on the synthesis reaction itself.

Outline

So let’s specify our demonstration scenario in more detail.

We’ll demonstrate making variations on SM-102, the same situation as outlined in another blog post on scoring ultralarge spaces. The scenario involves modifying the amine ionizable head moiety and the lipophilic alcohol tail moieties, while maintaining the two-tailed structure and spacer carbons between them.
We’ll use synthons derived from the head and tail fragments used there, with the reactive sites at the ester linkages only.

Assembling synthons for a reaction

Now we focus on the key conversion from fragments into synthons, ranging over the reactants in the above esterification reaction and assembling possible synthons for each. This is not trivial – in addition to removing heteroatoms, we must also mark reaction sites. This requires us to look up new information which is left technically ambiguous in the standard way of representing reactions.¹

This task is an opportunity for useful insight into the chemistry of these virtual spaces. We range over the fragments and detect appropriate chemical moieties (acid groups, alcohol groups, etc.) to determine which fragments can participate in which roles, as synthons in the reaction.

Acid synthons

According to the esterification reaction outlined above, the first reactant should be a carboxylic acid, with two acid moieties that will form ester linkages with tails. As explained further in the other post on ultralarge scoring, we can create variants of SM-102 by varying this head fragment systematically according to a comprehensive list of thousands of primary amines known in the literature. We substitute each of the hydrogens on the primary amine with an alkyl spacer chain leading to an acid reactive moiety, which will join with a tail synthon. This substitution reaction results in a set of synthons, corresponding to the first (acid) reactant in the esterification.

CODE

primary_amine_to_head_synthon = "[#6:1][NX3;H2:2]>>[#6:1][N:2](CCCCCCCC(=O)[OX2H0:4][3*])CCCCCC(=O)[OX2H0:6][5*]"
bio = io.BytesIO(ccd_tools.draw_reaction(primary_amine_to_head_synthon, width=1000, height=400))
Image.open(bio)

CODE

import pandas as pd
head_pfx = "../../files/LNP_catalog/"
df = pd.read_csv(head_pfx + 'frags_amines_TS.csv.gz', compression='gzip')
df

	SMILES	Name	Amine_type
0	C1CNC2NCCNC2N1	Whitehead_0	secondary_amine \| secondary_amine \| secondary_...
1	C1CNCCCNCCCNC1	Whitehead_1	secondary_amine \| secondary_amine \| secondary_...
2	C1CNCCCNCCNCCCNC1	Whitehead_2	secondary_amine \| secondary_amine \| secondary_...
3	C1CNCCN1	Whitehead_3	secondary_amine \| secondary_amine
4	C1CNCCNC1	Whitehead_4\|CSSS00000210015	secondary_amine \| secondary_amine
...	...	...	...
39531	N#CCn1nnc(N)n1	Klarich_34290744	primary_amine \| aromatic_pyridinic \| aromatic_...
39532	Nc1nnc2nccnn12	Klarich_2289036	primary_aniline \| aromatic_pyridinic \| aromati...
39533	CNC(=O)n1nnnc1N	Klarich_5510474	secondary_amine \| aromatic_pyridinic \| aromati...
39534	Nc1ncc2n[nH]nc2n1	Klarich_18562019	primary_aniline \| aromatic_pyridinic \| aromati...
39535	N=C(N)Nc1nnn[nH]1	Klarich_4404328	primary_amine \| primary_amine \| secondary_anil...

39536 rows × 3 columns

Alcohol synthons

Synthons for the alcohol reactants are easy to determine using the reasoning we’ve outlined. Each new bond simply attaches to a hydroxyl carbon, i.e. the \(\alpha\) carbon of the alcohol.

CODE

alcohol_tail_to_synthon1 = "[#6:1][OX2H1]>>[#6:1][3*]"
alcohol_tail_to_synthon2 = "[#6:1][OX2H1]>>[#6:1][5*]"
bio = io.BytesIO(ccd_tools.draw_reaction(alcohol_tail_to_synthon1, width=700, height=200))
Image.open(bio)

Footnotes

Further processing such as protecting atoms for reactions is designed to address this issue.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{balsubramani,
  author = {Balsubramani, Akshay},
  title = {Searching Too Many Chemicals to Count},
  langid = {en}
}

For attribution, please cite this work as:

Balsubramani, Akshay. n.d. “Searching Too Many Chemicals to Count.”