Searching inconceivably large spaces in drug discovery
Author
Akshay Balsubramani
How to predict on chemical space without enumerating it
In modern AI-aided drug discovery, the chemical space addressed by the drug discovery funnel is truly vast. Virtual screening of large libraries of compounds is much higher-throughput than traditional primary screening. However, the chemical space that can be explored is still limited by the ability to enumerate and score compounds; this does not reasonably scale over many billions of compounds because of the expense of calculation on each compound, no matter how fast the calculations are.
To address this problem, we exploit the fact that we only want the highest scoring structures from a library. This is a very similar situation to online marketing or advertising matching, in which recommendations need to be made from relatively little data. There is a deep understanding of the methods that succeed in such a task. One prominent example is also one of the oldest and most versatile - Thompson sampling.
We’ll explore how Thompson sampling can be used to score chemical structures without enumerating them, and how it can be applied to the problem of virtual screening in drug discovery. For this application in chemistry, we’ll use a convenient implementation from a recent paper(Klarich et al. 2024).
Defining a chemical space implicitly
The celebrated Pfizer-BioNTech and Moderna mRNA vaccines for COVID-19 use modular lipid nanoparticles (LNPs) to deliver mRNA. These formulations use different ionizable lipids, known as ALC-0315 (Pfizer-BioNTech) and SM-102 (Moderna), which are often looked upon in a modular manner:
Their “reaction skeletons” are similar (ester linkages connecting two tails with some branching to a tertiary amine head group) but have important differences (e.g. their ester groups face opposite directions with respect to the head group). Suppose we want to try varying these components from one of these basic structures.
Take SM-102 as an example; we might want to vary the head and tails, keeping the ester linkers intact. Let’s say we want to modify it precisely, using the large and diverse collection of hundreds of primary amine heads and hydrophobic tails outlined in previousposts(akinc?). We will perform the following modifications:
Try a number of different head groups corresponding to different primary amines.
Keep the alkyl chain spacers separating the acid groups from the amine nitrogen (currently each tail has 5 spacer carbons before the carbonyl group).
Try different tail groups, varying the length, branching, and saturation.
In silico “reaction” rule
To implement Thompson sampling, we need to write these modifications as a multi-component in silico “reaction” in an unambiguous way. This will have several components per the above description:
A head group, which is a primary amine.
The first tail (as an alcohol).
The second tail (as an alcohol).
We insert the alkyl spacer carbons on either side.
import pandas as pdamines_klarich = pd.read_csv("https://raw.githubusercontent.com/PatWalters/TS/refs/heads/main/data/primary_amines_ok.smi", header=None, sep=" ", names=["SMILES", "NAME"])print("{} amines from Klarich et al. 2024 library.".format(len(amines_klarich)))frags_heads.extend(amines_klarich['SMILES'].tolist())names_heads.extend(['_'.join(str(x).split()) for x in amines_klarich['NAME'].tolist()])
Next, we gather a set of tail groups we might want to try in this variant screening. Drug designers in this field vary the tail groups in a number of ways, and we include a representative sample of these variations:
Varying the length of the tail (8, 10, 12, 14 carbons)
Varying the branching of the tail (unbranched and 2-branched tails explored)
Varying the saturation of the tail (each branch of each tail can contain a double bond somewhere along its length)
We can systematically generate a set of tail groups that vary in these ways. For virtual screening purposes, we’ll do this combinatorially, generating all possible combinations of these variations.
Our implementation is independently extensible here; we use core RDKit functionality to change individual atoms within the mol object and reconvert back into SMILES, which is much more robust and preferable to doing pure SMILES string manipulation.
CODE
from itertools import combinations, productfrom rdkit import Chemfrom rdkit import RDLogger# 1‑line switch‑off for all RDKit warnings/infoRDLogger.DisableLog("rdkit")# design‑space parameters -----------------------------------------------------lengths = [8, 10, 12, 14] # carbons / tailunsat_choices = (0, 1, 2) # number of C=C per tailspacing_allowed = {2, 3} # gap(s) between successive C=C bonds# -----------------------------------------------------------------------------def spacing_ok(db_positions):"""Return True iff every spacing between successive double bonds is allowed."""iflen(db_positions) <2:returnTrue db_positions =sorted(db_positions)returnall(b - a in spacing_allowed for a, b inzip(db_positions, db_positions[1:]))def build_linear_tail(n_carbon, db_pos):"""HO–(CH2)n‑1–CH2– with optional internal C=C bonds.""" mol = Chem.RWMol() c_atoms = [mol.AddAtom(Chem.Atom("C")) for _ inrange(n_carbon)] o_atom = mol.AddAtom(Chem.Atom("O"))for i inrange(n_carbon -1): bond_type = Chem.BondType.DOUBLE if i in db_pos else Chem.BondType.SINGLE mol.AddBond(c_atoms[i], c_atoms[i +1], bond_type) mol.AddBond(c_atoms[-1], o_atom, Chem.BondType.SINGLE)return Chem.MolToSmiles(mol, canonical=True)def build_branched_tail(n1, n2, db1, db2):"""Tertiary alcohol core with two hydrophobic tails.""" mol = Chem.RWMol() core = mol.AddAtom(Chem.Atom("C")) o_at = mol.AddAtom(Chem.Atom("O")) mol.AddBond(core, o_at, Chem.BondType.SINGLE)def add_chain(start_atom, length, db_pos_set): prev = start_atomfor idx inrange(length): c = mol.AddAtom(Chem.Atom("C")) bond_type = ( Chem.BondType.DOUBLE if (idx +1) in db_pos_set else Chem.BondType.SINGLE ) mol.AddBond(prev, c, bond_type) prev = c add_chain(core, n1, set(db1)) add_chain(core, n2, set(db2))return Chem.MolToSmiles(mol, canonical=True)frags =set()# unbranched tailsfor L, d in product(lengths, unsat_choices): interior =range(1, L -1) # keep core‑C and C‑OH singleif d >len(interior):continuefor db in combinations(interior, d):ifnot spacing_ok(db):continuetry: smi = build_linear_tail(L, db) Chem.SanitizeMol(Chem.MolFromSmiles(smi)) frags.add(smi)except:pass# drops impossible ones silently# ---------- branched ------------for L1, L2, d1, d2 in product(lengths, lengths, unsat_choices, unsat_choices): interior1 =range(1, L1) # skip bond 0 (core‑C1) interior2 =range(1, L2)if d1 >len(interior1) or d2 >len(interior2):continuefor db1 in combinations(interior1, d1):ifnot spacing_ok(db1):continuefor db2 in combinations(interior2, d2):ifnot spacing_ok(db2):continuetry: smi = build_branched_tail(L1, L2, db1, db2) Chem.SanitizeMol(Chem.MolFromSmiles(smi)) frags.add(smi)exceptException:passfrags_tails = [Chem.CanonSmiles(f) for f in frags if Chem.MolFromSmiles(f) isnotNone]names_tails = ["Tail_fragment"] *len(frags_tails)tails_df = pd.DataFrame([frags_tails, names_tails], index=["SMILES", "NAME"]).Ttails_df.to_csv("tail_frags.smi", index=False, header=False, sep=" ")
Note that this type of combinatorial expansion is much easier to do computationally than in an assay, playing to the strengths of virtual screening.
The implementation involves helper functions linear_smiles and branched_smiles, which automatically generate tails of a particular nature. When all the combinations are expanded, we have a set of many hundreds of tail groups that can be used in the screening.
CODE
mols2grid.display([Chem.MolFromSmiles(x) for x in frags_tails],mol_col="mol", n_cols=7, n_rows=4)
This is too many structures to enumerate
These modifications are sensible for the field and apply very narrowly to only the single base structure SM-102 (we could do the same for any of the hundreds of other purchasable trade lipids of interest, for instance).
But we have already made our way into a combinatorial explosion.
CODE
# Write this in scientific notationtotal_size =len(frags_heads) *len(frags_tails) *len(frags_tails)print(f"Total size of the design space: {total_size:.2e} structures.")
Total size of the design space: 4.23e+11 structures.
Scoring a subset of structures
In a drug development pipeline, the overall goal is to pick structures which do well in experiments. Since experiments are often resource-intensive and costly to perform, AI in drug development seeks to eliminate the need for testing unviable structures, and more quickly focus on promising regions of chemical space for further development. Accordingly, there are many models in drug discovery that predict everything from physicochemical properties to in vivo responses and the results of high-throughput assays.
A model to score these structures
We’ll demonstrate this for LNP development, using our recurring example from previous posts. We’ll use a model from a popular paper(Xu et al. 2024) published recently that attempts to predict the transfection efficiency of an LNP with a given ionizable lipid component.
To do this fully, we’ll need to train such a model on a set of known LNPs with their transfection efficiencies. A popular dataset for this is that of the AGILE study(Xu et al. 2024), which contains a dataset of 1200 LNPs with their transfection efficiencies. It also contains a model pretrained on this dataset; with some effort, this can be downloaded and used to score the giant virtual space we have generated.
For now, we’ll train a new model on the AGILE dataset for transfection efficiency on 1200 structures.
# import torch, yaml, argparse# from rdkit import Chem# from torch_geometric.data import Batch# from models.graph_model import GraphModel # model wrapper in AGILE# from utils.mol_graph import mol_to_graph_data_obj # RDKit → PyG graph# def load_model(ckpt_path: str):# ckpt = torch.load(ckpt_path, map_location="cpu")# hparams = ckpt["hyper_parameters"]["model_params"]# net = GraphModel(**hparams)# net.load_state_dict(ckpt["state_dict"], strict=False)# net.eval()# return net# https://github.com/bowang-lab/AGILE/raw/refs/heads/main/ckpt/pretrained_agile_60k/checkpoints/model.pth
Thompson sampling: act according to your beliefs
Thompson sampling (TS) is a decision-making method used to efficiently learn which option is best when facing uncertain outcomes. It was originally developed in the context of clinical trials, and is now widely used in machine learning and experimental design. It balances two paradigmatic goals in tension:
Exploration: Testing uncertain or less-studied compounds to improve overall knowledge.
Exploitation: Focusing on compounds already showing promise to efficiently optimize outcomes.
Thompson sampling works with a distribution over experimental outcomes representing the scientist’s current beliefs. At any point in time, these beliefs can be sampled to suggest a best course of action (e.g. compounds to test next). Thompson sampling consists of taking this action and updating the belief distribution according to Bayes’ rule.
It is well-understood, both empirically and theoretically, that such a strategy can work even if the initial belief distribution is dramatically misspecified. As more data is collected, the belief distribution converges to the true distribution of outcomes, and the sampling strategy becomes more focused on the options that are truly best.
A typical experimental context involves screening compounds to identify which yield the highest activity. Initially the experimentalist has little information, and belief distributions have high uncertainty, so Thompson sampling often samples high values even for options with lower estimated means, encouraging exploration. If a compound performs well, the experimentalist’s belief in its future performance shifts upward, increasing the likelihood that future samples will lead to selecting it again. Conversely, poor performance lowers future expectations. Over time, this naturally guides experiments toward the most promising compounds, while still periodically exploring less-understood ones to confirm or revise the current understanding. Eventually, the uncertainty ceases to dominate, and sampling tends to favor options with better estimated outcomes.
This approach elegantly and naturally balances exploration and exploitation, analogous to how method development is often approached anyway —- initially trying diverse approaches before refining the most promising ones.
The TS algorithm is one of the workhorses of modern machine learning. The trick is often in how well it works in practice, and here we try it on a truly vast library of ionizable lipids from the literature. In this post, we’ll construct such a library and perform this virtual scoring effort in a realistic lipid nanoparticle engineering context. This is paradigmatic of many small molecule drug discovery subfields, but with a combination of very broad application and a truly vast design space.
To finish setting up the TS algorithm, we need to precisely define the scoring function that will be used to evaluate the structures. This implementation constrains us to implement the evaluation function separately in its evaluators.py file.
Klarich, Kathryn, Brian Goldman, Trevor Kramer, Patrick Riley, and W Patrick Walters. 2024. “Thompson Sampling─ an Efficient Method for Searching Ultralarge Synthesis on Demand Databases.”Journal of Chemical Information and Modeling 64 (4): 1158–71.
Xu, Yue, Shihao Ma, Haotian Cui, Jingan Chen, Shufen Xu, Fanglin Gong, Alex Golubovic, et al. 2024. “AGILE Platform: A Deep Learning Powered Approach to Accelerate LNP Development for mRNA Delivery.”Nature Communications 15 (1): 6305.