CODE
= "combinatorial_chemistry_decomposition_tools.py"
decomp_tools_path
from importlib.machinery import SourceFileLoader
= SourceFileLoader("ccd_tools", decomp_tools_path).load_module() ccd_tools
Akshay Balsubramani
This post follows up on another post which sets up how to decompose chemical structures in silico into fragments. That post shows how a single common fragment composition reaction (esterification) can be viewed in reverse as a cleavage reaction of ester linkages, and in this manner can be repeatedly applied to ionizable lipids in the literature to yield a common set of building block fragments.
Here we’ll focus on putting together a fairly comprehensive set of reactions in ionizable lipid design for LNPs. These reactions collectively form a nearly universal set for:
We pick up where the previous post left off. The first step is to recall the previous post’s code, which was saved to a .py file at the end of that post. We can import this code from the assumed file name combinatorial_chemistry_decomposition_tools.py
. For explanations and documentation, refer to that post.
The esterification/saponification example outlined in the other post shows clearly how a given reaction principle can be applied in reverse to decompose a molecule into fragments, which can be reconstituted for even greater variety.
from rdkit import Chem
from PIL import Image
import io, os
rxn_smarts_dict = {}
rxn_smarts_dict['esterification'] = "[#6:4][C:2](=[O:3])[OX2H1:1].[#6X4,c;!$(C([OX2H])[O,S,#7,#15]):6][OX2H1]>>[#6:6][O:1][C:2](=[O:3])[#6:4]"
bio = io.BytesIO(ccd_tools.draw_reaction(rxn_smarts_dict["amide_coupling"]))
Image.open(bio)
NameError: name 'ccd_tools' is not defined
Another very useful reaction is amide formation using a primary amine with an acid to make an amide (implying the reverse amide cleavage reaction).
We can test all of this on standard LNP ionizable lipids, such as ALC-0315, which has the following SMILES, retrieved from the known_IL_combined
file written in this post.
We’ve compiled a fairly comprehensive set of the known ILs of commercial relevance in drug development, comprising the majority of known named LNPs. This is procured from purchasable LNP catalogs in another post.
SMILES | catalog | name | |
---|---|---|---|
0 | CC(C)=CCCC(C)CCSCC(C)C(=O)OCCOC(=O)CCN(CCCN(C)... | Medchem | 4A3-Cit |
1 | CC(C)CCCC(C)CCCC(C)CC(=O)OCC(COC(=O)CCCN(C)C)(... | Broadpharm | BP Lipid 517 |
2 | CC(C)CCCC(C)CCCC(C)CC(=O)OCC(COC(=O)CCN1CC(O)C... | Broadpharm | BP Lipid 533 |
3 | CC(C)CCCC(C)CCCC(C)CC(=O)OCC(COC(=O)CCN1CCCC1)... | Broadpharm | BP Lipid 521 |
4 | CC(C)CCCC(C)CCCC(C)CC(=O)OCC(COC(=O)CCN1CCN(C)... | Broadpharm | BP Lipid 525 |
... | ... | ... | ... |
509 | CCCCSSCCCCCCOC(=O)CCNCCCN(C)CCCN(CCC(=O)OCCCCC... | Medchem | Lipid 9 |
510 | [2H]C(COC(=O)CCCCCCCCCCCCCCCCC)(CN(C)C)OC(=O)C... | Broadpharm | 18:0 DAP |
511 | [2H]N(C(=O)C(CCCCCOC(=O)CCCCCCC/C=C\CCCCCCCC)N... | Broadpharm | 119-23 |
512 | [2H]N(CCN(C)C)C(=O)C(CCSCCC(=O)OCCCCCCCCCCCCCC... | Broadpharm | CP-LC-1074 |
513 | [3H]N(C(=O)CCCCC(CCSCCC(=O)OCCCCCC(C)C)SCCC(=O... | Broadpharm | CP-LC-1428 |
514 rows × 3 columns
This is a set of >500 structures comprising ionizable lipid leads and lead candidates in various research programs. We can now decompose all these molecules jointly at all their ester linkages and examine the fragments that result. Notably, very few distinct fragments are required to describe these ionizable lipids.
We will show this now along with an implementation. To set this up, some utility code processes the molecule info dictionary to determine the fragment composition of any of the input structures. The first step to doing this is simply to plot how many fragments the structures have by this decomposition. This is a reflection of the number of ester linkages in the molecule, because every one is cleaved by this in silico reaction.
def process_molecule(smiles, all_molecule_info):
# Returns a list of fragments comprising the molecule.
if smiles not in all_molecule_info:
return [smiles]
components = all_molecule_info[smiles]['components']
return_list = []
for component in components:
return_list.extend(process_molecule(component, all_molecule_info))
return return_list
all_smiles = list(all_structures['SMILES'])
all_fragments, all_molecule_info = full_decompose(
all_smiles, reaction_types=["esterification"], one_reaction_only=True
)
print("{} total fragments represent {} structures: ".format(len(all_fragments), len(all_smiles)))
import matplotlib.pyplot as plt
plt.hist([len(process_molecule(x, all_molecule_info)) for x in all_smiles])
plt.xlabel("Number of fragments per molecule")
plt.ylabel("Number of molecules")
plt.title("Fragmentation of molecules by saponification")
plt.show()
392 total fragments represent 514 structures:
We see that some structures are not decomposable in this manner because they lack ester linkages, but these constitute <15% of the structures.
The rest have more fragments, with the mode being structures with 3 fragments. By itself, this attests to the structure of the chemical space explored so far. In this manner, we can almost universally express the modular structure of ionizable lipids.
This code produces infix expressions with R1
as the placeholder for the reaction (esterification) and F0
, F1
, … as the placeholders for the fragments.
Infix expression for SM-102: (_(_F82_R1_F57_)_R1_F145_)
Displaying these fragments, we see that they are the same ones obtained in the use of full_decompose
above.
As I mentioned, parse trees and infix strings are essentially equivalent, and we can convert any infix string to the corresponding SMILES as well with knowledge of the decomposition’s fragment IDs. The algorithm to do this is a classic application of stacks devised by the computer scientist Edsger Dijkstra, called the shunting yard algorithm.
The algorithm uses two stacks: the reaction stack and the output queue. Reading tokens left‑to‑right, it immediately pushes SMILES fragments/structures onto the output. When it encounters a reaction, it can compare its precedence and associativity with the reaction on the top of the reaction stack, and accordingly pop operators on the stack to the output before the new operator is pushed. Left parentheses are pushed onto the reaction stack to mark sub‑expressions; when a right parenthesis appears, reactions are popped to the output until the matching left parenthesis is removed to finish evaluating the sub-expression. After all tokens are processed, any remaining reactions are popped to the output queue. The resulting postfix sequence can be evaluated in a single pass with one stack, giving the final SMILES result.
import random, itertools
def evaluate_expression(
infix_str, frags_program,
random_choice=None, neighborhood_size=10,
rseed=42
):
"""
Returns a list of SMILES strings that are the possible results of evaluating the infix expression infix_str.
frags_program is a list of SMILES strings that correspond to fragment IDs in the expression.
random_choice is the number of random fragments to subsample and add to the expression (None for no subsampling).
frag_hints is a dictionary mapping each fragment SMILES string to a list of SMILES denoting similar fragments.
basic_mode is a flag that, if True, doesn't vary around any arguments and just synthesizes the string as given.
"""
tokens = infix_str.split('_')
# Base case: if token is a SMILES, return its value
if len(tokens) == 1:
if tokens[0].startswith('F'): # It's a fragment
this_frag_smiles = frags_program[int(tokens[0][1:])]
toret = [ this_frag_smiles ] # Add the fragment that constituted the original structure
frags_to_add = []
if random_choice is not None:
random.seed(rseed)
frags_to_add = random.choices(frags_to_add, k=random_choice)
toret.extend(frags_to_add)
possible_results = [Chem.CanonSmiles(x) for x in list(set(toret))]
else: # It's a SMILES string
possible_results = [ Chem.CanonSmiles(tokens[0]) ]
return possible_results
# Handle Parentheses
for i, ch in enumerate(tokens):
if ch == '(':
start = i
elif ch == ')':
end = i
mid_values = evaluate_expression(
'_'.join(tokens[start+1:end]), frags_program,
random_choice=random_choice,
neighborhood_size=neighborhood_size, rseed=rseed)
toret = []
leftstr = '_'.join(tokens[:start])
rightstr = '_'.join(tokens[end+1:])
for left_val in mid_values:
strlst = []
if len(leftstr) > 0:
strlst.append(leftstr)
strlst.append(str(left_val))
if len(rightstr) > 0:
strlst.append(rightstr)
toret.extend(evaluate_expression(
'_'.join(strlst), frags_program, random_choice=random_choice,
neighborhood_size=neighborhood_size, rseed=rseed))
return toret
# Handle reactions
for i in reversed(range(len(tokens))):
if tokens[i].startswith('R'):
toret = []
left_vals = evaluate_expression(
'_'.join(tokens[:i]), frags_program, random_choice=random_choice,
neighborhood_size=neighborhood_size, rseed=rseed)
right_vals = evaluate_expression(
'_'.join(tokens[i+1:]), frags_program, random_choice=random_choice,
neighborhood_size=neighborhood_size, rseed=rseed)
reaction_code = tokens[i]
for comp1, comp2 in itertools.product(left_vals, right_vals):
reactants = [comp1, comp2]
reaction_name = rxn_names[reaction_code]
products = react_fragments(
reactants, reaction_type=reaction_name, decompose_mode=False, one_reaction_only=False)
toret.extend(list(set(list_flatten(products))))
return toret
Now that we’ve set up synthesis and decomposition routines, it’s interesting to use them together in larger pipelines. But this is not straightforward. Decomposition is not unique – there are many ways to decompose a structure, corresponding to different orders of peeling off the fragments. The same is true for synthesis, with reactive sites not being unique either.
For our purposes, the in silico decomposition reaction should be reversible. This means that the decomposition reaction should be possible to run in reverse, and the products should be possible to reassemble into the original structure. Otherwise, the decomposition into fragments loses valuable information about the structure.
We can check that when we take a structure and decompose it into fragments, we can run these fragments through the appropriate synthesis routines and recover the structure. This is useful to check for in general, and we start with the already-decomposed structure of SM-102.
There are two structures instead of one when we reconstitute the fragments. It’s worth understanding why – this is because they can react on either of the two carboxylic acid groups on the head, which is asymmetric. Crucially, SM-102 itself is in the reconstituted set:
This is a key desirable property to keep in mind when we start coupling decomposition and synthesis to create combinatorially varied virtual spaces:
When we decompose a structure and reconstitute the fragments (with the same reaction settings), we always recover at least the original structure.
We can now decompose all these structures at once to get a sense of which fragments they share, and sanity-check our insights so far.
def decompose_dataset(
input_lipids,
reaction_types=["esterification"]
):
input_lipids = [Chem.CanonSmiles(x) for x in input_lipids]
frags_program = input_lipids
oldlen = 0
while oldlen != len(frags_program):
oldlen = len(frags_program)
frags_program, molecule_info_program = full_decompose(
frags_program,
reaction_types=reaction_types,
one_reaction_only=True,
print_debug=False,
)
input_lipid_desc = {}
for lpd in input_lipids:
input_lipid_desc[lpd] = generate_infix_smiles_decomp(
lpd, molecule_info_program, frags_program
)
return input_lipid_desc, molecule_info_program, frags_program
# Get the fragment counts for each lipid in input_lipid_desc. This should be a dictionary mapping fragment names to counts in the lipid.
def compute_fragment_counts(
input_lipid_desc,
frags_program,
default_pfx = 'all_chems/'
):
all_frags = []
uq_lipid_frags = []
aggregate_counts = {}
for smi in input_lipid_desc:
these_counts = np.zeros(len(frags_program)).astype(int)
tag = input_lipid_desc[smi]
frags = [x for x in tag.split('_') if x.startswith('F')]
all_frags += frags
uq_lipid_frags += list(set(frags))
for f in frags:
frags_program_index = int(f[1:])
these_counts[frags_program_index] += 1
aggregate_counts[smi] = these_counts
uaf = np.unique(all_frags, return_counts=True)
uaf_dict = dict(zip(uaf[0], uaf[1]))
ulf = np.unique(uq_lipid_frags, return_counts=True)
ulf_dict = dict(zip(ulf[0], ulf[1]))
name_arr = ['F' + str(x) for x in range(len(frags_program))]
lipid_tags_df = pd.DataFrame.from_dict({'SMILES': input_lipid_desc.keys(), 'Tag': input_lipid_desc.values()})
rxn_codes_df = pd.DataFrame.from_dict({'Reaction Name': rxn_codes.keys(), 'Reaction code': rxn_codes.values()})
#print(uaf_dict)
frag_count_df = pd.DataFrame.from_dict({
'Name': name_arr,
'SMILES': frags_program,
'Count_in_library': [uaf_dict.get(x, 0) for x in name_arr],
'Number_of_unique_lipids': [ulf_dict.get(x, 0) for x in name_arr]
})
frag_count_df = pd.concat([frag_count_df, pd.DataFrame.from_dict(aggregate_counts)], axis=1)
if default_pfx is not None:
if not os.path.exists(default_pfx):
os.makedirs(default_pfx)
lipid_tags_df.to_csv(f'{default_pfx}lipid_tags.tsv', sep='\t', index=False)
rxn_codes_df.to_csv(f'{default_pfx}rxn_codes.tsv', sep='\t', index=False)
frag_count_df.to_csv(f'{default_pfx}fragments.tsv', sep='\t', index=False)
return lipid_tags_df, frag_count_df, rxn_codes_df
How much are we gaining by decomposing structures in this straightforward manner? It depends on the set of structures and how much redundancy they share. For a varied set such as those in the catalogs here which are discovered through completely different pipelines and routes, the amount of redundancy we would expect is not high.
It is surprising therefore that the decompositions are nontrivial, as we saw with most structures being cleaved. This results in significantly lighter (less complex) fragments, which can be easily plotted.
import matplotlib.pyplot as plt
input_lipid_desc, molecule_info_program, frags_program = decompose_dataset(all_smiles, reaction_types=['esterification'])
lipid_tags_df, frag_count_df, rxn_codes_df = compute_fragment_counts(input_lipid_desc, frags_program)
frag_count_df.set_index('SMILES', inplace=True)
newcols = list(frag_count_df.columns)
for i in range(3, frag_count_df.shape[1]):
sndx = np.where(np.array(all_smiles) == frag_count_df.columns[i])[0]
if len(sndx) > 0:
newcols[i] = all_structures['name'][sndx[0]]
frag_count_df.columns = newcols
newrows = list(frag_count_df.index)
for i in range(frag_count_df.shape[0]):
sndx = np.where(np.array(all_smiles) == frag_count_df.columns[i])[0]
if len(sndx) > 0:
newcols[i] = all_structures['name'][sndx[0]]
frag_count_df.index = newrows
import rdkit.Chem.Descriptors
data_chemwts = [Chem.Descriptors.ExactMolWt(Chem.MolFromSmiles(x)) for x in all_smiles]
data_fragwts = [Chem.Descriptors.ExactMolWt(Chem.MolFromSmiles(x)) for x in frags_program]
data1 = data_chemwts + data_fragwts
lbls1 = ['structures']*len(data_chemwts) + ['fragments']*len(data_fragwts)
def plot_hist_molwts(data1, data2, fam_name):
plt.figure(figsize=(8, 4))
plt.hist(data1, bins=30, label=f'Lipids (n = {len(data1)})', alpha=0.5, density=True)
plt.hist(data2, bins=30, label=f'Fragments (n = {len(data2)})', alpha=0.5, density=True)
plt.xlabel("Molecular weight")
plt.title(f"{fam_name}: Distribution of molecular weights")
plt.legend()
plt.show()
plot_hist_molwts(data_chemwts, data_fragwts, 'All lipids')
Unfortunately, it is clear that there are still a few very heavy fragments, resulting from lipids that are not cleaved:
from rdkit.Chem import Descriptors
desc_molwt_ndces = np.argsort([Descriptors.MolWt(Chem.MolFromSmiles(x)) for x in frags_program])[::-1]
mols2grid.display([Chem.MolFromSmiles(x) for x in np.array(frags_program)[desc_molwt_ndces]],
mol_col="mol", n_cols=7, n_rows=4,
title="Fragments sorted by molecular weight (descending order)")
The first few dozen of these fragments consist of many huge structures that must each be painstakingly constructed, and therefore are appropriately viewed as compositions of fragments themselves, through other non-esterification reactions. In order to decompose using multiple reactions at once, we refer to another follow-up post.