Decompositions with a flexible reaction set

cheminformatics

LNP

Combinatorial chemistry and in silico decompositions, part 2

Author

Akshay Balsubramani

Introduction

This post follows up on another post which sets up how to decompose chemical structures in silico into fragments. That post shows how a single common fragment composition reaction (esterification) can be viewed in reverse as a cleavage reaction of ester linkages, and in this manner can be repeatedly applied to ionizable lipids in the literature to yield a common set of building block fragments.

Here we’ll focus on putting together a fairly comprehensive set of reactions in ionizable lipid design for LNPs. These reactions collectively form a nearly universal set for:

Getting combinatorial libraries of fragments through iterated decompositions
Ligating these fragments in chemically sensible ways to put together even more variety

We pick up where the previous post left off. The first step is to recall the previous post’s code, which was saved to a .py file at the end of that post. We can import this code from the assumed file name combinatorial_chemistry_decomposition_tools.py. For explanations and documentation, refer to that post.

CODE

decomp_tools_path = "combinatorial_chemistry_decomposition_tools.py"

from importlib.machinery import SourceFileLoader
ccd_tools = SourceFileLoader("ccd_tools", decomp_tools_path).load_module()

Establishing a set of reactions

The esterification/saponification example outlined in the other post shows clearly how a given reaction principle can be applied in reverse to decompose a molecule into fragments, which can be reconstituted for even greater variety.

CODE

from rdkit import Chem
from PIL import Image
import io, os

rxn_smarts_dict = {}
rxn_smarts_dict['esterification'] = "[#6:4][C:2](=[O:3])[OX2H1:1].[#6X4,c;!$(C([OX2H])[O,S,#7,#15]):6][OX2H1]>>[#6:6][O:1][C:2](=[O:3])[#6:4]"

bio = io.BytesIO(ccd_tools.draw_reaction(rxn_smarts_dict["amide_coupling"]))
Image.open(bio)

NameError: name 'ccd_tools' is not defined

Another very useful reaction is amide formation using a primary amine with an acid to make an amide (implying the reverse amide cleavage reaction).

CODE

rxn_smarts_dict["amide_coupling"] = "[#6:1][C:2](=[O:3])[OX2H1:4].[#6:6][NX3;H2;!$(NC=O):5]>>[#6:6][NX3R0;H1:5][CR0:2](=[O:3])[#6:1]"

bio = io.BytesIO(ccd_tools.draw_reaction(rxn_smarts_dict["amide_coupling"]))
Image.open(bio)

We can test all of this on standard LNP ionizable lipids, such as ALC-0315, which has the following SMILES, retrieved from the known_IL_combined file written in this post.

CODE

import pandas as pd
all_structures = pd.read_csv('known_IL_combined.tsv', sep='\t', header=None, names=['SMILES', 'catalog', 'name'])

smiles_sm102 = all_structures['SMILES'][all_structures['name'] == 'SM-102'].values[0]
Chem.MolFromSmiles(smiles_sm102)

Decomposing known chemical structures

We’ve compiled a fairly comprehensive set of the known ILs of commercial relevance in drug development, comprising the majority of known named LNPs. This is procured from purchasable LNP catalogs in another post.

CODE

all_structures = pd.read_csv('known_IL_combined.tsv', sep='\t', header=None, names=['SMILES', 'catalog', 'name'])
all_structures

	SMILES	catalog	name
0	CC(C)=CCCC(C)CCSCC(C)C(=O)OCCOC(=O)CCN(CCCN(C)...	Medchem	4A3-Cit
1	CC(C)CCCC(C)CCCC(C)CC(=O)OCC(COC(=O)CCCN(C)C)(...	Broadpharm	BP Lipid 517
2	CC(C)CCCC(C)CCCC(C)CC(=O)OCC(COC(=O)CCN1CC(O)C...	Broadpharm	BP Lipid 533
3	CC(C)CCCC(C)CCCC(C)CC(=O)OCC(COC(=O)CCN1CCCC1)...	Broadpharm	BP Lipid 521
4	CC(C)CCCC(C)CCCC(C)CC(=O)OCC(COC(=O)CCN1CCN(C)...	Broadpharm	BP Lipid 525
...	...	...	...
509	CCCCSSCCCCCCOC(=O)CCNCCCN(C)CCCN(CCC(=O)OCCCCC...	Medchem	Lipid 9
510	[2H]C(COC(=O)CCCCCCCCCCCCCCCCC)(CN(C)C)OC(=O)C...	Broadpharm	18:0 DAP
511	[2H]N(C(=O)C(CCCCCOC(=O)CCCCCCC/C=C\CCCCCCCC)N...	Broadpharm	119-23
512	[2H]N(CCN(C)C)C(=O)C(CCSCCC(=O)OCCCCCCCCCCCCCC...	Broadpharm	CP-LC-1074
513	[3H]N(C(=O)CCCCC(CCSCCC(=O)OCCCCCC(C)C)SCCC(=O...	Broadpharm	CP-LC-1428

514 rows × 3 columns

This is a set of >500 structures comprising ionizable lipid leads and lead candidates in various research programs. We can now decompose all these molecules jointly at all their ester linkages and examine the fragments that result. Notably, very few distinct fragments are required to describe these ionizable lipids.

We will show this now along with an implementation. To set this up, some utility code processes the molecule info dictionary to determine the fragment composition of any of the input structures. The first step to doing this is simply to plot how many fragments the structures have by this decomposition. This is a reflection of the number of ester linkages in the molecule, because every one is cleaved by this in silico reaction.

CODE

def process_molecule(smiles, all_molecule_info):
    # Returns a list of fragments comprising the molecule.
    if smiles not in all_molecule_info:
        return [smiles]
    components = all_molecule_info[smiles]['components']
    return_list = []
    for component in components:
        return_list.extend(process_molecule(component, all_molecule_info))
    return return_list

CODE

all_smiles = list(all_structures['SMILES'])
all_fragments, all_molecule_info = full_decompose(
    all_smiles, reaction_types=["esterification"], one_reaction_only=True
)
print("{} total fragments represent {} structures: ".format(len(all_fragments), len(all_smiles)))

import matplotlib.pyplot as plt
plt.hist([len(process_molecule(x, all_molecule_info)) for x in all_smiles])
plt.xlabel("Number of fragments per molecule")
plt.ylabel("Number of molecules")
plt.title("Fragmentation of molecules by saponification")
plt.show()

392 total fragments represent 514 structures:

We see that some structures are not decomposable in this manner because they lack ester linkages, but these constitute <15% of the structures.

The rest have more fragments, with the mode being structures with 3 fragments. By itself, this attests to the structure of the chemical space explored so far. In this manner, we can almost universally express the modular structure of ionizable lipids.

This code produces infix expressions with R1 as the placeholder for the reaction (esterification) and F0, F1, … as the placeholders for the fragments.

CODE

rxn_str = generate_infix_smiles_decomp(
    smiles_sm102, all_molecule_info, all_fragments
)
print("Infix expression for SM-102:\t{}".format(rxn_str))

Infix expression for SM-102:    (_(_F82_R1_F57_)_R1_F145_)

Displaying these fragments, we see that they are the same ones obtained in the use of full_decompose above.

CODE

frag_ndces = [int(x[1:]) for x in rxn_str.split("_") if x.startswith("F")]
list_prods = np.array(all_fragments)[frag_ndces]
mols2grid.display([Chem.MolFromSmiles(x) for x in list_prods],mol_col="mol", n_cols=7, n_rows=4)

Infix strings → structures

As I mentioned, parse trees and infix strings are essentially equivalent, and we can convert any infix string to the corresponding SMILES as well with knowledge of the decomposition’s fragment IDs. The algorithm to do this is a classic application of stacks devised by the computer scientist Edsger Dijkstra, called the shunting yard algorithm.

The algorithm uses two stacks: the reaction stack and the output queue. Reading tokens left‑to‑right, it immediately pushes SMILES fragments/structures onto the output. When it encounters a reaction, it can compare its precedence and associativity with the reaction on the top of the reaction stack, and accordingly pop operators on the stack to the output before the new operator is pushed. Left parentheses are pushed onto the reaction stack to mark sub‑expressions; when a right parenthesis appears, reactions are popped to the output until the matching left parenthesis is removed to finish evaluating the sub-expression. After all tokens are processed, any remaining reactions are popped to the output queue. The resulting postfix sequence can be evaluated in a single pass with one stack, giving the final SMILES result.

CODE

import random, itertools


def evaluate_expression(
    infix_str, frags_program, 
    random_choice=None, neighborhood_size=10, 
    rseed=42
):
    """
    Returns a list of SMILES strings that are the possible results of evaluating the infix expression infix_str. 
    frags_program is a list of SMILES strings that correspond to fragment IDs in the expression. 
    random_choice is the number of random fragments to subsample and add to the expression (None for no subsampling).
    frag_hints is a dictionary mapping each fragment SMILES string to a list of SMILES denoting similar fragments.
    basic_mode is a flag that, if True, doesn't vary around any arguments and just synthesizes the string as given.
    """
    tokens = infix_str.split('_')
    # Base case: if token is a SMILES, return its value
    if len(tokens) == 1:
        if tokens[0].startswith('F'):     # It's a fragment
            this_frag_smiles = frags_program[int(tokens[0][1:])]
            toret = [ this_frag_smiles ]    # Add the fragment that constituted the original structure
            frags_to_add = []
            if random_choice is not None:
                random.seed(rseed)
                frags_to_add = random.choices(frags_to_add, k=random_choice)
            toret.extend(frags_to_add)
            possible_results = [Chem.CanonSmiles(x) for x in list(set(toret))]
        else:     # It's a SMILES string
            possible_results = [ Chem.CanonSmiles(tokens[0]) ]
        return possible_results
    
    # Handle Parentheses
    for i, ch in enumerate(tokens):
        if ch == '(':
            start = i
        elif ch == ')':
            end = i
            mid_values = evaluate_expression(
                '_'.join(tokens[start+1:end]), frags_program, 
                random_choice=random_choice, 
                neighborhood_size=neighborhood_size, rseed=rseed)
            toret = []
            leftstr = '_'.join(tokens[:start])
            rightstr = '_'.join(tokens[end+1:])
            for left_val in mid_values:
                strlst = []
                if len(leftstr) > 0:
                    strlst.append(leftstr)
                strlst.append(str(left_val))
                if len(rightstr) > 0:
                    strlst.append(rightstr)
                toret.extend(evaluate_expression(
                    '_'.join(strlst), frags_program, random_choice=random_choice, 
                    neighborhood_size=neighborhood_size, rseed=rseed))
            return toret

    # Handle reactions
    for i in reversed(range(len(tokens))):
        if tokens[i].startswith('R'):
            toret = []
            left_vals = evaluate_expression(
                '_'.join(tokens[:i]), frags_program, random_choice=random_choice, 
                neighborhood_size=neighborhood_size, rseed=rseed)
            right_vals = evaluate_expression(
                '_'.join(tokens[i+1:]), frags_program, random_choice=random_choice, 
                neighborhood_size=neighborhood_size, rseed=rseed)
            reaction_code = tokens[i]
            for comp1, comp2 in itertools.product(left_vals, right_vals):
                reactants = [comp1, comp2]
                reaction_name = rxn_names[reaction_code]
                products = react_fragments(
                    reactants, reaction_type=reaction_name, decompose_mode=False, one_reaction_only=False)
                toret.extend(list(set(list_flatten(products))))
            return toret

From synthesis to decomposition and back again

Now that we’ve set up synthesis and decomposition routines, it’s interesting to use them together in larger pipelines. But this is not straightforward. Decomposition is not unique – there are many ways to decompose a structure, corresponding to different orders of peeling off the fragments. The same is true for synthesis, with reactive sites not being unique either.

Decomposition and synthesis: in silico inverses of each other

For our purposes, the in silico decomposition reaction should be reversible. This means that the decomposition reaction should be possible to run in reverse, and the products should be possible to reassemble into the original structure. Otherwise, the decomposition into fragments loses valuable information about the structure.

We can check that when we take a structure and decompose it into fragments, we can run these fragments through the appropriate synthesis routines and recover the structure. This is useful to check for in general, and we start with the already-decomposed structure of SM-102.

CODE

lps = set([Chem.CanonSmiles(x) for x in evaluate_expression(rxn_str, all_fragments)])
mols2grid.display([Chem.MolFromSmiles(x) for x in lps], n_cols=7, n_rows=4)

There are two structures instead of one when we reconstitute the fragments. It’s worth understanding why – this is because they can react on either of the two carboxylic acid groups on the head, which is asymmetric. Crucially, SM-102 itself is in the reconstituted set:

CODE

print(smiles_sm102 in lps)

True

This is a key desirable property to keep in mind when we start coupling decomposition and synthesis to create combinatorially varied virtual spaces:

When we decompose a structure and reconstitute the fragments (with the same reaction settings), we always recover at least the original structure.

Decompose all chemicals at once

We can now decompose all these structures at once to get a sense of which fragments they share, and sanity-check our insights so far.

CODE

def decompose_dataset(
    input_lipids,
    reaction_types=["esterification"]
):
    input_lipids = [Chem.CanonSmiles(x) for x in input_lipids]
    frags_program = input_lipids
    oldlen = 0
    while oldlen != len(frags_program):
        oldlen = len(frags_program)
        frags_program, molecule_info_program = full_decompose(
            frags_program,
            reaction_types=reaction_types,
            one_reaction_only=True,
            print_debug=False,
        )
    
    input_lipid_desc = {}
    for lpd in input_lipids:
        input_lipid_desc[lpd] = generate_infix_smiles_decomp(
            lpd, molecule_info_program, frags_program
        )
    return input_lipid_desc, molecule_info_program, frags_program


# Get the fragment counts for each lipid in input_lipid_desc. This should be a dictionary mapping fragment names to counts in the lipid.
def compute_fragment_counts(
    input_lipid_desc, 
    frags_program, 
    default_pfx = 'all_chems/'
):
    all_frags = []
    uq_lipid_frags = []
    aggregate_counts = {}
    for smi in input_lipid_desc:
        these_counts = np.zeros(len(frags_program)).astype(int)
        tag = input_lipid_desc[smi]
        frags = [x for x in tag.split('_') if x.startswith('F')]
        all_frags += frags
        uq_lipid_frags += list(set(frags))
        for f in frags:
            frags_program_index = int(f[1:])
            these_counts[frags_program_index] += 1
        aggregate_counts[smi] = these_counts
    uaf = np.unique(all_frags, return_counts=True)
    uaf_dict = dict(zip(uaf[0], uaf[1]))
    ulf = np.unique(uq_lipid_frags, return_counts=True)
    ulf_dict = dict(zip(ulf[0], ulf[1]))
    name_arr = ['F' + str(x) for x in range(len(frags_program))]

    lipid_tags_df = pd.DataFrame.from_dict({'SMILES': input_lipid_desc.keys(), 'Tag': input_lipid_desc.values()})
    rxn_codes_df = pd.DataFrame.from_dict({'Reaction Name': rxn_codes.keys(), 'Reaction code': rxn_codes.values()})

    #print(uaf_dict)
    frag_count_df = pd.DataFrame.from_dict({
        'Name': name_arr, 
        'SMILES': frags_program, 
        'Count_in_library': [uaf_dict.get(x, 0) for x in name_arr], 
        'Number_of_unique_lipids': [ulf_dict.get(x, 0) for x in name_arr]
    })
    frag_count_df = pd.concat([frag_count_df, pd.DataFrame.from_dict(aggregate_counts)], axis=1)
    if default_pfx is not None:
        if not os.path.exists(default_pfx):
            os.makedirs(default_pfx)
        lipid_tags_df.to_csv(f'{default_pfx}lipid_tags.tsv', sep='\t', index=False)
        rxn_codes_df.to_csv(f'{default_pfx}rxn_codes.tsv', sep='\t', index=False)
        frag_count_df.to_csv(f'{default_pfx}fragments.tsv', sep='\t', index=False)
    return lipid_tags_df, frag_count_df, rxn_codes_df

How much are we gaining by decomposing structures in this straightforward manner? It depends on the set of structures and how much redundancy they share. For a varied set such as those in the catalogs here which are discovered through completely different pipelines and routes, the amount of redundancy we would expect is not high.

It is surprising therefore that the decompositions are nontrivial, as we saw with most structures being cleaved. This results in significantly lighter (less complex) fragments, which can be easily plotted.

CODE

import matplotlib.pyplot as plt

input_lipid_desc, molecule_info_program, frags_program = decompose_dataset(all_smiles, reaction_types=['esterification'])
lipid_tags_df, frag_count_df, rxn_codes_df = compute_fragment_counts(input_lipid_desc, frags_program)
frag_count_df.set_index('SMILES', inplace=True)

newcols = list(frag_count_df.columns)
for i in range(3, frag_count_df.shape[1]):
    sndx = np.where(np.array(all_smiles) == frag_count_df.columns[i])[0]
    if len(sndx) > 0:
        newcols[i] = all_structures['name'][sndx[0]]
frag_count_df.columns = newcols

newrows = list(frag_count_df.index)
for i in range(frag_count_df.shape[0]):
    sndx = np.where(np.array(all_smiles) == frag_count_df.columns[i])[0]
    if len(sndx) > 0:
        newcols[i] = all_structures['name'][sndx[0]]
frag_count_df.index = newrows

import rdkit.Chem.Descriptors
data_chemwts = [Chem.Descriptors.ExactMolWt(Chem.MolFromSmiles(x)) for x in all_smiles]
data_fragwts = [Chem.Descriptors.ExactMolWt(Chem.MolFromSmiles(x)) for x in frags_program]
data1 = data_chemwts + data_fragwts
lbls1 = ['structures']*len(data_chemwts) + ['fragments']*len(data_fragwts)

def plot_hist_molwts(data1, data2, fam_name):
    plt.figure(figsize=(8, 4))
    plt.hist(data1, bins=30, label=f'Lipids (n = {len(data1)})', alpha=0.5, density=True)
    plt.hist(data2, bins=30, label=f'Fragments (n = {len(data2)})', alpha=0.5, density=True)
    plt.xlabel("Molecular weight")
    plt.title(f"{fam_name}: Distribution of molecular weights")
    plt.legend()
    plt.show()


plot_hist_molwts(data_chemwts, data_fragwts, 'All lipids')

CODE

frag_count_df.to_csv('known_IL_combined_fragments.tsv', sep='\t')

Unfortunately, it is clear that there are still a few very heavy fragments, resulting from lipids that are not cleaved:

CODE

from rdkit.Chem import Descriptors
desc_molwt_ndces = np.argsort([Descriptors.MolWt(Chem.MolFromSmiles(x)) for x in frags_program])[::-1]
mols2grid.display([Chem.MolFromSmiles(x) for x in np.array(frags_program)[desc_molwt_ndces]], 
                  mol_col="mol", n_cols=7, n_rows=4, 
                  title="Fragments sorted by molecular weight (descending order)")

The first few dozen of these fragments consist of many huge structures that must each be painstakingly constructed, and therefore are appropriately viewed as compositions of fragments themselves, through other non-esterification reactions. In order to decompose using multiple reactions at once, we refer to another follow-up post.