Sourcing chemical structures for in silico discovery

cheminformatics

LNP

Getting ionizable lipid structures from online catalogs

Author

Akshay Balsubramani

Online catalogs of chemical structures

The building blocks of an ionizable lipid are fragments of different moieties in the lipid. These fragments are shared in common across combinatorial libraries, providing insight and control into their variations.

In the papers in which they appear, the fragments are typically dictated by synthesis or other resource constraints. For in silico AI/ML, we often want to mine the SMILES and lipids corresponding to known structures, which is a daunting task given the number that have been studied. Scouring the web for catalogs and literature online is the only way to keep up with such structures.

We go through the process of mining two such catalogs – Broadpharm and Medchem – which together cover the vast majority of available fragments and structures.

These cover a large fraction of the available literature, including structures that are not under patent. We will scrape each of these catalogs and combine them into a data frame that is the result of this notebook.

Procedure

In each case, getting the necessary structures from the catalogs is a two-step process:

Retrieve structure images from the catalog: This is done by scraping the catalog for the internal IDs of chemical structures. These correspond to structure images in the catalog. Each image is then downloaded from the catalog and saved to a local directory. The implementation below takes a URL and returns a data frame of the structures in the catalog.

CODE

import requests, os
from urllib.parse import urlparse

def scrape_url(url):
    """
    Scrapes the HTML content from a given URL and returns it as a string.
    
    Args:
        url (str): The URL to scrape
        
    Returns:
        str: The HTML content of the page
        
    Raises:
        ValueError: If the URL is invalid
        requests.RequestException: If the request fails
    """
    # Validate URL
    parsed_url = urlparse(url)
    if not parsed_url.scheme or not parsed_url.netloc:
        raise ValueError("Invalid URL. Please provide a complete URL including http:// or https://")
    
    # Set a user agent to mimic a browser request
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    # Make the request
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise an exception for 4XX/5XX responses
        return response.text
    except requests.exceptions.RequestException as e:
        raise requests.RequestException(f"Failed to retrieve the webpage: {e}")


def retrieve_chemstructure_images(IDs, names, imageID_to_url, name_to_path):
    img_path = {}
    for i in range(len(IDs)):
        if len(names[i]) == 0:
            continue
        cpd_name = names[i]
        image_url = imageID_to_url(IDs[i])
        if '/' in cpd_name:
            cpd_name = cpd_name.replace('/', '-')
        image_path = name_to_path(cpd_name)
        if not os.path.exists(image_path):
            img_data = requests.get(image_url).content
            print(f'Downloading {image_url} to {image_path}')
            with open(image_path, 'wb') as handler:
                handler.write(img_data)
        img_path[cpd_name] = image_path
    return img_path

Convert the structure images to SMILES strings: This is done using deep computer vision models that are trained for this specific task. We use the DECIMER transformer model for this purpose.

The SMILES string conversion requires some manual verification of the answers, so it is not yet fully automated. Though further automatic SMILES conversion presents some challenges, there is scope for addressing these as needed. ¹

CODE

def predict_SMILES_from_images(img_path_dict):
    # To run this, DECIMER must be installed, requiring opencv-python and keras-preprocessing
    from DECIMER import predict_SMILES as DECIMER_predict_SMILES
    chem_smiles = {}
    i = 0
    for cpd_name, image_path in img_path_dict.items():
        SMILES_str = DECIMER_predict_SMILES(image_path)
        #SMILES_str = openchemie_predict_SMILES(image_path)
        chem_smiles[cpd_name] = SMILES_str
        i += 1
        print(i, cpd_name)
    return chem_smiles

Examples

A look at Broadpharm’s IL catalog

Broadpharm is one of the largest vendors for ionizable lipids in LNP design and has a detailed and organized catalog with solid coverage of the literature. The structures available there can be inspected from their website.

CODE

in_url = "https://broadpharm.com/product-categories/lipid/ionizable-lipid"
in_str = scrape_url(in_url)

Get catalog IDs and corresponding chemical names of all structures

Each structure has a unique catalog ID which is necessary to retrieve it from within the catalog. But we typically want to store it under a different name – often the structure’s trademarked name, or another referent that is useful for looking it up in the literature.

We first retrieve the mapping between the catalog IDs and chemical names – a quick step that only involves some HTML traversals. BeautifulSoup is a standard HTML/XML parser that makes this process easier.

The code here is specific to the format of the Broadpharm catalog, and would change with time and other catalogs. The best way to write this code is therefore adaptively, with a code generation model.

CODE

from bs4 import BeautifulSoup
import pandas as pd

# Parse the HTML data using BeautifulSoup
soup = BeautifulSoup(in_str, 'html.parser')


# Initialize an empty list to store the extracted data
compound_data = []

for tr in soup.find_all('tr'):
    td = tr.find_all('td')
    row = [i.text for i in td]
    if len(row) != 6:
        pass  # print(row)
    elif row[1] != '':
        compound = {
            'Product ID': row[0],
            'Name': row[1],
            'Molecular Structure': row[2],
            'Molecular Weight': row[3],
            'Purity': row[4],
            'Pricing': row[5]
        }
        compound_data.append(compound)
        

# Convert the list of dictionaries to a Pandas DataFrame
compound_df = pd.DataFrame(compound_data)

IDs = compound_df['Product ID'].tolist()
names = compound_df['Name'].tolist()

Using this mapping, it is easy to retrieve the images of all structures with their catalog IDs.

CODE

file_path_pfx = "../../files/"

broadpharm_imageID_to_url = lambda x: f'https://broadpharm.com/web/images/mol_images/{x}.gif'
broadpharm_name_to_path = lambda x: f'{file_path_pfx}broadpharm_lipids/{x}.png'
img_path = retrieve_chemstructure_images(IDs, names, broadpharm_imageID_to_url, broadpharm_name_to_path)

Predict SMILES from image paths in dataframe

The next step is to predict the SMILES strings from the image paths in the dataframe, which is where the DECIMER model comes in.

(This could also be done using the openchemie toolkit, from which the image recognition model (Qian et al. 2023) has compared favorably to other methods from the literature (Rajan et al. 2024). )

CODE

from datetime import date
dtime = date.today().strftime("%Y-%m-%d")

catalog_name = f'Broadpharm'
chem_smiles = predict_SMILES_from_images(img_path)

CODE

new_fname = f'{catalog_name}_smiles.tsv'
pd.Series(chem_smiles).to_csv(new_fname, sep='\t', header=False)

Medchem’s IL catalog

CODE

in_url = "https://www.medchemexpress.com/search.html?q=ionizable+lipid&type=inhibitors-and-agonists"

# scrape the url manually to yield the string below, because the website appears to not be statically rendered.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument(
    "user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)

driver = webdriver.Chrome(options=options)

driver.get("https://www.medchemexpress.com/search.html?q=ionizable+lipid&type=inhibitors-and-agonists")
# driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

in_str = driver.page_source
driver.quit()

Get catalog IDs and corresponding chemical names of all structures

Again, BeautifulSoup comes to the rescue, though this code needs to be adapted to the specific structure of the Medchem catalog this time. The code can in general be written by an LLM-based agent.

CODE

from bs4 import BeautifulSoup
import requests, os

import pandas as pd


# Parse the HTML data using BeautifulSoup
new_soup = BeautifulSoup(in_str, 'html.parser')

names = []
IDs = []
i = 0
for x in new_soup.find_all('li'):
    if x.dl is not None:
        if len(x.dl.tr.text) > 0:
            names.append(x.dl.tr.text.strip().split('\n')[0])
        else:
            if len(x.dl.tr.a.contents) >= 2:
                names.append(x.dl.tr.a.contents[1].contents[0])
            else:
                names.append(x.dl.tr.a.contents[0].contents[0])
        if len(x.dl.dt.text) > 0:
            IDs.append(x.dl.dt.text)
        else:
            IDs.append(x.dl.dt.a.contents[0])
        i += 1

Armed with this mapping, we now again retrieve the images of all structures.

CODE

medchem_imageID_to_url = lambda x: f'https://file.medchemexpress.com/product_pic/{x}.gif'
medchem_name_to_path = lambda x: f'medchem_lipids/{x}.png'
img_path = retrieve_chemstructure_images(IDs, names, medchem_imageID_to_url, medchem_name_to_path)

Predict SMILES from image paths in dataframe

Again we proceed very similarly to the Broadpharm catalog, using model predictions to convert the images to SMILES strings.

CODE

from datetime import date
dtime = date.today().strftime("%Y-%m-%d")

catalog_name = f'Medchem_{}'
chem_smiles = predict_SMILES_from_images(img_path)

CODE

new_fname = f'{catalog_name}_smiles.tsv'
pd.Series(chem_smiles).to_csv(new_fname, sep='\t', header=False)

Consolidate known lipid structures

We can now merge all the catalogs that have been collected, collapsing duplicate entries appropriately and logging the source(s) of any structure. Doing a quick RDKit canonical-SMILES comparison during this process corrects for the indeterminacy in how the structure is represented (there is more than one valid SMILES for a molecule, e.g. depending on the choice of starting atom). So only unique structures are included in the final database.

CODE

import numpy as np, pandas as pd
smiles_df_paths = ['Broadpharm_smiles.tsv', 'Medchem_smiles.tsv']
consolidated_smiles = {}
for s in smiles_df_paths:
    cat_name = s.split('_smiles')[0]
    consolidated_smiles[cat_name] = dict(np.array(pd.read_csv(s, sep='\t', header=None)))

FileNotFoundError: [Errno 2] No such file or directory: 'Broadpharm_smiles.tsv'

CODE

from rdkit import Chem

new_df = {
    'Catalog': [],
    'Name': [],
    'SMILES': []
}
for catname in consolidated_smiles.keys():
    thiscat = consolidated_smiles[catname]
    thiscat_list = [x for x in zip(*thiscat.items())]
    new_df['Name'].extend(list(thiscat_list[0]))
    new_df['SMILES'].extend(list(thiscat_list[1]))
    new_df['Catalog'].extend([catname] * len(thiscat))
new_df = pd.DataFrame(new_df)
new_df['SMILES'] = [Chem.CanonSmiles(x) for x in new_df['SMILES']]

# We find that the model occasionally mislabels explicitly drawn hydrogens as isotopic, so we fix that.
new_df['SMILES'] = [x.replace('[2H]', '[H]') for x in new_df['SMILES']]
new_df['SMILES'] = [x.replace('[3H]', '[H]') for x in new_df['SMILES']]

# Consolidate so that new_df['SMILES'] are all unique. For any duplicates, combine them by concatenating the contents of each of their other columns.
new_df_combined = new_df.groupby('SMILES').agg(lambda x: ' | '.join(set(x))).reset_index()

This combined set of SMILES can be further analyzed, or used to seed a virtual space.

CODE

new_df_combined.to_csv('../../files/known_IL_combined.tsv', sep='\t', header=False, index=False)

Directly sourcing fragments

By far the most comprehensive way to ensure broad virtual screening is to collect fragments, not structures, because an ultralarge virtual space defines combinatorially many structures from its fragments.

To mine such fragments, we turn to a growing body of literature using libraries that are combinatorially defined, where the study itself specifies fragments rather than the entire library of structures. Such studies almost always explicitly enumerate the library, and we can massively amplify their power by incorporating their fragments into ultralarge libraries for virtual screening.

Amine head fragments

We can retrieve a vast variety of amine fragments from non-LNP catalogs, for use in synthesis. We download an .sdf file directly from the Chemspace catalog at this link.

CODE

import zipfile


def local_zip_download(zip_path, output_dir):
    new_fname = zip_path.split('/')[-1]
    # Check if new_fname is a zip file
    if not new_fname.endswith('.zip'):
        return ''
    out_path = output_dir + new_fname
    with requests.get(zip_path, stream=True) as r:
        r.raise_for_status()
        with open(out_path, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return out_path


def extract_sdf_from_zip(zip_path, output_dir, download=False):
    if download:
        zip_path = local_zip_download(zip_path, output_dir)
    fnames_written = []
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        for file_name in zip_ref.namelist():
            if file_name.lower().endswith(".sdf"):
                zip_ref.extract(file_name, output_dir)
                print(f"Extracted: {file_name} to {output_dir}")
                fnames_written.append(file_name)
    return fnames_written

Chemspace stores these structures as a compressed .sdf file.

CODE

from rdkit import Chem
from rdkit.Chem import PandasTools

chemspace_file_path = '../../files/Chemspace_Amine_Fragments_Set.zip'
sdf_dir = '../../files/'
sdf_fnames = extract_sdf_from_zip(chemspace_file_path, sdf_dir)

chemspace_sdf = PandasTools.LoadSDF(sdf_dir + sdf_fnames[0], removeHs=False)
chemspace_sdf['SMILES'] = [Chem.MolToSmiles(x) for x in chemspace_sdf['ROMol']]
chemspace_sdf['Name'] = chemspace_sdf['CHEMSPACE_ID']
chemspace_sdf

Extracted: Chemspace_Amine_Fragments_Set.sdf to ../../files/

	CHEMSPACE_ID	CHEMSPACE_URL	ID	ROMol	SMILES	Name
0	CSSS00007999857	https://chem-space.com/CSSS00007999857			CNC1CCCN(C2Cc3ccccc3C2)C1	CSSS00007999857
1	CSSS00012024712	https://chem-space.com/CSSS00012024712			Oc1cccc2c1CCCN2	CSSS00012024712
2	CSSS02018321032	https://chem-space.com/CSSS02018321032			O=S(=O)(Cc1ccc(F)cc1)N1CC2CCC(C1)N2	CSSS02018321032
3	CSSS00102954942	https://chem-space.com/CSSS00102954942			CN(C)c1nccc2c1CCNC2.Cl	CSSS00102954942
4	CSSS00133055275	https://chem-space.com/CSSS00133055275			Cc1ccc2oc(C(=O)NCC3CCCCN3)cc2c1.Cl	CSSS00133055275
...	...	...	...	...	...	...
18884	CSSS06359898475	https://chem-space.com/CSSS06359898475			Cl.FCCNC1CCOC1	CSSS06359898475
18885	CSSS00021525833	https://chem-space.com/CSSS00021525833			CC(C)(C)CN1CCC(C2CCNCC2)C1	CSSS00021525833
18886	CSSS00015926175	https://chem-space.com/CSSS00015926175			C1CN[C@H]2COC[C@H]2C1	CSSS00015926175
18887	CSSS00000685022	https://chem-space.com/CSSS00000685022			CNCC(=O)Nc1c(C)cc(C)cc1C.Cl	CSSS00000685022
18888	CSSS00027672546	https://chem-space.com/CSSS00027672546			Cc1ccc(C)c(C(=O)NC2CNCCC2C)c1.Cl	CSSS00027672546

18889 rows × 6 columns

Similar databases of amine fragments are available from other, smaller sources.

CODE

frags_file_path = 'http://fchgroup.net/files/fragments/FCHGroup_fragment-like_amines.zip'
sdf_dir = '../../files/'
sdf_fnames = extract_sdf_from_zip(frags_file_path, sdf_dir, download=True)

fch_sdf = PandasTools.LoadSDF(sdf_dir + sdf_fnames[0], removeHs=False)
fch_sdf['SMILES'] = [Chem.MolToSmiles(x) for x in fch_sdf['ROMol']]
fch_sdf['Name'] = [f"FCH_{i}" for i in range(fch_sdf.shape[0])]
fch_sdf

Extracted: FCHGroup_fragment-like_amines.sdf to ../../files/

	ID	URL	ROMol	SMILES	Name
0		https://chem-space.com/CSC000156186		Clc1ccc(CNc2cnccn2)cc1	FCH_0
1		https://chem-space.com/CSC000251315		CC(Nc1ncccn1)C1CC2CCC1C2	FCH_1
2		https://chem-space.com/CSC000156714		N#Cc1cccnc1NC1CCCC1	FCH_2
3		https://chem-space.com/CSC116272021		CC(C)(C)NCC(=O)NC(=O)NC1CCCCC1	FCH_3
4		https://chem-space.com/CSC116272401		COc1ccccc1C=CC1NC(=O)c2ccccc2N1	FCH_4
...	...	...	...	...	...
2150		https://chem-space.com/CSC116267824		O=C(NC[C@]12CNC[C@@]1(C(F)(F)F)C2)C1CCCC1	FCH_2150
2151		https://chem-space.com/CSC150710329		NC1CC12CC(NC(=O)c1cccc3c1CCN3)C2	FCH_2151
2152		https://chem-space.com/CSC116268621		Cc1cc(=O)[nH]c(NC2CCCc3c(C)cccc32)n1	FCH_2152
2153		https://chem-space.com/CSC150714667		C[C@H](NC1CCCC(F)(F)CC1)C(=O)N(C)C	FCH_2153
2154		https://chem-space.com/CSC150711165		Cn1cc(CN[C@@]23CCC[C@@H]2C3)ccc1=O	FCH_2154

2155 rows × 5 columns

Amines typically act as nucleophiles in organic chemistry and LNP chemistry, particularly when participating in synthesis. This means that the head fragments that scientists in LNP discovery typically care about are primary and secondary amines.

Primary amines

Primary amines are particularly useful reagents to have, so we devote special attention to their enumeration here for purposes of downstream use in other workflows. First, we gather a set of primary amines that are well studied as head groups in ionizable lipids, from a seminal line of work on short-RNA LNP delivery (Akinc et al. 2008).

CODE

import mols2grid

frags_whitehead_amines = ['NCCCCCO', 'CCNCCCNCC', 'CNCCOCCNC', 'CN(C)CCCN', 'CCN(CC)CCCN', 'NCCN(CCO)CCO', 'NCCN', 'NCCNCCO', 'NCCN(CCN)CCN', 'CN(CCN)CCN', 'CCN(CC)CCNCCN', 'N/C=C1\\\\CCCN1', 'NCCNCCNCCNCCNCCN', 'NCCCOCCOCCOCCCN', 'NCCOCCO', 'CC(C)C[C@H](N)CO', 'NCCNCCN1CCN(CCN)CC1', 'NCCN1CCNCC1', 'CC(N)CNCC(C)N', 'CCN(CCCNC)CCCNC', 'CN(C)CCCNCCCN', 'CNCCN(CCNC)CCNC', 'CC(CN)N', 'CN(CCCN)CCCN', 'C(CN)CN1CCN(CCCN)CC1', 'C1=CC(=CC(=C1)CN)CN', 'C1=CC2=C(C=C1N)OCCO2', 'COCCCN', 'C1COC(O1)CCN', 'C1OC(C)(C)OC1CN', 'COCCN', 'CCOCCN', 'COC(OC)CN', 'C1CC(OC1)CN', 'C(CO)CN', 'C[C@@H](O)CN', 'C[C@H](O)CCN', 'CC(CO)(CO)N', 'C(CCO)CN', 'C(COCCO)N', 'C(O)C(C)(C)CN', 'CCC(CO)(CO)N', 'C(CCCO)CCN', 'C1C[C@H](O)CC[C@H]1N', 'N(C)CCNC', 'N(CC)CCNCC', 'N(C(C)C)CCNC(C)C', 'N(CC)CCCNCC', 'N(C)CCOCCOCCNC', 'N1CCCNCC1', 'CCCN(CC)CN', 'C1CCN(C1)CCN', 'CN(C)CCN', 'CC(CN(C)C)N', 'C1CCN(CC1)CCN', 'C1COCCN1CCN', 'C1COCCN1CCCN', 'C1=CN(C=N1)CCCN', 'CNCCN', 'CNCCCN', 'CCCNCCN', 'C(CNCCNCCN)N', 'C(CN)CN', 'C(CCN)CN', 'CCCCNCCN', 'N(CCNCCO)CCO', 'CCCCCCCCCCCCCN', 'C(CN)CNCCN', 'CCCNCCCN', 'C(CCNCCCN)CNCCCN', 'CC(C)N(C(C)C)CCN', 'C(CNCCNCCNCCN)N', 'C(CN)CNCCN', 'C(CNCCN)N', 'CCCCCCCCCCCCN(CCN)CCN', 'C1CC[C@@H]([C@H](C1)N)N', 'C1CC[C@@H]([C@@H](C1)N)N', 'C1CCN(CC1)N', 'C1C=CC[C@H]([C@@H]1N)N', 'C(CNCCN)CNCCN', 'CCNCCN', 'C=CCCN', 'C1CNC[C@H]1N', 'C1CC(CC(C1)N)N', 'C1CNCCC1N', 'N1CCNCC1', 'N1CCNC[C@H]1C', 'CC1CCC(CC1)CCN', 'CCC1CCC(CC1)CCN', 'N1CCNCC1c1ccccc1', 'N1CCCNCCCNCCC1', 'N1CCNCCNCCNCC1', 'N1CCCNCCNCCNCC1', 'N1CCCNCCNCCCNCC1', 'N1(C)CCCNCCN(C)CCCNCC1', 'N1CCCNCCCNCCCNCC1', 'N1CCNCCNCCOCC1', 'C1C[C@H](CC[C@@H]1N)N', 'C1CC(CC(C1)CN)CN', 'C1=CC(=CC(=C1)N)N', 'C1=CC(=CC=C1N)N', 'CC1=CC=CC(=C1N)N', 'CC1=C(C=C(C=C1)N)N', 'CC1=C(C=CC=C1N)N', 'CC1=CC(=C(C=C1)N)N', 'C1=CC2=C(C(=C1)N)C(=CC=C2)N', 'N1C(C)NC(C)NC1C', 'C1(=NC(=NC(=N1)N)N)N', 'C1CC(CCC1CN)CN', 'C(CCCCN)CCCN', 'C(COCCOCCN)N', 'C(COCCOCCOCCN)N', 'C(CCOCCCN)COCCCN', 'COCCCN', 'N(C)CCOCCOCCNC', 'C(CO)N', 'N(CCNCCO)CCO', 'CCCN', 'CCCCN', 'CCCCCN', 'CCCCCCN', 'CNCCCCCCNC', 'CCCNCCCN', 'C(CN)CNCCCN', 'C(CCCNCCCCCCN)CCN', 'C(CN)CNCCCNCCCN', 'CC(C)[NH]', 'CC(C)(C)[NH]', 'CCC(CC)N', 'CCC(C)(C)N', 'CC(C)CN', 'CCC(C)N', 'CC(C)C(C)N', 'CCCC(C)N', 'CC(C)CCN', 'CC(C)(C)CCN', 'CC(C)(C)CC(C)(C)N', 'CCC(C)CN', 'CCCCC(CC)CN', 'CC(CO)N', 'CC(C)C[C@@H](CO)N', 'CC(C)(C)C(CO)N', 'C(CO)(CO)(CO)N', 'CC(C)OCCCN', 'N(C(C)(C)C)CCNC(C)(C)C', 'C(CCN)CCN', 'C(CCCN)CCN', 'C(CCCCN)CCN', 'C(CCCCCN)CCN', 'C(CCCCCCCN)CCN', 'C(CCCCCCCCCN)CCN', 'CC(CC(C)(C)CCN)CN', 'C(CCCN(C)Cc1ccccc1)N', 'CN1CCN(CC1)CCCCN', 'CCN1CCC(CC1)N', 'CN1CCN(CC1)CCN', 'CN1CCOCC1CN', 'CCCN1CCC(CC1)N', 'C(CN(C)Cc1ccccc1)N', 'CN(C)CCOCCN', 'C1CCN(CC1)CCOCCN', 'C1CCN(C1)CCOCCN', 'CN(C1=CC=CC=C1)C(=O)CN', 'CN1C=CN=C1CN', 'CN1CCCC1CN', 'CN1CCCC1C(=O)N', 'CN1CCC(CC1)CN', 'CN1CCC(CC1)N', 'CC(C)N1CCN(CC1)CCCN', 'C1CN(CCN1)C(=O)CCN', 'C(C)(CN(C)C)(C)CN', 'CC(C)NCCN', 'C1[CH]C1N', 'C1CC(C1)N', 'C1CCC(C1)N', 'CCN1CCC[C@H]1CN', 'CCN1CCC[C@@H]1CN', 'C1CCC(CC1)N', 'C1CCC(CC1)CN', 'CC(C)(CN(C)C)CN', 'C1CC2CC1CC2N', 'C1CCCC(CCC1)N', 'C1C2CC3CC1CC(C2)(C3)N', 'C1C2CC3CC1CC(C2)(C3)CN', 'C1=CC=C(C=C1)CCCN', 'C1=CC=C(C=C1)CCCCN', 'CC(C)(C)C1=CC=C(C=C1)N', 'CC(C)(C)C1=CC=CC=C1N', 'CCC1=C(C(=CC=C1)CC)N', 'COC1=CC=CC=C1N', 'COC1=CC=CC=C1CN', 'CC1=CC=C(C=C1)C(C)N', 'CC(C)C1=CC=CC=C1N', 'CC(C)C1=CC=C(C=C1)N', 'CC(C)OC1=CC=CC(=C1)N', 'COC1=CC=C(C=C1)CCN', 'CCOC1=C(C=CC=C1)N', 'CCCCCCCCOC1=CC=C(C=C1)N', 'CCOC1=CC=CC=C1CN', 'CC(C)(C)C1=CC(=C(C=C1)C(C)(C)C)N', 'COC1=CC(=C(C=C1)OC)N', 'CC(C)(C)c1ccc(OC)c(c1)N', 'C(Cc1cc(OC)c(OC)cc1)N', 'C(c1cc(C)c(c(OC)c1)OC)N', 'C1=CC=C(C=C1)NCCN', 'C1=CC=C(C=C1)CNCCN', 'CNC1=CC=CC=C1N', 'C1=CC(=CC(=C1)N)CN', 'Cc1c(N)c(C)c(C)c(c1C)N', 'C1=CC=C(C(=C1)CN)N', 'CC1=CC(=C(C=C1C)N)N', 'c1(F)cc(ccc1)N', 'c1(F)ccc(cc1)N', 'C1=CC(=C(C=C1)F)CN', 'C(c1cc(F)ccc1)N', 'C1=CC(=C(C=C1)F)CCN', 'c1(cc(F)ccc1F)N', 'C1=C(C=C(C(=C1)F)F)N', 'C1=C(C=C(C=C1F)F)N', 'C(c1cc(F)ccc1F)N', 'C1=C(C=C(C(=C1F)F)F)N', 'c1(F)cc(F)c(c(F)c1)N', 'C(F)(F)(F)c1cc(ccc1)N', 'C(F)(F)(F)c1c(ccc(c1)F)N', 'CC1=CC=C(C=C1F)N', 'COC1=CC=C(C=C1F)N', 'c1(OC(F)F)ccc(cc1)N', 'C(F)(F)(F)Oc1ccc(cc1)N', 'CC1CC(CCC1N)CC2CCC(C(C2)C)N', 'N1C2NCCNC2NCC1', 'C1CC(C2=C1C=CC=C2)N', 'C1=CC2=C(C=C1)CC(C2)N', 'C1CC2=C(C1)C(=CC=C2)N', 'C1OC2=CC=C(C=C2O1)N', 'C1CCC2=C(C1)C=CC(=C2)N', 'C1C2=C(C=CC(=C2)N)C3=C1C=C(C=C3)N', 'c1(cc2cc3ccccc3cc2cc1)N', 'C1=CC=C2C(=C1)C3=C(C=CC=C3)C(=C2N)N', 'C1=CC=C(C=C1)C2=CC(=CC=C2)N', 'C1=CC=C(C=C1)C2=CC=C(C=C2)N', 'Cc1c(N)ccc(c1)-c1ccc(c(C)c1)N', 'C1=C(C=C(C(=C1)N)N)C2=CC=C(C(=C2)N)N', 'Cc1c(N)c(C)cc(c1)-c1cc(C)c(c(c1)C)N', 'C1=CC=C(C=C1)CC2=CC=CC=C2N', 'C1=CC(=CC=C1CC2=CC=C(C=C2)N)N', 'CNC1=CC=C(C=C1)CC2=CC=C(C=C2)N', 'C1=CC=C(C=C1)C(C2=CC=CC=C2)N', 'C(CC(c1ccccc1)c1ccccc1)N', 'C1=CC=C(C=C1)OC2=CC=CC=C2N', 'C1=CC=C(C=C1)OC2=CC=C(C=C2)N', 'c1(Oc2ccc(F)cc2)ccc(cc1)N', 'C1=CC(=CC=C1N)OC2=CC=C(C=C2)N', 'C1=CC=C(C=C1)COC2=CC(=CC=C2)N', 'C1=CC=C(C=C1)NC2=CC=C(C=C2)N', 'C1=CC=C(C=C1)NC2=CC=CC=C2N', 'C1=CC=C(C=C1)[C@H]([C@H](C2=CC=CC=C2)N)N', 'C1=CC=C(C=C1)C(C2=CC=CC=C2)(C3=CC=CC=C3)N', 'Nc1c(cc(cc1-c1ccccc1)-c1ccccc1)-c1ccccc1', 'C1=CC=C2C(=C1)C=CC(=C2C3=C(C=CC4=CC=CC=C43)N)N', 'N(c1ccccc1)c1ccc(cc1)-c1ccc(cc1)Nc1ccccc1', 'C1(c2c(cccc2)-c2ccccc21)(c1ccc(N)cc1)c1ccc(cc1)N', 'CC(NCCN(CCNC(C)C)CCNC(C)C)C']
frags_whitehead_amines = np.unique([Chem.CanonSmiles(x) for x in frags_whitehead_amines])
frags_whitehead_df = pd.DataFrame({
    'SMILES': frags_whitehead_amines, 
    'Name': [f"Whitehead_{i}" for i in range(len(frags_whitehead_amines))]
})

print("{} heads from Whitehead et al. 2014 library.".format(len(frags_whitehead_amines)))
mols2grid.display([Chem.MolFromSmiles(x) for x in frags_whitehead_amines],mol_col="mol")

252 heads from Whitehead et al. 2014 library.

The Descriptor Libraries Project from the Molecular Sciences Software Institute (MolSSI) runs quantum mechanical calculations on a subset of individual fragments. These are usefully sorted by molecule class. The primary amines are displayed below.

CODE

molssi_amines_primary_df = pd.read_excel("https://descriptor-libraries.molssi.org/primary-amines/content/primary-amines_library.xlsx", index_col=0)
molssi_amines_primary_df

	Canonical_SMILES	Compound_Name	Reaxys	Drug_Repurposing_Hub	Enamine	Notes
ID
1	COCCN	N1	1.0	NaN	NaN	NaN
2	NCC1CCCCC1	N2	1.0	NaN	NaN	NaN
3	NCCc1ccccc1Cl	N3	1.0	NaN	NaN	NaN
4	NCc1ccc(-c2ccccc2)cc1	N4	1.0	NaN	NaN	NaN
5	NC1CCOCC1	N5	1.0	NaN	NaN	NaN
...	...	...	...	...	...	...
4268	CN1CCC(N)CC1	N-TAK-901	NaN	1.0	NaN	NaN
4269	NC[C@@H]1C[C@H]1c1cccc2c1CCO2	N-tasimelteon	NaN	1.0	NaN	NaN
4270	CN1CCN(CCCN)CC1	N-TEN-010	NaN	1.0	NaN	NaN
4271	NC1CCC1	N-VU0360172	NaN	1.0	NaN	NaN
4272	CCC[C@H](N)c1ccccc1	N-WP1130	NaN	1.0	NaN	NaN

4272 rows × 6 columns

Secondary amines

Secondary amines typically offer less scope for versatile modification in syntheses, since they are already participating in a covalent bond with a carbon. MolSSI has a library of secondary amines that can be used for this purpose.

CODE

molssi_amines_secondary_df = pd.read_excel("https://descriptor-libraries.molssi.org/secondary-amines/content/secondary-amines_library.xlsx", index_col=0)
molssi_amines_secondary_df

	Canonical_SMILES	Compound_Name	Reaxys	Drug_Repurposing_Hub	Enamine	Notes
ID
1	C1COCCN1	SecN1	1.0	NaN	NaN	NaN
2	C1CCNCC1	SecN2	1.0	NaN	NaN	NaN
3	C1CCNC1	SecN3	1.0	NaN	NaN	NaN
4	CNC	SecN4	1.0	NaN	NaN	NaN
5	CNCCCN1c2ccccc2CCc2ccccc21	SecN5	1.0	NaN	NaN	NaN
...	...	...	...	...	...	...
3845	CN1CC2(CCNC2)C1	SecNEnamine5	NaN	NaN	1.0	NaN
3846	C1COCC2(C1)CNC2	SecNEnamine6	NaN	NaN	1.0	NaN
3847	C1CC(C2CNCCO2)C1	SecNEnamine7	NaN	NaN	1.0	NaN
3848	CC1(C)CNCCS1(=O)=O	SecNEnamine8	NaN	NaN	1.0	NaN
3849	CC1(C)CNCC2(CCOC2)O1	SecNEnamine9	NaN	NaN	1.0	NaN

3849 rows × 6 columns

Consolidating amine fragments

Taken together, these represent a hefty cross-section of amine fragments in commercial use. There is overlap with manufacturers’ databases, such as those of Enamine and WuXi. With the right access permissions, those too can be mined as necessary.

Having gathered this set, it’s useful to inspect these to get an idea of the staggering variety of drug-like amine fragments available (primarily because of their use in small-molecule drug design over the years).

CODE

smiles_list = []
name_list = []

molssi_amines_primary_df.rename(columns={'Canonical_SMILES': 'SMILES', 'Compound_Name': 'Name'}, inplace=True)
molssi_amines_secondary_df.rename(columns={'Canonical_SMILES': 'SMILES', 'Compound_Name': 'Name'}, inplace=True)

for new_df in [frags_whitehead_df, molssi_amines_primary_df, molssi_amines_secondary_df, chemspace_sdf, fch_sdf]:
    new_smiles_list = [Chem.CanonSmiles(x) for x in new_df['SMILES'].values]
    new_names_list = [f"{new_df['Name'].values[i]}" for i in range(new_df.shape[0])]
    for i in range(new_df.shape[0]):
        if new_smiles_list[i] in smiles_list:
            ndx = np.where(np.array(smiles_list) == new_smiles_list[i])[0][0]
            name_list[ndx] = name_list[ndx] + "|" + new_names_list[i]
        else:
            smiles_list.append(new_smiles_list[i])
            name_list.append(new_names_list[i])

frags_amines_df = pd.DataFrame({
    'SMILES': smiles_list,
    'Name': name_list
})

frags_amines_df

	SMILES	Name
0	C1CNC2NCCNC2N1	Whitehead_0
1	C1CNCCCNCCCNC1	Whitehead_1
2	C1CNCCCNCCNCCCNC1	Whitehead_2
3	C1CNCCN1	Whitehead_3
4	C1CNCCNC1	Whitehead_4\|CSSS00000210015
...	...	...
26647	CNC(=O)c1ccc(C)c(NC2COCCC2C)c1	FCH_2133
26648	CC(=O)N1CC(NC2CC(=O)N(C3CCCCC3)C2)C1	FCH_2134
26649	CNC1(C(=O)N2CC3CCCC3(N)C2)CCC1	FCH_2142
26650	NC1CC12CC(NC(=O)c1cccc3c1CCN3)C2	FCH_2151
26651	C[C@H](NC1CCCC(F)(F)CC1)C(=O)N(C)C	FCH_2153

26652 rows × 2 columns

CODE

amines_all = set(frags_amines_df["SMILES"])
print(f"Number of amine fragments: {len(amines_all)}")

display_df = pd.DataFrame(list(amines_all), columns=["SMILES"])
mols2grid.display(display_df)

Number of amine fragments: 26652

There is a comprehensive variety of chemical material available here, which can be used to conduct virtual screening.

Classification and filtering of amine fragments

The above fragment databases do not discriminate between these finer distinctions, but filtering one from the other is a simple matter of detecting the salient atoms. For completeness, we give quite a comprehensive version here, which classifies nitrogens in the molecule, including even aromatic amine nitrogens (anilines) and aromatic non-amine nitrogens (pyridines and pyrroles).

CODE

def classify_nitrogens(sm):
    """
    Classify each of the nitrogens in a molecule.
    """
    m = Chem.MolFromSmiles(sm)
    if m is None:
        raise ValueError('bad SMILES')
    res = []
    for a in m.GetAtoms():
        if a.GetAtomicNum() != 7:
            continue
        if a.GetFormalCharge() > 0 and a.GetDegree() == 4:
            cat = 'quaternary_ammonium'
        elif a.GetIsAromatic():
            cat = 'aromatic_pyrrolic' if a.GetTotalNumHs() else 'aromatic_pyridinic'
        else:
            ani = any(nb.GetIsAromatic() and nb.GetAtomicNum() == 6 for nb in a.GetNeighbors())
            d = a.GetDegree()
            if d == 1:
                cat = 'primary_' + ('aniline' if ani else 'amine')
            elif d == 2:
                cat = 'secondary_' + ('aniline' if ani else 'amine')
            elif d == 3:
                cat = 'tertiary_' + ('aniline' if ani else 'amine')
            else:
                cat = 'other_n'
        res.append((a.GetIdx(), cat))
    return res or [('none',)]


def summarize_amines(smiles_list):
    """
    Summarize the amine classifications for a list of molecules. 
    Returns a dictionary mapping each SMILES to a list of its classifications, 
    for each nitrogen in the molecule. 
    """
    answers = {}
    for s in smiles_list:
        nitr_classifications = [classify_nitrogens(s)[i][1] for i in range(len(classify_nitrogens(s)))]
        answers[s] = nitr_classifications
    return answers



import mols2grid

candidate_amines = [
    "CCN",               # primary amine
    "CCNC",              # secondary amine
    "CCN(CC)CN",          # tertiary amine
    "c1ccc(N)cc1",       # primary aniline
    "c1ccc(NC)cc1",      # secondary aniline
    "c1ccc(N(C)C)cc1",   # tertiary aniline
    "c1ccncc1",          # aromatic pyridinic
    "c1c[nH]cc1",        # aromatic pyrrolic
    "C[N+](C)(C)C",      # quaternary ammonium, 
    "CC(C)(C)[NH]",      # primary amine
]
for s in candidate_amines:
    print("{}\t → \t{}".format(
        s, ", ".join([classify_nitrogens(s)[i][1] for i in range(len(classify_nitrogens(s)))])
    ))
mols2grid.display([Chem.MolFromSmiles(x) for x in candidate_amines])

CCN  →  primary_amine
CCNC     →  secondary_amine
CCN(CC)CN    →  tertiary_amine, primary_amine
c1ccc(N)cc1  →  primary_aniline
c1ccc(NC)cc1     →  secondary_aniline
c1ccc(N(C)C)cc1  →  tertiary_aniline
c1ccncc1     →  aromatic_pyridinic
c1c[nH]cc1   →  aromatic_pyrrolic
C[N+](C)(C)C     →  quaternary_ammonium
CC(C)(C)[NH]     →  primary_amine

The examples here illustrate this useful functionality in mining this molecule class (amine heads). Note the last example, with the aminyl radical (a common artifact of the image-to-SMILES conversion, especially for old or low-quality images). This is easily converted into a primary amine, and falls with the primary amines here for the purposes of this classification.

This can be used to quickly annotate the head groups in the library we’ve sourced here.

CODE

amine_classifications = summarize_amines(frags_amines_df["SMILES"])

CODE

new_col = {s: " | ".join(amine_classifications[s]) for s in amine_classifications}
frags_amines_df["Amine_type"] = frags_amines_df["SMILES"].map(new_col)
frags_amines_df.to_csv("../../files/frags_amines.csv.gz", compression="gzip", index=False)

CODE

frags_amines_df

	SMILES	Name	Amine_type
0	C1CNC2NCCNC2N1	Whitehead_0	secondary_amine \| secondary_amine \| secondary_...
1	C1CNCCCNCCCNC1	Whitehead_1	secondary_amine \| secondary_amine \| secondary_...
2	C1CNCCCNCCNCCCNC1	Whitehead_2	secondary_amine \| secondary_amine \| secondary_...
3	C1CNCCN1	Whitehead_3	secondary_amine \| secondary_amine
4	C1CNCCNC1	Whitehead_4\|CSSS00000210015	secondary_amine \| secondary_amine
...	...	...	...
26647	CNC(=O)c1ccc(C)c(NC2COCCC2C)c1	FCH_2133	secondary_amine \| secondary_aniline
26648	CC(=O)N1CC(NC2CC(=O)N(C3CCCCC3)C2)C1	FCH_2134	tertiary_amine \| secondary_amine \| tertiary_amine
26649	CNC1(C(=O)N2CC3CCCC3(N)C2)CCC1	FCH_2142	secondary_amine \| tertiary_amine \| primary_amine
26650	NC1CC12CC(NC(=O)c1cccc3c1CCN3)C2	FCH_2151	primary_amine \| secondary_amine \| secondary_an...
26651	C[C@H](NC1CCCC(F)(F)CC1)C(=O)N(C)C	FCH_2153	secondary_amine \| tertiary_amine

26652 rows × 3 columns

Enumerating a variety of fragments for design

The other type of fragment which occurs very commonly in ionizable lipid design is a hydrophobic tail. Ionizable lipids typically contain multiple hydrophobic tails, which are the focus of significant engineering efforts. The tails are understood to play a large role in the lipid’s molecular shape, packing, and self-assembly behavior. They constitute the bulk of a lipid sterically, and are a paradigmatic example of hydrophobic “skeletons” that are common in organic chemistry.

For ionizable lipid design, many properties of their lipid tails, including its length, degree of branching and saturation, and partial charge distribution can all be varied to create a wide variety of lipids.

Branching existing tails (introducing alkyl side chains) is a strategy to increase the hydrophobic cross-section without requiring an additional separate tail, and can improve endosomal escape potency by promoting a more cone-shaped molecular packing (Hajj et al. 2019). Modern vaccine lipids illustrate this principle: both SM-102 and ALC-0315 (Pfizer/BioNTech) possess branched tails that behave similarly to multi-tail structures, thought to contribute to their high efficacy. Function appears to be quite sensitive to structure here, and branching close to the headgroup can be counterproductive, possibly due to steric effects (Padilla et al. 2025).

Tail length also matters: longer tails strengthen hydrophobic interactions and nanoparticle stability, but if too long can impede endosomal release or result in slow biodegradation (Mrksich et al. 2024).

Varying tails in virtual screening

Varying all these properties simultaneously results in (combinatorially) many tail structures, which are carefully pruned to arrive at chemical screening libraries in the literature. Experimental screening constraints typically make such densely explored tail libraries impractical (Hashiba et al. 2024).

But virtual screening is a much lower-cost and higher-throughput process by design. At this early stage in the drug discovery pipeline, we are looking to prospect over as many structures as possible and certainly do not wish to rule out potentially viable structures, especially when doing QSAR on sensitive signals with activity cliffs. So for these purposes, it’s often useful to programmatically create large libraries of tail structures by systematically varying the properties of the tail.

Let’s take a look at what this looks like for the prototypical LNP IL, known as MC3. This consists of an amide head linked by an ester group with a branched, partially unsaturated hydrophobic tail. Synthetically as well, MC3 can be viewed as the coupling of a branched, saturated and mono/diunsaturated alcohol (dilinoleyl alcohol) with a headgroup acid (dimethylamino butanoic acid).

So in this case, if we wanted to examine the effects of changing the tail, we might substitute it with other branched, partially saturated alcohols. Here’s code to generate such a library in an independently extensible manner which may be useful. We use core RDKit functionality to change individual atoms within the mol object and reconvert back into SMILES, which is much more robust than doing pure SMILES string manipulation.

CODE

from rdkit import Chem
from rdkit import RDLogger

# 1‑line switch‑off for all RDKit warnings/info
RDLogger.DisableLog("rdkit")


def spacing_ok(db_positions, spacings_allowed={2, 3}):
    """Return True iff every spacing between successive double bonds is allowed."""
    if len(db_positions) < 2:
        return True
    db_positions = sorted(db_positions)
    return all(b - a in spacings_allowed for a, b in zip(db_positions, db_positions[1:]))


def build_linear_tail(n_carbon, db_pos):
    """HO–(CH2)n‑1–CH2–  with optional internal C=C bonds."""
    mol = Chem.RWMol()

    c_atoms = [mol.AddAtom(Chem.Atom("C")) for _ in range(n_carbon)]
    o_atom  = mol.AddAtom(Chem.Atom("O"))

    for i in range(n_carbon - 1):
        bond_type = Chem.BondType.DOUBLE if i in db_pos else Chem.BondType.SINGLE
        mol.AddBond(c_atoms[i], c_atoms[i + 1], bond_type)

    mol.AddBond(c_atoms[-1], o_atom, Chem.BondType.SINGLE)
    return Chem.MolToSmiles(mol, canonical=True)


def build_branched_tail(n1, n2, db1, db2):
    """Tertiary alcohol core with two hydrophobic tails."""
    mol  = Chem.RWMol()
    core = mol.AddAtom(Chem.Atom("C"))
    o_at = mol.AddAtom(Chem.Atom("O"))

    mol.AddBond(core, o_at, Chem.BondType.SINGLE)

    def add_chain(start_atom, length, db_pos_set):
        prev = start_atom
        for idx in range(length):
            c = mol.AddAtom(Chem.Atom("C"))
            bond_type = (
                Chem.BondType.DOUBLE if (idx + 1) in db_pos_set else Chem.BondType.SINGLE
            )
            mol.AddBond(prev, c, bond_type)
            prev = c

    add_chain(core, n1, set(db1))
    add_chain(core, n2, set(db2))

    return Chem.MolToSmiles(mol, canonical=True)

Tail variations in practice

To demonstrate this code, we could construct a set of branched diunsaturated tails, inspired by the dilinoleyl tail of MC3.

CODE

mc3_smiles = "CCCCC\C=C/C\C=C/CCCCCCCCC(CCCCCCCC\C=C/C\C=C/CCCCC)OC(=O)CCCN(C)C"
tail_smiles = "CCCCC\C=C/C\C=C/CCCCCCCCC(CCCCCCCC\C=C/C\C=C/CCCCC)[1*]"
mols2grid.display(
    [Chem.MolFromSmiles(mc3_smiles), 
     Chem.MolFromSmiles(tail_smiles)], 
    size=(300, 200))

We can make a bunch of variations on this tail. We keep the branching but vary the length of each of the two branches:

We try asymmetric-length branches, with double bonds at various places along the length of each branch. It doesn’t make sense to have branches of wildly different lengths, so we disallow branched tails where the branches differ in length by more than 6.
We try different spacings between the double bonds, to generate tails with different lability and flexibility (including conjugated double bonds).

We can do this by systematically changing the atoms in the mol object.

CODE

from itertools import combinations, product

from rdkit import RDLogger
lg = RDLogger.logger()
lg.setLevel(RDLogger.CRITICAL) # Set the logging level to CRITICAL to suppress all messages below this level


# design‑space parameters -----------------------------------------------------
branched_lengths        = [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]   # carbons / tail
unsat_choices  = (0, 1, 2)                 # number of C=C per tail
spacings_allowed = {2, 3, 4, 5}
# -----------------------------------------------------------------------------

frags = set()


for L1, L2, d1, d2 in product(branched_lengths, branched_lengths, unsat_choices, unsat_choices):
    interior1 = range(1, L1)                     # skip bond 0 (core‑C1)
    interior2 = range(1, L2)

    if d1 > len(interior1) or d2 > len(interior2):
        continue

    if abs(L1 - L2) > 6:
        continue

    for db1 in combinations(interior1, d1):
        if not spacing_ok(db1, spacings_allowed=spacings_allowed):
            continue
        for db2 in combinations(interior2, d2):
            if not spacing_ok(db2, spacings_allowed=spacings_allowed):
                continue
            try:
                smi = build_branched_tail(L1, L2, db1, db2)
                Chem.SanitizeMol(Chem.MolFromSmiles(smi))
                frags.add(smi)
            except Exception:
                pass

A random sample of these confirms that we are including a huge variety of diversity.

CODE

import random

print(f"Number of tails: {len(frags)}")

print("Sample of 1000 tails:")
# Choose a random set of 1000 tails
mols = [Chem.MolFromSmiles(x) for x in random.choices(list(frags), k=1000)]

mols2grid.display(mols)

Number of tails: 179697
Sample of 1000 tails:

(This lays the groundwork for constructing ultralarge spaces of structures for virtual screening. In commonly occurring situations, it will be difficult to optimize beyond a certain point in such spaces; the outputs of screening inherit any issues in the model, and the model is typically operating at the edge of what is statistically reasonable from its training data.

However, note that the combinatorics favor longer tails with greater unsaturation, because there are more of them. In some situations, this can be an overwhelming and crucial bias to account for in design programs; I’ve seen multiply unsaturated long highly branched tails be problematic in design programs before.)

Saving for later reference

These are saved for the construction of other virtual spaces. We’ve done enumerative combinatorics on a massive scale to construct these, but they constitute a tiny file.

CODE

import gzip

bt_file_path = "../../files/branched_tail_alcohols.smi.gz"

with gzip.open(bt_file_path, "wt") as f:
    for i, smi in enumerate(sorted(smiles_dict.values())):
        f.write(f"{smi} branched-tail-alcohol-{i}\n")

print("Size of file:{} kb".format(os.path.getsize(bt_file_path)/1024))

Size of file:661.875 kb

For completeness, this can be reloaded as follows:

CODE

import gzip
# Read the gzipped file
with gzip.open("../../files/branched_tail_alcohols.smi.gz", "rt") as f:
    lines = f.readlines()
    smiles_dict = {}
    for line in lines:
        parts = line.strip().split()
        if len(parts) >= 2:
            smiles = parts[0]
            name = parts[1]
            smiles_dict[name] = smiles
        
# Example: print first 5 entries
for i, (name, smiles) in enumerate(list(smiles_dict.items())[:5]):
    print(f"{name}: {smiles}")

branched-tail-alcohol-0: CC=CC=CCCCC(O)=CC=CCCCCC
branched-tail-alcohol-1: CC=CC=CCCCC(O)=CC=CCCCCCC
branched-tail-alcohol-2: CC=CC=CCCCC(O)=CC=CCCCCCCC
branched-tail-alcohol-3: CC=CC=CCCCC(O)=CC=CCCCCCCCC
branched-tail-alcohol-4: CC=CC=CCCCC(O)=CC=CCCCCCCCCC

Discussion

These literature-sourced and catalog-sourced lipids constitute an initial knowledge base of ionizable lipid chemistry. This is a basic version of a virtual library, rich in real-world examples and starting points. This library is a possibly very biased sampling of molecules that are effective in different scenarios, so there are practical challenges in training encoders (embeddings of what known successful lipids look like). Instead, we show in another post how to use structures to identify fragments for the next stage of combinatorial expansion. From there, other posts explore the many uses of ML to expand the space further, model it, and use it to guide synthesis.

In practice, assembling such a dataset is itself a form of valuable IP: different companies and groups often put significant effort into curating proprietary databases of formulations and molecules from both publications and internal experiments. However, the general methods to do so (the text mining and database queries implemented here) are widely accessible, and sharing non-proprietary portions (e.g. structures of lipids already published) can be a benefit for initial exploration.

For instance, it is common to attempt to incorporate literature data to bootstrap drug discovery programs, due to the relative scarcity of internal screening data. Particularly when starting a program and at the initial stages, this type of AI/ML can be a unique data-driven option to guide and refine existing hypotheses about efficacy in the vastness of chemical space.

Exporting code

For future such reference, it’s useful to collect all the code from this post in a .py file.

CODE

all_code = '''
# ======== Structure sourcing code ========

import requests, os
from urllib.parse import urlparse

def scrape_url(url):
    """
    Scrapes the HTML content from a given URL and returns it as a string.
    
    Args:
        url (str): The URL to scrape
        
    Returns:
        str: The HTML content of the page
        
    Raises:
        ValueError: If the URL is invalid
        requests.RequestException: If the request fails
    """
    # Validate URL
    parsed_url = urlparse(url)
    if not parsed_url.scheme or not parsed_url.netloc:
        raise ValueError("Invalid URL. Please provide a complete URL including http:// or https://")
    
    # Set a user agent to mimic a browser request
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    # Make the request
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise an exception for 4XX/5XX responses
        return response.text
    except requests.exceptions.RequestException as e:
        raise requests.RequestException(f"Failed to retrieve the webpage: {e}")


def retrieve_chemstructure_images(IDs, names, imageID_to_url, name_to_path):
    img_path = {}
    for i in range(len(IDs)):
        if len(names[i]) == 0:
            continue
        cpd_name = names[i]
        image_url = imageID_to_url(IDs[i])
        if '/' in cpd_name:
            cpd_name = cpd_name.replace('/', '-')
        image_path = name_to_path(cpd_name)
        if not os.path.exists(image_path):
            img_data = requests.get(image_url).content
            print(f'Downloading {image_url} to {image_path}')
            with open(image_path, 'wb') as handler:
                handler.write(img_data)
        img_path[cpd_name] = image_path
    return img_path


def openchemie_predict_SMILES(image_path):
    import torch
    # from molscribe import MolScribe
    # from huggingface_hub import hf_hub_download
    # ckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth')
    # model = MolScribe(ckpt_path, device=torch.device('cpu'))
    # output = model.predict_image_file(image_path, return_atoms_bonds=True, return_confidence=True)
    from openchemie import OpenChemIE
    from PIL import Image
    model = OpenChemIE(device=torch.device('cpu')) # change to cuda for gpu
    img3 = Image.open(image_path)
    mres = model.extract_molecules_from_figures([img3])
    return mres


def predict_SMILES_from_images(img_path_dict):
    chem_smiles = {}
    i = 0
    for cpd_name, image_path in img_path_dict.items():
        # To run this, DECIMER must be installed, requiring opencv-python and keras-preprocessing
        from DECIMER import predict_SMILES as DECIMER_predict_SMILES
        SMILES_str = DECIMER_predict_SMILES(image_path)
        #SMILES_str = openchemie_predict_SMILES(image_path)
        chem_smiles[cpd_name] = SMILES_str
        i += 1
        print(i, cpd_name)
    return chem_smiles


import zipfile

def local_zip_download(zip_path, output_dir):
    new_fname = zip_path.split('/')[-1]
    # Check if new_fname is a zip file
    if not new_fname.endswith('.zip'):
        return ''
    out_path = output_dir + new_fname
    with requests.get(zip_path, stream=True) as r:
        r.raise_for_status()
        with open(out_path, "wb") as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
    return out_path


def extract_sdf_from_zip(zip_path, output_dir, download=False):
    if download:
        zip_path = local_zip_download(zip_path, output_dir)
    fnames_written = []
    with zipfile.ZipFile(zip_path, 'r') as zip_ref:
        for file_name in zip_ref.namelist():
            if file_name.lower().endswith(".sdf"):
                zip_ref.extract(file_name, output_dir)
                print(f"Extracted: {file_name} to {output_dir}")
                fnames_written.append(file_name)
    return fnames_written





# ======== Cheminformatic code ========


from rdkit import Chem
from rdkit import RDLogger

# 1‑line switch‑off for all RDKit warnings/info
RDLogger.DisableLog("rdkit")


def classify_nitrogens(sm):
    m = Chem.MolFromSmiles(sm)
    if m is None:
        raise ValueError('bad SMILES')
    res = []
    for a in m.GetAtoms():
        if a.GetAtomicNum() != 7:
            continue
        if a.GetFormalCharge() > 0 and a.GetDegree() == 4:
            cat = 'quaternary_ammonium'
        elif a.GetIsAromatic():
            cat = 'aromatic_pyrrolic' if a.GetTotalNumHs() else 'aromatic_pyridinic'
        else:
            ani = any(nb.GetIsAromatic() and nb.GetAtomicNum() == 6 for nb in a.GetNeighbors())
            d = a.GetDegree()
            if d == 1:
                cat = 'primary_' + ('aniline' if ani else 'amine')
            elif d == 2:
                cat = 'secondary_' + ('aniline' if ani else 'amine')
            elif d == 3:
                cat = 'tertiary_' + ('aniline' if ani else 'amine')
            else:
                cat = 'other_n'
        res.append((a.GetIdx(), cat))
    return res or [('none',)]


def classify_nitrogens(sm):
    """
    Classify each of the nitrogens in a molecule.
    """
    m = Chem.MolFromSmiles(sm)
    if m is None:
        raise ValueError('bad SMILES')
    res = []
    for a in m.GetAtoms():
        if a.GetAtomicNum() != 7:
            continue
        if a.GetFormalCharge() > 0 and a.GetDegree() == 4:
            cat = 'quaternary_ammonium'
        elif a.GetIsAromatic():
            cat = 'aromatic_pyrrolic' if a.GetTotalNumHs() else 'aromatic_pyridinic'
        else:
            ani = any(nb.GetIsAromatic() and nb.GetAtomicNum() == 6 for nb in a.GetNeighbors())
            d = a.GetDegree()
            if d == 1:
                cat = 'primary_' + ('aniline' if ani else 'amine')
            elif d == 2:
                cat = 'secondary_' + ('aniline' if ani else 'amine')
            elif d == 3:
                cat = 'tertiary_' + ('aniline' if ani else 'amine')
            else:
                cat = 'other_n'
        res.append((a.GetIdx(), cat))
    return res or [('none',)]


def summarize_amines(smiles_list):
    """
    Summarize the amine classifications for a list of molecules. 
    Returns a dictionary mapping each SMILES to a list of its classifications, 
    for each nitrogen in the molecule. 
    """
    answers = {}
    for s in smiles_list:
        nitr_classifications = [classify_nitrogens(s)[i][1] for i in range(len(classify_nitrogens(s)))]
        answers[s] = nitr_classifications
    return answers


def spacing_ok(db_positions, spacings_allowed={2, 3}):
    """Return True iff every spacing between successive double bonds is allowed."""
    if len(db_positions) < 2:
        return True
    db_positions = sorted(db_positions)
    return all(b - a in spacings_allowed for a, b in zip(db_positions, db_positions[1:]))


def build_linear_tail(n_carbon, db_pos):
    """HO–(CH2)n‑1–CH2–  with optional internal C=C bonds."""
    mol = Chem.RWMol()

    c_atoms = [mol.AddAtom(Chem.Atom("C")) for _ in range(n_carbon)]
    o_atom  = mol.AddAtom(Chem.Atom("O"))

    for i in range(n_carbon - 1):
        bond_type = Chem.BondType.DOUBLE if i in db_pos else Chem.BondType.SINGLE
        mol.AddBond(c_atoms[i], c_atoms[i + 1], bond_type)

    mol.AddBond(c_atoms[-1], o_atom, Chem.BondType.SINGLE)
    return Chem.MolToSmiles(mol, canonical=True)


def build_branched_tail(n1, n2, db1, db2):
    """Tertiary alcohol core with two hydrophobic tails."""
    mol  = Chem.RWMol()
    core = mol.AddAtom(Chem.Atom("C"))
    o_at = mol.AddAtom(Chem.Atom("O"))

    mol.AddBond(core, o_at, Chem.BondType.SINGLE)

    def add_chain(start_atom, length, db_pos_set):
        prev = start_atom
        for idx in range(length):
            c = mol.AddAtom(Chem.Atom("C"))
            bond_type = (
                Chem.BondType.DOUBLE if (idx + 1) in db_pos_set else Chem.BondType.SINGLE
            )
            mol.AddBond(prev, c, bond_type)
            prev = c

    add_chain(core, n1, set(db1))
    add_chain(core, n2, set(db2))

    return Chem.MolToSmiles(mol, canonical=True)

'''

CODE

with open("../../files/sourcing_ionizable_lipid_structures.py", "w", encoding="utf-8") as f:
    f.write(all_code)

References

Akinc, Akin, Andreas Zumbuehl, Michael Goldberg, Elizaveta S Leshchiner, Valentina Busini, Naushad Hossain, Sergio A Bacallado, et al. 2008. “A Combinatorial Library of Lipid-Like Materials for Delivery of RNAi Therapeutics.” Nature Biotechnology 26 (5): 561–69.

Hajj, Khalid A, Rebecca L Ball, Sarah B Deluty, Shridhar R Singh, Daria Strelkova, Christopher M Knapp, and Kathryn A Whitehead. 2019. “Branched-Tail Lipid Nanoparticles Potently Deliver mRNA in Vivo Due to Enhanced Ionization at Endosomal pH.” Small 15 (6): 1805097.

Hashiba, Kazuki, Masamitsu Taguchi, Sachiko Sakamoto, Ayaka Otsu, Yoshiki Maeda, Yuichi Suzuki, Hirofumi Ebe, Arimichi Okazaki, Hideyoshi Harashima, and Yusuke Sato. 2024. “Impact of Lipid Tail Length on the Organ Selectivity of mRNA-Lipid Nanoparticles.” Nano Letters 24 (41): 12758–67.

Mrksich, Kaitlin, Marshall S Padilla, Ryann A Joseph, Emily L Han, Dongyoon Kim, Rohan Palanki, Junchao Xu, and Michael J Mitchell. 2024. “Influence of Ionizable Lipid Tail Length on Lipid Nanoparticle Delivery of mRNA of Varying Length.” Journal of Biomedical Materials Research Part A 112 (9): 1494–1505.

Padilla, Marshall S., Kaitlin Mrksich, Yiming Wang, Rebecca M. Haley, Jacqueline J. Li, Emily L. Han, Rakan El-Mayta, et al. 2025. “Branched Endosomal Disruptor (BEND) Lipids Mediate Delivery of mRNA and CRISPR-Cas9 Ribonucleoprotein Complex for Hepatic Gene Editing and t Cell Engineering.” Nature Communications 16 (1): 996. https://doi.org/10.1038/s41467-024-55137-6.

Qian, Yujie, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W Coley, and Regina Barzilay. 2023. “MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation.” Journal of Chemical Information and Modeling 63 (7): 1925–34.

Rajan, Kohulan, Henning Otto Brinkhaus, Achim Zielesny, and Christoph Steinbeck. 2024. “Advancements in Hand-Drawn Chemical Structure Recognition Through an Enhanced DECIMER Architecture.” Journal of Cheminformatics 16 (1): 78. https://doi.org/10.1186/s13321-024-00872-7.

Footnotes

It is even possible to verify the SMILES strings against the original images at scale, by using RDKit to generate structures from the SMILES strings and comparing them to the original images. This can be the basis for a RL-trained system for improving the SMILES string conversion accuracy. Though this may sound complex, the implementation details can be quite straightforward. This is a topic for another post.↩︎

Reuse

CC BY 4.0