Getting ionizable lipid structures from online catalogs
Author
Akshay Balsubramani
Online catalogs of chemical structures
The building blocks of an ionizable lipid are fragments of different moieties in the lipid. These fragments are shared in common across combinatorial libraries, providing insight and control into their variations.
In the papers in which they appear, the fragments are typically dictated by synthesis or other resource constraints. For in silico AI/ML, we often want to mine the SMILES and lipids corresponding to known structures, which is a daunting task given the number that have been studied. Scouring the web for catalogs and literature online is the only way to keep up with such structures.
We go through the process of mining two such catalogs – Broadpharm and Medchem – which together cover the vast majority of available fragments and structures.
These cover a large fraction of the available literature, including structures that are not under patent. We will scrape each of these catalogs and combine them into a data frame that is the result of this notebook.
Procedure
In each case, getting the necessary structures from the catalogs is a two-step process:
Retrieve structure images from the catalog: This is done by scraping the catalog for the internal IDs of chemical structures. These correspond to structure images in the catalog. Each image is then downloaded from the catalog and saved to a local directory. The implementation below takes a URL and returns a data frame of the structures in the catalog.
CODE
import requests, osfrom urllib.parse import urlparsedef scrape_url(url):""" Scrapes the HTML content from a given URL and returns it as a string. Args: url (str): The URL to scrape Returns: str: The HTML content of the page Raises: ValueError: If the URL is invalid requests.RequestException: If the request fails """# Validate URL parsed_url = urlparse(url)ifnot parsed_url.scheme ornot parsed_url.netloc:raiseValueError("Invalid URL. Please provide a complete URL including http:// or https://")# Set a user agent to mimic a browser request headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' }# Make the requesttry: response = requests.get(url, headers=headers, timeout=10) response.raise_for_status() # Raise an exception for 4XX/5XX responsesreturn response.textexcept requests.exceptions.RequestException as e:raise requests.RequestException(f"Failed to retrieve the webpage: {e}")def retrieve_chemstructure_images(IDs, names, imageID_to_url, name_to_path): img_path = {}for i inrange(len(IDs)):iflen(names[i]) ==0:continue cpd_name = names[i] image_url = imageID_to_url(IDs[i])if'/'in cpd_name: cpd_name = cpd_name.replace('/', '-') image_path = name_to_path(cpd_name)ifnot os.path.exists(image_path): img_data = requests.get(image_url).contentprint(f'Downloading {image_url} to {image_path}')withopen(image_path, 'wb') as handler: handler.write(img_data) img_path[cpd_name] = image_pathreturn img_path
Convert the structure images to SMILES strings: This is done using deep computer vision models that are trained for this specific task. We use the DECIMER transformer model for this purpose.
The SMILES string conversion requires some manual verification of the answers, so it is not yet fully automated. Though further automatic SMILES conversion presents some challenges, there is scope for addressing these as needed. 1
CODE
def predict_SMILES_from_images(img_path_dict):# To run this, DECIMER must be installed, requiring opencv-python and keras-preprocessingfrom DECIMER import predict_SMILES as DECIMER_predict_SMILES chem_smiles = {} i =0for cpd_name, image_path in img_path_dict.items(): SMILES_str = DECIMER_predict_SMILES(image_path)#SMILES_str = openchemie_predict_SMILES(image_path) chem_smiles[cpd_name] = SMILES_str i +=1print(i, cpd_name)return chem_smiles
Examples
A look at Broadpharm’s IL catalog
Broadpharm is one of the largest vendors for ionizable lipids in LNP design and has a detailed and organized catalog with solid coverage of the literature. The structures available there can be inspected from their website.
Get catalog IDs and corresponding chemical names of all structures
Each structure has a unique catalog ID which is necessary to retrieve it from within the catalog. But we typically want to store it under a different name – often the structure’s trademarked name, or another referent that is useful for looking it up in the literature.
We first retrieve the mapping between the catalog IDs and chemical names – a quick step that only involves some HTML traversals. BeautifulSoup is a standard HTML/XML parser that makes this process easier.
The code here is specific to the format of the Broadpharm catalog, and would change with time and other catalogs. The best way to write this code is therefore adaptively, with a code generation model.
CODE
from bs4 import BeautifulSoupimport pandas as pd# Parse the HTML data using BeautifulSoupsoup = BeautifulSoup(in_str, 'html.parser')# Initialize an empty list to store the extracted datacompound_data = []for tr in soup.find_all('tr'): td = tr.find_all('td') row = [i.text for i in td]iflen(row) !=6:pass# print(row)elif row[1] !='': compound = {'Product ID': row[0],'Name': row[1],'Molecular Structure': row[2],'Molecular Weight': row[3],'Purity': row[4],'Pricing': row[5] } compound_data.append(compound)# Convert the list of dictionaries to a Pandas DataFramecompound_df = pd.DataFrame(compound_data)IDs = compound_df['Product ID'].tolist()names = compound_df['Name'].tolist()
Using this mapping, it is easy to retrieve the images of all structures with their catalog IDs.
The next step is to predict the SMILES strings from the image paths in the dataframe, which is where the DECIMER model comes in.
(This could also be done using the openchemie toolkit, from which the image recognition model (Qian et al. 2023) has compared favorably to other methods from the literature (Rajan et al. 2024). )
CODE
from datetime import datedtime = date.today().strftime("%Y-%m-%d")catalog_name =f'Broadpharm'chem_smiles = predict_SMILES_from_images(img_path)
in_url ="https://www.medchemexpress.com/search.html?q=ionizable+lipid&type=inhibitors-and-agonists"# scrape the url manually to yield the string below, because the website appears to not be statically rendered.from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsoptions = Options()options.add_argument("--headless")options.add_argument("--disable-blink-features=AutomationControlled")options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ""AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")driver = webdriver.Chrome(options=options)driver.get("https://www.medchemexpress.com/search.html?q=ionizable+lipid&type=inhibitors-and-agonists")# driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")in_str = driver.page_sourcedriver.quit()
Get catalog IDs and corresponding chemical names of all structures
Again, BeautifulSoup comes to the rescue, though this code needs to be adapted to the specific structure of the Medchem catalog this time. The code can in general be written by an LLM-based agent.
CODE
from bs4 import BeautifulSoupimport requests, osimport pandas as pd# Parse the HTML data using BeautifulSoupnew_soup = BeautifulSoup(in_str, 'html.parser')names = []IDs = []i =0for x in new_soup.find_all('li'):if x.dl isnotNone:iflen(x.dl.tr.text) >0: names.append(x.dl.tr.text.strip().split('\n')[0])else:iflen(x.dl.tr.a.contents) >=2: names.append(x.dl.tr.a.contents[1].contents[0])else: names.append(x.dl.tr.a.contents[0].contents[0])iflen(x.dl.dt.text) >0: IDs.append(x.dl.dt.text)else: IDs.append(x.dl.dt.a.contents[0]) i +=1
Armed with this mapping, we now again retrieve the images of all structures.
We can now merge all the catalogs that have been collected, collapsing duplicate entries appropriately and logging the source(s) of any structure. Doing a quick RDKit canonical-SMILES comparison during this process corrects for the indeterminacy in how the structure is represented (there is more than one valid SMILES for a molecule, e.g. depending on the choice of starting atom). So only unique structures are included in the final database.
CODE
import numpy as np, pandas as pdsmiles_df_paths = ['Broadpharm_smiles.tsv', 'Medchem_smiles.tsv']consolidated_smiles = {}for s in smiles_df_paths: cat_name = s.split('_smiles')[0] consolidated_smiles[cat_name] =dict(np.array(pd.read_csv(s, sep='\t', header=None)))
FileNotFoundError: [Errno 2] No such file or directory: 'Broadpharm_smiles.tsv'
CODE
from rdkit import Chemnew_df = {'Catalog': [],'Name': [],'SMILES': []}for catname in consolidated_smiles.keys(): thiscat = consolidated_smiles[catname] thiscat_list = [x for x inzip(*thiscat.items())] new_df['Name'].extend(list(thiscat_list[0])) new_df['SMILES'].extend(list(thiscat_list[1])) new_df['Catalog'].extend([catname] *len(thiscat))new_df = pd.DataFrame(new_df)new_df['SMILES'] = [Chem.CanonSmiles(x) for x in new_df['SMILES']]# We find that the model occasionally mislabels explicitly drawn hydrogens as isotopic, so we fix that.new_df['SMILES'] = [x.replace('[2H]', '[H]') for x in new_df['SMILES']]new_df['SMILES'] = [x.replace('[3H]', '[H]') for x in new_df['SMILES']]# Consolidate so that new_df['SMILES'] are all unique. For any duplicates, combine them by concatenating the contents of each of their other columns.new_df_combined = new_df.groupby('SMILES').agg(lambda x: ' | '.join(set(x))).reset_index()
This combined set of SMILES can be further analyzed, or used to seed a virtual space.
By far the most comprehensive way to ensure broad virtual screening is to collect fragments, not structures, because an ultralarge virtual space defines combinatorially many structures from its fragments.
To mine such fragments, we turn to a growing body of literature using libraries that are combinatorially defined, where the study itself specifies fragments rather than the entire library of structures. Such studies almost always explicitly enumerate the library, and we can massively amplify their power by incorporating their fragments into ultralarge libraries for virtual screening.
Amine head fragments
We can retrieve a vast variety of amine fragments from non-LNP catalogs, for use in synthesis. We download an .sdf file directly from the Chemspace catalog at this link.
CODE
import zipfiledef local_zip_download(zip_path, output_dir): new_fname = zip_path.split('/')[-1]# Check if new_fname is a zip fileifnot new_fname.endswith('.zip'):return'' out_path = output_dir + new_fnamewith requests.get(zip_path, stream=True) as r: r.raise_for_status()withopen(out_path, "wb") as f:for chunk in r.iter_content(chunk_size=8192): f.write(chunk)return out_pathdef extract_sdf_from_zip(zip_path, output_dir, download=False):if download: zip_path = local_zip_download(zip_path, output_dir) fnames_written = []with zipfile.ZipFile(zip_path, 'r') as zip_ref:for file_name in zip_ref.namelist():if file_name.lower().endswith(".sdf"): zip_ref.extract(file_name, output_dir)print(f"Extracted: {file_name} to {output_dir}") fnames_written.append(file_name)return fnames_written
Chemspace stores these structures as a compressed .sdf file.
CODE
from rdkit import Chemfrom rdkit.Chem import PandasToolschemspace_file_path ='../../files/Chemspace_Amine_Fragments_Set.zip'sdf_dir ='../../files/'sdf_fnames = extract_sdf_from_zip(chemspace_file_path, sdf_dir)chemspace_sdf = PandasTools.LoadSDF(sdf_dir + sdf_fnames[0], removeHs=False)chemspace_sdf['SMILES'] = [Chem.MolToSmiles(x) for x in chemspace_sdf['ROMol']]chemspace_sdf['Name'] = chemspace_sdf['CHEMSPACE_ID']chemspace_sdf
Extracted: Chemspace_Amine_Fragments_Set.sdf to ../../files/
CHEMSPACE_ID
CHEMSPACE_URL
ID
ROMol
SMILES
Name
0
CSSS00007999857
https://chem-space.com/CSSS00007999857
CNC1CCCN(C2Cc3ccccc3C2)C1
CSSS00007999857
1
CSSS00012024712
https://chem-space.com/CSSS00012024712
Oc1cccc2c1CCCN2
CSSS00012024712
2
CSSS02018321032
https://chem-space.com/CSSS02018321032
O=S(=O)(Cc1ccc(F)cc1)N1CC2CCC(C1)N2
CSSS02018321032
3
CSSS00102954942
https://chem-space.com/CSSS00102954942
CN(C)c1nccc2c1CCNC2.Cl
CSSS00102954942
4
CSSS00133055275
https://chem-space.com/CSSS00133055275
Cc1ccc2oc(C(=O)NCC3CCCCN3)cc2c1.Cl
CSSS00133055275
...
...
...
...
...
...
...
18884
CSSS06359898475
https://chem-space.com/CSSS06359898475
Cl.FCCNC1CCOC1
CSSS06359898475
18885
CSSS00021525833
https://chem-space.com/CSSS00021525833
CC(C)(C)CN1CCC(C2CCNCC2)C1
CSSS00021525833
18886
CSSS00015926175
https://chem-space.com/CSSS00015926175
C1CN[C@H]2COC[C@H]2C1
CSSS00015926175
18887
CSSS00000685022
https://chem-space.com/CSSS00000685022
CNCC(=O)Nc1c(C)cc(C)cc1C.Cl
CSSS00000685022
18888
CSSS00027672546
https://chem-space.com/CSSS00027672546
Cc1ccc(C)c(C(=O)NC2CNCCC2C)c1.Cl
CSSS00027672546
18889 rows × 6 columns
Similar databases of amine fragments are available from other, smaller sources.
CODE
frags_file_path ='http://fchgroup.net/files/fragments/FCHGroup_fragment-like_amines.zip'sdf_dir ='../../files/'sdf_fnames = extract_sdf_from_zip(frags_file_path, sdf_dir, download=True)fch_sdf = PandasTools.LoadSDF(sdf_dir + sdf_fnames[0], removeHs=False)fch_sdf['SMILES'] = [Chem.MolToSmiles(x) for x in fch_sdf['ROMol']]fch_sdf['Name'] = [f"FCH_{i}"for i inrange(fch_sdf.shape[0])]fch_sdf
Extracted: FCHGroup_fragment-like_amines.sdf to ../../files/
ID
URL
ROMol
SMILES
Name
0
https://chem-space.com/CSC000156186
Clc1ccc(CNc2cnccn2)cc1
FCH_0
1
https://chem-space.com/CSC000251315
CC(Nc1ncccn1)C1CC2CCC1C2
FCH_1
2
https://chem-space.com/CSC000156714
N#Cc1cccnc1NC1CCCC1
FCH_2
3
https://chem-space.com/CSC116272021
CC(C)(C)NCC(=O)NC(=O)NC1CCCCC1
FCH_3
4
https://chem-space.com/CSC116272401
COc1ccccc1C=CC1NC(=O)c2ccccc2N1
FCH_4
...
...
...
...
...
...
2150
https://chem-space.com/CSC116267824
O=C(NC[C@]12CNC[C@@]1(C(F)(F)F)C2)C1CCCC1
FCH_2150
2151
https://chem-space.com/CSC150710329
NC1CC12CC(NC(=O)c1cccc3c1CCN3)C2
FCH_2151
2152
https://chem-space.com/CSC116268621
Cc1cc(=O)[nH]c(NC2CCCc3c(C)cccc32)n1
FCH_2152
2153
https://chem-space.com/CSC150714667
C[C@H](NC1CCCC(F)(F)CC1)C(=O)N(C)C
FCH_2153
2154
https://chem-space.com/CSC150711165
Cn1cc(CN[C@@]23CCC[C@@H]2C3)ccc1=O
FCH_2154
2155 rows × 5 columns
Amines typically act as nucleophiles in organic chemistry and LNP chemistry, particularly when participating in synthesis. This means that the head fragments that scientists in LNP discovery typically care about are primary and secondary amines.
Primary amines
Primary amines are particularly useful reagents to have, so we devote special attention to their enumeration here for purposes of downstream use in other workflows. First, we gather a set of primary amines that are well studied as head groups in ionizable lipids, from a seminal line of work on short-RNA LNP delivery (Akinc et al. 2008).
Secondary amines typically offer less scope for versatile modification in syntheses, since they are already participating in a covalent bond with a carbon. MolSSI has a library of secondary amines that can be used for this purpose.
Taken together, these represent a hefty cross-section of amine fragments in commercial use. There is overlap with manufacturers’ databases, such as those of Enamine and WuXi. With the right access permissions, those too can be mined as necessary.
Having gathered this set, it’s useful to inspect these to get an idea of the staggering variety of drug-like amine fragments available (primarily because of their use in small-molecule drug design over the years).
CODE
smiles_list = []name_list = []molssi_amines_primary_df.rename(columns={'Canonical_SMILES': 'SMILES', 'Compound_Name': 'Name'}, inplace=True)molssi_amines_secondary_df.rename(columns={'Canonical_SMILES': 'SMILES', 'Compound_Name': 'Name'}, inplace=True)for new_df in [frags_whitehead_df, molssi_amines_primary_df, molssi_amines_secondary_df, chemspace_sdf, fch_sdf]: new_smiles_list = [Chem.CanonSmiles(x) for x in new_df['SMILES'].values] new_names_list = [f"{new_df['Name'].values[i]}"for i inrange(new_df.shape[0])]for i inrange(new_df.shape[0]):if new_smiles_list[i] in smiles_list: ndx = np.where(np.array(smiles_list) == new_smiles_list[i])[0][0] name_list[ndx] = name_list[ndx] +"|"+ new_names_list[i]else: smiles_list.append(new_smiles_list[i]) name_list.append(new_names_list[i])frags_amines_df = pd.DataFrame({'SMILES': smiles_list,'Name': name_list})frags_amines_df
SMILES
Name
0
C1CNC2NCCNC2N1
Whitehead_0
1
C1CNCCCNCCCNC1
Whitehead_1
2
C1CNCCCNCCNCCCNC1
Whitehead_2
3
C1CNCCN1
Whitehead_3
4
C1CNCCNC1
Whitehead_4|CSSS00000210015
...
...
...
26647
CNC(=O)c1ccc(C)c(NC2COCCC2C)c1
FCH_2133
26648
CC(=O)N1CC(NC2CC(=O)N(C3CCCCC3)C2)C1
FCH_2134
26649
CNC1(C(=O)N2CC3CCCC3(N)C2)CCC1
FCH_2142
26650
NC1CC12CC(NC(=O)c1cccc3c1CCN3)C2
FCH_2151
26651
C[C@H](NC1CCCC(F)(F)CC1)C(=O)N(C)C
FCH_2153
26652 rows × 2 columns
CODE
amines_all =set(frags_amines_df["SMILES"])print(f"Number of amine fragments: {len(amines_all)}")display_df = pd.DataFrame(list(amines_all), columns=["SMILES"])mols2grid.display(display_df)
Number of amine fragments: 26652
There is a comprehensive variety of chemical material available here, which can be used to conduct virtual screening.
Classification and filtering of amine fragments
The above fragment databases do not discriminate between these finer distinctions, but filtering one from the other is a simple matter of detecting the salient atoms. For completeness, we give quite a comprehensive version here, which classifies nitrogens in the molecule, including even aromatic amine nitrogens (anilines) and aromatic non-amine nitrogens (pyridines and pyrroles).
CODE
def classify_nitrogens(sm):""" Classify each of the nitrogens in a molecule. """ m = Chem.MolFromSmiles(sm)if m isNone:raiseValueError('bad SMILES') res = []for a in m.GetAtoms():if a.GetAtomicNum() !=7:continueif a.GetFormalCharge() >0and a.GetDegree() ==4: cat ='quaternary_ammonium'elif a.GetIsAromatic(): cat ='aromatic_pyrrolic'if a.GetTotalNumHs() else'aromatic_pyridinic'else: ani =any(nb.GetIsAromatic() and nb.GetAtomicNum() ==6for nb in a.GetNeighbors()) d = a.GetDegree()if d ==1: cat ='primary_'+ ('aniline'if ani else'amine')elif d ==2: cat ='secondary_'+ ('aniline'if ani else'amine')elif d ==3: cat ='tertiary_'+ ('aniline'if ani else'amine')else: cat ='other_n' res.append((a.GetIdx(), cat))return res or [('none',)]def summarize_amines(smiles_list):""" Summarize the amine classifications for a list of molecules. Returns a dictionary mapping each SMILES to a list of its classifications, for each nitrogen in the molecule. """ answers = {}for s in smiles_list: nitr_classifications = [classify_nitrogens(s)[i][1] for i inrange(len(classify_nitrogens(s)))] answers[s] = nitr_classificationsreturn answersimport mols2gridcandidate_amines = ["CCN", # primary amine"CCNC", # secondary amine"CCN(CC)CN", # tertiary amine"c1ccc(N)cc1", # primary aniline"c1ccc(NC)cc1", # secondary aniline"c1ccc(N(C)C)cc1", # tertiary aniline"c1ccncc1", # aromatic pyridinic"c1c[nH]cc1", # aromatic pyrrolic"C[N+](C)(C)C", # quaternary ammonium, "CC(C)(C)[NH]", # primary amine]for s in candidate_amines:print("{}\t → \t{}".format( s, ", ".join([classify_nitrogens(s)[i][1] for i inrange(len(classify_nitrogens(s)))]) ))mols2grid.display([Chem.MolFromSmiles(x) for x in candidate_amines])
The examples here illustrate this useful functionality in mining this molecule class (amine heads). Note the last example, with the aminyl radical (a common artifact of the image-to-SMILES conversion, especially for old or low-quality images). This is easily converted into a primary amine, and falls with the primary amines here for the purposes of this classification.
This can be used to quickly annotate the head groups in the library we’ve sourced here.
new_col = {s: " | ".join(amine_classifications[s]) for s in amine_classifications}frags_amines_df["Amine_type"] = frags_amines_df["SMILES"].map(new_col)frags_amines_df.to_csv("../../files/frags_amines.csv.gz", compression="gzip", index=False)
CODE
frags_amines_df
SMILES
Name
Amine_type
0
C1CNC2NCCNC2N1
Whitehead_0
secondary_amine | secondary_amine | secondary_...
1
C1CNCCCNCCCNC1
Whitehead_1
secondary_amine | secondary_amine | secondary_...
2
C1CNCCCNCCNCCCNC1
Whitehead_2
secondary_amine | secondary_amine | secondary_...
3
C1CNCCN1
Whitehead_3
secondary_amine | secondary_amine
4
C1CNCCNC1
Whitehead_4|CSSS00000210015
secondary_amine | secondary_amine
...
...
...
...
26647
CNC(=O)c1ccc(C)c(NC2COCCC2C)c1
FCH_2133
secondary_amine | secondary_aniline
26648
CC(=O)N1CC(NC2CC(=O)N(C3CCCCC3)C2)C1
FCH_2134
tertiary_amine | secondary_amine | tertiary_amine
26649
CNC1(C(=O)N2CC3CCCC3(N)C2)CCC1
FCH_2142
secondary_amine | tertiary_amine | primary_amine
26650
NC1CC12CC(NC(=O)c1cccc3c1CCN3)C2
FCH_2151
primary_amine | secondary_amine | secondary_an...
26651
C[C@H](NC1CCCC(F)(F)CC1)C(=O)N(C)C
FCH_2153
secondary_amine | tertiary_amine
26652 rows × 3 columns
Enumerating a variety of fragments for design
The other type of fragment which occurs very commonly in ionizable lipid design is a hydrophobic tail. Ionizable lipids typically contain multiple hydrophobic tails, which are the focus of significant engineering efforts. The tails are understood to play a large role in the lipid’s molecular shape, packing, and self-assembly behavior. They constitute the bulk of a lipid sterically, and are a paradigmatic example of hydrophobic “skeletons” that are common in organic chemistry.
For ionizable lipid design, many properties of their lipid tails, including its length, degree of branching and saturation, and partial charge distribution can all be varied to create a wide variety of lipids.
Branching existing tails (introducing alkyl side chains) is a strategy to increase the hydrophobic cross-section without requiring an additional separate tail, and can improve endosomal escape potency by promoting a more cone-shaped molecular packing (Hajj et al. 2019). Modern vaccine lipids illustrate this principle: both SM-102 and ALC-0315 (Pfizer/BioNTech) possess branched tails that behave similarly to multi-tail structures, thought to contribute to their high efficacy. Function appears to be quite sensitive to structure here, and branching close to the headgroup can be counterproductive, possibly due to steric effects (Padilla et al. 2025).
Tail length also matters: longer tails strengthen hydrophobic interactions and nanoparticle stability, but if too long can impede endosomal release or result in slow biodegradation (Mrksich et al. 2024).
Varying tails in virtual screening
Varying all these properties simultaneously results in (combinatorially) many tail structures, which are carefully pruned to arrive at chemical screening libraries in the literature. Experimental screening constraints typically make such densely explored tail libraries impractical (Hashiba et al. 2024).
But virtual screening is a much lower-cost and higher-throughput process by design. At this early stage in the drug discovery pipeline, we are looking to prospect over as many structures as possible and certainly do not wish to rule out potentially viable structures, especially when doing QSAR on sensitive signals with activity cliffs. So for these purposes, it’s often useful to programmatically create large libraries of tail structures by systematically varying the properties of the tail.
So in this case, if we wanted to examine the effects of changing the tail, we might substitute it with other branched, partially saturated alcohols. Here’s code to generate such a library in an independently extensible manner which may be useful. We use core RDKit functionality to change individual atoms within the mol object and reconvert back into SMILES, which is much more robust than doing pure SMILES string manipulation.
CODE
from rdkit import Chemfrom rdkit import RDLogger# 1‑line switch‑off for all RDKit warnings/infoRDLogger.DisableLog("rdkit")def spacing_ok(db_positions, spacings_allowed={2, 3}):"""Return True iff every spacing between successive double bonds is allowed."""iflen(db_positions) <2:returnTrue db_positions =sorted(db_positions)returnall(b - a in spacings_allowed for a, b inzip(db_positions, db_positions[1:]))def build_linear_tail(n_carbon, db_pos):"""HO–(CH2)n‑1–CH2– with optional internal C=C bonds.""" mol = Chem.RWMol() c_atoms = [mol.AddAtom(Chem.Atom("C")) for _ inrange(n_carbon)] o_atom = mol.AddAtom(Chem.Atom("O"))for i inrange(n_carbon -1): bond_type = Chem.BondType.DOUBLE if i in db_pos else Chem.BondType.SINGLE mol.AddBond(c_atoms[i], c_atoms[i +1], bond_type) mol.AddBond(c_atoms[-1], o_atom, Chem.BondType.SINGLE)return Chem.MolToSmiles(mol, canonical=True)def build_branched_tail(n1, n2, db1, db2):"""Tertiary alcohol core with two hydrophobic tails.""" mol = Chem.RWMol() core = mol.AddAtom(Chem.Atom("C")) o_at = mol.AddAtom(Chem.Atom("O")) mol.AddBond(core, o_at, Chem.BondType.SINGLE)def add_chain(start_atom, length, db_pos_set): prev = start_atomfor idx inrange(length): c = mol.AddAtom(Chem.Atom("C")) bond_type = ( Chem.BondType.DOUBLE if (idx +1) in db_pos_set else Chem.BondType.SINGLE ) mol.AddBond(prev, c, bond_type) prev = c add_chain(core, n1, set(db1)) add_chain(core, n2, set(db2))return Chem.MolToSmiles(mol, canonical=True)
Tail variations in practice
To demonstrate this code, we could construct a set of branched diunsaturated tails, inspired by the dilinoleyl tail of MC3.
We can make a bunch of variations on this tail. We keep the branching but vary the length of each of the two branches:
We try asymmetric-length branches, with double bonds at various places along the length of each branch. It doesn’t make sense to have branches of wildly different lengths, so we disallow branched tails where the branches differ in length by more than 6.
We try different spacings between the double bonds, to generate tails with different lability and flexibility (including conjugated double bonds).
We can do this by systematically changing the atoms in the mol object.
CODE
from itertools import combinations, productfrom rdkit import RDLoggerlg = RDLogger.logger()lg.setLevel(RDLogger.CRITICAL) # Set the logging level to CRITICAL to suppress all messages below this level# design‑space parameters -----------------------------------------------------branched_lengths = [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20] # carbons / tailunsat_choices = (0, 1, 2) # number of C=C per tailspacings_allowed = {2, 3, 4, 5}# -----------------------------------------------------------------------------frags =set()for L1, L2, d1, d2 in product(branched_lengths, branched_lengths, unsat_choices, unsat_choices): interior1 =range(1, L1) # skip bond 0 (core‑C1) interior2 =range(1, L2)if d1 >len(interior1) or d2 >len(interior2):continueifabs(L1 - L2) >6:continuefor db1 in combinations(interior1, d1):ifnot spacing_ok(db1, spacings_allowed=spacings_allowed):continuefor db2 in combinations(interior2, d2):ifnot spacing_ok(db2, spacings_allowed=spacings_allowed):continuetry: smi = build_branched_tail(L1, L2, db1, db2) Chem.SanitizeMol(Chem.MolFromSmiles(smi)) frags.add(smi)exceptException:pass
A random sample of these confirms that we are including a huge variety of diversity.
CODE
import randomprint(f"Number of tails: {len(frags)}")print("Sample of 1000 tails:")# Choose a random set of 1000 tailsmols = [Chem.MolFromSmiles(x) for x in random.choices(list(frags), k=1000)]mols2grid.display(mols)
Number of tails: 179697
Sample of 1000 tails:
(This lays the groundwork for constructing ultralarge spaces of structures for virtual screening. In commonly occurring situations, it will be difficult to optimize beyond a certain point in such spaces; the outputs of screening inherit any issues in the model, and the model is typically operating at the edge of what is statistically reasonable from its training data.
However, note that the combinatorics favor longer tails with greater unsaturation, because there are more of them. In some situations, this can be an overwhelming and crucial bias to account for in design programs; I’ve seen multiply unsaturated long highly branched tails be problematic in design programs before.)
Saving for later reference
These are saved for the construction of other virtual spaces. We’ve done enumerative combinatorics on a massive scale to construct these, but they constitute a tiny file.
CODE
import gzipbt_file_path ="../../files/branched_tail_alcohols.smi.gz"with gzip.open(bt_file_path, "wt") as f:for i, smi inenumerate(sorted(smiles_dict.values())): f.write(f"{smi} branched-tail-alcohol-{i}\n")print("Size of file:{} kb".format(os.path.getsize(bt_file_path)/1024))
Size of file:661.875 kb
For completeness, this can be reloaded as follows:
CODE
import gzip# Read the gzipped filewith gzip.open("../../files/branched_tail_alcohols.smi.gz", "rt") as f: lines = f.readlines() smiles_dict = {}for line in lines: parts = line.strip().split()iflen(parts) >=2: smiles = parts[0] name = parts[1] smiles_dict[name] = smiles# Example: print first 5 entriesfor i, (name, smiles) inenumerate(list(smiles_dict.items())[:5]):print(f"{name}: {smiles}")
These literature-sourced and catalog-sourced lipids constitute an initial knowledge base of ionizable lipid chemistry. This is a basic version of a virtual library, rich in real-world examples and starting points. This library is a possibly very biased sampling of molecules that are effective in different scenarios, so there are practical challenges in training encoders (embeddings of what known successful lipids look like). Instead, we show in another post how to use structures to identify fragments for the next stage of combinatorial expansion. From there, other posts explore the many uses of ML to expand the space further, model it, and use it to guide synthesis.
In practice, assembling such a dataset is itself a form of valuable IP: different companies and groups often put significant effort into curating proprietary databases of formulations and molecules from both publications and internal experiments. However, the general methods to do so (the text mining and database queries implemented here) are widely accessible, and sharing non-proprietary portions (e.g. structures of lipids already published) can be a benefit for initial exploration.
For instance, it is common to attempt to incorporate literature data to bootstrap drug discovery programs, due to the relative scarcity of internal screening data. Particularly when starting a program and at the initial stages, this type of AI/ML can be a unique data-driven option to guide and refine existing hypotheses about efficacy in the vastness of chemical space.
Exporting code
For future such reference, it’s useful to collect all the code from this post in a .py file.
CODE
all_code ='''# ======== Structure sourcing code ========import requests, osfrom urllib.parse import urlparsedef scrape_url(url): """ Scrapes the HTML content from a given URL and returns it as a string. Args: url (str): The URL to scrape Returns: str: The HTML content of the page Raises: ValueError: If the URL is invalid requests.RequestException: If the request fails """ # Validate URL parsed_url = urlparse(url) if not parsed_url.scheme or not parsed_url.netloc: raise ValueError("Invalid URL. Please provide a complete URL including http:// or https://") # Set a user agent to mimic a browser request headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' } # Make the request try: response = requests.get(url, headers=headers, timeout=10) response.raise_for_status() # Raise an exception for 4XX/5XX responses return response.text except requests.exceptions.RequestException as e: raise requests.RequestException(f"Failed to retrieve the webpage: {e}")def retrieve_chemstructure_images(IDs, names, imageID_to_url, name_to_path): img_path = {} for i in range(len(IDs)): if len(names[i]) == 0: continue cpd_name = names[i] image_url = imageID_to_url(IDs[i]) if '/' in cpd_name: cpd_name = cpd_name.replace('/', '-') image_path = name_to_path(cpd_name) if not os.path.exists(image_path): img_data = requests.get(image_url).content print(f'Downloading {image_url} to {image_path}') with open(image_path, 'wb') as handler: handler.write(img_data) img_path[cpd_name] = image_path return img_pathdef openchemie_predict_SMILES(image_path): import torch # from molscribe import MolScribe # from huggingface_hub import hf_hub_download # ckpt_path = hf_hub_download('yujieq/MolScribe', 'swin_base_char_aux_1m.pth') # model = MolScribe(ckpt_path, device=torch.device('cpu')) # output = model.predict_image_file(image_path, return_atoms_bonds=True, return_confidence=True) from openchemie import OpenChemIE from PIL import Image model = OpenChemIE(device=torch.device('cpu')) # change to cuda for gpu img3 = Image.open(image_path) mres = model.extract_molecules_from_figures([img3]) return mresdef predict_SMILES_from_images(img_path_dict): chem_smiles = {} i = 0 for cpd_name, image_path in img_path_dict.items(): # To run this, DECIMER must be installed, requiring opencv-python and keras-preprocessing from DECIMER import predict_SMILES as DECIMER_predict_SMILES SMILES_str = DECIMER_predict_SMILES(image_path) #SMILES_str = openchemie_predict_SMILES(image_path) chem_smiles[cpd_name] = SMILES_str i += 1 print(i, cpd_name) return chem_smilesimport zipfiledef local_zip_download(zip_path, output_dir): new_fname = zip_path.split('/')[-1] # Check if new_fname is a zip file if not new_fname.endswith('.zip'): return '' out_path = output_dir + new_fname with requests.get(zip_path, stream=True) as r: r.raise_for_status() with open(out_path, "wb") as f: for chunk in r.iter_content(chunk_size=8192): f.write(chunk) return out_pathdef extract_sdf_from_zip(zip_path, output_dir, download=False): if download: zip_path = local_zip_download(zip_path, output_dir) fnames_written = [] with zipfile.ZipFile(zip_path, 'r') as zip_ref: for file_name in zip_ref.namelist(): if file_name.lower().endswith(".sdf"): zip_ref.extract(file_name, output_dir) print(f"Extracted: {file_name} to {output_dir}") fnames_written.append(file_name) return fnames_written# ======== Cheminformatic code ========from rdkit import Chemfrom rdkit import RDLogger# 1‑line switch‑off for all RDKit warnings/infoRDLogger.DisableLog("rdkit")def classify_nitrogens(sm): m = Chem.MolFromSmiles(sm) if m is None: raise ValueError('bad SMILES') res = [] for a in m.GetAtoms(): if a.GetAtomicNum() != 7: continue if a.GetFormalCharge() > 0 and a.GetDegree() == 4: cat = 'quaternary_ammonium' elif a.GetIsAromatic(): cat = 'aromatic_pyrrolic' if a.GetTotalNumHs() else 'aromatic_pyridinic' else: ani = any(nb.GetIsAromatic() and nb.GetAtomicNum() == 6 for nb in a.GetNeighbors()) d = a.GetDegree() if d == 1: cat = 'primary_' + ('aniline' if ani else 'amine') elif d == 2: cat = 'secondary_' + ('aniline' if ani else 'amine') elif d == 3: cat = 'tertiary_' + ('aniline' if ani else 'amine') else: cat = 'other_n' res.append((a.GetIdx(), cat)) return res or [('none',)]def classify_nitrogens(sm): """ Classify each of the nitrogens in a molecule. """ m = Chem.MolFromSmiles(sm) if m is None: raise ValueError('bad SMILES') res = [] for a in m.GetAtoms(): if a.GetAtomicNum() != 7: continue if a.GetFormalCharge() > 0 and a.GetDegree() == 4: cat = 'quaternary_ammonium' elif a.GetIsAromatic(): cat = 'aromatic_pyrrolic' if a.GetTotalNumHs() else 'aromatic_pyridinic' else: ani = any(nb.GetIsAromatic() and nb.GetAtomicNum() == 6 for nb in a.GetNeighbors()) d = a.GetDegree() if d == 1: cat = 'primary_' + ('aniline' if ani else 'amine') elif d == 2: cat = 'secondary_' + ('aniline' if ani else 'amine') elif d == 3: cat = 'tertiary_' + ('aniline' if ani else 'amine') else: cat = 'other_n' res.append((a.GetIdx(), cat)) return res or [('none',)]def summarize_amines(smiles_list): """ Summarize the amine classifications for a list of molecules. Returns a dictionary mapping each SMILES to a list of its classifications, for each nitrogen in the molecule. """ answers = {} for s in smiles_list: nitr_classifications = [classify_nitrogens(s)[i][1] for i in range(len(classify_nitrogens(s)))] answers[s] = nitr_classifications return answersdef spacing_ok(db_positions, spacings_allowed={2, 3}): """Return True iff every spacing between successive double bonds is allowed.""" if len(db_positions) < 2: return True db_positions = sorted(db_positions) return all(b - a in spacings_allowed for a, b in zip(db_positions, db_positions[1:]))def build_linear_tail(n_carbon, db_pos): """HO–(CH2)n‑1–CH2– with optional internal C=C bonds.""" mol = Chem.RWMol() c_atoms = [mol.AddAtom(Chem.Atom("C")) for _ in range(n_carbon)] o_atom = mol.AddAtom(Chem.Atom("O")) for i in range(n_carbon - 1): bond_type = Chem.BondType.DOUBLE if i in db_pos else Chem.BondType.SINGLE mol.AddBond(c_atoms[i], c_atoms[i + 1], bond_type) mol.AddBond(c_atoms[-1], o_atom, Chem.BondType.SINGLE) return Chem.MolToSmiles(mol, canonical=True)def build_branched_tail(n1, n2, db1, db2): """Tertiary alcohol core with two hydrophobic tails.""" mol = Chem.RWMol() core = mol.AddAtom(Chem.Atom("C")) o_at = mol.AddAtom(Chem.Atom("O")) mol.AddBond(core, o_at, Chem.BondType.SINGLE) def add_chain(start_atom, length, db_pos_set): prev = start_atom for idx in range(length): c = mol.AddAtom(Chem.Atom("C")) bond_type = ( Chem.BondType.DOUBLE if (idx + 1) in db_pos_set else Chem.BondType.SINGLE ) mol.AddBond(prev, c, bond_type) prev = c add_chain(core, n1, set(db1)) add_chain(core, n2, set(db2)) return Chem.MolToSmiles(mol, canonical=True)'''
CODE
withopen("../../files/sourcing_ionizable_lipid_structures.py", "w", encoding="utf-8") as f: f.write(all_code)
References
Akinc, Akin, Andreas Zumbuehl, Michael Goldberg, Elizaveta S Leshchiner, Valentina Busini, Naushad Hossain, Sergio A Bacallado, et al. 2008. “A Combinatorial Library of Lipid-Like Materials for Delivery of RNAi Therapeutics.”Nature Biotechnology 26 (5): 561–69.
Hajj, Khalid A, Rebecca L Ball, Sarah B Deluty, Shridhar R Singh, Daria Strelkova, Christopher M Knapp, and Kathryn A Whitehead. 2019. “Branched-Tail Lipid Nanoparticles Potently Deliver mRNA in Vivo Due to Enhanced Ionization at Endosomal pH.”Small 15 (6): 1805097.
Hashiba, Kazuki, Masamitsu Taguchi, Sachiko Sakamoto, Ayaka Otsu, Yoshiki Maeda, Yuichi Suzuki, Hirofumi Ebe, Arimichi Okazaki, Hideyoshi Harashima, and Yusuke Sato. 2024. “Impact of Lipid Tail Length on the Organ Selectivity of mRNA-Lipid Nanoparticles.”Nano Letters 24 (41): 12758–67.
Mrksich, Kaitlin, Marshall S Padilla, Ryann A Joseph, Emily L Han, Dongyoon Kim, Rohan Palanki, Junchao Xu, and Michael J Mitchell. 2024. “Influence of Ionizable Lipid Tail Length on Lipid Nanoparticle Delivery of mRNA of Varying Length.”Journal of Biomedical Materials Research Part A 112 (9): 1494–1505.
Padilla, Marshall S., Kaitlin Mrksich, Yiming Wang, Rebecca M. Haley, Jacqueline J. Li, Emily L. Han, Rakan El-Mayta, et al. 2025. “Branched Endosomal Disruptor (BEND) Lipids Mediate Delivery of mRNA and CRISPR-Cas9 Ribonucleoprotein Complex for Hepatic Gene Editing and t Cell Engineering.”Nature Communications 16 (1): 996. https://doi.org/10.1038/s41467-024-55137-6.
Qian, Yujie, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W Coley, and Regina Barzilay. 2023. “MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation.”Journal of Chemical Information and Modeling 63 (7): 1925–34.
Rajan, Kohulan, Henning Otto Brinkhaus, Achim Zielesny, and Christoph Steinbeck. 2024. “Advancements in Hand-Drawn Chemical Structure Recognition Through an Enhanced DECIMER Architecture.”Journal of Cheminformatics 16 (1): 78. https://doi.org/10.1186/s13321-024-00872-7.
Footnotes
It is even possible to verify the SMILES strings against the original images at scale, by using RDKit to generate structures from the SMILES strings and comparing them to the original images. This can be the basis for a RL-trained system for improving the SMILES string conversion accuracy. Though this may sound complex, the implementation details can be quite straightforward. This is a topic for another post.↩︎