Getting ionizable lipid structures from online catalogs
Author
Akshay Balsubramani
Online catalogs of chemical structures
The building blocks of an ionizable lipid are fragments of different moieties in the lipid. These fragments are shared in common across combinatorial libraries, providing insight and control into their variations.
In the papers in which they appear, the fragments are typically dictated by synthesis or other resource constraints. For in silico AI/ML, we often want to mine the SMILES and lipids corresponding to known structures, which is a daunting task given the number that have been studied. Scouring the web for catalogs and literature online is the only way to keep up with such structures.
We go through the process of mining two such catalogs – Broadpharm and Medchem – which together cover the vast majority of available fragments and structures.
These cover a large fraction of the available literature, including structures that are not under patent. We will scrape each of these catalogs and combine them into a data frame that is the result of this notebook.
Procedure
In each case, getting the necessary structures from the catalogs is a two-step process:
Retrieve structure images from the catalog: This is done by scraping the catalog for the internal IDs of chemical structures. Then, which takes a URL and returns a data frame of the structures in the catalog.
CODE
import requests, osfrom urllib.parse import urlparsedef scrape_url(url):""" Scrapes the HTML content from a given URL and returns it as a string. Args: url (str): The URL to scrape Returns: str: The HTML content of the page Raises: ValueError: If the URL is invalid requests.RequestException: If the request fails """# Validate URL parsed_url = urlparse(url)ifnot parsed_url.scheme ornot parsed_url.netloc:raiseValueError("Invalid URL. Please provide a complete URL including http:// or https://")# Set a user agent to mimic a browser request headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' }# Make the requesttry: response = requests.get(url, headers=headers, timeout=10) response.raise_for_status() # Raise an exception for 4XX/5XX responsesreturn response.textexcept requests.exceptions.RequestException as e:raise requests.RequestException(f"Failed to retrieve the webpage: {e}")def retrieve_chemstructure_images(IDs, names, imageID_to_url, name_to_path): img_path = {}for i inrange(len(IDs)):iflen(names[i]) ==0:continue cpd_name = names[i] image_url = imageID_to_url(IDs[i])if'/'in cpd_name: cpd_name = cpd_name.replace('/', '-') image_path = name_to_path(cpd_name)ifnot os.path.exists(image_path): img_data = requests.get(image_url).contentprint(f'Downloading {image_url} to {image_path}')withopen(image_path, 'wb') as handler: handler.write(img_data) img_path[cpd_name] = image_pathreturn img_path
Convert the structure images to SMILES strings: This is done using deep computer vision models that are trained for this specific task. We use the DECIMER transformer model for this purpose, though more advanced workflows like that of the OpenChemIE also could be used.
The SMILES string conversion requires some manual verification of the answers, so it is not yet fully automated. Though further automatic SMILES conversion presents some challenges, there is scope for addressing these as needed. 1
CODE
# To run this, DECIMER must be installed, requiring opencv-python and keras-preprocessingfrom DECIMER import predict_SMILES as DECIMER_predict_SMILESdef predict_SMILES_from_images(img_path_dict): chem_smiles = {} i =0for cpd_name, image_path in img_path_dict.items(): SMILES_str = DECIMER_predict_SMILES(image_path)#SMILES_str = openchemie_predict_SMILES(image_path) chem_smiles[cpd_name] = SMILES_str i +=1print(i, cpd_name)return chem_smiles
/Users/akshay/opt/anaconda3/envs/env-openchemie/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
Examples
A look at Broadpharm’s IL catalog
Broadpharm is one of the largest vendors for ionizable lipids in LNP design and has a detailed and organized catalog with solid coverage of the literature. The structures available there can be inspected from the website.
Get catalog IDs and corresponding chemical names of all structures
Each structure has a unique catalog ID which is necessary to retrieve it from within the catalog. But we typically want to store it under a different name – often the structure’s trademarked name, or another referent that is useful for looking it up in the literature.
We first retrieve the mapping between the catalog IDs and chemical names – a quick step that only involves some HTML traversals. BeautifulSoup is a standard HTML/XML parser that makes this process easier.
CODE
from bs4 import BeautifulSoupimport pandas as pd# Parse the HTML data using BeautifulSoupsoup = BeautifulSoup(in_str, 'html.parser')# Initialize an empty list to store the extracted datacompound_data = []for tr in soup.find_all('tr'): td = tr.find_all('td') row = [i.text for i in td]iflen(row) !=6:pass# print(row)elif row[1] !='': compound = {'Product ID': row[0],'Name': row[1],'Molecular Structure': row[2],'Molecular Weight': row[3],'Purity': row[4],'Pricing': row[5] } compound_data.append(compound)# Convert the list of dictionaries to a Pandas DataFramecompound_df = pd.DataFrame(compound_data)IDs = compound_df['Product ID'].tolist()names = compound_df['Name'].tolist()
Using this mapping, it is easy to retrieve the images of all structures with their catalog IDs.
The next step is to predict the SMILES strings from the image paths in the dataframe. This is done using the openchemie toolkit, from which we use an image recognition model (Qian et al. 2023) that has compared favorably to other methods from the literature (Rajan et al. 2024).
CODE
from datetime import datedtime = date.today().strftime("%Y-%m-%d")catalog_name =f'Broadpharm_{}'chem_smiles = predict_SMILES_from_images(img_path)
in_url ="https://www.medchemexpress.com/search.html?q=ionizable+lipid&type=inhibitors-and-agonists"# scrape the url manually to yield the string below, because the website appears to not be statically rendered.from selenium import webdriverfrom selenium.webdriver.chrome.options import Optionsoptions = Options()options.add_argument("--headless")options.add_argument("--disable-blink-features=AutomationControlled")options.add_argument("user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) ""AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36")driver = webdriver.Chrome(options=options)driver.get("https://www.medchemexpress.com/search.html?q=ionizable+lipid&type=inhibitors-and-agonists")# driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")in_str = driver.page_sourcedriver.quit()
Get catalog IDs and corresponding chemical names of all structures
Again, BeautifulSoup comes to the rescue, though this code needs to be adapted to the specific structure of the Medchem catalog.
CODE
from bs4 import BeautifulSoupimport requests, osimport pandas as pd# Parse the HTML data using BeautifulSoupnew_soup = BeautifulSoup(in_str, 'html.parser')names = []IDs = []i =0for x in new_soup.find_all('li'):if x.dl isnotNone:iflen(x.dl.tr.text) >0: names.append(x.dl.tr.text.strip().split('\n')[0])else:iflen(x.dl.tr.a.contents) >=2: names.append(x.dl.tr.a.contents[1].contents[0])else: names.append(x.dl.tr.a.contents[0].contents[0])iflen(x.dl.dt.text) >0: IDs.append(x.dl.dt.text)else: IDs.append(x.dl.dt.a.contents[0]) i +=1
Armed with this mapping, we now again retrieve the images of all structures.
We can now merge all the catalogs that have been collected, collapsing duplicate entries appropriately and logging the source(s) of any structure. Doing a quick RDKit canonical-SMILES comparison during this process corrects for the indeterminacy in how the structure is represented (there is more than one valid SMILES for a molecule, e.g. depending on the choice of starting atom). So only unique structures are included in the final database.
CODE
import numpy as npsmiles_df_paths = ['Broadpharm_smiles.tsv', 'Medchem_smiles.tsv']consolidated_smiles = {}for s in smiles_df_paths: cat_name = s.split('_smiles')[0] consolidated_smiles[cat_name] =dict(np.array(pd.read_csv(s, sep='\t', header=None)))
CODE
from rdkit import Chemnew_df = {'Catalog': [],'Name': [],'SMILES': []}for catname in consolidated_smiles.keys(): thiscat = consolidated_smiles[catname] thiscat_list = [x for x inzip(*thiscat.items())] new_df['Name'].extend(list(thiscat_list[0])) new_df['SMILES'].extend(list(thiscat_list[1])) new_df['Catalog'].extend([catname] *len(thiscat))new_df = pd.DataFrame(new_df)new_df['SMILES'] = [Chem.CanonSmiles(x) for x in new_df['SMILES']]# Consolidate so that new_df['SMILES'] are all unique. For any duplicates, combine them by concatenating the contents of each of their other columns.new_df_combined = new_df.groupby('SMILES').agg(lambda x: ' | '.join(set(x))).reset_index()
This combined set of SMILES can be further analyzed, or used to seed a virtual space.
We can retrieve a vast variety of amine fragments from non-LNP catalogs, for use in synthesis. We download an .sdf file directly from the Chemspace catalog at this link.
CODE
from rdkit.Chem import PandasToolschemspace_file_path ='Chemspace_Amine_Fragments_Set.sdf'pdsdf = PandasTools.LoadSDF(chemspace_file_path, removeHs=False)pdsdf#sdf_entries[0].split('\n')
Failed to patch pandas - unable to change molecule rendering
CHEMSPACE_ID
CHEMSPACE_URL
ID
ROMol
0
CSSS00007999857
https://chem-space.com/CSSS00007999857
<rdkit.Chem.rdchem.Mol object at 0x28191b7d0>
1
CSSS00012024712
https://chem-space.com/CSSS00012024712
<rdkit.Chem.rdchem.Mol object at 0x28191b8b0>
2
CSSS02018321032
https://chem-space.com/CSSS02018321032
<rdkit.Chem.rdchem.Mol object at 0x28191b5a0>
3
CSSS00102954942
https://chem-space.com/CSSS00102954942
<rdkit.Chem.rdchem.Mol object at 0x28191b530>
4
CSSS00133055275
https://chem-space.com/CSSS00133055275
<rdkit.Chem.rdchem.Mol object at 0x28191b4c0>
...
...
...
...
...
18884
CSSS06359898475
https://chem-space.com/CSSS06359898475
<rdkit.Chem.rdchem.Mol object at 0x2a06fe570>
18885
CSSS00021525833
https://chem-space.com/CSSS00021525833
<rdkit.Chem.rdchem.Mol object at 0x2a06fe5e0>
18886
CSSS00015926175
https://chem-space.com/CSSS00015926175
<rdkit.Chem.rdchem.Mol object at 0x2a06fe650>
18887
CSSS00000685022
https://chem-space.com/CSSS00000685022
<rdkit.Chem.rdchem.Mol object at 0x2a06fe6c0>
18888
CSSS00027672546
https://chem-space.com/CSSS00027672546
<rdkit.Chem.rdchem.Mol object at 0x2a06fe730>
18889 rows × 4 columns
It’s useful to inspect this to get an idea of the staggering variety of drug-like amine fragments available (primarily because of their use in small-molecule drug design over the years).