Sourcing structures for in silico LNP discovery

cheminformatics
LNP
Getting ionizable lipid structures from online catalogs
Author

Akshay Balsubramani

Online catalogs of chemical structures

The building blocks of an ionizable lipid are fragments of different moieties in the lipid. These fragments are shared in common across combinatorial libraries, providing insight and control into their variations.

In the papers in which they appear, the fragments are typically dictated by synthesis or other resource constraints. For in silico AI/ML, we often want to mine the SMILES and lipids corresponding to known structures, which is a daunting task given the number that have been studied. Scouring the web for catalogs and literature online is the only way to keep up with such structures.

We go through the process of mining two such catalogs – Broadpharm and Medchem – which together cover the vast majority of available fragments and structures.

These cover a large fraction of the available literature, including structures that are not under patent. We will scrape each of these catalogs and combine them into a data frame that is the result of this notebook.

Procedure

In each case, getting the necessary structures from the catalogs is a two-step process:

  1. Retrieve structure images from the catalog: This is done by scraping the catalog for the internal IDs of chemical structures. Then, which takes a URL and returns a data frame of the structures in the catalog.
CODE
import requests, os
from urllib.parse import urlparse

def scrape_url(url):
    """
    Scrapes the HTML content from a given URL and returns it as a string.
    
    Args:
        url (str): The URL to scrape
        
    Returns:
        str: The HTML content of the page
        
    Raises:
        ValueError: If the URL is invalid
        requests.RequestException: If the request fails
    """
    # Validate URL
    parsed_url = urlparse(url)
    if not parsed_url.scheme or not parsed_url.netloc:
        raise ValueError("Invalid URL. Please provide a complete URL including http:// or https://")
    
    # Set a user agent to mimic a browser request
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
    }
    
    # Make the request
    try:
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise an exception for 4XX/5XX responses
        return response.text
    except requests.exceptions.RequestException as e:
        raise requests.RequestException(f"Failed to retrieve the webpage: {e}")


def retrieve_chemstructure_images(IDs, names, imageID_to_url, name_to_path):
    img_path = {}
    for i in range(len(IDs)):
        if len(names[i]) == 0:
            continue
        cpd_name = names[i]
        image_url = imageID_to_url(IDs[i])
        if '/' in cpd_name:
            cpd_name = cpd_name.replace('/', '-')
        image_path = name_to_path(cpd_name)
        if not os.path.exists(image_path):
            img_data = requests.get(image_url).content
            print(f'Downloading {image_url} to {image_path}')
            with open(image_path, 'wb') as handler:
                handler.write(img_data)
        img_path[cpd_name] = image_path
    return img_path
  1. Convert the structure images to SMILES strings: This is done using deep computer vision models that are trained for this specific task. We use the DECIMER transformer model for this purpose, though more advanced workflows like that of the OpenChemIE also could be used.

The SMILES string conversion requires some manual verification of the answers, so it is not yet fully automated. Though further automatic SMILES conversion presents some challenges, there is scope for addressing these as needed. 1

CODE
# To run this, DECIMER must be installed, requiring opencv-python and keras-preprocessing
from DECIMER import predict_SMILES as DECIMER_predict_SMILES

def predict_SMILES_from_images(img_path_dict):
    chem_smiles = {}
    i = 0
    for cpd_name, image_path in img_path_dict.items():
        SMILES_str = DECIMER_predict_SMILES(image_path)
        #SMILES_str = openchemie_predict_SMILES(image_path)
        chem_smiles[cpd_name] = SMILES_str
        i += 1
        print(i, cpd_name)
    return chem_smiles
/Users/akshay/opt/anaconda3/envs/env-openchemie/lib/python3.9/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

Examples

A look at Broadpharm’s IL catalog

Broadpharm is one of the largest vendors for ionizable lipids in LNP design and has a detailed and organized catalog with solid coverage of the literature. The structures available there can be inspected from the website.

CODE
in_url = "https://broadpharm.com/product-categories/lipid/ionizable-lipid"
in_str = scrape_url(in_url)

Get catalog IDs and corresponding chemical names of all structures

Each structure has a unique catalog ID which is necessary to retrieve it from within the catalog. But we typically want to store it under a different name – often the structure’s trademarked name, or another referent that is useful for looking it up in the literature.

We first retrieve the mapping between the catalog IDs and chemical names – a quick step that only involves some HTML traversals. BeautifulSoup is a standard HTML/XML parser that makes this process easier.

CODE
from bs4 import BeautifulSoup
import pandas as pd

# Parse the HTML data using BeautifulSoup
soup = BeautifulSoup(in_str, 'html.parser')


# Initialize an empty list to store the extracted data
compound_data = []

for tr in soup.find_all('tr'):
    td = tr.find_all('td')
    row = [i.text for i in td]
    if len(row) != 6:
        pass  # print(row)
    elif row[1] != '':
        compound = {
            'Product ID': row[0],
            'Name': row[1],
            'Molecular Structure': row[2],
            'Molecular Weight': row[3],
            'Purity': row[4],
            'Pricing': row[5]
        }
        compound_data.append(compound)
        

# Convert the list of dictionaries to a Pandas DataFrame
compound_df = pd.DataFrame(compound_data)

IDs = compound_df['Product ID'].tolist()
names = compound_df['Name'].tolist()

Using this mapping, it is easy to retrieve the images of all structures with their catalog IDs.

CODE
broadpharm_imageID_to_url = lambda x: f'https://broadpharm.com/web/images/mol_images/{x}.gif'
broadpharm_name_to_path = lambda x: f'broadpharm_lipids/{x}.png'
img_path = retrieve_chemstructure_images(IDs, names, broadpharm_imageID_to_url, broadpharm_name_to_path)

Predict SMILES from image paths in dataframe

The next step is to predict the SMILES strings from the image paths in the dataframe. This is done using the openchemie toolkit, from which we use an image recognition model (Qian et al. 2023) that has compared favorably to other methods from the literature (Rajan et al. 2024).

CODE
from datetime import date
dtime = date.today().strftime("%Y-%m-%d")

catalog_name = f'Broadpharm_{}'
chem_smiles = predict_SMILES_from_images(img_path)
CODE
new_fname = f'{catalog_name}_smiles.tsv'
pd.Series(chem_smiles).to_csv(new_fname, sep='\t', header=False)

Medchem’s IL catalog

CODE
in_url = "https://www.medchemexpress.com/search.html?q=ionizable+lipid&type=inhibitors-and-agonists"

# scrape the url manually to yield the string below, because the website appears to not be statically rendered.
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_argument("--headless")
options.add_argument("--disable-blink-features=AutomationControlled")
options.add_argument(
    "user-agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
    "AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36"
)

driver = webdriver.Chrome(options=options)

driver.get("https://www.medchemexpress.com/search.html?q=ionizable+lipid&type=inhibitors-and-agonists")
# driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")

in_str = driver.page_source
driver.quit()

Get catalog IDs and corresponding chemical names of all structures

Again, BeautifulSoup comes to the rescue, though this code needs to be adapted to the specific structure of the Medchem catalog.

CODE
from bs4 import BeautifulSoup
import requests, os

import pandas as pd


# Parse the HTML data using BeautifulSoup
new_soup = BeautifulSoup(in_str, 'html.parser')

names = []
IDs = []
i = 0
for x in new_soup.find_all('li'):
    if x.dl is not None:
        if len(x.dl.tr.text) > 0:
            names.append(x.dl.tr.text.strip().split('\n')[0])
        else:
            if len(x.dl.tr.a.contents) >= 2:
                names.append(x.dl.tr.a.contents[1].contents[0])
            else:
                names.append(x.dl.tr.a.contents[0].contents[0])
        if len(x.dl.dt.text) > 0:
            IDs.append(x.dl.dt.text)
        else:
            IDs.append(x.dl.dt.a.contents[0])
        i += 1

Armed with this mapping, we now again retrieve the images of all structures.

CODE
medchem_imageID_to_url = lambda x: f'https://file.medchemexpress.com/product_pic/{x}.gif'
medchem_name_to_path = lambda x: f'medchem_lipids/{x}.png'
img_path = retrieve_chemstructure_images(IDs, names, medchem_imageID_to_url, medchem_name_to_path)

Predict SMILES from image paths in dataframe

Again we proceed very similarly to the Broadpharm catalog, using model predictions to convert the images to SMILES strings.

CODE
from datetime import date
dtime = date.today().strftime("%Y-%m-%d")

catalog_name = f'Medchem_{}'
chem_smiles = predict_SMILES_from_images(img_path)
CODE
new_fname = f'{catalog_name}_smiles.tsv'
pd.Series(chem_smiles).to_csv(new_fname, sep='\t', header=False)

Consolidate known lipid structures

We can now merge all the catalogs that have been collected, collapsing duplicate entries appropriately and logging the source(s) of any structure. Doing a quick RDKit canonical-SMILES comparison during this process corrects for the indeterminacy in how the structure is represented (there is more than one valid SMILES for a molecule, e.g. depending on the choice of starting atom). So only unique structures are included in the final database.

CODE
import numpy as np
smiles_df_paths = ['Broadpharm_smiles.tsv', 'Medchem_smiles.tsv']
consolidated_smiles = {}
for s in smiles_df_paths:
    cat_name = s.split('_smiles')[0]
    consolidated_smiles[cat_name] = dict(np.array(pd.read_csv(s, sep='\t', header=None)))
CODE
from rdkit import Chem

new_df = {
    'Catalog': [],
    'Name': [],
    'SMILES': []
}
for catname in consolidated_smiles.keys():
    thiscat = consolidated_smiles[catname]
    thiscat_list = [x for x in zip(*thiscat.items())]
    new_df['Name'].extend(list(thiscat_list[0]))
    new_df['SMILES'].extend(list(thiscat_list[1]))
    new_df['Catalog'].extend([catname] * len(thiscat))
new_df = pd.DataFrame(new_df)
new_df['SMILES'] = [Chem.CanonSmiles(x) for x in new_df['SMILES']]

# Consolidate so that new_df['SMILES'] are all unique. For any duplicates, combine them by concatenating the contents of each of their other columns.
new_df_combined = new_df.groupby('SMILES').agg(lambda x: ' | '.join(set(x))).reset_index()

This combined set of SMILES can be further analyzed, or used to seed a virtual space.

CODE
new_df_combined.to_csv('known_IL_combined.tsv', sep='\t', header=False, index=False)

Chemspace fragment database

We can retrieve a vast variety of amine fragments from non-LNP catalogs, for use in synthesis. We download an .sdf file directly from the Chemspace catalog at this link.

CODE
from rdkit.Chem import PandasTools

chemspace_file_path = 'Chemspace_Amine_Fragments_Set.sdf'
pdsdf = PandasTools.LoadSDF(chemspace_file_path, removeHs=False)
pdsdf
#sdf_entries[0].split('\n')
Failed to patch pandas - unable to change molecule rendering
CHEMSPACE_ID CHEMSPACE_URL ID ROMol
0 CSSS00007999857 https://chem-space.com/CSSS00007999857 <rdkit.Chem.rdchem.Mol object at 0x28191b7d0>
1 CSSS00012024712 https://chem-space.com/CSSS00012024712 <rdkit.Chem.rdchem.Mol object at 0x28191b8b0>
2 CSSS02018321032 https://chem-space.com/CSSS02018321032 <rdkit.Chem.rdchem.Mol object at 0x28191b5a0>
3 CSSS00102954942 https://chem-space.com/CSSS00102954942 <rdkit.Chem.rdchem.Mol object at 0x28191b530>
4 CSSS00133055275 https://chem-space.com/CSSS00133055275 <rdkit.Chem.rdchem.Mol object at 0x28191b4c0>
... ... ... ... ...
18884 CSSS06359898475 https://chem-space.com/CSSS06359898475 <rdkit.Chem.rdchem.Mol object at 0x2a06fe570>
18885 CSSS00021525833 https://chem-space.com/CSSS00021525833 <rdkit.Chem.rdchem.Mol object at 0x2a06fe5e0>
18886 CSSS00015926175 https://chem-space.com/CSSS00015926175 <rdkit.Chem.rdchem.Mol object at 0x2a06fe650>
18887 CSSS00000685022 https://chem-space.com/CSSS00000685022 <rdkit.Chem.rdchem.Mol object at 0x2a06fe6c0>
18888 CSSS00027672546 https://chem-space.com/CSSS00027672546 <rdkit.Chem.rdchem.Mol object at 0x2a06fe730>

18889 rows × 4 columns

It’s useful to inspect this to get an idea of the staggering variety of drug-like amine fragments available (primarily because of their use in small-molecule drug design over the years).

CODE
import mols2grid

mols2grid.display(pdsdf,mol_col="ROMol", n_cols=7)

This represents a hefty cross-section of the amine fragments in commercial use. There is overlap with manufacturers’ databases, such as those of Enamine and WuXi. With the right access permissions, those too can be mined as necessary.

References

Qian, Yujie, Jiang Guo, Zhengkai Tu, Zhening Li, Connor W Coley, and Regina Barzilay. 2023. “MolScribe: Robust Molecular Structure Recognition with Image-to-Graph Generation.” Journal of Chemical Information and Modeling 63 (7): 1925–34.
Rajan, Kohulan, Henning Otto Brinkhaus, Achim Zielesny, and Christoph Steinbeck. 2024. “Advancements in Hand-Drawn Chemical Structure Recognition Through an Enhanced DECIMER Architecture.” Journal of Cheminformatics 16 (1): 78. https://doi.org/10.1186/s13321-024-00872-7.

Footnotes

  1. It is even possible to verify the SMILES strings against the original images at scale, by using RDKit to generate structures from the SMILES strings and comparing them to the original images.↩︎