Exploring perturbation data in drug discovery

(top)
dataviz
genomics
Some trajectories of AI-guided exploration
Author

Akshay Balsubramani

Modified

June 19, 2025

Genesis

One of the major challenges ahead in AI for science is eliciting tacit knowledge from either data sources or human behavior (often language), as such knowledge often contains inductive biases that are invaluable for human-level deduction.

Single-cell multiplexed phenotyping data occupies a vital role in drug discovery, with several important characteristics: - Patterns learned are causal, not just correlative - Throughput is high enough for large-scale AI methods to learn patterns at superhuman levels

Many wonderful examples of this include companies (like Recursion and Insitro) who have recognized the power of this technology in conjunction with AI.

I saw a new single-cell perturbation data set come out yesterday from Xaira named “X-Atlas/Orion”; this is quite an exciting time for computation in biology, because advanced quantitative analysis methods from the past few years1 have made it possible to recover unforeseen domain-specific structure in such complex high-throughput data. Single-cell data analysis is a great case study of this.

Given all the recent advances promising AI augmentation for a scientist in drug discovery, I wanted to see what I could do with a few hypotheses where my knowledge is slightly better than a layperson’s. I’ll document where I used those assistants and where I’m incorporating my own inductive biases into the process. 2 The role of both the AI and the human scientist in this type of open-ended analysis is to perform sweeping multi-modal reasoning.

I’ll look to incorporate those inductive biases into the search with AI scientist assistance available at this time. This might be illuminating in my particular case, because I have zero formal training in either chemistry or biology, but have been very lucky to work with world-class collaborators in many areas of basic cell biology and drug discovery; so have some intuitions about a range of problems. Let’s see how far these can be refined with this data.

I’ll augment this notebook over the next few sessions of asking questions about this dataset. This will demonstrate some of the many ways of talking to any dataset, particularly such highly structured single-cell perturbation data.

Scenario

As I’ve discussed previously, there is a key role in the development of AI systems for interfaces, between generalist human language and specialist domains with their own languages.

Broadly speaking, there are several layers to the knowledge in this dataset which we can think of communicating to different broader reasoning processes.

  • Statistical patterns are implicit in the dataset; descriptive analysis of the data in this manner is useful, and gains descriptive power when the dataset is richer and more structured. Statistical analysis transforms a soup of unaligned barcoded reads into an embedded space.
  • Domain-specific knowledge about the dataset includes metadata and information about the genes, like the STRING protein-protein interaction database referred to in the paper, the Gene Ontology (GO), and many other relevant databases. All are condensations of hard-won tacit knowledge directly about the genes being perturbed.
  • Pan-domain context is everything needed to make the dataset relevant beyond the (cell biology) domain. It typically depends on the question being asked and allows the dataset to be useful in endeavors well outside the disciplinary domain. For uses of this data set in drug discovery, it might be conveyed in protein-ligand complex structures, information about potential ligand candidates for particular proteins in a pathway, safety heuristics and commercial constraints, etc.

Statistical knowledge about the dataset is almost always conveyed in statistical terms, not natural language terms.3 Within-domain and pan-domain knowledge, however, are often conveyed in natural language and sometimes through statistical means like foundation models.

It’s clear that AI/ML systems can handle all these functions, but there is a need to translate between embeddings and human language to properly integrate this information, and especially to query it. Focusing on the X-Atlas dataset,

  • Embedding ⇆ embedding: scGPT (Cui et al. 2024) and successors are foundation models trying to integrate increasingly more multi-omic single-cell data modalities.
  • Embedding ⇆ natural language: BioReason is a recent foundation model that can reason about the dataset in natural language.
  • Natural language ⇆ natural language: This is asking a question about the dataset in language and getting the answer back in language, as you would with a full-fledged assistant. Systems designed for this include Biomni and FutureHouse.

All of these and more will be applied to this dataset in short order.

Goals

This enterprise is common in drug discovery and other industries, often phrased as:

View a dataset through as many therapeutically interesting lenses as possible.

This is a natural goal to have, because datasets are costly and expensive to generate in frontier science. But it is a very high-level interdisciplinary goal, and it is often unclear how to achieve it.

Therefore, the goals for this post suggest themselves:

Document various therapeutically interesting chains of reasoning and the AI augmentation used to get them, from an individual scientist’s perspective.

Of course, this is also the task of an AI agent. Though it would be interesting to write about that, that is not the scope here. Instead, this post is an opportunity to see how much AI can augment the lack of database-specific knowledge of a computational scientist in biochemistry and drug discovery (me) equipped with some weak domain-specific inductive biases.

Outline

For what is a growing list of explorations, I’m going to use the single-cell perturbation data set linked here, from the above paper. I’ll go through the HCT116 data here; the other dataset is in the HEK293T cell line.

Caveats abound – this is a cell line from a colorectal carcinoma, not well-regulated human tissue. But it is living tissue and it is often used in drug discovery research, and as measured by the X-Atlas authors has the higher signal-to-noise ratio and the better-quality perturbations of the two measured datasets.

We will fill in:

  • Recapitulating the paper and basic analysis
  • Comparative gene network analysis
  • Recapitulating analyses from other recent papers
  • GPCR biology-focused inquiries

References

Cui, Haotian, Chloe Wang, Hassaan Maan, Kuan Pang, Fengning Luo, Nan Duan, and Bo Wang. 2024. “scGPT: Toward Building a Foundation Model for Single-Cell Multi-Omics Using Generative AI.” Nature Methods 21 (8): 1470–80.

Footnotes

  1. Such as cellOT, the source of the title image.↩︎

  2. The code itself is largely vibe-coded iteratively, but with significant detail in the prompts, including instructions on packages and databases to use; the idea is not to document the coding process, but to document the reasoning process. ↩︎

  3. Except for qualitative observations.↩︎

Reuse

CC BY 4.0