Akshay Balsubramani

: akshay { at } akshay { dot } bio
: b-akshay
: @_bakshay

Technical blog linked here.

Research Overview

Welcome! I work on and with AI/biotech ventures, based out of the San Francisco Bay Area. I {partner, build, code, write} with {executives, builders, and scientists} in AI, biotech, and drug discovery.

I have put together drug discovery efforts in a variety of domains, most recently as a Principal Scientist at Sanofi, on the mRNA Platform Design team.

I have worked in synthetic biology, building machine learning for drug discovery at Octant Bio.

I have been a genomics researcher, with postdoctoral training in the lab of Anshul Kundaje at Stanford University. My genomics research has focused on computationally elucidating gene networks and epigenetic regulation of transcription at the single-cell level.

I have been a machine learning and statistics researcher, working on methods for semi-supervised and representation learning and decision theory, as well as work on more foundational aspects of probability and stochastic processes. I completed my PhD at UC San Diego in machine learning, advised by Yoav Freund. I spent summers interning with the Machine Learning group at Microsoft Research NYC, the video content analysis group at Google Research Mountain View, and the Interaction and Intent group at Microsoft Research Silicon Valley.

Research Manuscripts

For an updated list, see my Google Scholar page.
Click on each paper title for a very unofficial one-sentence summary.

Preprints

P-value peeking and estimating extrema. [arXiv]
In old and new statistical hypothesis tests, the reported p-values can be optimally adapted to any sampling strategy, eliminating the need to pre-specify sample sizes.
Akshay Balsubramani. Preprint.

Sharp finite-sample concentration of independent variables. [arXiv]
Empirical distributions from i.i.d. data are concentrated, with a very simple information-theoretic proof.
Akshay Balsubramani. Preprint.

Linking generative adversarial learning and binary classification. [arXiv]
Generative adversarial learning of a distribution, using a classifier learned by risk minimization, is always equivalent to f-divergence minimization.
Akshay Balsubramani. Preprint.

Muffled semi-supervised learning. [arXiv] [code]
There are several ways to achieve significant off-the-shelf improvements on supervised classification performance using unlabeled data, by "muffling" supervised recommendations by imputing the opposite labels on unlabeled data.
Akshay Balsubramani, Yoav Freund. Preprint.

Learning to abstain from binary prediction. [arXiv]
The problem of binary classification with an abstaining predictor centers around the tradeoff between abstaining and making a prediction error. We characterize this tradeoff optimally well, both theoretically and empirically with efficient algorithms that use labeled and unlabeled data.
Akshay Balsubramani. Preprint.

PAC-Bayes iterated logarithm bounds for martingale mixtures. [arXiv]
Any mixture of stochastic processes with high probability stays within an optimally characterized range of its conditional mean, at all times along its sample path, and with respect to all "posterior" mixing distributions.
Akshay Balsubramani. Preprint.

Sharp finite-time iterated-logarithm martingale concentration. [arXiv]
Any stochastic process with high probability stays within a narrow, optimally characterized range of its conditional mean, at all times along its sample path.
Akshay Balsubramani. Preprint.

Papers

Integrative single-cell analysis of cardiogenesis identifies developmental trajectories and non-coding mutations in congenital heart disease. [bioRxiv]
Individual heart cells develop and change in a way that can be meticulously traced through the genome and epigenome, uncovering new insights into genetic variations causing heart disease.
Mohamed Ameen*, Laksshman Sundaram*, Mengcheng Shen, Abhimanyu Banerjee, Soumya Kundu, Surag Nair, Anna Shcherbina, Mingxia Gu, Kitchener D. Wilson, Avyay Varadarajan, Nirmal Vadgama, Akshay Balsubramani, Joseph C. Wu, Jesse Engreitz, Kyle Farh, Ioannis Karakikes, Kevin C. Wang, Thomas Quertermous, William Greenleaf, Anshul Kundaje.
Cell, 2022.

Accelerating in silico saturation mutagenesis using compressed sensing. [bioRxiv]
Interrogating neural network models for base-pair-level effects is a common task which can be made far more efficient (~>100x) using appropriate computational insights.
Jacob Schreiber, Surag Nair, Akshay Balsubramani, Anshul Kundaje.
Bioinformatics, 2022.

Domain adaptive neural networks improve cross-species prediction of transcription factor binding. [bioRxiv]
Transcription factor binding contains interesting and learnable signal that is species-agnostic, as well as species-specific binding that can be learned with domain adaptation methods from AI.
Kelly Cochran, Divyanshi Srivastava, Avanti Shrikumar, Akshay Balsubramani, Anshul Kundaje, Shaun Mahony.
Genome Research, 2022. Prev. in Workshop on Machine Learning in Computational Biology, 2019.

WILDS: A benchmark of in-the-wild distribution shifts. [arXiv]
"In the wild" shifts between training and test distributions are commonplace and benchmarked in this paper, with significant effects on performance of predictive models.
Pang Wei Koh*, Shiori Sagawa*, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weihua Hu, Michihiro Yasunaga, Richard Lanas Phillips, Sara Beery, Jure Leskovec, Anshul Kundaje, Emma Pierson, Sergey Levine, Chelsea Finn, Percy Liang.
International Conference on Machine Learning (ICML), 2021.

A genome-wide atlas of co-essential modules assigns function to uncharacterized genes. [paper] [bioRxiv]
By measuring essentiality of genes in a broad spectrum of cancer cell lines using CRISPR, we can infer known and unknown functional relationships between genes.
Michael Wainberg*, Roarke A. Kamber*, Akshay Balsubramani*, Robin M. Meyers, Nasa Sinnott-Armstrong, Daniel Hornburg, Lihua Jiang, Joanne Chan, Ruiqi Jian, Mingxin Gu, Anna Shcherbina, Michael M. Dubreuil, Kaitlyn Spees, Wouter Meuleman, Michael P. Snyder, Anshul Kundaje, Michael C. Bassik.
Nature Genetics, 2021.

Learning transport cost from subset correspondence. [arXiv]
Information about partial correspondences between analogous datasets can be used to learn custom metrics for use by optimal transport methods.
Ruishan Liu, Akshay Balsubramani, James Zou.
International Conference on Learning Representations (ICLR), 2020 (conference track).

An adaptive nearest neighbor rule for classification. [arXiv] [code] [demo] [spotlight]
Nearest-neighbor classifiers can be modified to give robust, provable, practically checkable confidence sets by choosing the neighborhood size according to local label noise.
Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund, Shay Moran.
Neural Information Processing Systems (NeurIPS), 2019.

Semantically decomposing the latent spaces of generative adversarial networks. [arXiv] [code] [demo]
When learning a latent space for generating data, any given axis of variation in the data can be disentangled from the rest in the latent space, using an efficient model-agnostic pairwise training strategy.
Chris Donahue, Zachary C. Lipton, Akshay Balsubramani, Julian McAuley.
International Conference on Learning Representations (ICLR), 2018 (conference track).

The ENCODE-DREAM Challenge to predict genome-wide binding of regulatory proteins to DNA. [pdf]
An open challenge to design a genome-wide predictor of transcription factor binding.
Akshay Balsubramani, Nathan Boley, James C. Costello, Laura M. Heiser, Tim Jeske, Robert Kueffner, Jin Wook Lee, Rani K. Powers, Anshul Kundaje.
Machine Learning Challenges as a Research Tool, NIPS, 2017.

Optimal binary autoencoding with pairwise correlations. [arXiv] [code] [discussion]
Efficient and practical biconvex learning of binary autoencoders is strongly optimal, using pairwise correlations between encoding and decoding layers.
Akshay Balsubramani.
International Conference on Learning Representations (ICLR), 2017 (conference track).

Sequential nonparametric testing with the law of the iterated logarithm. [arXiv]
When performing non-parametric testing of the difference in mean between two distributions (and many other problems besides), we devise rigorous sequential tests that use as few samples as possible, adapting to the unknown mean difference.
Akshay Balsubramani*, Aaditya Ramdas*.
Conference on Uncertainty in Artificial Intelligence (UAI), 2016.

Optimal binary classifier aggregation for general losses. [arXiv] [spotlight]
The minimax optimal way to combine a set of binary classifiers of varying competences with unlabeled data is an artificial neuron, with a sigmoid-shaped transfer function that only depends on the evaluation loss function.
Akshay Balsubramani, Yoav Freund.
Neural Information Processing Systems (NIPS), 2016. Short version in Workshop on Learning Faster from Easy Data, NIPS, 2015.

Instance-dependent regret bounds for dueling bandits. [paper]
Online learning from limited (bandit) pairwise feedback between actions is easy when a few actions are better than the rest and the matrix of pairwise preferences is well-conditioned.
Akshay Balsubramani, Zohar Karnin, Robert Schapire, Masrour Zoghi.
Conference on Learning Theory (COLT), 2016.

Scalable semi-supervised aggregation of classifiers. [arXiv]
There is an efficient way to use unlabeled data to combine the trees of a random forest, which often performs better than random forests for binary classification.
Akshay Balsubramani, Yoav Freund.
Neural Information Processing Systems (NIPS), 2015.

Optimally combining classifiers using unlabeled data. [arXiv]
The minimax optimal way to combine a set of binary classifiers of known competences with unlabeled data resembles a weighted majority vote, and is efficiently learnable.
Akshay Balsubramani, Yoav Freund.
Conference on Learning Theory (COLT), 2015.

The fast convergence of incremental PCA. [arXiv]
Natural algorithms for incremental linear-time and -space principal component analysis (PCA) converge quickly to the optimum, despite the problem's nonconvexity.
Akshay Balsubramani, Sanjoy Dasgupta, Yoav Freund.
Neural Information Processing Systems (NIPS), 2013.

Workshop Only

An empirical comparison of sparse vs. embedding techniques on many-class text classification.
Rare features can be usefully predictive in (text) classification problems with many classes and features.
Akshay Balsubramani, Omid Madani.
Workshop on Extreme Classification, NIPS, 2013.

Theses

Playing games to reduce supervision in learning. [pdf]

Ph.D. Dissertation, 2016.

The utility of abstaining in binary classification. [arXiv]
Research Exam (requirement for M.S.), UC San Diego. March 2013.

* Indicates equal authorship.

Other writing

I maintain a blog where I post research-related content that hasn't made it into papers (yet).

Miscellaneous

Before the PhD, I was an Associate at Strand Life Sciences, where I did statistical genomics, developing tools for genomics researchers. Previously, I received a B.S. (High Honors) in Electrical Engineering and Computer Science at UC Berkeley. On the way to that degree, I minored in (quantum) physics at Berkeley as well. Before that, I lived in various parts of India, the US, and Singapore.

Some suggestions on research which I believe in.

I used to play the violin (and occasionally still do); before college, I got a distinction in it (unfortunately recordings are lost!). I also played the Carnatic classical style, which is less polyphonic but melodically richer than the Western European classical tradition.

I have always enjoyed traveling and do so whenever the opportunity arises. I like running, occasionally structured. In my free time, I sometimes write on history and philosophy tidbits I find interesting.

This site is (still and perennially) under construction.