Topic statement
The world that we work and play in has increasingly been shaped by machine learning, whose success and generalizability arises from applying a set of fundamental statistical ideas that describe the behavior of populations. These ideas have been vital to designing algorithms for learning from the massive collections of structured user data that arise in large consumer-facing technology companies.
But as time has passed, the initial optimistic flush of new consumer benefits from these powerful ML ideas has faded somewhat. The ability to collect data and run experiments freely at scale, once tech’s major drawing card for the application of ML ideas, has shown itself to have increasingly alarming consequences on a societal level.
Still, applying these ideas has never been more rewarding! They are fueling scientific revelations in real time, in the natural sciences, in the worlds of basic biology and biochemical discovery.
New era-defining ways have emerged of biochemically measuring and manipulating the environments of cells and the biomolecules within them. Sequencing and microfluidics revolutions have catalyzed existing technologies and enabled a variety of new ones.
The accumulation of new technologies and data has led to a dizzying kaleidoscope of data-driven scientific discoveries of depth and power. This has often been driven by a small set of fundamental quantitative and algorithm-design themes, which are employed again and again in various forms on new problems and data to distill the scientifically disorienting flood of new measurements and experiments.
This blog will focus on such examples in action, demonstrating a growing toolbox of cross-platform algorithms to describe biochemical and other observations.
The living environment of a cell is an inexplicably complex place: > youtube: https://youtu.be/VdmbpAo9JR4
In each cell, there is a vast assortment of intricately tuned molecular machines (proteins) which perform a crowded multi-way dance in which they interact with each other, especially through each other’s functional chemical groups of a few atoms each. And there are millions of cells of different types, interacting in medically relevant tissue- and organ-level responses measured by experiments. All this represents a daunting challenge to understand.
Fortunately, we live in a time of wonders, and a variety of biochemical reactions have made it increasingly possible to observe this dance at an atomic level of detail. All this experimental data has made understanding seem tantalizingly within reach of quantitative models and learning.
Physics is often looked at as an example in this respect: a mature natural science. It’s one area in which very complex behavior has been famously well described by quantitative methods, which has led to a fairly succinct understanding of the world we observe. There are some powerful mathematical threads at play here: basic statistical mechanics ideas, developed into information theory, have been foundational to ML in helping describe the structure of datasets. From a statistical point of view, there are remarkable structures in large populations that share common understanding across fields, and we hope to bring some of that perspective in these posts.
Many of us who make and use algorithms need to introduce them into new teams of people and new data. In these situations, it helps when they are developed in an idealized way:
Concept - Implementable idea - A concept explained with pseudocode.
Code - Usable implementation - Code that runs the algorithm’s pseudocode.
Explanation - Documented usage - The code in action, with documented usage examples.
But reality doesn’t typically meet this ideal. Instead, a team adopting this type of code often needs to reconstruct understanding on the fly, which can lead to escalating implementation and maintenance difficulties.
Reference resources like textbooks and some research papers focus on the concept (1), rather than the code (2) or the explanation (3) – and the concepts are conveyed in a different language, possibly originating with some regrettably dense math. Meanwhile, implementations typically have the opposite emphasis, focusing on (2) and sometimes documented usage (3). But they can be difficult to modify and repurpose without understanding what is being done and why; and it takes far longer to reverse-engineer this understanding.
So there is scope to bridge this gap with something like Knuth’s “literate programming” [@knuth1984literate], in which algorithms are explained and used in context, inline with reproducible code. This is our aim, focusing on methods with simple APIs, which can be implemented fairly cleanly with the basic ML Python software stack. There is a huge variety of these, with ML and statistics algorithms research producing fresh ideas constantly – and we believe they should be more commonly accessible tools in the natural sciences.
“An algorithm must be seen to be believed.” - Don Knuth