Introduction: getting player rankings from one-on-one matches

To sports fans like me, the problem of determining a global ranking of competitors or teams from a sparse collection of one-on-one matches is a familiar one. In competitive endeavors like in soccer, tennis, or chess, there are systems of doing this which can be strange and opaque to varying degrees, but ultimately come up with a "reasonable" ranking.

I wanted to post about one such method that might be my favorite. It is a beautiful application of linear algebra, which is widely known in some disciplines and totally revelatory in others. It relies on a collection of tools for calculus -- measuring local change -- on graphs.

Measuring change on graphs

It consists of tools which are extremely simple to implement, but Python toolboxes that implement these tools often don't expose or document them in a way that's readily broken down and understood.

In this post, I'll quickly lay out some of the basics of this "graph calculus," and we'll see an efficient way to implement the calculus on a graph.

The graph could represent a manifold, about which I've written previously; graph calculus is a powerful toolbox to efficiently and nonparametrically manipulate functions on these manifolds, mimicking the original inventions of these calculus tools to model physics phenomena.

# !pip install -q scanpy
# !pip install anndata

import sys, numpy as np, time, scipy, scipy.linalg
import os, sklearn
import matplotlib.pyplot as plt
%matplotlib inline

import anndata, scanpy as sc

print('Packages imported.')
Packages imported.

Getting global rankings from matches

There's a very neat way to conceptualize the problem of getting a global ranking from pairwise comparisons. It relies on two steps: making a graph out of the matches played, and then computing a unique global ranking on this graph.

Making a graph from match results

Suppose we have results from a number of one-on-one matches between a group of players. If we draw these players as circles, we can connect them with arrows pointing at the winners.

This is a (directed) graph! If we're required to rank the players, we'd like higher-ranked players to beat lower ones. For many input graphs, there might not be a ranking that perfectly satisfies this, but we'd at least want results to agree with rankings in most matches, most of the time, and more often for a bigger ranking gap. We'd generally consider any ranking that does this to be "reasonable" given the match result graph.

In this post, we show that:

We can calculate a "reasonable" ranking of points for any match results, such that the match results largely follow these rankings as much as possible.

Computing a ranking on graph nodes

We can rely on an amazing linear-algebra framework that was first discovered by physicists and is now touched on in undergraduate vector calculus classes. I'll lay this out in the next section "Formulating a graph calculus". That can be skipped for those who aren't interested in the details and derivation, and can be read standalone.

After that, the section "Global rankings from the graph" contains the recipe for computing a "potential function" over nodes, from which the ranking is easily computed.

Finally, the last section contains experiments before a summary.

Formulating a graph calculus

We deal with an undirected connected weighted graph on $n$ vertices represented by a symmetric adjacency matrix $K \in \mathbb{R}^{n \times n}$. This stands in for a manifold. From a data science perspective, we can think of this in a couple of ways:

  • As a binary (0-1) matrix where $K (x, y) = 1$ iff $x$ and $y$ have a bilateral symmetric relationship defined between them (e.g. they have played a match).
  • As a neighborhood graph, where $K (x, y) = 0$ unless $x$ and $y$ are nearby neighbors.

On this graph $K$, we deal with arrays and matrices, which in multivariate calculus are represented by scalar and vector fields:

  • Scalar field - represented by a function on the graph vertices: a vector $g \in \mathbb{R}^{n}$.
  • Vector field - represented by a function on the graph edges: a matrix $G \in \mathbb{R}^{n \times n}$.

Definitions of graph calculus

The gradient of a scalar field $g$ is a vector field $\nabla g$ connecting neighboring vertices in $K$ by their differences: $$ [\nabla g] (x, y) := ( g (x) - g (y) ) \cdot \textbf{1} (K (x, y) > 0) $$ It is a natural way to represent changes in a function over the manifold.

Flows

A vertex $x$ has degree $d (x) = \sum_{y} K (x, y)$; define $D := \mbox{diag} (d)$ to be the diagonal matrix constructed from $d$. Any vector field $G$ has a symmetric part $\textsf{sym}\;G$, with $[\textsf{sym}\;G] (x, y) = \frac{1}{2} (G (x, y) + G (y, x))$.

The field $G$ is called antisymmetric if $\textsf{sym}\;(G) = 0$. Such a field is also called a flow -- the prototypical example is the gradient flow $G := \nabla g$.

def symmetric_part(A):
    """ Returns the symmetrized version of the input matrix. """
    return 0.5*(A + A.T)

def asymmetric_part(A):
    """ Returns the anti-symmetrized version of the input matrix. """
    return 0.5*(A - A.T)

def grad_op_graph(fn, adj_mat):
    """
    Gradient of a function on a (n x n) graph.
    Args: 
        fn: A function on vertices (n-dimensional vector)
        adj_mat: (n x n) Adjacency matrix of graph
    Returns:
        Matrix ((n x n) directed graph) of gradients on edges.
    """
    return adj_mat.multiply(fn) - adj_mat.multiply(fn).T

Differential operators: divergence and curl

Given a vector field $G$, the divergence of $G$ is a scalar field measuring the net flux of signal leaving each vertex. $$ [\textsf{div}\; G] (x) := \frac{1}{2} \sum_{y \neq x} \frac{K (x, y)}{d(x)} (G (x, y) - G (y, x)) $$ The curl of $G$ is a vector field that's basically a symmetrized version of G's local magnitude along an edge. $$ [\textsf{curl}\; G] (x, y) := \frac{\textbf{1} (K (x, y) > 0) }{2 K (x, y)} (G (x, y) + G (y, x)) $$

Second derivatives: the Laplacian

With these differential operators defined, we can write some properties of second derivatives that correspond to their analogues in vector calculus. Importantly, we define the Laplacian of $G$. $$ [\nabla^2 g] (x) := \text{scalar Laplacian of G} = [\textsf{div} (\nabla g )] (x) = g(x) - \frac{1}{d(x)} \sum_{y \neq x} K (x, y) g(y) = (I - D^{-1} K) g $$ The definition of the operators also makes clear that $\textsf{div} (\textsf{curl}\; G) = 0$, exactly as in vector calculus. Similarly, $\textsf{curl} (\nabla g) = 0$ as well.

def compute_transitions_asym(adj_mat):
    W = symmetric_part(adj_mat)
    recip = np.reciprocal(np.asarray(W.sum(axis=0)).astype(np.float64))
    recip[np.isinf(recip)] = 0
    return scipy.sparse.spdiags(recip, 0, W.shape[0], W.shape[0]).dot(W)

def laplacian_op_graph(adj_mat, normalize=True):
    """
    Laplacian of a function over graph vertices.
    
    Args:
        adj_mat: adjacency matrix for data kNN graph.
        normalize: Whether to degree-normalize the Laplacian matrix. 
    Returns: 
        Laplacian matrix (V x V).
    """
    if not normalize:
        lapmat = scipy.sparse.diags(np.ravel(adj_mat.sum(axis=0))) - adj_mat
    else:
        lapmat = scipy.sparse.identity(adj_mat.shape[0]) - compute_transitions_asym(adj_mat)
    return lapmat

def div_op_graph(field_mat, adj_mat):
    """
    Divergence of a function in a neighborhood on the data manifold. 
    Args:
        field_mat: a vector field (function on edges, i.e. (V x V) matrix-valued)
        adj_mat: adjacency matrix for data kNN graph.
    Returns: 
        Array of vertex-wise divergence values.
    """
    trans_mat = compute_transitions_asym(adj_mat)
    return np.ravel(trans_mat.multiply(asymmetric_part(field_mat)).sum(axis=1))

def curl_op_graph(field_mat, adj_mat):
    """
    Curl of a function on edges of a (cell x cell) graph.
    Args: 
        field_mat: a vector field (function on edges, i.e. (V x V) matrix-valued)
        adj_mat: (cell x cell) Adjacency matrix of graph
    Returns:
        Matrix of curl associated with each edge.
    """
    np.reciprocal(adj_mat.data, out=adj_mat.data)
    return adj_mat.multiply(symmetric_part(field_mat))

Connections to physics and the fundamental theorem

These notions are very familiar to physicists. In fact, all of them -- the divergence, curl, and Laplacian -- were invented to quantitatively describe physics phenomena, and later became part of vector calculus. They are now central to continuous ways of describing the physical world, as in Maxwell's equations of electrodynamics, fluid mechanics, and relativity. Maxwell's equations in particular have strong roots in the fundamental theorems of vector calculus describing divergence, flux, and integration by parts.

To tie all this together mathematically, we can prove a fundamental theorem of graph calculus that deals with the relationship between a signal's Laplacian in a region and its flux exiting the region.

For any functions $g, h$ on vertices and a non-empty subset $\Omega \subseteq V$, $$ \begin{aligned} \sum_{x \in \Omega} \nabla^2 g(x) h(x) d(x) = \frac{1}{2} \sum_{x, y \in \Omega} [\nabla g] (x, y) [\nabla h] (x, y) K (x, y) - \sum_{x \in \Omega , y \in \Omega^c} [\nabla g] (x, y) h(x) K (x, y) \end{aligned} $$

$\textbf{Proof}$ $$ \begin{aligned} \sum_{x \in \Omega} \nabla^2 g(x) h(x) d(x) &= \sum_{x \in \Omega} h(x) \sum_{y \neq x} K (x, y) (g(x) - g(y)) = \sum_{x \in \Omega} \sum_{y \neq x} K (x, y) (g(x) - g(y)) h(x) \\ &= \sum_{x \in \Omega} \sum_{y \in \Omega} K (x, y) (g(x) - g(y)) h(x) + \sum_{x \in \Omega} \sum_{y \in \Omega^c} K (x, y) (g(x) - g(y)) h(x) \qquad \qquad (*) \end{aligned} $$ The term $\sum_{x \in \Omega} \sum_{y \in \Omega} K (x, y) (g(x) - g(y)) h(x)$ can also be written as $\sum_{y \in \Omega} \sum_{x \in \Omega} K (y, x) (g(y) - g(x)) h(y)$, interchanging the roles of $x$ and $y$. Since $K$ is symmetric, averaging these two expressions gives $\frac{1}{2} \sum_{x \in \Omega} \sum_{y \in \Omega} K (x, y) (g(x) - g(y)) (h(x) - h(y))$. Thus we deduce that $$\sum_{x \in \Omega} \sum_{y \in \Omega} K (x, y) (g(x) - g(y)) h(x) = \frac{1}{2} \sum_{x \in \Omega} \sum_{y \in \Omega} K (x, y) [\nabla g] (x, y) [\nabla h] (x, y) $$ Substituting this in the equation $(*)$ above gives the result.

Special cases

As special cases, we can get fundamental forms of Gauss's law (the divergence theorem) describing flux and the Laplacian, and Green's identities describing integration by parts:

$\textbf{Green's identities}$ For any functions $g, h$ on vertices, $$ \begin{aligned} \sum_{x \in \Omega} \nabla^2 g(x) d(x) &= - \sum_{x \in \Omega} d(x) \sum_{ y \in \Omega^c} [\nabla g] (x, y) \frac{K (x, y)}{d(x)} \;\;\forall \Omega \subseteq V \qquad &(\text{Gauss's Law}) \\ \sum_{x \in V} h (x) \nabla^2 g (x) d(x) &= \frac{1}{2} \sum_{x, y \in V} [\nabla g] (x, y) [\nabla h] (x, y) K (x, y) \qquad &(\text{First Green's identity}) \\ \sum_{x \in V} h (x) \nabla^2 g (x) d(x) &= \sum_{x \in V} g (x) \nabla^2 h (x) d(x) \qquad &(\text{Second Green's identity}) \end{aligned} $$
$\textbf{Other second derivatives on the graph}$ For any functions $g, h$ on vertices, $$ [\nabla (\textsf{div}\; G)] (x, y) = \frac{K (x, y)}{2} \sum_{z \in V} \left[ \frac{G_{xz} - G_{zx}}{d(x)} - \frac{G_{yz} - G_{zy}}{d(y)} \right] = \frac{1}{2} \sum_{z \in V} \left[ \frac{K (x, y)}{d(x)} (G_{xz} - G_{zx}) - \frac{K (x, y)}{d(y)} (G_{yz} - G_{zy}) \right] $$ $$ [\Delta G] (x, y) := \text{vector Laplacian of G} = [\textsf{curl} (\textsf{curl}\; G) - \nabla (\textsf{div}\; G ) ] (x, y) = \frac{1}{2 K (x, y)^2} (G (x, y) + G (y, x)) - \left[ \nabla (\textsf{div}\; G ) \right] (x, y) $$

Interpreting the calculus

Just as in the standard vector calculus, the most useful way of understanding these operators is understanding

A note on normalization and defining this calculus

This task of writing up the calculus involves more detailed choices than may first appear. There are a couple of different normalizations of the Laplacian, corresponding to different ways of calculating the divergence and gradient.

I've chosen the "non-self-adjoint" way to write this, which is less symmetric mathematically; but I prefer it for the nice interpretations it gives divergences and gradients.

To be specific, we can define inner products on the vertices $ \langle a, b \rangle_{V} = \sum_x a(x) b(x) d(x) $ and on the edges $ \langle A, B \rangle_{E} = \sum_{(x, y) \sim e} A(x, y) B(x, y) K (x, y) $. Having defined this, the divergence is defined to be the "adjoint" of the gradient, satisfying the following equation for any functions $f$ over vertices and $G$ over edges:

$$ - \langle f, \textsf{div}\; G \rangle_{V} = \langle \nabla f, G \rangle_{E} $$
$\textbf{Proof}$ $$ \begin{aligned} - \langle f, \textsf{div}\; G \rangle_{V} &= - \sum_x d(x) f(x) \sum_{y \neq x} \frac{K (x, y)}{d(x)} (G (x, y) - G (y, x)) \\ &= \sum_x \sum_{y \neq x} f(x) K (x, y) (G (y, x) - G (x, y)) = \sum_{(x, y) \sim e} (f(y) - f(x)) K(x, y) G (x, y) = \langle \nabla f, G \rangle_{E} \end{aligned} $$

Global rankings from the graph

Decomposing functions over edges

In physics and math, a fundamental result of vector calculus is that the divergence and curl operators, used to model fluids locally, can be used to decompose any vector field. (This power is why Maxwell's equations completely describe any electromagnetic field.) The resulting scalar and vector potentials are wonderful ways to understand the local behavior of these fluids. We can show the same result for functions on graphs!

This is the Helmholtz decomposition: any function over edges $G$ can be decomposed as $$ G = - \nabla \Phi + U + S $$

where

  • $\Phi$ is a vertex potential function (with $\textsf{curl} (\nabla \Phi) = 0$)
  • $S$ is a symmetric edge potential function that is closest to $G$, i.e. minimizes $|| S - G ||$
  • $U$ is a harmonic (globally conservative) function on edges with $\textsf{div} ( U) = 0$ and $\textsf{curl} ( U) = 0$

Given the above properties, all these functions are unique ($\Phi$ up to constant shift) and efficiently calculable:

$$ \Phi = (\nabla^2)^{+} [ \textsf{div}\; G ]%\left[ \textsf{div}\; \frac{1}{2} (G - G^{\top}) \right] \qquad \qquad S = \frac{1}{2} (G + G^{\top}) $$

where $(\nabla^2)^{+}$ is the pseudoinverse of the Laplacian $\nabla^2$.

def helmholtz(
    field_mat, 
    adj_mat, 
    given_divergence=None
):
    """
    Compute Helmholtz decomposition of input edge function on graph.
    Args:
        adj_mat: adjacency matrix for data kNN graph.
        field_mat: a vector field (function over edges, i.e. matrix-valued)
        given_divergence: a given divergence function over vertices. 
            Replaces the use of field_mat if the source/sink info is known.
    Returns: 
        Pair of (vertex potential P, edge potential S) such that
        field_mat = - ∇P + S + U
        where U is a "harmonic" edge flow which is divergence-free, and 0 when field_mat is symmetric.
    """
    laplacian_op = laplacian_op_graph(adj_mat)
    if given_divergence is not None:
        sympart = scipy.sparse.csr_matrix(adj_mat.shape)
        potential = scipy.sparse.linalg.lsmr(
            laplacian_op, 
            given_divergence
        )
    else:
        sympart = symmetric_part(field_mat)
        potential = scipy.sparse.linalg.lsmr(
            laplacian_op, 
            div_op_graph(field_mat, adj_mat)
        )
    return (potential[0], sympart)

In words, any function on edges $G$ can be broken up into:

  1. A gradient flow $- \nabla \Phi$ that is globally acyclic (curl-free)
  2. A curl flow $S$ that is locally cyclic (div-free)
  3. A harmonic flow $U$ that is locally cyclic but globally acyclic (div-free and curl-free)

This translates to our graph manifold a cornerstone theorem of fluid dynamics [Chorin and Marsden 1990].

Proof of decomposition

This section can be skipped without loss of meaning - it has a proof of the decomposition.

First, take $S = \frac{1}{2} (G + G^{\top})$, which possesses the desired properties, since it is the unique projection of $G$ on to symmetric matrices [Fan and Hoffman 1955]. This leaves the anti-symmetric edge vector $F = G - S$, which can further be decomposed by a lemma.

$\textbf{Lemma. }$ For any anti-symmetric edge vector $F$, there is a unique function $\Phi$ (up to constant shift) on vertices, and a unique function $U$ on edges with $\textsf{div} ( U) = 0$ and $\textsf{curl} ( U) = 0$, such that $$ F = - \nabla \Phi + U $$ Also, $\Phi$ can be directly calculated: $\Phi = (\nabla^2)^{+} (\textsf{div}\; F)$.

$\textbf{Proof of lemma}$ First note that since $\textsf{curl} ( \nabla \Phi) = 0$, $\textsf{curl} ( U) = \textsf{curl} ( F) = 0$ since $F$ is anti-symmetric.

To show uniqueness of $\Phi$ up to shift, assume that $\Phi$ and $\Phi_2$ are two functions on vertices with properties as defined in the lemma statement. Since $\textsf{div}\;U = 0$ by assumption, $ \textsf{div}\; F = - \textsf{div} (\nabla \Phi ) = \nabla^2 \Phi $. By similar reasoning, $ \textsf{div}\; F = \nabla^2 \Phi_2$.

It can be verified that since the graph is connected, $\nabla^2$ has a one-dimensional null space (eigenvalue $0$), spanned by the corresponding eigenvector $\textbf{1}$, the all-ones vector. So $\nabla^2 \Phi = \nabla^2 \Phi_2 \iff \nabla^2 (\Phi - \Phi_2 ) = 0 \iff \Phi - \Phi_2 = c \textbf{1}$ for some constant $c$, proving that $\Phi$ is unique up to constant shift.

This defines a unique $\nabla \Phi$, showing that $U = F + \nabla \Phi$ is unique in the decomposition given $F$.

This also shows that $\nabla^2$ is of full rank orthogonal to the one-dimensional subspace spanned by $\textbf{1}$, so that we can recover $\Phi = (\nabla^2)^{+} (\textsf{div}\; F)$ up to constant shift.

This ends the proof.

Given the lemma, the rest of the decomposition follows by noting that:

  • $\textsf{div}\; S = 0$ since $S$ is symmetric, so $\textsf{div}\; F = \textsf{div}\; G$, so $\Phi = (\nabla^2)^{+} (\textsf{div}\; G)$.
  • For a given $G$, $S$ is unique, so $F$ is unique, so by the lemma, $\Phi$ and $U$ are unique.

Interpreting the decomposition

We can show that the potential $\Phi$ solves a corresponding least-squares problem, finding the function whose gradient best approximates $G$: $$ \min_{\Phi} \frac{1}{2} || \nabla \Phi - G ||^2 $$

We are often looking for a part of the flow that is the gradient of some potential function. The symmetric part of the flow $S$ can therefore be ruled out immediately as having no part to play in any gradient.

This leaves the anti-symmetric part $F := \nabla \Phi + U$. We have shown that $U$ is the divergence-free part of $F$, which is cyclic flow without sources/sinks. Meanwhile, the source/sink-induced component of $F$ is the gradient flow $\nabla \Phi$.

This decomposition is a very general fact; the potential it produces has a relationship to social choice theory as well, generalizing foundational Borda and Kemeny ranking methods [Jiang et al. 2011].

Trying it out: finding a global potential

To illustrate, I'll apply these tools to a basic example, where the graph is constructed from match results from the last basketball season.

NBA data

We use Kaggle to get match results from the season. (This is downloaded and unzipped into the folder archive.) We'll compute the Helmholtz potential and implied ranking of teams.

Let's first read the data and make the graph.

import pandas as pd
# Use networkx to visualize the network. 
games_df = pd.read_csv('archive/games.csv')
teams_df = pd.read_csv('archive/teams.csv')
team_to_abbrev = dict(zip(teams_df['TEAM_ID'], teams_df['ABBREVIATION']))
games_df['HOME_TEAM_ID'] = games_df['HOME_TEAM_ID'].replace(team_to_abbrev)
games_df['VISITOR_TEAM_ID'] = games_df['VISITOR_TEAM_ID'].replace(team_to_abbrev)
teams = np.unique(games_df['HOME_TEAM_ID'])
num_teams = len(teams)

# Make the graph from games_df
games_df = games_df[games_df['SEASON'] == 2021]

rows_ndx = games_df['HOME_TEAM_ID'].replace( dict(zip(teams, range(len(teams)))) )
cols_ndx = games_df['VISITOR_TEAM_ID'].replace( dict(zip(teams, range(len(teams)))) )
#data_arr = games_df['HOME_TEAM_WINS']

#list(teams).index('UTA')
# Handle the two different directions of edge
rows1 = rows_ndx[games_df['HOME_TEAM_WINS'] == 0]
cols1 = cols_ndx[games_df['HOME_TEAM_WINS'] == 0]
rows2 = rows_ndx[games_df['HOME_TEAM_WINS'] == 1]
cols2 = cols_ndx[games_df['HOME_TEAM_WINS'] == 1]

rows_index = np.concatenate((rows1, cols2))
cols_index = np.concatenate((cols1, rows2))

mtx = scipy.sparse.csr_matrix((np.ones(len(rows_index)), (rows_index, cols_index)), shape=(num_teams, num_teams))
adj_mat = scipy.sparse.csr_matrix((mtx > 0).toarray().astype(int))

Now we compute the Helmholtz vertex and edge potentials. The vertex potential gives us the ranking of the teams that we want.

vertex_potential, edge_potential = helmholtz(mtx, adj_mat)
teams[np.argsort(vertex_potential)]
array(['PHX', 'GSW', 'DAL', 'MIA', 'CHI', 'MEM', 'BOS', 'UTA', 'PHI',
       'MIL', 'TOR', 'MIN', 'DEN', 'CLE', 'BKN', 'ATL', 'LAC', 'CHA',
       'NYK', 'SAC', 'SAS', 'WAS', 'NOP', 'LAL', 'POR', 'IND', 'DET',
       'OKC', 'ORL', 'HOU'], dtype=object)
ranking_df = pd.read_csv('archive/ranking.csv')
#ranking_df[ranking_df['SEASON_ID'] == 22021].iloc[:10, :]

Interpolating reprogramming potential from fibroblasts to final cell fates

Several tools exist to use this type of formalism to infer cell state from measurements of single cells. A particularly popular and seminal approach has been to use RNA velocity - the idea that we can project a cell's future state by observing its pool of unspliced RNA in addition to the standard spliced RNA. Computetional RNA velocity methods create a vector field in transcriptome (gene expression) space, which can be used to infer the potential landscape on which cells lie.

All this has major links to the famous developmental biologist Waddington's idea of a "developmental landscape". This was introduced to express the idea that cell state transitions are essentially low-dimensional phenomena occurring among a high-dimensional context of many genes, proteins, and other biomolecules. So all these biochemical changes occur in a coordinated fashion to push cells from one low-dimensional state to another, which Waddington conceptualized balls rolling "down" the landscape of "developmental potential."

We use a landmark single-cell RNA-seq dataset that reprogrammed mouse fibroblasts to stem cells [Schiebinger et al. 2019], obtainable from the CellRank package.

#adta = sc.read('velo-vascular.h5')
adta = sc.read('waddington-OT.h5')
cmap_jumbo = ["#f7ff00","#ff8300","#f000ff","#001eff","#33ccff","#74ee15","#fb9a99","#ff3300","#cab2d6","#e6194b","#3cb44b","#ffe119","#4363d8","#f58231","#911eb4","#46f0f0","#d220c8","#bcf60c","#fabebe","#008080","#e6beff","#9a6324","#fffac8","#aaffc3","#808000","#ffd8b1","#000075","#7d87b9","#bec1d4"]
sc.pl.embedding(adta, color='day', basis='X_force_directed')
 

If we take the first and last day to represent "sources" and "sinks" of the landscape, we can use the RNA-seq derived manifold over cells to impute a developmental potential and view it over the landscape. This scales impressively well to this large dataset due to its sparse neighborhood graph, finishing the computation on <1 min on a laptop CPU with >200K cells.

given_div = np.zeros(adta.shape[0])
given_div[adta.obs['day'].astype(float) < 1.0] = 1
given_div[adta.obs['day'].astype(float) > 17.0] = -1

itime = time.time()
sc.pp.pca(adta)
print(time.time() - itime)
sc.pp.neighbors(adta, random_state=0)
print(time.time() - itime)

itime = time.time()
hhd = helmholtz(
    adta.obsp['connectivities'], 
    adta.obsp['connectivities'], 
    given_divergence=given_div
)
print(time.time() - itime)

adta.obs['wpotential'] = hhd[0]
sc.pl.embedding(adta, color='wpotential', basis='X_force_directed')

Finally, we see pretty good concordance between the ranking of cells by potential and their temporal order in the experiment. We wouldn't expect perfect concordance here because some cells in the sample progress differently from others. But it is a sanity check that the potential is being decently interpolated, solely from initial and terminal cell states.

ab1 = scipy.stats.rankdata(hhd[0])
ab2 = scipy.stats.rankdata(adta.obs['day'])
#plt.scatter(ab2, ab1)
plt.scatter(adta.obs['day'], ab1, s=0.001)
<matplotlib.collections.PathCollection at 0x1855be1a0>

Summary: calculus on graphs as an efficient, practical tool

Calculus on graphs uses a simple toolkit with a huge variety of applications. In this post, I've developed a calculus on graphs which faithfully follows its vector-based analogue famously devised for physics. I've shown theoretically how this encapsulates common notions of local change on a graph, and practically implemented it very efficiently. To illustrate the flexibility and power of the framework, I've used it for a couple of applications, showing how it can:

  • Form a complete ranking of the participants of a few matches.
  • Scale up to large-scale single-cell data and flexibly incorporate prior information.

All the code developed here is available on github.

This post can be cited as:

@misc{balsubramani2022graphcalculus,
  title   = "Graph calculus",
  author  = "Balsubramani, Akshay",
  year    = "2022",
  url     = "https://akshay.bio/blog/graph-calculus/"
}

References

  1. Chorin, A.J. and Marsden, J.E. 1990. A mathematical introduction to fluid mechanics. Springer.
  2. Fan, K. and Hoffman, A.J. 1955. Some metric inequalities in the space of matrices. Proceedings of the American Mathematical Society 6, 1, 111–116.
  3. Jiang, X., Lim, L.-H., Yao, Y., and Ye, Y. 2011. Statistical ranking and combinatorial Hodge theory. Mathematical Programming 127, 1, 203–244.
  4. Schiebinger, G., Shu, J., Tabaka, M., et al. 2019. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming. Cell 176, 4, 928–943.