Introduction: neighborhood graphs and data manifolds
Graphs are extremely versatile structured representations of datasets of various kinds, and neighborhood-based graphs are a particularly interesting context for algorithms.
Nearest-neighbor graphs are increasingly widely used to represent the landscape of data in terms of their local relationships, allowing algorithms to manipulate and learn from a “data manifold.” This is implicitly the type of learning used by spectral methods for dimensionality reduction (Saul et al. 2006) for several decades. More recently, it’s been the basis of the visualization methods tSNE and UMAP – these methods are famously widely used to look at the broad sweep of heterogeneity across a dataset.
I’m a big fan of these methods for data science, because they tend to be:
Expressive: The graph is nonparametric, and represents the data at the finest possible resolution, with each point linked to only its immediate neighborhood. This is what makes it possible for algorithms like tSNE and UMAP to purport to visualize the whole space.
Efficient: The \(k\)-NN graph is represented by a very sparse matrix, with only linearly many nonzero elements. The methods then often amount to linear algebra on these matrices, which processors can do very quickly.
Concise: We can typically realize the above advantages with relatively straightforward, short implementations (though this can vary).
When parts of a graph (nodes or vertices) are labeled, the unlabeled parts can be extrapolated from the labels by using a fundamental assumption of homophily:
Local neighborhoods of a graph have similar labels.
In this context, the classic methods of predicting the labels “diffuse” them over the graph into the unlabeled examples by essentially playing out random walks on the graph. Confusingly, there are two classic methods that do this, both often called “label diffusion” (or “label propagation”). They are separate, but related. I cover both in this post.
Load data
An important area of application of graph methods in the life sciences is in the analysis of single-cell data. A single-cell experiment can measure the biochemical activity of thousands of cells at once, yielding a relatively comprehensive sample of a wide range of cell types, including some which cannot be physically dissociated from each other.
With extensive cell type heterogeneity being typical and interesting, single-cell researchers want to squeeze as much as possible out of each dataset without averaging over cell-to-cell variation. So researchers have favored a form of analysis that extensively uses the nearest-neighbor graph1 between the cells as a “data manifold.” Single-cell datasets are a good way of trying out neighborhood-based methods.
I’ll demonstrate label propagation on single-cell gene expression data from the GTEx consortium, as I’ve written about previously.
CODE
import sys, numpy as np, time, scipyimport matplotlib.pyplot as plt%matplotlib inlineprint('Packages imported.')import anndata, scanpy as scgtex9_all_adta_name ='GTEx_8_tissues_snRNAseq_atlas_071421.public_obs.h5ad'fname = gtex9_all_adta_nameadta = sc.read(fname)cmap_jumbo = ["#f7ff00","#ff8300","#f000ff","#001eff","#33ccff","#74ee15","#fb9a99","#ff3300","#cab2d6","#e6194b","#3cb44b","#ffe119","#4363d8","#f58231","#911eb4","#46f0f0","#d220c8","#bcf60c","#fabebe","#008080","#e6beff","#9a6324","#fffac8","#aaffc3","#808000","#ffd8b1","#000075","#7d87b9","#bec1d4"]sc.pl.umap(adta, color=['Broad cell type'])#, palette=cmap_jumbo)sc.pl.umap(adta, color=['Tissue'])
We’ll be predicting the Tissue label of the cells, denoting the tissue of origin. I’ll create train/test splits based on the label annotations.
CODE
num_labeled =1000adta.obs['split'] = ['unlabeled']*adta.shape[0]# adta.obs['split'][np.random.choice(len(adta.obs['split']), size=num_labeled, replace=False)] = 'labeled'# # The cells also have a `subtissue` label, which is `nan` for some cells for which no subtissue can be denoted. Using tissue labels on these cells, we'll try to predict the rest.labeled_mask = adta.obs['Broad cell type'] =='Unknown'labeled_mask = adta.obs['Tissue composition'] =='Other'adta.obs['split'][labeled_mask] ='labeled'
What people mean by “label diffusion”
There are two popular methods that people talk about using the generic descriptor “label diffusion.”
(Laplacian regularization)[https://akshay.bio/variable-expectations/graph-based-regularization]: soft regularization of the labels to be smooth over the graph.
This method (harmonic extension): hard clamping of the labels, with harmonic (maximally smooth) interpolation over the graph between labeled points.
Similarity between the methods
They share important similarities:
Inputs: a graph with nodes corresponding to data observations, and labels provided for a subset of the nodes. Given the input graph structure, label propagation does not need any more information about the data’s feature representation.
Ultra-efficient to implement in practice, just involving multiplying a vector by a sparse matrix dictated by the graph structure, a few times successively.
The graph is typically constructed in an unsupervised way from the data - so taken all together, label diffusion is a semi-supervised algorithm.
All this is very different from traditional parametric machine learning algorithms, which learn from the feature representation of the input examples, as well as their labels.
Difference between the methods
Despite all this, there is a significant difference between the two methods.
The first method we’ll describe (harmonic extension) has no parameters. It assumes the given labels are completely correct, so it only tries to predict on the unlabeled data, interpolating from the neighborhood of each point.
In contrast, the second method (Laplacian regularization) assumes the labels could be noisy. In predicting on the labeled data, it trades off between trusting the given labels and interpolating them using the graph structure. The method has one scalar parameter governing this tradeoff.
Harmonic extension (clamped labels)
One method clamps the values on the labeled nodes, and then interpolates the unlabeled values so that each node’s value is the average of its neighbors’ values (Zhu, Ghahramani, and Lafferty 2003).
This method of interpolation sensibly follows the graph structure, analogous to linear interpolation in a parametric setting. The interpolated function on the unlabeled data is unique, and it’s computationally easy to find it: it’s called the harmonic extension of the given labels to the unlabeled data.
CODE
def harmonic_extension( labeled_signal, # n-matrix with labeled rows set to their fixed values, and unlabeled rows set to arbitrary values. adj_mat, # (n x n) adjacency matrix labeled_ndces, # Boolean mask indicating which cells are in the labeled set. method='iterative', eps_tol=0.01# Min. relative error in consecutive iterations of F before stopping (normally <= 20 iterations)):""" Given a graph and a continuous label signal over labeled_ndces, impute the rest of the signal with its harmonic extension (hard clamping). Returns n-vector predicting signal over the entire graph. """ labels = labeled_signal[labeled_ndces]if scipy.sparse.issparse(labels): labels = labels.toarray() num_labeled = np.sum(labeled_ndces) num_unlabeled = adj_mat.shape[0] - num_labeled pmat = scipy.sparse.diags(1.0/np.ravel(adj_mat.sum(axis=0))).dot(adj_mat) p_uu = pmat[~labeled_ndces, :][:, ~labeled_ndces] p_ul = pmat[~labeled_ndces, :][:, labeled_ndces] inv_sofar = p_ul.dot(labels)if method =='iterative':# Power series (I - P_uu)^{-1} = I + P_uu + P_uu^2 + ... cummat = p_ul.dot(labels) cumu = [] stop_crit =Falsewhilenot stop_crit: cummat = p_uu.dot(cummat) rel_err = np.square(cummat).sum()/np.square(inv_sofar).sum() inv_sofar = inv_sofar + cummat cumu.append(inv_sofar)if rel_err <= eps_tol: stop_crit =True cumu.append(inv_sofar)# Add unlabeled indices back into their respective places.for i inrange(len(cumu)): to_add = np.zeros(labeled_signal.shape) to_add[labeled_ndces] = labels to_add[~labeled_ndces] = cumu[i] cumu[i] = to_addreturn cumuelif method =='direct': toret = scipy.sparse.linalg.lsmr(scipy.sparse.identity(num_unlabeled) - p_uu, inv_sofar)return toret[0]
Details
The harmonic extension
The description of the interpolation method I gave above comes with an algorithm: clamp the labels of all labeled nodes, and interpolate the labels of unlabeled nodes as the average of their labeled neighbors. More specifically,
Given a graph \(G\), there is a unique function \(f^H\) on the vertices of \(G\) with \((\Delta f^H) \vert_{ U} = 0\) and \(f^H \vert_{ L} (l) = y_l\). This is the harmonic extension of labels \(y_L\) on \(G\).
The harmonic extension has a more technical description which shows how to compute it, from (Zhu, Ghahramani, and Lafferty 2003). To state this, suppose \(G\) has adjacency matrix \(A\), and denote the normalized adjacency matrix \[ P = D^{-1} A =
\left[
\begin{array}{c|c}
P_{LL} & P_{LU} \\ \hline
P_{UL} & P_{UU}
\end{array}
\right] \] where the submatrices index the labeled and unlabeled examples.
\(\textbf{Formal result}\)
If \[\begin{align*} \displaystyle f^{H} \in \underset{\substack{f \in \mathbb{R}^{n}, \\ f \vert_{ L} (l) = y_l \forall l \in L }}{\text{argmin}} f^\top \Delta f
\end{align*}\] then it can be computed as \[ f^{H} \vert_{ U} = (I - P_{UU})^{-1} P_{UL} y_l \]
Implementation notes
At first, this looks like having to invert a matrix of the size of the unlabeled data, which is certainly infeasible. But a key observation here is that \((I - P_{UU})^{-1} = I + P_{UU} + P_{UU}^2 + ...\), so \[ f^{H} = (I + P_{UU} + P_{UU}^2 + \dots) P_{UL} y_l \] This can be accumulated term by term efficiently, with just matrix multiplications. The \(i^{th}\) term \(t_i = P_{UU}^{i-1} P_{UL} y_l\) can be written recursively:
This gives an easy way to calculate successive terms \(t_1, t_2, \dots\). The terms get provably exponentially smaller2, so this can be stopped after relatively few iterations.
Interpretations
Bayesian interpretation
The harmonic extension can be interpreted as the mean predicted value of a model.
In this model, signals over nodes have joint Gaussian probability \(p (f) \propto \exp( - f^\top \Delta f) = \exp( - \sum_{x, y} A_{x y} (f_{x} - f_{y})^2 )\), so that \[ f \vert_{ U} \vert f \vert_{ L} \sim \mathcal{N} (f^{H} \vert_{ U} , \frac{1}{2} (D_{UU} - A_{UU})^{-1}) \]
Random walks
The completed function for any unlabeled \(u\) is actually a weighted average over the labels, which can be interpreted in terms of where random walks from \(u\) hit the labeled set. The harmonic extension is $ f^{H} (u) = {l L} P{u, l} y_l $ on \(u \in U\)3, where \[ P_{u, l} := Pr (\text{a random walk starting at $u$ reaches $L$ first through $l$}) \]
\(\textbf{Why?}\)
The proof is short, relying on the fact that \(P_{u, l} = \frac{1}{\sum_{v \in N(u)} A_{v, l}} \sum_{v \in N(u)} A_{v, l} P_{v, l}\), which follows from the definitions. Then \[ f^{H} (u)
= \sum_{l \in L} P_{u, l} y_l
= \sum_{l \in L} \sum_{v \in N(u)} \frac{1}{\sum_{v \in N(u)} A_{v, l}} A_{v, l} P_{v, l} y_l
= \frac{1}{\sum_{v \in N(u)} A_{v, l}} \sum_{v \in N(u)} A_{v, l} f^{H} (v) \] the definition of a harmonic function.
Worked example
First, I’ll prepare the existing data and label matrix to respect the current splits in adta.obs['split'], with labels in adta.obs['Tissue'].
I’m preparing this (multiclass classification) data to be predicted in a standard way:
Encode the classes as one-hot vectors (\(K\) classes at the corners of the \(K\)-simplex)
Predict these continuous-valued vector signals for each node of the graph
CODE
from sklearn.preprocessing import OneHotEncoderohe = OneHotEncoder()label_mat = ohe.fit_transform(adta.obs[['Tissue']])all_labels = label_mat.copy()label_mat[adta.obs['split'] =='unlabeled', :] =0label_mat.eliminate_zeros() # Just for cleanupcumuvals = harmonic_extension(label_mat, adta.obsp['connectivities'], np.array(adta.obs['split'] =='labeled'))matching_labels = np.argmax(cumuvals[-1], axis=1) == np.ravel(all_labels.argmax(axis=1))print("Accuracy of harmonic extension on unlabeled data: {}".format(matching_labels[adta.obs['split'] =='unlabeled'].mean()))
Accuracy of harmonic extension on unlabeled data: 0.2901955135210316
Summary and extras
I’ve gone through a method that’s simple and blindingly fast to use with the neighborhood-based graphs commonly used for homophily-based nonparametric learning. It is best implemented in the default iterative mode, but there is also a direct mode that eschews the iteration, instead solving a system of linear equations. This will give (almost) exactly the same result.
There are many more applications of the diffusion framework. It is used by (Sharp et al. 2022) to get state-of-the-art results vs. GCNs in a suite of computer graphics tasks. Meanwhile, (Huang et al. 2021) uses it to get state-of-the-art results using very few parameters on basic node prediction problems.
References
Huang, Qian, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R Benson. 2021. “Combining Label Propagation and Simple Models Out-Performs Graph Neural Networks.”
Saul, Lawrence K, Kilian Q Weinberger, Fei Sha, Jihun Ham, and Daniel D Lee. 2006. “Spectral Methods for Dimensionality Reduction.”Semi-Supervised Learning 3.
Sharp, Nicholas, Souhaib Attaiki, Keenan Crane, and Maks Ovsjanikov. 2022. “Diffusionnet: Discretization Agnostic Learning on Surfaces.”ACM Transactions on Graphics (TOG) 41 (3): 1–16.
Zhu, Xiaojin, Zoubin Ghahramani, and John D Lafferty. 2003. “Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions.” In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 912–19.
Footnotes
Built from PCA-derived representations of the data.↩︎
This is because all eigenvalues of the matrix \(P\) have magnitudes \(\leq 1\).↩︎