Introduction: neighborhood graphs and data manifolds

Graphs are extremely versatile structured representations of datasets of various kinds, and neighborhood-based graphs are a particularly interesting context for algorithms.

Nearest-neighbor graphs are increasingly widely used to represent the landscape of data in terms of their local relationships, allowing algorithms to manipulate and learn from a "data manifold." This is implicitly the type of learning used by spectral methods for dimensionality reduction [Saul et al. 2006] for several decades. More recently, it's been the basis of the visualization methods tSNE and UMAP -- these methods are famously widely used to look at the broad sweep of heterogeneity across a dataset.

I'm a big fan of these methods for data science, because they tend to be:

  • Expressive: The graph is nonparametric, and represents the data at the finest possible resolution, with each point linked to only its immediate neighborhood. This is what makes it possible for algorithms like tSNE and UMAP to purport to visualize the whole space.
  • Efficient: The $k$-NN graph is represented by a very sparse matrix, with only linearly many nonzero elements. The methods then often amount to linear algebra on these matrices, which processors can do very quickly.
  • Concise: We can typically realize the above advantages with relatively straightforward, short implementations (though this can vary).

When parts of a graph (nodes or vertices) are labeled, the unlabeled parts can be extrapolated from the labels by using a fundamental assumption of homophily:

Local neighborhoods of a graph have similar labels.

In this context, the classic methods of predicting the labels "diffuse" them over the graph into the unlabeled examples by essentially playing out random walks on the graph. Confusingly, there are two classic methods that do this, both often called "label diffusion" (or "label propagation"). They are separate, but related. I cover both in this post.

Load data

An important area of application of graph methods in the life sciences is in the analysis of single-cell data. A single-cell experiment can measure the biochemical activity of thousands of cells at once, yielding a relatively comprehensive sample of a wide range of cell types, including some which cannot be physically dissociated from each other.

With extensive cell type heterogeneity being typical and interesting, single-cell researchers want to squeeze as much as possible out of each dataset without averaging over cell-to-cell variation. So researchers have favored a form of analysis that extensively uses the nearest-neighbor graph 1 between the cells as a "data manifold." Single-cell datasets are a good way of trying out neighborhood-based methods.

import sys, numpy as np, time, scipy

import matplotlib.pyplot as plt
%matplotlib inline

print('Packages imported.')

import anndata, scanpy as sc
Packages imported.

I'll demonstrate label propagation on single-cell gene expression data from the GTEx consortium, as I've written about previously.

# !curl https://storage.googleapis.com/gtex_analysis_v9/snrna_seq_data/GTEx_8_tissues_snRNAseq_atlas_071421.public_obs.h5ad -o GTEx_8_tissues_snRNAseq_atlas_071421.public_obs.h5ad

gtex9_all_adta_name = 'GTEx_8_tissues_snRNAseq_atlas_071421.public_obs.h5ad'
fname = gtex9_all_adta_name
adta = sc.read(fname)

cmap_jumbo = ["#f7ff00","#ff8300","#f000ff","#001eff","#33ccff","#74ee15","#fb9a99","#ff3300","#cab2d6","#e6194b","#3cb44b","#ffe119","#4363d8","#f58231","#911eb4","#46f0f0","#d220c8","#bcf60c","#fabebe","#008080","#e6beff","#9a6324","#fffac8","#aaffc3","#808000","#ffd8b1","#000075","#7d87b9","#bec1d4"]

sc.pl.umap(adta, color=['Broad cell type'])#, palette=cmap_jumbo)
sc.pl.umap(adta, color=['Tissue'])

We'll be predicting the Tissue label of the cells, denoting the tissue of origin. I'll create train/test splits based on the label annotations.

#hide-output
num_labeled = 1000

adta.obs['split'] = ['unlabeled']*adta.shape[0]

# adta.obs['split'][np.random.choice(len(adta.obs['split']), size=num_labeled, replace=False)] = 'labeled'
# # The cells also have a `subtissue` label, which is `nan` for some cells for which no subtissue can be denoted. Using tissue labels on these cells, we'll try to predict the rest.
labeled_mask = adta.obs['Broad cell type'] == 'Unknown'
labeled_mask = adta.obs['Tissue composition'] == 'Other'

adta.obs['split'][labeled_mask] = 'labeled'

What people mean by "label diffusion"

There are two popular methods that people talk about using the generic descriptor "label diffusion." They share important similarities:

  • Inputs: a graph with nodes corresponding to data observations, and labels provided for a subset of the nodes. Given the input graph structure, label propagation does not need any more information about the data's feature representation.
  • Ultra-efficient to implement in practice, just involving multiplying a vector by a sparse matrix dictated by the graph structure, a few times successively.
  • The graph is typically constructed in an unsupervised way from the data - so taken all together, label diffusion is a semi-supervised algorithm.

All this is very different from traditional parametric machine learning algorithms, which learn from the feature representation of the input examples, as well as their labels.

Difference between the methods

Despite all this, there is a significant difference between the two methods.

The first method we'll describe (harmonic extension) has no parameters. It assumes the given labels are completely correct, so it only tries to predict on the unlabeled data, interpolating from the neighborhood of each point.

In contrast, the second method (Laplacian regularization) assumes the labels could be noisy. In predicting on the labeled data, it trades off between trusting the given labels and interpolating them using the graph structure. The method has one scalar parameter governing this tradeoff.

Worked examples

If you just want to see worked examples, please skip below to here. But first, we'll set up and motivate each of the label diffusion methods.

Harmonic extension (clamped labels)

One method clamps the values on the labeled nodes, and then interpolates the unlabeled values so that each node's value is the average of its neighbors' values [Zhu et al. 2003].

This method of interpolation sensibly follows the graph structure, analogous to linear interpolation in a parametric setting. The interpolated function on the unlabeled data is unique, and it's computationally easy to find it: it's called the harmonic extension of the given labels to the unlabeled data.

Code

def harmonic_extension(
    labeled_signal, # n-matrix with labeled rows set to their fixed values, and unlabeled rows set to arbitrary values.
    adj_mat, # (n x n) adjacency matrix
    labeled_ndces, # Boolean mask indicating which cells are in the labeled set.
    method='iterative', 
    eps_tol=0.01 # Min. relative error in consecutive iterations of F before stopping (normally <= 20 iterations)
):
    """
    Given a graph and a continuous label signal over labeled_ndces, 
    impute the rest of the signal with its harmonic extension (hard clamping). 
    Returns n-vector predicting signal over the entire graph.
    """
    labels = labeled_signal[labeled_ndces]
    if scipy.sparse.issparse(labels):
        labels = labels.toarray()
    num_labeled = np.sum(labeled_ndces)
    num_unlabeled = adj_mat.shape[0] - num_labeled
    pmat = scipy.sparse.diags(1.0/np.ravel(adj_mat.sum(axis=0))).dot(adj_mat)
    p_uu = pmat[~labeled_ndces, :][:, ~labeled_ndces]
    p_ul = pmat[~labeled_ndces, :][:, labeled_ndces]
    inv_sofar = p_ul.dot(labels)
    if method == 'iterative':
        # Power series (I - P_uu)^{-1} = I + P_uu + P_uu^2 + ...
        cummat = p_ul.dot(labels)
        cumu = []
        stop_crit = False
        while not stop_crit:
            cummat = p_uu.dot(cummat)
            rel_err = np.square(cummat).sum()/np.square(inv_sofar).sum()
            inv_sofar = inv_sofar + cummat
            cumu.append(inv_sofar)
            if rel_err <= eps_tol:
                stop_crit = True
        cumu.append(inv_sofar)
        # Add unlabeled indices back into their respective places.
        for i in range(len(cumu)):
            to_add = np.zeros(labeled_signal.shape)
            to_add[labeled_ndces] = labels
            to_add[~labeled_ndces] = cumu[i]
            cumu[i] = to_add
        return cumu
    elif method == 'direct':
        toret = scipy.sparse.linalg.lsmr(scipy.sparse.identity(num_unlabeled) - p_uu, inv_sofar)
        return toret[0]

Details

The harmonic extension

The description of the interpolation method I gave above comes with an algorithm: clamp the labels of all labeled nodes, and interpolate the labels of unlabeled nodes as the average of their labeled neighbors. More specifically,

Given a graph $G$, there is a unique function $f^H$ on the vertices of $G$ with $(\Delta f^H) \vert_{ U} = 0$ and $f^H \vert_{ L} (l) = y_l$. This is the harmonic extension of labels $y_L$ on $G$.

The harmonic extension has a more technical description which shows how to compute it, from [Zhu et al. 2003]. To state this, suppose $G$ has adjacency matrix $A$, and denote the normalized adjacency matrix $P = D^{-1} A = $ $ \left[ \begin{array}{c|c} P_{LL} & P_{LU} \\ \hline P_{UL} & P_{UU} \end{array} \right] $, where the submatrices index the labeled and unlabeled examples.

$\textbf{Formal result}$ Then $f^{H} \in \underset{\substack{f \in \mathbb{R}^{n}, \\ f \vert_{ L} (l) = y_l \forall l \in L }}{\text{argmin}} f^\top \Delta f $. Also, $$ f^{H} \vert_{ U} = (I - P_{UU})^{-1} P_{UL} y_l $$

Implementation notes

At first, this looks like having to invert a matrix of the size of the unlabeled data, which is certainly infeasible. But a key observation here is that $(I - P_{UU})^{-1} = I + P_{UU} + P_{UU}^2 + ...$, so $$ f^{H} = (I + P_{UU} + P_{UU}^2 + \dots) P_{UL} y_l $$ This can be accumulated term by term efficiently, with just matrix multiplications. The $i^{th}$ term $t_i = P_{UU}^{i-1} P_{UL} y_l$ can be written recursively:$$ t_i = P_{UU}^{i-1} P_{UL} y_l \quad \forall i \quad \implies t_1 = P_{UL} y_l \;\text{ , }\; t_i = P_{UU} t_{i-1} $$ This gives an easy way to calculate successive terms $t_1, t_2, \dots$. The terms get provably exponentially smaller 2, so this can be stopped after relatively few iterations.

Interpretations

Bayesian interpretation

The harmonic extension can be interpreted as the mean predicted value of a \emph{Gaussian random field} model.

In this model, signals over nodes have joint Gaussian probability $p (f) \propto \exp( - f^\top \Delta f) = \exp( - \sum_{x, y} A_{x y} (f_{x} - f_{y})^2 )$, so that $$ f \vert_{ U} \vert f \vert_{ L} \sim \mathcal{N} (f^{H} \vert_{ U} , \frac{1}{2} (D_{UU} - A_{UU})^{-1}) $$

Random walks

The completed function for any unlabeled $u$ is actually a weighted average over the labels, which can be interpreted in terms of where random walks from $u$ hit the labeled set. The harmonic extension is $ f^{H} (u) = \sum_{l \in L} P_{u, l} y_l $ on $u \in U$ 3, where $$ P_{u, l} := Pr (\text{a random walk starting at $u$ reaches $L$ first through $l$}) $$

$\textbf{Why?}$ The proof is short, relying on the fact that $P_{u, l} = \frac{1}{\sum_{v \in N(u)} A_{v, l}} \sum_{v \in N(u)} A_{v, l} P_{v, l}$, which follows from the definitions. Then $$ f^{H} (u) = \sum_{l \in L} P_{u, l} y_l = \sum_{l \in L} \sum_{v \in N(u)} \frac{1}{\sum_{v \in N(u)} A_{v, l}} A_{v, l} P_{v, l} y_l = \frac{1}{\sum_{v \in N(u)} A_{v, l}} \sum_{v \in N(u)} A_{v, l} f^{H} (v) $$ the definition of a harmonic function.

Worked example

First, I'll prepare the existing data and label matrix to respect the current splits in adta.obs['split'], with labels in adta.obs['Tissue'].

I'm preparing this (multiclass classification) data to be predicted in a standard way:

  • Encode the classes as one-hot vectors ($K$ classes at the corners of the $K$-simplex)
  • Predict these continuous-valued vector signals for each node of the graph

#hide-output
from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
label_mat = ohe.fit_transform(adta.obs[['Tissue']])
all_labels = label_mat.copy()
label_mat[adta.obs['split'] == 'unlabeled', :] = 0
label_mat.eliminate_zeros()   # Just for cleanup

cumuvals = harmonic_extension(label_mat, adta.obsp['connectivities'], np.array(adta.obs['split'] == 'labeled'))
matching_labels = np.argmax(cumuvals[-1], axis=1) == np.ravel(all_labels.argmax(axis=1))
print("Accuracy of harmonic extension on unlabeled data: {}".format(matching_labels[adta.obs['split'] == 'unlabeled'].mean()))
Accuracy of harmonic extension on unlabeled data: 0.2901955135210316

Label propagation (regularized labels)

Like the harmonic functions method, the label propagation algorithm of [Zhou et al. 2003] is a classic method for manipulating a vector of values (labels) on nodes of a graph. It propagates the label signal from each node to its neighbors, converging to a smoothed version of the labels over the graph.

Code

def label_propagation(
    labeled_signal, # (n x |Y|) matrix with unlabeled rows set to arbitrary values.
    adj_mat, # (n x n) adjacency matrix
    labeled_ndces, # Boolean mask indicating which cells are in the labeled set.
    param_alpha=0.8, 
    method='iterative', 
    eps_tol=0.01   # Min. relative error in consecutive iterations of F before stopping (normally <= 20 iterations)
):
    """
    From Zhou et al. 2003 "Learning with local and global consistency".
    Returns an n-vector of real-valued relative confidences.
    """
    labeled_signal[~labeled_ndces, :] = 0
    dw_invsqrt = scipy.sparse.diags(
        np.reciprocal(np.sqrt(np.ravel(adj_mat.sum(axis=0))))
    )
    R = dw_invsqrt.dot(adj_mat).dot(dw_invsqrt)
    F = labeled_signal.copy()
    if scipy.sparse.issparse(F):
        F = F.toarray()
    cumu = []
    if method == 'iterative':
        stop_crit = False
        while not stop_crit:
            F_new = np.array((param_alpha*R.dot(F)) + ((1-param_alpha)*labeled_signal))
            rel_err = np.square(F_new - F).sum()/np.square(F_new).sum()
            F = F_new
            cumu.append(F)  # np.argmax(F, axis=1)
            print(rel_err)
            if rel_err <= eps_tol:
                stop_crit = True
        cumu.append(F)  # np.argmax(F, axis=1)
        return cumu
    elif method == 'direct':
        return scipy.sparse.linalg.lsmr(scipy.sparse.identity(R.shape[0]) - param_alpha*R, labeled_signal)

Details

Algorithm

The node labels are given by a matrix $Y \in [0,1]^{n \times |\mathcal{Y}|}$, with each labeled node being a one-hot encoded row, and unlabeled rows zeroed out.

To fill in the entries of $Y$, the method of [Zhou et al. 2003] takes a parameter $\alpha \in (0,1)$, computes the normalized adjacency matrix $R = D^{-1/2} A D^{-1/2}$, and iterates the following until convergence to solve for prediction scores $F$: $$ \alpha R F + (1-\alpha) Y \mapsto F $$ When this has converged, predictions are made by thresholding the resulting scores $F$. The iteration always converges, regardless of how $F$ is initialized! As before, the convergence is exponential, so it just takes a few iterations in practice.

$\alpha$ is the only parameter, and controls the tradeoff between respecting the known labels $Y$ and the graph structure $R$.

  • $\alpha \to 0$: Respects known labels, ignores graph structure
  • $\alpha \to 1$: Respects graph structure, ignores known labels

Motivation: smoothness-regularized optimization

This algorithm is exactly equivalent to solving a least-squares problem, with the label predictions regularized to be "smooth" with respect to the graph structure.

More specifically, defining a parameter $\mu$ such that $\alpha = \frac{1}{1 + \mu}$, the iteration converges to an $F^*$ which solves an optimization problem:$$ \min_{F \in \mathbb{R}^{n \times |\mathcal{Y}| }} \mu \underbrace{ \left\lVert Y - F \right\rVert^2 }_{loss} + \underbrace{F^\top (I - R) F}_{smoothness}$$

Here the "loss" term forces the predictions $F$ to be close to the labels $Y$, while the "smoothness" term forces the predictions to vary smoothly over the graph, so that the differences between neighboring nodes are not large.

Worked example

Label propagation can be run on this graph, with these labels. We'll use the one-hot encoded label_mat from the harmonic extension example above.

#hide-output
cumuvals = label_propagation(label_mat, adta.obsp['connectivities'], np.array(adta.obs['split'] == 'labeled'), param_alpha=0.999)

real_labels = np.ravel(np.argmax(all_labels, axis=1))
for i in range(len(cumuvals)):
    ml = np.ravel(np.argmax(cumuvals[i], axis=1))
    print( "Iteration {}: \tAccuracy = {}".format(i, (real_labels == ml)[adta.obs['split'] == 'unlabeled'].mean() ))

Using label propagation for ranking

Connections to Personalized PageRank

Label propagation is identical to using the Personalized PageRank (PPR) kernel (see [Page et al. 1999]), which is formulated with the concept of random walks on the graph that are biased towards the labels.

The paper [Zhou et al. 2003] formalizes these connections, and discusses how to use label propagation for ranking. The intuition for using this label propagation algorithm for ranking is very simple:

"We... assign a positive ranking score to each query and zero to the remaining points which are ranked with respect to the queries. All points then spread their ranking score to their nearby neighbors via the weighted network. The spread process is repeated until a global stable state is achieved, and all points except queries are ranked according to their final ranking scores." ([Zhou et al. 2003])

These connections have been explored much more recently in the context of graph neural networks, for which this PPR kernel regularizes learning effectively [Klicpera et al. 2018]. So this identical linear algebraic iteration continues to find algebraic use in the GNN context.

In animated form

We'll pick 2 points to be the query, and highlight the diffusion of ranks from these two seed query points.

carr = np.zeros((adta.shape[0], 1))
carr[0, 0] = 1
carr[100000, 0] = 1

#hide-output
itime = time.time()
cumuvals = label_propagation(carr, adta.obsp['connectivities'], np.ravel(carr == 1), param_alpha=0.999, method='iterative')
print("Time taken for label propagation: {}".format(time.time() - itime))

First, we'll import the animation code from an earlier post.

import requests
url = 'https://raw.githubusercontent.com/b-akshay/blog-tools/main/create_animation.py'
r = requests.get(url)

# make sure your filename is the same as how you want to import 
with open('create_animation.py', 'w') as f:
    f.write(r.text)

# now we can import
from create_animation import *

#hide-output

itime = time.time()

coords = adta.obsm['X_umap']
animation = create_animation(coords, cumuvals)

print("Time taken to generate frames: {}".format(time.time() - itime))

itime = time.time()
print("Time taken to produce html animation: {}".format(time.time() - itime))

#HTML(animation.to_html5_video())
HTML(animation.to_jshtml())
</input>

Summary and extras

All the code developed here is available on github.

I've gone through two methods that are by far the most commonly used algorithms for label diffusion. They're simple and blindingly fast to use with the neighborhood-based graphs commonly used for homophily-based nonparametric learning.

Both are best implemented in the default iterative mode, but there is also a direct mode that eschews the iteration, instead solving a system of linear equations. This will give almost exactly the same result.

There are many more applications of the diffusion framework. [Sharp et al. 2022] uses it to get state-of-the-art results vs. GCNs in a suite of computer graphics tasks. Meanwhile, [Huang et al. 2020] uses it to get state-of-the-art results using very few parameters on basic node prediction problems. I'll write about some of these in future posts.

Footnotes

1. Built from PCA-derived representations of the data.

2. This is because all eigenvalues of the matrix $P$ have magnitudes $\leq 1$.

3. Note that Pl,l = 1.

References

  1. Saul, L.K., Weinberger, K.Q., Sha, F., Ham, J., and Lee, D.D. 2006. Spectral methods for dimensionality reduction. Semi-supervised learning 3.
  2. Zhu, X., Ghahramani, Z., and Lafferty, J.D. 2003. Semi-supervised learning using gaussian fields and harmonic functions. Proceedings of the 20th International conference on Machine learning (ICML-03), 912–919.
  3. Zhou, D., Bousquet, O., Lal, T., Weston, J., and Schölkopf, B. 2003. Learning with Local and Global Consistency. Advances in Neural Information Processing Systems, MIT Press.
  4. Page, L., Brin, S., Motwani, R., and Winograd, T. 1999. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab.
  5. Zhou, D., Weston, J., Gretton, A., Bousquet, O., and Schölkopf, B. 2003. Ranking on data manifolds. Advances in neural information processing systems 16.
  6. Klicpera, J., Bojchevski, A., and Günnemann, S. 2018. Predict then propagate: Graph neural networks meet personalized pagerank. arXiv preprint arXiv:1810.05997.
  7. Sharp, N., Attaiki, S., Crane, K., and Ovsjanikov, M. 2022. Diffusionnet: Discretization agnostic learning on surfaces. ACM Transactions on Graphics (TOG) 41, 3, 1–16.
  8. Huang, Q., He, H., Singh, A., Lim, S.-N., and Benson, A.R. 2020. Combining label propagation and simple models out-performs graph neural networks. arXiv preprint arXiv:2010.13993.