Graph-based semi-supervised prediction

graphs

Harmonic analysis for learning on graphs

Author

Akshay Balsubramani

Modified

November 29, 2022

Introduction: neighborhood graphs and data manifolds

Graphs are extremely versatile structured representations of datasets of various kinds, and neighborhood-based graphs are a particularly interesting context for algorithms.

Nearest-neighbor graphs are increasingly widely used to represent the landscape of data in terms of their local relationships, allowing algorithms to manipulate and learn from a “data manifold.” This is implicitly the type of learning used by spectral methods for dimensionality reduction (Saul et al. 2006) for several decades. More recently, it’s been the basis of the visualization methods tSNE and UMAP – these methods are famously widely used to look at the broad sweep of heterogeneity across a dataset.

I’m a big fan of these methods for data science, because they tend to be:

Expressive: The graph is nonparametric, and represents the data at the finest possible resolution, with each point linked to only its immediate neighborhood. This is what makes it possible for algorithms like tSNE and UMAP to purport to visualize the whole space.
Efficient: The $k$-NN graph is represented by a very sparse matrix, with only linearly many nonzero elements. The methods then often amount to linear algebra on these matrices, which processors can do very quickly.
Concise: We can typically realize the above advantages with relatively straightforward, short implementations (though this can vary).

When parts of a graph (nodes or vertices) are labeled, the unlabeled parts can be extrapolated from the labels by using a fundamental assumption of homophily:

Local neighborhoods of a graph have similar labels.

In this context, the classic methods of predicting the labels “diffuse” them over the graph into the unlabeled examples by essentially playing out random walks on the graph. Confusingly, there are two classic methods that do this, both often called “label diffusion” (or “label propagation”). They are separate, but related. I cover both in this post.

Load data

An important area of application of graph methods in the life sciences is in the analysis of single-cell data. A single-cell experiment can measure the biochemical activity of thousands of cells at once, yielding a relatively comprehensive sample of a wide range of cell types, including some which cannot be physically dissociated from each other.

With extensive cell type heterogeneity being typical and interesting, single-cell researchers want to squeeze as much as possible out of each dataset without averaging over cell-to-cell variation. So researchers have favored a form of analysis that extensively uses the nearest-neighbor graph¹ between the cells as a “data manifold.” Single-cell datasets are a good way of trying out neighborhood-based methods.

I’ll demonstrate label propagation on single-cell gene expression data from the GTEx consortium, as I’ve written about previously.

CODE

import sys, numpy as np, time, scipy

import matplotlib.pyplot as plt
%matplotlib inline

print('Packages imported.')

import anndata, scanpy as sc

gtex9_all_adta_name = 'GTEx_8_tissues_snRNAseq_atlas_071421.public_obs.h5ad'
fname = gtex9_all_adta_name
adta = sc.read(fname)

cmap_jumbo = ["#f7ff00","#ff8300","#f000ff","#001eff","#33ccff","#74ee15","#fb9a99","#ff3300","#cab2d6","#e6194b","#3cb44b","#ffe119","#4363d8","#f58231","#911eb4","#46f0f0","#d220c8","#bcf60c","#fabebe","#008080","#e6beff","#9a6324","#fffac8","#aaffc3","#808000","#ffd8b1","#000075","#7d87b9","#bec1d4"]

sc.pl.umap(adta, color=['Broad cell type'])#, palette=cmap_jumbo)
sc.pl.umap(adta, color=['Tissue'])

We’ll be predicting the Tissue label of the cells, denoting the tissue of origin. I’ll create train/test splits based on the label annotations.

CODE

num_labeled = 1000

adta.obs['split'] = ['unlabeled']*adta.shape[0]

# adta.obs['split'][np.random.choice(len(adta.obs['split']), size=num_labeled, replace=False)] = 'labeled'
# # The cells also have a `subtissue` label, which is `nan` for some cells for which no subtissue can be denoted. Using tissue labels on these cells, we'll try to predict the rest.
labeled_mask = adta.obs['Broad cell type'] == 'Unknown'
labeled_mask = adta.obs['Tissue composition'] == 'Other'

adta.obs['split'][labeled_mask] = 'labeled'

What people mean by “label diffusion”

There are two popular methods that people talk about using the generic descriptor “label diffusion.”

(Laplacian regularization)[https://akshay.bio/variable-expectations/graph-based-regularization]: soft regularization of the labels to be smooth over the graph.
This method (harmonic extension): hard clamping of the labels, with harmonic (maximally smooth) interpolation over the graph between labeled points.

Similarity between the methods

They share important similarities:

Inputs: a graph with nodes corresponding to data observations, and labels provided for a subset of the nodes. Given the input graph structure, label propagation does not need any more information about the data’s feature representation.
Ultra-efficient to implement in practice, just involving multiplying a vector by a sparse matrix dictated by the graph structure, a few times successively.
The graph is typically constructed in an unsupervised way from the data - so taken all together, label diffusion is a semi-supervised algorithm.

All this is very different from traditional parametric machine learning algorithms, which learn from the feature representation of the input examples, as well as their labels.

Difference between the methods

Despite all this, there is a significant difference between the two methods.

The first method we’ll describe (harmonic extension) has no parameters. It assumes the given labels are completely correct, so it only tries to predict on the unlabeled data, interpolating from the neighborhood of each point.

In contrast, the second method (Laplacian regularization) assumes the labels could be noisy. In predicting on the labeled data, it trades off between trusting the given labels and interpolating them using the graph structure. The method has one scalar parameter governing this tradeoff.

Harmonic extension (clamped labels)

One method clamps the values on the labeled nodes, and then interpolates the unlabeled values so that each node’s value is the average of its neighbors’ values (Zhu, Ghahramani, and Lafferty 2003).

This method of interpolation sensibly follows the graph structure, analogous to linear interpolation in a parametric setting. The interpolated function on the unlabeled data is unique, and it’s computationally easy to find it: it’s called the harmonic extension of the given labels to the unlabeled data.

CODE

def harmonic_extension(
    labeled_signal, # n-matrix with labeled rows set to their fixed values, and unlabeled rows set to arbitrary values.
    adj_mat, # (n x n) adjacency matrix
    labeled_ndces, # Boolean mask indicating which cells are in the labeled set.
    method='iterative', 
    eps_tol=0.01 # Min. relative error in consecutive iterations of F before stopping (normally <= 20 iterations)
):
    """
    Given a graph and a continuous label signal over labeled_ndces, 
    impute the rest of the signal with its harmonic extension (hard clamping). 
    Returns n-vector predicting signal over the entire graph.
    """
    labels = labeled_signal[labeled_ndces]
    if scipy.sparse.issparse(labels):
        labels = labels.toarray()
    num_labeled = np.sum(labeled_ndces)
    num_unlabeled = adj_mat.shape[0] - num_labeled
    pmat = scipy.sparse.diags(1.0/np.ravel(adj_mat.sum(axis=0))).dot(adj_mat)
    p_uu = pmat[~labeled_ndces, :][:, ~labeled_ndces]
    p_ul = pmat[~labeled_ndces, :][:, labeled_ndces]
    inv_sofar = p_ul.dot(labels)
    if method == 'iterative':
        # Power series (I - P_uu)^{-1} = I + P_uu + P_uu^2 + ...
        cummat = p_ul.dot(labels)
        cumu = []
        stop_crit = False
        while not stop_crit:
            cummat = p_uu.dot(cummat)
            rel_err = np.square(cummat).sum()/np.square(inv_sofar).sum()
            inv_sofar = inv_sofar + cummat
            cumu.append(inv_sofar)
            if rel_err <= eps_tol:
                stop_crit = True
        cumu.append(inv_sofar)
        # Add unlabeled indices back into their respective places.
        for i in range(len(cumu)):
            to_add = np.zeros(labeled_signal.shape)
            to_add[labeled_ndces] = labels
            to_add[~labeled_ndces] = cumu[i]
            cumu[i] = to_add
        return cumu
    elif method == 'direct':
        toret = scipy.sparse.linalg.lsmr(scipy.sparse.identity(num_unlabeled) - p_uu, inv_sofar)
        return toret[0]

Details

The harmonic extension

The description of the interpolation method I gave above comes with an algorithm: clamp the labels of all labeled nodes, and interpolate the labels of unlabeled nodes as the average of their labeled neighbors. More specifically,

Given a graph $G$, there is a unique function $f^H$ on the vertices of $G$ with $(\Delta f^H) \vert_{ U} = 0$ and $f^H \vert_{ L} (l) = y_l$. This is the harmonic extension of labels $y_L$ on $G$.

The harmonic extension has a more technical description which shows how to compute it, from (Zhu, Ghahramani, and Lafferty 2003). To state this, suppose $G$ has adjacency matrix $A$, and denote the normalized adjacency matrix \[ P = D^{-1} A = \left[ \begin{array}{c|c} P_{LL} & P_{LU} \\ \hline P_{UL} & P_{UU} \end{array} \right] \] where the submatrices index the labeled and unlabeled examples.

$\textbf{Formal result}$

If \[\begin{align*} \displaystyle f^{H} \in \underset{\substack{f \in \mathbb{R}^{n}, \\ f \vert_{ L} (l) = y_l \forall l \in L }}{\text{argmin}} f^\top \Delta f \end{align*}\] then it can be computed as \[ f^{H} \vert_{ U} = (I - P_{UU})^{-1} P_{UL} y_l \]

Implementation notes

At first, this looks like having to invert a matrix of the size of the unlabeled data, which is certainly infeasible. But a key observation here is that $(I - P_{UU})^{-1} = I + P_{UU} + P_{UU}^2 + ...$, so \[ f^{H} = (I + P_{UU} + P_{UU}^2 + \dots) P_{UL} y_l \] This can be accumulated term by term efficiently, with just matrix multiplications. The $i^{th}$ term $t_i = P_{UU}^{i-1} P_{UL} y_l$ can be written recursively:

\[ t_i = P_{UU}^{i-1} P_{UL} y_l \quad \forall i \quad \implies t_1 = P_{UL} y_l \;\text{ , }\; t_i = P_{UU} t_{i-1} \]

This gives an easy way to calculate successive terms $t_1, t_2, \dots$. The terms get provably exponentially smaller², so this can be stopped after relatively few iterations.

Interpretations

Bayesian interpretation

The harmonic extension can be interpreted as the mean predicted value of a model.

In this model, signals over nodes have joint Gaussian probability $p (f) \propto \exp( - f^\top \Delta f) = \exp( - \sum_{x, y} A_{x y} (f_{x} - f_{y})^2 )$, so that \[ f \vert_{ U} \vert f \vert_{ L} \sim \mathcal{N} (f^{H} \vert_{ U} , \frac{1}{2} (D_{UU} - A_{UU})^{-1}) \]

Random walks

The completed function for any unlabeled $u$ is actually a weighted average over the labels, which can be interpreted in terms of where random walks from $u$ hit the labeled set. The harmonic extension is $ f^{H} (u) = {l L} P{u, l} y_l $ on $u \in U$³, where \[ P_{u, l} := Pr (\text{a random walk starting at $u$ reaches $L$ first through $l$}) \]

$\textbf{Why?}$

The proof is short, relying on the fact that $P_{u, l} = \frac{1}{\sum_{v \in N(u)} A_{v, l}} \sum_{v \in N(u)} A_{v, l} P_{v, l}$, which follows from the definitions. Then \[ f^{H} (u) = \sum_{l \in L} P_{u, l} y_l = \sum_{l \in L} \sum_{v \in N(u)} \frac{1}{\sum_{v \in N(u)} A_{v, l}} A_{v, l} P_{v, l} y_l = \frac{1}{\sum_{v \in N(u)} A_{v, l}} \sum_{v \in N(u)} A_{v, l} f^{H} (v) \] the definition of a harmonic function.

Worked example

First, I’ll prepare the existing data and label matrix to respect the current splits in adta.obs['split'], with labels in adta.obs['Tissue'].

I’m preparing this (multiclass classification) data to be predicted in a standard way:

Encode the classes as one-hot vectors ($K$ classes at the corners of the $K$-simplex)
Predict these continuous-valued vector signals for each node of the graph

CODE

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()
label_mat = ohe.fit_transform(adta.obs[['Tissue']])
all_labels = label_mat.copy()
label_mat[adta.obs['split'] == 'unlabeled', :] = 0
label_mat.eliminate_zeros()   # Just for cleanup

cumuvals = harmonic_extension(label_mat, adta.obsp['connectivities'], np.array(adta.obs['split'] == 'labeled'))
matching_labels = np.argmax(cumuvals[-1], axis=1) == np.ravel(all_labels.argmax(axis=1))
print("Accuracy of harmonic extension on unlabeled data: {}".format(matching_labels[adta.obs['split'] == 'unlabeled'].mean()))

Accuracy of harmonic extension on unlabeled data: 0.2901955135210316

Summary and extras

I’ve gone through a method that’s simple and blindingly fast to use with the neighborhood-based graphs commonly used for homophily-based nonparametric learning. It is best implemented in the default iterative mode, but there is also a direct mode that eschews the iteration, instead solving a system of linear equations. This will give (almost) exactly the same result.

There are many more applications of the diffusion framework. It is used by (Sharp et al. 2022) to get state-of-the-art results vs. GCNs in a suite of computer graphics tasks. Meanwhile, (Huang et al. 2021) uses it to get state-of-the-art results using very few parameters on basic node prediction problems.

References

Huang, Qian, Horace He, Abhay Singh, Ser-Nam Lim, and Austin R Benson. 2021. “Combining Label Propagation and Simple Models Out-Performs Graph Neural Networks.”

Saul, Lawrence K, Kilian Q Weinberger, Fei Sha, Jihun Ham, and Daniel D Lee. 2006. “Spectral Methods for Dimensionality Reduction.” Semi-Supervised Learning 3.

Sharp, Nicholas, Souhaib Attaiki, Keenan Crane, and Maks Ovsjanikov. 2022. “Diffusionnet: Discretization Agnostic Learning on Surfaces.” ACM Transactions on Graphics (TOG) 41 (3): 1–16.

Zhu, Xiaojin, Zoubin Ghahramani, and John D Lafferty. 2003. “Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions.” In Proceedings of the 20th International Conference on Machine Learning (ICML-03), 912–19.

Footnotes

Built from PCA-derived representations of the data.↩︎
This is because all eigenvalues of the matrix $P$ have magnitudes $\leq 1$.↩︎
Note that $P_{l,l}$ = 1.↩︎