Constructing data-driven neighborhood graphs

graphs
ML
Building neighborhood graphs from datasets
Author

Akshay Balsubramani

Modified

April 15, 2023

Introduction: constructing graphs from data

Graphs play a crucial role in data analysis, helping to intuitively visualize and understand complex data structures, and neighborhood-based graphs are a particularly interesting context for algorithms.

Nearest-neighbor graphs are increasingly widely used to represent the landscape of data in terms of their local relationships, allowing algorithms to manipulate and learn from a “data manifold.” This is implicitly the type of learning used by spectral methods for dimensionality reduction (Saul et al. 2006) for several decades. More recently, it’s been the basis of the visualization methods tSNE and UMAP these methods are famously widely used to look at the broad sweep of heterogeneity across a dataset.

I’m a big fan of using neighborhood-based graphs for data science, because they tend to be:

  • Expressive: The graph is nonparametric, and represents the data at the finest possible resolution, with each point linked to only its immediate neighborhood. This is what makes it possible for algorithms like tSNE and UMAP to purport to visualize the whole space.

  • Efficient: The \(k\)-NN graph is represented by a very sparse matrix, with only linearly many nonzero elements. The methods then often amount to linear algebra on these matrices, which processors can do very quickly.

  • Concise: We can often realize the above advantages with relatively straightforward, short implementations.

There’s a whole world of sophisticated approaches to constructing such graphs and using them for analysis. Since they’re so useful, advanced applications are using a broadening array of such methods all the time.

Here I’ll outline some techniques for graph construction, with a specific emphasis on sparse graphs and efficient, expressive construction. Efficient numpy/scipy code for these approaches is often not found easily, if at all, so I’ll include code implementations and usage examples as well.

Manifold learning and affinity matrices

All of the methods I’ll discuss are based on the same basic setup - a weighted graph with \(n\) vertices represented by an \(n \times n\) adjacency matrix \(A \geq 0\).

  • \(A\) can be symmetric (undirected graphs) or asymmetric (directed graphs).
  • \(A\) can be binary (unweighted graphs) or real-valued (weighted graphs).

The setting we consider in this paper is a \(d\)-dimensional, compact and smooth Riemannian manifold (\(\mathcal{M}, g\)) embedded1 in \(\mathbb{R}^p\), with the ambient dimension p being much larger than the intrinsic dimension d.  Our observations are a set of points sampled from the manifold, according to some density p supported on \(\mathcal{M}\).

Starting from pairwise distances in the ambient Euclidean space, the characteristic workflow for a majority of manifold learning approaches (Coifman and Lafon 2006; Roweis and Saul 2000; Tenenbaum, Silva, and Langford 2000) is to construct an affinity matrix \(A\), which can be thought of as capturing local information in the adjacency matrix of a weighted graph.

Affinity matrices are sometimes referred to as kernel matrices.2 Pairs of points in close proximity (as measured in the ambient space) are assigned a high affinity, and pairs that are far away are assigned a low (or zero) affinity. The most common practice is to build \(A\) as a kNN graph.

The goal of \(A\) therefore is to encode only local relationships: \(\mathcal{M}\) is smooth, so the ambient Euclidean metric locally agrees with the Riemannian metric. Following this reasoning, the weighted graph \(A\) is a reasonably faithful representation of the latent manifold \(\mathcal{M}\).

Usage of \(A\) occurs through the associated graph Laplacian in computational areas ranging from machine learning to geometry. The graph Laplacian has a multitude of interpretations and applications as a discrete counterpart to the continuous Laplacian \(\nabla^2\) (the Laplace—Beltrami operator, see e.g. (Von Luxburg, Belkin, and Bousquet 2008) for theory).

In summary, we’ll generally think of \(A\) as being the \(k\)-nearest-neighbors graph of \(X = \{ x_1 , x_2, \dots, x_n \}\), which is very sparse. So \(A\) represents the manifold of data points \(\mathcal{M}\) for use in analysis.

Constructing the nearest neighbor graph

The task of nearest-neighbor search and retrieval is fundamental to modern ML - it’s the way any vector database indexes its data. The authoritative open benchmark on this is a good place to start in choosing a method to use. I like pynndescent here, one of the top-performing methods on the benchmark. It’s fast and simple to use and understand, without extraneous dependencies, and written by the same author as the excellent UMAP method. I often use the single-cell package scanpy, which uses pynndescent for its nearest neighbor search.

Diffusion maps

Diffusion maps are a powerful tool for understanding and visualizing the structure of complex data sets, including graphs. This technique, rooted in the field of manifold learning, leverages the concept of diffusion processes on a data set – random walks derived from the neighborhood graph \(A\). The principal components of these random walks provide a low-dimensional basis that captures the coarse-grained (“zoomed-out”) structure of the dataset.

All this can be done extremely efficiently. The diffusion map is particularly useful in cases where the data lies on a non-linear manifold, enabling us to capture complex geometries that traditional methods might overlook.

The first step here is to construct the diffusion process’s transition kernel on the neighborhood graph \(A\). More on this is in (Coifman et al. 2005) and other papers.

Many graph toolboxes have built-in functions for this, but I’ll implement them from scratch here. 3

Setup

Diffusion map embeddings come in many related flavors that are all based on the same basic setup. This section can be skipped to cut straight to the implementation.

Density normalization

Define \(d_i = \sum_{j \in [n]} A_{ij}\), an estimate of the local sampling density of cell \(i\), and construct \(D = \text{diag}(d)\). Then we normalize with a parameter \(\alpha \in \{0, 0.5, 1 \}\), to make a kernel matrix \[ K_A := D^{- \alpha} A D^{- \alpha} \] This is a standard normalization in single-cell analysis (Haghverdi, Buettner, and Theis 2015).

Computing a transition matrix

We compute the transition matrix of a random walk over the weighted graph represented by \(K\), normalizing \(K\) by its row sums \(p(x) := \sum_{y} K (x, y)\). Defining \(D_p = \text{diag}(p(x))\), this transition matrix is \[ T (x, y) = \frac{K (x, y)}{p (x)} = D_p^{-1} K \]

\(\textbf{Details}\)

It’s much better computationally to work with the symmetrized version of \(T\): \[ T^s (x, y) = \frac{K (x, y)}{\sqrt{ p(x) p(y) }} = D_p^{-1/2} K D_p^{-1/2} = D_p^{1/2} T D_p^{-1/2} \] \(T^s\) and \(T\) have the same eigenvalues, but somewhat different eigenvectors related by the following equivalences: \[ T \psi = D_p^{-1} K \psi = \lambda \psi \iff D_p^{1/2} T \psi = D_p^{-1/2} K \psi = \lambda D_p^{1/2} \psi \iff T^s D_p^{1/2} \psi = \lambda D_p^{1/2} \psi \] So for an eigenvalue \(\lambda\), right eigenvectors \(\psi\) in the unsymmetrized space correspond to eigenvectors \(D_p^{1/2} \psi := \chi\) when symmetrized.

We decompose the symmetrized matrix to get eigenvectors \(\chi\), and then compute \(D_p^{-1/2} \chi\) as the corresponding right eigenvectors of \(T\) as needed. It can be verified that the stationary distribution of \(T\) is \(\phi_0 (x) = \frac{p(x)}{\sum_y p(y)}\), which is the top left eigenvector. 4

Diffusion map embedding

The natural embedding associated with this transition kernel \(T\) is the diffusion map embedding.

\(\textbf{Details}\)

By the spectral decomposition, \(T = \sum_{i=1}^{n} \lambda_i \phi_i \psi_i^\top\), where \(\phi_{i}, \psi_{i}\) are the left, right eigenvectors of \(T\).

The standard diffusion map embedding is defined (Coifman et al. 2005) to be
\[ \Psi_t (x) := \{ \lambda_i^t \psi_i (x) \}_{i=1}^n = \left\{ \lambda_i^t \frac{\phi_i (x)}{\phi_0 (x)} \right\}_{i=1}^n \]

Letting the \(T\)-step transition be \(T^t\), the diffusion distance between two points is defined to be an average distance between random walks started on those points (Nadler et al. 2006). \[ D_t^2 (x, y) := \sum_z \frac{(T^t (x, z) - T^t (y, z))^2}{\pi (z)} %= \sum_z \frac{1}{\pi (z)} (T^t (x, z) - T^t (y, z))^2 \\ = \sum_{i \geq 1} \lambda_i^{2t} (\psi_{i} (x) - \psi_{i} (y))^2 = || \Psi_t (x) - \Psi_t (y) ||^2 \]

In practice, we calculate the embedding with the symmetric kernel’s eigenvectors \(\phi_i (x)\).

This involves computing the top few eigenvectors of \(T\). As \(T\) is sparse and we only need a few eigenvectors, this can be done very efficiently if done correctly.

CODE
import sys, numpy as np, time, scipy, scipy.linalg, scipy.sparse
import matplotlib.pyplot as plt
%matplotlib inline

import anndata, scanpy as sc


def symmetric_part(A):
    """ Returns the symmetrized version of the input matrix. """
    return 0.5*(A + A.T)


def compute_transitions(adj_mat, alpha=1.0, sym=True, normalize=True):
    """Compute symmetrized transition matrix. 
    alpha : The exponent of the diffusion kernel. 
    * 1 = Laplace-Beltrami (density)-normalized [default]. 
    * 0.5 = normalized graph Laplacian (Fokker-Planck dynamics, cf. "Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators" NIPS2006).
    * 0 = classical graph Laplacian. 
    """
    similarity_mat = symmetric_part(adj_mat)
    dens = np.asarray(similarity_mat.sum(axis=0))  # dens[i] is an estimate for the sampling density at point i.
    K = scipy.sparse.spdiags(np.power(dens, -alpha), 0, similarity_mat.shape[0], similarity_mat.shape[0])
    W = scipy.sparse.csr_matrix(K.dot(similarity_mat).dot(K))
    if not normalize:
        return W
    else:
        z = np.sqrt(np.asarray(W.sum(axis=0)).astype(np.float64))    # sqrt(density)
        recipsqrt = np.reciprocal(z)
        recipsqrt[np.isinf(recipsqrt)] = 0
        Zmat = scipy.sparse.spdiags(recipsqrt, 0, W.shape[0], W.shape[0])
        return Zmat.dot(W).dot(Zmat) if sym else Zmat.power(2).dot(W)

This is a well-known embedding that is a standard tool in single-cell analysis (Haghverdi et al. 2016).

PageRank and GCN kernels

Personalized PageRank

The Personalized PageRank (PPR) graph kernel is an effective tool for learning node embeddings of a graph. By personalizing the PageRank algorithm for each node, the PPR kernel captures rich connectivity patterns within sparse networks.

The PPR kernel (Gasteiger, Weißenberger, and Günnemann 2019) is defined using a (normalized) adjacency matrix \(A\) and a “teleport probability” parameter \(\alpha \in [0,1]\). It can be interpreted as the induced distribution of a random “surfer” who, having started at an initial vertex \(x\), teleports back to \(x\) from wherever it is after each step with some probability \(\alpha\).

This means that if \(e_{x}\) is the indicator vector for \(x\), then the PPR kernel \(\pi_{PPR}\) satisfies: \[ \pi_{PPR} (e_{x}) = (1 - \alpha) A \pi_{PPR} (e_{x}) + \alpha e_{x} \] Solving,

\[ \pi_{PPR} (e_{x}) := \alpha \left( I - (1-\alpha) A \right)^{-1} e_{x} \]

Variations on this have been quite helpful for deep learning in recent years (Klicpera, Bojchevski, and Günnemann 2019).

Computing PPR efficiently

\(\textbf{Details}\)

It’s useful to compute this iteratively, using the identity

\[ \left( I - (1-\alpha) A \right)^{-1} = I + \sum_{k=1}^{\infty} ((1-\alpha) A)^{k} \]

This interpretably expands the inverse in terms of powers of \(A\), which is sparse. The resulting algorithm is actually identical to what I’ve covered elsewhere as “label propagation.”

CODE
def personalized_pagerank_kernel(adj_mat, apply_to_data=None, param_alpha=0.05, eps_tol=0.5):
    """
    Compute an approximate personalized pagerank kernel (I - (1-param_alpha)gmat )^{-1} using power iterations.
    """
    R = compute_transitions(adj_mat, alpha=0.0, sym=False, normalize=True).T
    labeled_signal = scipy.sparse.identity(R.shape[0])
    if apply_to_data is not None:
        labeled_signal = labeled_signal.dot(apply_to_data)
    F = scipy.sparse.identity(R.shape[0])
    rel_err = 1.0
    itime = time.time()
    while rel_err > eps_tol:
        F_new = ((1-param_alpha)*R.dot(F)) + (param_alpha*labeled_signal)
        resid = (F_new - F)
        resid_err = np.sum(np.square(resid.data))
        rel_err = resid_err/np.sum(np.square(F_new.data))
        F = F_new
    return F

Graph convolutional networks

The kernel used for graph convolutional networks (GCNs) (Wu et al. 2019) is a variant of this, and has been shown to be useful in learning. It’s slightly different from the standard Laplacian-based diffusion kernel, in that it includes additional self-loops on each node of the graph for numerical stability.

CODE
def gcn_kernel(adj_mat, self_loops=True):
    """
    Compute a graph convolutional net kernel.
    """
    if self_loops:
        adj_mat = adj_mat + scipy.sparse.identity(adj_mat.shape[0])
    norm_adj = compute_transitions(adj_mat, alpha=0.0, sym=True, normalize=True)
    return norm_adj

Another variant of this kernel (Wang et al. 2021) tries to more precisely interpolate the continuous diffusion process5, eschewing the self-loops since it’s already numerically stable enough.

This kernel has two parameters, terminal_time (T) and num_steps_propagation (K); it computes a diffusion process up to a time T in discrete chunks with resolution governed by K. Starting as before with the normalized adjacency matrix \(A\), the diffusion process is computed as

\[ X_{T} := \underbrace{\left( \left(1 - \frac{T}{K}\right) I + \frac{T}{K} A \right)^{K}}_{\text{new kernel}} X \]

\(K\) should be set as high as possible, but \(T\)’s interpretation is less straightforward. The paper (Wang et al. 2021) discusses the role of both parameters.

  • …there exists an optimal terminal time T for each dataset with the best feature separability….
  • Either a smaller T (under-smooth) or a larger T (over-smooth) will mix the features up and make them more indistinguishable, which eventually leads to lower accuracy.
  • …We can see that, with fixed optimal T, too large step size ∆t (i.e., too small propagation steps K) will lead to feature collapse, while gradually increasing the propagation steps K makes the nodes of different classes more separable…
CODE
def graphconv_euler_kernel(adj_mat, apply_to_data=None, terminal_time=5, num_steps_propagation=3):
    """
    Compute the Euler-Maruyama-discretized graph diffusion kernel.

    adj_mat : The adjacency matrix of the graph.
    apply_to_data : The data to which the kernel is applied. If None, then the kernel is applied to the identity matrix.
    terminal_time : The terminal time for the diffusion process.
    num_steps_propagation : The number of steps to take in the Euler-Maruyama discretization.
    """
    norm_adj = compute_transitions(adj_mat, alpha=0.0, sym=True, normalize=True)
    step_length = 1.0*terminal_time/num_steps_propagation
    if apply_to_data is None:
        apply_to_data = scipy.sparse.identity(norm_adj.shape[0])
    for i in range(num_steps_propagation):
        norm_adj = (1-step_length)*apply_to_data + step_length*norm_adj.dot(apply_to_data)
    return norm_adj

More meaningful sparse graph neighborhoods

There are other robust methods for graph construction, effectively trying to derive sparse graph structures with individually meaningful connections, starting from the kNN graph. A great example of these concepts, with intuitive and clear explanations, is given by (Saul 2020) – I’ll go over that, while of course providing code for it.

Robust neighborhood graphs

The first step is to build the kNN graph, where each node is connected to its \(k\) nearest neighbors; we do this as above, giving a sparse adjacency matrix \(K\). Though this is a good efficient heuristic, we want to build a graph with more meaningful edges. As (Saul 2020) says:

“In practice, we find that, even for small values of k, this graph may be too permissive as a representation of pairs of inputs that should be mapped to nearby outputs. Therefore, the rest of our procedure is designed to extract a subset of edges in this graph that represent more robust neighborhood relationships.”

This proceeds in steps that are good examples of how to identify meaningful connective structure in the graph:

  • Identify connections that unambiguously exist: The first \(s\) powers of the matrix \(K\) are accumulated, playing out a random walk on the graph for \(s\) steps. The matrix \(R\) is defined by accumulating these matrices: \(R = K + \dots + K^{s}\). Finally, form the matrix \(M_{ij} = R_{ij} R_{ji}\), which is only nonzero when \(i\) and \(j\) are mutually reachable according to \(R\).
CODE
def diffusion_mutualreach_kernel(adj_mat, s=1):
    powers_so_far = []
    kmat = adj_mat
    powers_so_far.append(kmat)
    for i in range(s-1):
        kmat = kmat.dot(adj_mat)
        powers_so_far.append(kmat)
    result = reduce(lambda x,y:x+y, powers_so_far)
    return result.multiply(result.T)
  • Identify a minimal set of essential connections: Build a graph \(U\) such that \(U_{ij} = K_{ij} + K_{ji} − K_{ij} K_{ji}\), which encodes if either node is among the other’s \(k\) nearest neighbors. Construct a minimum spanning tree of \(U\). The edges are weighted by the Euclidean distances, and the largest connected component is considered. The adjacency matrix of the minimum weighted spanning tree is denoted by \(T\), which “…encodes a minimal set of edges that preserve the large-scale connectivity of the inputs” (Saul 2020).
CODE
def mst_connections_kernel(adj_mat):
    umat = adj_mat + adj_mat.T - adj_mat.multiply(adj_mat.T)
    treemat = scipy.sparse.csgraph.minimum_spanning_tree(umat)
    return treemat

The results are then combined into a final matrix: \[ E_{ij} = K_{ij} (T_{ij} + M_{ij}) \]

In words, this matrix connects \(i\) and \(j\) if \(j\) is among \(i\)’s \(k\) nearest neighbors, AND if \(i\) and \(j\) are either unambiguously connected according to \(R\) or their connection is essential according to the tree \(T\). As (Saul 2020) puts it:

“By construction, this neighborhood graph connects only a subset of those pairs of inputs that are kNN; in particular, it retains those pairs needed to keep the graph connected, as well as those whose mutuality suggests an extra degree of robustness. Note that these neighborhood relations are not, in general, symmetric, so that we may sometimes regard \(x_j\) as nearby to \(x_i\) (when \(E_{ij}\) = 1) but not vice versa (when \(E_{ji}\) = 0). This happens, most notably, when \(x_i\) is an outlier.”

CODE
def saul_nonparametric_kernel(adj_mat, s=1):
    Mmat = diffusion_mutualreach_kernel(adj_mat, s=s)
    Tmat = mst_connections_kernel(adj_mat)
    return adj_mat.multiply(Mmat + Tmat)

Higher-order motif counting

As shown above, counting an edge \((i,j)\) in the graph based on other higher-order paths between \(i\) and \(j\) is a powerful way to understand the structure of the graph. This can be based on counting motifs like triangles between the nodes, lending interpretability to the process. It is also ultra-efficient with linear algebra, as I write about separately.

Learning graph structures from data

All the methods covered here use \(k\)-NN graphs and look at graph construction as an essentially local problem, learning each neighborhood roughly separately from far-away ones. This post has covered some advanced methods for constructing graphs, especially sparse graphs. However, this is not an exhaustive list — there are numerous other techniques and strategies waiting to be explored.

In other posts, I’ll cover some useful tools to learn these edge structures in a more global way. There are many ways to do this, including using:

References

Belkin, Mikhail, Partha Niyogi, and Vikas Sindhwani. 2006. “Manifold Regularization: A Geometric Framework for Learning from Labeled and Unlabeled Examples.” Journal of Machine Learning Research 7 (11).
Coifman, Ronald R, and Stéphane Lafon. 2006. “Diffusion Maps.” Applied and Computational Harmonic Analysis 21 (1): 5–30.
Coifman, Ronald R, Stephane Lafon, Ann B Lee, Mauro Maggioni, Boaz Nadler, Frederick Warner, and Steven W Zucker. 2005. “Geometric Diffusions as a Tool for Harmonic Analysis and Structure Definition of Data: Diffusion Maps.” Proceedings of the National Academy of Sciences 102 (21): 7426–31.
Crane, Keenan, Clarisse Weischedel, and Max Wardetzky. 2017. “The Heat Method for Distance Computation.” Communications of the ACM 60 (11): 90–99.
Gasteiger, Johannes, Stefan Weißenberger, and Stephan Günnemann. 2019. “Diffusion Improves Graph Learning.” Advances in Neural Information Processing Systems 32.
Haghverdi, Laleh, Florian Buettner, and Fabian J Theis. 2015. “Diffusion Maps for High-Dimensional Single-Cell Analysis of Differentiation Data.” Bioinformatics 31 (18): 2989–98.
Haghverdi, Laleh, Maren Büttner, F Alexander Wolf, Florian Buettner, and Fabian J Theis. 2016. “Diffusion Pseudotime Robustly Reconstructs Lineage Branching.” Nature Methods 13 (10): 845–48.
Klicpera, Johannes, Aleksandar Bojchevski, and Stephan Günnemann. 2019. “Combining Neural Networks with Personalized Pagerank for Classification on Graphs.” In International Conference on Learning Representations.
Nadler, Boaz, Stéphane Lafon, Ronald R Coifman, and Ioannis G Kevrekidis. 2006. “Diffusion Maps, Spectral Clustering and Reaction Coordinates of Dynamical Systems.” Applied and Computational Harmonic Analysis 21 (1): 113–27.
Roweis, Sam T, and Lawrence K Saul. 2000. “Nonlinear Dimensionality Reduction by Locally Linear Embedding.” Science 290 (5500): 2323–26.
Saul, Lawrence K. 2020. “A Tractable Latent Variable Model for Nonlinear Dimensionality Reduction.” Proceedings of the National Academy of Sciences 117 (27): 15403–8.
Saul, Lawrence K, Kilian Q Weinberger, Fei Sha, Jihun Ham, and Daniel D Lee. 2006. “Spectral Methods for Dimensionality Reduction.” Semi-Supervised Learning 3.
Solomon, Justin, Fernando De Goes, Gabriel Peyré, Marco Cuturi, Adrian Butscher, Andy Nguyen, Tao Du, and Leonidas Guibas. 2015. “Convolutional Wasserstein Distances: Efficient Optimal Transportation on Geometric Domains.” ACM Transactions on Graphics (ToG) 34 (4): 1–11.
Tenenbaum, Joshua B, Vin de Silva, and John C Langford. 2000. “A Global Geometric Framework for Nonlinear Dimensionality Reduction.” Science 290 (5500): 2319–23.
Von Luxburg, Ulrike, Mikhail Belkin, and Olivier Bousquet. 2008. “Consistency of Spectral Clustering.” The Annals of Statistics, 555–86.
Wang, Yifei, Yisen Wang, Jiansheng Yang, and Zhouchen Lin. 2021. “Dissecting the Diffusion Process in Linear Graph Convolutional Networks.” Advances in Neural Information Processing Systems 34: 5758–69.
Wu, Felix, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. “Simplifying Graph Convolutional Networks.” In International Conference on Machine Learning, 6861–71. PMLR.

Footnotes

  1. Smoothly, i.e. diffeomorphically.↩︎

  2. However, they don’t abide by the technical ML definition of kernels, as in general they may fail to be positive semidefinite and thus are not strictly kernels.↩︎

  3. There is a host of different normalizations in different papers, which are used in different situations, with almost the same code. It’s short and illuminating code that allows us to control how fast we do the computation.↩︎

  4. Similarly, any unsymmetrized left eigenvector \(\phi\) of \(T\), i.e.  $ ^T = D_p{-1} K = ^$ happens $ %D_p{1/2} T = D_p{-1/2} K = D_p{1/2} Ts D_p^{1/2} = D_p{1/2} $.↩︎

  5. It uses Euler-Maruyama discretization of the continuous process, instead of just taking its discrete-time version.↩︎