Introduction: thinning out edges in a graph

Approximating a given graph by a graph with fewer edges is called sparsification. It turns out that every weighted undirected graph can be efficiently sparsified into a graph that is "close" to it on all scales, fine and coarse. This is done by a remarkable algorithm, whose principles are so simple and elegant that I couldn't resist trying it myself on some gene network data.

Since I found a scarcity of well-documented modular Python implementations of this algorithm 1, I implemented a version in this notebook that directly uses the ideas of the seminal work that introduced the algorithm ([Spielman and Srivastava 2011]). The version we'll discuss here scales easily on a laptop CPU to graphs with tens of thousands of vertices, and tens of millions of edges.

Let's get started.

What do we get from sparsification?

A sparsification method operates on an undirected weighted graph $G$. It creates a sparsifier $H$ -- a graph with the same vertices as $G$ but many fewer edges -- which is similar to $G$ in terms of connectivity. There are some clear benefits to this:

  • Dense graphs are produced by many routines; kernel and similarity matrices are dense by nature, and sparsification enables us to store and manipulate them efficiently.
  • The sparse graphs produced by the method of [Spielman and Srivastava 2011] that we implement can be produced at any desired level of sparsity, which translates proportionally into gains in storage and calculation efficiency.
  • As an intermediate computation, we implement an efficient and principled way to locally characterize the connectivity of the graph.

Sparsification preserves cuts

Remarkably, the sparsification method preserves graph cuts -- that is, any large enough subset of vertices touches approximately the same weight of edges in $H$ as in $G$. The method guarantees this for any subset of vertices whatsoever. This is a powerful property -- it ensures the topology of the graph is preserved after the sparsification, at coarse and fine scales.

Sparsifying a graph by resampling edges

The core algorithm here is easy to describe. Given $G$, it produces a sparsified graph with num_edges edges constructed using a simple procedure:

  1. Calculate weights over edges of $G$.
  2. Sample num_edges edges with replacement from the distribution induced by the weights.

We can write this as a function, with the weight calculation abstracted out. The weight calculation requires a precomputed quantity for each edge, called the effective resistance, which we will discuss next. For now, it is included in the API of our function.

import sys, numpy as np, pandas as pd, time, scipy, scipy.linalg
import scipy.sparse as sp
import matplotlib.pyplot as plt
import anndata, scanpy as sc
%matplotlib inline

print('Packages imported.')
Packages imported.

def sample_sparsifier(adj_mat, num_edges, R, ultrasparsifier=False):
    """
    Params:
    ------
    adj_mat : sp.csr_matrix
        The adjacency matrix of the graph.
    num_edges : int
        The number of samples for the sparsifier.
        
    Returns:
    ------
    H : sp.csr_matrix
        The adjacency matrix of the sparsified graph with at most num_edges edges.
    """
    rows, cols, R = compute_exact_resistances(adj_mat)
    weights = np.array(adj_mat[rows, cols].tolist())   # Effective resistances (approximate) for each edge.
    probs = compute_sampling_probs(adj_mat, R, ultrasparsifier=ultrasparsifier)
    sampled = np.random.choice(len(probs), num_edges, p=probs, replace=True)
    H = sp.lil_matrix(adj_mat.shape)
    for idx in sampled:
        H[rows[idx], cols[idx]] += weights[idx] / num_edges / probs[idx]
    return H.tocsr()

Edge sampling weights using effective resistances

How do we compute edge weights to guide this sparsification-by-sampling procedure? We want to put higher weight on edges that are bottlenecks to the flow of random walks on the graph, as such edges are more essential to preserving the overall topology of the graph.

One intuitive way of measuring this property between any pair of vertices $a$ and $b$ is the commute time $C_{ab}$ - the expected number of steps that a random walker would take to diffuse from $a$ to $b$ and back, which is higher around bottlenecks. It turns out that if the commute times are used as edge sampling weights, the resulting sparsifier preserves the graph topology beautifully! This is what we'll implement.

Effective resistance

We'll find it more convenient to compute a quantity called the effective resistance $R_{ab}$ across each edge $(a, b)$ [Klein and Randić 1993]. This is equivalent to computing the commute time $C_{ab}$, as $R_{ab}$ is proportional to $C_{ab}$ in any given graph (see [Chandra et al. 1996]).

$R_{ab}$ can be understood in other equivalent ways:

  • $R_{ab}$ is proportional to the probability the edge appears in a random spanning tree of the graph (for unweighted graphs). Bottleneck edges that are essential to the topology of the graph will be in the most spanning trees, and will have highest weight.
  • $R_{ab}$ is the effective electrical resistance between two nodes in an electrical network laid out according to the graph, with a resistor placed on each edge according to the weights. It is a vast generalization to arbitrary networks of the well-known formulas for series and parallel resistances.
  • As such, the effective resistance is a natural graph-based measure of distance that accounts for the multiplicity of paths between any two nodes.

def compute_sampling_probs(adj_mat, R, ultrasparsifier=False):
    """
    R : matrix of effective resistances
    ultrasparsifier : Change probabilities so that no edge gets scaled up much. 
    Only guarantees spectral closeness up to 2 * epsilon, but with the benefit of lowering variance.
    """
    rows, cols = adj_mat.nonzero()
    weights = np.array(adj_mat[rows, cols].tolist())
    probs = weights * R
    probs /= np.sum(probs)
    if ultrasparsifier:
        degrees = adj_mat.sum(1)[np.array((rows, cols))].squeeze().T
        mins = 1 / np.min(degrees, axis=1) / adj_mat.shape[0]
        probs += mins
        probs /= 2
    # probs = probs.reshape((probs.shape[0], 1))
    probs /= np.sum(probs)
    return np.ravel(probs)

Effective resistances and the graph Laplacian

For any vertex $v$, let $e_{v} \in \mathbb{R}^{n}$ be the vector with a 1 in position $v$ and zero elsewhere: the $v^{th}$ vertex basis vector. For any edge $e = (u, v)$, let $\chi_{e} := e_{u} - e_{v}$: the $e^{th}$ edge basis vector. Note that for distinct edges $e \neq f$, $\chi_{e}^{\top} \chi_{f} = 0$.

The graph Laplacian $L$ is a matrix $L = D - W$, where $D = diag(\textbf{1}^{\top} W)$. It can be written in terms of the edges: $L = \sum_{e \in E} w_{e} \chi_{e} \chi_{e}^\top$.

It turns out that using the electrical circuit theory interpretation of effective resistance (Kirchhoff's current law and Ohm's law), the effective resistance can be written in terms of the pseudoinverse $L^{+}$ of the Laplacian: for any pair of vertices $e = (u, v)$,

$$ R_{u v} = \chi_{e}^{\top} L^{+} \chi_{e} $$

def compute_exact_resistances(adj_mat):
    # Compute effective resistance between all pairs of vertices in the graph.
    if scipy.sparse.issparse(adj_mat):
        L = sp.csgraph.laplacian(adj_mat).tocsr()
        L_inv = np.linalg.pinv(L.todense())
    else:
        L = sp.csgraph.laplacian(adj_mat)
        L_inv = np.linalg.pinv(L)
    rows, cols = adj_mat.nonzero()
    R = []
    for i, j in zip(rows, cols):
        R.append(L_inv[i, i] + L_inv[j, j] - L_inv[j, i] - L_inv[i, j])
    return (rows, cols, R)

The sparsifier preserves graph cuts

Putting this all together, we get a graph sparsifier.

This sparsifier is special -- it preserves all graph cuts.

More formally, for any set of vertices $T \subseteq V$, let $p_G (T)$ and $p_H (T)$ be the weight of edges incident to $T$ in graph $G$ and its sparsifier $H$. Then for any $\epsilon \in (0,1)$, $$ (1 - \epsilon) p_G (T) \leq p_H (T) \leq (1 + \epsilon) p_G (T) $$ if we draw $H$ to have $O \left(\frac{n \log n }{ \epsilon^2 } \right)$ edges.

$\textbf{Details}$ The Laplacian $L$ completely characterizes the connectivity of the graph $G$, because the weight of edges incident to any subset of vertices $T \subseteq V$ -- which we call $p_{G} (T)$ -- can be written in terms of $L$. If $C(T)$ is the cut-set of $T$, i.e. the set of edges incident to a vertex in $T$, then since $L = \sum_{e \in E} w_{e} \chi_{e} \chi_{e}^\top$, we can write: $$p_{G} (T) = \sum_{e \in E : E \in C(T)} w_{e} = \sum_{e \in E} w_{e} (\textbf{1}_{T}^{\top} \chi_{e})^2 = \textbf{1}_{T}^{\top} L \textbf{1}_{T}$$ As a linear operator, the Laplacian therefore characterizes all graph cuts. The sparsifier we have constructed very strictly approximates the original graph - their Laplacian matrices are spectrally close, i.e. close as linear operators. In detail, [Spielman and Srivastava 2011] proves that the sparsifier's Laplacian $L_{H}$ satisfies $(1 - \epsilon) x^{\top} L_{G} x \leq x^{\top} L_{H} x \leq (1 + \epsilon) x^{\top} L_{G} x$ for all $x$. Substituting $x = \textbf{1}_{T}$, we get $p_{G} (T) = x^{\top} L_{G} x$ and $p_{H} (T) = x^{\top} L_{H} x$, giving the result.

Experiments highlight scalability issues

Current implementations do compute $R_{uv}$ using the above method. Unfortunately, it is extremely slow to compute the pseudoinverse $L^{+}$ exactly, as it requires solving a linear system of equations in $n$ variables. We can see this by trying out the function on a range of networks against the graph processing package pygsp, which computes it exactly and is a state-of-the-art package for such graph algorithms.

This package deserves special mention for including sparsification in a broader context of algorithms that I'll explore in future posts. But its actual implementation is not efficient enough to scale as ours does.

Loading the data

I'll load a network of human genes to demonstrate this method. The network, from this recent paper, is constructed so that nearby genes have similar functions.

#hide-output
amat = np.load('GLS_p.npy')
gene_names = pd.read_csv('genes.txt', header=None)
coess_adj = -np.log(amat)
coess_adj[np.isinf(coess_adj)] = 0

adj_mat = coess_adj

Trying it out

Running the pygsp version of graph sparsification, we run into some scalability issues in directly using $R_{uv}$. In this case, it takes ~30 seconds on my laptop at time of writing, to process a 2000-node subgraph of the initial adjacency matrix. and ~6 hours to process the entire matrix. The runtime doesn't scale well at all as the number of nodes increases.

from pygsp.reduction import graph_sparsify

from pygsp import reduction, graphs
from matplotlib import pyplot as plt

# G = graphs.Sensor(200)#, k=20, distributed=True, seed=1)
times_list = []
kn_list = [100, 160, 260, 470, 670, 1070, 1600, 2500, 4000]

for kn in kn_list:
    G = graphs.Graph(adj_mat[:kn, :kn])
    # # Getting a graph from adj_mat
    # print('{} nodes, {} edges'.format(G.N, G.Ne))
    # G.A
    itime = time.time()
    Gs = reduction.graph_sparsify(G, epsilon=0.99)
    elapsed = time.time() - itime
    #print(kn, elapsed)
    times_list.append(elapsed)

plt.plot(kn_list, times_list, )
plt.xlabel('number of nodes')
plt.ylabel('Time (s)')
plt.xscale('log')
plt.yscale('log')
plt.show()

We can check that our implementation performs similarly with similar runtime.

Conclusions

This leads to an inescapable conclusion:

Exact computation is a non-starter for large graphs.

Can we get away with approximately computing $L^{+}$? There are a couple of major hurdles to doing this efficiently:

  • The computation is particularly plagued by instability problems that might occur over particular edges, and it's difficult to guarantee that all the edges would be "uniformly" well approximated.
  • $L$ is often sparse (e.g. for kNN graphs), but $L^{+}$ is dense; how can we do this in less than $O( n^2 )$ space?

The best approximation, in a sense, is an elegant algorithm by [Spielman and Srivastava 2011] that approximates resistances over all cuts of the graph. This beautiful method warrants a bit more detail.

num_edges = 20000
kn = 2000
am = adj_mat[:kn, :kn]
itime = time.time()
rows, cols, R = compute_exact_resistances(am)
print(time.time() - itime)
#compute_sampling_probs(adj_mat, R, ultrasparsifier=False)

sparsified_graph = sample_sparsifier(am, num_edges, R, ultrasparsifier=True)
print(time.time() - itime)
6.543221950531006
14.336249828338623

Approximating resistances over edges with random projections

So let's tackle the problem of approximately computing resistances over edges. To understand and implement the algorithm for doing this, we'll need to rewrite the Laplacian as $L = B^\top S^2 B$, where $$ B_{e, v} = \begin{cases} 1,\;\;\;\text{ if e has destination vertex $v$} \\ -1,\;\;\;\text{ if e has source vertex $v$} \\ 0,\;\;\;\text{ if e does not hit vertex $v$} \end{cases} \qquad , \qquad S_{e_1, e_2} = \begin{cases} \sqrt{w_{e_1}},\;\;\text{ if } e_1 = e_2 \\ 0,\;\;\text{ otherwise} \end{cases} $$ (When the graph is undirected, we orient it arbitrarily to apply the "destination" and "source" designations of each edge.)

A little linear algebra ([Spielman and Srivastava 2011]) shows that the resistance $R_{u v}$ between any two vertices $u$ and $v$ is the distance between the $u^{th}$ and $v^{th}$ columns of the $|E| \times n$ matrix $$ M := S B L^{+} $$

Scaling up the implementation with random projections

Here we run into an implementation hurdle:

Computing and storing $M$ explicitly is not scalable, as each data point is represented by $|E|$ coordinates, a number far more than the number of data $n$.

The core of our implementation, following [Spielman and Srivastava 2011], is in overcoming this hurdle. We efficiently approximate the matrix $M$ by representing its $n$ columns in a reduced $\log (n)$-dimensional space, such that their pairwise distances are preserved. This is done using a random projection.

A random projection is a linear operator $Q \in \mathbb{R}^{k \times |E|}$ that reduces the $|E|$-dimensional data to a smaller number of dimensions $k$. The reduced data can be stacked together as the columns of a $k \times n$ matrix $Z$: $$ Z = Q M = Q S B L^{+} $$ Resistances can be calculated from $Z$ for any edge by reading off the relevant columns; the resistance is the squared Euclidean distance $\vert \vert \cdot \vert \vert_2^2$ between these columns.

def compute_effective_resistances(Z, edges):
    """
    Parameters:
    ----------
    Z : (k x n) matrix from which to efficiently compute approximate effective resistances.
    edges : A tuple of two lists: the row and column indices of the respective edges.
    
    Returns:
    --------
    R : Effective resistances for each edge.
    """
    rows, cols = edges
    R = []
    for i, j in zip(rows, cols):
        R.append(np.linalg.norm(Z[:, i] - Z[:, j]) ** 2)
    return np.array(R)

For this use, we want a random projection matrix $Q$ that preserves pairwise distances between $n$ points. This happens when the entries are i.i.d. random variables ([Arriaga and Vempala 2006]), and we will explore two such special cases:

How many random projection dimensions $k$ to construct, as the number of data $n$ goes up? Incredibly, [Spielman and Srivastava 2011] shows that $k \sim \log(n)$ dimensions suffice. This relates to a major benefit of using random projections: they provide a seamless way to trade off between computation and accuracy (i.e., choosing $k$), and require very little computation to get a good result 2.

Computing $QSB$ efficiently

This is all very well, except that computing the matrices $Q, S, B$ explicitly involves at the very least looping over $|E|$ edges, which again does not scale. However, these matrices only figure into the calculation through the $k \times n$ matrix $P := Q S B$, which we can calculate by alternative means.

In essence, we calculate $P$ from $A$ by computing each of the $k$ coordinates in turn for a node. We do this by summing the map's contributions over edges around the node, weighted according to their orientation and according to $\sqrt{\cdot}$ of the weight of the edges. This can be parallelized over all nodes at once.

In more detail, let's implement these steps.

A) Preprocess the adjacency matrix

  • Take the elementwise $\sqrt{\cdot}$ of $A$.
  • Orient the graph $A$ by blocking out the lower triangle of the matrix with zeroes.
  • Arbitrarily assign some of the remaining edges to be negative.
  • Finally, anti-symmetrize the matrix again to re-fill the lower triangle, forming a new graph $A_2$.

It's easy to add a little code here at the appropriate commented-out place to arbitarily change the orientation of some edges by making them negative; It's a good sanity check to see that this does not affect things.

def preprocess_adjmat(adj_mat):
    if not scipy.sparse.issparse(adj_mat):
        adj2 = np.sqrt(adj_mat)
        adj2 = np.triu(adj2)
    else:
        adj_mat.data = np.sqrt(adj_mat.data)
        adj2 = scipy.sparse.triu(adj_mat)

    # Arbitrarily assign some edges to be negative. 
    pass

    # Anti-symmetrize the matrix again.
    adj2 = adj2 - adj2.T
    return adj2

B) Compute the $k \times n \times n$ random map $T$, which should be zero outside edges

We'll focus on cases where $Q$ is composed of i.i.d. entries. In these cases, $T$ is accordingly i.i.d. distributed on entries corresponding to edges, and zero elsewhere.

import sklearn.random_projection

def draw_random_featmap(
    adj_mat, 
    num_components = 40, 
    projection_mode='sparse', 
    verbose=False
):
    if not scipy.sparse.issparse(adj_mat):
        num_edges = np.prod(adj_mat.shape)
    else:
        num_edges = len(adj_mat.data)
    dummy_data = scipy.sparse.csr_matrix((adj_mat.shape[0], num_edges))
    
    if (num_components * num_edges) > 1e6:
        projection_mode = 'sparse'   # Otherwise memory can't handle it
    
    itime = time.time()
    if projection_mode == 'sparse':
        form = sklearn.random_projection.SparseRandomProjection(n_components=num_components, random_state=42)
    else:
        form = sklearn.random_projection.GaussianRandomProjection(n_components=num_components, random_state=42)
    X_new = form.fit(dummy_data)
    if verbose:
        print(time.time() - itime)

    embmat = form.components_
    list_tensor = []
    if scipy.sparse.issparse(adj_mat):
        nnzrows, nnzcols = adj_mat.nonzero()
    for component_index in range(num_components):
        newdata = np.ravel(embmat[component_index, :].toarray())
        if scipy.sparse.issparse(adj_mat):
            nnzrows, nnzcols = adj_mat.nonzero()
            random_map = scipy.sparse.csr_matrix((newdata, (nnzrows, nnzcols)), shape=adj_mat.shape)
        else:
            random_map = np.reshape(newdata, adj_mat.shape)
        list_tensor.append(random_map)
    return list_tensor

C) Compute each coordinate in the random map

We do this in a vectorized way for all nodes in the graph for each coordinate $i = 1, \dots, k$. We elementwise multiply the corresponding feature map (slice of $T$) by $A_2$, and then find the row sums of that matrix to get a column vector $v_i \in \mathbb{R}^{n}$. This gives the $k \times n$ matrix $P := Q S B$.

def compute_reduced_vertices(list_tensor_random, adj_mat):
    adj2 = preprocess_adjmat(adj_mat)
    list_apply_tensor = [x.multiply(adj2).sum(axis=1) for x in list_tensor_random]
    return list_apply_tensor

D) Putting it together to compute $QSB$

We do this in a vectorized way for all nodes in the graph for each coordinate $i = 1, \dots, k$. We elementwise multiply the corresponding feature map (slice of $T$) by $A_2$, and then find the row sums of that matrix to get a column vector $v_i \in \mathbb{R}^{n}$. This gives the $k \times n$ matrix $P := Q S B$.

import pandas as pd
from io import StringIO
fname = '/Users/akshay/Downloads/2022_09_12_phage_neutralisers_vh_vl (2).xls'
#pd.read_excel(fname, engine='xlrd', header=None)

table_lines = []
table_lines.append("3247_31_B01	Kappa	1	123537	6558	197470	353	166	165	655	171	-	YES	-	-	63088.5	2128.5	YES	113.2	31.39	34.20602218	132.86	12.62	16.68393523	IGHV3-23	3106_06_A03	DYIMH	13	AIHGRGGTTDYADSVKG	10	TVTGNNYYYGLDV	11	DYIMH-AIHGRGGTTDYADSVKG-TVTGNNYYYGLDV	13	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSAIHGRGGTTDYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_32_B04	Kappa	1	51003	622	180287	96	132	198	519	131	-	YES	-	-	5473.5	708	YES	165.13	4.78	3.570751274	175.61	4.93	4.93095529	IGHV3-23	3106_06_A03	DYQMH	20	HIGGRGGRTRYADSVKG	216	TVTGNNYYYGLDV	11	DYQMH-HIGGRGGRTRYADSVKG-TVTGNNYYYGLDV	42	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYQMHWVRQAPGKGLEWLSHIGGRGGRTRYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_32_D11	Kappa	1	211193	770	234856	92	7086	178	985	70	-	YES	-	-	22461	364	YES	170.21	32.49	23.54628106	206.77	36.38	30.9035631	IGHV3-23	3106_06_A03	DYIMH	13	GIHGRGGVTNYADSVKG	132	TVTGNNYYYGLDV	11	DYIMH-GIHGRGGVTNYADSVKG-TVTGNNYYYGLDV	17	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSGIHGRGGVTNYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_32_E03	Kappa	1	38068	735	214674	121	148	339	2140	189	-	YES	-	-	20570	678	YES	464.56	50.31	13.3588843	525.9	36.42	12.16382488	IGHV3-23	3106_06_A03	DYPMQ	18	RIDGRGGRTRYADSVKG	431	TVTGNNYYYGLDV	11	DYPMQ-RIDGRGGRTRYADSVKG-TVTGNNYYYGLDV	30	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYPMQWVRQAPGKGLEWLSRIDGRGGRTRYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_32_G04	Kappa	1	206305	2164	210206	181	40584	256	1000	113	-	YES	YES	-	24390.5	336	YES	262.1	55.34	26.04532722	305.25	79.02	45.46893071	IGHV3-23	3106_06_A03	DYIMH	13	AIHGRGGVTNYADSVKG	13	TVTGNNYYYGLDV	11	DYIMH-AIHGRGGVTNYADSVKG-TVTGNNYYYGLDV	16	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSAIHGRGGVTNYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_38_F06	Kappa	2	142200	160	145419	62	84	71	213	65	-	YES	-	-	82941	126	YES	177.55	69.31	48.15405964	177.53	22.23	21.99384186	IGHV3-23	3106_06_A03	DYIMY	14	AIHGRGGVTNYADSVKG	13	TVTGNNYYYGLDV	11	DYIMY-AIHGRGGVTNYADSVKG-TVTGNNYYYGLDV	22	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMYWVRQAPGKGLEWLSAIHGRGGVTNYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_39_B08	Kappa	2	158250	100	148834	243	11588	56	408	134	-	YES	-	-	59522.5	283	YES	313.75	53.65	21.09325854	301.74	29.84	17.36998011	IGHV3-23	3106_06_A03	IYKMH	68	TIRGRGGPTQYADSVKG	625	TVTGNNYYYGLDV	11	IYKMH-TIRGRGGPTQYADSVKG-TVTGNNYYYGLDV	150	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSIYKMHWVRQAPGKGLEWLSTIRGRGGPTQYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_39_C10	Kappa	2	140683	114	175048	257	9496	24	542	182	-	YES	-	-	61572.5	639.5	YES	287.24	60.57	26.01179602	271.78	8.95	5.784141775	IGHV3-23	3106_06_A03	KYHMQ	79	GIKGRGGITRYADSVKG	146	TVTGNNYYYGLDV	11	KYHMQ-GIKGRGGITRYADSVKG-TVTGNNYYYGLDV	222	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYHMQWVRQAPGKGLEWLSGIKGRGGITRYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_40_F03	Kappa	2	141303	525	147159	190	425	492	450	254	-	YES	-	-	22726	270	YES	249.97	48.45	23.90911909	245.82	26.45	18.89913165	IGHV3-23	3106_06_A03	DYIMH	13	GIHGRGGVTQYADSVKG	133	TVTGNNYYYGLDV	11	DYIMH-GIHGRGGVTQYADSVKG-TVTGNNYYYGLDV	18	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSGIHGRGGVTQYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_40_G12	Kappa	2	54559	370	144875	231	221	1237	2333	181	-	YES	-	-	15939.5	658.5	YES	229.79	32.81	17.61297787	219.91	1.25	0.998385863	IGHV3-23	3106_06_A03	DYQMH	20	HIGGRGGRTKYADSVKG	214	TVTGNNYYYGLDV	11	DYQMH-HIGGRGGRTKYADSVKG-TVTGNNYYYGLDV	39	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYQMHWVRQAPGKGLEWLSHIGGRGGRTKYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_40_H01	Kappa	2	118514	1115	148648	2131	980	2421	1074	208	-	YES	-	-	166637	213	YES	207.96	82.07	48.68131587	214.1	20.59	16.89168864	IGHV3-23	3106_06_A03	DYIMH	13	AIHGRGGVTDYADSVKG	11	TVTGNNYYYGLDV	11	DYIMH-AIHGRGGVTDYADSVKG-TVTGNNYYYGLDV	14	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSAIHGRGGVTDYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_41_C03	Kappa	2	31466	147	91173	148	188	479	1330	193	-	YES	-	-	10116.5	940	YES	NA	NA	NA	NA	NA	NA	IGHV3-23	3106_06_A03	KYAMS	71	GIKGRGGHTKYADSVKG	144	TVTGNNYYYGLDV	11	KYAMS-GIKGRGGHTKYADSVKG-TVTGNNYYYGLDV	174	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSGIKGRGGHTKYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_42_A09	Kappa	2	100354	507	191202	454	530	636	1064	296	-	YES	-	-	54109	272.5	YES	211.25	38.36	22.399564	194.97	32.03	28.85509678	IGHV3-23	3106_06_A03	DYVMH	24	GIHGRGGYTKYADSVKG	134	TVTGNNYYYGLDV	11	DYVMH-GIHGRGGYTKYADSVKG-TVTGNNYYYGLDV	64	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYVMHWVRQAPGKGLEWLSGIHGRGGYTKYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_42_F01	Kappa	2	31681	534	163866	550	1233	434	1157	147	-	YES	-	-	33575	638.5	YES	289.39	37.85	16.13392553	293.74	42.93	25.67031431	IGHV3-23	3106_06_A03	DYPMQ	18	DIRGRGGRTRYADSVKG	69	TVTGNNYYYGLDV	11	DYPMQ-DIRGRGGRTRYADSVKG-TVTGNNYYYGLDV	26	KYAMS	AISGNGNSAYNADSVKG	RASQSISSYLN	AASSLQS	QQSSKTPLT	356	QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK	1	1	1	1	1	1	QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYPMQWVRQAPGKGLEWLSDIRGRGGRTRYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS	AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3258_34_H11	Lambda	1	213672	2890	187268	339	190669	971	1178	572	-	YES	YES	-	13985	2108	YES	420.05	28.01	8.225641998	466.5	32.51	12.24048736	IGHV3-23	3107_13_H11	QYGMQ	147	YIDGHGGHTYYADSVKG	677	SWNKVRGGDV	9	QYGMQ-YIDGHGGHTYYADSVKG-SWNKVRGGDV	567	SYAMS	AISGSGGSTYYADSVKG	SGSSSNIGSNYVY	RNNQRPS	AAWDDSLSGRV	136	EVQLVESGGGFVQPGESVRLSCEDSGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCSTSWNKVRGGDVWGQGTTVTVSS	QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEADYYCAAWDDSLSGRVFGGGTKLTVL	1	1	1	1	1	1	EVQLVESGGGFVQPGESVRLSCEDSGFTFSQYGMQWVRQAPGKGLEWVSYIDGHGGHTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCSTSWNKVRGGDVWGQGTTVTVSS	QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEADYYCAAWDDSLSGRVFGGGTKLTVL")
table_lines.append("3258_43_A11	Lambda	2	201715	4862	214990	613	169588	663	1413	585	-	YES	YES	-	16425	380.5	YES	273.98	21.54	9.698052297	289.64	35.14	21.30966423	IGHV3-23	3107_13_H11	AYGMQ	1	YIDGHGGHTYYADSVKG	677	SWNKVRGGDV	9	AYGMQ-YIDGHGGHTYYADSVKG-SWNKVRGGDV	1	SYAMS	AISGSGGSTYYADSVKG	SGSSSNIGSNYVY	RNNQRPS	AAWDDSLSGRV	136	EVQLVESGGGFVQPGESVRLSCEDSGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCSTSWNKVRGGDVWGQGTTVTVSS	QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEADYYCAAWDDSLSGRVFGGGTKLTVL	1	1	1	1	1	1	EVQLVESGGGFVQPGESVRLSCEDSGFTFSAYGMQWVRQAPGKGLEWVSYIDGHGGHTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCSTSWNKVRGGDVWGQGTTVTVSS	QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEADYYCAAWDDSLSGRVFGGGTKLTVL")

newdf = pd.read_csv(StringIO('\n'.join(table_lines)), sep='\t', header=None)
newdf.iloc[:, [0, -2, -1]].to_csv('seqs.tsv', sep='\t', index=None)

Matrix inversion as gradient descent

There are huge time and space savings by using $Z$ instead of $R$, because this choice represents each of the $n$ points by $k \sim \log (n)$ coordinates instead of $|E|$ coordinates. However, the time to naively compute $Z$ is dominated by the $O (n^{3})$ cost of computing $L^{+}$, so we need to compute $Z$ without explicitly computing $L^{+}$.

We do this by writing $$ L Z^{\top} \approx B^{\top} S Q^{\top} $$ and finding $Z = \arg \min_{Z} \vert\vert L Z^{\top} - B^{\top} S Q^{\top} \vert\vert_{2}^{2} $ as the least-squares linear solution to this system.

So we can efficiently find $Z$ by gradient descent on the objective function $$ f(Z) := \vert\vert L Z^{\top} - B^{\top} S Q^{\top} \vert\vert_{2}^{2} = \vert\vert Z L^{\top} - Q S B \vert\vert_{2}^{2} = \vert\vert Z L - Q S B \vert\vert_{2}^{2} $$

def calculate_reduced_resistances(
    adj_mat, 
    epsilon=1e-1, eta=1e-3, max_iters=1000, convergence_after = 10,
    tolerance=1e-2, log_every=10, compute_exact_loss=False
):
    """ 
    Computes the Z matrix using gradient descent.
    
    Parameters:
    -----------
    adj_mat : sp.csr_matrix
        The adjacency matrix of the graph.
    epsilon : float
        Tolerance for deviations w.r.t. spectral norm of the sparsifier. Smaller epsilon lead to a higher
        dimensionality of the Z matrix.
    eta : float
        Step size for the gradient descent.
    max_iters : int
        Maximum number of iterations.
    convergence_after : int
        If the loss did not decrease significantly for this amount of iterations, the gradient descent will abort.
    tolerance : float
        The minimum amount of energy decrease that is expected for iterations. If for a certain number of iterations
        no overall energy decrease is detected, the gradient descent will abort.
    log_every : int
        Log the loss after each log_every iterations.
    compute_exact_loss : bool
        Only for debugging. If set it computes the actual pseudo inverse without down-projection and checks if
        the pairwise distances in Z's columns are the same with respect to the forbenius norm.
        
    Returns:
    --------
    Z : ndarray, shape [k, N]
        Matrix from which to efficiently compute approximate resistances.
    """
    # Compute L, S, B
    L = sp.csr_matrix(sp.csgraph.laplacian(adj_mat))
    rows, cols = adj_mat.nonzero()
    weights = np.sqrt(np.array(adj_mat[rows, cols].tolist()))
    S = sp.diags(weights, [0])
    # Construct signed edge incidence matrix
    num_vertices = adj_mat.shape[0]
    num_edges = S.shape[0]
    assert(num_edges == len(rows) and num_edges == len(cols))
    B = sp.coo_matrix((
        ([1] * num_edges) + ([-1] * num_edges),
        (list(range(num_edges)) * 2, list(rows) + list(cols))
    ), shape=[num_edges, num_vertices])

    k = int(np.ceil(np.log(B.shape[1] / epsilon**2)))
    Y_red = compute_reduced_vertices()
    
    # Use gradient descent to solve for Z
    Z = np.random.randn(k, L.shape[1])
    best_loss = np.inf
    best_iter = np.inf
    for it in range(max_iters):
        eta_t = eta  # Constant learning rate
        eta_t = eta * (1.0/np.sqrt(it + 10))
        residual = Y_red - sp.csr_matrix.dot(Z, L)
        loss = np.linalg.norm(residual)
        if it % log_every == 0: 
            print(f'Loss before iteration {it}: {loss}')
            if compute_exact_loss:
                pairwise_dist = Z.T.dot(Z)
                exact_loss = np.linalg.norm(pairwise_dist - pairwise_dist_gnd)
                print(f'Loss w.r.t. exact pairwise distances {exact_loss}')
        if loss + tolerance < best_loss:
            best_loss = loss
            best_iter = it
        elif it > best_iter + convergence_after:
            # No improvement for 10 iterations
            print(f'Convergence after {it - 1} iterations.')
            break
        Z += eta_t * L.dot(residual.T).T
    return Z
    

Putting it all together to sparsify graphs

All this results in a streamlined procedure for sparsifying a graph to have any desired number of edges num_edges. To summarize, given a graph $G$, we:

  1. Calculate reduced-dimension representation $Z$ of the graph $G$.
  2. Compute effective resistances over edges in $G$.
  3. Run weighted edge bootstrap to generate sparsified graph $H$, with relative edge weights proportional to resistances.

def spectral_sparsify(
    adj_mat, q=None, epsilon=1e-1, eta=1e-3, max_iters=1000, convergence_after=100,
    tolerance=1e-2, log_every=10, prevent_vertex_blow_up=False
):
    """ Computes a spectral sparsifier of the graph given by an adjacency matrix.
    
    Parameters:
    ----------
    adj_mat : sp.csr_matrix, shape [N, N]
        The adjacency matrix of the graph.
    q : int or None
        The number of samples for the sparsifier. If None q will be set to N * log(N) / (epsilon * 2)
    epsilon : float
        The desired spectral similarity of the sparsifier.
    eta : float
        Step size for the gradient descent when computing resistances.
    max_iters : int
        Maximum number of iterations when computing resistances.
    convergence_after : int
        If the loss did not decrease significantly for this amount of iterations, the gradient descent will abort.
    tolerance : float
        The minimum amount of energy decrease that is expected for iterations. If for a certain number of iterations
        no overall energy decrease is detected, the gradient descent will abort.
    prevent_vertex_blow_up : bool
        If the probabilities will be tweaked in order to ensure that the vertices are not blown up too much. 
        Note that this will only guarantee a spectral closeness up to a factor of 2 * epsilon.
    
    Returns:
    --------
    H : sp.csr_matrix, shape [N, N]
        Sparsified graph with at most q edges.
    """
    if q is None:
        q = int(np.ceil(adj_mat.shape[0] * np.log(adj_mat.shape[0]) / (epsilon ** 2)))   # or 6n*ln(n) , by https://www.cs.ubc.ca/~nickhar/Cargese3.pdf
    edges = adj_mat.nonzero()
    Z = calculate_reduced_resistances(adj_mat, epsilon=epsilon, max_iters=max_iters, convergence_after=convergence_after, eta=eta, tolerance=tolerance, compute_exact_loss=False)
    R = compute_effective_resistances(Z, edges)
    return sample_sparsifier(adj_mat, q, R, edges, prevent_vertex_blow_up=prevent_vertex_blow_up)

Summary

  • The weighted edge bootstrap algorithm of [Spielman and Srivastava 2011] guarantees a sparsifier with high probability for any undirected weighted graph.
  • An efficient implementation, described in [Spielman and Srivastava 2011], uses random projections to construct a subspace of dimension $\log n$ in which the pairwise distances between the $n$ vertices are approximately preserved.
  • Though simple in concept, this sparsification method is an extremely useful and powerful primitive in designing efficient graph algorithms.

Footnotes

1. The pygsp package deserves special mention for including sparsification in a broader context of algorithms that I will explore in future posts. But its actual implementation is not efficient enough to scale as ours does.

2. By Johnson-Lindenstrauss theory and related statistics, there is no such method that can use fewer dimensions to achieve a given reconstruction error guarantee.

References

  1. Spielman, D.A. and Srivastava, N. 2011. Graph sparsification by effective resistances. SIAM Journal on Computing 40, 6, 1913–1926.
  2. Klein, D.J. and Randić, M. 1993. Resistance distance. Journal of mathematical chemistry 12, 1, 81–95.
  3. Chandra, A.K., Raghavan, P., Ruzzo, W.L., Smolensky, R., and Tiwari, P. 1996. The electrical resistance of a graph captures its commute and cover times. computational complexity 6, 4, 312–340.
  4. Arriaga, R.I. and Vempala, S. 2006. An algorithmic theory of learning: Robust concepts and random projection. Machine learning 63, 2, 161–182.
  5. Dasgupta, S. and Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms 22, 1, 60–65.
  6. Li, P., Hastie, T.J., and Church, K.W. 2006. Very sparse random projections. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 287–296.