Graph sparsification
Removing edges while maintaining connectivity
- Introduction: thinning out edges in a graph
- Edge sampling weights using effective resistances
- Experiments highlight scalability issues
- Approximating resistances over edges with random projections
- Putting it all together to sparsify graphs
- Footnotes
- References
Introduction: thinning out edges in a graph
Approximating a given graph by a graph with fewer edges is called sparsification. It turns out that every weighted undirected graph can be efficiently sparsified into a graph that is "close" to it on all scales, fine and coarse. This is done by a remarkable algorithm, whose principles are so simple and elegant that I couldn't resist trying it myself on some gene network data.
Since I found a scarcity of well-documented modular Python implementations of this algorithm 1, I implemented a version in this notebook that directly uses the ideas of the seminal work that introduced the algorithm ([Spielman and Srivastava 2011]). The version we'll discuss here scales easily on a laptop CPU to graphs with tens of thousands of vertices, and tens of millions of edges.
Let's get started.
What do we get from sparsification?
A sparsification method operates on an undirected weighted graph $G$. It creates a sparsifier $H$ -- a graph with the same vertices as $G$ but many fewer edges -- which is similar to $G$ in terms of connectivity. There are some clear benefits to this:
- Dense graphs are produced by many routines; kernel and similarity matrices are dense by nature, and sparsification enables us to store and manipulate them efficiently.
- The sparse graphs produced by the method of [Spielman and Srivastava 2011] that we implement can be produced at any desired level of sparsity, which translates proportionally into gains in storage and calculation efficiency.
- As an intermediate computation, we implement an efficient and principled way to locally characterize the connectivity of the graph.
Sparsification preserves cuts
Remarkably, the sparsification method preserves graph cuts -- that is, any large enough subset of vertices touches approximately the same weight of edges in $H$ as in $G$. The method guarantees this for any subset of vertices whatsoever. This is a powerful property -- it ensures the topology of the graph is preserved after the sparsification, at coarse and fine scales.
Sparsifying a graph by resampling edges
The core algorithm here is easy to describe.
Given $G$, it produces a sparsified graph with num_edges
edges constructed using a simple procedure:
- Calculate weights over edges of $G$.
- Sample
num_edges
edges with replacement from the distribution induced by the weights.
We can write this as a function, with the weight calculation abstracted out. The weight calculation requires a precomputed quantity for each edge, called the effective resistance, which we will discuss next. For now, it is included in the API of our function.
import sys, numpy as np, pandas as pd, time, scipy, scipy.linalg
import scipy.sparse as sp
import matplotlib.pyplot as plt
import anndata, scanpy as sc
%matplotlib inline
print('Packages imported.')
def sample_sparsifier(adj_mat, num_edges, R, ultrasparsifier=False):
"""
Params:
------
adj_mat : sp.csr_matrix
The adjacency matrix of the graph.
num_edges : int
The number of samples for the sparsifier.
Returns:
------
H : sp.csr_matrix
The adjacency matrix of the sparsified graph with at most num_edges edges.
"""
rows, cols, R = compute_exact_resistances(adj_mat)
weights = np.array(adj_mat[rows, cols].tolist()) # Effective resistances (approximate) for each edge.
probs = compute_sampling_probs(adj_mat, R, ultrasparsifier=ultrasparsifier)
sampled = np.random.choice(len(probs), num_edges, p=probs, replace=True)
H = sp.lil_matrix(adj_mat.shape)
for idx in sampled:
H[rows[idx], cols[idx]] += weights[idx] / num_edges / probs[idx]
return H.tocsr()
Edge sampling weights using effective resistances
How do we compute edge weights to guide this sparsification-by-sampling procedure? We want to put higher weight on edges that are bottlenecks to the flow of random walks on the graph, as such edges are more essential to preserving the overall topology of the graph.
One intuitive way of measuring this property between any pair of vertices $a$ and $b$ is the commute time $C_{ab}$ - the expected number of steps that a random walker would take to diffuse from $a$ to $b$ and back, which is higher around bottlenecks. It turns out that if the commute times are used as edge sampling weights, the resulting sparsifier preserves the graph topology beautifully! This is what we'll implement.
Effective resistance
We'll find it more convenient to compute a quantity called the effective resistance $R_{ab}$ across each edge $(a, b)$ [Klein and Randić 1993]. This is equivalent to computing the commute time $C_{ab}$, as $R_{ab}$ is proportional to $C_{ab}$ in any given graph (see [Chandra et al. 1996]).
$R_{ab}$ can be understood in other equivalent ways:
- $R_{ab}$ is proportional to the probability the edge appears in a random spanning tree of the graph (for unweighted graphs). Bottleneck edges that are essential to the topology of the graph will be in the most spanning trees, and will have highest weight.
- $R_{ab}$ is the effective electrical resistance between two nodes in an electrical network laid out according to the graph, with a resistor placed on each edge according to the weights. It is a vast generalization to arbitrary networks of the well-known formulas for series and parallel resistances.
- As such, the effective resistance is a natural graph-based measure of distance that accounts for the multiplicity of paths between any two nodes.
def compute_sampling_probs(adj_mat, R, ultrasparsifier=False):
"""
R : matrix of effective resistances
ultrasparsifier : Change probabilities so that no edge gets scaled up much.
Only guarantees spectral closeness up to 2 * epsilon, but with the benefit of lowering variance.
"""
rows, cols = adj_mat.nonzero()
weights = np.array(adj_mat[rows, cols].tolist())
probs = weights * R
probs /= np.sum(probs)
if ultrasparsifier:
degrees = adj_mat.sum(1)[np.array((rows, cols))].squeeze().T
mins = 1 / np.min(degrees, axis=1) / adj_mat.shape[0]
probs += mins
probs /= 2
# probs = probs.reshape((probs.shape[0], 1))
probs /= np.sum(probs)
return np.ravel(probs)
Effective resistances and the graph Laplacian
For any vertex $v$, let $e_{v} \in \mathbb{R}^{n}$ be the vector with a 1 in position $v$ and zero elsewhere: the $v^{th}$ vertex basis vector. For any edge $e = (u, v)$, let $\chi_{e} := e_{u} - e_{v}$: the $e^{th}$ edge basis vector. Note that for distinct edges $e \neq f$, $\chi_{e}^{\top} \chi_{f} = 0$.
The graph Laplacian $L$ is a matrix $L = D - W$, where $D = diag(\textbf{1}^{\top} W)$. It can be written in terms of the edges: $L = \sum_{e \in E} w_{e} \chi_{e} \chi_{e}^\top$.
It turns out that using the electrical circuit theory interpretation of effective resistance (Kirchhoff's current law and Ohm's law), the effective resistance can be written in terms of the pseudoinverse $L^{+}$ of the Laplacian: for any pair of vertices $e = (u, v)$,
$$ R_{u v} = \chi_{e}^{\top} L^{+} \chi_{e} $$def compute_exact_resistances(adj_mat):
# Compute effective resistance between all pairs of vertices in the graph.
if scipy.sparse.issparse(adj_mat):
L = sp.csgraph.laplacian(adj_mat).tocsr()
L_inv = np.linalg.pinv(L.todense())
else:
L = sp.csgraph.laplacian(adj_mat)
L_inv = np.linalg.pinv(L)
rows, cols = adj_mat.nonzero()
R = []
for i, j in zip(rows, cols):
R.append(L_inv[i, i] + L_inv[j, j] - L_inv[j, i] - L_inv[i, j])
return (rows, cols, R)
This sparsifier is special -- it preserves all graph cuts.
More formally, for any set of vertices $T \subseteq V$, let $p_G (T)$ and $p_H (T)$ be the weight of edges incident to $T$ in graph $G$ and its sparsifier $H$. Then for any $\epsilon \in (0,1)$, $$ (1 - \epsilon) p_G (T) \leq p_H (T) \leq (1 + \epsilon) p_G (T) $$ if we draw $H$ to have $O \left(\frac{n \log n }{ \epsilon^2 } \right)$ edges.
$\textbf{Details}$
The Laplacian $L$ completely characterizes the connectivity of the graph $G$, because the weight of edges incident to any subset of vertices $T \subseteq V$ -- which we call $p_{G} (T)$ -- can be written in terms of $L$. If $C(T)$ is the cut-set of $T$, i.e. the set of edges incident to a vertex in $T$, then since $L = \sum_{e \in E} w_{e} \chi_{e} \chi_{e}^\top$, we can write: $$p_{G} (T) = \sum_{e \in E : E \in C(T)} w_{e} = \sum_{e \in E} w_{e} (\textbf{1}_{T}^{\top} \chi_{e})^2 = \textbf{1}_{T}^{\top} L \textbf{1}_{T}$$ As a linear operator, the Laplacian therefore characterizes all graph cuts. The sparsifier we have constructed very strictly approximates the original graph - their Laplacian matrices are spectrally close, i.e. close as linear operators. In detail, [Spielman and Srivastava 2011] proves that the sparsifier's Laplacian $L_{H}$ satisfies $(1 - \epsilon) x^{\top} L_{G} x \leq x^{\top} L_{H} x \leq (1 + \epsilon) x^{\top} L_{G} x$ for all $x$. Substituting $x = \textbf{1}_{T}$, we get $p_{G} (T) = x^{\top} L_{G} x$ and $p_{H} (T) = x^{\top} L_{H} x$, giving the result.Experiments highlight scalability issues
Current implementations do compute $R_{uv}$ using the above method.
Unfortunately, it is extremely slow to compute the pseudoinverse $L^{+}$ exactly, as it requires solving a linear system of equations in $n$ variables.
We can see this by trying out the function on a range of networks against the graph processing package pygsp
, which computes it exactly and is a state-of-the-art package for such graph algorithms.
This package deserves special mention for including sparsification in a broader context of algorithms that I'll explore in future posts. But its actual implementation is not efficient enough to scale as ours does.
Loading the data
I'll load a network of human genes to demonstrate this method. The network, from this recent paper, is constructed so that nearby genes have similar functions.
#hide-output
amat = np.load('GLS_p.npy')
gene_names = pd.read_csv('genes.txt', header=None)
coess_adj = -np.log(amat)
coess_adj[np.isinf(coess_adj)] = 0
adj_mat = coess_adj
Trying it out
Running the pygsp
version of graph sparsification, we run into some scalability issues in directly using $R_{uv}$.
In this case, it takes ~30 seconds on my laptop at time of writing, to process a 2000-node subgraph of the initial adjacency matrix. and ~6 hours to process the entire matrix.
The runtime doesn't scale well at all as the number of nodes increases.
from pygsp.reduction import graph_sparsify
from pygsp import reduction, graphs
from matplotlib import pyplot as plt
# G = graphs.Sensor(200)#, k=20, distributed=True, seed=1)
times_list = []
kn_list = [100, 160, 260, 470, 670, 1070, 1600, 2500, 4000]
for kn in kn_list:
G = graphs.Graph(adj_mat[:kn, :kn])
# # Getting a graph from adj_mat
# print('{} nodes, {} edges'.format(G.N, G.Ne))
# G.A
itime = time.time()
Gs = reduction.graph_sparsify(G, epsilon=0.99)
elapsed = time.time() - itime
#print(kn, elapsed)
times_list.append(elapsed)
plt.plot(kn_list, times_list, )
plt.xlabel('number of nodes')
plt.ylabel('Time (s)')
plt.xscale('log')
plt.yscale('log')
plt.show()
We can check that our implementation performs similarly with similar runtime.
Conclusions
This leads to an inescapable conclusion:
Exact computation is a non-starter for large graphs.
Can we get away with approximately computing $L^{+}$? There are a couple of major hurdles to doing this efficiently:
- The computation is particularly plagued by instability problems that might occur over particular edges, and it's difficult to guarantee that all the edges would be "uniformly" well approximated.
- $L$ is often sparse (e.g. for kNN graphs), but $L^{+}$ is dense; how can we do this in less than $O( n^2 )$ space?
The best approximation, in a sense, is an elegant algorithm by [Spielman and Srivastava 2011] that approximates resistances over all cuts of the graph. This beautiful method warrants a bit more detail.
num_edges = 20000
kn = 2000
am = adj_mat[:kn, :kn]
itime = time.time()
rows, cols, R = compute_exact_resistances(am)
print(time.time() - itime)
#compute_sampling_probs(adj_mat, R, ultrasparsifier=False)
sparsified_graph = sample_sparsifier(am, num_edges, R, ultrasparsifier=True)
print(time.time() - itime)
Approximating resistances over edges with random projections
So let's tackle the problem of approximately computing resistances over edges. To understand and implement the algorithm for doing this, we'll need to rewrite the Laplacian as $L = B^\top S^2 B$, where $$ B_{e, v} = \begin{cases} 1,\;\;\;\text{ if e has destination vertex $v$} \\ -1,\;\;\;\text{ if e has source vertex $v$} \\ 0,\;\;\;\text{ if e does not hit vertex $v$} \end{cases} \qquad , \qquad S_{e_1, e_2} = \begin{cases} \sqrt{w_{e_1}},\;\;\text{ if } e_1 = e_2 \\ 0,\;\;\text{ otherwise} \end{cases} $$ (When the graph is undirected, we orient it arbitrarily to apply the "destination" and "source" designations of each edge.)
A little linear algebra ([Spielman and Srivastava 2011]) shows that the resistance $R_{u v}$ between any two vertices $u$ and $v$ is the distance between the $u^{th}$ and $v^{th}$ columns of the $|E| \times n$ matrix $$ M := S B L^{+} $$
Scaling up the implementation with random projections
Here we run into an implementation hurdle:
Computing and storing $M$ explicitly is not scalable, as each data point is represented by $|E|$ coordinates, a number far more than the number of data $n$.
The core of our implementation, following [Spielman and Srivastava 2011], is in overcoming this hurdle. We efficiently approximate the matrix $M$ by representing its $n$ columns in a reduced $\log (n)$-dimensional space, such that their pairwise distances are preserved. This is done using a random projection.
A random projection is a linear operator $Q \in \mathbb{R}^{k \times |E|}$ that reduces the $|E|$-dimensional data to a smaller number of dimensions $k$. The reduced data can be stacked together as the columns of a $k \times n$ matrix $Z$: $$ Z = Q M = Q S B L^{+} $$ Resistances can be calculated from $Z$ for any edge by reading off the relevant columns; the resistance is the squared Euclidean distance $\vert \vert \cdot \vert \vert_2^2$ between these columns.
def compute_effective_resistances(Z, edges):
"""
Parameters:
----------
Z : (k x n) matrix from which to efficiently compute approximate effective resistances.
edges : A tuple of two lists: the row and column indices of the respective edges.
Returns:
--------
R : Effective resistances for each edge.
"""
rows, cols = edges
R = []
for i, j in zip(rows, cols):
R.append(np.linalg.norm(Z[:, i] - Z[:, j]) ** 2)
return np.array(R)
For this use, we want a random projection matrix $Q$ that preserves pairwise distances between $n$ points. This happens when the entries are i.i.d. random variables ([Arriaga and Vempala 2006]), and we will explore two such special cases:
- Gaussian entries: These are typically good enough, and are well-known in applied math as the basis for the Johnson-Lindenstrauss lemma [Dasgupta and Gupta 2003].
- {-1,0,1}-distributed entries: These form the basis for sparse random projections, which are much more efficient to compute [Li et al. 2006].
How many random projection dimensions $k$ to construct, as the number of data $n$ goes up? Incredibly, [Spielman and Srivastava 2011] shows that $k \sim \log(n)$ dimensions suffice. This relates to a major benefit of using random projections: they provide a seamless way to trade off between computation and accuracy (i.e., choosing $k$), and require very little computation to get a good result 2.
Computing $QSB$ efficiently
This is all very well, except that computing the matrices $Q, S, B$ explicitly involves at the very least looping over $|E|$ edges, which again does not scale. However, these matrices only figure into the calculation through the $k \times n$ matrix $P := Q S B$, which we can calculate by alternative means.
In essence, we calculate $P$ from $A$ by computing each of the $k$ coordinates in turn for a node. We do this by summing the map's contributions over edges around the node, weighted according to their orientation and according to $\sqrt{\cdot}$ of the weight of the edges. This can be parallelized over all nodes at once.
In more detail, let's implement these steps.
A) Preprocess the adjacency matrix
- Take the elementwise $\sqrt{\cdot}$ of $A$.
- Orient the graph $A$ by blocking out the lower triangle of the matrix with zeroes.
- Arbitrarily assign some of the remaining edges to be negative.
- Finally, anti-symmetrize the matrix again to re-fill the lower triangle, forming a new graph $A_2$.
It's easy to add a little code here at the appropriate commented-out place to arbitarily change the orientation of some edges by making them negative; It's a good sanity check to see that this does not affect things.
def preprocess_adjmat(adj_mat):
if not scipy.sparse.issparse(adj_mat):
adj2 = np.sqrt(adj_mat)
adj2 = np.triu(adj2)
else:
adj_mat.data = np.sqrt(adj_mat.data)
adj2 = scipy.sparse.triu(adj_mat)
# Arbitrarily assign some edges to be negative.
pass
# Anti-symmetrize the matrix again.
adj2 = adj2 - adj2.T
return adj2
import sklearn.random_projection
def draw_random_featmap(
adj_mat,
num_components = 40,
projection_mode='sparse',
verbose=False
):
if not scipy.sparse.issparse(adj_mat):
num_edges = np.prod(adj_mat.shape)
else:
num_edges = len(adj_mat.data)
dummy_data = scipy.sparse.csr_matrix((adj_mat.shape[0], num_edges))
if (num_components * num_edges) > 1e6:
projection_mode = 'sparse' # Otherwise memory can't handle it
itime = time.time()
if projection_mode == 'sparse':
form = sklearn.random_projection.SparseRandomProjection(n_components=num_components, random_state=42)
else:
form = sklearn.random_projection.GaussianRandomProjection(n_components=num_components, random_state=42)
X_new = form.fit(dummy_data)
if verbose:
print(time.time() - itime)
embmat = form.components_
list_tensor = []
if scipy.sparse.issparse(adj_mat):
nnzrows, nnzcols = adj_mat.nonzero()
for component_index in range(num_components):
newdata = np.ravel(embmat[component_index, :].toarray())
if scipy.sparse.issparse(adj_mat):
nnzrows, nnzcols = adj_mat.nonzero()
random_map = scipy.sparse.csr_matrix((newdata, (nnzrows, nnzcols)), shape=adj_mat.shape)
else:
random_map = np.reshape(newdata, adj_mat.shape)
list_tensor.append(random_map)
return list_tensor
C) Compute each coordinate in the random map
We do this in a vectorized way for all nodes in the graph for each coordinate $i = 1, \dots, k$. We elementwise multiply the corresponding feature map (slice of $T$) by $A_2$, and then find the row sums of that matrix to get a column vector $v_i \in \mathbb{R}^{n}$. This gives the $k \times n$ matrix $P := Q S B$.
def compute_reduced_vertices(list_tensor_random, adj_mat):
adj2 = preprocess_adjmat(adj_mat)
list_apply_tensor = [x.multiply(adj2).sum(axis=1) for x in list_tensor_random]
return list_apply_tensor
D) Putting it together to compute $QSB$
We do this in a vectorized way for all nodes in the graph for each coordinate $i = 1, \dots, k$. We elementwise multiply the corresponding feature map (slice of $T$) by $A_2$, and then find the row sums of that matrix to get a column vector $v_i \in \mathbb{R}^{n}$. This gives the $k \times n$ matrix $P := Q S B$.
import pandas as pd
from io import StringIO
fname = '/Users/akshay/Downloads/2022_09_12_phage_neutralisers_vh_vl (2).xls'
#pd.read_excel(fname, engine='xlrd', header=None)
table_lines = []
table_lines.append("3247_31_B01 Kappa 1 123537 6558 197470 353 166 165 655 171 - YES - - 63088.5 2128.5 YES 113.2 31.39 34.20602218 132.86 12.62 16.68393523 IGHV3-23 3106_06_A03 DYIMH 13 AIHGRGGTTDYADSVKG 10 TVTGNNYYYGLDV 11 DYIMH-AIHGRGGTTDYADSVKG-TVTGNNYYYGLDV 13 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSAIHGRGGTTDYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_32_B04 Kappa 1 51003 622 180287 96 132 198 519 131 - YES - - 5473.5 708 YES 165.13 4.78 3.570751274 175.61 4.93 4.93095529 IGHV3-23 3106_06_A03 DYQMH 20 HIGGRGGRTRYADSVKG 216 TVTGNNYYYGLDV 11 DYQMH-HIGGRGGRTRYADSVKG-TVTGNNYYYGLDV 42 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYQMHWVRQAPGKGLEWLSHIGGRGGRTRYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_32_D11 Kappa 1 211193 770 234856 92 7086 178 985 70 - YES - - 22461 364 YES 170.21 32.49 23.54628106 206.77 36.38 30.9035631 IGHV3-23 3106_06_A03 DYIMH 13 GIHGRGGVTNYADSVKG 132 TVTGNNYYYGLDV 11 DYIMH-GIHGRGGVTNYADSVKG-TVTGNNYYYGLDV 17 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSGIHGRGGVTNYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_32_E03 Kappa 1 38068 735 214674 121 148 339 2140 189 - YES - - 20570 678 YES 464.56 50.31 13.3588843 525.9 36.42 12.16382488 IGHV3-23 3106_06_A03 DYPMQ 18 RIDGRGGRTRYADSVKG 431 TVTGNNYYYGLDV 11 DYPMQ-RIDGRGGRTRYADSVKG-TVTGNNYYYGLDV 30 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYPMQWVRQAPGKGLEWLSRIDGRGGRTRYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_32_G04 Kappa 1 206305 2164 210206 181 40584 256 1000 113 - YES YES - 24390.5 336 YES 262.1 55.34 26.04532722 305.25 79.02 45.46893071 IGHV3-23 3106_06_A03 DYIMH 13 AIHGRGGVTNYADSVKG 13 TVTGNNYYYGLDV 11 DYIMH-AIHGRGGVTNYADSVKG-TVTGNNYYYGLDV 16 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSAIHGRGGVTNYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_38_F06 Kappa 2 142200 160 145419 62 84 71 213 65 - YES - - 82941 126 YES 177.55 69.31 48.15405964 177.53 22.23 21.99384186 IGHV3-23 3106_06_A03 DYIMY 14 AIHGRGGVTNYADSVKG 13 TVTGNNYYYGLDV 11 DYIMY-AIHGRGGVTNYADSVKG-TVTGNNYYYGLDV 22 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMYWVRQAPGKGLEWLSAIHGRGGVTNYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_39_B08 Kappa 2 158250 100 148834 243 11588 56 408 134 - YES - - 59522.5 283 YES 313.75 53.65 21.09325854 301.74 29.84 17.36998011 IGHV3-23 3106_06_A03 IYKMH 68 TIRGRGGPTQYADSVKG 625 TVTGNNYYYGLDV 11 IYKMH-TIRGRGGPTQYADSVKG-TVTGNNYYYGLDV 150 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSIYKMHWVRQAPGKGLEWLSTIRGRGGPTQYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_39_C10 Kappa 2 140683 114 175048 257 9496 24 542 182 - YES - - 61572.5 639.5 YES 287.24 60.57 26.01179602 271.78 8.95 5.784141775 IGHV3-23 3106_06_A03 KYHMQ 79 GIKGRGGITRYADSVKG 146 TVTGNNYYYGLDV 11 KYHMQ-GIKGRGGITRYADSVKG-TVTGNNYYYGLDV 222 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYHMQWVRQAPGKGLEWLSGIKGRGGITRYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_40_F03 Kappa 2 141303 525 147159 190 425 492 450 254 - YES - - 22726 270 YES 249.97 48.45 23.90911909 245.82 26.45 18.89913165 IGHV3-23 3106_06_A03 DYIMH 13 GIHGRGGVTQYADSVKG 133 TVTGNNYYYGLDV 11 DYIMH-GIHGRGGVTQYADSVKG-TVTGNNYYYGLDV 18 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSGIHGRGGVTQYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_40_G12 Kappa 2 54559 370 144875 231 221 1237 2333 181 - YES - - 15939.5 658.5 YES 229.79 32.81 17.61297787 219.91 1.25 0.998385863 IGHV3-23 3106_06_A03 DYQMH 20 HIGGRGGRTKYADSVKG 214 TVTGNNYYYGLDV 11 DYQMH-HIGGRGGRTKYADSVKG-TVTGNNYYYGLDV 39 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYQMHWVRQAPGKGLEWLSHIGGRGGRTKYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_40_H01 Kappa 2 118514 1115 148648 2131 980 2421 1074 208 - YES - - 166637 213 YES 207.96 82.07 48.68131587 214.1 20.59 16.89168864 IGHV3-23 3106_06_A03 DYIMH 13 AIHGRGGVTDYADSVKG 11 TVTGNNYYYGLDV 11 DYIMH-AIHGRGGVTDYADSVKG-TVTGNNYYYGLDV 14 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYIMHWVRQAPGKGLEWLSAIHGRGGVTDYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_41_C03 Kappa 2 31466 147 91173 148 188 479 1330 193 - YES - - 10116.5 940 YES NA NA NA NA NA NA IGHV3-23 3106_06_A03 KYAMS 71 GIKGRGGHTKYADSVKG 144 TVTGNNYYYGLDV 11 KYAMS-GIKGRGGHTKYADSVKG-TVTGNNYYYGLDV 174 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSGIKGRGGHTKYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_42_A09 Kappa 2 100354 507 191202 454 530 636 1064 296 - YES - - 54109 272.5 YES 211.25 38.36 22.399564 194.97 32.03 28.85509678 IGHV3-23 3106_06_A03 DYVMH 24 GIHGRGGYTKYADSVKG 134 TVTGNNYYYGLDV 11 DYVMH-GIHGRGGYTKYADSVKG-TVTGNNYYYGLDV 64 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYVMHWVRQAPGKGLEWLSGIHGRGGYTKYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3248_42_F01 Kappa 2 31681 534 163866 550 1233 434 1157 147 - YES - - 33575 638.5 YES 289.39 37.85 16.13392553 293.74 42.93 25.67031431 IGHV3-23 3106_06_A03 DYPMQ 18 DIRGRGGRTRYADSVKG 69 TVTGNNYYYGLDV 11 DYPMQ-DIRGRGGRTRYADSVKG-TVTGNNYYYGLDV 26 KYAMS AISGNGNSAYNADSVKG RASQSISSYLN AASSLQS QQSSKTPLT 356 QVQLVESGGGSAQPGGSLRLSCAASGFTFSKYAMSWVRQAPGKGLEWLSAISGNGNSAYNADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK 1 1 1 1 1 1 QVQLVESGGGSAQPGGSLRLSCAASGFTFSDYPMQWVRQAPGKGLEWLSDIRGRGGRTRYADSVKGRFTISRDNSKNTLYLQMNNLRADDTALYYCASTVTGNNYYYGLDVWGQGTTVTVSS AIQMTQSPSSLSASVGDRVTITCRASQSISSYLNWYQQKPGKAPKLLIYAASSLQSGVPSRFSGSGSGTDFTLTISSLQPEDFATYYCQQSSKTPLTFGGGTRLEIK")
table_lines.append("3258_34_H11 Lambda 1 213672 2890 187268 339 190669 971 1178 572 - YES YES - 13985 2108 YES 420.05 28.01 8.225641998 466.5 32.51 12.24048736 IGHV3-23 3107_13_H11 QYGMQ 147 YIDGHGGHTYYADSVKG 677 SWNKVRGGDV 9 QYGMQ-YIDGHGGHTYYADSVKG-SWNKVRGGDV 567 SYAMS AISGSGGSTYYADSVKG SGSSSNIGSNYVY RNNQRPS AAWDDSLSGRV 136 EVQLVESGGGFVQPGESVRLSCEDSGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCSTSWNKVRGGDVWGQGTTVTVSS QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEADYYCAAWDDSLSGRVFGGGTKLTVL 1 1 1 1 1 1 EVQLVESGGGFVQPGESVRLSCEDSGFTFSQYGMQWVRQAPGKGLEWVSYIDGHGGHTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCSTSWNKVRGGDVWGQGTTVTVSS QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEADYYCAAWDDSLSGRVFGGGTKLTVL")
table_lines.append("3258_43_A11 Lambda 2 201715 4862 214990 613 169588 663 1413 585 - YES YES - 16425 380.5 YES 273.98 21.54 9.698052297 289.64 35.14 21.30966423 IGHV3-23 3107_13_H11 AYGMQ 1 YIDGHGGHTYYADSVKG 677 SWNKVRGGDV 9 AYGMQ-YIDGHGGHTYYADSVKG-SWNKVRGGDV 1 SYAMS AISGSGGSTYYADSVKG SGSSSNIGSNYVY RNNQRPS AAWDDSLSGRV 136 EVQLVESGGGFVQPGESVRLSCEDSGFTFSSYAMSWVRQAPGKGLEWVSAISGSGGSTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCSTSWNKVRGGDVWGQGTTVTVSS QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEADYYCAAWDDSLSGRVFGGGTKLTVL 1 1 1 1 1 1 EVQLVESGGGFVQPGESVRLSCEDSGFTFSAYGMQWVRQAPGKGLEWVSYIDGHGGHTYYADSVKGRFTISRDNSKNTLYLQMNSLRAEDTAVYYCSTSWNKVRGGDVWGQGTTVTVSS QSVLTQPPSASGTPGQRVTISCSGSSSNIGSNYVYWYQQLPGTAPKLLIYRNNQRPSGVPDRFSGSKSGTSASLAISGLRSEDEADYYCAAWDDSLSGRVFGGGTKLTVL")
newdf = pd.read_csv(StringIO('\n'.join(table_lines)), sep='\t', header=None)
newdf.iloc[:, [0, -2, -1]].to_csv('seqs.tsv', sep='\t', index=None)
Matrix inversion as gradient descent
There are huge time and space savings by using $Z$ instead of $R$, because this choice represents each of the $n$ points by $k \sim \log (n)$ coordinates instead of $|E|$ coordinates. However, the time to naively compute $Z$ is dominated by the $O (n^{3})$ cost of computing $L^{+}$, so we need to compute $Z$ without explicitly computing $L^{+}$.
We do this by writing $$ L Z^{\top} \approx B^{\top} S Q^{\top} $$ and finding $Z = \arg \min_{Z} \vert\vert L Z^{\top} - B^{\top} S Q^{\top} \vert\vert_{2}^{2} $ as the least-squares linear solution to this system.
So we can efficiently find $Z$ by gradient descent on the objective function $$ f(Z) := \vert\vert L Z^{\top} - B^{\top} S Q^{\top} \vert\vert_{2}^{2} = \vert\vert Z L^{\top} - Q S B \vert\vert_{2}^{2} = \vert\vert Z L - Q S B \vert\vert_{2}^{2} $$
def calculate_reduced_resistances(
adj_mat,
epsilon=1e-1, eta=1e-3, max_iters=1000, convergence_after = 10,
tolerance=1e-2, log_every=10, compute_exact_loss=False
):
"""
Computes the Z matrix using gradient descent.
Parameters:
-----------
adj_mat : sp.csr_matrix
The adjacency matrix of the graph.
epsilon : float
Tolerance for deviations w.r.t. spectral norm of the sparsifier. Smaller epsilon lead to a higher
dimensionality of the Z matrix.
eta : float
Step size for the gradient descent.
max_iters : int
Maximum number of iterations.
convergence_after : int
If the loss did not decrease significantly for this amount of iterations, the gradient descent will abort.
tolerance : float
The minimum amount of energy decrease that is expected for iterations. If for a certain number of iterations
no overall energy decrease is detected, the gradient descent will abort.
log_every : int
Log the loss after each log_every iterations.
compute_exact_loss : bool
Only for debugging. If set it computes the actual pseudo inverse without down-projection and checks if
the pairwise distances in Z's columns are the same with respect to the forbenius norm.
Returns:
--------
Z : ndarray, shape [k, N]
Matrix from which to efficiently compute approximate resistances.
"""
# Compute L, S, B
L = sp.csr_matrix(sp.csgraph.laplacian(adj_mat))
rows, cols = adj_mat.nonzero()
weights = np.sqrt(np.array(adj_mat[rows, cols].tolist()))
S = sp.diags(weights, [0])
# Construct signed edge incidence matrix
num_vertices = adj_mat.shape[0]
num_edges = S.shape[0]
assert(num_edges == len(rows) and num_edges == len(cols))
B = sp.coo_matrix((
([1] * num_edges) + ([-1] * num_edges),
(list(range(num_edges)) * 2, list(rows) + list(cols))
), shape=[num_edges, num_vertices])
k = int(np.ceil(np.log(B.shape[1] / epsilon**2)))
Y_red = compute_reduced_vertices()
# Use gradient descent to solve for Z
Z = np.random.randn(k, L.shape[1])
best_loss = np.inf
best_iter = np.inf
for it in range(max_iters):
eta_t = eta # Constant learning rate
eta_t = eta * (1.0/np.sqrt(it + 10))
residual = Y_red - sp.csr_matrix.dot(Z, L)
loss = np.linalg.norm(residual)
if it % log_every == 0:
print(f'Loss before iteration {it}: {loss}')
if compute_exact_loss:
pairwise_dist = Z.T.dot(Z)
exact_loss = np.linalg.norm(pairwise_dist - pairwise_dist_gnd)
print(f'Loss w.r.t. exact pairwise distances {exact_loss}')
if loss + tolerance < best_loss:
best_loss = loss
best_iter = it
elif it > best_iter + convergence_after:
# No improvement for 10 iterations
print(f'Convergence after {it - 1} iterations.')
break
Z += eta_t * L.dot(residual.T).T
return Z
Putting it all together to sparsify graphs
All this results in a streamlined procedure for sparsifying a graph to have any desired number of edges num_edges
.
To summarize, given a graph $G$, we:
- Calculate reduced-dimension representation $Z$ of the graph $G$.
- Compute effective resistances over edges in $G$.
- Run weighted edge bootstrap to generate sparsified graph $H$, with relative edge weights proportional to resistances.
def spectral_sparsify(
adj_mat, q=None, epsilon=1e-1, eta=1e-3, max_iters=1000, convergence_after=100,
tolerance=1e-2, log_every=10, prevent_vertex_blow_up=False
):
""" Computes a spectral sparsifier of the graph given by an adjacency matrix.
Parameters:
----------
adj_mat : sp.csr_matrix, shape [N, N]
The adjacency matrix of the graph.
q : int or None
The number of samples for the sparsifier. If None q will be set to N * log(N) / (epsilon * 2)
epsilon : float
The desired spectral similarity of the sparsifier.
eta : float
Step size for the gradient descent when computing resistances.
max_iters : int
Maximum number of iterations when computing resistances.
convergence_after : int
If the loss did not decrease significantly for this amount of iterations, the gradient descent will abort.
tolerance : float
The minimum amount of energy decrease that is expected for iterations. If for a certain number of iterations
no overall energy decrease is detected, the gradient descent will abort.
prevent_vertex_blow_up : bool
If the probabilities will be tweaked in order to ensure that the vertices are not blown up too much.
Note that this will only guarantee a spectral closeness up to a factor of 2 * epsilon.
Returns:
--------
H : sp.csr_matrix, shape [N, N]
Sparsified graph with at most q edges.
"""
if q is None:
q = int(np.ceil(adj_mat.shape[0] * np.log(adj_mat.shape[0]) / (epsilon ** 2))) # or 6n*ln(n) , by https://www.cs.ubc.ca/~nickhar/Cargese3.pdf
edges = adj_mat.nonzero()
Z = calculate_reduced_resistances(adj_mat, epsilon=epsilon, max_iters=max_iters, convergence_after=convergence_after, eta=eta, tolerance=tolerance, compute_exact_loss=False)
R = compute_effective_resistances(Z, edges)
return sample_sparsifier(adj_mat, q, R, edges, prevent_vertex_blow_up=prevent_vertex_blow_up)
Summary
- The weighted edge bootstrap algorithm of [Spielman and Srivastava 2011] guarantees a sparsifier with high probability for any undirected weighted graph.
- An efficient implementation, described in [Spielman and Srivastava 2011], uses random projections to construct a subspace of dimension $\log n$ in which the pairwise distances between the $n$ vertices are approximately preserved.
- Though simple in concept, this sparsification method is an extremely useful and powerful primitive in designing efficient graph algorithms.
Footnotes
1. The pygsp package deserves special mention for including sparsification in a broader context of algorithms that I will explore in future posts. But its actual implementation is not efficient enough to scale as ours does.↩
2. By Johnson-Lindenstrauss theory and related statistics, there is no such method that can use fewer dimensions to achieve a given reconstruction error guarantee.↩
- Spielman, D.A. and Srivastava, N. 2011. Graph sparsification by effective resistances. SIAM Journal on Computing 40, 6, 1913–1926.
- Klein, D.J. and Randić, M. 1993. Resistance distance. Journal of mathematical chemistry 12, 1, 81–95.
- Chandra, A.K., Raghavan, P., Ruzzo, W.L., Smolensky, R., and Tiwari, P. 1996. The electrical resistance of a graph captures its commute and cover times. computational complexity 6, 4, 312–340.
- Arriaga, R.I. and Vempala, S. 2006. An algorithmic theory of learning: Robust concepts and random projection. Machine learning 63, 2, 161–182.
- Dasgupta, S. and Gupta, A. 2003. An elementary proof of a theorem of Johnson and Lindenstrauss. Random Structures & Algorithms 22, 1, 60–65.
- Li, P., Hastie, T.J., and Church, K.W. 2006. Very sparse random projections. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 287–296.