Detecting motifs in networks

graphs
linear-algebra
Some linear algebra one-liners
Author

Akshay Balsubramani

Triangles and derived graphs

Counting the number of triangles in a graph is an essential operation in graph theory. I’ll detail an extremely efficient approach to this task, which is especially useful in analysis of relatively sparse real-world graphs where tightly connected groups can indicate significant relationships.

This material is known as a “triad census,” and appears in many packages for network analysis in the context of social networks. Interpreting triangles is useful for the same reason that interpreting edges is useful in networks – it provides a rich description of interactions among small local neighborhoods.

Pairs of vertices in directed networks have just a few possible connections between them – unconnected, connected in one or the other direction, or connected in both directions. Triplets of vertices have a richer spectrum of 16 possible interrelations between them, corresponding to triangle “motifs.” Being able to localize these motifs and quantify their behavior allows analysts to sensitively calculate local changes in graphs, and break them down in interpretable ways.

Warmup: counting triangles with linear algebra on undirected graphs

An undirected network may be modeled as a graph, like the one shown below.

Figure 1

Adjacency matrix. Let \(A\) be the adjacency matrix representation of the graph, defined as follows. The entries of \(A\) are either 0 or 1; and \(a_{i,j}\) equals 1 exactly where there is an edge connecting nodes \(i\) and \(j\). For instance, for the graph shown above,

\[ A = \begin{bmatrix} 0 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & 1 \\ 1 & 1 & 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 & 0 \end{bmatrix} \]

Given this representation of the graph, here is a way to count triangles using linear algebra.

First, let \(A \cdot B\) denote matrix multiplication. That is, \(E = A \cdot B\) means \(e_{i,j} = \sum_k a_{i,k} b_{k, j}\). Then \([A \cdot A ]_{i, j}\) counts the number of paths of length 2 from node \(i\) to node \(j\).

Next, let \(A \circ B\) denote elementwise multiplication. That is, \(E = A \circ B\) means \(e_{i, j} = a_{i, j} b_{i, j}\).

Then the matrix \(C = (A \cdot A) \circ A\) counts the number of triangles shared by each pair of vertices, i.e. each element \(c_{i,j}\), counts the number of triangles in which both \(i\) and \(j\) appear. For the example shown above, it turns out that \[ C = \begin{bmatrix} 0 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 & 0 & 0 \end{bmatrix} \]

This reflects the fact that there is exactly one triangle in the graph, which is formed by nodes 1, 2, and 5.

Enumerating all directed triads (3-node motifs)

Extending this example, there are a number of undirected and directed triads exhaustively listed by (Benson, Gleich, and Leskovec 2016). Here are all the possible motifs involving 3 connected vertices. We can compute them using linear algebra, using variations of the above reasoning.

Figure 2

The naming scheme is unfortunately hard to initially read, but descriptive. The three digits in the code (\(m\) \(a\) \(n\)) give the counts of mutual (reciprocal) dyads ⇄, asymmetric (one-way) dyads →, and null dyads ∅ in the triad. The triad characteristics can be summarized as below.

Code # arcs Description Connected? Name
003 0 ∅ ∅ ∅ empty graph
012 1 A → B single edge
102 2 A ⇄ B single mutual dyad
021D 2 A → B, A → C (divergent “fan-out”) divergent 2-fan
021U 2 B → A, C → A (convergent “fan-in”) convergent 2-fan
021C 2 A → B → C (two-step chain) open 2-path
111D 3 A ⇄ B, A → C mutual dyad with one outgoing tail
111U 3 A ⇄ B, C → A mutual dyad with one incoming head
030T 3 A → B → C and A → C feed-forward loop (transitive triple)
030C 3 3-cycle A → B → C → A directed cycle of length 3
201 4 A ⇄ B, A ⇄ C two mutual dyads sharing a node
120D 4 A ⇄ B, A → C, B → C diverging mutual core
120U 4 A ⇄ B, C → A, C → B converging mutual core
120C 4 A ⇄ B, B → C, C → A mixed cycle with one reciprocal edge
210 5 A ⇄ B, A ⇄ C, B → C nearly complete, one single arc missing
300 6 A ⇄ B ⇄ C ⇄ A (all six arcs) complete bidirected triangle

Building a systematic implementation

We can systematically organize how to implement all these triads. We rely on a few building blocks with direct interpretations.

building block meaning size
A adjacency (u → v) n×n
P = A @ A # two-step paths u→·→v n×n
shared_out = A @ A.T # targets shared by u and v n×n
shared_in = A.T @ A # sources shared by u and v n×n
DO on every existing arc u→v put out-deg(v) n×n sparse
DI on every existing arc u→v put in-deg(u) n×n sparse

These can be used together to count all possible triads. Here are some examples.

  • 030T (feed-forward loop) – edge u→v is in as many FFLs as the two tails share outgoing targets (shared_out[u,v]) – edge v→w uses shared incoming sources (shared_in[v,w]) – edge u→w is counted by plain length-2 paths P[u,w].

Adding the three matrices and masking with A grabs the right value for every arc.

  • 030C (directed 3-cycle) u→v belongs to as many cycles as there are two-step paths v→·→u .

  • 021C / 021D / 021U These depend only on degrees once paths that would turn them into 030T or 030C are removed.

CODE
import numpy as np, scipy.sparse as sp

# --------------------------------------------------------------------- #
# helpers                                                               #
# --------------------------------------------------------------------- #
def _row_mask(A, vec):
    """place vec[row] onto the pattern of A (row–based)"""
    r, c = A.nonzero()
    return sp.csr_matrix((vec[r], (r, c)), shape=A.shape, dtype=np.int32)

def _col_mask(A, vec):
    """place vec[col] onto the pattern of A (col–based)"""
    r, c = A.nonzero()
    return sp.csr_matrix((vec[c], (r, c)), shape=A.shape, dtype=np.int32)

# --------------------------------------------------------------------- #
# main                                                                  #
# --------------------------------------------------------------------- #
def edge_motif_matrix(adj, motif_id):
    """
    Parameters
    ----------
    adj       :  n×n 0/1 scipy.sparse (no self-loops)
    motif_id  :  1–13  (Milo numbering, connected triads only)

    Returns
    -------
    M  : sparse matrix  with  M[i,j] = # triads of that motif arc i→j belongs to
    """
    if motif_id not in range(1,14):
        raise ValueError("motif_id must be an integer from 1 to 13")

    A   = adj.astype(bool).astype(np.int8).tocsr()
    AT  = A.T
    n   = A.shape[0]

    # basic reused pieces ------------------------------------------------
    P           = A @ A              # 2-step paths
    PT          = P.T
    shared_out  = A  @ AT            # # common targets
    shared_in   = AT @ A             # # common sources

    out_deg = np.asarray(A.sum(1)).ravel()
    in_deg  = np.asarray(A.sum(0)).ravel()

    DO = _row_mask(A, out_deg)       # out-deg(source)
    DI = _row_mask(A, in_deg)        #  in-deg(source)   (used for 111U)

    Mrec  = A.multiply(AT)           # directed arcs that are reciprocal
    U     = A - Mrec                 # strictly asymmetric arcs
    Z     = (Mrec > 0).astype(np.int8)    # undirected graph on mutual dyads
    mut_deg = np.asarray(Z.sum(1)).ravel()

    # quick masks carrying mutual-degree on desired positions
    MUT_SRC = _row_mask(A, mut_deg)          # deg_mut(source)  on every arc
    MUT_DST = _col_mask(A, mut_deg)          # deg_mut(dest)    on every arc

    # extra second-order pieces where needed
    ZAT   = Z @ AT
    ZA    = Z @ A
    AZ    = A  @ Z
    ATZ   = AT @ Z
    ZZ    = Z  @ Z

    # ------------------------------------------------------------------ #
    # motif-specific kernels                                             #
    # ------------------------------------------------------------------ #
    TRIAD_TO_ID={'021C':1,'021D':2,'021U':3,'030T':4,'030C':5,'201':6,
             '120C':7,'120D':8,'120U':9,'210':10,'300':11,'111D':12,'111U':13}
    FORMULAE = {
        # 021C  (open 2-path)
        1 : lambda: A.multiply(DO - shared_out - PT),
        # 021D  (divergent ‘fan-out’)
        2 : lambda: A.multiply(DO - 1),
        # 021U  (convergent ‘fan-in’)
        3 : lambda: A.multiply(DI - 1),
        # 030T  (feed-forward loop)
        4 : lambda: A.multiply(shared_out + shared_in + P),
        # 030C  (directed 3-cycle)
        5 : lambda: A.multiply(PT),
        # 201   (two mutual dyads sharing one node)
        6 : lambda: Mrec.multiply(_row_mask(Mrec, np.maximum(mut_deg-1, 0))),
        # 120C  (cycle with one mutual)
        7 : lambda: (
                Mrec.multiply(PT) +
               (U).multiply(ZAT + ZA.T)        # exclude reciprocal arcs here
        ),
        # 120D  (diverging mutual core)
        8 : lambda: (
                Mrec.multiply(shared_out) +
               (U).multiply(ZA)
        ),
        # 120U  (converging mutual core)
        9 : lambda: (
                Mrec.multiply(shared_in) +
               (U).multiply(AZ)
        ),
        # 210   (near-complete; two mutual + one single)
        10: lambda: (
                Mrec.multiply(ZA + ATZ) +
                U.multiply(ZZ)
        ),
        # 300   (complete bidirected triangle)
        11: lambda: Mrec.multiply(ZZ),
        # 111D  (mutual + tail-out)
        12: lambda: (
                Mrec.multiply(DO - 1) +
               (U).multiply(MUT_SRC)
        ),
        # 111U  (mutual + tail-in)
        13: lambda: (
                Mrec.multiply(DI - 1) +
               (U).multiply(MUT_DST)
        ),
    }

    M = FORMULAE[motif_id]().tocsr()
    M.data[M.data < 0] = 0           # clip any negatives from subtractions
    return M
CODE
import numpy as np, scipy.sparse as sp

# Example: a feed-forward loop (motif 4 → 030T)
A = sp.csr_matrix(np.array([[0,1,1],
                            [0,0,1],
                            [0,0,0]]))

ffl_per_edge = edge_motif_matrix(A, 4).toarray()
print(ffl_per_edge)
# every arc carries one feed-forward loop

A “census” counting such triads for a graph is a common way of getting a higher-order signature of the graph’s connectivity. It can be implemented extremely quickly for very sparse graphs using disjoint-set data structures. We implement it here to be completely tensor-native, and additionally preserve a rich layer of local information about edge participation in these triads.

Trusses: cohesive subgraphs of a graph

A truss is a subgraph of a graph that is highly connected, like a clique. It’s a useful concept because trusses, unlike cliques, are efficient to find, so I’ve used them before and will cover them here.1

A \(k\)-truss is a subgraph of \(k\) vertices, where every pair participates in \(k-2\) triangles in the truss (J. Cohen 2008). This is a tight-knit subgraph, but a bit less restrictive than a \(k\)-clique, in which every pair by definition participates in \(k-2\) clique triangles, but there are also connection requirements on other vertices outside the triangle.

Detecting a truss can be done very efficiently (J. Cohen 2009). Starting from the given graph, the idea is to compute the number of triangles each edge participates in, and then remove edges that participate in fewer than \(k-2\) triangles. This is done iteratively until no more edges can be removed; the edges that remain are the edges in the truss.

These are very efficient with linear algebra, using the triangle enumeration operations we’ve just defined (Low et al. 2018). After initializing gmat to be the given adjacency matrix adj_mat, the algorithm repeats the following steps until no more edges can be removed:

  • Enumerate \(300\) (undirected triangle motif) counts along each edge. We do this with triangle_motif_kernel(gmat, ID='300')
  • Remove edges that participate in fewer than \(k-2\) triangles. We do this by thresholding the \(M_4\) counts from the previous step.
CODE
def get_k_truss(adj_mat, k):
    """Returns the k-truss of a graph.

    Parameters
    ----------
    adj_mat : sparse matrix
        Adjacency matrix of the graph. (n x n), where n is the number of vertices.
    k : int
        The k-truss to be extracted.
    
    Returns
    -------
    gmat : sparse matrix
        (n x n) adjacency matrix of the k-truss. 
        Includes all n vertices, but only edges in the k-truss. 
    """
    k_truss = adj_mat.copy()
    k_truss.data = np.ones_like(k_truss.data)
    while True: # repeat until no edges are removed
        tri_count = triangle_motif_kernel(k_truss, ID='M4')
        gmat_new = adj_mat.multiply(tri_count >= k-2)
        if (k_truss != gmat_new).nnz == 0:
            break
        k_truss = gmat_new
    # Binarize the k-truss to get its sparsity structure, and return those edges of the original matrix.
    k_truss.data = np.ones_like(k_truss.data)
    return adj_mat.multiply(k_truss)

Higher-order motifs

The classic paper (Ponstein 1966) devises an ingenious algorithm to count arbitrary cycles. This is one of the most general ways to count motifs, and we might implement it at some point because there is no implementation to be found. Instead, it’s easier to pursue construction of a 4th-order version of a clique, as was introduced more recently.

Higher-order connected subgraphs

(J. D. Cohen 2019) looks for higher-order connected subgraphs, which he calls “trapezes.” These can be enumerated efficiently too, by adapting the above techniques.

CODE
import sys, numpy as np, time, scipy, scipy.linalg, scipy.sparse
import matplotlib.pyplot as plt
%matplotlib inline

import anndata, scanpy as sc


def triangle_motif_kernel(gmat, ID='030C'):
    """
    Compute an adjacency matrix between nodes, 
    whose (i,j)th entry is the number of motifs that i and j are both involved in.
    gmat: A Scipy sparse adjacency matrix.
    """
    Bmat = gmat.multiply(gmat.T)
    Umat = gmat - Bmat
    if ID == '030C':
        c = np.dot(Umat, Umat).multiply(Umat.T)
        return c + c.T
    elif ID == '120C':
        c = np.dot(Bmat, Umat).multiply(Umat.T) + np.dot(Umat, Bmat).multiply(Umat.T) + np.dot(Umat, Umat).multiply(Bmat)
        return c + c.T
    elif ID == '210':
        c = np.dot(Bmat, Bmat).multiply(Umat) + np.dot(Bmat, Umat).multiply(Bmat) + np.dot(Umat, Bmat).multiply(Bmat)
        return c + c.T
    elif ID == '300':
        c = np.dot(Bmat, Bmat).multiply(Bmat)
        return c
    elif ID == '030T':
        c = np.dot(Umat, Umat).multiply(Umat) + np.dot(Umat, Umat.T).multiply(Umat) + np.dot(Umat.T, Umat).multiply(Umat)
        return c + c.T
    elif ID == '120D':
        c = np.dot(Umat, Bmat).multiply(Umat) + np.dot(Bmat, Umat.T).multiply(Umat.T) + np.dot(Umat.T, Umat).multiply(Bmat)
        return c
    elif ID == '120U':
        c = np.dot(Umat.T, Bmat).multiply(Umat.T) + np.dot(Bmat, Umat).multiply(Umat) + np.dot(Umat, Umat.T).multiply(Bmat)
        return c

References

Benson, Austin R, David F Gleich, and Jure Leskovec. 2016. “Higher-Order Organization of Complex Networks.” Science 353 (6295): 163–66.
Cohen, Jonathan. 2008. “Trusses: Cohesive Subgraphs for Social Network Analysis.” National Security Agency Technical Report 16 (3.1): 1–29.
———. 2009. “Graph Twiddling in a Mapreduce World.” Computing in Science & Engineering 11 (4): 29–41.
Cohen, Jonathan D. 2019. “Trusses and Trapezes: Easily-Interpreted Communities in Social Networks.” arXiv Preprint arXiv:1907.09417.
Low, Tze Meng, Daniele G Spampinato, Anurag Kutuluru, Upasana Sridhar, Doru Thom Popovici, Franz Franchetti, and Scott McMillan. 2018. “Linear Algebraic Formulation of Edge-Centric k-Truss Algorithms with Adjacency Matrices.” In 2018 IEEE High Performance Extreme Computing Conference (HPEC), 1–7. IEEE.
Ponstein, Jacob. 1966. “Self-Avoiding Paths and the Adjacency Matrix of a Graph.” SIAM Journal on Applied Mathematics 14 (3): 600–609.

Footnotes

  1. The term comes from mechanical engineering of structures like bridges and roofs, and the idea comes from social network analysis.↩︎

Reuse

Citation

BibTeX citation:
@online{balsubramani,
  author = {Balsubramani, Akshay},
  title = {Detecting Motifs in Networks},
  langid = {en}
}
For attribution, please cite this work as:
Balsubramani, Akshay. n.d. “Detecting Motifs in Networks.”