Building graphs from gene sets

dataviz
genomics
Gene-gene graphs and the Gene Ontology
Author

Akshay Balsubramani

Modified

October 1, 2023

Gene sets from the Gene Ontology (GO)

I’m very interested in the theme of using gene-gene graphs to guide and regularize learning, where it can be a remarkably effective technique to improve the efficacy of ML models on genes.

Much of the existing knowledge about gene function in various contexts can be described in terms of gene sets. They require some more intermediate modifications to guide and regularize learning. I’ll write about these here, to put them in one place for downstream applications.

As a prototype example, the GO is a controlled vocabulary for describing gene function – an indispensable tool for performing analyses over protein-coding genes. The GO is basically composed of gene sets corresponding to GO terms (nodes) and relationships between them (edges). This knowledge is essentially manually assembled, painstakingly from decades of experiments in the field. Building a gene-gene graph from the GO is a useful way to richly capture this knowledge.

I’ll cover how to build a gene-gene graph using GO. The first step is to get the relevant gene sets we need from GO.

Getting all GO gene sets

We’ll use the EnrichR API via the GSEAPy package to get gene sets corresponding to GO terms.

CODE
import gseapy

def get_goterm_dict_enrichr(organism='Human'):
    go_mf = gseapy.get_library(name='GO_Molecular_Function_2023', organism='Human')
    go_cc = gseapy.get_library(name='GO_Cellular_Component_2023', organism='Human')
    go_bp = gseapy.get_library(name='GO_Biological_Process_2023', organism='Human')
    go_dict = {}
    go_dict.update(go_mf)
    go_dict.update(go_bp)
    go_dict.update(go_cc)
    return go_dict

#all_libs = gseapy.get_library_name(organism='Human')
go_dict = get_goterm_dict_enrichr()

Here is a list of the protein-coding genes that we’ll be considering in this post.

CODE
import anndata, scanpy as sc
import pandas as pd, matplotlib.pyplot as plt, time
from sklearn import preprocessing
sc.settings.verbosity = 0
%matplotlib inline

import requests, os, numpy as np
import scipy, scipy.sparse

gene_names_url = 'https://raw.githubusercontent.com/kundajelab/coessentiality-browser/master/data/vizdf_GLS01_CO99.csv'
gene_names = np.ravel(pd.read_csv(gene_names_url, header=0, index_col=None, sep="\t")['gene_names'])

To get a basic idea of the gene sets in GO, we can plot the distribution of cluster sizes.

CODE
def membership_mat(genesets_dct, gene_names, file_path='memberships.npz'):
    """ Returns a sparse matrix of gene memberships to clusters.
    The matrix is of shape (n_genes, n_clusters), and is saved to disk as a sparse matrix.
    """
    if not os.path.exists(file_path):
        # Reverse the key-value mapping of the dictionary
        genesets_dict_rev = { g: [] for g in gene_names }
        for key, value in genesets_dct.items():
            for gene in value:
                if gene in genesets_dict_rev:
                    genesets_dict_rev[gene].append(key)
        set_names = list(genesets_dct.keys())
        nclusts = len(genesets_dct)
        clust_mmships = np.zeros((len(gene_names), nclusts))
        for i in range(len(set_names)):
            if set_names[i] in genesets_dct:
                clust_mmships[np.isin(gene_names, genesets_dct[set_names[i]]), i] = 1
            if i%250 == 0:
                print(i)
        clust_mmships = scipy.sparse.csr_matrix(clust_mmships)
        scipy.sparse.save_npz(file_path, clust_mmships)
    else:
        clust_mmships = scipy.sparse.load_npz(file_path)
    return clust_mmships

clust_mmships_GO = membership_mat(go_dict, gene_names, file_path='GO_memberships.npz')

plt.hist(np.sum(clust_mmships_GO.toarray(), axis=0), bins=100, density=True, cumulative=False, range=(0,100))
plt.title("Distribution of cluster sizes (GO)")
plt.xlabel("Cluster size (number of genes)")
plt.ylabel("Fraction of clusters")
plt.show()

We see that most of the gene sets are small, with a long tail of larger gene sets. So the modules being highlighted here are mostly relatively small and informative.

Data-driven gene sets

Data like gene expression can be used to partition genes into gene modules. One example of great functional and conceptual relevance is in the paper (Wainberg et al. 2021) (which I was involved in). The paper uses GO-derived gene sets as a crucial component to build a functionally relevant graph which faithfully recapitulates known biology, so it’s a great template to build upon. So I’ll delve deeper into some of the computational ideas from this paper.

The goal is to construct a network in which neighboring genes have similar functions. The most common way of constructing such a network is through co-expression analysis. But this has serious limitations with spurious correlations, because correlation is not causation.

A much better idea to construct functionally relevant gene networks is to use CRISPR knockout data instead of co-expression data. The idea here is that two genes may be functionally related if they are coessential – necessary for cell survival – in the same cell types. Therefore, performing a regression of coessentiality scores between a pair of genes gives a sensitive measure of functional similarity between the genes. And a gene-gene graph can be constructed from performing these regressions (through savvy implementation, they can be done in parallel on a laptop) through a generalized least squares (GLS) procedure.

I’ll write elsewhere about how to construct this graph, as it deserves a separate post. But for now, here’s code to load a precomputed version of the graph, roughly adapting associated code from (Wainberg et al. 2021).

CODE
def build_knn(mat, k=10, symmetrize_type='inclusive'):
    """
    Build symmetric kNN graph, returning a sparse matrix. 

    Parameters
    ----------
    mat : numpy array
        Input matrix of similarities.
    k : int
        Number of nearest neighbors.
    symmetrize_type : str
        Type of symmetrization to use.
        * 'mutual': min(A, A^T), making degree \leq k
        * 'inclusive': max(A, A^T), making degree \geq k
    
    Returns
    -------
    sparse_adj : scipy.sparse.csr_matrix
        Sparse adjacency matrix of the kNN graph.
    """
    sparse_adj = scipy.sparse.csr_matrix(np.zeros(mat.shape))
    for i in range(mat.shape[0]):
        matarr = mat[i,:]
        nbrs = np.argsort(matarr)[::-1][:k]    # Highest k similarities
        sparse_adj[i, nbrs] = 1.0   #matarr[nbrs]
    if (symmetrize_type == 'mutual'):
        return sparse_adj.minimum(sparse_adj.transpose())
    elif (symmetrize_type == 'inclusive'):
        return sparse_adj.maximum(sparse_adj.transpose())
    else:
        print("Mode not yet implemented.")
        return sparse_adj

A clustering / community detection algorithm can be used to partition the graph into gene modules. This has already been done for the paper (Wainberg et al. 2021), and the resulting gene sets are loaded below.

CODE
# Read in clusterONE analysis results.
def get_coess_module_dict():
    clusterone_url = 'https://raw.githubusercontent.com/kundajelab/coessentiality/master/clusterOne_clusters.tsv'
    cone_clusts = pd.read_csv(clusterone_url, sep="\t", index_col=0, header=0)
    clustdata = cone_clusts.iloc[:,10:].fillna(value='')
    genesets_dict = {}
    for i in range(clustdata.shape[0]):
        a = clustdata.iloc[i,:].values
        genesets_dict[clustdata.index[i]] = a[a != '']
    return genesets_dict

genesets_dict = get_coess_module_dict()
clust_mmships_coessentiality = membership_mat(genesets_dict, gene_names, file_path='clusterone_memberships.npz')

plt.hist(np.sum(clust_mmships_coessentiality.toarray(), axis=0), bins=100, density=True, cumulative=False, range=(0,100))
plt.title("Distribution of cluster sizes (coessentiality-based modules)")
plt.xlabel("Cluster size (number of genes)")
plt.ylabel("Fraction of clusters")
plt.show()

Graphs from gene sets

The next step is to process these gene sets into a neighborhood-based matrix representing the gene sets.

Pairwise similarity between genes

One way to do this is by creating a graph gg_jaccsims where each node is a gene, and each edge is a Jaccard similarity measure between the GO terms of the two genes. The Jaccard similarity is the number of sets that both genes are in (“intersection”), divided by the number of sets that either gene is in (“union”).

CODE
def pairwise_jaccard_graph(input_memberships):
    """ Given a bunch of entities (like genes) and their binary memberships in some sets (like GO terms), 
    construct graph of Jaccard similarities between the entities. 

    Parameters
    ----------
    input_memberships : scipy.sparse.csr_matrix
        Sparse matrix of set memberships. Each row is an entity, each column is a set.

    Returns
    -------
    numpy.ndarray
        Matrix of pairwise Jaccard similarities. 
    """
    a = input_memberships.dot(input_memberships.T).toarray()
    tmp = np.add(np.add(-a.T, np.diagonal(a)).T, np.diagonal(a))
    return np.nan_to_num(np.divide(a, tmp))


# Build and save pairwise Jaccard similarities between genes, according to the clusterings given.
gg_jaccsims_GO = pairwise_jaccard_graph(clust_mmships_GO)
gg_jaccsims_coessentiality = pairwise_jaccard_graph(clust_mmships_coessentiality)
/var/folders/5b/ps6ymxr90tj0jglr7hvc98zm0000gn/T/ipykernel_91723/703627734.py:17: RuntimeWarning: invalid value encountered in divide
  return np.nan_to_num(np.divide(a, tmp))

There are stark differences between the data-derived gene sets and GO. The data-derived gene sets are significantly smaller and more specific, resulting in higher similarities between genes using our method above.

CODE
plt.hist(gg_jaccsims_GO[gg_jaccsims_GO > 0], bins=150, cumulative=False, density=True, label='GO')
plt.hist(gg_jaccsims_coessentiality[gg_jaccsims_coessentiality > 0], bins=150, cumulative=False, density=True, label='Coessentiality')
plt.title("Distribution of nonzero Jaccard similarities between genes")
plt.xlabel("Jaccard similarity")
plt.ylabel("Fraction of gene pairs")
plt.legend()
plt.show()

The graphs are not particularly sparse, so we sparsify this graph by building a symmetric kNN matrix from it.

CODE
jaccard_graph_GO = build_knn(gg_jaccsims_GO, k=10, symmetrize_type='inclusive')
jaccard_graph_coess = build_knn(gg_jaccsims_coessentiality, k=10, symmetrize_type='inclusive')
/Users/akshay/opt/anaconda3/envs/env-chemML/lib/python3.10/site-packages/scipy/sparse/_index.py:146: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
  self._set_arrayXarray(i, j, x)

This is ultimately a graph which is much easier to deal with for computation, as I’ve written about elsewhere. By using the Jaccard similarity, we normalize for the fact that some genes are in many more GO terms than others.

Gene sets as neighborhoods

Here’s another way of building a graph from gene sets. Each gene set can be represented in the overall gene-gene graph by a clique – a subgraph where each gene in the gene set is connected to every other gene in it. Adding the resulting cliques together builds a graph that seamlessly deals with overlapping gene sets. When the gene sets are relatively small, as above, this resulting graph is also relatively sparse.

We normalize each graph by the number of nodes in the gene set, which also normalizes the added degrees of each node in the clique. We save the resulting graph for further use.

CODE
def create_clique_adj_matrix(master_list, active_list, weight='reciprocal'):
    """ Create clique composed of the elements in active_list that are in master_list.
    """
    n = len(master_list)
    data, row_indices, col_indices = [], [], []
    # Create a dictionary for faster look-up
    active_list = np.intersect1d(master_list, active_list)
    gene_to_index = {gene: index for index, gene in enumerate(master_list)}
    for gene1 in active_list:
        for gene2 in active_list:
            if gene1 != gene2:
                row = gene_to_index[gene1]
                col = gene_to_index[gene2]
                row_indices.append(row)
                col_indices.append(col)
                if weight=='reciprocal':
                    data.append(1.0/len(active_list))
                elif weight=='binary':
                    data.append(1.0)
    return scipy.sparse.csr_matrix((data, (row_indices, col_indices)), shape=(n, n))


def genesets_clique_matrix(geneset_dict, gene_names, file_path='clique_graph.npz'):
    """ Return dictionary mapping GO terms to their clique adjacency matrices. 
    """
    if not os.path.exists(file_path):
        matrix_return = scipy.sparse.csr_matrix((len(gene_names), len(gene_names)))
        i = 0
        for termname in geneset_dict:
            i += 1
            matrix_return += create_clique_adj_matrix(gene_names, geneset_dict[termname])
        scipy.sparse.save_npz(file_path, matrix_return)
    else:
        matrix_return = scipy.sparse.load_npz(file_path)
    return matrix_return
CODE
clique_graph = genesets_clique_matrix(go_dict, gene_names, file_path='GO_clique_graph.npz')
clique_graph_GO = build_knn(clique_graph, k=10, symmetrize_type='inclusive')

Processing the graph

We can now process the graph to make it more useful for downstream analysis, by applying several data science techniques to it, following (Wainberg et al. 2021).

Mixing in a connected component

One potential issue with using the GO graph for learning is that it is not fully connected - there are many genes that are not connected to any other genes, because they don’t appear in GO’s gene modules.

CODE
all_genes_go_dict = list(set([item for sublist in [go_dict[k] for k in go_dict.keys()] for item in sublist]))
num_genes_connected = len(np.intersect1d(gene_names, all_genes_go_dict))
print("{} genes are in at least one GO gene set. {} are in no GO gene sets.".format(
    num_genes_connected, len(gene_names) - num_genes_connected))
14139 genes are in at least one GO gene set. 3495 are in no GO gene sets.

The same is true of the coessentiality-derived gene modules.

CODE
all_genes_go_dict = list(set([item for sublist in [genesets_dict[k] for k in genesets_dict.keys()] for item in sublist]))
num_genes_connected = len(np.intersect1d(gene_names, all_genes_go_dict))
print("{} genes are in at least one coessentiality-derived gene set. {} are in no coessentiality-derived GO gene sets.".format(
    num_genes_connected, len(gene_names) - num_genes_connected))
15374 genes are in at least one coessentiality-derived gene set. 2260 are in no coessentiality-derived GO gene sets.

One way to solve this issue is to use the fully connected graph derived from coessentiality data, and mix it in with the GO graph.

CODE
def coessentiality_GLS_graphs(
    file_path='GLS_pvals_10NN.npz', fname_GLS_p = 'GLS_p.npy', 
    GLS_p_url='https://mitra.stanford.edu/bassik/coessentiality/GLS_p.npy', 
    min_logp=1000
):
    """ Return sparse matrix of coessentiality (-log p) values between genes, calculated using GLS.
    """
    if not os.path.exists(fname_GLS_p):
        response = requests.get(GLS_p_url)
        with open(fname_GLS_p, 'wb') as fp:
            fp.write(response.content)
    GLS_p = np.load(fname_GLS_p)
    res = -np.log(GLS_p)
    res[np.isinf(res)] = min_logp
    if not os.path.exists(file_path):
        GLS_pvals_100 = build_knn(res, k=100, symmetrize_type='inclusive')
        GLS_pvals_10 = build_knn(GLS_pvals_100.toarray(), k=10, symmetrize_type='inclusive')
        scipy.sparse.save_npz(file_path, GLS_pvals_10)
    GLS_pvals_10 = scipy.sparse.load_npz(file_path)
    return GLS_pvals_10, res

Putting it all together, mixing the two graphs’ adjacency matrices together with some weighting, gives a function for getting a gene-gene graph.

CODE
def get_gene_gene_graph(
    mode='coessentiality', 
    graph_construction='jaccard', 
    file_path='gene_gene_graph.npz', 
    frac_gg_graph=0.99
):
    if os.path.exists(file_path):
        return scipy.sparse.load_npz(file_path)
    else:
        # Build the fully connected graph from coessentiality data
        GLS_pvals_10, res = coessentiality_GLS_graphs()
        if mode == 'coessentiality':
            geneset_dict = get_coess_module_dict()
            fprefix = 'clusterone_'
        elif mode == 'GO':
            geneset_dict = get_goterm_dict_enrichr()
            fprefix = 'GO_'
        if graph_construction == 'clique':
            clique_graph = genesets_clique_matrix(geneset_dict, gene_names, file_path=fprefix+'clique_graph.npz')
            gg_graph = build_knn(clique_graph, k=10, symmetrize_type='inclusive')
        elif graph_construction == 'jaccard':
            clust_mmships = membership_mat(geneset_dict, gene_names, file_path=fprefix+'memberships.npz')
            gg_jaccsims = pairwise_jaccard_graph(clust_mmships)
            gg_graph = build_knn(gg_jaccsims, k=10, symmetrize_type='inclusive')
        # Construct the combined graph
        gene_gene_graph = scipy.sparse.csr_matrix(
            (frac_gg_graph * gg_graph) + 
            ((1-frac_gg_graph) * GLS_pvals_10)
        )
        scipy.sparse.save_npz(file_path, gene_gene_graph)
CODE
gg_graph = get_gene_gene_graph(
    graph_construction='jaccard', 
    file_path='gene_gene_graph.npz'
)
gg_clique = get_gene_gene_graph(
    graph_construction='clique', 
    file_path='gene_gene_graph_clique.npz'
)

Smoothing with diffusion

This involves using diffusion maps to emphasize the more global coarse-scale structure of the graph, smoothing out high-frequency components to a degree such that the top few components capture most of the variance in the graph.

This is a useful technique for downstream learning, in the same way that PCA-based dimensionality reduction is used in computational workflows. In addition, it can help to provide a more visualizable representation of the input graph.

CODE
def compute_transitions(adj_mat, alpha=1.0, sym=True):
    """Compute symmetrized transition matrix. 
    alpha : The exponent of the diffusion kernel. 
    * 1 = Laplace-Beltrami (density)-normalized [default]. 
    * 0.5 = normalized graph Laplacian (Fokker-Planck dynamics, cf. "Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators" NIPS2006).
    * 0 = classical graph Laplacian. 
    """
    similarity_mat = symmetric_part(adj_mat)
    dens = np.asarray(similarity_mat.sum(axis=0))  # dens[i] is an estimate for the sampling density at point i.
    K = scipy.sparse.spdiags(np.power(dens, -alpha), 0, similarity_mat.shape[0], similarity_mat.shape[0])
    W = scipy.sparse.csr_matrix(K.dot(similarity_mat).dot(K))
    z = np.sqrt(np.asarray(W.sum(axis=0)).astype(np.float64))    # sqrt(density)
    Zmat = scipy.sparse.spdiags(1.0/z, 0, W.shape[0], W.shape[0])
    return Zmat.dot(W).dot(Zmat) if sym else Zmat.power(2).dot(W)

def compute_eigen(adj_mat, n_comps=2, sym=True):
    """
    Compute eigendecomposition of sparse transition kernel matrix.
    (Computing in 64-bit can matter for analysis use beyond just visualization!).
    sym indicates that the transition matrix should be symmetrized.
    NOTE: use np.linalg.eigh if we want to support non-sparse matrices.
    """
    if sym:
        eigvals, evecs = scipy.sparse.linalg.eigsh(adj_mat.astype(np.float64), k=n_comps)
    else:
        # eigvals, evecs = scipy.sparse.linalg.eigs(adj_mat.astype(np.float64), k=n_comps)   # DON'T USE without further thresholding: complex-number underflow issues
        evecs, eigvals, _ = scipy.sparse.linalg.svds(adj_mat.astype(np.float64), k=n_comps)
    #eigvals, evecs = eigvals.astype(np.float32), evecs.astype(np.float32)
    sorted_ndces = np.argsort(np.abs(eigvals))[::-1]
    return eigvals[sorted_ndces], evecs[:, sorted_ndces]

""" Returns the symmetrized version of the input matrix. (The anti-symmetrized version is A - symmetric_part(A) .) """
def symmetric_part(A):
    return 0.5*(A + A.T)


"""
Compute diffusion map embedding.
sym_compute: Whether to compute the SVD on a symmetric matrix.
sym_return: Whether to return the SVD of the symmetrized transition matrix.
Assumes that {sym_return => sym_compute} holds, i.e. doesn't allow sym_compute==False and sym_return==True.
Returns (n_comps-1) components, excluding the first which is the same for all data.
"""
def diffmap_proj(
    adj_mat, 
    t=None, 
    min_energy_frac=0.95, 
    n_dims=None,      # Number of dims to return
    n_comps=None,     # Number of comps to use in computation.
    return_eigvals=False, 
    embed_type='diffmap', 
    sym_compute=True, 
    sym_return=False
):
    if n_comps is None:     # When data are high-d dimensional with log2(d) \leq 14-16 as for scRNA, 2K eigenvector computation is tolerably fast; change otherwise
        n_comps = min(2000, adj_mat.shape[0]-1)
    if n_dims is None:
        n_dims = n_comps - 1
    eigvals, evecs = compute_eigen(compute_transitions(adj_mat, sym=sym_compute), n_comps=n_comps, sym=sym_compute)
    if sym_compute:
        evecs_sym = evecs
        evecs_unsym = np.multiply(evecs, np.outer(1.0/evecs[:,0].astype(np.float64), np.ones(evecs.shape[1])))
    else:
        evecs_unsym = evecs
    if sym_return:
        if not sym_compute:
            return None
        eigvecs_normalized = preprocessing.normalize(np.real(evecs_sym), axis=0, norm='l2')
    else:
        eigvecs_normalized = preprocessing.normalize(np.real(evecs_unsym), axis=0, norm='l2')
    
    if t is None:     # Use min_energy_frac to determine the fraction of noise variance 
        t = min_t_for_energy(eigvals, n_dims+1, min_energy_frac)
    frac_energy_explained = np.cumsum(np.power(np.abs(eigvals), t)/np.sum(np.power(np.abs(eigvals), t)))[n_dims]
    print("{} dimensions contain about {:.2f} fraction of the variance in the first {} dimensions (Diffusion time = {:.3f})".format(
        n_dims+1, frac_energy_explained, n_comps, t))
    if embed_type=='naive':
        all_comps = eigvecs_normalized
    elif embed_type=='diffmap':
        all_comps = np.power(np.abs(eigvals), t) * eigvecs_normalized
    elif embed_type=='commute': # Check the correctness!
        all_comps = np.power((1-np.abs(eigvals)), -t/2) * eigvecs_normalized
    if not return_eigvals:
        return all_comps[:,1:(n_dims+1)]     # Return n_dims dimensions, skipping the first trivial one.
    else:
        return (all_comps[:,1:(n_dims+1)], eigvals)    # Return the eigenvalues as well.

def min_t_for_energy(eigvals, desired_dim, min_energy_frac, max_t=None):
    # Calc upper bound for t principal eigengap (if g=lbda1/lbda2, then g^t < 100 implies t < log(100)/log(g) ). Don't change unless you know what you're doing!
    if max_t is None:
        max_t = np.log(100)/np.log(max(eigvals[0]/eigvals[1], 1.01))
    f = lambda t: (np.sum(heat_eigval_dist(eigvals, t)[:desired_dim]) - min_energy_frac)
    if f(0)*f(max_t) >= 0:    # since f is always nondecreasing this means the zero isn't in the interval
        return max_t if f(0) < 0 else 0
    return scipy.optimize.brentq(f, 0, max_t)

def heat_eigval_dist(eigvals, t):
    return np.power(np.abs(eigvals), t)/np.sum(np.power(np.abs(eigvals), t))
CODE
def smooth_anneal_graph(in_graph, n_cmps=100, reduced_dim=50, min_energy_frac=0.9):
    itime = time.time()
    emap_heat = diffmap_proj(in_graph, n_comps=n_cmps, n_dims=40, min_energy_frac=min_energy_frac, embed_type='diffmap', return_eigvals=False)
    print("Diffusion components computed. Time: {:.2f}".format(time.time() - itime))
    ann_heat = anndata.AnnData(X=emap_heat[:, :40])
    ann_heat.obs['gene_names'] = gene_names
    sc.pp.neighbors(ann_heat)
    sc.tl.umap(ann_heat)
    return ann_heat


gene_gene_graph = gg_graph
ann_heat = smooth_anneal_graph(gene_gene_graph, n_cmps=100, reduced_dim=50, min_energy_frac=0.9)
path_final_graph = 'coess_atlas_gene_graph.npz'
if not os.path.exists(path_final_graph):
    scipy.sparse.save_npz(path_final_graph, ann_heat.obsp['connectivities'])

itime = time.time()
#Plot eigenvalue spectrum of resulting graph.
emap_naive, eigvals = diffmap_proj(gene_gene_graph, t=0, n_comps=50, embed_type='naive', return_eigvals=True)
print("Laplacian eigenmap computed. Time: {:.2f} sec.".format(time.time() - itime))
plt.title("Eigenvalue spectrum of Laplacian")
plt.xlabel("Eigenvalue index")
plt.ylabel("Eigenvalue magnitude")
plt.scatter(x=range(50), y=np.abs(eigvals)[:50])
plt.show()
41 dimensions contain about 0.90 fraction of the variance in the first 100 dimensions (Diffusion time = 95.917)
Diffusion components computed. Time: 4.27
50 dimensions contain about 1.00 fraction of the variance in the first 50 dimensions (Diffusion time = 0.000)
Laplacian eigenmap computed. Time: 1.78 sec.

Verify that the overall graph includes known gene modules

Read modules highlighted in the paper.

CODE
url_modules = 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8763319/bin/NIHMS1768167-supplement-Supp_Data_2.zip'

import gzip, requests
from zipfile import ZipFile

file_path = 'NIHMS1768167-supplement-Supp_Data_2.zip'

response = requests.get(url_modules)
with open(file_path, 'wb') as fp:
    fp.write(response.content)

# opening the zip file in READ mode 
with ZipFile(file_path, 'r') as zip: 
    # printing all the contents of the zip file 
    zip.printdir() 
  
    # extracting all the files 
    print('Extracting all the files now...') 
    zip.extractall() 
    print('Done!') 


xl = pd.ExcelFile('Supplementary_Data_2.xlsx')
print(xl.sheet_names)

ref_modules = xl.parse('Manual Annotation References')

module_dict = {}
module_delims = list(np.where(~pd.isna(ref_modules['Module name']))[0]) + [ref_modules.shape[0]]
for i in range(len(module_delims)-1):
    these_genes_df = ref_modules.iloc[module_delims[i]:module_delims[i+1], :]
    module_dict[these_genes_df.iloc[0, :2]['Module name']] = np.ravel(these_genes_df['Gene name'])
File Name                                             Modified             Size
Supplementary_Data_2.xlsx                      2021-03-05 15:28:02     11721905
Extracting all the files now...
Done!
['Co-essential Modules', 'Manual Annotation References']
CODE
for m in module_dict:
    ann_heat.obs['selected'] = np.isin(gene_names, module_dict[m]).astype(int)
    arrsize = ann_heat.obs['selected']*150 + 4
    #sc.pl.umap(ann_heat, color='selected', title=m, size=arrsize, cmap='viridis')

First, we need some utility code to deal with gene sets. This computes similarities between genes based on their membership in gene sets. Given a particular universe of gene sets, the similarity between two genes is the Jaccard index between their set memberships.

CODE
ann_heat.obs.index = ann_heat.obs['gene_names']
mods = ann_heat.obs['gene_names'].copy()
for m in module_dict:
    mods[module_dict[m]] = m
ann_heat.obs['module'] = mods
for m in module_dict:
    print(m, (ann_heat[module_dict[m]].obsp['connectivities'] > 0).mean())
mTORC1 regulation 0.4512000000000001
mTORC2 / PI3K / AKT signaling 0.7653061224489793
Autophagy 0.5041551246537397
Notch/TGF-beta signaling 0.24691358024691357
p53 signaling 0.796875
IFN signaling 0.5051903114186852
Glycolysis 0.42666666666666664
Chromatin remodeling 0.7551020408163263
snRNA transcription 0.3076923076923077
Cell cycle 0.6599999999999999
Mitochondrial respiration 0.11763831052944679
CODE
ann_heat.obs.index = ann_heat.obs['gene_names']
mods = ann_heat.obs['gene_names'].copy()
for m in module_dict:
    mods[module_dict[m]] = m
ann_heat.obs['module'] = mods
for m in module_dict:
    print(m, (ann_heat[module_dict[m]].obsp['connectivities'] > 0).mean())
mTORC1 regulation 0.20480000000000004
mTORC2 / PI3K / AKT signaling 0.061224489795918366
Autophagy 0.34349030470914127
Notch/TGF-beta signaling 0.043209876543209874
p53 signaling 0.0234375
IFN signaling 0.05536332179930796
Glycolysis 0.017777777777777778
Chromatin remodeling 0.5408163265306122
snRNA transcription 0.33136094674556216
Cell cycle 0.08
Mitochondrial respiration 0.03822129684711481
CODE
ann_heat.obs.index = ann_heat.obs['gene_names']
ann_heat.obs['module'] = ann_heat.obs['gene_names']
for m in module_dict:
    ann_heat[module_dict[m]].obs['module'] = m

for m in module_dict:
    print(m, (ann_heat[module_dict[m]].obsp['connectivities'] > 0).mean())
mTORC1 regulation 0.4512000000000001
mTORC2 / PI3K / AKT signaling 0.7653061224489793
Autophagy 0.5041551246537397
Notch/TGF-beta signaling 0.24691358024691357
p53 signaling 0.796875
IFN signaling 0.5051903114186852
Glycolysis 0.42666666666666664
Chromatin remodeling 0.7551020408163263
snRNA transcription 0.3076923076923077
Cell cycle 0.6599999999999999
Mitochondrial respiration 0.11763831052944679

Incorporating precision medicine data

GO information is tightly related to other indirect ways of getting at gene function. The interplay between gene regulation and drug responses is a rich source of information about gene function. These sources are particularly prevalent in medicine and can be very useful in drug development.

The paper (Chandak, Huang, and Zitnik 2023) builds a knowledge graph with multiple types of entities for this purpose. This provides a useful way to contextually view the relationships between genes, drugs, and diseases.

CODE
import pandas as pd
primekg_df = pd.read_csv('primeKG/kg.csv')

primekg_df[primekg_df['display_relation'] == 'interacts with']['y_type'].value_counts()
primekg_df#['x_type'].value_counts()
relation display_relation x_index x_id x_type x_name x_source y_index y_id y_type y_name y_source
0 protein_protein ppi 0 9796 gene/protein PHYHIP NCBI 8889 56992 gene/protein KIF15 NCBI
1 protein_protein ppi 1 7918 gene/protein GPANK1 NCBI 2798 9240 gene/protein PNMA1 NCBI
2 protein_protein ppi 2 8233 gene/protein ZRSR2 NCBI 5646 23548 gene/protein TTC33 NCBI
3 protein_protein ppi 3 4899 gene/protein NRF1 NCBI 11592 11253 gene/protein MAN1B1 NCBI
4 protein_protein ppi 4 5297 gene/protein PI4KA NCBI 2122 8601 gene/protein RGS20 NCBI
... ... ... ... ... ... ... ... ... ... ... ... ...
8100493 anatomy_protein_absent expression absent 66747 4720 anatomy cerebellar vermis UBERON 5259 140 gene/protein ADORA3 NCBI
8100494 anatomy_protein_absent expression absent 63824 1377 anatomy quadriceps femoris UBERON 58254 105378952 gene/protein KLF18 NCBI
8100495 anatomy_protein_absent expression absent 63826 1379 anatomy vastus lateralis UBERON 58254 105378952 gene/protein KLF18 NCBI
8100496 anatomy_protein_absent expression absent 64523 2084 anatomy heart left ventricle UBERON 58254 105378952 gene/protein KLF18 NCBI
8100497 anatomy_protein_absent expression absent 67302 5384 anatomy nasal cavity epithelium UBERON 58254 105378952 gene/protein KLF18 NCBI

8100498 rows × 12 columns

References

Chandak, Payal, Kexin Huang, and Marinka Zitnik. 2023. “Building a Knowledge Graph to Enable Precision Medicine.” Scientific Data 10 (1): 67.
Wainberg, Michael, Roarke A Kamber, Akshay Balsubramani, Robin M Meyers, Nasa Sinnott-Armstrong, Daniel Hornburg, Lihua Jiang, et al. 2021. “A Genome-Wide Atlas of Co-Essential Modules Assigns Function to Uncharacterized Genes.” Nature Genetics 53 (5): 638–49.