Gene sets from gene function

dataviz
genomics
Deriving gene sets to steer genomic learning
Author

Akshay Balsubramani

Gene sets from all the genomic databases

When gene-gene graphs are used to guide and regularize learning, the result can be a remarkably effective technique to improve the efficacy of AI/ML models on genes.

Much of the existing knowledge about gene function in various contexts can be described in terms of gene sets. They require some more intermediate modifications to guide and regularize learning. We’ll write about these here, to put them in one place for downstream applications.

Beyond real-time Gene Ontology (GO) enrichments

As a prototypical example, the GO is a controlled vocabulary for describing gene function – an indispensable tool for performing analyses over protein-coding genes. The GO is basically composed of gene sets corresponding to GO terms (nodes) and relationships between them (edges). This knowledge is essentially manually assembled, painstakingly from decades of experiments in the field. Building a gene-gene graph from the GO is a useful way to richly capture this knowledge.

The GO has various limitations because it represents a biased set of experimental observations over the decades, with much more coverage over some genes than others, using the imperfectly precise English language to define terms. Its limited nature enables statistical tests and lookups to be done on gene sets in essentially real time. We’ve seen how to implement this in code, for the GO and a few other important databases of gene-gene associations and protein-protein interactions.

Rather than running against the hundreds of existing genomic databases, just a handful are used there to run a statistical test every time. Such methods follow the standard API and usage pattern of “gene set enrichment analyses” that are pervasive in genomics (Subramanian et al. 2005). These analyses typically do not expose the gene set itself.

Getting all gene sets

We’ll use the EnrichR API via the GSEAPy package to get gene sets corresponding to all sorts of things beyond GO. Gene Ontology is terrific for “what does my gene do in the cell?” — but biology’s diversity spans tissues, perturbations, diseases, drugs, phenotypes and multi-omics layers that GO was never meant to capture. GSEApy surfaces >250 libraries gathered by Enrichr; together they let us probe who, where, when and under-what-perturbation a gene list lights up.

Question we can now ask Example libraries Typical use-case
Which mechanistic pathway is shifting? KEGG 2025, Reactome 2024, WikiPathways 2024, BioPlanet 2019, BioCarta 2016 Map DE-genes onto curated signalling cascades; compare orthogonal databases for consensus.
Which protein complex or physical interaction is enriched? CORUM, BioPlex 2017, PPI_Hub_Proteins, Transcription_Factor_PPIs Detect multi-subunit machines (e.g. ribosome, spliceosome) driving a phenotype.
What cell type or tissue signature do these genes match? ARCHS4_Tissues, Azimuth 2023, Allen Brain Atlas 2021, Tabula Sapiens, CellMarker 2024 Deconvolve bulk RNA-seq; annotate scRNA-seq clusters; trace tumour micro-environment origins.
Are these genes linked to a disease or trait? DisGeNET, OMIM_Expanded, GWAS_Catalog 2025, Human Phenotype Ontology, Orphanet 2021 Prioritise variants; generate therapeutic hypotheses; connect mouse QTL to human disorders.
Which perturbation (drug, CRISPR, kinase, ageing) mimics or reverses my signature? LINCS_L1000_Chem_Pert_up/down, Achilles_fitness_decrease, DepMap_CRISPR 2023, Aging_Perturbations_GEO, RummaGEO_DrugPerturbations 2025 Connectivity mapping for drug repurposing; infer causal regulators; identify synthetic-lethal partners.
Which regulators target this gene set? ChEA 2022, ENCODE_TF_ChIP-seq 2015, TRANSFAC/JASPAR PWMs, TargetScan microRNA 2017 Upstream TF / miRNA prediction; build regulatory networks; explain motif enrichments.
What molecular feature binds/defines these proteins? Pfam_Domains 2019, InterPro_Domains, GlyGen_Glycosylated_Proteins 2022, COMPARTMENTS 2025 Find shared domains, PTMs, sub-cellular localization—useful for drug-targetability and biomarker design.
Are these genes essential under certain screens or contexts? DepMap_WG_CRISPR_Screens 2019, Achilles, CCLE_Proteomics 2020 Cross-reference dependency maps with omics to flag context-specific vulnerabilities.
How do metabolites, lncRNAs, viruses or microbes intersect? HMDB_Metabolites, Metabolomics_Workbench 2022, lncHUB_Co-Expression, Virus-Host_PPI 2020, Microbe_Perturbations_GEO Integrative multi-omics; host–pathogen interactions; non-coding RNA function discovery.

Practical pay-offs

  • Contextual precision: Tissue and perturbation libraries shrink false positives when analysing heterogeneous samples (e.g. GTEx_Tissues vs. GO-CC).

  • Cross-modal validation: Concordant hits in pathway and perturbation libraries strengthen mechanistic claims (e.g. “mTOR pathway” + “Rapamycin-down” drug set).

  • Hypothesis generation: Drug/CRISPR libraries turn enrichment into actionable screens, repurposing ideas or synthetic-lethal pairs.

  • Rare-disease insights: Phenotype-centric sets (HPO, MGI phenotypes, KOMP2) connect human and mouse genetics, aiding variant interpretation.

  • End-to-end omics: Domain-specific sets like localized and metabolite sets bridge transcriptomics to proteomics and metabolomics in a single pipeline.

In short, GO is a solid backbone, but a wider ecosystem of evidence allows us to answer richer integrative biological questions, from broad single-cell annotation to focused drug discovery hypothesis generation.

Basic statistics

The number of genomic databases covered by this data is vast. The databases covered here include all those in the table above, and many more.

CODE
import gseapy
names = gseapy.get_library_name()
print("{} databases found:".format(len(names)))
print('\n'.join(names))
218 databases found:
ARCHS4_Cell-lines
ARCHS4_IDG_Coexp
ARCHS4_Kinases_Coexp
ARCHS4_TFs_Coexp
ARCHS4_Tissues
Achilles_fitness_decrease
Achilles_fitness_increase
Aging_Perturbations_from_GEO_down
Aging_Perturbations_from_GEO_up
Allen_Brain_Atlas_10x_scRNA_2021
Allen_Brain_Atlas_down
Allen_Brain_Atlas_up
Azimuth_2023
Azimuth_Cell_Types_2021
BioCarta_2013
BioCarta_2015
BioCarta_2016
BioPlanet_2019
BioPlex_2017
CCLE_Proteomics_2020
COMPARTMENTS_Curated_2025
COMPARTMENTS_Experimental_2025
CORUM
COVID-19_Related_Gene_Sets
COVID-19_Related_Gene_Sets_2021
Cancer_Cell_Line_Encyclopedia
CellMarker_2024
CellMarker_Augmented_2021
ChEA_2013
ChEA_2015
ChEA_2016
ChEA_2022
Chromosome_Location
Chromosome_Location_hg19
ClinVar_2019
DGIdb_Drug_Targets_2024
DSigDB
Data_Acquisition_Method_Most_Popular_Genes
DepMap_CRISPR_GeneDependency_CellLines_2023
DepMap_WG_CRISPR_Screens_Broad_CellLines_2019
DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019
Descartes_Cell_Types_and_Tissue_2021
Diabetes_Perturbations_GEO_2022
DisGeNET
Disease_Perturbations_from_GEO_down
Disease_Perturbations_from_GEO_up
Disease_Signatures_from_GEO_down_2014
Disease_Signatures_from_GEO_up_2014
DrugMatrix
Drug_Perturbations_from_GEO_2014
Drug_Perturbations_from_GEO_down
Drug_Perturbations_from_GEO_up
ENCODE_Histone_Modifications_2013
ENCODE_Histone_Modifications_2015
ENCODE_TF_ChIP-seq_2014
ENCODE_TF_ChIP-seq_2015
ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X
ESCAPE
Elsevier_Pathway_Collection
Enrichr_Libraries_Most_Popular_Genes
Enrichr_Submissions_TF-Gene_Coocurrence
Enrichr_Users_Contributed_Lists_2020
Epigenomics_Roadmap_HM_ChIP-seq
FANTOM6_lncRNA_KD_DEGs
GO_Biological_Process_2021
GO_Biological_Process_2023
GO_Biological_Process_2025
GO_Cellular_Component_2021
GO_Cellular_Component_2023
GO_Cellular_Component_2025
GO_Molecular_Function_2021
GO_Molecular_Function_2023
GO_Molecular_Function_2025
GTEx_Aging_Signatures_2021
GTEx_Tissue_Expression_Down
GTEx_Tissue_Expression_Up
GTEx_Tissues_V8_2023
GWAS_Catalog_2019
GWAS_Catalog_2023
GWAS_Catalog_2025
GeDiPNet_2023
GeneSigDB
Gene_Perturbations_from_GEO_down
Gene_Perturbations_from_GEO_up
Genes_Associated_with_NIH_Grants
Genome_Browser_PWMs
GlyGen_Glycosylated_Proteins_2022
HDSigDB_Human_2021
HDSigDB_Mouse_2021
HMDB_Metabolites
HMS_LINCS_KinomeScan
HomoloGene
HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression
HuBMAP_ASCTplusB_augmented_2022
HumanCyc_2015
HumanCyc_2016
Human_Gene_Atlas
Human_Phenotype_Ontology
IDG_Drug_Targets_2022
InterPro_Domains_2019
Jensen_COMPARTMENTS
Jensen_DISEASES
Jensen_DISEASES_Curated_2025
Jensen_DISEASES_Experimental_2025
Jensen_TISSUES
KEA_2013
KEA_2015
KEGG_2013
KEGG_2015
KEGG_2016
KEGG_2019_Human
KEGG_2019_Mouse
KEGG_2021_Human
KOMP2_Mouse_Phenotypes_2022
Kinase_Perturbations_from_GEO_down
Kinase_Perturbations_from_GEO_up
L1000_Kinase_and_GPCR_Perturbations_down
L1000_Kinase_and_GPCR_Perturbations_up
LINCS_L1000_CRISPR_KO_Consensus_Sigs
LINCS_L1000_Chem_Pert_Consensus_Sigs
LINCS_L1000_Chem_Pert_down
LINCS_L1000_Chem_Pert_up
LINCS_L1000_Ligand_Perturbations_down
LINCS_L1000_Ligand_Perturbations_up
Ligand_Perturbations_from_GEO_down
Ligand_Perturbations_from_GEO_up
MAGMA_Drugs_and_Diseases
MAGNET_2023
MCF7_Perturbations_from_GEO_down
MCF7_Perturbations_from_GEO_up
MGI_Mammalian_Phenotype_Level_4_2021
MGI_Mammalian_Phenotype_Level_4_2024
MSigDB_Computational
MSigDB_Hallmark_2020
MSigDB_Oncogenic_Signatures
Metabolomics_Workbench_Metabolites_2022
Microbe_Perturbations_from_GEO_down
Microbe_Perturbations_from_GEO_up
MoTrPAC_2023
Mouse_Gene_Atlas
NCI-60_Cancer_Cell_Lines
NCI-Nature_2016
NIBR_DRUGseq_2025_down
NIBR_DRUGseq_2025_up
NURSA_Human_Endogenous_Complexome
OMIM_Disease
OMIM_Expanded
Old_CMAP_down
Old_CMAP_up
Orphanet_Augmented_2021
PFOCR_Pathways_2023
PPI_Hub_Proteins
PanglaoDB_Augmented_2021
Panther_2015
Panther_2016
PerturbAtlas
PerturbAtlas_MouseGenePerturbationSigs
PerturbSeq_ReplogleK562
PerturbSeq_ReplogleRPE1
Pfam_Domains_2019
Pfam_InterPro_Domains
PheWeb_2019
PhenGenI_Association_2021
Phosphatase_Substrates_from_DEPOD
ProteomicsDB_2020
Proteomics_Drug_Atlas_2023
RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO
Rare_Diseases_AutoRIF_ARCHS4_Predictions
Rare_Diseases_AutoRIF_Gene_Lists
Rare_Diseases_GeneRIF_ARCHS4_Predictions
Rare_Diseases_GeneRIF_Gene_Lists
Reactome_2022
Reactome_Pathways_2024
RummaGEO_DrugPerturbations_2025
RummaGEO_GenePerturbations_2025
Rummagene_kinases
Rummagene_signatures
Rummagene_transcription_factors
SILAC_Phosphoproteomics
SubCell_BarCode
SynGO_2022
SynGO_2024
SysMyo_Muscle_Gene_Sets
TF-LOF_Expression_from_GEO
TF_Perturbations_Followed_by_Expression
TG_GATES_2020
TISSUES_Curated_2025
TISSUES_Experimental_2025
TRANSFAC_and_JASPAR_PWMs
TRRUST_Transcription_Factors_2019
Table_Mining_of_CRISPR_Studies
Tabula_Muris
Tabula_Sapiens
TargetScan_microRNA
TargetScan_microRNA_2017
The_Kinase_Library_2023
The_Kinase_Library_2024
Tissue_Protein_Expression_from_Human_Proteome_Map
Tissue_Protein_Expression_from_ProteomicsDB
Transcription_Factor_PPIs
UK_Biobank_GWAS_v1
Virus-Host_PPI_P-HIPSTer_2020
VirusMINT
Virus_Perturbations_from_GEO_down
Virus_Perturbations_from_GEO_up
WikiPathway_2021_Human
WikiPathway_2023_Human
WikiPathways_2013
WikiPathways_2015
WikiPathways_2016
WikiPathways_2019_Human
WikiPathways_2019_Mouse
WikiPathways_2024_Human
WikiPathways_2024_Mouse
dbGaP
huMAP
lncHUB_lncRNA_Co-Expression
miRTarBase_2017
CODE
def get_goterm_dict_enrichr(organism='Human'):
    go_mf = gseapy.get_library(name='GO_Molecular_Function_2025', organism=organism)
    go_cc = gseapy.get_library(name='GO_Cellular_Component_2025', organism=organism)
    go_bp = gseapy.get_library(name='GO_Biological_Process_2025', organism=organism)
    go_dict = {}
    go_dict.update(go_mf)
    go_dict.update(go_bp)
    go_dict.update(go_cc)
    return go_dict

#all_libs = gseapy.get_library_name(organism='Human')
go_dict = get_goterm_dict_enrichr()
geneset_sizes = [len(go_dict[k]) for k in go_dict]
CODE
import numpy as np
import matplotlib.pyplot as plt

plt.hist(np.log(geneset_sizes), bins=100, density=True, cumulative=False)
plt.title("Distribution of gene set sizes (GO)")
plt.xlabel("log(Gene set size (number of genes))")
plt.ylabel("Density")
plt.show()

CODE
a = gseapy.get_library(name='ENCODE_TF_ChIP-seq_2015', organism='Human')
CODE
['ARCHS4_Cell-lines', 'ARCHS4_IDG_Coexp', 'ARCHS4_Kinases_Coexp', 'ARCHS4_TFs_Coexp', 'ARCHS4_Tissues', 'Achilles_fitness_decrease', 'Achilles_fitness_increase', 'Aging_Perturbations_from_GEO_down', 'Aging_Perturbations_from_GEO_up', 'Allen_Brain_Atlas_10x_scRNA_2021', 'Allen_Brain_Atlas_down', 'Allen_Brain_Atlas_up', 'Azimuth_2023', 'Azimuth_Cell_Types_2021', 'BioCarta_2013', 'BioCarta_2015', 'BioCarta_2016', 'BioPlanet_2019', 'BioPlex_2017', 'CCLE_Proteomics_2020', 'COMPARTMENTS_Curated_2025', 'COMPARTMENTS_Experimental_2025', 'CORUM', 'COVID-19_Related_Gene_Sets', 'COVID-19_Related_Gene_Sets_2021', 'Cancer_Cell_Line_Encyclopedia', 'CellMarker_2024', 'CellMarker_Augmented_2021', 'ChEA_2013', 'ChEA_2015', 'ChEA_2016', 'ChEA_2022', 'Chromosome_Location', 'Chromosome_Location_hg19', 'ClinVar_2019', 'DGIdb_Drug_Targets_2024', 'DSigDB', 'Data_Acquisition_Method_Most_Popular_Genes', 'DepMap_CRISPR_GeneDependency_CellLines_2023', 'DepMap_WG_CRISPR_Screens_Broad_CellLines_2019', 'DepMap_WG_CRISPR_Screens_Sanger_CellLines_2019', 'Descartes_Cell_Types_and_Tissue_2021', 'Diabetes_Perturbations_GEO_2022', 'DisGeNET', 'Disease_Perturbations_from_GEO_down', 'Disease_Perturbations_from_GEO_up', 'Disease_Signatures_from_GEO_down_2014', 'Disease_Signatures_from_GEO_up_2014', 'DrugMatrix', 'Drug_Perturbations_from_GEO_2014', 'Drug_Perturbations_from_GEO_down', 'Drug_Perturbations_from_GEO_up', 'ENCODE_Histone_Modifications_2013', 'ENCODE_Histone_Modifications_2015', 'ENCODE_TF_ChIP-seq_2014', 'ENCODE_TF_ChIP-seq_2015', 'ENCODE_and_ChEA_Consensus_TFs_from_ChIP-X', 'ESCAPE', 'Elsevier_Pathway_Collection', 'Enrichr_Libraries_Most_Popular_Genes', 'Enrichr_Submissions_TF-Gene_Coocurrence', 'Enrichr_Users_Contributed_Lists_2020', 'Epigenomics_Roadmap_HM_ChIP-seq', 'FANTOM6_lncRNA_KD_DEGs', 'GO_Biological_Process_2021', 'GO_Biological_Process_2023', 'GO_Biological_Process_2025', 'GO_Cellular_Component_2021', 'GO_Cellular_Component_2023', 'GO_Cellular_Component_2025', 'GO_Molecular_Function_2021', 'GO_Molecular_Function_2023', 'GO_Molecular_Function_2025', 'GTEx_Aging_Signatures_2021', 'GTEx_Tissue_Expression_Down', 'GTEx_Tissue_Expression_Up', 'GTEx_Tissues_V8_2023', 'GWAS_Catalog_2019', 'GWAS_Catalog_2023', 'GWAS_Catalog_2025', 'GeDiPNet_2023', 'GeneSigDB', 'Gene_Perturbations_from_GEO_down', 'Gene_Perturbations_from_GEO_up', 'Genes_Associated_with_NIH_Grants', 'Genome_Browser_PWMs', 'GlyGen_Glycosylated_Proteins_2022', 'HDSigDB_Human_2021', 'HDSigDB_Mouse_2021', 'HMDB_Metabolites', 'HMS_LINCS_KinomeScan', 'HomoloGene', 'HuBMAP_ASCT_plus_B_augmented_w_RNAseq_Coexpression', 'HuBMAP_ASCTplusB_augmented_2022', 'HumanCyc_2015', 'HumanCyc_2016', 'Human_Gene_Atlas', 'Human_Phenotype_Ontology', 'IDG_Drug_Targets_2022', 'InterPro_Domains_2019', 'Jensen_COMPARTMENTS', 'Jensen_DISEASES', 'Jensen_DISEASES_Curated_2025', 'Jensen_DISEASES_Experimental_2025', 'Jensen_TISSUES', 'KEA_2013', 'KEA_2015', 'KEGG_2013', 'KEGG_2015', 'KEGG_2016', 'KEGG_2019_Human', 'KEGG_2019_Mouse', 'KEGG_2021_Human', 'KOMP2_Mouse_Phenotypes_2022', 'Kinase_Perturbations_from_GEO_down', 'Kinase_Perturbations_from_GEO_up', 'L1000_Kinase_and_GPCR_Perturbations_down', 'L1000_Kinase_and_GPCR_Perturbations_up', 'LINCS_L1000_CRISPR_KO_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_Consensus_Sigs', 'LINCS_L1000_Chem_Pert_down', 'LINCS_L1000_Chem_Pert_up', 'LINCS_L1000_Ligand_Perturbations_down', 'LINCS_L1000_Ligand_Perturbations_up', 'Ligand_Perturbations_from_GEO_down', 'Ligand_Perturbations_from_GEO_up', 'MAGMA_Drugs_and_Diseases', 'MAGNET_2023', 'MCF7_Perturbations_from_GEO_down', 'MCF7_Perturbations_from_GEO_up', 'MGI_Mammalian_Phenotype_Level_4_2021', 'MGI_Mammalian_Phenotype_Level_4_2024', 'MSigDB_Computational', 'MSigDB_Hallmark_2020', 'MSigDB_Oncogenic_Signatures', 'Metabolomics_Workbench_Metabolites_2022', 'Microbe_Perturbations_from_GEO_down', 'Microbe_Perturbations_from_GEO_up', 'MoTrPAC_2023', 'Mouse_Gene_Atlas', 'NCI-60_Cancer_Cell_Lines', 'NCI-Nature_2016', 'NIBR_DRUGseq_2025_down', 'NIBR_DRUGseq_2025_up', 'NURSA_Human_Endogenous_Complexome', 'OMIM_Disease', 'OMIM_Expanded', 'Old_CMAP_down', 'Old_CMAP_up', 'Orphanet_Augmented_2021', 'PFOCR_Pathways_2023', 'PPI_Hub_Proteins', 'PanglaoDB_Augmented_2021', 'Panther_2015', 'Panther_2016', 'PerturbAtlas', 'PerturbAtlas_MouseGenePerturbationSigs', 'PerturbSeq_ReplogleK562', 'PerturbSeq_ReplogleRPE1', 'Pfam_Domains_2019', 'Pfam_InterPro_Domains', 'PheWeb_2019', 'PhenGenI_Association_2021', 'Phosphatase_Substrates_from_DEPOD', 'ProteomicsDB_2020', 'Proteomics_Drug_Atlas_2023', 'RNA-Seq_Disease_Gene_and_Drug_Signatures_from_GEO', 'Rare_Diseases_AutoRIF_ARCHS4_Predictions', 'Rare_Diseases_AutoRIF_Gene_Lists', 'Rare_Diseases_GeneRIF_ARCHS4_Predictions', 'Rare_Diseases_GeneRIF_Gene_Lists', 'Reactome_2022', 'Reactome_Pathways_2024', 'RummaGEO_DrugPerturbations_2025', 'RummaGEO_GenePerturbations_2025', 'Rummagene_kinases', 'Rummagene_signatures', 'Rummagene_transcription_factors', 'SILAC_Phosphoproteomics', 'SubCell_BarCode', 'SynGO_2022', 'SynGO_2024', 'SysMyo_Muscle_Gene_Sets', 'TF-LOF_Expression_from_GEO', 'TF_Perturbations_Followed_by_Expression', 'TG_GATES_2020', 'TISSUES_Curated_2025', 'TISSUES_Experimental_2025', 'TRANSFAC_and_JASPAR_PWMs', 'TRRUST_Transcription_Factors_2019', 'Table_Mining_of_CRISPR_Studies', 'Tabula_Muris', 'Tabula_Sapiens', 'TargetScan_microRNA', 'TargetScan_microRNA_2017', 'The_Kinase_Library_2023', 'The_Kinase_Library_2024', 'Tissue_Protein_Expression_from_Human_Proteome_Map', 'Tissue_Protein_Expression_from_ProteomicsDB', 'Transcription_Factor_PPIs', 'UK_Biobank_GWAS_v1', 'Virus-Host_PPI_P-HIPSTer_2020', 'VirusMINT', 'Virus_Perturbations_from_GEO_down', 'Virus_Perturbations_from_GEO_up', 'WikiPathway_2021_Human', 'WikiPathway_2023_Human', 'WikiPathways_2013', 'WikiPathways_2015', 'WikiPathways_2016', 'WikiPathways_2019_Human', 'WikiPathways_2019_Mouse', 'WikiPathways_2024_Human', 'WikiPathways_2024_Mouse', 'dbGaP', 'huMAP', 'lncHUB_lncRNA_Co-Expression', 'miRTarBase_2017']

Here is a list of the protein-coding genes that we’ll be considering in this post.

CODE
import anndata, scanpy as sc
import pandas as pd, matplotlib.pyplot as plt, time
from sklearn import preprocessing
sc.settings.verbosity = 0
%matplotlib inline

import requests, os, numpy as np
import scipy, scipy.sparse

gene_names_url = 'https://raw.githubusercontent.com/kundajelab/coessentiality-browser/master/data/vizdf_GLS01_CO99.csv'
gene_names = np.ravel(pd.read_csv(gene_names_url, header=0, index_col=None, sep="\t")['gene_names'])

To get a basic idea of the gene sets in GO, we can plot the distribution of cluster sizes.

CODE
def membership_mat(genesets_dct, gene_names, file_path='memberships.npz'):
    """ Returns a sparse matrix of gene memberships to clusters.
    The matrix is of shape (n_genes, n_clusters), and is saved to disk as a sparse matrix.
    """
    if not os.path.exists(file_path):
        # Reverse the key-value mapping of the dictionary
        genesets_dict_rev = { g: [] for g in gene_names }
        for key, value in genesets_dct.items():
            for gene in value:
                if gene in genesets_dict_rev:
                    genesets_dict_rev[gene].append(key)
        set_names = list(genesets_dct.keys())
        nclusts = len(genesets_dct)
        clust_mmships = np.zeros((len(gene_names), nclusts))
        for i in range(len(set_names)):
            if set_names[i] in genesets_dct:
                clust_mmships[np.isin(gene_names, genesets_dct[set_names[i]]), i] = 1
            if i%250 == 0:
                print(i)
        clust_mmships = scipy.sparse.csr_matrix(clust_mmships)
        scipy.sparse.save_npz(file_path, clust_mmships)
    else:
        clust_mmships = scipy.sparse.load_npz(file_path)
    return clust_mmships

clust_mmships_GO = membership_mat(go_dict, gene_names, file_path='GO_memberships.npz')

plt.hist(np.sum(clust_mmships_GO.toarray(), axis=0), bins=100, density=True, cumulative=False, range=(0,100))
plt.title("Distribution of cluster sizes (GO)")
plt.xlabel("Cluster size (number of genes)")
plt.ylabel("Fraction of clusters")
plt.show()

We see that most of the gene sets are small, with a long tail of larger gene sets. So the modules being highlighted here are mostly relatively small and informative.

Data-driven gene sets

Data like gene expression can be used to partition genes into gene modules. One example of great functional and conceptual relevance is in our paper (Wainberg et al. 2021). The paper uses GO-derived gene sets as a crucial component to build a functionally relevant graph which faithfully recapitulates known biology, so it’s a great template to build upon. So we can delve deeper into some of the computational ideas from this paper.

The goal is to construct a network in which neighboring genes have similar functions. The most common way of constructing such a network is through co-expression analysis. But this has serious limitations with spurious correlations, because correlation is much more common than causation.

A much better idea to construct functionally relevant gene networks is to use CRISPR knockout data instead of co-expression data. The idea here is that two genes may be functionally related if they are coessential – necessary for cell survival – in the same cell types. Therefore, performing a regression of coessentiality scores between a pair of genes gives a sensitive measure of functional similarity between the genes. And a gene-gene graph can be constructed from performing these regressions (through savvy implementation, they can be done in parallel on a laptop) through a generalized least squares (GLS) procedure.

How to construct this graph is outside the scope of this post. But for now, here’s code to load a precomputed version of the graph, roughly adapting associated code from (Wainberg et al. 2021).

CODE
def build_knn(mat, k=10, symmetrize_type='inclusive'):
    """
    Build symmetric kNN graph, returning a sparse matrix. 

    Parameters
    ----------
    mat : numpy array
        Input matrix of similarities.
    k : int
        Number of nearest neighbors.
    symmetrize_type : str
        Type of symmetrization to use.
        * 'mutual': min(A, A^T), making degree \leq k
        * 'inclusive': max(A, A^T), making degree \geq k
    
    Returns
    -------
    sparse_adj : scipy.sparse.csr_matrix
        Sparse adjacency matrix of the kNN graph.
    """
    sparse_adj = scipy.sparse.csr_matrix(np.zeros(mat.shape))
    for i in range(mat.shape[0]):
        matarr = mat[i,:]
        nbrs = np.argsort(matarr)[::-1][:k]    # Highest k similarities
        sparse_adj[i, nbrs] = 1.0   #matarr[nbrs]
    if (symmetrize_type == 'mutual'):
        return sparse_adj.minimum(sparse_adj.transpose())
    elif (symmetrize_type == 'inclusive'):
        return sparse_adj.maximum(sparse_adj.transpose())
    else:
        print("Mode not yet implemented.")
        return sparse_adj

A clustering / community detection algorithm can be used to partition the graph into gene modules. We already did this for the paper (Wainberg et al. 2021), and the resulting gene sets are loaded below.

CODE
# Read in clusterONE analysis results.
def get_coess_module_dict():
    clusterone_url = 'https://raw.githubusercontent.com/kundajelab/coessentiality/master/clusterOne_clusters.tsv'
    cone_clusts = pd.read_csv(clusterone_url, sep="\t", index_col=0, header=0)
    clustdata = cone_clusts.iloc[:,10:].fillna(value='')
    genesets_dict = {}
    for i in range(clustdata.shape[0]):
        a = clustdata.iloc[i,:].values
        genesets_dict[clustdata.index[i]] = a[a != '']
    return genesets_dict

genesets_dict = get_coess_module_dict()
clust_mmships_coessentiality = membership_mat(genesets_dict, gene_names, file_path='clusterone_memberships.npz')

plt.hist(np.sum(clust_mmships_coessentiality.toarray(), axis=0), bins=100, density=True, cumulative=False, range=(0,100))
plt.title("Distribution of cluster sizes (coessentiality-based modules)")
plt.xlabel("Cluster size (number of genes)")
plt.ylabel("Fraction of clusters")
plt.show()

Graphs from gene sets

The next step is to process these gene sets into a neighborhood-based matrix representing the gene sets.

Pairwise similarity between genes

One way to do this is by creating a graph gg_jaccsims where each node is a gene, and each edge is a Jaccard similarity measure between the GO terms of the two genes. The Jaccard similarity is the number of sets that both genes are in (“intersection”), divided by the number of sets that either gene is in (“union”).

CODE
def pairwise_jaccard_graph(input_memberships):
    """ Given a bunch of entities (like genes) and their binary memberships in some sets (like GO terms), 
    construct graph of Jaccard similarities between the entities. 

    Parameters
    ----------
    input_memberships : scipy.sparse.csr_matrix
        Sparse matrix of set memberships. Each row is an entity, each column is a set.

    Returns
    -------
    numpy.ndarray
        Matrix of pairwise Jaccard similarities. 
    """
    a = input_memberships.dot(input_memberships.T).toarray()
    tmp = np.add(np.add(-a.T, np.diagonal(a)).T, np.diagonal(a))
    return np.nan_to_num(np.divide(a, tmp))


# Build and save pairwise Jaccard similarities between genes, according to the clusterings given.
gg_jaccsims_GO = pairwise_jaccard_graph(clust_mmships_GO)
gg_jaccsims_coessentiality = pairwise_jaccard_graph(clust_mmships_coessentiality)

There are stark differences between the data-derived gene sets and GO. The data-derived gene sets are significantly smaller and more specific, resulting in higher similarities between genes using our method above.

CODE
plt.hist(gg_jaccsims_GO[gg_jaccsims_GO > 0], bins=150, cumulative=False, density=True, label='GO')
plt.hist(gg_jaccsims_coessentiality[gg_jaccsims_coessentiality > 0], bins=150, cumulative=False, density=True, label='Coessentiality')
plt.title("Distribution of nonzero Jaccard similarities between genes")
plt.xlabel("Jaccard similarity")
plt.ylabel("Fraction of gene pairs")
plt.legend()
plt.show()

The graphs are not particularly sparse, so we sparsify this graph by building a symmetric kNN matrix from it.

CODE
jaccard_graph_GO = build_knn(gg_jaccsims_GO, k=10, symmetrize_type='inclusive')
jaccard_graph_coess = build_knn(gg_jaccsims_coessentiality, k=10, symmetrize_type='inclusive')

This is ultimately a graph which is much easier to deal with for computation, as we’ve seen discussed elsewhere. By using the Jaccard similarity, we normalize for the fact that some genes are in many more GO terms than others.

Gene sets as neighborhoods

Here’s another way of building a graph from gene sets. Each gene set can be represented in the overall gene-gene graph by a clique – a subgraph where each gene in the gene set is connected to every other gene in it. Adding the resulting cliques together builds a graph that seamlessly deals with overlapping gene sets. When the gene sets are relatively small, as above, this resulting graph is also relatively sparse.

As we have seen, gene set sizes vary hugely, and we must normalize them accordingly when incorporating them. We normalize each graph by the number of nodes in the gene set, which also normalizes the added degrees of each node in the clique. We save the resulting graph for further use.

CODE
def create_clique_adj_matrix(master_list, active_list, weight='reciprocal'):
    """ Create clique composed of the elements in active_list that are in master_list.
    """
    n = len(master_list)
    data, row_indices, col_indices = [], [], []
    # Create a dictionary for faster look-up
    active_list = np.intersect1d(master_list, active_list)
    gene_to_index = {gene: index for index, gene in enumerate(master_list)}
    for gene1 in active_list:
        for gene2 in active_list:
            if gene1 != gene2:
                row = gene_to_index[gene1]
                col = gene_to_index[gene2]
                row_indices.append(row)
                col_indices.append(col)
                if weight=='reciprocal':
                    data.append(1.0/len(active_list))
                elif weight=='binary':
                    data.append(1.0)
    return scipy.sparse.csr_matrix((data, (row_indices, col_indices)), shape=(n, n))


def genesets_clique_matrix(geneset_dict, gene_names, file_path='clique_graph.npz'):
    """ Return dictionary mapping GO terms to their clique adjacency matrices. 
    """
    if not os.path.exists(file_path):
        matrix_return = scipy.sparse.csr_matrix((len(gene_names), len(gene_names)))
        i = 0
        for termname in geneset_dict:
            i += 1
            matrix_return += create_clique_adj_matrix(gene_names, geneset_dict[termname])
        scipy.sparse.save_npz(file_path, matrix_return)
    else:
        matrix_return = scipy.sparse.load_npz(file_path)
    return matrix_return
CODE
clique_graph = genesets_clique_matrix(go_dict, gene_names, file_path='GO_clique_graph.npz')
clique_graph_GO = build_knn(clique_graph, k=10, symmetrize_type='inclusive')

Processing the graph

We can now process the graph to make it more useful for downstream analysis, by applying several data science techniques to it, following (Wainberg et al. 2021).

Mixing in a connected component

One potential issue with using the GO graph for learning is that it is not fully connected - there are many genes that are not connected to any other genes, because they don’t appear in GO’s gene modules.

CODE
all_genes_go_dict = list(set([item for sublist in [go_dict[k] for k in go_dict.keys()] for item in sublist]))
num_genes_connected = len(np.intersect1d(gene_names, all_genes_go_dict))
print("{} genes are in at least one GO gene set. {} are in no GO gene sets.".format(
    num_genes_connected, len(gene_names) - num_genes_connected))

The same is true of the coessentiality-derived gene modules.

CODE
all_genes_go_dict = list(set([item for sublist in [genesets_dict[k] for k in genesets_dict.keys()] for item in sublist]))
num_genes_connected = len(np.intersect1d(gene_names, all_genes_go_dict))
print("{} genes are in at least one coessentiality-derived gene set. {} are in no coessentiality-derived GO gene sets.".format(
    num_genes_connected, len(gene_names) - num_genes_connected))

One way to solve this issue is to use the fully connected graph derived from coessentiality data, and mix it in with the GO graph.

CODE
def coessentiality_GLS_graphs(
    file_path='GLS_pvals_10NN.npz', fname_GLS_p = 'GLS_p.npy', 
    GLS_p_url='https://mitra.stanford.edu/bassik/coessentiality/GLS_p.npy', 
    min_logp=1000
):
    """ Return sparse matrix of coessentiality (-log p) values between genes, calculated using GLS.
    """
    if not os.path.exists(fname_GLS_p):
        response = requests.get(GLS_p_url)
        with open(fname_GLS_p, 'wb') as fp:
            fp.write(response.content)
    GLS_p = np.load(fname_GLS_p)
    res = -np.log(GLS_p)
    res[np.isinf(res)] = min_logp
    if not os.path.exists(file_path):
        GLS_pvals_100 = build_knn(res, k=100, symmetrize_type='inclusive')
        GLS_pvals_10 = build_knn(GLS_pvals_100.toarray(), k=10, symmetrize_type='inclusive')
        scipy.sparse.save_npz(file_path, GLS_pvals_10)
    GLS_pvals_10 = scipy.sparse.load_npz(file_path)
    return GLS_pvals_10, res

Putting it all together, mixing the two graphs’ adjacency matrices together with some weighting, gives a function for getting a gene-gene graph.

CODE
def get_gene_gene_graph(
    mode='coessentiality', 
    graph_construction='jaccard', 
    file_path='gene_gene_graph.npz', 
    frac_gg_graph=0.99
):
    if os.path.exists(file_path):
        return scipy.sparse.load_npz(file_path)
    else:
        # Build the fully connected graph from coessentiality data
        GLS_pvals_10, res = coessentiality_GLS_graphs()
        if mode == 'coessentiality':
            geneset_dict = get_coess_module_dict()
            fprefix = 'clusterone_'
        elif mode == 'GO':
            geneset_dict = get_goterm_dict_enrichr()
            fprefix = 'GO_'
        if graph_construction == 'clique':
            clique_graph = genesets_clique_matrix(geneset_dict, gene_names, file_path=fprefix+'clique_graph.npz')
            gg_graph = build_knn(clique_graph, k=10, symmetrize_type='inclusive')
        elif graph_construction == 'jaccard':
            clust_mmships = membership_mat(geneset_dict, gene_names, file_path=fprefix+'memberships.npz')
            gg_jaccsims = pairwise_jaccard_graph(clust_mmships)
            gg_graph = build_knn(gg_jaccsims, k=10, symmetrize_type='inclusive')
        # Construct the combined graph
        gene_gene_graph = scipy.sparse.csr_matrix(
            (frac_gg_graph * gg_graph) + 
            ((1-frac_gg_graph) * GLS_pvals_10)
        )
        scipy.sparse.save_npz(file_path, gene_gene_graph)
CODE
gg_graph = get_gene_gene_graph(
    graph_construction='jaccard', 
    file_path='gene_gene_graph.npz'
)
gg_clique = get_gene_gene_graph(
    graph_construction='clique', 
    file_path='gene_gene_graph_clique.npz'
)

Smoothing with diffusion

This involves using diffusion maps to emphasize the more global coarse-scale structure of the graph, smoothing out high-frequency components to a degree such that the top few components capture most of the variance in the graph.

This is a useful technique for downstream learning, in the same way that PCA-based dimensionality reduction is used in computational workflows. In addition, it can help to provide a more visualizable representation of the input graph.

CODE
def compute_transitions(adj_mat, alpha=1.0, sym=True):
    """Compute symmetrized transition matrix. 
    alpha : The exponent of the diffusion kernel. 
    * 1 = Laplace-Beltrami (density)-normalized [default]. 
    * 0.5 = normalized graph Laplacian (Fokker-Planck dynamics, cf. "Diffusion maps, spectral clustering and eigenfunctions of Fokker-Planck operators" NIPS2006).
    * 0 = classical graph Laplacian. 
    """
    similarity_mat = symmetric_part(adj_mat)
    dens = np.asarray(similarity_mat.sum(axis=0))  # dens[i] is an estimate for the sampling density at point i.
    K = scipy.sparse.spdiags(np.power(dens, -alpha), 0, similarity_mat.shape[0], similarity_mat.shape[0])
    W = scipy.sparse.csr_matrix(K.dot(similarity_mat).dot(K))
    z = np.sqrt(np.asarray(W.sum(axis=0)).astype(np.float64))    # sqrt(density)
    Zmat = scipy.sparse.spdiags(1.0/z, 0, W.shape[0], W.shape[0])
    return Zmat.dot(W).dot(Zmat) if sym else Zmat.power(2).dot(W)

def compute_eigen(adj_mat, n_comps=2, sym=True):
    """
    Compute eigendecomposition of sparse transition kernel matrix.
    (Computing in 64-bit can matter for analysis use beyond just visualization!).
    sym indicates that the transition matrix should be symmetrized.
    NOTE: use np.linalg.eigh if we want to support non-sparse matrices.
    """
    if sym:
        eigvals, evecs = scipy.sparse.linalg.eigsh(adj_mat.astype(np.float64), k=n_comps)
    else:
        # eigvals, evecs = scipy.sparse.linalg.eigs(adj_mat.astype(np.float64), k=n_comps)   # DON'T USE without further thresholding: complex-number underflow issues
        evecs, eigvals, _ = scipy.sparse.linalg.svds(adj_mat.astype(np.float64), k=n_comps)
    #eigvals, evecs = eigvals.astype(np.float32), evecs.astype(np.float32)
    sorted_ndces = np.argsort(np.abs(eigvals))[::-1]
    return eigvals[sorted_ndces], evecs[:, sorted_ndces]

""" Returns the symmetrized version of the input matrix. (The anti-symmetrized version is A - symmetric_part(A) .) """
def symmetric_part(A):
    return 0.5*(A + A.T)


"""
Compute diffusion map embedding.
sym_compute: Whether to compute the SVD on a symmetric matrix.
sym_return: Whether to return the SVD of the symmetrized transition matrix.
Assumes that {sym_return => sym_compute} holds, i.e. doesn't allow sym_compute==False and sym_return==True.
Returns (n_comps-1) components, excluding the first which is the same for all data.
"""
def diffmap_proj(
    adj_mat, 
    t=None, 
    min_energy_frac=0.95, 
    n_dims=None,      # Number of dims to return
    n_comps=None,     # Number of comps to use in computation.
    return_eigvals=False, 
    embed_type='diffmap', 
    sym_compute=True, 
    sym_return=False
):
    if n_comps is None:     # When data are high-d dimensional with log2(d) \leq 14-16 as for scRNA, 2K eigenvector computation is tolerably fast; change otherwise
        n_comps = min(2000, adj_mat.shape[0]-1)
    if n_dims is None:
        n_dims = n_comps - 1
    eigvals, evecs = compute_eigen(compute_transitions(adj_mat, sym=sym_compute), n_comps=n_comps, sym=sym_compute)
    if sym_compute:
        evecs_sym = evecs
        evecs_unsym = np.multiply(evecs, np.outer(1.0/evecs[:,0].astype(np.float64), np.ones(evecs.shape[1])))
    else:
        evecs_unsym = evecs
    if sym_return:
        if not sym_compute:
            return None
        eigvecs_normalized = preprocessing.normalize(np.real(evecs_sym), axis=0, norm='l2')
    else:
        eigvecs_normalized = preprocessing.normalize(np.real(evecs_unsym), axis=0, norm='l2')
    
    if t is None:     # Use min_energy_frac to determine the fraction of noise variance 
        t = min_t_for_energy(eigvals, n_dims+1, min_energy_frac)
    frac_energy_explained = np.cumsum(np.power(np.abs(eigvals), t)/np.sum(np.power(np.abs(eigvals), t)))[n_dims]
    print("{} dimensions contain about {:.2f} fraction of the variance in the first {} dimensions (Diffusion time = {:.3f})".format(
        n_dims+1, frac_energy_explained, n_comps, t))
    if embed_type=='naive':
        all_comps = eigvecs_normalized
    elif embed_type=='diffmap':
        all_comps = np.power(np.abs(eigvals), t) * eigvecs_normalized
    elif embed_type=='commute': # Check the correctness!
        all_comps = np.power((1-np.abs(eigvals)), -t/2) * eigvecs_normalized
    if not return_eigvals:
        return all_comps[:,1:(n_dims+1)]     # Return n_dims dimensions, skipping the first trivial one.
    else:
        return (all_comps[:,1:(n_dims+1)], eigvals)    # Return the eigenvalues as well.

def min_t_for_energy(eigvals, desired_dim, min_energy_frac, max_t=None):
    # Calc upper bound for t principal eigengap (if g=lbda1/lbda2, then g^t < 100 implies t < log(100)/log(g) ). Don't change unless you know what you're doing!
    if max_t is None:
        max_t = np.log(100)/np.log(max(eigvals[0]/eigvals[1], 1.01))
    f = lambda t: (np.sum(heat_eigval_dist(eigvals, t)[:desired_dim]) - min_energy_frac)
    if f(0)*f(max_t) >= 0:    # since f is always nondecreasing this means the zero isn't in the interval
        return max_t if f(0) < 0 else 0
    return scipy.optimize.brentq(f, 0, max_t)

def heat_eigval_dist(eigvals, t):
    return np.power(np.abs(eigvals), t)/np.sum(np.power(np.abs(eigvals), t))
CODE
def smooth_anneal_graph(in_graph, n_cmps=100, reduced_dim=50, min_energy_frac=0.9):
    itime = time.time()
    emap_heat = diffmap_proj(in_graph, n_comps=n_cmps, n_dims=40, min_energy_frac=min_energy_frac, embed_type='diffmap', return_eigvals=False)
    print("Diffusion components computed. Time: {:.2f}".format(time.time() - itime))
    ann_heat = anndata.AnnData(X=emap_heat[:, :40])
    ann_heat.obs['gene_names'] = gene_names
    sc.pp.neighbors(ann_heat)
    sc.tl.umap(ann_heat)
    return ann_heat


gene_gene_graph = gg_graph
ann_heat = smooth_anneal_graph(gene_gene_graph, n_cmps=100, reduced_dim=50, min_energy_frac=0.9)
path_final_graph = 'coess_atlas_gene_graph.npz'
if not os.path.exists(path_final_graph):
    scipy.sparse.save_npz(path_final_graph, ann_heat.obsp['connectivities'])

itime = time.time()
#Plot eigenvalue spectrum of resulting graph.
emap_naive, eigvals = diffmap_proj(gene_gene_graph, t=0, n_comps=50, embed_type='naive', return_eigvals=True)
print("Laplacian eigenmap computed. Time: {:.2f} sec.".format(time.time() - itime))
plt.title("Eigenvalue spectrum of Laplacian")
plt.xlabel("Eigenvalue index")
plt.ylabel("Eigenvalue magnitude")
plt.scatter(x=range(50), y=np.abs(eigvals)[:50])
plt.show()

Verify that the overall graph includes known gene modules

Read modules highlighted in the paper.

CODE
url_modules = 'https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8763319/bin/NIHMS1768167-supplement-Supp_Data_2.zip'

import gzip, requests
from zipfile import ZipFile

file_path = 'NIHMS1768167-supplement-Supp_Data_2.zip'

response = requests.get(url_modules)
with open(file_path, 'wb') as fp:
    fp.write(response.content)

# opening the zip file in READ mode 
with ZipFile(file_path, 'r') as zip: 
    # printing all the contents of the zip file 
    zip.printdir() 
  
    # extracting all the files 
    print('Extracting all the files now...') 
    zip.extractall() 
    print('Done!') 


xl = pd.ExcelFile('Supplementary_Data_2.xlsx')
print(xl.sheet_names)

ref_modules = xl.parse('Manual Annotation References')

module_dict = {}
module_delims = list(np.where(~pd.isna(ref_modules['Module name']))[0]) + [ref_modules.shape[0]]
for i in range(len(module_delims)-1):
    these_genes_df = ref_modules.iloc[module_delims[i]:module_delims[i+1], :]
    module_dict[these_genes_df.iloc[0, :2]['Module name']] = np.ravel(these_genes_df['Gene name'])
CODE
for m in module_dict:
    ann_heat.obs['selected'] = np.isin(gene_names, module_dict[m]).astype(int)
    arrsize = ann_heat.obs['selected']*150 + 4
    #sc.pl.umap(ann_heat, color='selected', title=m, size=arrsize, cmap='viridis')

First, we need some utility code to deal with gene sets. This computes similarities between genes based on their membership in gene sets. Given a particular universe of gene sets, the similarity between two genes is the Jaccard index between their set memberships.

CODE
ann_heat.obs.index = ann_heat.obs['gene_names']
mods = ann_heat.obs['gene_names'].copy()
for m in module_dict:
    mods[module_dict[m]] = m
ann_heat.obs['module'] = mods
for m in module_dict:
    print(m, (ann_heat[module_dict[m]].obsp['connectivities'] > 0).mean())
CODE
ann_heat.obs.index = ann_heat.obs['gene_names']
mods = ann_heat.obs['gene_names'].copy()
for m in module_dict:
    mods[module_dict[m]] = m
ann_heat.obs['module'] = mods
for m in module_dict:
    print(m, (ann_heat[module_dict[m]].obsp['connectivities'] > 0).mean())
CODE
ann_heat.obs.index = ann_heat.obs['gene_names']
ann_heat.obs['module'] = ann_heat.obs['gene_names']
for m in module_dict:
    ann_heat[module_dict[m]].obs['module'] = m

for m in module_dict:
    print(m, (ann_heat[module_dict[m]].obsp['connectivities'] > 0).mean())

Incorporating precision medicine data

GO information is tightly related to other indirect ways of getting at gene function. The interplay between gene regulation and drug responses is a rich source of information about gene function. These sources are particularly prevalent in medicine and can be very useful in drug development.

The paper (Chandak, Huang, and Zitnik 2023) builds a knowledge graph with multiple types of entities for this purpose. This provides a useful way to contextually view the relationships between genes, drugs, and diseases.

CODE
import pandas as pd
primekg_df = pd.read_csv('primeKG/kg.csv')

primekg_df[primekg_df['display_relation'] == 'interacts with']['y_type'].value_counts()
primekg_df#['x_type'].value_counts()

References

Chandak, Payal, Kexin Huang, and Marinka Zitnik. 2023. “Building a Knowledge Graph to Enable Precision Medicine.” Scientific Data 10 (1): 67.
Subramanian, Aravind, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, et al. 2005. “Gene Set Enrichment Analysis: A Knowledge-Based Approach for Interpreting Genome-Wide Expression Profiles.” Proceedings of the National Academy of Sciences 102 (43): 15545–50.
Wainberg, Michael, Roarke A Kamber, Akshay Balsubramani, Robin M Meyers, Nasa Sinnott-Armstrong, Daniel Hornburg, Lihua Jiang, et al. 2021. “A Genome-Wide Atlas of Co-Essential Modules Assigns Function to Uncharacterized Genes.” Nature Genetics 53 (5): 638–49.

Reuse