Looking up gene sets in databases

dataviz

genomics

Some wrapper code for gProfiler

Author

Akshay Balsubramani

Modified

December 22, 2022

Looking up the significance of gene sets

The Gene Ontology (GO) is a controlled vocabulary for describing gene function. It is organized as a directed acyclic graph with three main branches: biological process, molecular function, and cellular component. The GO is a useful tool for summarizing the results of an analysis over protein-coding genes.

Querying the GO is often performed alongside queries to other databases, such as those for protein-protein interaction. Here is code for querying several major databases all at once, using the g:Profiler web service (Raudvere et al. 2019). Along with the excellent gget package (Luebbert and Pachter 2023), this covers many common gene-set lookup use cases. Here is a wrapper, adapting an excellent existing Python port for the gProfiler API to report the heterogeneous results of the lookup in a pandas dataframe.

import requests
import numpy as np, pandas as pd

BASE_URL = "http://biit.cs.ut.ee/gprofiler/"
HEADERS = {'User-Agent': 'python-gprofiler'}

def gprofiler(
    query, topk=20, organism='hsapiens', ordered_query=False, significant=True, 
    exclude_iea=False, region_query=False, max_p_value=1.0, max_set_size=0, 
    correction_method='analytical', hier_filtering='none', 
    domain_size='annotated', custom_bg=[], numeric_ns='', no_isects=False, 
    png_fn=None, include_graph=False, src_filter=None, mode='enrich'
):
    ''' Annotate gene list functionally
    Interface to the g:Profiler tool for finding enrichments in gene lists. Organism
    names are constructed by concatenating the first letter of the name and the
    family name. Example: human - 'hsapiens', mouse - 'mmusculus'. If requesting PNG
    output, the request is directed to the g:GOSt tool in case 'query' is a vector
    and the g:Cocoa (compact view of multiple queries) tool in case 'query' is a
    list. PNG output can fail (return FALSE) in case the input query is too large.
    In such case, it is advisable to fall back to a non-image request.
    Returns a pandas DataFrame with enrichment results or None
    '''
    query_url = ''
    mode = 'enrich'   # Currently gprofiler gconvert gives an HTTP 405 error.
    if mode == 'enrich':
        my_url = BASE_URL + 'gcocoa.cgi'
    elif mode == 'lookup':
        my_url = BASE_URL + 'convert.cgi'
    wantpng = True if png_fn else False
    output_type = 'mini_png' if wantpng else 'mini'
    if wantpng:
        raise NotImplementedError('PNG Output not implemented')
        return
    if include_graph:
        raise NotImplementedError('Biogrid Interactions not implemented (include_graph)')
        return

    # Query
    qnames = list(query)
    if not qnames:
        raise ValueError('Missing query')
    query_url = ' '.join(qnames)

    # Significance threshold
    if correction_method == 'gSCS':
        correction_method = 'analytical'
    if correction_method not in ('analytical', 'fdr', 'bonferroni'):
        raise ValueError("Multiple testing correction method not recognized (correction_method)")

    # Hierarchical filtering
    if hier_filtering not in ('none', 'moderate', 'strong'):
        raise ValueError("hier_filtering must be one of \"none\", \"moderate\" or \"strong\"")
    if hier_filtering == 'strong':
        hier_filtering = 'compact_ccomp'
    elif hier_filtering == 'moderate':
        hier_filtering = 'compact_rgroups'
    else:
        hier_filtering = ''

    # Domain size
    if domain_size not in ('annotated', 'known'):
        raise ValueError("domain_size must be one of \"annotated\" or \"known\"")

    # Custom background
    if isinstance(custom_bg, list):
        custom_bg = ' '.join(custom_bg)
    else:
        raise TypeError('custom_bg need to be a list')

    # Max. set size
    if max_set_size < 0:
        max_set_size = 0

    # HTTP request
    if mode == 'enrich':
        query_params = {
            'organism': organism,
            'query': query_url,
            'output': output_type,
            'analytical': '1',
            'sort_by_structure': '1',
            'ordered_query': '1' if ordered_query else '0',
            'significant': '1' if significant else '0',
            'no_iea': '1' if exclude_iea else '0',
            'as_ranges': '1' if region_query else '0',
            'omit_metadata': '0' if include_graph else '1',
            'user_thr': str(max_p_value),
            'max_set_size': str(max_set_size),
            'threshold_algo': correction_method,
            'hierfiltering': hier_filtering,
            'domain_size_type': domain_size,
            'custbg_file': '',
            'custbg': custom_bg,
            'prefix': numeric_ns,
            'no_isects': '1' if no_isects else '0'
        }
    elif mode == 'lookup':
        query_params = {
            'query': query_url, 
            'target': 'GO'
        }
    if src_filter:
        for i in src_filter:
            query_params['sf_' + i] = '1'
    raw_query = requests.post(my_url, data=query_params, headers=HEADERS)

    # Here PNG request parsing would go, but not implementing that
    if wantpng:
        pass

    # Requested text
    split_query = raw_query.text.split('\n')

    # Here interaction parsing would go, but not implementing that
    if include_graph:
        pass

    # Parse main result body
    split_query = [
        s.split('\t')
        for s in split_query
        if s and not s.startswith('#')
    ]
    
    enrichment = pd.DataFrame(split_query)
    if mode == 'enrich':
        colnames = [
            "query.number", "significant", "p.value", "term.size", 
            "query.size", "overlap.size", "recall", "precision", 
            "term.id", "domain", "subgraph.number", "term.name", 
            "relative.depth", "intersection"
        ]
        numeric_colnames = [
            "query.number", "p.value", "term.size", 
            "query.size", "overlap.size", "recall", 
            "precision", "subgraph.number", "relative.depth"
        ]
    elif mode == 'lookup':
        return enrichment
        #colnames = ["alias.number", "alias", "target.number", "target", "name", "description", "namespace"]
        #numeric_colnames = ["alias.number", "target.number"]
    
    if enrichment.shape[1] > 0:
        enrichment.columns = colnames
        enrichment.index = enrichment['term.id']
        numeric_columns = numeric_colnames
        for column in numeric_columns:
            enrichment[column] = pd.to_numeric(enrichment[column])
        if mode == 'enrich':
            enrichment['significant'] = enrichment['significant'] == '!'
    else:
        enrichment = None
    # Only report most significant results
    if (enrichment is not None) and (enrichment.shape[0] > 0):
        x = np.array(np.argsort(enrichment['p.value']))
        enrichment = enrichment.iloc[x[:topk], :]
    return enrichment


def viz_gprofiler(selected_genes, topk=20):
    go_results = gprofiler(selected_genes, topk=topk)
    database_colors = { 'MF': '#CB3C19', 'BP': '#FA9D18', 'CC': '#198520', 'keg': '#D9869B', 'rea': '#335ACC', 'tf': '#4F6E9B', 'mir': '#53B8AD', 'hpa': '#542DB1', 'cor': '#6AAB19', 'hp': '#8C0784'}
    database_IDs = {'MF': 'Molecular function', 'BP': 'Biological process', 'CC': 'Cellular component', 'keg': 'KEGG', 'rea': 'REAC', 'tf': 'TF', 'mir': 'MIRNA', 'hpa': 'HPA', 'cor': 'CORUM', 'hp': 'HP'}
    top_go_logpvals = -np.log10(go_results['p.value'].astype(float))
    top_go_dbIDs = go_results['domain']
    top_go_termnames = go_results['term.name']
    top_go_termIDs = go_results['term.id']
    top_go_queryhits = go_results['intersection']
    bar_colors = np.array([database_colors[x] for x in top_go_dbIDs])
    panel_data = []
    ordi = np.arange(len(top_go_dbIDs))[::-1]
    for c in np.unique(bar_colors):
        trace_locs = np.where(bar_colors == c)[0]
        trace_locs = trace_locs[::-1]      # Reverse because it's better.
        panel_data.append({
            'name': database_IDs[top_go_dbIDs[trace_locs[0]]], 
            'x': top_go_logpvals[trace_locs],
            # 'y': ordi[trace_locs], 
            'y': top_go_termIDs[trace_locs],
            'hovertext': [
                "-log10(p): {}<br>{}<br>{}".format(
                    str(round(top_go_logpvals[t], 2)), 
                    top_go_termnames[t], 
                    top_go_termIDs[t]
                ) for t in trace_locs], 
            'text': top_go_termnames[trace_locs]
        })
    return panel_data

go_results = gprofiler(['NF1', 'KRAS', 'RAF1'])

go_results

	query.number	significant	p.value	term.size	query.size	overlap.size	recall	precision	term.id	domain	subgraph.number	term.name	relative.depth	intersection
term.id
GO:0021896	1	True	0.000012	2	3	2	0.667	1.000	GO:0021896	BP	11	forebrain astrocyte differentiation	2	KRAS,NF1
GO:0021897	1	True	0.000012	2	3	2	0.667	1.000	GO:0021897	BP	11	forebrain astrocyte development	3	KRAS,NF1
REAC:R-HSA-6802949	1	True	0.000018	51	3	3	1.000	0.059	REAC:R-HSA-6802949	rea	8	Signaling by RAS mutants	3	RAF1,KRAS,NF1
KEGG:01521	1	True	0.000055	79	3	3	1.000	0.038	KEGG:01521	keg	50	EGFR tyrosine kinase inhibitor resistance	1	RAF1,KRAS,NF1
REAC:R-HSA-6802953	1	True	0.000056	4	3	2	0.667	0.500	REAC:R-HSA-6802953	rea	8	RAS signaling downstream of NF1 loss-of-f...	3	KRAS,NF1
REAC:R-HSA-6802957	1	True	0.000057	74	3	3	1.000	0.041	REAC:R-HSA-6802957	rea	8	Oncogenic MAPK signaling	2	RAF1,KRAS,NF1
HP:0002967	1	True	0.000283	32	3	3	1.000	0.094	HP:0002967	hp	23	Cubitus valgus	1	RAF1,KRAS,NF1
GO:0046578	1	True	0.001030	219	3	3	1.000	0.014	GO:0046578	BP	62	regulation of Ras protein signal transduction	2	RAF1,KRAS,NF1
HP:0030063	1	True	0.001190	51	3	3	1.000	0.059	HP:0030063	hp	43	Neuroepithelial neoplasm	3	RAF1,KRAS,NF1
GO:0035020	1	True	0.001220	15	3	2	0.667	0.133	GO:0035020	BP	62	regulation of Rac protein signal transduc...	3	KRAS,NF1
HP:0030061	1	True	0.001410	54	3	3	1.000	0.056	HP:0030061	hp	43	Neuroectodermal neoplasm	2	RAF1,KRAS,NF1
HP:0100836	1	True	0.001410	54	3	3	1.000	0.056	HP:0100836	hp	41	Malignant neoplasm of the central nervous...	3	RAF1,KRAS,NF1
HP:0030060	1	True	0.001410	54	3	3	1.000	0.056	HP:0030060	hp	43	Nervous tissue neoplasm	1	RAF1,KRAS,NF1
KEGG:04014	1	True	0.001450	233	3	3	1.000	0.013	KEGG:04014	keg	15	Ras signaling pathway	1	RAF1,KRAS,NF1
REAC:R-HSA-5621575	1	True	0.001960	21	3	2	0.667	0.095	REAC:R-HSA-5621575	rea	14	CD209 (DC-SIGN) signaling	1	RAF1,KRAS
REAC:R-HSA-5673001	1	True	0.001980	239	3	3	1.000	0.013	REAC:R-HSA-5673001	rea	58	RAF/MAP kinase cascade	3	RAF1,KRAS,NF1
REAC:R-HSA-5684996	1	True	0.002140	245	3	3	1.000	0.012	REAC:R-HSA-5684996	rea	58	MAPK1/MAPK3 signaling	2	RAF1,KRAS,NF1
HP:0010991	1	True	0.002160	62	3	3	1.000	0.048	HP:0010991	hp	52	Abnormality of the abdominal musculature	1	RAF1,KRAS,NF1
REAC:R-HSA-5673000	1	True	0.002800	25	3	2	0.667	0.080	REAC:R-HSA-5673000	rea	58	RAF activation	4	RAF1,KRAS
KEGG:04010	1	True	0.002980	296	3	3	1.000	0.010	KEGG:04010	keg	22	MAPK signaling pathway	1	RAF1,KRAS,NF1

References

Luebbert, Laura, and Lior Pachter. 2023. “Efficient Querying of Genomic Reference Databases with Gget.” Bioinformatics 39 (1): btac836.

Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “G: Profiler: A Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update).” Nucleic Acids Research 47 (W1): W191–98.

Reuse

CC BY 4.0