Looking up gene function in databases

dataviz

genomics

Cross-referencing many databases at once

Author

Akshay Balsubramani

Gene sets in the Gene Ontology

The Gene Ontology (GO) is a controlled vocabulary for describing gene function. Along with other pathway/phenotype databases, the GO is a mainstay for interpreting gene lists. Using it is a good introduction to the highly structured forms of hard-won tacit knowledge that are essential in most areas of biology.

The GO is organized as a directed acyclic graph with three main branches: biological process, molecular function, and cellular component. The GO is a useful tool for summarizing the results of an analysis over protein-coding genes.

Querying the GO is often performed alongside queries to other databases, such as those for protein-protein interaction. Here is code for querying several major databases all at once.

Wrapper and visualizer

The g:Profiler web service (Raudvere et al. 2019) unifies access to many of these resources.
Since late‑2023 the official Python client gprofiler‑official has replaced ad‑hoc wrappers.
This notebook shows a minimal, modern workflow:

1. Install & import gprofiler‑official.
2. Run an enrichment query with GProfiler.profile.
3. Inspect and visualise the results.

Compared with the former version (deprecated from 2022, below), no manual request building is needed – the client handles paging, retries, and dataframe‑ready output.

CODE

from gprofiler import GProfiler
import pandas as pd, plotly.express as px
gp = GProfiler(return_dataframe=True)


genes = ['NF1', 'KRAS', 'RAF1']
go_results = gp.profile(organism='hsapiens', query=genes, no_evidences=False)

go_results

	source	native	name	p_value	significant	description	term_size	query_size	intersection_size	effective_domain_size	precision	recall	query	parents	intersections	evidences
0	WP	WP:WP2253	Pilocytic astrocytoma	4.909064e-08	True	Pilocytic astrocytoma	9	3	3	8752	1.000000	0.333333	query_1	[WP:000000]	[NF1, KRAS, RAF1]	[[WP], [WP], [WP]]
1	WP	WP:WP2586	Aryl hydrocarbon receptor pathway	8.871381e-06	True	Aryl hydrocarbon receptor pathway	46	3	3	8752	1.000000	0.065217	query_1	[WP:000000]	[NF1, KRAS, RAF1]	[[WP], [WP], [WP]]
2	GO:BP	GO:0021896	forebrain astrocyte differentiation	2.035805e-05	True	"The process in which a relatively unspecializ...	3	3	2	21026	0.666667	0.666667	query_1	[GO:0030900, GO:0048708]	[NF1, KRAS]	[[ISS, IEA], [IEA]]
3	GO:BP	GO:0021897	forebrain astrocyte development	2.035805e-05	True	"The process aimed at the progression of an as...	3	3	2	21026	0.666667	0.666667	query_1	[GO:0014002, GO:0021896]	[NF1, KRAS]	[[ISS, IEA], [IEA]]
4	HP	HP:0012209	Juvenile myelomonocytic leukemia	2.669185e-05	True	Juvenile myelomonocytic leukemia (JMML) is a l...	18	3	3	5080	1.000000	0.166667	query_1	[HP:0012324]	[NF1, KRAS, RAF1]	[[HP], [HP], [HP]]
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
166	HP	HP:0000823	Delayed puberty	4.835465e-02	True	Passing the age when puberty normally occurs w...	208	3	3	5080	1.000000	0.014423	query_1	[HP:0001510, HP:0008373]	[NF1, KRAS, RAF1]	[[HP], [HP], [HP]]
167	KEGG	KEGG:04218	Cellular senescence	4.866513e-02	True	Cellular senescence	155	3	2	8484	0.666667	0.012903	query_1	[KEGG:00000]	[KRAS, RAF1]	[[KEGG], [KEGG]]
168	KEGG	KEGG:04150	mTOR signaling pathway	4.866513e-02	True	mTOR signaling pathway	155	3	2	8484	0.666667	0.012903	query_1	[KEGG:00000]	[KRAS, RAF1]	[[KEGG], [KEGG]]
169	HP	HP:0004912	Hypophosphatemic rickets	4.968087e-02	True	Hypophosphatemic rickets	25	3	2	5080	0.666667	0.080000	query_1	[HP:0002148, HP:0002748]	[KRAS, RAF1]	[[HP], [HP]]
170	CORUM	CORUM:5923	RAF1-BRAF complex, RAS stimulated	4.993169e-02	True	RAF1-BRAF complex, RAS stimulated	1	2	1	3383	0.500000	1.000000	query_1	[CORUM:0000000]	[RAF1]	[[CORUM]]

171 rows × 16 columns

What is the best way to visualize this information? This type of integrative database search presents several challenges. A major one is that the databases are not comparable to each other. Using Fisher’s exact test to compare the query with gene sets and ranking by p-value is necessarily inexact but does have the advantage of simultaneously optimizing precision and recall of the retrieved results.

A generally appropriate solution is to filter quite loosely by p-value, using it only to short-list a set of enrichment terms. Together, these terms present a portrait of gene function that is meticulously curated along several distinct axes of evidence, from protein-protein interactions to downstream gene function and knockout experiments.

It’s important to display these short-listed terms all concurrently side-by-side in a way that distinguishes their sources of evidence and displays clear descriptions. As the descriptions are largely in technical English, interactive plots (such as those generated by Plotly) are a good solution to display them.

We can visualize all this information for any gene set, providing a rich portrait of the functional associations of that gene set.

CODE

def show_plotly_fig(go_results):
    go_results['-log10p'] = -np.log10(go_results['p_value'])
    go_results['label'] = go_results['name'] + '  (' + go_results['native'] + ')'
    fig = px.bar(
        go_results.sort_values('-log10p').tail(20),
        x='-log10p',
        y='label',
        orientation='h',
        color='source',
        hover_data=['native']
    )
    fig.update_layout(
        yaxis_title='', 
        xaxis_title='-log10(adj p)' if 'negative_log10_of_adjusted_p_value' in go_results else 'p‑value', 
        plot_bgcolor='white'
    )
    fig.update_xaxes(
        mirror=True,
        ticks='outside',
        showline=True,
        linecolor='black'
    )
    fig.update_yaxes(
        mirror=True,
        ticks='outside',
        showline=True,
        linecolor='black',
        gridcolor='lightgrey'
    )
    fig.show()

show_plotly_fig(go_results)

Unable to display output for mime type(s): application/vnd.plotly.v1+json

The function of a gene is necessarily contextual. A gene set does not necessarily mean something in isolation, but also in the context of what other genes are present. Expanding and contracting the gene sets we look up will also lead to different GO term results. This is an important consideration in using these databases, making the real-time nature of the above lookup crucial.

[Deprecated] : old wrapper (2022)

Here for completeness we include some old code which used HTTP requests directly using the API and manually did the rest. This adapts an excellent existing Python port for the gProfiler API to report the heterogeneous results of the lookup in a pandas dataframe.

This section is now deprecated, in view of the more complete functionality that g:Profiler has added above.

CODE

import requests

BASE_URL = "http://biit.cs.ut.ee/gprofiler/"
HEADERS = {'User-Agent': 'python-gprofiler'}

def gprofiler(
    query, topk=20, organism='hsapiens', ordered_query=False, significant=True, 
    exclude_iea=False, region_query=False, max_p_value=1.0, max_set_size=0, 
    correction_method='analytical', hier_filtering='none', 
    domain_size='annotated', custom_bg=[], numeric_ns='', no_isects=False, 
    png_fn=None, include_graph=False, src_filter=None, mode='enrich'
):
    ''' Annotate gene list functionally
    Interface to the g:Profiler tool for finding enrichments in gene lists. Organism
    names are constructed by concatenating the first letter of the name and the
    family name. Example: human - 'hsapiens', mouse - 'mmusculus'. If requesting PNG
    output, the request is directed to the g:GOSt tool in case 'query' is a vector
    and the g:Cocoa (compact view of multiple queries) tool in case 'query' is a
    list. PNG output can fail (return FALSE) in case the input query is too large.
    In such case, it is advisable to fall back to a non-image request.
    Returns a pandas DataFrame with enrichment results or None
    '''
    query_url = ''
    mode = 'enrich'   # Currently gprofiler gconvert gives an HTTP 405 error.
    if mode == 'enrich':
        my_url = BASE_URL + 'gcocoa.cgi'
    elif mode == 'lookup':
        my_url = BASE_URL + 'convert.cgi'
    wantpng = True if png_fn else False
    output_type = 'mini_png' if wantpng else 'mini'
    if wantpng:
        raise NotImplementedError('PNG Output not implemented')
        return
    if include_graph:
        raise NotImplementedError('Biogrid Interactions not implemented (include_graph)')
        return

    # Query
    qnames = list(query)
    if not qnames:
        raise ValueError('Missing query')
    query_url = ' '.join(qnames)

    # Significance threshold
    if correction_method == 'gSCS':
        correction_method = 'analytical'
    if correction_method not in ('analytical', 'fdr', 'bonferroni'):
        raise ValueError("Multiple testing correction method not recognized (correction_method)")

    # Hierarchical filtering
    if hier_filtering not in ('none', 'moderate', 'strong'):
        raise ValueError("hier_filtering must be one of \"none\", \"moderate\" or \"strong\"")
    if hier_filtering == 'strong':
        hier_filtering = 'compact_ccomp'
    elif hier_filtering == 'moderate':
        hier_filtering = 'compact_rgroups'
    else:
        hier_filtering = ''

    # Domain size
    if domain_size not in ('annotated', 'known'):
        raise ValueError("domain_size must be one of \"annotated\" or \"known\"")

    # Custom background
    if isinstance(custom_bg, list):
        custom_bg = ' '.join(custom_bg)
    else:
        raise TypeError('custom_bg need to be a list')

    # Max. set size
    if max_set_size < 0:
        max_set_size = 0

    # HTTP request
    if mode == 'enrich':
        query_params = {
            'organism': organism,
            'query': query_url,
            'output': output_type,
            'analytical': '1',
            'sort_by_structure': '1',
            'ordered_query': '1' if ordered_query else '0',
            'significant': '1' if significant else '0',
            'no_iea': '1' if exclude_iea else '0',
            'as_ranges': '1' if region_query else '0',
            'omit_metadata': '0' if include_graph else '1',
            'user_thr': str(max_p_value),
            'max_set_size': str(max_set_size),
            'threshold_algo': correction_method,
            'hierfiltering': hier_filtering,
            'domain_size_type': domain_size,
            'custbg_file': '',
            'custbg': custom_bg,
            'prefix': numeric_ns,
            'no_isects': '1' if no_isects else '0'
        }
    elif mode == 'lookup':
        query_params = {
            'query': query_url, 
            'target': 'GO'
        }
    if src_filter:
        for i in src_filter:
            query_params['sf_' + i] = '1'
    raw_query = requests.post(my_url, data=query_params, headers=HEADERS)

    # Here PNG request parsing would go, but not implementing that
    if wantpng:
        pass

    # Requested text
    split_query = raw_query.text.split('\n')

    # Here interaction parsing would go, but not implementing that
    if include_graph:
        pass

    # Parse main result body
    split_query = [
        s.split('\t')
        for s in split_query
        if s and not s.startswith('#')
    ]
    
    enrichment = pd.DataFrame(split_query)
    if mode == 'enrich':
        colnames = [
            "query.number", "significant", "p.value", "term.size", 
            "query.size", "overlap.size", "recall", "precision", 
            "term.id", "domain", "subgraph.number", "term.name", 
            "relative.depth", "intersection"
        ]
        numeric_colnames = [
            "query.number", "p.value", "term.size", 
            "query.size", "overlap.size", "recall", 
            "precision", "subgraph.number", "relative.depth"
        ]
    elif mode == 'lookup':
        return enrichment
        #colnames = ["alias.number", "alias", "target.number", "target", "name", "description", "namespace"]
        #numeric_colnames = ["alias.number", "target.number"]
    
    if enrichment.shape[1] > 0:
        print(enrichment.shape)
        enrichment.columns = colnames
        enrichment.index = enrichment['term.id']
        numeric_columns = numeric_colnames
        for column in numeric_columns:
            enrichment[column] = pd.to_numeric(enrichment[column])
        if mode == 'enrich':
            enrichment['significant'] = enrichment['significant'] == '!'
    else:
        enrichment = None
    # Only report most significant results
    if (enrichment is not None) and (enrichment.shape[0] > 0):
        x = np.array(np.argsort(enrichment['p.value']))
        enrichment = enrichment.iloc[x[:topk], :]
    return enrichment

CODE

def viz_gprofiler(selected_genes, topk=20):
    go_results = gprofiler(selected_genes, topk=topk)
    database_colors = { 'MF': '#CB3C19', 'BP': '#FA9D18', 'CC': '#198520', 'keg': '#D9869B', 'rea': '#335ACC', 'tf': '#4F6E9B', 'mir': '#53B8AD', 'hpa': '#542DB1', 'cor': '#6AAB19', 'hp': '#8C0784'}
    database_IDs = {'MF': 'Molecular function', 'BP': 'Biological process', 'CC': 'Cellular component', 'keg': 'KEGG', 'rea': 'REAC', 'tf': 'TF', 'mir': 'MIRNA', 'hpa': 'HPA', 'cor': 'CORUM', 'hp': 'HP'}
    top_go_logpvals = -np.log10(go_results['p_value'].astype(float))
    top_go_dbIDs = go_results['domain']
    top_go_termnames = go_results['name']
    top_go_termIDs = go_results['native']
    bar_colors = np.array([database_colors[x] for x in top_go_dbIDs])
    panel_data = []
    ordi = np.arange(len(top_go_dbIDs))[::-1]
    for c in np.unique(bar_colors):
        trace_locs = np.where(bar_colors == c)[0]
        trace_locs = trace_locs[::-1]      # Reverse because it's better.
        panel_data.append({
            'name': database_IDs[top_go_dbIDs[trace_locs[0]]], 
            'x': top_go_logpvals[trace_locs],
            # 'y': ordi[trace_locs], 
            'y': top_go_termIDs[trace_locs],
            'hovertext': [
                "-log10(p): {}<br>{}<br>{}".format(
                    str(round(top_go_logpvals[t], 2)), 
                    top_go_termnames[t], 
                    top_go_termIDs[t]
                ) for t in trace_locs], 
            'text': top_go_termnames[trace_locs]
        })
    return panel_data

CODE

go_results = gprofiler(genes, mode='enrich')
go_results

Along with the excellent gget package (Luebbert and Pachter 2023), this covers many common gene-set lookup use cases.

	query.number	significant	p.value	term.size	query.size	overlap.size	recall	precision	term.id	domain	subgraph.number	term.name	relative.depth	intersection
term.id
GO:0021896	1	True	0.000012	2	3	2	0.667	1.000	GO:0021896	BP	11	forebrain astrocyte differentiation	2	KRAS,NF1
GO:0021897	1	True	0.000012	2	3	2	0.667	1.000	GO:0021897	BP	11	forebrain astrocyte development	3	KRAS,NF1
REAC:R-HSA-6802949	1	True	0.000018	51	3	3	1.000	0.059	REAC:R-HSA-6802949	rea	8	Signaling by RAS mutants	3	RAF1,KRAS,NF1
KEGG:01521	1	True	0.000055	79	3	3	1.000	0.038	KEGG:01521	keg	50	EGFR tyrosine kinase inhibitor resistance	1	RAF1,KRAS,NF1
REAC:R-HSA-6802953	1	True	0.000056	4	3	2	0.667	0.500	REAC:R-HSA-6802953	rea	8	RAS signaling downstream of NF1 loss-of-f…	3	KRAS,NF1
REAC:R-HSA-6802957	1	True	0.000057	74	3	3	1.000	0.041	REAC:R-HSA-6802957	rea	8	Oncogenic MAPK signaling	2	RAF1,KRAS,NF1
HP:0002967	1	True	0.000283	32	3	3	1.000	0.094	HP:0002967	hp	23	Cubitus valgus	1	RAF1,KRAS,NF1
GO:0046578	1	True	0.001030	219	3	3	1.000	0.014	GO:0046578	BP	62	regulation of Ras protein signal transduction	2	RAF1,KRAS,NF1
HP:0030063	1	True	0.001190	51	3	3	1.000	0.059	HP:0030063	hp	43	Neuroepithelial neoplasm	3	RAF1,KRAS,NF1
GO:0035020	1	True	0.001220	15	3	2	0.667	0.133	GO:0035020	BP	62	regulation of Rac protein signal transduc…	3	KRAS,NF1
HP:0030061	1	True	0.001410	54	3	3	1.000	0.056	HP:0030061	hp	43	Neuroectodermal neoplasm	2	RAF1,KRAS,NF1
HP:0100836	1	True	0.001410	54	3	3	1.000	0.056	HP:0100836	hp	41	Malignant neoplasm of the central nervous…	3	RAF1,KRAS,NF1
HP:0030060	1	True	0.001410	54	3	3	1.000	0.056	HP:0030060	hp	43	Nervous tissue neoplasm	1	RAF1,KRAS,NF1
KEGG:04014	1	True	0.001450	233	3	3	1.000	0.013	KEGG:04014	keg	15	Ras signaling pathway	1	RAF1,KRAS,NF1
REAC:R-HSA-5621575	1	True	0.001960	21	3	2	0.667	0.095	REAC:R-HSA-5621575	rea	14	CD209 (DC-SIGN) signaling	1	RAF1,KRAS
REAC:R-HSA-5673001	1	True	0.001980	239	3	3	1.000	0.013	REAC:R-HSA-5673001	rea	58	RAF/MAP kinase cascade	3	RAF1,KRAS,NF1
REAC:R-HSA-5684996	1	True	0.002140	245	3	3	1.000	0.012	REAC:R-HSA-5684996	rea	58	MAPK1/MAPK3 signaling	2	RAF1,KRAS,NF1
HP:0010991	1	True	0.002160	62	3	3	1.000	0.048	HP:0010991	hp	52	Abnormality of the abdominal musculature	1	RAF1,KRAS,NF1
REAC:R-HSA-5673000	1	True	0.002800	25	3	2	0.667	0.080	REAC:R-HSA-5673000	rea	58	RAF activation	4	RAF1,KRAS
KEGG:04010	1	True	0.002980	296	3	3	1.000	0.010	KEGG:04010	keg	22	MAPK signaling pathway	1	RAF1,KRAS,NF1

References

Luebbert, Laura, and Lior Pachter. 2023. “Efficient Querying of Genomic Reference Databases with Gget.” Bioinformatics 39 (1): btac836.

Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “G: Profiler: A Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update).” Nucleic Acids Research 47 (W1): W191–98.

Reuse

CC BY 4.0

Questions? Email.

References

Luebbert, Laura, and Lior Pachter. 2023. “Efficient Querying of Genomic Reference Databases with Gget.” Bioinformatics 39 (1): btac836.

Reuse

CC BY 4.0