Looking up gene function in databases

dataviz
genomics
Cross-referencing many databases at once
Author

Akshay Balsubramani

Gene sets in the Gene Ontology

The Gene Ontology (GO) is a controlled vocabulary for describing gene function. Along with other pathway/phenotype databases, the GO is a mainstay for interpreting gene lists. Using it is a good introduction to the highly structured forms of hard-won tacit knowledge that are essential in most areas of biology.

The GO is organized as a directed acyclic graph with three main branches: biological process, molecular function, and cellular component. The GO is a useful tool for summarizing the results of an analysis over protein-coding genes.

Querying the GO is often performed alongside queries to other databases, such as those for protein-protein interaction. Here is code for querying several major databases all at once.

Wrapper and visualizer

The g:Profiler web service (Raudvere et al. 2019) unifies access to many of these resources.
Since late‑2023 the official Python client gprofiler‑official has replaced ad‑hoc wrappers.
This notebook shows a minimal, modern workflow:

1. Install & import gprofiler‑official.
2. Run an enrichment query with GProfiler.profile.
3. Inspect and visualise the results.

Compared with the former version (deprecated from 2022, below), no manual request building is needed – the client handles paging, retries, and dataframe‑ready output.

CODE
from gprofiler import GProfiler
import pandas as pd, plotly.express as px
gp = GProfiler(return_dataframe=True)


genes = ['NF1', 'KRAS', 'RAF1']
go_results = gp.profile(organism='hsapiens', query=genes, no_evidences=False)

go_results
source native name p_value significant description term_size query_size intersection_size effective_domain_size precision recall query parents intersections evidences
0 WP WP:WP2253 Pilocytic astrocytoma 4.909064e-08 True Pilocytic astrocytoma 9 3 3 8752 1.000000 0.333333 query_1 [WP:000000] [NF1, KRAS, RAF1] [[WP], [WP], [WP]]
1 WP WP:WP2586 Aryl hydrocarbon receptor pathway 8.871381e-06 True Aryl hydrocarbon receptor pathway 46 3 3 8752 1.000000 0.065217 query_1 [WP:000000] [NF1, KRAS, RAF1] [[WP], [WP], [WP]]
2 GO:BP GO:0021896 forebrain astrocyte differentiation 2.035805e-05 True "The process in which a relatively unspecializ... 3 3 2 21026 0.666667 0.666667 query_1 [GO:0030900, GO:0048708] [NF1, KRAS] [[ISS, IEA], [IEA]]
3 GO:BP GO:0021897 forebrain astrocyte development 2.035805e-05 True "The process aimed at the progression of an as... 3 3 2 21026 0.666667 0.666667 query_1 [GO:0014002, GO:0021896] [NF1, KRAS] [[ISS, IEA], [IEA]]
4 HP HP:0012209 Juvenile myelomonocytic leukemia 2.669185e-05 True Juvenile myelomonocytic leukemia (JMML) is a l... 18 3 3 5080 1.000000 0.166667 query_1 [HP:0012324] [NF1, KRAS, RAF1] [[HP], [HP], [HP]]
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
166 HP HP:0000823 Delayed puberty 4.835465e-02 True Passing the age when puberty normally occurs w... 208 3 3 5080 1.000000 0.014423 query_1 [HP:0001510, HP:0008373] [NF1, KRAS, RAF1] [[HP], [HP], [HP]]
167 KEGG KEGG:04218 Cellular senescence 4.866513e-02 True Cellular senescence 155 3 2 8484 0.666667 0.012903 query_1 [KEGG:00000] [KRAS, RAF1] [[KEGG], [KEGG]]
168 KEGG KEGG:04150 mTOR signaling pathway 4.866513e-02 True mTOR signaling pathway 155 3 2 8484 0.666667 0.012903 query_1 [KEGG:00000] [KRAS, RAF1] [[KEGG], [KEGG]]
169 HP HP:0004912 Hypophosphatemic rickets 4.968087e-02 True Hypophosphatemic rickets 25 3 2 5080 0.666667 0.080000 query_1 [HP:0002148, HP:0002748] [KRAS, RAF1] [[HP], [HP]]
170 CORUM CORUM:5923 RAF1-BRAF complex, RAS stimulated 4.993169e-02 True RAF1-BRAF complex, RAS stimulated 1 2 1 3383 0.500000 1.000000 query_1 [CORUM:0000000] [RAF1] [[CORUM]]

171 rows × 16 columns

What is the best way to visualize this information? This type of integrative database search presents several challenges. A major one is that the databases are not comparable to each other. Using Fisher’s exact test to compare the query with gene sets and ranking by p-value is necessarily inexact but does have the advantage of simultaneously optimizing precision and recall of the retrieved results.

A generally appropriate solution is to filter quite loosely by p-value, using it only to short-list a set of enrichment terms. Together, these terms present a portrait of gene function that is meticulously curated along several distinct axes of evidence, from protein-protein interactions to downstream gene function and knockout experiments.

It’s important to display these short-listed terms all concurrently side-by-side in a way that distinguishes their sources of evidence and displays clear descriptions. As the descriptions are largely in technical English, interactive plots (such as those generated by Plotly) are a good solution to display them.

We can visualize all this information for any gene set, providing a rich portrait of the functional associations of that gene set.

CODE
def show_plotly_fig(go_results):
    go_results['-log10p'] = -np.log10(go_results['p_value'])
    go_results['label'] = go_results['name'] + '  (' + go_results['native'] + ')'
    fig = px.bar(
        go_results.sort_values('-log10p').tail(20),
        x='-log10p',
        y='label',
        orientation='h',
        color='source',
        hover_data=['native']
    )
    fig.update_layout(
        yaxis_title='', 
        xaxis_title='-log10(adj p)' if 'negative_log10_of_adjusted_p_value' in go_results else 'p‑value', 
        plot_bgcolor='white'
    )
    fig.update_xaxes(
        mirror=True,
        ticks='outside',
        showline=True,
        linecolor='black'
    )
    fig.update_yaxes(
        mirror=True,
        ticks='outside',
        showline=True,
        linecolor='black',
        gridcolor='lightgrey'
    )
    fig.show()

show_plotly_fig(go_results)
Unable to display output for mime type(s): application/vnd.plotly.v1+json

The function of a gene is necessarily contextual. A gene set does not necessarily mean something in isolation, but also in the context of what other genes are present. Expanding and contracting the gene sets we look up will also lead to different GO term results. This is an important consideration in using these databases, making the real-time nature of the above lookup crucial.

[Deprecated] : old wrapper (2022)

Here for completeness we include some old code which used HTTP requests directly using the API and manually did the rest. This adapts an excellent existing Python port for the gProfiler API to report the heterogeneous results of the lookup in a pandas dataframe.

This section is now deprecated, in view of the more complete functionality that g:Profiler has added above.

CODE
import requests

BASE_URL = "http://biit.cs.ut.ee/gprofiler/"
HEADERS = {'User-Agent': 'python-gprofiler'}

def gprofiler(
    query, topk=20, organism='hsapiens', ordered_query=False, significant=True, 
    exclude_iea=False, region_query=False, max_p_value=1.0, max_set_size=0, 
    correction_method='analytical', hier_filtering='none', 
    domain_size='annotated', custom_bg=[], numeric_ns='', no_isects=False, 
    png_fn=None, include_graph=False, src_filter=None, mode='enrich'
):
    ''' Annotate gene list functionally
    Interface to the g:Profiler tool for finding enrichments in gene lists. Organism
    names are constructed by concatenating the first letter of the name and the
    family name. Example: human - 'hsapiens', mouse - 'mmusculus'. If requesting PNG
    output, the request is directed to the g:GOSt tool in case 'query' is a vector
    and the g:Cocoa (compact view of multiple queries) tool in case 'query' is a
    list. PNG output can fail (return FALSE) in case the input query is too large.
    In such case, it is advisable to fall back to a non-image request.
    Returns a pandas DataFrame with enrichment results or None
    '''
    query_url = ''
    mode = 'enrich'   # Currently gprofiler gconvert gives an HTTP 405 error.
    if mode == 'enrich':
        my_url = BASE_URL + 'gcocoa.cgi'
    elif mode == 'lookup':
        my_url = BASE_URL + 'convert.cgi'
    wantpng = True if png_fn else False
    output_type = 'mini_png' if wantpng else 'mini'
    if wantpng:
        raise NotImplementedError('PNG Output not implemented')
        return
    if include_graph:
        raise NotImplementedError('Biogrid Interactions not implemented (include_graph)')
        return

    # Query
    qnames = list(query)
    if not qnames:
        raise ValueError('Missing query')
    query_url = ' '.join(qnames)

    # Significance threshold
    if correction_method == 'gSCS':
        correction_method = 'analytical'
    if correction_method not in ('analytical', 'fdr', 'bonferroni'):
        raise ValueError("Multiple testing correction method not recognized (correction_method)")

    # Hierarchical filtering
    if hier_filtering not in ('none', 'moderate', 'strong'):
        raise ValueError("hier_filtering must be one of \"none\", \"moderate\" or \"strong\"")
    if hier_filtering == 'strong':
        hier_filtering = 'compact_ccomp'
    elif hier_filtering == 'moderate':
        hier_filtering = 'compact_rgroups'
    else:
        hier_filtering = ''

    # Domain size
    if domain_size not in ('annotated', 'known'):
        raise ValueError("domain_size must be one of \"annotated\" or \"known\"")

    # Custom background
    if isinstance(custom_bg, list):
        custom_bg = ' '.join(custom_bg)
    else:
        raise TypeError('custom_bg need to be a list')

    # Max. set size
    if max_set_size < 0:
        max_set_size = 0

    # HTTP request
    if mode == 'enrich':
        query_params = {
            'organism': organism,
            'query': query_url,
            'output': output_type,
            'analytical': '1',
            'sort_by_structure': '1',
            'ordered_query': '1' if ordered_query else '0',
            'significant': '1' if significant else '0',
            'no_iea': '1' if exclude_iea else '0',
            'as_ranges': '1' if region_query else '0',
            'omit_metadata': '0' if include_graph else '1',
            'user_thr': str(max_p_value),
            'max_set_size': str(max_set_size),
            'threshold_algo': correction_method,
            'hierfiltering': hier_filtering,
            'domain_size_type': domain_size,
            'custbg_file': '',
            'custbg': custom_bg,
            'prefix': numeric_ns,
            'no_isects': '1' if no_isects else '0'
        }
    elif mode == 'lookup':
        query_params = {
            'query': query_url, 
            'target': 'GO'
        }
    if src_filter:
        for i in src_filter:
            query_params['sf_' + i] = '1'
    raw_query = requests.post(my_url, data=query_params, headers=HEADERS)

    # Here PNG request parsing would go, but not implementing that
    if wantpng:
        pass

    # Requested text
    split_query = raw_query.text.split('\n')

    # Here interaction parsing would go, but not implementing that
    if include_graph:
        pass

    # Parse main result body
    split_query = [
        s.split('\t')
        for s in split_query
        if s and not s.startswith('#')
    ]
    
    enrichment = pd.DataFrame(split_query)
    if mode == 'enrich':
        colnames = [
            "query.number", "significant", "p.value", "term.size", 
            "query.size", "overlap.size", "recall", "precision", 
            "term.id", "domain", "subgraph.number", "term.name", 
            "relative.depth", "intersection"
        ]
        numeric_colnames = [
            "query.number", "p.value", "term.size", 
            "query.size", "overlap.size", "recall", 
            "precision", "subgraph.number", "relative.depth"
        ]
    elif mode == 'lookup':
        return enrichment
        #colnames = ["alias.number", "alias", "target.number", "target", "name", "description", "namespace"]
        #numeric_colnames = ["alias.number", "target.number"]
    
    if enrichment.shape[1] > 0:
        print(enrichment.shape)
        enrichment.columns = colnames
        enrichment.index = enrichment['term.id']
        numeric_columns = numeric_colnames
        for column in numeric_columns:
            enrichment[column] = pd.to_numeric(enrichment[column])
        if mode == 'enrich':
            enrichment['significant'] = enrichment['significant'] == '!'
    else:
        enrichment = None
    # Only report most significant results
    if (enrichment is not None) and (enrichment.shape[0] > 0):
        x = np.array(np.argsort(enrichment['p.value']))
        enrichment = enrichment.iloc[x[:topk], :]
    return enrichment
CODE
def viz_gprofiler(selected_genes, topk=20):
    go_results = gprofiler(selected_genes, topk=topk)
    database_colors = { 'MF': '#CB3C19', 'BP': '#FA9D18', 'CC': '#198520', 'keg': '#D9869B', 'rea': '#335ACC', 'tf': '#4F6E9B', 'mir': '#53B8AD', 'hpa': '#542DB1', 'cor': '#6AAB19', 'hp': '#8C0784'}
    database_IDs = {'MF': 'Molecular function', 'BP': 'Biological process', 'CC': 'Cellular component', 'keg': 'KEGG', 'rea': 'REAC', 'tf': 'TF', 'mir': 'MIRNA', 'hpa': 'HPA', 'cor': 'CORUM', 'hp': 'HP'}
    top_go_logpvals = -np.log10(go_results['p_value'].astype(float))
    top_go_dbIDs = go_results['domain']
    top_go_termnames = go_results['name']
    top_go_termIDs = go_results['native']
    bar_colors = np.array([database_colors[x] for x in top_go_dbIDs])
    panel_data = []
    ordi = np.arange(len(top_go_dbIDs))[::-1]
    for c in np.unique(bar_colors):
        trace_locs = np.where(bar_colors == c)[0]
        trace_locs = trace_locs[::-1]      # Reverse because it's better.
        panel_data.append({
            'name': database_IDs[top_go_dbIDs[trace_locs[0]]], 
            'x': top_go_logpvals[trace_locs],
            # 'y': ordi[trace_locs], 
            'y': top_go_termIDs[trace_locs],
            'hovertext': [
                "-log10(p): {}<br>{}<br>{}".format(
                    str(round(top_go_logpvals[t], 2)), 
                    top_go_termnames[t], 
                    top_go_termIDs[t]
                ) for t in trace_locs], 
            'text': top_go_termnames[trace_locs]
        })
    return panel_data
CODE
go_results = gprofiler(genes, mode='enrich')
go_results

Along with the excellent gget package (Luebbert and Pachter 2023), this covers many common gene-set lookup use cases.

query.number significant p.value term.size query.size overlap.size recall precision term.id domain subgraph.number term.name relative.depth intersection
term.id
GO:0021896 1 True 0.000012 2 3 2 0.667 1.000 GO:0021896 BP 11 forebrain astrocyte differentiation 2 KRAS,NF1
GO:0021897 1 True 0.000012 2 3 2 0.667 1.000 GO:0021897 BP 11 forebrain astrocyte development 3 KRAS,NF1
REAC:R-HSA-6802949 1 True 0.000018 51 3 3 1.000 0.059 REAC:R-HSA-6802949 rea 8 Signaling by RAS mutants 3 RAF1,KRAS,NF1
KEGG:01521 1 True 0.000055 79 3 3 1.000 0.038 KEGG:01521 keg 50 EGFR tyrosine kinase inhibitor resistance 1 RAF1,KRAS,NF1
REAC:R-HSA-6802953 1 True 0.000056 4 3 2 0.667 0.500 REAC:R-HSA-6802953 rea 8 RAS signaling downstream of NF1 loss-of-f… 3 KRAS,NF1
REAC:R-HSA-6802957 1 True 0.000057 74 3 3 1.000 0.041 REAC:R-HSA-6802957 rea 8 Oncogenic MAPK signaling 2 RAF1,KRAS,NF1
HP:0002967 1 True 0.000283 32 3 3 1.000 0.094 HP:0002967 hp 23 Cubitus valgus 1 RAF1,KRAS,NF1
GO:0046578 1 True 0.001030 219 3 3 1.000 0.014 GO:0046578 BP 62 regulation of Ras protein signal transduction 2 RAF1,KRAS,NF1
HP:0030063 1 True 0.001190 51 3 3 1.000 0.059 HP:0030063 hp 43 Neuroepithelial neoplasm 3 RAF1,KRAS,NF1
GO:0035020 1 True 0.001220 15 3 2 0.667 0.133 GO:0035020 BP 62 regulation of Rac protein signal transduc… 3 KRAS,NF1
HP:0030061 1 True 0.001410 54 3 3 1.000 0.056 HP:0030061 hp 43 Neuroectodermal neoplasm 2 RAF1,KRAS,NF1
HP:0100836 1 True 0.001410 54 3 3 1.000 0.056 HP:0100836 hp 41 Malignant neoplasm of the central nervous… 3 RAF1,KRAS,NF1
HP:0030060 1 True 0.001410 54 3 3 1.000 0.056 HP:0030060 hp 43 Nervous tissue neoplasm 1 RAF1,KRAS,NF1
KEGG:04014 1 True 0.001450 233 3 3 1.000 0.013 KEGG:04014 keg 15 Ras signaling pathway 1 RAF1,KRAS,NF1
REAC:R-HSA-5621575 1 True 0.001960 21 3 2 0.667 0.095 REAC:R-HSA-5621575 rea 14 CD209 (DC-SIGN) signaling 1 RAF1,KRAS
REAC:R-HSA-5673001 1 True 0.001980 239 3 3 1.000 0.013 REAC:R-HSA-5673001 rea 58 RAF/MAP kinase cascade 3 RAF1,KRAS,NF1
REAC:R-HSA-5684996 1 True 0.002140 245 3 3 1.000 0.012 REAC:R-HSA-5684996 rea 58 MAPK1/MAPK3 signaling 2 RAF1,KRAS,NF1
HP:0010991 1 True 0.002160 62 3 3 1.000 0.048 HP:0010991 hp 52 Abnormality of the abdominal musculature 1 RAF1,KRAS,NF1
REAC:R-HSA-5673000 1 True 0.002800 25 3 2 0.667 0.080 REAC:R-HSA-5673000 rea 58 RAF activation 4 RAF1,KRAS
KEGG:04010 1 True 0.002980 296 3 3 1.000 0.010 KEGG:04010 keg 22 MAPK signaling pathway 1 RAF1,KRAS,NF1

References

Luebbert, Laura, and Lior Pachter. 2023. “Efficient Querying of Genomic Reference Databases with Gget.” Bioinformatics 39 (1): btac836.

Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “G: Profiler: A Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update).” Nucleic Acids Research 47 (W1): W191–98.

Reuse

CC BY 4.0

Questions? Email.

References

Luebbert, Laura, and Lior Pachter. 2023. “Efficient Querying of Genomic Reference Databases with Gget.” Bioinformatics 39 (1): btac836.
Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “G: Profiler: A Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update).” Nucleic Acids Research 47 (W1): W191–98.

Reuse