The Gene Ontology (GO) is a controlled vocabulary for describing gene function. Along with other pathway/phenotype databases, the GO is a mainstay for interpreting gene lists. Using it is a good introduction to the highly structured forms of hard-won tacit knowledge that are essential in most areas of biology.
The GO is organized as a directed acyclic graph with three main branches: biological process, molecular function, and cellular component. The GO is a useful tool for summarizing the results of an analysis over protein-coding genes.
Querying the GO is often performed alongside queries to other databases, such as those for protein-protein interaction. Here is code for querying several major databases all at once.
Wrapper and visualizer
The g:Profiler web service (Raudvere et al. 2019) unifies access to many of these resources.
Since late‑2023 the official Python client gprofiler‑official has replaced ad‑hoc wrappers.
This notebook shows a minimal, modern workflow:
1. Install & import gprofiler‑official.
2. Run an enrichment query with GProfiler.profile.
3. Inspect and visualise the results.
Compared with the former version (deprecated from 2022, below), no manual request building is needed – the client handles paging, retries, and dataframe‑ready output.
CODE
from gprofiler import GProfilerimport pandas as pd, plotly.express as pxgp = GProfiler(return_dataframe=True)genes = ['NF1', 'KRAS', 'RAF1']go_results = gp.profile(organism='hsapiens', query=genes, no_evidences=False)go_results
source
native
name
p_value
significant
description
term_size
query_size
intersection_size
effective_domain_size
precision
recall
query
parents
intersections
evidences
0
WP
WP:WP2253
Pilocytic astrocytoma
4.909064e-08
True
Pilocytic astrocytoma
9
3
3
8752
1.000000
0.333333
query_1
[WP:000000]
[NF1, KRAS, RAF1]
[[WP], [WP], [WP]]
1
WP
WP:WP2586
Aryl hydrocarbon receptor pathway
8.871381e-06
True
Aryl hydrocarbon receptor pathway
46
3
3
8752
1.000000
0.065217
query_1
[WP:000000]
[NF1, KRAS, RAF1]
[[WP], [WP], [WP]]
2
GO:BP
GO:0021896
forebrain astrocyte differentiation
2.035805e-05
True
"The process in which a relatively unspecializ...
3
3
2
21026
0.666667
0.666667
query_1
[GO:0030900, GO:0048708]
[NF1, KRAS]
[[ISS, IEA], [IEA]]
3
GO:BP
GO:0021897
forebrain astrocyte development
2.035805e-05
True
"The process aimed at the progression of an as...
3
3
2
21026
0.666667
0.666667
query_1
[GO:0014002, GO:0021896]
[NF1, KRAS]
[[ISS, IEA], [IEA]]
4
HP
HP:0012209
Juvenile myelomonocytic leukemia
2.669185e-05
True
Juvenile myelomonocytic leukemia (JMML) is a l...
18
3
3
5080
1.000000
0.166667
query_1
[HP:0012324]
[NF1, KRAS, RAF1]
[[HP], [HP], [HP]]
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
166
HP
HP:0000823
Delayed puberty
4.835465e-02
True
Passing the age when puberty normally occurs w...
208
3
3
5080
1.000000
0.014423
query_1
[HP:0001510, HP:0008373]
[NF1, KRAS, RAF1]
[[HP], [HP], [HP]]
167
KEGG
KEGG:04218
Cellular senescence
4.866513e-02
True
Cellular senescence
155
3
2
8484
0.666667
0.012903
query_1
[KEGG:00000]
[KRAS, RAF1]
[[KEGG], [KEGG]]
168
KEGG
KEGG:04150
mTOR signaling pathway
4.866513e-02
True
mTOR signaling pathway
155
3
2
8484
0.666667
0.012903
query_1
[KEGG:00000]
[KRAS, RAF1]
[[KEGG], [KEGG]]
169
HP
HP:0004912
Hypophosphatemic rickets
4.968087e-02
True
Hypophosphatemic rickets
25
3
2
5080
0.666667
0.080000
query_1
[HP:0002148, HP:0002748]
[KRAS, RAF1]
[[HP], [HP]]
170
CORUM
CORUM:5923
RAF1-BRAF complex, RAS stimulated
4.993169e-02
True
RAF1-BRAF complex, RAS stimulated
1
2
1
3383
0.500000
1.000000
query_1
[CORUM:0000000]
[RAF1]
[[CORUM]]
171 rows × 16 columns
What is the best way to visualize this information? This type of integrative database search presents several challenges. A major one is that the databases are not comparable to each other. Using Fisher’s exact test to compare the query with gene sets and ranking by p-value is necessarily inexact but does have the advantage of simultaneously optimizing precision and recall of the retrieved results.
A generally appropriate solution is to filter quite loosely by p-value, using it only to short-list a set of enrichment terms. Together, these terms present a portrait of gene function that is meticulously curated along several distinct axes of evidence, from protein-protein interactions to downstream gene function and knockout experiments.
It’s important to display these short-listed terms all concurrently side-by-side in a way that distinguishes their sources of evidence and displays clear descriptions. As the descriptions are largely in technical English, interactive plots (such as those generated by Plotly) are a good solution to display them.
We can visualize all this information for any gene set, providing a rich portrait of the functional associations of that gene set.
Unable to display output for mime type(s): application/vnd.plotly.v1+json
The function of a gene is necessarily contextual. A gene set does not necessarily mean something in isolation, but also in the context of what other genes are present. Expanding and contracting the gene sets we look up will also lead to different GO term results. This is an important consideration in using these databases, making the real-time nature of the above lookup crucial.
[Deprecated] : old wrapper (2022)
Here for completeness we include some old code which used HTTP requests directly using the API and manually did the rest. This adapts an excellent existing Python port for the gProfiler API to report the heterogeneous results of the lookup in a pandas dataframe.
This section is now deprecated, in view of the more complete functionality that g:Profiler has added above.
CODE
import requestsBASE_URL ="http://biit.cs.ut.ee/gprofiler/"HEADERS = {'User-Agent': 'python-gprofiler'}def gprofiler( query, topk=20, organism='hsapiens', ordered_query=False, significant=True, exclude_iea=False, region_query=False, max_p_value=1.0, max_set_size=0, correction_method='analytical', hier_filtering='none', domain_size='annotated', custom_bg=[], numeric_ns='', no_isects=False, png_fn=None, include_graph=False, src_filter=None, mode='enrich'):''' Annotate gene list functionally Interface to the g:Profiler tool for finding enrichments in gene lists. Organism names are constructed by concatenating the first letter of the name and the family name. Example: human - 'hsapiens', mouse - 'mmusculus'. If requesting PNG output, the request is directed to the g:GOSt tool in case 'query' is a vector and the g:Cocoa (compact view of multiple queries) tool in case 'query' is a list. PNG output can fail (return FALSE) in case the input query is too large. In such case, it is advisable to fall back to a non-image request. Returns a pandas DataFrame with enrichment results or None ''' query_url ='' mode ='enrich'# Currently gprofiler gconvert gives an HTTP 405 error.if mode =='enrich': my_url = BASE_URL +'gcocoa.cgi'elif mode =='lookup': my_url = BASE_URL +'convert.cgi' wantpng =Trueif png_fn elseFalse output_type ='mini_png'if wantpng else'mini'if wantpng:raiseNotImplementedError('PNG Output not implemented')returnif include_graph:raiseNotImplementedError('Biogrid Interactions not implemented (include_graph)')return# Query qnames =list(query)ifnot qnames:raiseValueError('Missing query') query_url =' '.join(qnames)# Significance thresholdif correction_method =='gSCS': correction_method ='analytical'if correction_method notin ('analytical', 'fdr', 'bonferroni'):raiseValueError("Multiple testing correction method not recognized (correction_method)")# Hierarchical filteringif hier_filtering notin ('none', 'moderate', 'strong'):raiseValueError("hier_filtering must be one of \"none\", \"moderate\" or \"strong\"")if hier_filtering =='strong': hier_filtering ='compact_ccomp'elif hier_filtering =='moderate': hier_filtering ='compact_rgroups'else: hier_filtering =''# Domain sizeif domain_size notin ('annotated', 'known'):raiseValueError("domain_size must be one of \"annotated\" or \"known\"")# Custom backgroundifisinstance(custom_bg, list): custom_bg =' '.join(custom_bg)else:raiseTypeError('custom_bg need to be a list')# Max. set sizeif max_set_size <0: max_set_size =0# HTTP requestif mode =='enrich': query_params = {'organism': organism,'query': query_url,'output': output_type,'analytical': '1','sort_by_structure': '1','ordered_query': '1'if ordered_query else'0','significant': '1'if significant else'0','no_iea': '1'if exclude_iea else'0','as_ranges': '1'if region_query else'0','omit_metadata': '0'if include_graph else'1','user_thr': str(max_p_value),'max_set_size': str(max_set_size),'threshold_algo': correction_method,'hierfiltering': hier_filtering,'domain_size_type': domain_size,'custbg_file': '','custbg': custom_bg,'prefix': numeric_ns,'no_isects': '1'if no_isects else'0' }elif mode =='lookup': query_params = {'query': query_url, 'target': 'GO' }if src_filter:for i in src_filter: query_params['sf_'+ i] ='1' raw_query = requests.post(my_url, data=query_params, headers=HEADERS)# Here PNG request parsing would go, but not implementing thatif wantpng:pass# Requested text split_query = raw_query.text.split('\n')# Here interaction parsing would go, but not implementing thatif include_graph:pass# Parse main result body split_query = [ s.split('\t')for s in split_queryif s andnot s.startswith('#') ] enrichment = pd.DataFrame(split_query)if mode =='enrich': colnames = ["query.number", "significant", "p.value", "term.size", "query.size", "overlap.size", "recall", "precision", "term.id", "domain", "subgraph.number", "term.name", "relative.depth", "intersection" ] numeric_colnames = ["query.number", "p.value", "term.size", "query.size", "overlap.size", "recall", "precision", "subgraph.number", "relative.depth" ]elif mode =='lookup':return enrichment#colnames = ["alias.number", "alias", "target.number", "target", "name", "description", "namespace"]#numeric_colnames = ["alias.number", "target.number"]if enrichment.shape[1] >0:print(enrichment.shape) enrichment.columns = colnames enrichment.index = enrichment['term.id'] numeric_columns = numeric_colnamesfor column in numeric_columns: enrichment[column] = pd.to_numeric(enrichment[column])if mode =='enrich': enrichment['significant'] = enrichment['significant'] =='!'else: enrichment =None# Only report most significant resultsif (enrichment isnotNone) and (enrichment.shape[0] >0): x = np.array(np.argsort(enrichment['p.value'])) enrichment = enrichment.iloc[x[:topk], :]return enrichment
Luebbert, Laura, and Lior Pachter. 2023. “Efficient Querying of Genomic Reference Databases with Gget.” Bioinformatics 39 (1): btac836.
Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “G: Profiler: A Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update).” Nucleic Acids Research 47 (W1): W191–98.
Luebbert, Laura, and Lior Pachter. 2023. “Efficient Querying of Genomic Reference Databases with Gget.”Bioinformatics 39 (1): btac836.
Raudvere, Uku, Liis Kolberg, Ivan Kuzmin, Tambet Arak, Priit Adler, Hedi Peterson, and Jaak Vilo. 2019. “G: Profiler: A Web Server for Functional Enrichment Analysis and Conversions of Gene Lists (2019 Update).”Nucleic Acids Research 47 (W1): W191–98.