Interactive browsers have started to enter wider usage in various biochemical fields which observe extremely high-dimensional and structured data on a regular basis. I recently wrote about building the foundations of such a browser for visualizing chemical space, in which the user can define subsets of chemicals interactively.
Browsers like this need to perform general learning tasks without making unnecessary assumptions. The potential of this workflow can be realized through interactive algorithms driven by user selections. The theme of interactive computation is the focus of this post series. I’m going to explore it in this post, by demonstrating how to add a heatmap, and augment it to group subsets of chemicals based on their fingerprints.
Loading the previous browser
I’m first loading the previously created interactive browser, as described in the previous post in the series. That post covers the basics of how to build an interactive interface with subselection capabilities. The code to run the browser is imported from the relevant file produced in that previous post.
CODE
import requestsurl ='https://raw.githubusercontent.com/b-akshay/blog-tools/main/interactive-browser/part-1/scatter_chemviz.py'r = requests.get(url)# make sure your filename is the same as how you want to import withopen('VE_browser.py', 'w') as f: f.write(r.text)# now we can importfrom VE_browser import*
I’m going to add another panel to the interface, with profound implications. This is a heatmap that displays feature-level values for every observation, i.e. it displays the “raw data” – in this case, the values of individual fingerprints for a single molecule. It allows the user to define subgroups of the data based on their representations directly, rather than based on some derived 2D scatterplots.
Selecting features to view
In this case, we could use my original featurization of 2048 Morgan (extended connectivity) fingerprint bits for this chemical data.
The heatmap only takes up less than half the screen horizontally - a few hundred pixels. So there are clearly too many features to view individually - more than the number of pixels displaying the heatmap. This is quite common in data visualization, and it is crucially important which features are selected to view.
As a first pass, I’ve implemented a simple function that selects the maximum-variance features in the data.
This runs at interactive speeds, and provides a guardrail against rendering astronomically many features.
Note: This will fail if the data are normalized to have unit variance per feature. So you may need to change this for your particular purpose.
Timing
What do I mean by “interactive speeds”? Running it on this data (>2K features) takes about 0.02s on my laptop. This is far below the ~1 second needed to perceptually throw the user out of the feeling of interactivity, a hard limit on our interactive designs that I’ve written about recently. Timing this feature selection step is easy, and good practice.
CODE
%timeit interesting_feat_ndces(anndata_all.X)
20.3 ms ± 877 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Displaying metadata
Though it’s possible to interactively narrow down to a manageable number of fingerprint bits, they are not easy to individually interpret. So we opt instead to display some descriptors calculated using various human-interpretable RDKit utilities, as I wrote about earlier.
These are already stored as metadata columns in the .obs dataframe; there are 208 such columns we display as a heatmap.
CODE
print("Shape of displayed metadata: {}".format(anndata_all.obs.iloc[:, :-8].values.shape))
Shape of displayed metadata: (5903, 208)
Co-clustering to group the selected data
This uses a function compute_coclustering which groups the rows and columns, and returns the cluster indices and IDs.
We focus on one operation in particular: ordering the rows and columns of the heatmap to emphasize the internal structure. This can be done with co-clustering – assigning joint cluster labels to the rows and columns of the heatmap.
An implementation with linear algebra
There are many methods for co-clustering, including work linking it to information theory (Dhillon, Mallela, and Modha 2003). I’ll use the most efficient, the spectral partitioning method of (Dhillon 2001). This is implemented in Scikit-learn. Here’s a wrapper around it setting it up for use by the browser.
300 ms ± 2.93 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Displaying and viewing the resulting heatmap
Generating the heatmap in Plotly
Putting it all together, here’s some wrapper code for translating all these configuration and style options into Plotly for a heatmap.
\({\normalsize \textbf{Details}}\)
The code in hm_hovertext() creates tooltip text for the heatmap. This is done pretty simply right now, but is where the performance bottleneck is.
The code uses several display configuration options from the params dictionary defined above. The code could be fewer lines, but everyone likes different display configurations!
The main implementation details here arise from the differences in handling discrete and continuous colorscales, which affects subset selection and annotation of points.
Further details, like the displayed name of the color variable, are set to generic defaults that are exposed in the code and can be changed.
This defines a dummy function hm_row_scatter. The job of this function is to return a column of scatterplot points alongside the rows to serve as a link between the scatterplot and the heatmap. Making this link work for significantly better subset selection is a task in itself. There is also additional code there I’ve left unexplained for now, involving customizable “row annotations” that can be used to modify the order of what’s displayed.
Delving into all this would make our discussion here very long, and is the subject of a subsequent post.
The end product
The code from this notebook is available as a file; setting it up an environment will deploy the app.
There are a couple of interesting stories along the way which merit their own code snippets.
Viewing the heatmap statically
One bit of code here normalizes the heatmap and then shows how to render it in HTML.
Recall that for a fully interactive experience, we should keep the overall computation time under ~1s when the user selects a chunk of the scatterplot.
1.45 s ± 79.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
427 ms ± 5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
309 ms ± 5.83 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
This is not happening with the present design; large heatmaps cannot be rendered without the user feeling like they have to pause their thoughts. After profiling the code in display_heatmap_cb, it turns out that most of the time is being spent in hm_hovertext. We leave optimization of this function to a later post.
Warning: The bottleneck of the interface at present is putting together the tooltip text that displays when cells are hovered upon.
The main heatmap callback
The central use of the heatmap is as an aid in exploring and showing structure in (clustering) whatever data the user has selected. So at the most basic level, the callback that updates the heatmap fires when the user selection changes. We add this Dash code to the app.
CODE
"""Update the main heatmap panel."""@app.callback( Output('main-heatmap', 'figure'), [Input('landscape-plot', 'selectedData')])def update_heatmap( selected_points):if (selected_points isnotNone) and ('points'in selected_points): selected_IDs = [p['text'].split('<br>')[0] for p in selected_points['points']]else: selected_IDs = []iflen(selected_IDs) ==0: selected_IDs = data_df.index subsetted_data = data_df.loc[np.array(selected_IDs)] subsetted_data = custom_colwise_norm_df(subsetted_data)print(f"Subsetted data: {subsetted_data.shape}") row_annotations =Nonereturn display_heatmap_cb( subsetted_data, 'MolWt', row_annots=row_annotations, xaxis_label=True, yaxis_label=True )
Summary: interactive clustering
We’ve seen how to display the raw data in a heatmap, and cocluster it to visually organize information on the fly at a user-defined resolution.
Takeaways
Rule of thumb: No more than ~500K entries in the displayed heatmap.
\({\normalsize \textbf{Explanation}}\)
Screen capability: A screen has finite resolution, and a row/column of the heatmap takes at least a pixel to intelligibly distinguish. So this will display the maximum possible if e.g. a 1000 x 500 pixel area of the interface being devoted to the heatmap.
User perception: This also goes back to one of the central principles I talked about in a previous post - all computations should occur in less than a second in order to make a human user feel like the experience is interactive. One of the most intensive types of computations the browser performs is rendering the heatmap and its tens of thousands of individual entries, each with a different tooltip that displays upon hovering. This seems to be a bottleneck at scales around \(~10^5\) or more – much more than this cannot comfortably be displayed on most desktop/laptop screens, let alone mobile.
This turns out to be a pretty stringent requirement - even 1000 observations of 500 features each can take a while to render, as we’ll see. Such considerations are tied to the framework, and using better heatmap renderers can give significant speedups.
But the datasets we deal with in biochemical sciences are often several orders of magnitude larger. I’ll write next about bridging that gap of scale.
Up ahead: scaling up
As the capabilities of this browser grow, we wonder: how much algorithmic functionality can we enable at interactive speeds?
Next up is a crucial pit stop along this journey, adding several upgrades to the browser’s abilities that allow the user to zoom into data-driven subsets.
After that, we’ll open up some corners of the algorithmic toolbox on these subsets, and demonstrate what is possible at these speeds.
One key thing to remember is that the process is already bottlenecked by runtime considerations. Around 10,000 points seem to be all that will visualize at perceptually interactive speeds on a CPU-bound local machine.1 So this sets a practical limit on the size of the datasets passed into algorithms for learning.
In the next posts, we’ll look at what algorithmic and visualization functionality we can include to make the user’s life easier.
References
Dhillon, Inderjit S. 2001. “Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning.” In Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 269–74.
Dhillon, Inderjit S, Subramanyam Mallela, and Dharmendra S Modha. 2003. “Information-Theoretic Co-Clustering.” In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 89–98.
Footnotes
However, much more than that is possible using GPUs, with technologies like WebGL.↩︎