Interactive browser, part 3: subset selection and scaling
Repurposing for single-cell data
- Introduction
- Loading the previous browser
- Single-cell gene expression data
- The browser
- Selecting subclusters of the data from the heatmap
- Summary
- Bonus changes: refactoring coclustering for sparse data
- References
Introduction
Previous posts in the series: [Part 1] [Part 2]
I've been writing about how to explore a dataset by building a data-driven interactive browser, which I've demonstrated on chemical data. Building this browser involves working with variables that are defined by the user interactively. The potential of this workflow can be realized through interactive computation that is driven by user selections.
In this series of posts, I'm developing and expanding a general, performant interactive data browser to explore the theme of interactive computation -- a favorite of mine.
This particular post lays the groundwork for us to try some powerful interactive algorithms, by focusing on helping the user to select a subset of observations interactively, using the existing clustering and visualization capabilities built in the previous posts in this series.
Loading the previous browser
I'll start by loading the previously created interactive browser, as described in the previous post in the series. That post covers how to build a basic interactive interface with interactive clustering capabilities deployed in a heatmap. The code to run the browser is imported from the relevant file produced in that previous post.
import requests
url = 'https://raw.githubusercontent.com/b-akshay/blog-tools/main/interactive-browser/part-3/browser_main.py'
r = requests.get(url)
# make sure your filename is the same as how you want to import
with open('VE_browser.py', 'w') as f:
f.write(r.text)
# now we can import
from VE_browser import *
Like a previous post, we are rendering a new type of data, so the workflow will be structured in two parts:
- Preprocessing script: Given a dataset of chemicals, writes a dataframe that's ready to be viewed by a front-end.
- Interactive application: Given a dataframe, view it interactively in a browser.
Single-cell gene expression data
In addition to the application of visualizing chemical space that I've been initially writing about, such browsers have started to enter the toolbox for various biological and chemical datasets. A fruitful field of applications of such browsers, and the interactive algorithms they enable, is single-cell data.
Here, I'll demonstrate on single-cell data from the GTEx consortium, a large worldwide effort profiling gene expression across tissues.
This is a beautiful dataset to demonstrate systems biology on single-cell data, as in the recent paper introducing it [Eraslan et al. 2022]. It is also very good to demonstrate new algorithms, as it is large (>200K cells and >17K genes) and comes well-annotated and preprocessed. The data are processed by a documented pipeline, which learns representations that attempt to correct for nuisance variation and batch effects.
Dataset overview
The dataset is downloaded from here.
This dataset is particularly notable for being the largest dataset of its kind using a single-nuclear RNA-seq (snRNA-seq) protocol, which attempts to isolate each cell nucleus, and sample unbiasedly from the RNA transcripts within it. This contrasts with more common and cheaper single-cell (scRNA-seq) protocols, which try to sample from the nucleus and cytoplasm (interior material) of each cell. This makes it much trickier to isolate the cells in scRNA-seq than snRNA-seq. The snRNA-seq dataset I'll be using here takes advantage of these characteristics, using cryogenically frozen tissues collected earlier by the same GTEx consortium, as detailed here.
As we can see, there are many cell types, whose annotations broadly match the cluster structure revealed by nonparametric dimensionality reduction and projection tools - in this case, UMAP [Becht et al. 2019].
import scanpy as sc, time
itime = time.time()
gtex9_adta = sc.read('GTEx_8_tissues_snRNAseq_atlas_071421.public_obs.h5ad')
print("Data read. Time: {}".format(time.time() - itime))
sc.pl.umap(gtex9_adta, color='Broad cell type')
Preprocessing the data
There are ~209k observations and ~17k features, which is impossible to visualize all at once. So we need to do a little preprocessing to display a subset of this dataset interactively.
The approach I cover here is to subsample down to 10k cells, which is typically a nice number for the browser for visualization- and perception-based reasons.
Selecting a subset of smoothed data
The standard way to subset rows is to randomly subsample observations. This is easy to implement, but can lead to unacceptable levels of sparsity in the data. At the extremes, this is obvious; for instance, sampling just a hundred observations from our 200k risks missing entire observation clusters and the features uniquely found in them.
In an attempt to make up for this, the needs of the situation often dictate a more sophisticated method of subsampling, sometimes accompanied by a nonparametric smoothing of the data so as not to lose signal with sparse data. We precede things by a nonparametric smoothing step that mixes in a little of each observation's neighbors for each feature. Since an observation's neighbors typically have similar feature representations, this respects the initial representations of the data, while making it much less sparse. And it's just one matrix multiplication, typically of two sparse matrices.
So the workflow I demonstrate here is:
- (Optional) Smooth data using neighborhood structure
- Subsample 10k observations and recompute neighborhood graph, using the learned representation that Eraslan et al. used in their analysis [Eraslan et al. 2022].
Some simple code to do so is below.
(Optional) Smooth the data
from sklearn.preprocessing import normalize
def smooth_anndata(adta, as_nonsparse=False):
w = adta.obsp['connectivities']
w_norm = normalize(w, norm='l1', axis=1)
X_smooth = (w_norm.dot(w_norm)).dot(adta.X)
adta_smooth = adta.copy()
adta_smooth.X = X_smooth.toarray() if as_nonsparse else X_smooth
return adta_smooth
# Smooth and resample data.
itime = time.time()
gtex9_adta_smooth = smooth_anndata(gtex9_adta)
print("Data smoothed. Time: {}".format(time.time() - itime))
itime = time.time()
gtex9_adta_10k_smooth = sc.pp.subsample(gtex9_adta_smooth, n_obs=10000, random_state=42, copy=True)
sc.pp.neighbors(gtex9_adta_10k_smooth, use_rep='X_vae_mean')
print("Neighbors computed. Time: {}".format(time.time() - itime))
sc.tl.umap(gtex9_adta_10k_smooth)
sc.pl.umap(gtex9_adta_10k_smooth, color='Broad cell type')
The overall cluster structure (coarse close-range topology) of the subsample is the same, which is one indication1 that we can proceed with the subsample, and our newly computed neighborhood graph.
1. In general, a 2D projection like UMAP is not a good way to reliably capture structure in a plot by eye. Straight lines, angles, and distances can be very deceptive (this is largely true even when reducing 3D to 2D, 'viewing from an angle'). But it's what we have, and pretty good for close-range cluster structure, which gets warped less by all the nonlinear shenanigans.↩
itime = time.time()
gtex9_adta_10k_smooth.obs['x'] = gtex9_adta_10k_smooth.obsm['X_umap'][:, 0]
gtex9_adta_10k_smooth.obs['y'] = gtex9_adta_10k_smooth.obsm['X_umap'][:, 1]
gtex9_adta_10k_smooth.write('GTEx_8_tissues_snRNAseq_atlas_10k_smooth.h5', compression='gzip')
print("Data written. Time: {}".format(time.time() - itime))
The browser
The modifications made to the code in this post ultimately result in an improved browser, the code for which is available here. Setting it up an environment will deploy the app.
Additional changes
In changing datasets out of chemical space, I'm also making a couple of other modifications.
-
Removing chemical functionality: The chemical images no longer need to be displayed, which entails removing the html
<img>
componentchem2D-image
and its dependencies; they are pretty modular and can be added back if needed. -
Handling anndata objects: Some changes were made to
update_heatmap()
and todisplay_heatmap_cb()
, as the object to be rendered is now ananndata
object instead of apandas
dataframe. One particular change of note is tointeresting_feat_ndces()
, to handle sparse data. The variances are now calculated in a particular way that does not involve constructing a dense matrix as an intermediate step.
def sparse_variance(data, axis=0):
"""
Input: a sparse matrix.
Output: variance along the selected axis.
"""
sqmat = data.power(2).mean(axis=axis)
return np.ravel(sqmat - np.square(np.ravel(data.mean(axis=axis))))
def interesting_feat_ndces(fit_data, num_feats_todisplay=500):
num_feats_todisplay = min(fit_data.shape[1], num_feats_todisplay)
if ((fit_data is None) or
(np.prod(fit_data.shape) == 0)
):
return np.arange(num_feats_todisplay)
if scipy.sparse.issparse(fit_data):
interestingnesses = sparse_variance(fit_data, axis=0)
else:
interestingnesses = np.std(fit_data, axis=0)
feat_ndces = np.argsort(interestingnesses)[::-1][:num_feats_todisplay]
return feat_ndces
- Color schemes and normalization: Strange things can happen when the wrong colormap is used: a discrete colormap for continuous ordinal data, for example. This is important enough I've started splitting out some of the code into functions. We've already seen a lot of the work done, in some of the colormap code posted in the first post
Selecting subclusters of the data from the heatmap
The heatmap, with its automatic coclustering, can be used as a vitally important tool in scaling things up. It depicts the structure of the data in a more informative way than a simple 2D projection like PCA or UMAP.
A really common interactive task to select a group of observations on the heatmap, and iteratively refine this subselection.
gtex9_adta_10k_smooth = sc.read('GTEx_8_tissues_snRNAseq_atlas_10k_smooth.h5')
Selecting data in heatmap
A key tool to do that is to allow a way to easily select a subset of rows from the heatmap. I do this by anchoring a selectable point alongside each row in the heatmap, constituting a parallel scatterplot whose state is kept linked to the main one in the browser.
The code for this is fairly boilerplate, implementing the logic of subset selection. I've included it below. A few notes for the interested:
${\normalsize \textbf{Notes}}$
- The arguments include the dataframe in question (
data_df
), and the column in it that’s being used as the scatterplot color variable (color_var
), both of which are used in the original scatterplot. The other required argument (hm_point_names
) contains the names of the - If there are 30 or fewer points being displayed, text labels will be shown alongside the points. This is a somewhat arbitrary choice that makes things easier when users do narrow down to a small subcluster of data.
def hm_row_scatter(anndata_obj, color_var, hm_point_names, num_of_category=0, bg_marker_size=params['bg_marker_size_factor']):
df_data = anndata_obj.obs
num_colors_needed = len(np.unique(df_data[color_var]))
colorscale_list = get_discrete_cmap(num_colors_needed)
row_scat_traces = []
hmscat_mode = 'markers'
# Decide if few enough points are around to display row labels
if len(hm_point_names) <= 30:
hmscat_mode = 'markers+text'
cnt = 0
for idx in np.unique(df_data[color_var]):
val = df_data.loc[df_data[color_var] == idx, :]
point_ids_this_trace = list(val.index)
hm_point_where_this_trace = np.isin(hm_point_names, point_ids_this_trace)
hm_point_names_this_trace = hm_point_names[hm_point_where_this_trace]
num_in_trace = len(hm_point_names_this_trace)
hm_point_ndces_this_trace = np.where(hm_point_where_this_trace)[0] # this trace's row indices in heatmap
y_coords_this_trace = np.arange(len(hm_point_names))[hm_point_ndces_this_trace]
# At this point, rows are sorted in order of co-clustering.
trace_color = colorscale_list[cnt]
cnt += 1
pt_text = ["{}".format(point_ids_this_trace[i]) for i in range(len(point_ids_this_trace))]
new_trace = {
'name': str(idx),
'x': np.ones(num_in_trace)*-1*num_of_category,
'y': y_coords_this_trace,
'xaxis': 'x2',
'hoverinfo': 'text',
'text': [x for x in hm_point_names_this_trace],
'mode': hmscat_mode,
'textposition': 'middle left',
'textfont': hm_font_macro,
'marker': {
'size': bg_marker_size,
'opacity': params['marker_opacity_factor'],
'symbol': 'circle',
'color': trace_color
},
'selected': style_selected,
'type': 'scatter'
}
row_scat_traces.append(new_trace)
return row_scat_traces
To understand what this is doing, some key code was actually introduced in a previous post, though I didn't comment on it much.
The code in question is from the main heatmap creation code, in display_heatmap_cb
, creating an alternate x-axis.
def display_heatmap_cb(...):
...
row_scat_traces = hm_row_scatter(working_object, color_var, hm_point_names)
...
return {
'data': [ hm_trace ] + row_scat_traces,
'layout': {
'xaxis': {
'automargin': True,
'showticklabels': False,
'showgrid': False, 'showline': False, 'zeroline': False, #'visible': False,
#'style': {'display': 'none'},
'domain': [scatter_frac_domain, 1]
},
...
'xaxis2': {
'showgrid': False, 'showline': False, 'zeroline': False, 'visible': False,
'domain': [0, scatter_frac_domain],
'range': [-1, 0.2]
}
...
}
...
}
This creates the auxiliary scatterplot axis and physically links it to the main one.
- The new heatmap's scatterplot's selected data is linked to the main scatterplot.
@app.callback(
Output('landscape-plot', 'selectedData'),
[Input('main-heatmap', 'selectedData')]
)
def update_stored_heatmap_data(hm_selection):
return hm_selection
Similarly, landscape-plot.selectedData
is added as an input dependency of landscape-plot.figure
, so that the selected data update on the main scatterplot.
- A displayed component
num-selected-counter
is added to count how many points are selected, and display the result to the user. This occupies little space, and can be difficult to gauge even within an order of magnitude otherwise.
...
html.Div(
id='num-selected-counter',
className='three columns',
children='# selected: ',
style={
# 'display': 'none',
'textAlign': 'center',
'color': params['font_color'],
'padding-top': '0px'
}
)
...
@app.callback(
Output('num-selected-counter', 'children'),
[Input('landscape-plot', 'selectedData')]
)
def update_numselected_counter(
landscape_data
):
num_selected = len(get_pointIDs(landscape_data))
return '# selected: {}'.format(num_selected)
- The callback function that displays the heatmap is changed to take the currently displayed color variable as an argument.
@app.callback(
Output('main-heatmap', 'figure'),
[Input('landscape-plot', 'selectedData'),
Input('color-selection', 'value')])
def update_heatmap(
selected_points,
color_var
):
if (selected_points is not None) and ('points' in selected_points):
selected_IDs = [p['text'].split('<br>')[0] for p in selected_points['points']]
else:
selected_IDs = []
if len(selected_IDs) == 0:
selected_IDs = data_df.index
subsetted_data = data_df.loc[np.array(selected_IDs)]
subsetted_data = custom_colwise_norm_df(subsetted_data)
row_annotations = None
heatmap_fig = display_heatmap_cb(
subsetted_data, color_var,
row_annots=row_annotations,
xaxis_label=True, yaxis_label=True,
col_alphabet=False, mode='sklearn'
)
return heatmap_fig
Summary
After establishing a way to see the data in the first post, and a way to see the substructure of that data in heatmap form in the second post, we've done two things here:
- Switch to a larger single-cell dataset and view its features.
- Establish a way to select subsets using the interactively viewed substructure.
Next: interactive visualization
Next, I'll share something I've been anticipating for a while - building some simple methods for interactive visualization into our browser.
This sets us up perfectly to do truly interactive computation in the browser, because supervised learning and other advanced algorithms run at interactive speeds on 10k observations. Future posts will head in this direction.
Bonus changes: refactoring coclustering for sparse data
On truly large, sparse data, we'd like the algorithm to be blazing fast.
Such data result in very sparse graphs, which can be dealt with efficiently to find spectral cuts as per the algorithm.
It's just a matter of rewriting the coclustering algorithm code from scikit-learn
and plugging the new implementation into our interface.
We are looking at refactoring just a small part of the scikit-learn
code.
But rather than accessing private parts of the code piecemeal, I'll opt here to rewrite it explicitly - it's not long, and quite insightful for future use.
Old code
The old code looks something like this:
def _fit(self, X):
normalized_data, row_diag, col_diag = _scale_normalize(X)
n_sv = 1 + int(np.ceil(np.log2(self.n_clusters)))
u, v = self._svd(normalized_data, n_sv, n_discard=1)
z = np.vstack((row_diag[:, np.newaxis] * u, col_diag[:, np.newaxis] * v))
_, labels = self._k_means(z, self.n_clusters)
n_rows = X.shape[0]
self.row_labels_ = labels[:n_rows]
self.column_labels_ = labels[n_rows:]
self.rows_ = np.vstack([self.row_labels_ == c for c in range(self.n_clusters)])
self.columns_ = np.vstack(
[self.column_labels_ == c for c in range(self.n_clusters)]
)
This consists of a few steps, starting from the $n \times d$ data matrix $X$:
-
The data matrix $X$ is normalized: if the row sums of $X$ are arranged in a $n \times n$ diagonal matrix $R$ and the column sums are in a $d \times d$ diagonal matrix $C$, the data are normalized according to $$ X' := R^{-1/2} X C^{-1/2} $$ This is the normalization function
_scale_normalize()
here. -
The normalized data matrix's top few left and right eigenvectors are computed and normalized.
-
The computed eigenvectors are clustered (by k-means) to get the co-clusters, and thereby ordered for display. This usage of eigenvector-based representations of the matrix's respective rows and columns is akin to spectral clustering.
The old code's implementation runs fast on sparse data, since it uses SVD routines which scale well when run on sparse data.
New code
I replace the above function by porting each step from sklearn:
-
The data matrix $X$ is normalized: This is basically unchanged from sklearn's way of doing it,
_scale_normalize()
here. The main difference is that we expect the input matrix to be nonnegative.
def scale_normalize(X):
"""
Normalize nonnegative ``X`` by scaling rows and columns independently.
Returns the normalized matrix and the row and column scaling factors.
"""
row_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=1))).squeeze()
col_diag = np.asarray(1.0 / np.sqrt(X.sum(axis=0))).squeeze()
row_diag[np.isinf(row_diag)] = 0
col_diag[np.isinf(col_diag)] = 0
row_diag = np.where(np.isnan(row_diag), 0, row_diag)
col_diag = np.where(np.isnan(col_diag), 0, col_diag)
if scipy.sparse.issparse(X):
n_rows, n_cols = X.shape
r = scipy.sparse.dia_matrix((row_diag, [0]), shape=(n_rows, n_rows))
c = scipy.sparse.dia_matrix((col_diag, [0]), shape=(n_cols, n_cols))
an = r * X * c
else:
an = row_diag[:, np.newaxis] * X * col_diag
return an, row_diag, col_diag
-
The normalized data matrix's top few left and right eigenvectors are computed and normalized.
This involves using a carefully chosen SVD routine,
sklearn.utils.extmath.randomized_svd()
, which uses state-of-the-art SVD methods that do the computations in a nearly optimal low-rank subspace [Halko et al. 2011]. Other routines fromscipy.sparse.linalg
can be used as well for the same purpose, namelysvds()
, andeigsh()
from the same library for symmetric matrices $A$.
from sklearn.utils.extmath import randomized_svd
def spectral_coembed(
X,
n_clusters=25,
n_components=2,
n_discard=1
):
"""
Args:
X: (n x d) data matrix used, n observations by d features.
Returns:
Spectral coembedding of (n + d) rows and columns of X; first the rows, then the columns.
Uses (n_components - n_discard) components.
"""
normalized_data, row_diag, col_diag = scale_normalize(X)
n_sv = 1 + int(np.ceil(np.log2(n_clusters))) if n_components is None else n_components
u, _, vt = randomized_svd(normalized_data, n_sv)
u = u[:, n_discard:]
v = vt[n_discard:].T
z = np.vstack((row_diag[:, np.newaxis] * u, col_diag[:, np.newaxis] * v))
return z
- The computed rows/columns are ordered for display using the first coordinate of their representation, i.e. the eigenvector corresponding to the second-largest eigenvalue.
from sklearn.cluster import KMeans
def cocluster_from_embedding(X, n_clusters=25, cluster=True):
"""
Args:
X: (n x d) data matrix used, n observations by d features.
Returns:
Indices to display the matrix with.
"""
z = spectral_coembed(X, n_clusters=n_clusters)
n_rows = X.shape[0]
if cluster:
kmodel = KMeans(n_clusters)
kmodel.fit(z)
labels = kmodel.labels_
row_labels_ = labels[:n_rows]
column_labels_ = labels[n_rows:]
return np.argsort(row_labels_), np.argsort(column_labels_)
else:
fiedler_vec = z[:, 0]
return np.argsort(fiedler_vec[:n_rows]), np.argsort(fiedler_vec[n_rows:])
This function cocluster_from_embedding
is now callable by the main coclustering function the interface is using, as shown below.
(I also included, and show here, a small tweak to select a default number of clusters based on the dimensions of the matrix.)
def compute_coclustering(
fit_data,
num_clusters=1,
mode='custom'
):
if num_clusters == 1:
num_clusters = 1 + int(np.ceil(np.log2(min(fit_data.shape[0], fit_data.shape[1])) ))
if mode == 'sklearn':
if scipy.sparse.issparse(fit_data):
fit_data = fit_data.toarray()
row_labels, col_labels = cocluster_core_sklearn(fit_data, num_clusters)
return (np.argsort(row_labels), np.argsort(col_labels))
elif mode == 'custom':
return cocluster_from_embedding(fit_data)
Finally, we do slightly change the heatmap rendering in the function display_heatmap_cb()
, making a couple of accommodations for the changing configuration options.
- Eraslan, G., Drokhlyansky, E., Anand, S., et al. 2022. Single-nucleus cross-tissue molecular reference maps toward understanding disease gene function. Science 376, 6594, eabl4290.
- Becht, E., McInnes, L., Healy, J., et al. 2019. Dimensionality reduction for visualizing single-cell data using UMAP. Nature biotechnology 37, 1, 38–44.
- Halko, N., Martinsson, P.-G., and Tropp, J.A. 2011. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM review 53, 2, 217–288.