Visualization tools for science and computation

dataviz

Overview

Increasing capabilities in domain-specific languages

Author

Akshay Balsubramani

Bootstrapping visualization functionality from scratch

Persistent information, such as that might be useful to an organization or a long-running project, is recorded in several different ways.The most common ways are those for general human consumption, chiefly text documents of various kinds as well as slides and other primarily textual media.

But information also exists in scientific programs and other forms of tacit knowledge which are not primarily explicit textual communications. Visualizing and analyzing this content can be more challenging.

To bridge this gap, there are bespoke visualizations which tend to be domain-specific for various fields. They can be extremely powerful in both visualizing data as well as analyzing it.

We can explain the situation in terms of the types of manipulations that the user performs on the content. The visual system is one of the highest-bandwidth ways for us to absorb information. And the highest-bandwidth way for us to emit information is typically text. This explains the ubiquity of slide decks and other similar universal visualization tools for forming a mental picture with text.

Interactive dashboards and other visualization platforms allow us to perform rich and precise actions, like selecting a subset quickly, which are also high-bandwidth ways of emitting information. Coupling these outputs to our richest inputs (visual) is an immediately appealing strategy, and explains why rich visual dashboards have opened up new frontiers in interactive data analysis.

On the other hand, the more general-purpose textual formats do not allow as flexible a grammar of actions to be taken on the data. More specific formats, up to and including interactive visualization and domain-specific dashboarding, are the gold standard for what can be done with appropriate visualization tools. Between these two extremes - of general-purpose textual formats on one hand and interactive dashboarding on the other extreme - lie a range of trade-offs, in which accessibility is traded off for power. Increasingly powerful strategies tend to operate in increasingly domain-specific ways, often working with domain-specific data rather than language. The fruits of these strategies are powerful data-driven decisions.

Organizations are often interested in exploring this trade-off, because it’s what happens whenever a resource-strapped organization attempts to do the most it can with its available resources. There are various ways in which interactive dashboarding ideas can be realized cheaply, to vastly improve the power of default workflows and resources.

In many cases, these represent domain-specific languages for manipulation of data, which can be significantly more powerful and precise than text for their particular purposes. We can build agents and small language models that use tools to perform these domain-specific manipulations, which can be a smart way of using the power of computation for discovery.

General considerations

Colors and how we perceive them are extremely important in conveying information in data. A post describes some ways to work with them that are state-of-the-art for the applied data scientist. Choosing colormaps or emphasizing particular areas of distinction between them are hugely effective.

Efficient real-time computation can be vital to unlocking the power of data and tacit knowledge. When the visualizations that we describe are linked to computational algorithms and models in the backend, the user is allowed the power to perform bespoke computational analyses, discovering new structures in data and reflecting their preferences in their behavior.

Translating between different visual languages

There are many posts where we’ve needed to implement a translation from one domain-specific visual language to another. Typically, these languages work around and are inspired by the standard HTML/CSS/JavaScript ecosystem, but it eases communication to humans enormously when one can seamlessly flip between these languages. They have widely varying server- and client-side requirements, and very often usage depends on the nature of the data, in ways that a data analyst or agent encounters regularly. We write these posts as guides for these entities, to use as reliable tools.

One such language is Mermaid, a Markdown-inspired text-only representation of rich interactive flowcharts and subnetworks, which works well with the graphs we find in data analysis, as we detail in a post.

Data browser

Interactive browsers are tools for working with data, which ally our sensory responses to powerful data manipulations. A lot of that power derives from the trajectories of paths that intelligent pattern-recognizing agents can sample from as they range over a dataset with analysis tools.

We can teach the same tools, which constitute both unsupervised and supervised ways of finding structure in the data, to agents that learn to use them. When software-based agents browse these tools, they can process information much faster than humans can, because they are limited by the speed of computation, not the speed of their visual systems.

The potential of this workflow can be realized through interactive computation that is driven by user selections. A theme of several posts is the vast amount that can be done with basic tensor-based computations and clever statistical techniques for summarization and structure learning. There is a roughly sequential progression, with each building on the previous.

Seeing layers of information through interactivity

A way to interactively view the standard UMAP and t-SNE plots used to visualize embeddings. This adds interactive layers of visualization through hovering and clicking, allowing a lot more information to be packed into a scatterplot than any static visualization.

This is demonstrated on chemical data, where each data point represents a molecule with a corresponding molecular graph representation, which happens to be how human chemists normally view it. The result is often naturally called a “data browser.”

(In organizational contexts, teams are often unable or unwilling to incorporate major external dependencies - to serve such contexts, we emphasize that this functionality can largely be realized in pure HTML.)

Interactive computation by selection

These scatterplots of data have grown to be exceptionally popular ways of using embeddings. Viewers of these scatterplots are often drawn to want to select patches, clusters, or other subsets of points in the plots. This unlocks a range of interactive computation possibilities when we allow them to do this in an interface.

Truly interactive computation can enable users to discover unknown structure in data intuitively, by iteratively applying a set of basic tools. We implement interactive clustering and co-clustering functionality in the data browser from above, showing how this vastly expands the structure that can be detected.

Taking this theme further, another complementary and similarly ubiquitous way to view data matrices is the heatmap. We show how to add heatmaps to the toolkit of interactive, selectable plots of data, giving full implementations of algorithms that can be run with heatmap and scatterplot selection combined.

Chemistry-specific visualizations

The HTML/JavaScript stack is a standard idiom for building any type of visualization, whether inside a company or for public domain use. Chemists also operate highly visually, with structured diagrams and associated notations providing high-bandwidth information in ways that are not captured by text. To serve chemistry, therefore, visualizations are an extremely important tool.

We show how to make convenient self-contained HTML plots to visualize embeddings of molecules and model predictions on them, as occurs frequently in molecular design for drug discovery. This uses relevant machinery to render molecular visualizations as vector/raster images, and then storing them in hoverable tooltips. We even allow model predictions to be shown on an atomic level, enabling quite a variety of information to be packed into the static HTML file.
Large virtual spaces are increasingly made in ways that respect the fragment-based structure of molecules. We develop a few tools to visualize such decompositions more easily.

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{balsubramani,
  author = {Balsubramani, Akshay},
  title = {Visualization Tools for Science and Computation},
  langid = {en}
}

For attribution, please cite this work as:

Balsubramani, Akshay. n.d. “Visualization Tools for Science and Computation.”