Procedurally generating data for tuning technical LLMs

LLM

Essays

Turning tacit technical knowledge into prediction

Author

Akshay Balsubramani

Synthetic data for LLM tuning

Domain-specific languages which are overloaded subsets of natural language exist and facilitate transmission of information in highly structured domain-specific ways.

Translating between (a) these domain-specific languages on one hand, and (b) the statistical patterns of data science on the domain data on the other hand, enables people to usefully talk to their data in a domain-specific way.

Such instruction tuning typically relies on a significant quantity of curated instructional data. There are various ways to assemble such a dataset.

In general-purpose and non-technical domains, it is common to generate a large number of instructions (with an LLM or otherwise), and then curate a subset of them.
In technical domains, the instructions typically follow a few templates, generated by the researchers. The models show degraded performance when asked to generate instructions outside of these templates, especially as part of multi-step reasoning.

There is a middle way I’ve ended up stumbling into over the course of a variety of different projects in the chemistry and biology of drug discovery. The domains and stakeholders are different, but they share a workflow that appears to be useful.

The reasons why have to do with the structure of valid instructional data. So it is useful to first understand the nature of the data in question.

Language in technical domains

Specific human language is highly overloaded in technical domains, and a domain uses only a highly specialized subset of it. This is apparent when using specific scientific jargon. But more broadly, strings can represent other domain-specific languages based on different types of natural observations.

In chemistry, most strings are not admissible training examples; the common encoding languages of SMILES and SELFIES describe how to encode a molecule graph into a string, a necessary precursor for all further learning.
In biology and biochemistry, we are sometimes luckier in that DNA/RNA/amino acid sequences are easily representable as text, and almost any sequence of letters can be physically realistic. However, generating “typical” sequences for a particular organism, protein function, or other area of biology remains a nontrivial and significant problem.

Notably, not just any string is “valid” - the valid strings of these languages are a highly informative and significant subset. And unlike for natural language, there is no existing corpus from which to mine this information.

So broadly speaking, there are two options that are productive:

Generate a large number of valid strings.
Mine the existing data for valid strings.

These two options aren’t mutually exclusive, and can be especially powerful when combined. Both are discussed below.

The structure of a valid string

This problem arises frequently across the broad spectrum of possible technical and science applications, where two common characteristics arise:

Domain experts recognize a valid string in their domain – they “know it when they see it.”
- This is true in each field in a different way; tacit human intuitions have developed around data modalities that did not exist a generation ago. Domain experts know what normal, high-quality, low-quality, and dysregulated data look like in modalities of their expertise.
- We can see this through the variety of domain-specific interfaces that are essential to their areas of application. Examples abound, like in chemistry for 2D and 3D visualization; in genomics for 1D browsing of highly multilayered data; and in protein biology for 3D visualization of protein complexes, in conjunction with multilayered data. Significantly, all these areas also use the common 2D language of nonparametric scatterplots to visualize high-dimensional embeddings - a UMAP or t-SNE plot is almost ubiquitous in a multimodal computation-heavy paper.
There is some way to decompose each sample into constituent components. There’s also a way to recombine components into “valid” strings.¹
- In chemistry and protein biology, these components and the rules for putting them together are key to defining “valid” space in many settings.
  - “Ultralarge” chemical spaces representing the universe of molecules that are synthesizable on demand. These are fragments put together by reactions; all these are languages on strings representing molecules.
  - Proteins and their complexes with each other and other ligands naturally decompose along certain structural moieties of interest, like binding pockets, active sites, and other higher-order structures.
- In genomic sequence biology too, these components are the units of analysis.
  - Sometimes this is very well-defined – toolkits for synthetic biology or guide RNA design, for example.
  - The situation is different in long-sequence contexts, where functional regions are less well-defined and can overlap in the genome. There the decomposition is more of an additive decomposition, of relative signal strengths contributed by different components, and the recombination adds those together and uses sampling to generate discretized sequences.

Generating valid instructions programmatically

Instructional data, especially very domain-specific instructions, are like this too!

They’re crucially decomposable into a small set of sentence structures and a small set of tools, which constitute a combinatorial variety of sequences of instructions.

In trying to generate valid instruction tuning data for specialized technical domains, I’ve found my workflows converging to a useful pattern:

Assemble an unclean dataset of instructions. Often this is done by recording human instructional data. Of course, we could always use a more powerful LLM to bootstrap this.
Break down the instructions into their constituent components. This is typically easily done with some powerful (or not-so-powerful) model.
Generate synthetic instructions by reconstituting the components, which may require mild assistance from a more powerful LLM. The key here is that the instructions are being generated programmatically, and the LLM, if required, is only called batch-wise on roughly generated combinations of components.
Fine-tune the model on the synthetic instructions. This might be many more instructions than the original dataset, which is normally acceptable because the model being tuned can be cheap/small.
Evaluate the model on the original instructions.

Typically, the synthetic instruction set is much bigger than the original instruction set, because it contains a lot of different combinations of components that the original instructions omitted.

The decomposition and recombination is not arbitrary, and contains a lot of information suggesting how to generalize over the data. This results in real induction going on here – generalization beyond the training set of original instructions. This comes from the new combinatorially generated valid instructions which allow the model to consider as training data instructions very different from those that actually exist.

When doing domain-specific instruction tuning, we know the units of variation in the combinatorial variety of the dataset:

The phrases used as instructions
The tools being used, and what they are being used for

These are linked by a generation process (often described as a tree/graph); the tools should be matched to the instructions, and then the combination will be valid. This allows us to rapidly accumulate tools and their instructions, and programmatically generate combinations of prompts. In practice, this is useful to encourage models to refocus on a particular part of the space to fix particular problems; when we see an instruction that could have been taken, we can fine-tune the model with more similar instructions.

(This is all quite similar to the process of generating ultra-large chemical spaces, for the same set of reasons. Like domain-specific instructions, those ultra-large chemical spaces encode combinatorial variety from relatively few parts. )

Tacit human-language data

There is a wide range of viewpoints on human language as tacit knowledge in the domains we have discussed.

On one hand, biologists and chemists are trained to think of data, particularly experimental data, as being in a non-textual domain-specific language.

On the other hand, there is a wealth of text-based biochemical data available, because these sciences are ultimately built by and for humans. And we typically want our models to be English-queryable and instructable.

Both of these perspectives are widely acknowledged and hold simultaneously. Incorporating high-dimensional statistical embeddings based on all relevant data is vital, and so is recording thoughts left by scientists. Such spoken and especially written communication often is overloaded with a large amount of context, which glues together individual data sets.

Incorporating these natural-language insights into the data-driven reasoning process inevitably unlocks new opportunities, as changes have ripple effects and humans are able to observe the impact of strategic, verbally-evidenced shifts. This is a major step towards the ultimate goal of unifying the human decision-making process with the data-driven one.

Footnotes

These have been abstracted under names like context-free, factorized, or modular grammars, but those distinctions are not relevant here. What is needed are operational ways to break down and build up strings.↩︎

Reuse

CC BY 4.0