The fundamental ideas underpinning modern AI and machine learning involve recognition of statistical patterns from sampled training data, which generalize usefully to a test-time setting. The basic probabilistic questions involved here are universal:
How much can be learned about a probability distribution from a finite sample of it?
What happens when the test-time distribution being evaluated is different from the training distribution?
Which patterns and representations usefully generalize to other distributions and datasets?
To begin understanding these basic principles of machine learning, we need to understand concentration of samples from a distribution. This understanding has been put together progressively over time. The story is related to the origins of the field, but goes much further back, to the heart of probability and information theory. It is also central to statistical mechanics, a field studying the collective (macroscopic) behavior of large populations of (microscopic) atoms.
In the late 19th century, as scientists tried to relate properties of bulk matter to its individual atoms, they sought a quantitative theory of this collective behavior. Development of the field was stalled at a basic level of understanding for many decades, in which unexplained observations were plentiful and unifying explanations scarce - the problem was daunting, because of the huge numbers of microscopic entities involved.
Progress came from a revolutionary insight by Boltzmann. When he considered a discretized probability-calculation scenario in 1877 (see , trans. ), he launched the field of statistical mechanics . The argument he used is extremely foundational in an AI context even today, when models learned by loss minimization, like deep neural networks, dominate.
So we explore Boltzmann’s core statistical mechanics understanding in detail. With this entry point, we ultimately describe modern-day (loss-minimization-based) AI modeling in statistical mechanics terms. This particular link between modeling and statistical mechanics is foundational, and contains many basic insights into models. It allows us to see the essential unity between the basic concepts underlying loss minimization, information theory, and statistical physics.
Context among related work
Starting from Boltzmann, the history of these ideas is long and vast. It goes back to their discovery in physics in the early 20th century, and many of these concepts are well-known in modern ML as well. Here is a brief discussion - more detailed elaboration will be in the notes.
As we have stated, these ideas have a long history in the statistical mechanics of Boltzmann and Gibbs. In the postwar years of the 1940s, the basics resurfaced for discrete alphabets with the flowering of information theory. The engineering focus of the field, with advances built on semiconductor advances in computing, resulted in a different approach and focus to areas such as coding and processing of probabilistic uncertainty (information).
These ideas were being developed in statistics as well, by Fisher and many others . The different focus there is reflected in the primary role of “exponential families,” particularly for certain types of statistics. Interestingly, it took longer to appreciate the connections between this branch of statistics and the ongoing work in information theory. Methods of information geometry have more recently developed around some of these ideas.
Presently, many ML researchers are aware of these ideas through their appearance in “energy-based models” (EBMs ), which take the energy function of a Gibbs ensemble as a starting point for modeling. EBMs are a key part of the modern deep learning toolkit, as a class of models for which a wide range of interpretations and tools is available.
For EBMs, the connections between the learning loss-minimization perspective and the statistical mechanics perspective are applied by formal analogy , as a way to avoid dealing with a probability-normalization constraint. The model class and distribution are typically given a priori, without using max-entropy or the observation constraints associated with the energy function. Works like use statistical mechanics notions like energy and temperature as powerful tools to analyze black-box computation .
Our purpose here is to center this view (see ), allowing us to elucidate the stat-mech view of prediction in full generality.
Scope
The purpose here is to comprehensively lay out the statistical mechanics view of loss minimization scenarios. Such scenarios are everywhere in modern AI systems. This viewpoint enables us to cross-fertilize ideas between our world of data modeling on one hand, and statistical physics and information theory on the other.
For the most part, these are well-established concepts that do not require complex technical machinery to handle real-world modeling scenarios. So we will prove many general results, but with as little needed technically as possible. Self-contained derivations of these results are helpful for several reasons:
The results have well-understood interpretations in the real world, as describing the behavior of bulk observed properties emerging from simply applying some basic underlying principles. These interpretations can be translated into data-modeling situations.
The methods used for the derivations, from optimization, game theory, and convexity, are familiar tools to AI researchers. But in the context of the results in physics and information theory, they lead to sometimes nonstandard derivations of known results. And these derivations’ broad scope and applicability, to new problems and loss functions, gives the results significant new power, significantly broadening their applications.
Following developments in AI and modeling, the powerful results of physics and information theory can be extended: to new models, systems with size/regularization which do not occur in observable physics, and more.
Concentration: Boltzmann’s probability calculation
Start with the first basic question from earlier: How much can be learned about a probability distribution from a finite sample of it?
In probability theory terms,
given a distribution \(P\) over a set of outcomes \(\mathcal{X}\), we are looking to understand the consequences of repeatedly sampling from \(P\). Boltzmann’s insight was that if the set of outcomes \(\mathcal{X}\) is finite, this question can be tractably answered by explicitly counting the possibilities.
Suppose \(\mathcal{X}\) is finite: \(\mathcal{X} = \{x_1, x_2, \dots, x_D \}\). We draw \(n\) samples from this set independently according to a probability distribution \(P = (P_1, \dots, P_D)\), and we observe the frequencies of each outcome. Let \(n_i\) be the number of times we observe outcome \(x_i\), so that \(\sum_{i} n_{i} = n\). The observed probability of outcome \(x_i\) is then \(Q_i := n_i / n\), so \(Q\) is another probability distribution over \(\mathcal{X}\).
Boltzmann calculated the probability of observing a particular set of frequencies \(\{n_1, \dots, n_D\}\) in this situation: \[
\begin{aligned}
\frac{1}{n} \log \text{Pr} (x_1 = n_1, \dots, x_D = n_D)
&= - \text{D} (Q \mid \mid P ) + \frac{1}{2 n} \log \left( \frac{2 \pi n}{ \prod_{i=1}^{D} (2 \pi n P_i) } \right) + \Theta (1 / n^2) \\
\end{aligned}
\]
This is an extremely accurate and powerful approximation for even moderate sample sizes \(n\), which tells us the likelihood of observing any specific configuration of outcomes.
We’ve calculated the chance of observing a particular set of frequencies given \(P\).
The real distribution \(P\) only enters the picture through its divergence from the observed distribution \(Q\). This \(- \text{D} (Q \mid \mid P )\) is also by far the dominant term, as all the others are \(O(1/n)\).
In short, observing the histogram \(Q\) alone does not determine \(P\), but it does give us enough information to precisely quantify the likelihood of deviations of \(Q\) from \(P\).
Proof
Boltzmann’s proof is the most direct one even after over a century, and it is well worth recapitulating. Discretizing the probability space, and discretizing quanta of probability (at resolution \(\frac{1}{n}\)), gives essentially a generalized balls-in-bins problem, which Boltzmann solved. The calculation is simply a matter of accounting for the differently weighted “bins” (outcomes), and the combinatorially many ways of throwing “balls” (quanta of probability) into them.
We can rewrite the multinomial coefficient using Stirling’s approximation ($ n! n n - n + (2 n) + (1/n) $).
\[\begin{align*}
\log \left( \frac{n!}{(n P_1)! \cdots (n P_D)!} \right)
&= n \log n - n + \frac{1}{2} \log (2 \pi n) - \sum_{i=1}^{D} \left[ n P_i \log n P_i - n P_i + \frac{1}{2} \log (2 \pi n P_i) \right] + \Theta (1/n) \\
&= n \log n + \frac{1}{2} \log (2 \pi n) - n \sum_{i=1}^{D} P_i \log n P_i - \frac{1}{2} \sum_{i=1}^{D} \log (2 \pi n P_i) + \Theta (1/n) \\
&= n \text{H} (P) + \frac{1}{2} \log (2 \pi n) - \frac{1}{2} \sum_{i=1}^{D} \log (2 \pi n P_i) + \Theta (1/n)
\end{align*}\]
Therefore, the probability of observing the frequencies \(n_1, \dots, n_D\) is \[
\begin{aligned}
\text{Pr} (x_1 = n_1, \dots, x_D = n_D)
&= \exp \left( - n \sum_{i=1}^{D} P_i \log \frac{P_i}{Q_i} + \frac{1}{2} \log \left( \frac{2 \pi n}{ \prod_{i=1}^{D} (2 \pi n P_i) } \right) + \Theta (1/n) \right) \\
\end{aligned}
\]
Restating this gives the result.
Impact and consequences
What Boltzmann called his ``probability calculations” () launched the field of statistical mechanics, and inspired the field of information theory decades later. This is because discretizing the space is a fully general technique, containing all the essential elements used to study concentration and collective behavior in statistical mechanics.
The calculation shows the degeneracy in observing the histogram \(Q\) - the “macrostate” of the \(n\)-sample dataset - from a particular “microstate,” i.e. the individual outcomes of each of the \(n\) samples. This was the idea that allowed physicists to quantify observable bulk properties of matter (macrostates) from unobservable configurations of each of its atoms (microstates).
In AI and data science, the system being studied (the “matter”) is a dataset comprising \(n\) examples, whose state is the microstate. And the macrostate consists of our coarse-grained observations about the dataset, as we develop more in the following sections.
Enter entropy
In this calculation, the log-multinomial coefficient \(\log \left( \frac{n!}{(n P_1)! \cdots (n P_D)!} \right)\) is \(\approx n \text{H} (P)\), with the approximation being very accurate for even moderate \(n\). In fact, this is the only approximation that we have made in the calculation. How accurate is it?
We can quantify the relative accuracy, i.e. the difference between the log-multinomial coefficient and \(n \text{H} (P)\), EXPONENTIATED. Do this for different \(P\) (and \(D\)), and plot it vs. \(n\). Note that even for moderate \(n\), the relative error is <1%.
This multinomial coefficient is the number of ways to get the same observed macrostate (the histogram \(Q\)) from particular microstates (the individual values of all the examples in the dataset). In other words, microstates can be counted in terms of the entropy of the observed macrostate.
Putting all this together, we arrive at some powerful insights.
Entropy is a measure of the microscopic multiplicity, or degeneracy, associated with a set of limited macroscopic/average observations into the underlying microstate. High-entropy configurations are exponentially more likely than other configurations - they dominate observed configurations for large \(n\).
Therefore, the macrostate maximizing entropy is the “most likely” state. Entropy maximization accounts for the many microstates that are consistent with a set of macroscopic observations.
In the macroscopically sized samples of atoms we observe in everyday life, our observations are almost deterministic, even of highly disordered systems like gaseous and liquid matter. This is because they are made of moles of individual constituents, and our handful of bulk properties observed about them corresponds to only a handful of constraints.
In the statistical mechanical view of AI, when trying to minimize loss over datasets, we can similarly say that the entropy of the learned distribution tends to be nearly maximal. (We later prove this, in a very general sense.)
What we’ve described, with a discrete “alphabet” of possible states, is the concept called the asymptotic equipartition property in information theory – high-entropy sequences occur with much greater multiplicity than low-entropy ones (Cover and Thomas 2006). Boltzmann’s “probability calculation” shows exactly why, with entropy emerging as a measure of (log) multiplicity.
References
Cover, Thomas M, and Joy A Thomas. 2006. Elements of Information Theory. Wiley-Interscience.