Foundations of concentration and entropy

statmech
Where entropy comes from
Author

Akshay Balsubramani

Published

July 23, 2024

Modified

August 19, 2024

Introduction

The fundamental ideas around artificial intelligence involve recognition of statistical patterns from previous training data, which generalize usefully to a test-time setting. The foundational probabilistic questions here are universal:

  • How much can be learned about a probability distribution from a finite sample of it?
  • What happens when the test-time distribution being evaluated is different from the training distribution?
  • Which patterns and representations usefully generalize to other distributions and datasets?

To begin understanding these basic principles of machine learning, we need to understand concentration of samples from a distribution. The story goes back to the origins of the field, and much further back, to the heart of probability and information theory. It is also central to statistical mechanics, a field founded to predict properties of bulk matter from individual atoms.

Development of the field was stalled at a basic level of understanding for many decades. Then in the late 19th century, Boltzmann had a revolutionary insight.
By considering a simplified discretized scenario in 1877 (see (Boltzmann 1877), trans. (Sharp and Matschinsky 2015)), he launched the field of statistical mechanics (Ellis 1999). The argument he used is extremely foundational in an AI context even today. So we will develop and refine it in detail, to describe the statistical mechanics of modeling.

Boltzmann’s probability calculation

Distributions spread probability over a set of outcomes \(\mathcal{X}\). Given a distribution \(P\) over \(\mathcal{X}\), we are looking to understand the consequences of repeatedly sampling from \(P\).

Suppose we have a finite set of outcomes \(\mathcal{X} = \{x_1, x_2, \dots, x_D \}\). We draw \(n\) samples from this set independently according to a probability distribution \(P = (P_1, \dots, P_D)\), and we observe the frequencies of each outcome. Let \(n_i\) be the number of times we observe outcome \(x_i\), so that \(\sum_{i} n_{i} = n\). The observed probability of outcome \(x_i\) is then \(Q_i := n_i / n\), so \(Q\) is another probability distribution over \(\mathcal{X}\).

Boltzmann calculated the probability of observing a particular set of frequencies \(\{n_1, \dots, n_D\}\) in this situation:

\[ \begin{aligned} \frac{1}{n} \log \text{Pr} (x_1 = n_1, \dots, x_D = n_D) &= - \text{D} (Q \mid \mid P ) + \frac{1}{2 n} \log \left( \frac{2 \pi n}{ \prod_{i=1}^{D} (2 \pi n P_i) } \right) + \Theta (1 / n^2) \\ \end{aligned} \]

This is an extremely accurate and powerful approximation for even moderate sample sizes \(n\), which tells us the likelihood of observing any specific configuration of outcomes.

  • We’ve calculated the chance of observing a particular set of frequencies given \(P\).

  • The real distribution \(P\) only enters the picture through its divergence from the observed distribution \(Q\). This \(- \text{D} (Q \mid \mid P )\) is also by far the dominant term, as all the others are \(O(1/n)\).

In short, observing the histogram \(Q\) alone does not determine \(P\), but it does give us enough information to precisely quantify the likelihood of deviations of \(Q\) from \(P\)!

Impact and consequences

What Boltzmann called his ``probability calculations” ((Boltzmann 1877)) launched the field of statistical mechanics, and inspired the field of information theory decades later. This is because discretizing the space is a fully general technique, containing all the essential elements used to study concentration and collective behavior in statistical mechanics.

The calculation shows the degeneracy in observing the histogram \(Q\) - the “macrostate” of the \(n\)-sample dataset - from a particular “microstate,” i.e. the individual outcomes of each of the \(n\) samples. This was the idea that allowed physicists to quantify observable bulk properties of matter (macrostates) from unobservable configurations of each of its atoms (microstates).

Proof

Boltzmann’s proof is the most direct one even after centuries. But it can be skipped; the argument is already made above.

\(\textbf{Proof}\)

\[\begin{align*} \text{Pr} (x_1 = n_1, \dots, x_D = n_D) &= \frac{n!}{n_1! \cdots n_D!} Q_1^{n_1} \cdots Q_D^{n_D} \\ &= \frac{n!}{n_1! \cdots n_D!} \exp \left( n_1 \log Q_1 + \cdots + n_D \log Q_D \right) \\ &= \frac{n!}{(n P_1)! \cdots (n P_D)!} \exp \left( n \sum_{i=1}^{D} P_i \log Q_i \right) \\ &= \frac{n!}{(n P_1)! \cdots (n P_D)!} \exp \left( - n \text{H} (P, Q) \right) \end{align*}\]

We can rewrite the multinomial coefficient using Stirling’s approximation 1

\[\begin{align*} \log \left( \frac{n!}{(n P_1)! \cdots (n P_D)!} \right) &= n \log n - n + \frac{1}{2} \log (2 \pi n) - \sum_{i=1}^{D} \left[ n P_i \log n P_i - n P_i + \frac{1}{2} \log (2 \pi n P_i) \right] + \Theta (1/n) \\ &= n \log n + \frac{1}{2} \log (2 \pi n) - n \sum_{i=1}^{D} P_i \log n P_i - \frac{1}{2} \sum_{i=1}^{D} \log (2 \pi n P_i) + \Theta (1/n) \\ &= n \text{H} (P) + \frac{1}{2} \log (2 \pi n) - \frac{1}{2} \sum_{i=1}^{D} \log (2 \pi n P_i) + \Theta (1/n) \end{align*}\]

Therefore, the probability of observing the frequencies \(n_1, \dots, n_D\) is \[ \begin{aligned} \text{Pr} (x_1 = n_1, \dots, x_D = n_D) &= \exp \left( - n \sum_{i=1}^{D} P_i \log \frac{P_i}{Q_i} + \frac{1}{2} \log \left( \frac{2 \pi n}{ \prod_{i=1}^{D} (2 \pi n P_i) } \right) + \Theta (1/n) \right) \\ \end{aligned} \]

Restating this gives the result.

Enter entropy

In this calculation, the log-multinomial coefficient \(\log \left( \frac{n!}{(n P_1)! \cdots (n P_D)!} \right)\) is \(\approx n \text{H} (P)\), with the approximation being very accurate for even moderate \(n\).

How accurate is this? We can quantify the relative accuracy, i.e. the difference between the log-multinomial coefficient and \(n \text{H} (P)\). Do this for different \(P\) (and \(D\)), and plot it vs. \(n\). Note that even for moderate \(n\), the relative error is <1%.

This is equivalent to the question “how many ways are there to arrange the letters in the word ‘ABRACADABRA’”?

CODE
import numpy as np
num_outcomes = 25
n = 1000
p_dist = np.ones(num_outcomes)*(1.0/num_outcomes)
p_dist

n*p_dist

# Use numpy function for computing combinations, and 
array([40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40.,
       40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40., 40.])

This multinomial coefficient is the number of ways to get the same observed macrostate (the histogram \(q\)) from particular microstates, i.e. unobserved individual configurations of each datapoint. In other words, microstates can be counted in terms of the entropy of the observed macrostate.

Entropy is a measure of the microscopic multiplicity, or degeneracy, associated with a set of limited macroscopic/average observations into the underlying microstate.

Since entropy measures the log of the multinomial coefficient, high-entropy configurations are exponentially more numerous than other configurations - most possible observed configurations have high entropy for large enough \(n\).

This happens overwhelmingly in the macroscopically sized samples of atoms we observe in everyday life: our observations are almost deterministic, even of highly disordered systems like gaseous and liquid matter, because they contain so many atoms. In the statistical mechanical view, when trying to minimize loss over datasets, the entropy of the distribution tends to be nearly maximal (as we will prove).

References

Boltzmann, Ludwig. 1877. Über Die Beziehung Zwischen Dem Zweiten Hauptsatze Des Mechanischen wärmetheorie Und Der Wahrscheinlichkeitsrechnung, Respective Den sätzen über Das wärmegleichgewicht. Wiener Berichte. Vol. 76. Kk Hof-und Staatsdruckerei.
Ellis, Richard S. 1999. “The Theory of Large Deviations: From Boltzmann’s 1877 Calculation to Equilibrium Macrostates in 2D Turbulence.” Physica D: Nonlinear Phenomena 133 (1-4): 106–36.
Sharp, Kim, and Franz Matschinsky. 2015. “Translation of Ludwig Boltzmann’s Paper ‘on the Relationship Between the Second Fundamental Theorem of the Mechanical Theory of Heat and Probability Calculations Regarding the Conditions for Thermal Equilibrium’ Sitzungberichte Der Kaiserlichen Akademie Der Wissenschaften. Mathematisch-Naturwissen Classe. Abt. II, LXXVI 1877, Pp 373-435 (Wien. Ber. 1877, 76: 373-435). Reprinted in Wiss. Abhandlungen, Vol. II, Reprint 42, p. 164-223, Barth, Leipzig, 1909.” Entropy 17 (4): 1971–2009.

Footnotes

  1. That is, \[ \log n! \approx n \log n - n + \frac{1}{2} \log (2 \pi n) + \Theta (1/n) \] ↩︎