Foundations of concentration and entropy

statmech
Where entropy comes from
Author

Akshay Balsubramani

Foundations: entropy, concentration, and learning

The fundamental ideas underpinning modern AI and machine learning involve recognition of statistical patterns from sampled training data, which generalize usefully to a test-time setting. The basic probabilistic questions involved here are universal:

  • How much can be learned about a probability distribution from a finite sample of it?

  • What happens when the test-time distribution being evaluated is different from the training distribution?

  • Which patterns and representations usefully generalize to other distributions and datasets?

To begin understanding these basic principles of machine learning, it can be useful to understand concentration of samples from a distribution. This understanding has been put together progressively over time. The story is related to the origins of the field, but goes much further back, to the heart of probability and information theory. It is also central to statistical mechanics, a field studying the collective (macroscopic) behavior of large populations of (microscopic) atoms.

In the late 19th century, as scientists tried to relate properties of bulk matter to its individual atoms, they sought a quantitative theory of this collective behavior. Development of the field was stalled at a basic level of understanding for many decades, in which unexplained observations were plentiful and unifying explanations scarce - the problem was daunting, because of the huge numbers of microscopic entities involved.

Progress came from a revolutionary insight by Boltzmann. When he considered a discretized probability-calculation scenario in 1877 (see , trans. ), he launched the field of statistical mechanics . The argument he used is extremely foundational in an AI context even today, when models learned by loss minimization, like deep neural networks, dominate.

Here’s an overview of Boltzmann’s core statistical mechanics understanding. With this entry point, we can ultimately describe modern-day (loss-minimization-based) AI modeling in statistical mechanics terms. This particular link between modeling and statistical mechanics is foundational, and contains many basic insights into models. It allows us to see the essential unity between the basic concepts underlying loss minimization, information theory, and statistical physics.

Concentration

Start with the first basic question from earlier: How much can be learned about a probability distribution from a finite sample of it? In probability theory terms, given a distribution \(P\) over a set of outcomes \(\mathcal{X}\), we are looking to understand the consequences of repeatedly sampling from \(P\). Boltzmann’s insight was that if the set of outcomes \(\mathcal{X}\) is finite, this question can be tractably answered by explicitly counting the possibilities. Suppose \(\mathcal{X}\) is finite: \(\mathcal{X} = \{x_1, x_2, \dots, x_D \}\). We draw \(n\) samples from this set independently according to a probability distribution \(P = (P_1, \dots, P_D)\), and we observe the frequencies of each outcome. Let \(n_i\) be the number of times we observe outcome \(x_i\), so that \(\sum_{i} n_{i} = n\). The observed probability of outcome \(x_i\) is then \(Q_i := n_i / n\), so \(Q\) is another probability distribution over \(\mathcal{X}\). Boltzmann calculated the probability of observing a particular set of frequencies \(\{n_1, \dots, n_D\}\) in this situation: \[ \begin{aligned} \frac{1}{n} \log \text{Pr} (x_1 = n_1, \dots, x_D = n_D) &= - \text{D} (Q \mid \mid P ) + \frac{1}{2 n} \log \left( \frac{2 \pi n}{ \prod_{i=1}^{D} (2 \pi n P_i) } \right) + \Theta (1 / n^2) \\ \end{aligned} \] This is an extremely accurate and powerful approximation for even moderate sample sizes \(n\), which tells us the likelihood of observing any specific configuration of outcomes.

  • We’ve calculated the chance of observing a particular set of frequencies given \(P\).
  • The real distribution \(P\) only enters the picture through its divergence from the observed distribution \(Q\). This \(- \text{D} (Q \mid \mid P )\) is also by far the dominant term, as all the others are \(O(1/n)\).

In short, observing the histogram \(Q\) alone does not determine \(P\), but it does give us enough information to precisely quantify the likelihood of deviations of \(Q\) from \(P\).

Boltzmann’s probability calculation

Boltzmann’s proof is the most direct one even after over a century, and it is well worth recapitulating. Discretizing the probability space, and discretizing quanta of probability (at resolution \(\frac{1}{n}\)), gives essentially a generalized balls-in-bins problem, which Boltzmann solved. The calculation is simply a matter of accounting for the differently weighted “bins” (outcomes), and the combinatorially many ways of throwing “balls” (quanta of probability) into them.

\[\begin{align*} \text{Pr} (x_1 = n_1, \dots, x_D = n_D) &= \frac{n!}{n_1! \cdots n_D!} Q_1^{n_1} \cdots Q_D^{n_D} \\ &= \frac{n!}{n_1! \cdots n_D!} \exp \left( n_1 \log Q_1 + \cdots + n_D \log Q_D \right) \\ &= \frac{n!}{(n P_1)! \cdots (n P_D)!} \exp \left( n \sum_{i=1}^{D} P_i \log Q_i \right) \\ &= \frac{n!}{(n P_1)! \cdots (n P_D)!} \exp \left( - n \text{H} (P, Q) \right) \end{align*}\] We can rewrite the multinomial coefficient using Stirling’s approximation ($ n! = n n - n + (2 n) + (1/n) \().\)\(\begin{align*} \log \left( \frac{n!}{(n P_1)! \cdots (n P_D)!} \right) &= n \log n - n + \frac{1}{2} \log (2 \pi n) - \sum_{i=1}^{D} \left[ n P_i \log n P_i - n P_i + \frac{1}{2} \log (2 \pi n P_i) \right] + \Theta (1/n) \\ &= n \log n + \frac{1}{2} \log (2 \pi n) - n \sum_{i=1}^{D} P_i \log n P_i - \frac{1}{2} \sum_{i=1}^{D} \log (2 \pi n P_i) + \Theta (1/n) \\ &= n \text{H} (P) + \frac{1}{2} \log (2 \pi n) - \frac{1}{2} \sum_{i=1}^{D} \log (2 \pi n P_i) + \Theta (1/n) \end{align*}\)$ Therefore, the probability of observing the frequencies \(n_1, \dots, n_D\) is \[ \begin{aligned} \text{Pr} (x_1 = n_1, \dots, x_D = n_D) &= \exp \left( - n \sum_{i=1}^{D} P_i \log \frac{P_i}{Q_i} + \frac{1}{2} \log \left( \frac{2 \pi n}{ \prod_{i=1}^{D} (2 \pi n P_i) } \right) + \Theta (1/n) \right) \end{aligned} \] Restating this gives the result.

Impact and consequences

What Boltzmann called his ``probability calculations” () launched the field of statistical mechanics, and inspired the field of information theory decades later. This is because discretizing the space is a fully general technique, containing all the essential elements used to study concentration and collective behavior in statistical mechanics.

The calculation shows the degeneracy in observing the histogram \(Q\) - the “macrostate” of the \(n\)-sample dataset - from a particular “microstate,” i.e. the individual outcomes of each of the \(n\) samples. This was the idea that allowed physicists to quantify observable bulk properties of matter (macrostates) from unobservable configurations of each of its atoms (microstates).

In AI and data science, the system being studied (the “matter”) is a dataset comprising examples, whose state is the microstate. And the macrostate consists of our coarse-grained observations about the dataset, as we develop more in the following sections.

CODE
import numpy as np
from scipy.special import binom, gammaln
from scipy.stats import entropy, dirichlet
import matplotlib.pyplot as plt
import time

def log_multinomial(freqs):
    if len(freqs) == 1:
        return 0
    # corrected_freqs = freqs #np.maximum(freqs, 1)   # Convert 0s to 1s
    cumsum = 0
    for i in range(len(freqs)):
        these_freqs = freqs[:len(freqs)-i]
        cumsum += log_binomial(sum(these_freqs), these_freqs[-1])
    return cumsum

# A function for computing np.log(n!/k!(n-k)!)
def log_binomial(n, k):
    if k==0:
        return 0
    log_binom = gammaln(n+1) - gammaln(k+1) - gammaln(n-k+1)
    return log_binom

def correction_term(freqs):
    n = np.sum(freqs)
    corrected_freqs = np.maximum(freqs, 1)   # Convert 0s to 1s
    return 0.5*(np.log(2*np.pi*n) - np.sum(np.log(2*np.pi*corrected_freqs)))


def return_approximations(test_distribution, n_values):
    exact_values = []
    approx_values = []
    corrected_values = []
    itime = time.time()
    for n in n_values:
        freqs = np.squeeze(np.round(n*test_distribution))

        logmult = log_multinomial(freqs)
        print('*', time.time()-itime)
        entropy_approx = (np.sum(freqs))*entropy(freqs/np.sum(freqs))
        print('**', time.time() - itime)
        corrected_approx = entropy_approx + correction_term(freqs)
        print('***', time.time()-itime)

        exact_values.append(logmult*(1.0/n))
        approx_values.append(entropy_approx*(1.0/n))
        corrected_values.append(corrected_approx*(1.0/n))

        results_str = "Log multinomial coefficient (n = {}): {:.2f} \nEntropy approximation: {:.2f} \nCorrected approximation error: {:.4f}\n".format(
            n, 
            logmult, 
            entropy_approx, 
            corrected_approx
        )
        print(results_str)
    return exact_values, approx_values, corrected_values
CODE
num_outcomes = 50000

# Simulate a multinomial distribution from a Dirichlet distribution
n_values = np.geomspace(num_outcomes*0.2, num_outcomes*10, 15).astype(int)

alpha = np.ones(num_outcomes)
test_distribution_unif = dirichlet.rvs(alpha, size=1, random_state=42)

distr = np.random.rand(num_outcomes)
test_distribution_mult = np.square(distr/np.linalg.norm(distr))

exact_values_unif, approx_values_unif, corrected_values_unif = return_approximations(test_distribution_unif, n_values)
exact_values_mult, approx_values_mult, corrected_values_mult = return_approximations(test_distribution_mult, n_values)
* 35.781620025634766
** 35.784404039382935
*** 35.78456711769104
Log multinomial coefficient (n = 10000): 30337.26 
Entropy approximation: 34457.23 
Corrected approximation error: -11492.5833

* 71.8799660205841
** 71.89258432388306
*** 71.89283013343811
Log multinomial coefficient (n = 13223): 61547.20 
Entropy approximation: 69163.83 
Corrected approximation error: 23157.9197

* 106.77520728111267
** 106.77786612510681
*** 106.77804708480835
Log multinomial coefficient (n = 17486): 106644.15 
Entropy approximation: 118807.16 
Corrected approximation error: 72614.7712

* 141.91067218780518
** 141.91306018829346
*** 141.91323924064636
Log multinomial coefficient (n = 23124): 167866.46 
Entropy approximation: 185425.25 
Corrected approximation error: 138755.5875

* 176.77868127822876
** 176.78119611740112
*** 176.78134417533875
Log multinomial coefficient (n = 30578): 249107.63 
Entropy approximation: 272701.01 
Corrected approximation error: 225057.7763

* 213.09714722633362
** 213.09945130348206
*** 213.09959602355957
Log multinomial coefficient (n = 40436): 353073.85 
Entropy approximation: 382948.34 
Corrected approximation error: 333710.4028

* 248.56120133399963
** 248.56363129615784
*** 248.5637743473053
Log multinomial coefficient (n = 53472): 489503.28 
Entropy approximation: 525866.67 
Corrected approximation error: 474289.1029

* 282.86229038238525
** 282.86300110816956
*** 282.8631360530853
Log multinomial coefficient (n = 70710): 667751.78 
Entropy approximation: 710701.25 
Corrected approximation error: 656041.8695

* 317.3422632217407
** 317.344455242157
*** 317.34459710121155
Log multinomial coefficient (n = 93506): 903786.78 
Entropy approximation: 953419.10 
Corrected approximation error: 894951.6779

* 351.53981709480286
** 351.54245018959045
*** 351.54259300231934
Log multinomial coefficient (n = 123650): 1214624.86 
Entropy approximation: 1270965.67 
Corrected approximation error: 1208076.5294

* 386.5101990699768
** 386.5247552394867
*** 386.5249032974243
Log multinomial coefficient (n = 163512): 1625147.14 
Entropy approximation: 1688154.66 
Corrected approximation error: 1620307.7373

* 421.18227100372314
** 421.18491435050964
*** 421.1850690841675
Log multinomial coefficient (n = 216224): 2168945.49 
Entropy approximation: 2238649.68 
Corrected approximation error: 2165386.0623

* 455.2906770706177
** 455.29137206077576
*** 455.29150915145874
Log multinomial coefficient (n = 285930): 2889129.35 
Entropy approximation: 2965642.38 
Corrected approximation error: 2886610.8010

* 489.443008184433
** 489.4453902244568
*** 489.4455361366272
Log multinomial coefficient (n = 378107): 3841814.35 
Entropy approximation: 3925109.05 
Corrected approximation error: 3840036.5819

* 524.6148092746735
** 524.6298611164093
*** 524.632287979126
Log multinomial coefficient (n = 500000): 5103202.81 
Entropy approximation: 5193286.47 
Corrected approximation error: 5101957.4502

* 34.337100982666016
** 34.33779311180115
*** 34.337929248809814
Log multinomial coefficient (n = 10000): 31856.24 
Entropy approximation: 36172.14 
Corrected approximation error: -9769.6848

* 69.52981305122375
** 69.53298211097717
*** 69.53313398361206
Log multinomial coefficient (n = 13223): 85310.88 
Entropy approximation: 95652.34 
Corrected approximation error: 49710.9583

* 104.17428612709045
** 104.17524290084839
*** 104.17538022994995
Log multinomial coefficient (n = 17486): 134223.01 
Entropy approximation: 149734.26 
Corrected approximation error: 103793.0798

* 138.92219305038452
** 138.92463207244873
*** 138.9250910282135
Log multinomial coefficient (n = 23124): 177867.66 
Entropy approximation: 197840.79 
Corrected approximation error: 151899.7323

* 172.93774509429932
** 172.9392650127411
*** 172.939795255661
Log multinomial coefficient (n = 30578): 262318.40 
Entropy approximation: 287697.83 
Corrected approximation error: 240114.1937

* 206.99364233016968
** 206.99462819099426
*** 206.99506497383118
Log multinomial coefficient (n = 40436): 355412.40 
Entropy approximation: 385993.99 
Corrected approximation error: 336342.1445

* 242.07200717926025
** 242.07436895370483
*** 242.07451105117798
Log multinomial coefficient (n = 53472): 494564.61 
Entropy approximation: 530752.80 
Corrected approximation error: 478133.5754

* 276.1028690338135
** 276.1038360595703
*** 276.10427808761597
Log multinomial coefficient (n = 70710): 669264.54 
Entropy approximation: 711084.62 
Corrected approximation error: 655109.6513

* 310.1585440635681
** 310.1612181663513
*** 310.16139006614685
Log multinomial coefficient (n = 93506): 901102.99 
Entropy approximation: 948711.79 
Corrected approximation error: 888877.7308

* 344.24847316741943
** 344.24931812286377
*** 344.24945402145386
Log multinomial coefficient (n = 123650): 1210659.68 
Entropy approximation: 1264205.73 
Corrected approximation error: 1200116.2049

* 378.32297706604004
** 378.3256959915161
*** 378.325856924057
Log multinomial coefficient (n = 163512): 1623062.55 
Entropy approximation: 1682702.82 
Corrected approximation error: 1613970.1375

* 412.3446829319
** 412.3457441329956
*** 412.346009016037
Log multinomial coefficient (n = 216224): 2166734.56 
Entropy approximation: 2232598.72 
Corrected approximation error: 2158940.0862

* 447.0017261505127
** 447.00399518013
*** 447.0044090747833
Log multinomial coefficient (n = 285930): 2885785.21 
Entropy approximation: 2957905.95 
Corrected approximation error: 2879061.1674

* 482.25389528274536
** 482.2566730976105
*** 482.2569410800934
Log multinomial coefficient (n = 378107): 3838070.08 
Entropy approximation: 3916517.42 
Corrected approximation error: 3832227.9385

* 516.502767086029
** 516.5038421154022
*** 516.5039961338043
Log multinomial coefficient (n = 500000): 5098787.84 
Entropy approximation: 5183640.32 
Corrected approximation error: 5093722.9396
CODE
# Plotting
absc = n_values/num_outcomes

fig, axs = plt.subplots(1, 2, figsize=(8, 5))
axs[0].set_xscale('log')
axs[0].plot(absc, exact_values_unif, label='Exact log-multinomial coefficient', marker='o')
axs[0].plot(absc, approx_values_unif, label='Entropy approximation', linestyle='--', marker='s')
axs[0].plot(absc, corrected_values_unif, label='Corrected entropy approximation', linestyle=':', marker='^')
axs[0].set_xlabel('Number of samples per possible outcome')
axs[0].set_ylabel('Nats / sample')
axs[0].title.set_text('Near-uniform')
axs[0].legend()
axs[0].grid(True)

axs[1].set_xscale('log')
axs[1].plot(absc, exact_values_mult, label='Exact log-multinomial coefficient', marker='o')
axs[1].plot(absc, approx_values_mult, label='Entropy approximation', linestyle='--', marker='s')
axs[1].plot(absc, corrected_values_mult, label='Corrected entropy approximation', linestyle=':', marker='^')
axs[1].set_xlabel('Number of samples per possible outcome')
axs[1].set_ylabel('Nats / sample')
axs[1].title.set_text('Random multinomial')
axs[1].legend()
axs[1].grid(True)

fig.suptitle(f'Approximating multinomial coefficients ({num_outcomes} outcomes)')
plt.tight_layout()
plt.show()
fig.savefig(f'multinomial_approximations_{num_outcomes}.png', dpi=300)

This multinomial coefficient is the number of ways to get the same observed macrostate (the histogram \(Q\)) from particular microstates (the individual values of all the examples in the dataset). In other words, microstates can be counted in terms of the entropy of the observed macrostate.

Putting all this together, we arrive at some powerful insights.

  • Entropy is a measure of the microscopic multiplicity, or degeneracy, associated with a set of limited macroscopic/average observations into the underlying microstate. High-entropy configurations are exponentially more likely than other configurations - they dominate observed configurations for large \(n\).

  • Therefore, the macrostate maximizing entropy is the “most likely” state. Entropy maximization accounts for the many microstates that are consistent with a set of macroscopic observations.

In the macroscopically sized samples of atoms we observe in everyday life, our observations are almost deterministic, even of highly disordered systems like gaseous and liquid matter. This is because they are made of moles of individual constituents, and our handful of bulk properties observed about them corresponds to only a handful of constraints.

In the statistical mechanical view of AI, when trying to minimize loss over datasets, we can similarly say that the entropy of the learned distribution tends to be nearly maximal. (This can be proved quite generally.)

What we’ve described, with a discrete “alphabet” of possible states, is the concept called the asymptotic equipartition property in information theory – high-entropy sequences occur with much greater multiplicity than low-entropy ones. Boltzmann’s “probability calculation” shows exactly why, with entropy emerging as a measure of (log) multiplicity.

Reuse

Citation

BibTeX citation:
@online{balsubramani,
  author = {Balsubramani, Akshay},
  title = {Foundations of Concentration and Entropy},
  langid = {en}
}
For attribution, please cite this work as:
Balsubramani, Akshay. n.d. “Foundations of Concentration and Entropy.”