CODE
import numpy as np, scipy.special
def unit_ball_volume(d):
"""Volume V_d of a unit ball in d dimensions"""
return (np.pi ** (0.5*d)) / scipy.special.gamma(0.5*d + 1)
Akshay Balsubramani
September 1, 2023
The kNN graph is an indispensable tool to nonparametrically take advantage of manifold structure in embeddings for modeling. However, its essentially discrete nature can be limiting for some analyses. It’s often convenient to take a more continuous perspective on modeling the data manifold.
In this post, I’m working through code to interpolate over the kNN graph to model continuous distributions on the manifold, all the way down1 to fine-grained neighborhoods in the dataset. This opens up a whole new world of possibilities for modeling and analysis, asking things like:
There is a wonderful way to do this, using some extremely simple modeling assumptions with profound consequences.
Our embedding consists of some noisy observations from a manifold, embedded in a latent space. We want to try and model the data density on the manifold from these observations and their kNN graph.
Let’s start with some assumptions.
Even if these assumptions can’t be satisfied exactly, they are typically satisfied approximately, and the resulting model can be very useful. It turns out that these assumptions are enough to specify a local probability model for the data – an (inhomogeneous) Poisson point process (PPP).
Any process that satisfies the assumptions above is a PPP! The PPP is a simple model parametrizing things in terms of a rate function \(\lambda (x)\) over points in the latent space \(x \in \mathcal{X}\). The rate function \(\lambda (x)\) is the expected number of points in a small neighborhood of \(x\), and is the key quantity that determines the PPP.
The PPP is a very natural model for the data density on the manifold. I like the axiomatic justification of it presented above, because it isolates the key assumption - that the underlying data density being estimated doesn’t change over small neighborhoods that encompass a few data points each. This is true, for instance, when the data are smoothly distributed on a low-dimensional embedded manifold.
I’ll explore some of its applications.
In a PPP setting, we can calculate the distribution function for the distance to the \(k^{th}\) nearest neighbor \(r_k (x)\) from a given point \(x\), inside a ball \(\mathcal{B}(x,r)\) of radius \(r\) around it. This ball has volume \(\text{vol}(\mathcal{B}(x, r)) = V_d r^d\), where \(V_{d}\)3 is the volume of the \(d\)-dimensional ball of radius 1.
In detail, the cdf of \(r_1\) at any \(x\) is given by \[ D_1 (R) := \text{Pr}(r_1 \leq R) = 1 - \text{Pr}(N (\mathcal{B}(x, R)) = 0) = 1 - \exp \left( - V_{d} (\lambda (x)) R^{d} \right) \]
We can extend this to the distribution of the distance to the \(k^{th}\) nearest neighbor \(r_k\): \[ D_{k} (R) := \text{Pr}(r_k \leq R) = 1 - \text{Pr}(N (\mathcal{B}(x, R)) < k) = 1 - \exp \left( - V_{d} (\lambda (x)) R^{d} \right) \sum_{n=0}^{k-1} \frac{(V_{d} (\lambda (x)) R^{d})^{n}}{n!} \]
The distribution of distances to each data point’s nearest neighbor gives insights into the density of the data in that space. This line of thinking leads to formulating a maximum-likelihood estimate for the local density at any given point \(x\). It’s a powerful nonparametric estimate that doesn’t require a priori information about the density function. The form is especially simple for \(k=1\), i.e. only considering each point’s single nearest neighbor.
Differentiating \(D_1\) with respect to \(r_1\) and setting it to zero, we get the maximum-likelihood estimate for the local density at \(x\) under the PPP model using the single nearest neighbor (Otto et al. 2023): \[ \hat{\rho} (r_{1} \mid d) = \frac{(d-1) }{d r_{1}^{d} V_{d}} \] To interpret this, notice that \(\hat{\rho} (s \mid d) (\text{vol}(\mathcal{B}(x, s))) = \frac{(d-1) }{d} \approx 1\), so the MLE estimate basically just smooths the probability mass of \(x\) over a ball going out to the nearest neighbor of \(x\).
One of the signature uses of this probability distribution in machine learning is by (Levina and Bickel 2004), who use those assumptions to estimate the intrinsic dimension of the manifold that the data lie on. The data manifold is some subset of the embedding space where the data actually lie. When the data-generating process has just a few interesting degrees of freedom, the data are often likely to lie in a lower-dimensional embedded manifold.
This method estimates the “intrinsic” dimension of that lower-dimensional space where the data lie. The key idea is that the volumes of neighborhoods behave differently for different values of intrinsic dimension. By the third assumption above, the probability of a neighborhood is proportional to its volume, so we can estimate the intrinsic dimension by fitting a power law to the neighborhood volumes using the Hill estimator.
The resulting estimator is very simple, kNN-graph-based, and occupies a pretty unique place in the literature. Here’s a useful implementation I made a while back for kNN graphs. More recently, there is also the scikit-dimension
package (Bac et al. 2021) for this and other intrinsic dimension estimation routines.
def intrinsic_dim_MLE(nbr_distances, bias_correction=True):
""" Returns MLE estimator of local intrinsic dimension at each data point.
Uses neighborhood size given by number of columns in `nbr_distances`.
Parameters
----------
nbr_distances : array, shape (n_samples, neighborhood_size)
Distances to the k nearest neighbors of each point.
Entry (i, j) is the distance from point i to its (j+1)st nearest neighbor.
bias_correction : bool, optional (default=True)
If True, apply bias correction to the estimator.
Returns
-------
idim_fn : array-like, shape (n_samples,)
Estimated intrinsic dimension at each data point.
"""
idim_fn = -np.log(nbr_distances)
idim_fn -= idim_fn[:, -1]
idim_fn = idim_fn[:, :-1].mean(axis=1)
idim_fn = np.reciprocal(idim_fn)
if bias_correction:
k = nbr_distances.shape[1]
idim_fn = idim_fn*((k-2)/(k-1))
return idim_fn
Another way to think about this estimator is that it comes from a probability calculation over the neighborhood of a point \(x\). If \(B_R (x)\) is a uniform distribution on a ball of radius \(R\) around \(x\) in \(d\)-dimensional space, then (Block et al. 2022): \[ d = \left[ \mathbb{E} \left( \log \frac{R}{|| X ||} \right) \right]^{-1} \] The empirical version of this estimator in the neighborhood around \(x\) is exactly the one from (Levina and Bickel 2004) above.
There are various ways to convert this into an estimate of probability density over the embedding space. We’ve already seen an intuitive way of doing this with the MLE estimate of the density at each data point, which smooths the probability of the point over a 1-NN ball around it.
1-NN methods are often plagued by high variance; what happens when we use more than one neighbor? A straightforward, direct way of estimating the local density at any \(x\) extends the 1-NN idea to the \(k\)-NN ball, smoothing the probability of \(k\) data points over the \(k\)-NN ball around \(x\).
To motivate it, fix \(k \geq 1\) and suppose the data \(x_1, \dots, x_n\) are from a density \(\rho (x)\). We can think of the probability of a point \(x\) being in the neighborhood ball as \(p_{k} (\approx \frac{k}{n})\). Then the number of points that fall into this neighborhood is like flipping \(n\) coins, each of probability \(p_{k}\) – it’s a \(\text{Binomial}(n, p_{k})\) random variable. So its mean is \(n p_{k} \approx k\), and its variance is \(\frac{p_{k} ( 1-p_{k} )}{n} \underbrace{\implies}_{n \to \infty} 0\).
The neighborhood ball is also small enough that \(\rho\) is approximately constant over it. So we can estimate \(\rho\) by dividing the number of points in the neighborhood by the volume of the neighborhood, giving a family of estimates \(\hat{\rho}_{k}\) for \(k \geq 1\): \[ \hat{\rho}_{k} (x) := \frac{k}{r_{k}^{d} V_{d}} \implies \log \hat{\rho}_{k} (x) = \log \frac{k}{V_{d}} - d \log r_{k}\] recalling that \(r_k (x)\) is the distance from \(x\) to its \(k^{th}\) nearest neighbor in \(d\) dimensions.
The following function returns \(\left\{ \hat{\rho}_{1} (x_i), \hat{\rho}_{2} (x_i), \dots, \hat{\rho}_{k} (x_i) \right\}_{i=1}^{n}\), given the kNN graph.
def knn_density(nbr_distances, manifold_dim=None):
"""Returns a function that estimates the density of a point using kNN.
Parameters
----------
nbr_distances : array, shape (n_samples, neighborhood_size)
Distances to the k nearest neighbors of each point.
Entry (i, j) is the distance from point i to its (j+1)st nearest neighbor.
manifold_dim : array, optional (default=None)
Local dimension of neighborhood around each data point.
Returns
-------
log_density_fn : array, shape (n_samples, neighborhood_size)
Function that estimates the log-density at each point using kNN.
Entry (i, j) is the estimated log-density at point i using (j+1) nearest neighbors.
"""
num_neighbors = nbr_distances.shape[1]
if manifold_dim is None:
manifold_dim = intrinsic_dim_MLE(nbr_distances)
log_density_fn = -np.multiply(np.log(nbr_distances), manifold_dim)
log_density_fn += np.log(1+np.arange(num_neighbors))
log_density_fn -= np.log(unit_ball_volume(manifold_dim))
return log_density_fn
This shows that with the neighborhood graph already computed and with vectorized computation, it takes no extra effort to compute \(\hat{\rho}_k\) for a range of values of \(k\) at once. The estimates can then be averaged to obtain a final result that’s a distribution.
This density estimate is the basis for the best-known entropy estimator, the Kozachenko–Leonenko estimator, for continuous distributions on manifolds. The entropy of a distribution is often used as a measure of its complexity, and being able to estimate it efficiently is a useful tool.
The entropy of the distribution \(\rho\) is defined as \[ H (\rho) := - \int_{x \in \mathbb{R}^{d}} \rho (x) \log \rho (x) dx \]
The Kozachenko–Leonenko estimator (Berrett, Samworth, and Yuan 2019) is \[ \hat{\rho}_{k} (x_1, \dots, x_n) := - \frac{1}{n} \sum_{i=1}^{n} \log \left( \hat{\rho}_{k} (x_i) \right) + \log \left( \frac{k}{e^{\Psi(k)}} \right) \] where \(\Psi (k) := - \gamma + \sum_{j=1}^{k-1} \frac{1}{j}\) is the digamma function.
Up to an additive correction factor, this is just the plug-in estimator for the entropy of the distribution \(\hat{\rho}_{k}\), a discrete distribution on the \(n\) data points. The correction factor goes to zero as \(n \to \infty\), so this is a consistent estimator.
This is simple to implement in a general form, in terms of the density estimates \(\hat{\rho}_{k}\) defined already above.
https://twitter.com/sp_monte_carlo/status/1745082793690894524/photo/1
https://twitter.com/adad8m/status/1745108054306381871
As far as statistically possible, actually.↩︎
Specifically, it’s true whenever the mean value theorem of calculus holds for the density; we’re already implicity assuming this, in doing the density-based modeling and viewing it as meaningful.↩︎
\(V_{d} := \frac{\pi^{d/2}}{\Gamma(1 + \frac{d}{2})}\)↩︎