Probability densities on embeddings

Tools for modeling distributions on data neighborhoods
graphs
genomics
statistics
Author

Akshay Balsubramani

Modified

September 1, 2023

Probabilities over neighborhoods

The kNN graph is an indispensable tool to nonparametrically take advantage of manifold structure in embeddings for modeling. However, its essentially discrete nature can be limiting for some analyses. It’s often convenient to take a more continuous perspective on modeling the data manifold.

In this post, I’m working through code to interpolate over the kNN graph to model continuous distributions on the manifold, all the way down1 to fine-grained neighborhoods in the dataset. This opens up a whole new world of possibilities for modeling and analysis, asking things like:

  • How can we efficiently infer the density of the data being in a particular region of the manifold?
  • How does this density vary across the manifold?
  • What’s the intrinsic complexity of different neighborhoods of the embedding?

There is a wonderful way to do this, using some extremely simple modeling assumptions with profound consequences.

Modeling assumptions

Our embedding consists of some noisy observations from a manifold, embedded in a latent space. We want to try and model the data density on the manifold from these observations and their kNN graph.

Let’s start with some assumptions.

Assumptions
  • Only one point can be observed per location in latent space: Our observations are fine-grained / noisy enough for this to be reasonable.
  • Probabilities of disjoint neighborhoods are independent: This is a very natural assumption, for small neighborhoods.
  • Neighborhood volume determines neighborhood probabilities: Within a small neighborhood, probabilities are uniform in latent space. This is true for any density that’s smooth enough to do calculus on, which is very natural here.2

Even if these assumptions can’t be satisfied exactly, they are typically satisfied approximately, and the resulting model can be very useful. It turns out that these assumptions are enough to specify a local probability model for the data – an (inhomogeneous) Poisson point process (PPP).

Poisson point process

Any process that satisfies the assumptions above is a PPP! The PPP is a simple model parametrizing things in terms of a rate function \(\lambda (x)\) over points in the latent space \(x \in \mathcal{X}\). The rate function \(\lambda (x)\) is the expected number of points in a small neighborhood of \(x\), and is the key quantity that determines the PPP.

Defining property of a Poisson process

The number of points in any region \(A \subseteq \mathcal{X}\) is a Poisson random variable with mean \(\lambda_A := \int_A \lambda (x) dx\).

The PPP is a very natural model for the data density on the manifold. I like the axiomatic justification of it presented above, because it isolates the key assumption - that the underlying data density being estimated doesn’t change over small neighborhoods that encompass a few data points each. This is true, for instance, when the data are smoothly distributed on a low-dimensional embedded manifold.

I’ll explore some of its applications.

Distances to neighbors

In a PPP setting, we can calculate the distribution function for the distance to the \(k^{th}\) nearest neighbor \(r_k (x)\) from a given point \(x\), inside a ball \(\mathcal{B}(x,r)\) of radius \(r\) around it. This ball has volume \(\text{vol}(\mathcal{B}(x, r)) = V_d r^d\), where \(V_{d}\)3 is the volume of the \(d\)-dimensional ball of radius 1.

In detail, the cdf of \(r_1\) at any \(x\) is given by \[ D_1 (R) := \text{Pr}(r_1 \leq R) = 1 - \text{Pr}(N (\mathcal{B}(x, R)) = 0) = 1 - \exp \left( - V_{d} (\lambda (x)) R^{d} \right) \]

CODE
import numpy as np, scipy.special

def unit_ball_volume(d):
    """Volume V_d of a unit ball in d dimensions"""
    return (np.pi ** (0.5*d)) / scipy.special.gamma(0.5*d + 1)

We can extend this to the distribution of the distance to the \(k^{th}\) nearest neighbor \(r_k\): \[ D_{k} (R) := \text{Pr}(r_k \leq R) = 1 - \text{Pr}(N (\mathcal{B}(x, R)) < k) = 1 - \exp \left( - V_{d} (\lambda (x)) R^{d} \right) \sum_{n=0}^{k-1} \frac{(V_{d} (\lambda (x)) R^{d})^{n}}{n!} \]

Maximum-likelihood density estimates

The distribution of distances to each data point’s nearest neighbor gives insights into the density of the data in that space. This line of thinking leads to formulating a maximum-likelihood estimate for the local density at any given point \(x\). It’s a powerful nonparametric estimate that doesn’t require a priori information about the density function. The form is especially simple for \(k=1\), i.e. only considering each point’s single nearest neighbor.

Differentiating \(D_1\) with respect to \(r_1\) and setting it to zero, we get the maximum-likelihood estimate for the local density at \(x\) under the PPP model using the single nearest neighbor (Otto et al. 2023): \[ \hat{\rho} (r_{1} \mid d) = \frac{(d-1) }{d r_{1}^{d} V_{d}} \] To interpret this, notice that \(\hat{\rho} (s \mid d) (\text{vol}(\mathcal{B}(x, s))) = \frac{(d-1) }{d} \approx 1\), so the MLE estimate basically just smooths the probability mass of \(x\) over a ball going out to the nearest neighbor of \(x\).

Local intrinsic dimension

One of the signature uses of this probability distribution in machine learning is by (Levina and Bickel 2004), who use those assumptions to estimate the intrinsic dimension of the manifold that the data lie on. The data manifold is some subset of the embedding space where the data actually lie. When the data-generating process has just a few interesting degrees of freedom, the data are often likely to lie in a lower-dimensional embedded manifold.

This method estimates the “intrinsic” dimension of that lower-dimensional space where the data lie. The key idea is that the volumes of neighborhoods behave differently for different values of intrinsic dimension. By the third assumption above, the probability of a neighborhood is proportional to its volume, so we can estimate the intrinsic dimension by fitting a power law to the neighborhood volumes using the Hill estimator.

The resulting estimator is very simple, kNN-graph-based, and occupies a pretty unique place in the literature. Here’s a useful implementation I made a while back for kNN graphs. More recently, there is also the scikit-dimension package (Bac et al. 2021) for this and other intrinsic dimension estimation routines.

CODE
def intrinsic_dim_MLE(nbr_distances, bias_correction=True):
    """ Returns MLE estimator of local intrinsic dimension at each data point. 
    Uses neighborhood size given by number of columns in `nbr_distances`.

    Parameters
    ----------
    nbr_distances : array, shape (n_samples, neighborhood_size)
        Distances to the k nearest neighbors of each point. 
        Entry (i, j) is the distance from point i to its (j+1)st nearest neighbor.
    bias_correction : bool, optional (default=True)
        If True, apply bias correction to the estimator.
    
    Returns
    -------
    idim_fn : array-like, shape (n_samples,)
        Estimated intrinsic dimension at each data point.
    """
    idim_fn = -np.log(nbr_distances)
    idim_fn -= idim_fn[:, -1]
    idim_fn = idim_fn[:, :-1].mean(axis=1)
    idim_fn = np.reciprocal(idim_fn)
    if bias_correction:
        k = nbr_distances.shape[1]
        idim_fn = idim_fn*((k-2)/(k-1))
    return idim_fn

Another way to think about this estimator is that it comes from a probability calculation over the neighborhood of a point \(x\). If \(B_R (x)\) is a uniform distribution on a ball of radius \(R\) around \(x\) in \(d\)-dimensional space, then (Block et al. 2022): \[ d = \left[ \mathbb{E} \left( \log \frac{R}{|| X ||} \right) \right]^{-1} \] The empirical version of this estimator in the neighborhood around \(x\) is exactly the one from (Levina and Bickel 2004) above.

A continuous probability density

There are various ways to convert this into an estimate of probability density over the embedding space. We’ve already seen an intuitive way of doing this with the MLE estimate of the density at each data point, which smooths the probability of the point over a 1-NN ball around it.

Efficient estimation with \(k > 1\) neighbors

1-NN methods are often plagued by high variance; what happens when we use more than one neighbor? A straightforward, direct way of estimating the local density at any \(x\) extends the 1-NN idea to the \(k\)-NN ball, smoothing the probability of \(k\) data points over the \(k\)-NN ball around \(x\).

To motivate it, fix \(k \geq 1\) and suppose the data \(x_1, \dots, x_n\) are from a density \(\rho (x)\). We can think of the probability of a point \(x\) being in the neighborhood ball as \(p_{k} (\approx \frac{k}{n})\). Then the number of points that fall into this neighborhood is like flipping \(n\) coins, each of probability \(p_{k}\) – it’s a \(\text{Binomial}(n, p_{k})\) random variable. So its mean is \(n p_{k} \approx k\), and its variance is \(\frac{p_{k} ( 1-p_{k} )}{n} \underbrace{\implies}_{n \to \infty} 0\).

The neighborhood ball is also small enough that \(\rho\) is approximately constant over it. So we can estimate \(\rho\) by dividing the number of points in the neighborhood by the volume of the neighborhood, giving a family of estimates \(\hat{\rho}_{k}\) for \(k \geq 1\): \[ \hat{\rho}_{k} (x) := \frac{k}{r_{k}^{d} V_{d}} \implies \log \hat{\rho}_{k} (x) = \log \frac{k}{V_{d}} - d \log r_{k}\] recalling that \(r_k (x)\) is the distance from \(x\) to its \(k^{th}\) nearest neighbor in \(d\) dimensions.

The following function returns \(\left\{ \hat{\rho}_{1} (x_i), \hat{\rho}_{2} (x_i), \dots, \hat{\rho}_{k} (x_i) \right\}_{i=1}^{n}\), given the kNN graph.

CODE
def knn_density(nbr_distances, manifold_dim=None):
    """Returns a function that estimates the density of a point using kNN.

    Parameters
    ----------
    nbr_distances : array, shape (n_samples, neighborhood_size)
        Distances to the k nearest neighbors of each point. 
        Entry (i, j) is the distance from point i to its (j+1)st nearest neighbor.
    manifold_dim : array, optional (default=None)
        Local dimension of neighborhood around each data point.

    Returns
    -------
    log_density_fn : array, shape (n_samples, neighborhood_size)
        Function that estimates the log-density at each point using kNN. 
        Entry (i, j) is the estimated log-density at point i using (j+1) nearest neighbors.
    """
    num_neighbors = nbr_distances.shape[1]
    if manifold_dim is None:
        manifold_dim = intrinsic_dim_MLE(nbr_distances)
    log_density_fn = -np.multiply(np.log(nbr_distances), manifold_dim)
    log_density_fn += np.log(1+np.arange(num_neighbors))
    log_density_fn -= np.log(unit_ball_volume(manifold_dim))
    return log_density_fn

This shows that with the neighborhood graph already computed and with vectorized computation, it takes no extra effort to compute \(\hat{\rho}_k\) for a range of values of \(k\) at once. The estimates can then be averaged to obtain a final result that’s a distribution.

Estimating the entropy

This density estimate is the basis for the best-known entropy estimator, the Kozachenko–Leonenko estimator, for continuous distributions on manifolds. The entropy of a distribution is often used as a measure of its complexity, and being able to estimate it efficiently is a useful tool.

The entropy of the distribution \(\rho\) is defined as \[ H (\rho) := - \int_{x \in \mathbb{R}^{d}} \rho (x) \log \rho (x) dx \]

The Kozachenko–Leonenko estimator (Berrett, Samworth, and Yuan 2019) is \[ \hat{\rho}_{k} (x_1, \dots, x_n) := - \frac{1}{n} \sum_{i=1}^{n} \log \left( \hat{\rho}_{k} (x_i) \right) + \log \left( \frac{k}{e^{\Psi(k)}} \right) \] where \(\Psi (k) := - \gamma + \sum_{j=1}^{k-1} \frac{1}{j}\) is the digamma function.

Up to an additive correction factor, this is just the plug-in estimator for the entropy of the distribution \(\hat{\rho}_{k}\), a discrete distribution on the \(n\) data points. The correction factor goes to zero as \(n \to \infty\), so this is a consistent estimator.

This is simple to implement in a general form, in terms of the density estimates \(\hat{\rho}_{k}\) defined already above.

https://twitter.com/sp_monte_carlo/status/1745082793690894524/photo/1

https://twitter.com/adad8m/status/1745108054306381871

References

Bac, Jonathan, Evgeny M Mirkes, Alexander N Gorban, Ivan Tyukin, and Andrei Zinovyev. 2021. “Scikit-Dimension: A Python Package for Intrinsic Dimension Estimation.” Entropy 23 (10): 1368.
Berrett, Thomas B, Richard J Samworth, and Ming Yuan. 2019. “Efficient Multivariate Entropy Estimation via \(k\)-Nearest Neighbour Distances.” The Annals of Statistics 47 (1): 288–318.
Block, Adam, Zeyu Jia, Yury Polyanskiy, and Alexander Rakhlin. 2022. “Intrinsic Dimension Estimation Using Wasserstein Distance.” The Journal of Machine Learning Research 23 (1): 14124–60.
Levina, Elizaveta, and Peter Bickel. 2004. “Maximum Likelihood Estimation of Intrinsic Dimension.” Advances in Neural Information Processing Systems 17.
Otto, Dominik Jenz, Cailin Jordan, Brennan Dury, Christine Dien, and Manu Setty. 2023. “Quantifying Cell-State Densities in Single-Cell Phenotypic Landscapes Using Mellon.” bioRxiv. https://doi.org/10.1101/2023.07.09.548272.

Footnotes

  1. As far as statistically possible, actually.↩︎

  2. Specifically, it’s true whenever the mean value theorem of calculus holds for the density; we’re already implicity assuming this, in doing the density-based modeling and viewing it as meaningful.↩︎

  3. \(V_{d} := \frac{\pi^{d/2}}{\Gamma(1 + \frac{d}{2})}\)↩︎