Random projections and dimensionality reduction

graphs
ML
Efficient, general applied tools
Author

Akshay Balsubramani

Modified

October 3, 2022

Outline

Data in machine learning and statistics very often come in the form of an \(n \times d\) matrix \(X\), where each row is one of \(n\) samples and each of the \(d\) columns is a feature describing the data. This is often followed by the fundamental preprocessing step of dimensionality reduction – to find a low-dimensional representation of a high-dimensional dataset. Many algorithms for this are commonly used.

  • PCA falls into this category, as a linear method that finds the directions of maximal variance in the data. Particularly when the data are sparse, PCA can be less efficient and interpretable than ideal, and also only learns linear structure in the data.
  • Non-negative matrix factorization works very efficiently on very sparse matrices – the scikit-learn implementation is good to use. As direct matrix factorization, it learns mostly linear substructure in the data.
  • Nonlinear dimensionality reduction methods typically based on graphs are very powerful, and tend to be based on distance matrices or neighborhood graphs. They include diffusion maps / Laplacian eigenmaps, UMAP and topological methods, t-SNE, and others, which are covered in other posts.

Learned solutions like autoencoders achieve the best compression performance and generalizability of the reduced representation, while learned sparse dictionaries can couple these with interpretability at the cost of complexity.

But there are many cases where we want to use a more simple, interpretable, and ultra-fast method to capture structure in the data. The best such alternatives to the learned methods are based on random matrices, and involve no learning. These tools arise in fundamental examples of modern numerical linear algebra as well (Woodruff 2014), and so they are used in many other areas of machine learning. Here are some examples of such methods, which are very useful and cheap ways to harness randomization in practical data analysis.

Random projections

Random projections are a simple and fast way to solve this issue. The idea is that we can project the data onto a random subspace of the original space, and the data will still be well-separated. The original result of this kind was the Johnson-Lindenstrauss lemma, which states that for any \(n\) points in \(\mathbb{R}^d\), a Gaussian projection \(f: \mathbb{R}^d \to \mathbb{R}^k\) to \(k = \frac{8 \log(n/\delta)}{\epsilon^2}\) dimensions will preserve the pairwise distances between the points up to a factor of \(1 \pm \epsilon\): \[ (1-\epsilon) \|x - y\|_2^2 \leq \|f(x) - f(y)\|_2^2 \leq (1+\epsilon) \|x - y\|_2^2 \] for all \(x, y \in \mathbb{R}^d\) with probability \(1 - \delta\).

The projection \(f\) is just a random matrix with i.i.d. standard normal entries, scaled appropriately by the number of components (the scikit-learn implementation is a good reference).

CODE
import numpy as np, sklearn.random_projection

target_dim = 100
source_dim = 20000

dense_basis = sklearn.random_projection._gaussian_random_matrix(
    n_components=target_dim, n_features=source_dim, random_state=None)

This is a very strong result (Dasgupta and Gupta 2003):

  • It assumes nothing about the data
  • \(f\) is a simple linear projection, with each coefficient being an independent standard normal random variable
  • It’s impossible theoretically to reduce into fewer dimensions than this in general

But the projection matrix itself is dense, and so this is not a very efficient method. And the components aren’t easily interpretable, as they mix all the features.

Sparser projections

A more efficient and interpretable method is to use a sparse random projection. There have been many proposed methods – the most useful to me is the sparsest of all.

Instead of mixing all \(d\) features in each reduced dimension, it’s possible to find a faithful projection that just uses \(\sqrt{d}\) features. This projection is almost as good as the dense one in practice (P. Li, Hastie, and Church 2006).

This is a massive improvement over standard random projections. The efficiency gains are obvious (\(\sqrt{d}\) speed and memory improvements).

The construction could not be simpler: each element of the projection matrix (to reduce to \(k\) components) is independently drawn to be \(\{ - \sqrt{\frac{d}{k}}, 0, \sqrt{\frac{d}{k}} \}\) with probabilities \(\{ \frac{1}{2d}, 1 - \frac{1}{d}, \frac{1}{2d} \}\) respectively.

As usual I recommend the scikit-learn implementation for this, as below:

CODE
sparse_basis = sklearn.random_projection._sparse_random_matrix(
    n_components=target_dim, n_features=source_dim, 
    density='auto', random_state=42
).transpose()

This sparse random projection follows the basic idea of the Bloom filter, a fundamental probabilistic data structure for set membership queries. Each projected dimension is an extremely noisy measurement of any data point because it only looks at a few features. But after combining many sufficiently independent projected dimensions, the noise cancels out and the original distances are approximately preserved.

A general construction

Perhaps the most general way to think of constructing these types of random projections is an amazing basic result by (Arriaga and Vempala 2006). They describe1 how to construct a random projection matrix by drawing its entries i.i.d. from any symmetric distribution with variance 1.

If the resulting matrix is \(R\), the matrix \(\frac{1}{\sqrt{k}} R\) is a good projection matrix! The dense Johnson-Lindenstrauss projection, and the very sparse projection above, are special cases of this.2 But this actually occurs under very general conditions.3

Sketching

All the given examples involve mixing different components to reconstruct each of the \(n\) data points. Can we go sparser? It’s not possible to reconstruct all the data directly if we do. If each of the \(n\) data mapped to only a single component, this is obvious, for instance, and it makes it impossible to proceed along these lines.

But what if we don’t care about reconstructing each row of the matrix \(X\), but just enough such we approximate the linear operator \(X\)? Among other things, this preserves the results of all linear regressions on the transformed data. It turns out that we can do this, with each data point (row) of \(X\) mapping to just one reduced-dimension component. The resulting procedure enables some of the best data “sketches” known, and it is ultra-efficient.

Sparse subspace embeddings

The idea is to find a matrix \(S\) such that \(Sx\) is a good approximation of \(x\) for all \(x\) in the column space of \(X\). The matrix \(S\) is called a sketching matrix, and the subspace embedding \(SX\) is called a sketch. The line of research that produced this has developed a host of fundamental and practical tools over the past years, which operate quickly even in sparse and streaming settings (Cormode and Muthukrishnan 2005). Work by (Clarkson and Woodruff 2017) significantly generalized this, culminating in the scipy implementation that adapts to both the sparsity and geometry of the input data.

Theory

I’ll briefly summarize some of the theory from (Clarkson and Woodruff 2017) about their “CWT” matrix \(S\).

If \(X\) is \(n \times d\) with rank \(r\), and \(S\) is a \(k \times n\) “CWT matrix”, then with high probability if \(k \geq \Omega(\epsilon^{-2} r^2 )\), \(SX\) is close to \(X\):

\[ (1-\epsilon) \| X w \|_2^2 \leq \| S X w \|_2^2 \leq (1+\epsilon) \| X w \|_2^2 \] \[ (1-\epsilon) \| X \|_F^2 \leq \| S X \|_F^2 \leq (1+\epsilon) \| X \|_F^2 \]

Under the same conditions, instead of solving the least squares problem \(\min_{W} \| X W - B \|_2^2\), we can solve it on the sketched data \((S X, S B)\) and leave the solution unchanged: > If \(W^{*} \in \arg\min_{X} \| S X W - S B \|_2^2\), then \[ \| X W^{*} - B \|_F^2 \leq \min_{W} \| X W - B \|_F^2 \]

For more full statements, (Clarkson and Woodruff 2017) is the original reference.

Implementation

Best of all, this amazing matrix \(S\) is easy to construct and use, and again a good implementation is in scipy.linalg.clarkson_woodruff_transform. I’ve isolated the code behind this for easier access, because it doesn’t allow direct importing.

CODE
from scipy._lib._util import check_random_state, rng_integers
from scipy.sparse import csc_matrix

def cwt_matrix(n_rows, n_columns, seed=None):
    """
    Generate a matrix S which represents a Clarkson-Woodruff transform.

    Given the desired size of matrix, the method returns a matrix S of size
    (n_rows, n_columns) where each column has all the entries set to 0
    except for one position which has been randomly set to +1 or -1 with
    equal probability.

    Parameters
    ----------
    n_rows : int
        Number of rows of S
    n_columns : int
        Number of columns of S
    seed : {None, int, `numpy.random.Generator`, `numpy.random.RandomState`}, optional
        If `seed` is None (or `np.random`), the `numpy.random.RandomState`
        singleton is used.
        If `seed` is an int, a new ``RandomState`` instance is used,
        seeded with `seed`.
        If `seed` is already a ``Generator`` or ``RandomState`` instance then
        that instance is used.

    Returns
    -------
    S : (n_rows, n_columns) csc_matrix
        The returned matrix has ``n_columns`` nonzero entries.

    Notes
    -----
    Given a matrix A, with probability at least 9/10,
    .. math:: \|SA\| = (1 \pm \epsilon)\|A\|
    Where the error epsilon is related to the size of S.
    """
    rng = check_random_state(seed)
    rows = rng_integers(rng, 0, n_rows, n_columns)
    cols = np.arange(n_columns+1)
    signs = rng.choice([1, -1], n_columns)
    S = csc_matrix((signs, rows, cols),shape=(n_rows, n_columns))
    return S


num_data = 1000000

sparse_sketch_matrix = cwt_matrix(target_dim, num_data)

A simple system of hierarchical sparsifiers

There’s a technique by (M. Li, Miller, and Peng 2013) (via (Kapralov et al. 2017)) that I really like. It’s not the most efficient way to get a sparsifier, but it’s an extremely way of sparsifying any \(n \times n\) positive semi-definite matrix \(K\). The specific idea is to interpolate between the matrix \(K\) and a scaled version of the identity matrix \(I_{n}\), by mixing in progressively more of \(I_{n}\) into \(K\). This is a very natural way to sparsify, because the identity matrix is the most sparse nontrivial matrix, lacking any edges between vertices.

The algorithm constructs a sequence of graphs interpolating between \(K\) and \(I_{n}\): \[ K \to K(d) \to K(d-1) \to \dots \to K(1) \to K(0) \to I \] such that every graph in the sequence is a coarse approximation of the previous one. Though the graphs get sparser in the sequence, every graph in the sequence approximates all cuts of the previous graph.

Theoretical guarantees

Consider any PSD matrix \(K\) with maximum and minimum eigenvalues of \(\lambda_u\) and \(\lambda_l\) respectively. Let \(d = \lceil \log_2 (\lambda_u / \lambda_l) \rceil\) and define \(K (l) = K + \frac{\lambda_u}{2^{l}} I_{n}\). Write \(p_G (T)\) as the weight of edges incident to subset \(T\) of vertices in graph \(G\). Then for any subset \(T\) of vertices,

  • \(p_{K} (T) \leq p_{K (d)} (T) \leq 2 p_{K} (T)\).
  • \(p_{K (l)} (T) \leq p_{K (l-1)} (T) \leq 2 p_{K (l)} (T)\) for all \(l\).
  • \(p_{K (0)} (T) \leq 2 p_{I_{n}} (T) \leq 2 p_{K (0)} (T)\).

It’s easy to show that it works for any PSD matrix. (Kapralov et al. 2017) applies this to the Laplacian of a graph. \(\lambda_{l}\) can be bounded away from 0 for Laplacians, and \(\lambda_{u}\) bounded away from 1 otherwise - it boils down to power iteration with \(L\) vs \(1-L\).

CODE
def sparsifiers_to_identity(kernel_mat):
    """Given a kernel matrix, return a list of increasingly sparse matrices interpolating spectrally between it and the identity. 

    Parameters
    ----------
    kernel_mat : sparse matrix
        A kernel matrix with maximum eigenvalue <= 1. 
        Could be, but isn't limited to, a normalized graph Laplacian.
    
    Returns
    -------
    list_of_sparsifiers: list
        A list of increasingly sparse matrices where `list_of_sparsifiers[0]` is close to `kernel_mat` 
        and `list_of_sparsifiers[-1]` is close to the identity matrix.
    """
    list_of_sparsifiers = []
    min_eigval = np.min(np.linalg.eigvals(kernel_mat))
    num_sparsifiers = np.ceil(-np.log2(min_eigval)).astype(int)
    gamma_multiplier = 1   # Initialize to an upper bound on the max eigenvalue. Here assumed to be 1.
    for i in range(num_sparsifiers+1):
        sparsifier = gamma_multiplier*scipy.sparse.identity(kernel_mat.shape[0]) + kernel_mat
        list_of_sparsifiers.append(sparsifier)
        gamma_multiplier *= 0.5
    list_of_sparsifiers.append(kernel_mat)
    return list_of_sparsifiers

Landmark (Fréchet) embeddings

One more kind of embedding that’s worth mentioning is the landmark embedding. This is a very simple embedding that is based on the idea of landmark sets of points from the data set. A landmark embedding is then the embedding of the data set by how far it is to the landmark sets, according to some distance metric.

This type of embedding tends to be very good at recovering data near the landmarks, and relies more on the given metric to extrapolate outside those. If the landmarks are chosen well at multiple scales, this can be a very powerful tool.

Bourgain’s embedding

The archetypal landmark embedding of this kind is the one proposed in (Bourgain 1985) in a seminal and beautiful paper. Extended many times (see e.g. (Johnson, Lindenstrauss, and Schechtman 1987), (Linial, London, and Rabinovich 1995), (Krauthgamer et al. 2004)), many intuitive descriptions of it exist, and it’s been used in modern ML also (Xiao, Zhong, and Zheng 2018). I’ll outline it as follows.

  • Set parameters \(n_{\text{levels}}\) and \(n_{\text{batches}}\) (both to \(\sim \log n\))
  • for \(L\) in \(1, \dots, n_{\text{levels}}\):
    • for \(B\) in \(1, \dots, n_{\text{batches}}\):
      • Form a batch \(X_{L,B}\) by sampling each data point with probability \(2^{-L}\)

The landmark embedding is then the embedding of the data set by how similar it is to the landmark sets. For any point \(x\), \[ f(x) = \underbrace{(d(x, X_{1, 1}), \dots, d(x, X_{L, B}))}_{\text{L*B components}} \]

The exponentially gridded “levels” are different scales of the embedding, allowing the metric to adapt flexibly to any scale.

CODE
def bourgain_embedding(X, dist='L2'):
    """A small, explicitly multiscale embedding that uses random subsets of the data to embed against. 

    Parameters
    ----------
    X : array
        The data to embed (n examples x d features).
    dist : string
        The distance metric to use. One of ['L2', 'L1']. Default is 'L2'.
    """
    np.random.seed(123)
    X_emb = []
    R = range(len(X))
    n = len(X)
    num_levels = int(math.ceil(math.log(n)/math.log(2)-1))
    num_batches = int(math.ceil(math.log(n)))
    counter = 0
    for lvl in range(0, num_levels+1):
        for t in range(num_batches):
            S = np.random.choice(R, 2**lvl)
            for j in R:
                x = X[j]
                d = min([ dist(x, X[s]) for s in S])
                counter += len(S)
                if lvl==0 and t==0:
                    X_emb.append([d])
                else:
                    X_emb[j].append(d)
    print (counter)
    return X_emb
""" Usage example: 
    X = [ [(i+j)*(i+j+1)/2+j for j in range(0,10+1)] for i in range(0,50)]
    X_emb = bourgain_embedding(X,dist)
    l = []
    for x in range(len(X)):
        for y in range(len(X)):
            if x != y:
                d1 = dist(X[x],X[y])
                d2 = dist(X_emb[x],X_emb[y])
                l.append(d2*1.0/(d1*1.0))  
                #print x,y,d1,d2
    print "distortion = %s" % (max(l)/min(l))
    print "upper bound for distortion = %s " % math.log(len(X))
"""

An empirically useful embedding

Bourgain’s embedding has the drawback of using explicitly multiscale information in the form of these datasets, leading to a somewhat involved construction procedure. But every level is constructed through the same procedure – uniform random sampling of the dataset. In empirically investigating this primitive, I think it makes for a cheap, versatile, and controllable embedding that exemplifies the characteristics of such Fréchet constructions.

CODE
from scipy.spatial import distance_matrix

def frechet_euclidean_embedding(X, landmarks=None, num_landmarks=1000):
    """ Compute a Frechet embedding of the data against a subset of landmarks.

    Parameters
    ----------
    X : array
        The data to embed (n examples x d features).
    landmarks : array
        The indices of the landmarks to embed against. If None, a random subset of size `num_landmarks` will be chosen.
    num_landmarks : int
        The number of landmarks to embed against. Only used if `landmarks` is None.
    
    Returns
    -------
    distance_matrix : array
        The distance matrix between the data and the landmarks.
    landmark_mask : array
        A boolean array indicating which points are landmarks.
    """
    if landmarks is None:
        if num_landmarks > X.shape[0]:
            raise ValueError('num_landmarks must be <= the number of data points.')
        landmarks = np.random.choice(range(X.shape[0]), num_landmarks)
    landmark_mask = np.zeros(X.shape[0], dtype=bool)
    landmark_mask[landmarks] = True
    landmark_data = X[landmark_mask, :]
    nonlandmark_data = X[~landmark_mask, :]
    feature_mat = distance_matrix(nonlandmark_data, landmark_data)
    return feature_mat, landmark_mask

References

Arriaga, Rosa I, and Santosh Vempala. 2006. “An Algorithmic Theory of Learning: Robust Concepts and Random Projection.” Machine Learning 63 (2): 161–82.
Bourgain, Jean. 1985. “On Lipschitz Embedding of Finite Metric Spaces in Hilbert Space.” Israel Journal of Mathematics 52 (1): 46–52.
Clarkson, Kenneth L, and David P Woodruff. 2017. “Low-Rank Approximation and Regression in Input Sparsity Time.” Journal of the ACM (JACM) 63 (6): 1–45.
Cormode, Graham, and S. Muthukrishnan. 2005. “An Improved Data Stream Summary: The Count-Min Sketch and Its Applications.” Journal of Algorithms 55 (1): 58–75.
Dasgupta, Sanjoy, and Anupam Gupta. 2003. “An Elementary Proof of a Theorem of Johnson and Lindenstrauss.” Random Structures & Algorithms 22 (1): 60–65.
Johnson, William B, Joram Lindenstrauss, and Gideon Schechtman. 1987. “On Lipschitz Embedding of Finite Metric Spaces in Low Dimensional Normed Spaces.” In Geometrical Aspects of Functional Analysis, 177–84. Springer.
Kapralov, Michael, Yin Tat Lee, CN Musco, Christopher Paul Musco, and Aaron Sidford. 2017. “Single Pass Spectral Sparsification in Dynamic Streams.” SIAM Journal on Computing 46 (1): 456–77.
Krauthgamer, Robert, James R Lee, Manor Mendel, and Assaf Naor. 2004. “Measured Descent: A New Embedding Method for Finite Metrics.” In 45th Annual IEEE Symposium on Foundations of Computer Science, 434–43. IEEE.
Li, Mu, Gary L Miller, and Richard Peng. 2013. “Iterative Row Sampling.” In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, 127–36. IEEE.
Li, Ping, Trevor J Hastie, and Kenneth W Church. 2006. “Very Sparse Random Projections.” In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 287–96.
Linial, Nathan, Eran London, and Yuri Rabinovich. 1995. “The Geometry of Graphs and Some of Its Algorithmic Applications.” Combinatorica 15 (2): 215–45.
Woodruff, David P. 2014. “Sketching as a Tool for Numerical Linear Algebra.” Foundations and Trends in Theoretical Computer Science 10 (1–2): 1–157.
Xiao, Chang, Peilin Zhong, and Changxi Zheng. 2018. “BourGAN: Generative Networks with Metric Embeddings.” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2275–86.

Footnotes

  1. Thm. 1 in (Arriaga and Vempala 2006)↩︎

  2. Respectively with standard normal entries, and entries drawn from \(\{ - \sqrt{d}, 0, \sqrt{d} \}\) with probabilities \(\{ \frac{1}{2d}, 1 - \frac{1}{d}, \frac{1}{2d} \}\).↩︎

  3. As long as the distribution has a decently controlled fourth moment or higher moments (Bernstein’s moment condition, which is automatic if the data are bounded.). Practically, if the data don’t have massive outliers, it works out.↩︎