CODE
import numpy as np, sklearn.random_projection
= 100
target_dim = 20000
source_dim
= sklearn.random_projection._gaussian_random_matrix(
dense_basis =target_dim, n_features=source_dim, random_state=None) n_components
Akshay Balsubramani
October 3, 2022
Data in machine learning and statistics very often come in the form of an \(n \times d\) matrix \(X\), where each row is one of \(n\) samples and each of the \(d\) columns is a feature describing the data. This is often followed by the fundamental preprocessing step of dimensionality reduction – to find a low-dimensional representation of a high-dimensional dataset. Many algorithms for this are commonly used.
scikit-learn
implementation is good to use. As direct matrix factorization, it learns mostly linear substructure in the data.Learned solutions like autoencoders achieve the best compression performance and generalizability of the reduced representation, while learned sparse dictionaries can couple these with interpretability at the cost of complexity.
But there are many cases where we want to use a more simple, interpretable, and ultra-fast method to capture structure in the data. The best such alternatives to the learned methods are based on random matrices, and involve no learning. These tools arise in fundamental examples of modern numerical linear algebra as well (Woodruff 2014), and so they are used in many other areas of machine learning. Here are some examples of such methods, which are very useful and cheap ways to harness randomization in practical data analysis.
Random projections are a simple and fast way to solve this issue. The idea is that we can project the data onto a random subspace of the original space, and the data will still be well-separated. The original result of this kind was the Johnson-Lindenstrauss lemma, which states that for any \(n\) points in \(\mathbb{R}^d\), a Gaussian projection \(f: \mathbb{R}^d \to \mathbb{R}^k\) to \(k = \frac{8 \log(n/\delta)}{\epsilon^2}\) dimensions will preserve the pairwise distances between the points up to a factor of \(1 \pm \epsilon\): \[ (1-\epsilon) \|x - y\|_2^2 \leq \|f(x) - f(y)\|_2^2 \leq (1+\epsilon) \|x - y\|_2^2 \] for all \(x, y \in \mathbb{R}^d\) with probability \(1 - \delta\).
The projection \(f\) is just a random matrix with i.i.d. standard normal entries, scaled appropriately by the number of components (the scikit-learn
implementation is a good reference).
This is a very strong result (Dasgupta and Gupta 2003):
But the projection matrix itself is dense, and so this is not a very efficient method. And the components aren’t easily interpretable, as they mix all the features.
A more efficient and interpretable method is to use a sparse random projection. There have been many proposed methods – the most useful to me is the sparsest of all.
Instead of mixing all \(d\) features in each reduced dimension, it’s possible to find a faithful projection that just uses \(\sqrt{d}\) features. This projection is almost as good as the dense one in practice (P. Li, Hastie, and Church 2006).
This is a massive improvement over standard random projections. The efficiency gains are obvious (\(\sqrt{d}\) speed and memory improvements).
The construction could not be simpler: each element of the projection matrix (to reduce to \(k\) components) is independently drawn to be \(\{ - \sqrt{\frac{d}{k}}, 0, \sqrt{\frac{d}{k}} \}\) with probabilities \(\{ \frac{1}{2d}, 1 - \frac{1}{d}, \frac{1}{2d} \}\) respectively.
As usual I recommend the scikit-learn
implementation for this, as below:
This sparse random projection follows the basic idea of the Bloom filter, a fundamental probabilistic data structure for set membership queries. Each projected dimension is an extremely noisy measurement of any data point because it only looks at a few features. But after combining many sufficiently independent projected dimensions, the noise cancels out and the original distances are approximately preserved.
Perhaps the most general way to think of constructing these types of random projections is an amazing basic result by (Arriaga and Vempala 2006). They describe1 how to construct a random projection matrix by drawing its entries i.i.d. from any symmetric distribution with variance 1.
If the resulting matrix is \(R\), the matrix \(\frac{1}{\sqrt{k}} R\) is a good projection matrix! The dense Johnson-Lindenstrauss projection, and the very sparse projection above, are special cases of this.2 But this actually occurs under very general conditions.3
All the given examples involve mixing different components to reconstruct each of the \(n\) data points. Can we go sparser? It’s not possible to reconstruct all the data directly if we do. If each of the \(n\) data mapped to only a single component, this is obvious, for instance, and it makes it impossible to proceed along these lines.
But what if we don’t care about reconstructing each row of the matrix \(X\), but just enough such we approximate the linear operator \(X\)? Among other things, this preserves the results of all linear regressions on the transformed data. It turns out that we can do this, with each data point (row) of \(X\) mapping to just one reduced-dimension component. The resulting procedure enables some of the best data “sketches” known, and it is ultra-efficient.
The idea is to find a matrix \(S\) such that \(Sx\) is a good approximation of \(x\) for all \(x\) in the column space of \(X\). The matrix \(S\) is called a sketching matrix, and the subspace embedding \(SX\) is called a sketch. The line of research that produced this has developed a host of fundamental and practical tools over the past years, which operate quickly even in sparse and streaming settings (Cormode and Muthukrishnan 2005). Work by (Clarkson and Woodruff 2017) significantly generalized this, culminating in the scipy
implementation that adapts to both the sparsity and geometry of the input data.
I’ll briefly summarize some of the theory from (Clarkson and Woodruff 2017) about their “CWT” matrix \(S\).
If \(X\) is \(n \times d\) with rank \(r\), and \(S\) is a \(k \times n\) “CWT matrix”, then with high probability if \(k \geq \Omega(\epsilon^{-2} r^2 )\), \(SX\) is close to \(X\):
\[ (1-\epsilon) \| X w \|_2^2 \leq \| S X w \|_2^2 \leq (1+\epsilon) \| X w \|_2^2 \] \[ (1-\epsilon) \| X \|_F^2 \leq \| S X \|_F^2 \leq (1+\epsilon) \| X \|_F^2 \]
Under the same conditions, instead of solving the least squares problem \(\min_{W} \| X W - B \|_2^2\), we can solve it on the sketched data \((S X, S B)\) and leave the solution unchanged: > If \(W^{*} \in \arg\min_{X} \| S X W - S B \|_2^2\), then \[ \| X W^{*} - B \|_F^2 \leq \min_{W} \| X W - B \|_F^2 \]
For more full statements, (Clarkson and Woodruff 2017) is the original reference.
Best of all, this amazing matrix \(S\) is easy to construct and use, and again a good implementation is in scipy.linalg.clarkson_woodruff_transform
. I’ve isolated the code behind this for easier access, because it doesn’t allow direct importing.
from scipy._lib._util import check_random_state, rng_integers
from scipy.sparse import csc_matrix
def cwt_matrix(n_rows, n_columns, seed=None):
"""
Generate a matrix S which represents a Clarkson-Woodruff transform.
Given the desired size of matrix, the method returns a matrix S of size
(n_rows, n_columns) where each column has all the entries set to 0
except for one position which has been randomly set to +1 or -1 with
equal probability.
Parameters
----------
n_rows : int
Number of rows of S
n_columns : int
Number of columns of S
seed : {None, int, `numpy.random.Generator`, `numpy.random.RandomState`}, optional
If `seed` is None (or `np.random`), the `numpy.random.RandomState`
singleton is used.
If `seed` is an int, a new ``RandomState`` instance is used,
seeded with `seed`.
If `seed` is already a ``Generator`` or ``RandomState`` instance then
that instance is used.
Returns
-------
S : (n_rows, n_columns) csc_matrix
The returned matrix has ``n_columns`` nonzero entries.
Notes
-----
Given a matrix A, with probability at least 9/10,
.. math:: \|SA\| = (1 \pm \epsilon)\|A\|
Where the error epsilon is related to the size of S.
"""
rng = check_random_state(seed)
rows = rng_integers(rng, 0, n_rows, n_columns)
cols = np.arange(n_columns+1)
signs = rng.choice([1, -1], n_columns)
S = csc_matrix((signs, rows, cols),shape=(n_rows, n_columns))
return S
num_data = 1000000
sparse_sketch_matrix = cwt_matrix(target_dim, num_data)
There’s a technique by (M. Li, Miller, and Peng 2013) (via (Kapralov et al. 2017)) that I really like. It’s not the most efficient way to get a sparsifier, but it’s an extremely way of sparsifying any \(n \times n\) positive semi-definite matrix \(K\). The specific idea is to interpolate between the matrix \(K\) and a scaled version of the identity matrix \(I_{n}\), by mixing in progressively more of \(I_{n}\) into \(K\). This is a very natural way to sparsify, because the identity matrix is the most sparse nontrivial matrix, lacking any edges between vertices.
The algorithm constructs a sequence of graphs interpolating between \(K\) and \(I_{n}\): \[ K \to K(d) \to K(d-1) \to \dots \to K(1) \to K(0) \to I \] such that every graph in the sequence is a coarse approximation of the previous one. Though the graphs get sparser in the sequence, every graph in the sequence approximates all cuts of the previous graph.
It’s easy to show that it works for any PSD matrix. (Kapralov et al. 2017) applies this to the Laplacian of a graph. \(\lambda_{l}\) can be bounded away from 0 for Laplacians, and \(\lambda_{u}\) bounded away from 1 otherwise - it boils down to power iteration with \(L\) vs \(1-L\).
def sparsifiers_to_identity(kernel_mat):
"""Given a kernel matrix, return a list of increasingly sparse matrices interpolating spectrally between it and the identity.
Parameters
----------
kernel_mat : sparse matrix
A kernel matrix with maximum eigenvalue <= 1.
Could be, but isn't limited to, a normalized graph Laplacian.
Returns
-------
list_of_sparsifiers: list
A list of increasingly sparse matrices where `list_of_sparsifiers[0]` is close to `kernel_mat`
and `list_of_sparsifiers[-1]` is close to the identity matrix.
"""
list_of_sparsifiers = []
min_eigval = np.min(np.linalg.eigvals(kernel_mat))
num_sparsifiers = np.ceil(-np.log2(min_eigval)).astype(int)
gamma_multiplier = 1 # Initialize to an upper bound on the max eigenvalue. Here assumed to be 1.
for i in range(num_sparsifiers+1):
sparsifier = gamma_multiplier*scipy.sparse.identity(kernel_mat.shape[0]) + kernel_mat
list_of_sparsifiers.append(sparsifier)
gamma_multiplier *= 0.5
list_of_sparsifiers.append(kernel_mat)
return list_of_sparsifiers
One more kind of embedding that’s worth mentioning is the landmark embedding. This is a very simple embedding that is based on the idea of landmark sets of points from the data set. A landmark embedding is then the embedding of the data set by how far it is to the landmark sets, according to some distance metric.
This type of embedding tends to be very good at recovering data near the landmarks, and relies more on the given metric to extrapolate outside those. If the landmarks are chosen well at multiple scales, this can be a very powerful tool.
The archetypal landmark embedding of this kind is the one proposed in (Bourgain 1985) in a seminal and beautiful paper. Extended many times (see e.g. (Johnson, Lindenstrauss, and Schechtman 1987), (Linial, London, and Rabinovich 1995), (Krauthgamer et al. 2004)), many intuitive descriptions of it exist, and it’s been used in modern ML also (Xiao, Zhong, and Zheng 2018). I’ll outline it as follows.
The landmark embedding is then the embedding of the data set by how similar it is to the landmark sets. For any point \(x\), \[ f(x) = \underbrace{(d(x, X_{1, 1}), \dots, d(x, X_{L, B}))}_{\text{L*B components}} \]
The exponentially gridded “levels” are different scales of the embedding, allowing the metric to adapt flexibly to any scale.
def bourgain_embedding(X, dist='L2'):
"""A small, explicitly multiscale embedding that uses random subsets of the data to embed against.
Parameters
----------
X : array
The data to embed (n examples x d features).
dist : string
The distance metric to use. One of ['L2', 'L1']. Default is 'L2'.
"""
np.random.seed(123)
X_emb = []
R = range(len(X))
n = len(X)
num_levels = int(math.ceil(math.log(n)/math.log(2)-1))
num_batches = int(math.ceil(math.log(n)))
counter = 0
for lvl in range(0, num_levels+1):
for t in range(num_batches):
S = np.random.choice(R, 2**lvl)
for j in R:
x = X[j]
d = min([ dist(x, X[s]) for s in S])
counter += len(S)
if lvl==0 and t==0:
X_emb.append([d])
else:
X_emb[j].append(d)
print (counter)
return X_emb
""" Usage example:
X = [ [(i+j)*(i+j+1)/2+j for j in range(0,10+1)] for i in range(0,50)]
X_emb = bourgain_embedding(X,dist)
l = []
for x in range(len(X)):
for y in range(len(X)):
if x != y:
d1 = dist(X[x],X[y])
d2 = dist(X_emb[x],X_emb[y])
l.append(d2*1.0/(d1*1.0))
#print x,y,d1,d2
print "distortion = %s" % (max(l)/min(l))
print "upper bound for distortion = %s " % math.log(len(X))
"""
Bourgain’s embedding has the drawback of using explicitly multiscale information in the form of these datasets, leading to a somewhat involved construction procedure. But every level is constructed through the same procedure – uniform random sampling of the dataset. In empirically investigating this primitive, I think it makes for a cheap, versatile, and controllable embedding that exemplifies the characteristics of such Fréchet constructions.
from scipy.spatial import distance_matrix
def frechet_euclidean_embedding(X, landmarks=None, num_landmarks=1000):
""" Compute a Frechet embedding of the data against a subset of landmarks.
Parameters
----------
X : array
The data to embed (n examples x d features).
landmarks : array
The indices of the landmarks to embed against. If None, a random subset of size `num_landmarks` will be chosen.
num_landmarks : int
The number of landmarks to embed against. Only used if `landmarks` is None.
Returns
-------
distance_matrix : array
The distance matrix between the data and the landmarks.
landmark_mask : array
A boolean array indicating which points are landmarks.
"""
if landmarks is None:
if num_landmarks > X.shape[0]:
raise ValueError('num_landmarks must be <= the number of data points.')
landmarks = np.random.choice(range(X.shape[0]), num_landmarks)
landmark_mask = np.zeros(X.shape[0], dtype=bool)
landmark_mask[landmarks] = True
landmark_data = X[landmark_mask, :]
nonlandmark_data = X[~landmark_mask, :]
feature_mat = distance_matrix(nonlandmark_data, landmark_data)
return feature_mat, landmark_mask
Thm. 1 in (Arriaga and Vempala 2006)↩︎
Respectively with standard normal entries, and entries drawn from \(\{ - \sqrt{d}, 0, \sqrt{d} \}\) with probabilities \(\{ \frac{1}{2d}, 1 - \frac{1}{d}, \frac{1}{2d} \}\).↩︎
As long as the distribution has a decently controlled fourth moment or higher moments (Bernstein’s moment condition, which is automatic if the data are bounded.). Practically, if the data don’t have massive outliers, it works out.↩︎