SVD-based imputation methods

graphs

statistics

Using the SVD to perform ultra-efficient imputation

Author

Akshay Balsubramani

Modified

February 4, 2023

SVD-based imputation of missing values

Generic low-rank matrix imputation

There is a beautiful line of methods that use the SVD to impute missing values in a matrix. This goes back a long way (e.g. such a method is famously benchmarked for DNA microarrays in (Troyanskaya et al. 2001)), and is still a very active area of research.

Modern methods of this type act roughly in two steps:

Estimate/define missingness probabilities for each entry in the matrix.
Use a low-rank SVD-based representation to estimate entries in the matrix, importance-weighting entries according to their missingness probabilities.

I implemented a useful and efficient recent example of this type - an algorithm by (Bhattacharya and Chatterjee 2022), continuing work from (Chatterjee 2015), which adds a bit more to this template. Here’s how it works.

We’re given an \(n \times d\) data matrix \(X\) with missing entries, and an \(n \times d\) mask \(P\) with binary entries \(P_{ij}\) indicating whether the entry \(X_{ij}\) is missing. \(X\) needs to have all its entries bounded in some reasonable range, i.e. \(\max_{ij} (X_{ij}) - \min_{ij} (X_{ij})\) shouldn’t be too big for this to work.

First, min-max scale the data \(X\) so that each entry is in \([-1, 1]\). Then, do the following:

Compute the truncated SVD of the \(n \times d\) masked data matrix \(Y := P \circ X\), truncated to cut off all components with singular value \(\leq 2.3 \max (\sqrt{n}, \sqrt{d})\). Call this resulting truncated-SVD approximation \(M\). Clip the entries of \(M\) to \([-1, 1]\) to form the matrix \(\hat{M}\).
Do the same to the matrix \(P\) instead of the matrix \(Y\). Call the resulting truncated-SVD approximation \(M_2\) and its clipped version \(\hat{M_2}\).
Calculate the matrix \(U\) as the entry-wise result of dividing \(\hat{M}\) by \(\hat{M_2}\). Clip the entries of \(U\) to \([-1, 1]\) to form the matrix \(\hat{U}\).

The “magic” value 2.3 is from (Gavish and Donoho 2014), and has some theoretical justification.

CODE

import sys, numpy as np, time, scipy
import requests
import matplotlib.pyplot as plt
%matplotlib inline

import anndata, scanpy as sc
import numpy as np, pandas as pd, matplotlib.pyplot as plt, seaborn as sns, os, sys, scipy.sparse as sp
sc.settings.verbosity = 0

# fname = 'GTEx_8_tissues_snRNAseq_atlas_071421.public_obs.h5ad'
# url_str = f'https://storage.googleapis.com/gtex_analysis_v9/snrna_seq_data/{fname}'
# response = requests.get(url_str)
# with open(file_path, 'wb') as fp:
#     fp.write(response.content)
# adta = sc.read(file_path)
# adta = sc.pp.subsample(adta, n_obs=int(num_subsample), random_state=random_state, copy=True)

adta = sc.read('../GTEx_8_snRNAseq_atlas_30k.h5ad')
sc.pp.neighbors(adta)
sc.tl.umap(adta)

CODE

from sklearn.preprocessing import MinMaxScaler


def svd_missing_impute(datamat, maskmat, num_comps=10):
    """ Compute a low-rank SVD approximation to a matrix with missing values.

    Parameters
    ----------
    datamat : ndarray
        (n x d) matrix with missing values.
    maskmat : ndarray
        (n x d) matrix with 1s where datamat is observed, and 0s where it is missing.
    num_comps : int, optional
        Number of singular values to compute. Default is 10.
    
    Returns
    -------
    estmat : ndarray
        (n x d) matrix with missing values imputed.
    """
    scaler = MinMaxScaler(feature_range=(-1, 1))
    tf_X = scaler.fit_transform(datamat)
    U_X, S_X, Vt_X = fast_svd(tf_X, num_comps=num_comps)
    U_P, S_P, Vt_P = fast_svd(maskmat, num_comps=num_comps)
    thresh_singular_value = 2.3 * np.sqrt(np.max(datamat.shape))
    S_X[S_X < thresh_singular_value] = 0
    S_P[S_P < thresh_singular_value] = 0
    est_X = np.clip(U_X.dot(np.diag(S_X).dot(Vt_X)), -1, 1)
    est_P = np.clip(U_P.dot(np.diag(S_P).dot(Vt_P)), -1, 1)
    weight_mat = np.reciprocal(est_P)
    weight_mat[np.isinf(weight_mat)] = 0
    estmat = np.clip(np.multiply(weight_mat, est_X), -1, 1)
    return scaler.inverse_transform(estmat)

A method that tries to respect biological zeros

(Linderman et al. 2022) propose a method that tries to respect biological zeros, which are often meaningful. This is a very efficient method which is based on the same ideas as the SVD implementations discussed above, worth implementing in a future post.

References

Bhattacharya, Sohom, and Sourav Chatterjee. 2022. “Matrix Completion with Data-Dependent Missingness Probabilities.” IEEE Transactions on Information Theory 68 (10): 6762–73.

Chatterjee, Sourav. 2015. “Matrix Estimation by Universal Singular Value Thresholding.” The Annals of Statistics, 177–214.

Gavish, Matan, and David L Donoho. 2014. “The Optimal Hard Threshold for Singular Values Is \(4 / \sqrt{3}\).” IEEE Transactions on Information Theory 60 (8): 5040–53.

Linderman, George C, Jun Zhao, Manolis Roulis, Piotr Bielecki, Richard A Flavell, Boaz Nadler, and Yuval Kluger. 2022. “Zero-Preserving Imputation of Single-Cell RNA-Seq Data.” Nature Communications 13 (1): 192.

Troyanskaya, Olga, Michael Cantor, Gavin Sherlock, Pat Brown, Trevor Hastie, Robert Tibshirani, David Botstein, and Russ B Altman. 2001. “Missing Value Estimation Methods for DNA Microarrays.” Bioinformatics 17 (6): 520–25.