Hello, I have mappin results for >1K samples and >1B genes. when I load a pandas data frame into memory it is about 1GB per sample. Hence to load all samples into memory will not be possible.
Is there a way to build a sparse matrix sample by sample (column by column) ideally backed to the disk?
Will I be able to apply algorithms on the combined data like total sum normalization. Or anyway do I need to find another distributed way to handle this data?
Yeah, that should be possible either in memory or on disk. I would note that with scverse (and the python ecosystem more generally) samples are rows instead of columns.
I don’t know what format your data is in right now, but the code would look roughly like:
import numpy as np
from scipy import sparse
samples = ...
indptr = np.zeros(N_SAMPLES + 1, dtype=np.int64)
indices_list = []
data_list = []
for i, sample in enumerate(samples):
row_dense = read_sample(sample)
row_indices = np.nonzero(row_dense)
row_data = row_dense[row_indices]
indptr[i + 1] = indptr[i] + len(row_indices)
indices_list.append(row_indices)
data_list.append(row_data)
X = sparse.csr_matrix(
(
np.concatenate(data_list),
np.concatenate(indices_list),
indptr
),
shape=(len(samples), N_VARIABLES)
)