Hello, I have mappin results for >1K samples and >1B genes. when I load a pandas data frame into memory it is about 1GB per sample. Hence to load all samples into memory will not be possible.
Is there a way to build a sparse matrix sample by sample (column by column) ideally backed to the disk?
Will I be able to apply algorithms on the combined data like total sum normalization. Or anyway do I need to find another distributed way to handle this data?
Yeah, that should be possible either in memory or on disk. I would note that with scverse (and the python ecosystem more generally) samples are rows instead of columns.
I don’t know what format your data is in right now, but the code would look roughly like:
import numpy as np
from scipy import sparse
samples = ...
indptr = np.zeros(N_SAMPLES + 1, dtype=np.int64)
indices_list = 
data_list = 
for i, sample in enumerate(samples):
row_dense = read_sample(sample)
row_indices = np.nonzero(row_dense)
row_data = row_dense[row_indices]
indptr[i + 1] = indptr[i] + len(row_indices)
X = sparse.csr_matrix(