Build a large anndata object column by colum

Hello, I have mappin results for >1K samples and >1B genes. when I load a pandas data frame into memory it is about 1GB per sample. Hence to load all samples into memory will not be possible.

Is there a way to build a sparse matrix sample by sample (column by column) ideally backed to the disk?

Will I be able to apply algorithms on the combined data like total sum normalization. Or anyway do I need to find another distributed way to handle this data?

Hey @SilasK,

Yeah, that should be possible either in memory or on disk. I would note that with scverse (and the python ecosystem more generally) samples are rows instead of columns.

I don’t know what format your data is in right now, but the code would look roughly like:

import numpy as np
from scipy import sparse

samples = ...

indptr = np.zeros(N_SAMPLES + 1, dtype=np.int64)
indices_list = []
data_list = []

for i, sample in enumerate(samples):
    row_dense = read_sample(sample)
    row_indices = np.nonzero(row_dense)
    row_data = row_dense[row_indices]
    indptr[i + 1] = indptr[i] + len(row_indices)
    indices_list.append(row_indices)
    data_list.append(row_data)

X = sparse.csr_matrix(
    (
        np.concatenate(data_list),
        np.concatenate(indices_list),
        indptr
    ),
    shape=(len(samples), N_VARIABLES)
)