Build a large anndata object column by colum

SilasK · September 29, 2022, 8:32am

Hello, I have mappin results for >1K samples and >1B genes. when I load a pandas data frame into memory it is about 1GB per sample. Hence to load all samples into memory will not be possible.

Is there a way to build a sparse matrix sample by sample (column by column) ideally backed to the disk?

Will I be able to apply algorithms on the combined data like total sum normalization. Or anyway do I need to find another distributed way to handle this data?

ivirshup · September 29, 2022, 11:09am

Hey @SilasK,

Yeah, that should be possible either in memory or on disk. I would note that with scverse (and the python ecosystem more generally) samples are rows instead of columns.

I don’t know what format your data is in right now, but the code would look roughly like:

import numpy as np
from scipy import sparse

samples = ...

indptr = np.zeros(N_SAMPLES + 1, dtype=np.int64)
indices_list = []
data_list = []

for i, sample in enumerate(samples):
    row_dense = read_sample(sample)
    row_indices = np.nonzero(row_dense)
    row_data = row_dense[row_indices]
    indptr[i + 1] = indptr[i] + len(row_indices)
    indices_list.append(row_indices)
    data_list.append(row_data)

X = sparse.csr_matrix(
    (
        np.concatenate(data_list),
        np.concatenate(indices_list),
        indptr
    ),
    shape=(len(samples), N_VARIABLES)
)

Topic		Replies	Views
SCVI tools with large datasets scvi-tools	3	664	May 31, 2024
scVI with large datasets scvi-tools	4	195	September 24, 2024
Efficient pseudobulking anndata anndata	2	42	February 27, 2025
Anndata 0.10 released! General anndata , release	0	630	October 9, 2023
Store data as sparse matrix anndata	3	3802	April 19, 2022

Build a large anndata object column by colum

Related topics