Getting normalized expression

Hi,

I am working on a big sized scRNAseq atlas with 2 million cells. I want to get the normalized expression. However, since it returns a data frame/numpy array, I run out of memory every time I am trying to retrieve the normalized expressions. I need it for performing DE. Is there any other way to get it? or perform DE between clusters without using it?

Thanks,
Kam

Yes, of course, there’s a direct way to run DE from a scVI-trained model, without the need to get_normalization_expression first: model.differential_expression(…), and you can state if you want to do group vs group, group vs all , by which groups, and so on..

If you still need the whole normalized expression itself, and do not have enough memory, you can extract it in smaller chunks of adatas.

Hi,

Thanks for your response. I tried the direct way using model.differential_expression. It throws an out of memory error due to the size of the adata probably. Is there a way to resolve this? or may be extracting the normalized data in smaller chunks might be useful, could you please guide me on how to do it?

should be something like:

import numpy as np
import scipy.sparse as sp

chunk_size = 50000
all_chunks = []

for start in range(0, adata.n_obs, chunk_size):
    end = min(start + chunk_size, adata.n_obs)

    x = model.get_normalized_expression(
        adata=adata[start:end],
        return_numpy=True,
    )

    all_chunks.append(sp.csr_matrix(x))

which will store it in a sparse matrix.

Do you really need to run DE on all cells? usually we run group vs all/group, e.g:

de_df = model.differential_expression(
    groupby="cell_type",
    group1="B_cell",
    group2="T_cell",
)
1 Like

Thanks a lot!
I am looking for cell specific markers so I have generated clusters and performing DE between each cluster vs other clusters.

Can I store the normalized expression as a sparse matrix in h5ad object for future use?

Yes, that’s the idea.

For the clusters, then just replace groupby to the “cluster_column” and group 1 and 2 to the clusters ids

1 Like

You will want to add a filter to set counts to zero below a certain threshold like 1e-5 to increase sparsity. However, we usually do not recommend using normalized counts for downstream tasks such as Wilcoxon or t-test.

1 Like

Is there an option to do that when getting the normalized expression?

you need to do that on raw data, as preprocessing with scanpy/anndata, before training the model

I meant at this step for x you can set a filter and set counts to zero below this threshold.

1 Like

It might be also worth to compute posterior predictive samples - those are the counts sampled from the negative binomial distribution using the learned parameters from scVI.

Thanks a lot! Could you please tell me the use of computing the posterior predictive samples? I am new to this topic and learning about it.

scvi.model.SCVI — scvi-tools and we described the use of it in the scvi-hub manuscripthttps://www.nature.com/articles/s41592-025-02799-9.