Is the output of `get_normalized_expression` batch-corrected or not?

Hello!
First things first, thank you for this awesome tool.

I know this question is similar to previous questions on this forum, but that’s part of the confusion I’m running into.

Some background. I have scRNA-seq data from several independent experiments (which I’ll refer to as batches) that I am hoping to pool together/integrate and use for eQTL mapping. I am hoping to use scvi to integrate these data together and correct for the batch effects that come from pooling multiple independent experiments.

I used scvi.model.SCVI.setup_anndata with batch_key="experiment" (that’s the categorical variable that encodes the scRNA-seq experiment), and I’ve trained an scvi model using that anndata object. Now my question is this: is the output of model.get_normalized_expression() expected to be batch-corrected? In other words, can I use this data as input for eQTL mapping, or do I need to do additional batch-correction steps (e.g. calculating PEER factors)?

Some threads in this forum seem to suggest that the output of this function is batch-corrected (e.g. Differential expression with scvi - batch correction?). Other threads suggest that the output is not batch-corrected (e.g. How to extract batch-corrected expression matrix from trained scVI vae model).

I’m really just hoping to get a straight answer on this. The tool and documentation are phenomenal overall, but this nuance is tripping me up.

Thank you so much in advance!

The output is not batch corrected. See the other post. I highly suggest against using get_normalized_counts for eQTL mapping. ScVI learns gene-gene dependencies. This will lead to trans-eQTL that are not backed by data. This is such a sensitive field that you should use best practice for sc-eQTL mapping and not apply less well validated approach (like scVI to normalize counts). You can still use the latent space to identify cell-types or similar cells across samples.

1 Like

Thanks so much for the response, this is very helpful. I’ll stick to using the scVI output for cell-type ID, and relying on the raw data for downstream.