DE analysis between two batch-specific clusters

Hi team,

Thank you so much for the wonderful tool.

I have a question about performing DE analysis. I have two datasets, each from a different batch. I have performed integration using scVI with the batch specified in the batch_key. With the trained model, I want to compare two clusters of cells using the differential_expression method from scVI.

The two clusters are batch-specific (i.e. each of them with cells only from one batch). Having said that, there are also clusters in the dataset that are overlapping across batches.

I was wondering what the output of differential expression means if I were to compare between these two batch-specific clusters, with the batch_correction=True? Supposedly it is correcting for batch effects so the output of above comparison would be meaningless? Or is the model trained in a way that any genes that differentiate batch A from batch B as a whole is filtered out and the remaining difference between the batch-specific clusters is being shown?

Hope I’m making sense, if not please let me know! Thank you.

I would run with batch_correction=False in this case as this option should only be used when group1 and group2 populations have members from multiple batches.

It’s hard to say, but the effect sizes can become skewed as there is an issue of out-of-distribution normalized/decoded expression data. If you’re comparing cell type A vs B and A and B are from different batches, the model has no notion of what cell type A would look like in cell type B’s batch.

We are still exploring the benefits and limitations of the built in DE procedure, but maybe one way to think about it is from the linear model perspective. If you were only comparing cell type A and B and they were confounded with batch effects, then I’m not sure you would model this batch effect in a linear model.

What do you think? @Valentine_Svensson @PierreBoyeau

Anecdotally, when I run these kinds of tests with batch_correction = False I get genes that are globally higher in one batch versus the other. But when I use batch_correction = True I get genes that are higher (or lower) only in the cluster of interest in the second batch.

I’m not completely sure why this happens, but it’s been pretty convenient. Thinking in linear terms, it could be that you’re effectively doing y ~ batch + batch:cluster + cluster and it subtracts the effect you’d see with a model like y ~ batch when using batch_correction = True (again, just anecdotal).

For exploring data, I’d suggest trying all variations and plotting a couple of top genes from each setting. This can help you figure out what’s going on in the data. This way you can see if the settings helps you find cluster-specific genes or not. If it’s not helping, and you’re finding the top genes from the tests don’t have a pattern of being cluster-specific, well, then it’s harder.


Thanks @adamgayoso and @Valentine_Svensson for your comments!
I think I’m getting the impression that while the DE analysis from the scVI trained model between batch-specific clusters would not make much statistical sense (without knowing much about what’s going on underneath), I can still go about doing some exploration to identify which genes may be higher/lower in one cluster compared to another. I think I’m seeing some DE genes that are biologically meaningful so I might look into them further with various plotting. Thanks again!
One more question, @Valentine_Svensson, does this logic (be it anecdotal) also apply to scenarios where I’m comparing batch-specific cluster to a cluster with mixed populations of cells from different batches?

Hi Josh,

I wouldn’t focus much on potential biological importance of the genes. Rather, check if the top genes have the pattern of specifically being highly expressed in only one of the batch specific clusters. Then after you have identified genes with that expression pattern, you can interpret them.

At that point, it is less important how you identified genes following this pattern. Imagine if you had known a famous marker for some cell type X. You look at the expression of that gene across your clusters and batches, and notice that it is only high in a cluster that is only present in one batch. I think this would at least convince me that in that one batch you captured “X cells”.

Yes I would think could work with several batches. But you can also just try and see, running the .differential_expression() and .get_normalized_expression() methods should just take a couple of seconds :).

Thank you very much @Valentine_Svensson for your input. One further question (sorry!): If I were to get the batch-corrected expression using .get_normalized_expression() that mirrors what I have done in .differential_expression() with batch_correction=True, should I be using the transform_batch parameter like so: transform_batch=['batch1','batch2'] (assuming there are two batches)? And NOT transform_batch=['batch1'] which would project the expression to one batch?

Hi Josh,

I believe you’d want to transform_batch='batch1', but I’m saying this mostly because I’m not sure what providing two batches to as a list to transform_batch does… Does it average the expressions or give both of them? @adamgayoso do you know?


So for get_normalized_expression, if return_mean is true it averages over them.

What happens if you put return_mean = False?