Differential Expression and Batch Correction

Hello, thank you for this excellent tool! I’ve reviewed the online documentation and many posts (e.g. 57) on this forum regarding batch correction and the .differential_expression() method, but I’m still confused about its application to our analysis.

We are running a typical scRNA-seq atlas analysis made up of multiple batches of samples from multiple patients that share a rare disease. Some of these batches are replicates from different patients for technical reasons, but do not have any relevant biological differences. For example,

Batch1: patient1, patient2, patient3 (hashed)
Batch2: patient 3
Batch3: patient 4
Batch4: patient 5
Batch5: patient 5

We preprocess and then integrate with scVI, providing batch and patient as two categorical_covariate_keys in the model. We then perform leiden clustering and calculate the differentially expressed genes between clusters, with the goal of using the DEGs to manually annotate cell clusters, as given in the tutorial.

When we run model.differential_expression(groupby="leiden"), should we set batch_correction=True ? We don’t typically do this and have seen biologically reasonable results without it.

Our goal would be to get the differentially expressed genes between one cluster and all the other clusters, so that we can better understand the cell identities. There are usually multiple batches in each cluster, and we don’t want the DEGs to be dominated by batch effect between clusters. So “averaging across batches” seems to be what we would want to do. In practice though, I don’t see how I could define batchid1 and batchid2 for this experiment.

A more in-depth explanation of batch_correction in the API or here would probably be helpful if possible, as I feel like many users have asked questions about it.

Thank you for your time!

Hi, in the initial model I recommend setting batch_id to the individual samples (setting it to something else leads to less integration due to e.g. sex, but also ambient counts etc).
Whether you want to include batch correction in DE depends on the case: While for DE between cell-types or clusters (your analysis here) it can make sense, it’s not good to study between group (like healthy and diseased) genes. If you correct for batch, cells are projected to each batch and differential expression for these projected cells is computed. I would recommend to disable correcting for batch and only enable it if you find expected between group genes significantly DE (happens e.g. if group A contains 95% cells from female donor and group B 95% cells from male donors that X chromosome genes are markers of group B).
I would highly recommend to also run other DE methods like pyDESeq2 to compute adjusted p-values and correct for covariates. Both things are not performed by scvi-tools DE.

1 Like