Hello, thank you for this excellent tool! I’ve reviewed the online documentation and many posts (e.g. 57) on this forum regarding batch correction and the .differential_expression()
method, but I’m still confused about its application to our analysis.
We are running a typical scRNA-seq atlas analysis made up of multiple batches of samples from multiple patients that share a rare disease. Some of these batches are replicates from different patients for technical reasons, but do not have any relevant biological differences. For example,
Batch1: patient1, patient2, patient3 (hashed)
Batch2: patient 3
Batch3: patient 4
Batch4: patient 5
Batch5: patient 5
We preprocess and then integrate with scVI, providing batch and patient as two categorical_covariate_keys
in the model. We then perform leiden clustering and calculate the differentially expressed genes between clusters, with the goal of using the DEGs to manually annotate cell clusters, as given in the tutorial.
When we run model.differential_expression(groupby="leiden")
, should we set batch_correction=True
? We don’t typically do this and have seen biologically reasonable results without it.
Our goal would be to get the differentially expressed genes between one cluster and all the other clusters, so that we can better understand the cell identities. There are usually multiple batches in each cluster, and we don’t want the DEGs to be dominated by batch effect between clusters. So “averaging across batches” seems to be what we would want to do. In practice though, I don’t see how I could define batchid1
and batchid2
for this experiment.
A more in-depth explanation of batch_correction
in the API or here would probably be helpful if possible, as I feel like many users have asked questions about it.
Thank you for your time!