Recommendation for transform_batch / categorical_covariate_keys to obtain "batch corrected" counts


Thanks for making this great package! I’m working with a multi-batch dataset with several donors, generated from multiple studies with their own batch effects (technology and varying sequencing depth). I am interested in generating a “batch-corrected” count matrix for downstream analysis. I see that in scvi.model.SCVI.setup_anndata() there are options for categorical_covariate_keys and continuous_covariate_keys. So then I could use the various batch effects like “technology” and “donor” as categorical covariates, for example.

I also see that once I build the model, there is a function model.get_normalized_expression, which can take a transform_batch argument, but how should I combine this with the categorical_covariates_keys above?

I saw a similar question here but didn’t see a specific recommendation: link.

Additionally, I see that the transform_batch requires one to specify a specific batch to treat each sample as it if came from as noted here: link. In that case, would it make more sense to average over all larger batch effects such as “technology” but not the individual to individual batches like “donor”? Would that be reasonable?

Overall, I want to remove technical artifacts such as sequencing depth and cell vs. nuclei effects from my datasets, without removing biological states such as tissue location or sex. Thanks!

Hi, to use transform_batch you have to use batch_key instead of categorical_covariate_key. You can provide a list of batches and the output is the average over these batches (so.e.g. All samples for a specific technology).