Impact of batch on TotalVI results

Good evening, thank you for maintaining TotalVI! We are applying a pre-existing model and classifier built from TotalVI to a new dataset. This new dataset includes samples across several different cancer types, patients, techniques (multiome, single-cell, single-nuclei), and labs. Some cancers have samples from multiple techniques while others do not. We would like to compare our cell types across cancers as best as possible. When we run with batch set to different variables, we obtain very different results. Since our conclusions will differ based upon what we ultimately choose as batch, we would like to understand better what the batch variable is doing.

Are you able to provide guidance or information on what the batch variable is doing in the model to help us to select the best batch option? Any input is very much appreciated, thank you!

Hey Jennifer,

Selecting best batch option might depend also on what you are looking for, as for example if you are analysing the biological aspect of your data, you should use only the technical data as batch (donor, sites etc…), on the other hand you can also use a flag to indicate which techniques even exists per sample , as this might also be a source of bias.

As to what specifically batch_key is doing, well, that depends on each model but overall in all cases its try to remove the noise originated in a technical aspects, or the removal of unwanted variation.

See the following resources for more information:

1 Like