Is anyone aware of benchmarks that have tested how large differences in library size (i.e. mean total counts per cell) affect integration with scVI and query mapping with scArches? I am working with an scVI model trained on a reference with significantly lower counts per cell compared to the query (where reference mean total count ~ 1000, query mean total count ~ 5000). After query mapping, I see a relationship between total counts for a cell and similarity to the reference. I’m curious to see if other people have looked at this, before going into doing downsampling experiments, since in my case a bunch of other biological factors are correlated with total counts per cell.
From the user guide I get that the scVI model default is to use sum of counts for library size
the recent default for scVI is to treat library size as observed, equal to the total RNA UMI count of a cell.
Is training the reference model with use_observed_lib_size=False
likely to make a difference here?
Thanks a lot!