Differences in library sizes between reference and query

Is anyone aware of benchmarks that have tested how large differences in library size (i.e. mean total counts per cell) affect integration with scVI and query mapping with scArches? I am working with an scVI model trained on a reference with significantly lower counts per cell compared to the query (where reference mean total count ~ 1000, query mean total count ~ 5000). After query mapping, I see a relationship between total counts for a cell and similarity to the reference. I’m curious to see if other people have looked at this, before going into doing downsampling experiments, since in my case a bunch of other biological factors are correlated with total counts per cell.

From the user guide I get that the scVI model default is to use sum of counts for library size

the recent default for scVI is to treat library size as observed, equal to the total RNA UMI count of a cell.

Is training the reference model with use_observed_lib_size=False likely to make a difference here?

Thanks a lot!

Hi, it’s usually not a good idea in my hands. Purely for the latent space, it might be fine but I wouldn’t trust downstream methods like get_normalized_counts or differential expression. In general, I would suggest training from scratch and check whether the same structure shows up. It doesn’t necessarily mean that you are interested in this feature and additionally adding total_counts as continuous covariate key might then help.

Hi @cane11, thanks for the tips. By training from scratch do you mean training an scVI model on concatenated reference and query, instead of using query mapping?

Hi Emma, yes after concatenation or just use the query data if you are not interested in the reference data.