I was wondering whether imbalance in the number of cells in different cell types, batches or datasets will significantly impact scVI training.
For example, subsetting a large atlas to have maximum 10k cells for each cell type and roughly the same number of cells as other datasets, I seem to obtain better results. Is this expected or normally handled by scVI loss functions ?
I don’t think that we have extensive testing for this. Since 1.2.0 you can define custom train indices, which would allow testing stratified sampling, but we haven’t done so yet. We don’t see significantly worse performance in lowly abundant cell-types in scvi.criticism compared to downsampling highly abundant cells after training (the metrics are slightly noisy and values increase with more cells).