scVI: effect of unbalanced number of cells in batches

jay-b · November 19, 2024, 8:56am

I was wondering whether imbalance in the number of cells in different cell types, batches or datasets will significantly impact scVI training.

For example, subsetting a large atlas to have maximum 10k cells for each cell type and roughly the same number of cells as other datasets, I seem to obtain better results. Is this expected or normally handled by scVI loss functions ?

cane11 · November 27, 2024, 6:51am

I don’t think that we have extensive testing for this. Since 1.2.0 you can define custom train indices, which would allow testing stratified sampling, but we haven’t done so yet. We don’t see significantly worse performance in lowly abundant cell-types in scvi.criticism compared to downsampling highly abundant cells after training (the metrics are slightly noisy and values increase with more cells).

Topic		Replies	Views
Minimum number of cells for scVI? scvi-tools scvi	2	406	February 15, 2023
Train scVI on a sampled dataset scvi-tools	8	289	December 6, 2024
scVI 21618 problem scvi-tools integration , scvi	5	363	November 7, 2024
Training split conditioned on batch_key scvi-tools scvi	3	207	May 22, 2024
Failure to remove a batch_key/ effect of number of LVs scvi-tools integration , scvi	6	527	February 9, 2024

scVI: effect of unbalanced number of cells in batches

Related topics