SCANVI soft labeling

I trained SCVI and SCANVI models on a dataset. In order to test the probabilities output by soft labeling , I withheld one cluster from the model. The theory was that the model should classify cells in this cluster with low confidence (lower max probability). I did 10 runs of this exercise with all clusters except the ‘Unseen cluster’ as part of the training. Here is what I get:

In the left plot, I plot the fraction of cells with max probability below 0.95. Each of the 10 runs of the model is colored separately. As expected, this fraction is quite low (~2%) among training cells. Also, expectedly, the fraction is higher among cells from the ‘Unseen cluster.’ In the right plot, I am plotting the median and 25-75 percentile distribution for max probability for the 10 runs of the model for training cells and ‘Unseen’ cells.

What I find strange is the variation in the outcome of SCANVI.predict for 10 runs of the same training data. The variation is quite large. Is this a result of expected stochasticity in different model runs?

Thanks for sharing this. Indeed this is weird and we are looking into improvements to scANVI’s classifier component. It also seems related to: