When I run DE analysis by scVI multiple times I get lists of genes that don’t overlap much (after applying my own filters of significance, which are the same across DE runs). How do people deal with this stochasticity in scVI DE analysis? Does it mean my groups are not very different from each other?
The DE analysis is based on sampling, so there will be variability between runs.
However, the resulting summary statistics should be (mostly) consistent between runs.
What orders of magnitudes for fold changes and posterior probabilities are you getting for the top genes?
As you are suspecting, if the top genes have small fold changes and a lot of variability (leading to posterior probabilities close to 0.5), the results might change a lot between runs.
In my own analysis, I tend to consider results where posterior probabilities are on the order of < 0.1 or > 0.9, and fold changes of at least 2x in either direction as “successful”. (These thresholds are largely arbitrary).
Many thanks for your reply and sorry for my delayed response, I’ve been on holidays.
Thanks for you clarification and providing your own criteria for DE genes selection. I am using similar values for selection. However, I suspect in my dataset the signal is not super strong. To achieve more robust results I’ve decided to run
model.differential_expression multiple times, e.g. 50, write the results to a
DataFrame and then
groupby by gene names calculating mean of all the DE values. This results in a more or less stable list of DE genes.
Thanks again for your always timely help!
Sounds like a reasonable idea!
I think you can achieve the same result if you increase the number of samples. The default
n_samples = 5000 can be changed to e.g. 100,000.
Oh, great, thanks for the suggestion, Valentine!
Though you will have to quadruple the number of samples to reduce the standard deviation of the estimates by half.
Interesting discussion. Could you please elaborate on why to quadruple the samples and the connection to the standard error?