Has something changed for the differential expression in the most recent version of scvi-tools? Previously, plots of bayes_factor vs. lfc_median were very similar to plots of prob_de vs. lfc_median. prob_de plots still look ok, but bayes_factor plots look like this:
Additionally, comparing lfc changes to ground truth simulated data gives wildly inaccurate results. Pseudobulk differential expression on the same dataset gives a perfect correlation (pearson=1.0)
Does the second plot look better in the old scVI-tools version? See other post about changes.
Could you try with a manual very small pseudocounts like 1e-10?
The Bayes factor plot likely looks find for a two-way comparison (change or not change vs the new three way comparison up/unchanged/down). There might be something wrong in computation there.
Changing pseudo counts gives the same results. Changing test_mode=‘two’ does reproduce the bayes factor plots as before, but the lfc estimates are still off. I applied the LFC changes to only genes expressed in >10% of cells and this increased the correlation to ~0.75, but I am still getting better LFC estimates with pseudobulk. I don’t have the plot on the bottom from a previous version (it was years ago I first tested this), but the results were certainly more accurate. I’ll keep an eye on the other post for the kwargs to reproduce previous versions
I found the issue, it has to do with total counts per celltype. If there are not at least 10 counts for a gene in a given cell type then the lfc estimates are off:
I forgot to mention that is on the log1p scale, so the model seems to require a fairly high number of counts per cell type for accurate LFC estimates. There also seem to be a high number of false positive detected as well. Pseudobulk does not suffer from either of these issues. Any suggestions or solutions from the developers here?
Regarding the false positives, it seems to be that genes altered in one cell type are showing up as differentially expressed in other cell types they are not altered in. This appears to be due to the normalization. Here are 3 example false positive genes, the top are library size normalized and log1p transformed and the bottom are normalized values extracted from the scVI model. The raw counts and all associated metadata are identical between batches for these genes in this cell type. These genes are altered in other cell types, and that appears to affect the normalization in this cell type. I have many examples of this.