Issue with scVI VAE reconstruction: High sensitivity to library size and poor rank preservation

Hello :smiley: ,

I am running scVI VAE tool, with the goal of analyzing both the learned cell embeddings and the reconstructed gene expression. ( i have merfish sample )

I want to verify if the gene ranking is preserved within each reconstructed cell . To do so, i compared the reconstructed profile (R’) vs. Ground Truth (R) for each cell , using Spearman correlation score to assess this rank preservation , and numerous problem arise from that !

the first thing i figure out is that there is a huge bias Toward High Expression/Library Size

I observed that the model is extremely sensitive to library size and gene abundance: The higher a gene’s expression (and the higher the cell’s library size), the better the ranking is preserved. ( plot 1 & 2 ) . And overall i have a low rank correlation between R’ and R .

It feels like the model focuses only on a subset of high-signal genes (e.g., cell-type markers or pathway specific) to generate embeddings, while ignoring the signal of the majority of other genes.

I attempted to fix this by testing different loss functions (ZINB, NB, Poisson), but unexpectedly, all of them output practically the same spearman correlation metrics, around 0.38

My question are :

1 ) if this Bad ranking for my reconstruction is a normal/expected behavior for scVI/VAEs for all three : zinb/nb/poisson loss ? or Could this problem arise from my specific dataset, or is it likely a hyperparameter/loss function tuning issue?

2 ) Is reconstruction quality is indeed correlated to library size ?

  1. if i do care about ranking should i discard scvi vae completely ? i think that for the VAE , going for a higher correlation will likely force the model to memorize noise rather than learn biology .

Thank you very much ! i hope my statement was clear , and i am not discarding some crucial insight :sweat_smile:

( plot 1 ( per gene correlation ( R’ -R ) based on raw expression )

Plot 2 : per cell correlation based on library size .)