Issue with scVI VAE reconstruction: High sensitivity to library size and poor rank preservation

Hello :smiley: ,

I am running scVI VAE tool, with the goal of analyzing both the learned cell embeddings and the reconstructed gene expression. ( i have merfish sample )

I want to verify if the gene ranking is preserved within each reconstructed cell . To do so, i compared the reconstructed profile (R’) vs. Ground Truth (R) for each cell , using Spearman correlation score to assess this rank preservation , and numerous problem arise from that !

the first thing i figure out is that there is a huge bias Toward High Expression/Library Size

I observed that the model is extremely sensitive to library size and gene abundance: The higher a gene’s expression (and the higher the cell’s library size), the better the ranking is preserved. ( plot 1 & 2 ) . And overall i have a low rank correlation between R’ and R .

It feels like the model focuses only on a subset of high-signal genes (e.g., cell-type markers or pathway specific) to generate embeddings, while ignoring the signal of the majority of other genes.

I attempted to fix this by testing different loss functions (ZINB, NB, Poisson), but unexpectedly, all of them output practically the same spearman correlation metrics, around 0.38

My question are :

1 ) if this Bad ranking for my reconstruction is a normal/expected behavior for scVI/VAEs for all three : zinb/nb/poisson loss ? or Could this problem arise from my specific dataset, or is it likely a hyperparameter/loss function tuning issue?

2 ) Is reconstruction quality is indeed correlated to library size ?

  1. if i do care about ranking should i discard scvi vae completely ? i think that for the VAE , going for a higher correlation will likely force the model to memorize noise rather than learn biology .

Thank you very much ! i hope my statement was clear , and i am not discarding some crucial insight :sweat_smile:

( plot 1 ( per gene correlation ( R’ -R ) based on raw expression )

Plot 2 : per cell correlation based on library size .)

  1. What you see is largely expected behavior for scVI VAEs on noisy count data like merfish, and given the metric you chose to compare with.
    The reconstruction loss naturally gives more weight to highly expressed genes, so the model knows to reconstruct them better, while in the lowly expressed ones, relative noise is large.
    You can, however, tune the KL weight in the loss function to prefer a better reconstruction loss, but as noted, it is also partially an inherent thing with your data (did you select HVG before analysis?). Perhaps also increase the latent space size? Yes, parameters can surely be optimized here better than the default ones.
    Either way, the model should still be good at batch integration, cell clustering, and noise removal, and you can check it with UMAP, given it converged in training (otherwise, yes, something else is wrong).
  2. Yes, by design.
  3. Of course, you can try to tune it first, but try other methods like plain PCA? But if you need to keep the biological conservation, you should still use the VAE model.
    In addition to all mentioned above, check your rankings with get_normalized_expression. This gives library-size-normalized reconstructions, which might have better rank preservation.
1 Like