Assessing scVI fit by gene


Again, thank you for the great package!
If I get it right, because scVI model is a VAE, during fitting it reconstructs expression data for cells based on latent z, cell UMI count and whatever batch variables passed to the model.
How best to determine if the fit turned out to be good? Are there recommended cutoffs on the absolute error in reconstruction? Can the error be assessed on per-gene level?

Thank you

1 Like

(Not part of scVI dev team)

One way I have assessed this in the past is my looking at the mean-mean, variance-variance and mean-variance relationship. For each gene, you can ask if the reconstructed expression captures these relationships.

You can also ask which genes are harder to reconstruct (have higher reconstruction error).
The reconstruction error itself carries a lot of information. For example. clustering on the reconstruction error per-gene per-cell captures the latent space based clustering (to an extent).

As a first contributor I am only allowed one image for this post, but I summarized these observations here in case you are interested.


Haha, I have your link bookmarked since you posted it. Finally it’s time to carefully read it : )
Thank you

I do endorse the answer by @saketkc. One caveat though is the shape of the NB distribution will mean that genes with higher means will be more difficult to reconstruct (by definition have higher reconstruction error, or lower log likelihood). Another potential idea is to look at posterior dispersion indices ([1605.07604] Posterior Dispersion Indices). This might better control for this slightly. I can post code for how to get these soon.