How to interpret the latent space in scVI

Hi scverse community,

From what I understood, the latent space of the model is a feature that is defined prior to the training, meaning when creating the model to train. After the training, each cell can be represented in the latent space with a n-dimensional vector. Now, the latent space reproduces the statiscal properties of the original data.

For this reason, it sounds a lot like a reduced dimensionality representation like a PCA. So my questions are,

  • Can I assume that the latent dimensions (usually stored in X_scvi) are ordered by most explained variance to least explained variance?
  • And does it make sense then to run sc.pp.neighbors on the first e.g., 5 latent dimensions out of 10?
  • Or, for example, would it make more sense to just train a model with 5 latent dimensions in total?

Thanks to anyone who can help me understand!

Vittorio

1 Like

Unlike PCA, scVI’s latent variables are not ordered in any meaningful way nor are they interpretable since the model is not a linear method. It can just be thought of as a low-dimensional, compressed representation that the model has learned when trying to optimize its objective, i.e., the ELBO, which includes the reconstruction loss (how well the model is able to recover the full gene counts from the compressed representation) and a KL prior term (for regularization).

Regarding your second and third questions, I think it makes more sense to train the model with 5 latent dimensions instead of subsetting from a larger dimensionality, since the model effectively requires all latent variables to reconstruct gene counts. In other words, we don’t expect a subset of the latent variables to contain all the information. I wouldn’t be surprised, though, if subsetting leads to some fairly reasonable neighbor graph or UMAP.

1 Like

Thanks for the clear and complete answer!

Unlike PCA, scVI’s latent variables are not ordered in any meaningful way nor are they interpretable since the model is not a linear method.

I am not sure that is true. The KL divergence is a measure of information and different latent dimensions carry different amounts of information. You can compute dimension-wise KLD with the following

# get latents from your data
mu, sigma = model.get_latent_representation(adata=adata, return_dist=True)
# the KLD function, measured latent-dimension-wise
kld_per_dimension = -0.5 * np.sum(
        1 + np.log(sigma) - mu**2 - sigma,
        axis=0,
    )
# sort by descending KLD
kld_importance = np.argsort(kld_per_dimension, ascending=False)

Sorting by descending KLD is equivalent to sorting by the information passing through the latent dimensions.

A good overview of how KLDs can be used to sort latent dimensions can be found here: https://arxiv.org/pdf/1804.03599.pdf

1 Like

Alternatively you could use LDVAE, which is more interpretable

1 Like

@gregjohnso Thanks for that information! This is very interesting.

@zvittorio Just to clarify, the output of scvi.model.SCVI.get_latent_representation is not ordered in a meaningful way in the sense that, by default, the values in the output array themslves are not ordered, whether that be with KLD or some other measure. You could, of course, compute KLD yourself and then rearrange these latent variables.

OK I see. It’s understandable that some vectors in the latent representation are more “meaningful” than others. But good to know that they do not get ordered in any way by default.
So, selecting a subset of them as in a PCA does not make sense if not backed up by some kind of KLD (or similar) ordering.
Thanks again to all of you for the insightful comments @martinkim0 @gregjohnso @davemcg