How to interpret the latent space in scVI

zvittorio · January 15, 2024, 10:05am

Hi scverse community,

From what I understood, the latent space of the model is a feature that is defined prior to the training, meaning when creating the model to train. After the training, each cell can be represented in the latent space with a n-dimensional vector. Now, the latent space reproduces the statiscal properties of the original data.

For this reason, it sounds a lot like a reduced dimensionality representation like a PCA. So my questions are,

Can I assume that the latent dimensions (usually stored in X_scvi) are ordered by most explained variance to least explained variance?
And does it make sense then to run sc.pp.neighbors on the first e.g., 5 latent dimensions out of 10?
Or, for example, would it make more sense to just train a model with 5 latent dimensions in total?

Thanks to anyone who can help me understand!

Vittorio

martinkim0 · January 17, 2024, 7:58pm

Unlike PCA, scVI’s latent variables are not ordered in any meaningful way nor are they interpretable since the model is not a linear method. It can just be thought of as a low-dimensional, compressed representation that the model has learned when trying to optimize its objective, i.e., the ELBO, which includes the reconstruction loss (how well the model is able to recover the full gene counts from the compressed representation) and a KL prior term (for regularization).

Regarding your second and third questions, I think it makes more sense to train the model with 5 latent dimensions instead of subsetting from a larger dimensionality, since the model effectively requires all latent variables to reconstruct gene counts. In other words, we don’t expect a subset of the latent variables to contain all the information. I wouldn’t be surprised, though, if subsetting leads to some fairly reasonable neighbor graph or UMAP.

zvittorio · January 18, 2024, 8:16am

Thanks for the clear and complete answer!

gregjohnso · February 3, 2024, 7:41pm

Unlike PCA, scVI’s latent variables are not ordered in any meaningful way nor are they interpretable since the model is not a linear method.

I am not sure that is true. The KL divergence is a measure of information and different latent dimensions carry different amounts of information. You can compute dimension-wise KLD with the following

# get latents from your data
mu, sigma = model.get_latent_representation(adata=adata, return_dist=True)
# the KLD function, measured latent-dimension-wise
kld_per_dimension = -0.5 * np.sum(
        1 + np.log(sigma) - mu**2 - sigma,
        axis=0,
    )
# sort by descending KLD
kld_importance = np.argsort(kld_per_dimension, ascending=False)

Sorting by descending KLD is equivalent to sorting by the information passing through the latent dimensions.

A good overview of how KLDs can be used to sort latent dimensions can be found here: https://arxiv.org/pdf/1804.03599.pdf

davemcg · February 9, 2024, 9:17pm

Alternatively you could use LDVAE, which is more interpretable

martinkim0 · February 16, 2024, 5:53pm

@gregjohnso Thanks for that information! This is very interesting.

@zvittorio Just to clarify, the output of scvi.model.SCVI.get_latent_representation is not ordered in a meaningful way in the sense that, by default, the values in the output array themslves are not ordered, whether that be with KLD or some other measure. You could, of course, compute KLD yourself and then rearrange these latent variables.

zvittorio · March 5, 2024, 4:49pm

OK I see. It’s understandable that some vectors in the latent representation are more “meaningful” than others. But good to know that they do not get ordered in any way by default.
So, selecting a subset of them as in a PCA does not make sense if not backed up by some kind of KLD (or similar) ordering.
Thanks again to all of you for the insightful comments @martinkim0 @gregjohnso @davemcg

Topic		Replies	Views
Embedding number for visualization scvi-tools integration , scvi	10	82	September 23, 2024
Clustering with scVI latent space scRNA-seq scvi , clustering	1	470	May 26, 2023
Dimensionality reduction of 10x visium spatial datasets scvi-tools scvi	1	444	January 15, 2023
Question about get_latent_representation function of scvi for scRNAseq data scvi-tools scvi	7	1055	October 21, 2022
Ablating latent variables in LinearSCVI scvi-tools	3	42	February 6, 2025

How to interpret the latent space in scVI

Related topics