Identifying Population-Specific Loadings with LDVAE

Hi there!

Thank you all very much for your great work on all the packages! I have a question on using LDVAE to identify population/cluster-specific genes contributing to the variation in that group:

My understanding is LDVAE (trained on all populations) gives us per gene weights (x dimension of the latent space) that can be used for interoperability. However, I am interested in identifying the top genes for each cluster/population present in the data. To extract the top loadings for each population, currently, I train LDVAE on specific groups and follow the standard pipeline. Is there a better way to extract cluster-specific loadings (given the drawbacks of training LDVAE on each population separately)?

Thanks so much for your time and help!


The reason to use an LDVAE is to attempt to identify sets of co-expressed genes, where each latent dimension will correspond to one such set of genes. This way the activity of that set of genes is summarized along one axis. In this framing, I would make the assumption that a cluster/population is defined as an extreme on each axis (in particular if using the logistic latent space option).

If you are interested in individual genes that are enriched in specific populations or clusters, the best way to get those would be by using the .differential_expression() method. See this part of the intro tutorial: Introduction to scvi-tools - scvi-tools

The .differential_expression() method is agnostic to the structure of the model, so it works both for standard VAE, cVAE (when there are batches), and LDVAE. The ‘interpretability’ aspect of LDVAE is that the axes of latent representation vectors are directly tied to a collection of genes. It turns out that by using the statistical framework developed for the .differential_expression() method you can get interpretability (or at least explainability) for arbitrary areas of the representation space even if the decoding function is non-linear.

Now, if you want to learn which latent dimensions are assocated with groups/clusters/treatments/etc, that would be a different solution, but it doesn’t sound like this is what you are looking for?

Hope this helps!

1 Like