Embedding number for visualization

vgettaa · August 29, 2024, 2:13am

Recently I found some standard deviation (sd) of embedding scores after scvi integration were closed to 0 (about 0.02, the information was store in seurat_object@reductions[[“integrated.scvi”]]@cell.embeddings), even some of the top embeddings (scvi_1, scvi_4, scvi_9, shown as follow: embedding scores in cells belong to different clusters). I wonder if embedding score with sd close to 0 is helpful for visualization. Why do the results turn out like this and how many embeddings should I choose. The analysis was based on Seurat V5, integratelayers function.

cane11 · August 30, 2024, 12:03am

Hi. Unfortunately, I really don’t get what you are doing here. Are those different runs of scVI compared with each other? What’s your expectation and why are you doing it? Sharing some code would additionally be helpful.

vgettaa · August 30, 2024, 9:25am

Hi, thanks for reply. I used scvi tool to integrate 6 scrna samples. After integration, the functions (FindNeighbors, FindClusters, RunUMAP) were used for clustering and visualization. A parameter specifed the number of embeddings should be input when using FindNeighbors and RunUMAP. Like PCA, the top PCs capture most data variances, and PCs have variances in decreasing order (The first pc has highest variance). But the SD of first, fourth, ninth embedding scores output by scvi were closed to 0 in the above figure, indicated that it is unhelpful of these embeddings (sd closed to 0) for clustering. So I am confused about the results and dont know how many embeddings should I choose for visulization.

cane11 · August 30, 2024, 2:43pm

I see, you should use all latent_dimensions. The order of latent dimensions doesn’t highlight importance or something. While axes with a higher estimated variance (qz_v) contain less information, filtering these out is unconventional for VAEs.

vgettaa · August 31, 2024, 7:04am

Maybe axes with higher estimated variance contain more information? I am not familiar with the VAE and UMAP algorithm, and I should use all latent dimensions for clustering and visulization? (the integration function ‘Integratelayers’ will output 30 latent dimensions). Thanks!

cane11 · August 31, 2024, 7:30am

I think in integratelayers you don’t have access to the estimated variances of each cell but only to the estimated mean position in latent space (so you are displaying here the variances across these means). I was speaking about the per cell variance in a VAE.
I would highly recommend to not reduce the number of latent parameters post hoc but use all latent dimensions for downstream analysis.

vgettaa · August 31, 2024, 10:33am

Thanks for reply! Actually, I dont know the relation between VAE and latent dimension. I would like to express this more clearly, in my understanding, the output of integratelayers contains a n x 30 matrix, n represents the number of cells, 30 represents the number of latent dimensions. The matrix indicates the position of each cell in a 30 dimension space, and it is difficult to distinguish cells with dimensions of low variance. I think these 30 latent dimensions are similar to principal components. So if the importance of latent dimensions is unrelated to variance, then I think all latent dimensions should be used for downstream analysis.

cane11 · September 9, 2024, 8:18pm

Yes use all latent dimensions.

vgettaa · September 22, 2024, 11:27am

Hi. I tried to use 21 dims to plot the UMAP, which removed the dims with low SD (in the above figure, integratedscvi_29, integratedscvi_10, integratedscvi_27, integratedscvi_1, integratedscvi_11, integratedscvi_4, integratedscvi_9, integratedscvi_15, integratedscvi_18). Compared to the UMAP plotted with 30 dims (all dims were used), I found the two plots are similar. So, for the UMAP, maybe it should rank the dims by SD and choose the high ones?

cane11 · September 22, 2024, 4:18pm

The one with low variance might be important for lowly abundant celltypes. I would highly recommend against subsetting the latent dimensions based on some criterion. Standard VAEs are not designed to have a sparse set of latent dimensions that are important.

vgettaa · September 23, 2024, 5:04am

Thanks! I think I should take some time to learn VAE.

Topic		Replies	Views
Using scVI-integrated data for scVelo and CellRank analysis - what embeddings to supply for velocity? Help	3	607	May 11, 2023
scVI dropout, need for it? scvi-tools scvi	2	250	January 10, 2024
Preserving biological variability in scVI sample integration scvi-tools integration , scvi	4	789	February 16, 2024
Can I use the package scib-metrics on methods that don't output an embedding? Help scrna-seq , integration	6	386	August 31, 2024
Batch Integration Parameter Tuning scvi-tools integration , gene-selection , scvi , modeling	1	636	March 2, 2022

Embedding number for visualization

Related topics