Recently I found some standard deviation (sd) of embedding scores after scvi integration were closed to 0 (about 0.02, the information was store in seurat_object@reductions[[“integrated.scvi”]]@cell.embeddings), even some of the top embeddings (scvi_1, scvi_4, scvi_9, shown as follow: embedding scores in cells belong to different clusters). I wonder if embedding score with sd close to 0 is helpful for visualization. Why do the results turn out like this and how many embeddings should I choose. The analysis was based on Seurat V5, integratelayers function.
Hi. Unfortunately, I really don’t get what you are doing here. Are those different runs of scVI compared with each other? What’s your expectation and why are you doing it? Sharing some code would additionally be helpful.
Hi, thanks for reply. I used scvi tool to integrate 6 scrna samples. After integration, the functions (FindNeighbors, FindClusters, RunUMAP) were used for clustering and visualization. A parameter specifed the number of embeddings should be input when using FindNeighbors and RunUMAP. Like PCA, the top PCs capture most data variances, and PCs have variances in decreasing order (The first pc has highest variance). But the SD of first, fourth, ninth embedding scores output by scvi were closed to 0 in the above figure, indicated that it is unhelpful of these embeddings (sd closed to 0) for clustering. So I am confused about the results and dont know how many embeddings should I choose for visulization.
I see, you should use all latent_dimensions. The order of latent dimensions doesn’t highlight importance or something. While axes with a higher estimated variance (qz_v) contain less information, filtering these out is unconventional for VAEs.
Maybe axes with higher estimated variance contain more information? I am not familiar with the VAE and UMAP algorithm, and I should use all latent dimensions for clustering and visulization? (the integration function ‘Integratelayers’ will output 30 latent dimensions). Thanks!
I think in integratelayers you don’t have access to the estimated variances of each cell but only to the estimated mean position in latent space (so you are displaying here the variances across these means). I was speaking about the per cell variance in a VAE.
I would highly recommend to not reduce the number of latent parameters post hoc but use all latent dimensions for downstream analysis.
Thanks for reply! Actually, I dont know the relation between VAE and latent dimension. I would like to express this more clearly, in my understanding, the output of integratelayers contains a n x 30 matrix, n represents the number of cells, 30 represents the number of latent dimensions. The matrix indicates the position of each cell in a 30 dimension space, and it is difficult to distinguish cells with dimensions of low variance. I think these 30 latent dimensions are similar to principal components. So if the importance of latent dimensions is unrelated to variance, then I think all latent dimensions should be used for downstream analysis.
Yes use all latent dimensions.
Hi. I tried to use 21 dims to plot the UMAP, which removed the dims with low SD (in the above figure, integratedscvi_29, integratedscvi_10, integratedscvi_27, integratedscvi_1, integratedscvi_11, integratedscvi_4, integratedscvi_9, integratedscvi_15, integratedscvi_18). Compared to the UMAP plotted with 30 dims (all dims were used), I found the two plots are similar. So, for the UMAP, maybe it should rank the dims by SD and choose the high ones?
The one with low variance might be important for lowly abundant celltypes. I would highly recommend against subsetting the latent dimensions based on some criterion. Standard VAEs are not designed to have a sparse set of latent dimensions that are important.
Thanks! I think I should take some time to learn VAE.