@Justin_Hong Thanks so much for the MrVI tool and preprint. Really enjoyed reading it.
I am just wondering whether the input should be an anndata object of highly variable genes or just the whole cellxgene matrix? I presume hvg selection before hand will bias the analysis depending on what you use as the batch key for hvg selection?
Also, any help of further documentation on interpretation of the output would be fab. As I understood it, the distance metrics can compare across samples, clusters or other condition variable, but it wasn’t clear if I can account for sample variation and across disease states at the same time?
Thanks for any insight you can offer.
Hi @Nusob888,
I’m glad you like MrVI, and I hope you find it useful. In our experience using highly variable genes has been helpful to reduce noise in the latent space. We generally used 2000 genes (somewhat arbitrarily). Whether or not you use batch_key will also matter as you suggested, though we did not look into the effect of this closely for the the MrVI paper.
Apologies for the sparse documentation. We will have better documentation in the future, but in the mean time feel free to ask questions about the model in discourse.
To answer your question, the current model does not incorporate sample metadata (e.g. disease) during training. Rather, the model is only given categorical sample IDs with no groupings of samples otherwise. Then, the distance matrices can be evaluated based on groupings to understand if there is a correlation (i.e. same group → smaller distances). This is how we conducted a “guided” analysis in the preprint. Hope this answers your question!
1 Like
Hi Justin,
Thank you for the clarification.
So am I right in thinking that the envisioned workflow might be to integrate the data with either another method (e.g. SCVI) or use mrVI. Cluster and annotate the cell types as metadata. Then proceed to look at distances as a guided analysis of cell type composition?
Similarly, if one were interested in other sources of grouping such as by transcriptomic perturbations, one could create meta groupings of samples from the distances and then perform differential expression analysis thereafter across cell types of interest?
Yes, cell types can be annotated via another method or MrVI in the u
latent space. The distances do not provide a guided analysis of cell type composition, since the model does not account for differences in sample abundance, just the sample-specific cell states.
MrVI would be great for grouping transcriptomic perturbations or samples via the distances. Subsequently, you could take the grouped samples, and do DE analysis across the groups for different cell types of interest.
Thats great, thanks Justin.
I have performed a test run. However, due to the sparse documentation, it is difficult for me to gauge how best to plot/cluster the data. Do you have idea how far things are from a guided tutorial?
1 Like
For now, the best thing to do is to plot both the u and z latent representations to get an understanding of how the data integrates, then to look at the average distance matrix in different clusters if your data. We will likely not have a guided tutorial for this version of MrVI in the near future since we are working hard on the next version of it. We will have more documentation for the next version!
Hi Justin,
Again thanks for this great tool. Along the lines of this conversation, I just need a bit of clarity regarding the cell, sample,sample distances. For each cell i get a square matrix of sample, sample representation, I was wondering how to obtain the column and rownames for each matrix, given that the diagonal for each matrix is 0 irrespective of which sample/subject the cell originates from. Sorry for the bother.Thanks
Sorry for the super late response! This is something we will add with the next version of this package, but in the mean time you can get it like this
sample_order = adata.obs.loc[
lambda x: ~x[sample_key].duplicated(keep="first")
].sort_values("_scvi_sample")[sample_key].values