MrVI input and interpretation

@Justin_Hong Thanks so much for the MrVI tool and preprint. Really enjoyed reading it.

I am just wondering whether the input should be an anndata object of highly variable genes or just the whole cellxgene matrix? I presume hvg selection before hand will bias the analysis depending on what you use as the batch key for hvg selection?

Also, any help of further documentation on interpretation of the output would be fab. As I understood it, the distance metrics can compare across samples, clusters or other condition variable, but it wasn’t clear if I can account for sample variation and across disease states at the same time?

Thanks for any insight you can offer.

Hi @Nusob888,

I’m glad you like MrVI, and I hope you find it useful. In our experience using highly variable genes has been helpful to reduce noise in the latent space. We generally used 2000 genes (somewhat arbitrarily). Whether or not you use batch_key will also matter as you suggested, though we did not look into the effect of this closely for the the MrVI paper.

Apologies for the sparse documentation. We will have better documentation in the future, but in the mean time feel free to ask questions about the model in discourse.

To answer your question, the current model does not incorporate sample metadata (e.g. disease) during training. Rather, the model is only given categorical sample IDs with no groupings of samples otherwise. Then, the distance matrices can be evaluated based on groupings to understand if there is a correlation (i.e. same group → smaller distances). This is how we conducted a “guided” analysis in the preprint. Hope this answers your question!

2 Likes

Hi Justin,

Thank you for the clarification.

So am I right in thinking that the envisioned workflow might be to integrate the data with either another method (e.g. SCVI) or use mrVI. Cluster and annotate the cell types as metadata. Then proceed to look at distances as a guided analysis of cell type composition?

Similarly, if one were interested in other sources of grouping such as by transcriptomic perturbations, one could create meta groupings of samples from the distances and then perform differential expression analysis thereafter across cell types of interest?

Yes, cell types can be annotated via another method or MrVI in the u latent space. The distances do not provide a guided analysis of cell type composition, since the model does not account for differences in sample abundance, just the sample-specific cell states.

MrVI would be great for grouping transcriptomic perturbations or samples via the distances. Subsequently, you could take the grouped samples, and do DE analysis across the groups for different cell types of interest.

Thats great, thanks Justin.

I have performed a test run. However, due to the sparse documentation, it is difficult for me to gauge how best to plot/cluster the data. Do you have idea how far things are from a guided tutorial?

1 Like

For now, the best thing to do is to plot both the u and z latent representations to get an understanding of how the data integrates, then to look at the average distance matrix in different clusters if your data. We will likely not have a guided tutorial for this version of MrVI in the near future since we are working hard on the next version of it. We will have more documentation for the next version!

Hi Justin,
Again thanks for this great tool. Along the lines of this conversation, I just need a bit of clarity regarding the cell, sample,sample distances. For each cell i get a square matrix of sample, sample representation, I was wondering how to obtain the column and rownames for each matrix, given that the diagonal for each matrix is 0 irrespective of which sample/subject the cell originates from. Sorry for the bother.Thanks

Sorry for the super late response! This is something we will add with the next version of this package, but in the mean time you can get it like this

sample_order = adata.obs.loc[
        lambda x: ~x[sample_key].duplicated(keep="first")
    ].sort_values("_scvi_sample")[sample_key].values

@Justin_Hong thanks for this super useful package.
You mentioned elsewhere that there will be a new version including tutorials of MrVI, are there any news on that?

Maybe in the meantime, could you more specifically tell us how you would visualize the resulting cell_sample_sample_distances (and/or cell_sample_sample_representations?) So how can they be used to create a heatmap or some other form of visualization to understand how the samples group together?

Thank you !!

@mihem thanks for your interest in MrVI. We’re hoping to get our work soon in the coming months along with all the code.

For the cell_sample_sample_distances, we typically use a seaborn clustermap to organize the samples based on the MrVI distances. We also usually average the distance matrices over cells known to belong to some homogenous population (e.g. based on cell type labels). Although we now have some ways of arriving at these groups from the distances themselves by grouping the cells with similar distance matrices. After performing this, it’s a good first step to look at the metadata of the samples within the groups. It can tell you which metadata have the strongest correlation with the MrVI distances.

@Justin_Hong Thank you so much for your quick response!

I understand your explanations but have difficulties applying them.

cell_sample_sample_distances is a three dimensional array of dimension of shape 22691, 37, 37 in may case. Could you provide some code to produce a heatmap in the way you explained it? Sorry, I am mostly working with dataframe in R (obviously 2D), and don’t find this super obvious how to arrive at a heatmap based on this output (seaborn expect a 2day array).

Thanks a lot!

@mihem Yes, so each element along the first axis of your output represents the sample x sample distance matrix implied by one of the cells in the input. So here you would either take the mean across the entire first axis, or if you believe there to be cell-state-specific heterogeneity in the nature of sample effects, you would split the first axis into groups based on, say cell types, before taking the mean and plotting heatmaps for each of those groups. Does that make sense?

Yes, thank you for the explanation.