MrVI input and interpretation

@Justin_Hong Thanks so much for the MrVI tool and preprint. Really enjoyed reading it.

I am just wondering whether the input should be an anndata object of highly variable genes or just the whole cellxgene matrix? I presume hvg selection before hand will bias the analysis depending on what you use as the batch key for hvg selection?

Also, any help of further documentation on interpretation of the output would be fab. As I understood it, the distance metrics can compare across samples, clusters or other condition variable, but it wasn’t clear if I can account for sample variation and across disease states at the same time?

Thanks for any insight you can offer.

Hi @Nusob888,

I’m glad you like MrVI, and I hope you find it useful. In our experience using highly variable genes has been helpful to reduce noise in the latent space. We generally used 2000 genes (somewhat arbitrarily). Whether or not you use batch_key will also matter as you suggested, though we did not look into the effect of this closely for the the MrVI paper.

Apologies for the sparse documentation. We will have better documentation in the future, but in the mean time feel free to ask questions about the model in discourse.

To answer your question, the current model does not incorporate sample metadata (e.g. disease) during training. Rather, the model is only given categorical sample IDs with no groupings of samples otherwise. Then, the distance matrices can be evaluated based on groupings to understand if there is a correlation (i.e. same group → smaller distances). This is how we conducted a “guided” analysis in the preprint. Hope this answers your question!

2 Likes

Hi Justin,

Thank you for the clarification.

So am I right in thinking that the envisioned workflow might be to integrate the data with either another method (e.g. SCVI) or use mrVI. Cluster and annotate the cell types as metadata. Then proceed to look at distances as a guided analysis of cell type composition?

Similarly, if one were interested in other sources of grouping such as by transcriptomic perturbations, one could create meta groupings of samples from the distances and then perform differential expression analysis thereafter across cell types of interest?

Yes, cell types can be annotated via another method or MrVI in the u latent space. The distances do not provide a guided analysis of cell type composition, since the model does not account for differences in sample abundance, just the sample-specific cell states.

MrVI would be great for grouping transcriptomic perturbations or samples via the distances. Subsequently, you could take the grouped samples, and do DE analysis across the groups for different cell types of interest.

Thats great, thanks Justin.

I have performed a test run. However, due to the sparse documentation, it is difficult for me to gauge how best to plot/cluster the data. Do you have idea how far things are from a guided tutorial?

1 Like

For now, the best thing to do is to plot both the u and z latent representations to get an understanding of how the data integrates, then to look at the average distance matrix in different clusters if your data. We will likely not have a guided tutorial for this version of MrVI in the near future since we are working hard on the next version of it. We will have more documentation for the next version!

Hi Justin,
Again thanks for this great tool. Along the lines of this conversation, I just need a bit of clarity regarding the cell, sample,sample distances. For each cell i get a square matrix of sample, sample representation, I was wondering how to obtain the column and rownames for each matrix, given that the diagonal for each matrix is 0 irrespective of which sample/subject the cell originates from. Sorry for the bother.Thanks

Sorry for the super late response! This is something we will add with the next version of this package, but in the mean time you can get it like this

sample_order = adata.obs.loc[
        lambda x: ~x[sample_key].duplicated(keep="first")
    ].sort_values("_scvi_sample")[sample_key].values

@Justin_Hong thanks for this super useful package.
You mentioned elsewhere that there will be a new version including tutorials of MrVI, are there any news on that?

Maybe in the meantime, could you more specifically tell us how you would visualize the resulting cell_sample_sample_distances (and/or cell_sample_sample_representations?) So how can they be used to create a heatmap or some other form of visualization to understand how the samples group together?

Thank you !!

@mihem thanks for your interest in MrVI. We’re hoping to get our work soon in the coming months along with all the code.

For the cell_sample_sample_distances, we typically use a seaborn clustermap to organize the samples based on the MrVI distances. We also usually average the distance matrices over cells known to belong to some homogenous population (e.g. based on cell type labels). Although we now have some ways of arriving at these groups from the distances themselves by grouping the cells with similar distance matrices. After performing this, it’s a good first step to look at the metadata of the samples within the groups. It can tell you which metadata have the strongest correlation with the MrVI distances.

@Justin_Hong Thank you so much for your quick response!

I understand your explanations but have difficulties applying them.

cell_sample_sample_distances is a three dimensional array of dimension of shape 22691, 37, 37 in may case. Could you provide some code to produce a heatmap in the way you explained it? Sorry, I am mostly working with dataframe in R (obviously 2D), and don’t find this super obvious how to arrive at a heatmap based on this output (seaborn expect a 2day array).

Thanks a lot!

@mihem Yes, so each element along the first axis of your output represents the sample x sample distance matrix implied by one of the cells in the input. So here you would either take the mean across the entire first axis, or if you believe there to be cell-state-specific heterogeneity in the nature of sample effects, you would split the first axis into groups based on, say cell types, before taking the mean and plotting heatmaps for each of those groups. Does that make sense?

Yes, thank you for the explanation.

Hi @mihem just to update here in case you missed it, we the new manuscript is up on biorxiv along with an updated version in scvi-tools. You can see all the details in the tweetorial: x.com

1 Like

Hi Justin,

Love the updated manuscript. I am looking to revisit MrVI.

I still had a few questions, some of which are the same as my prior questions:

  1. hvg selection.
    I note on the quick start tutorial, you have taken the top 10,000 genes and not assigned a batch variable.

I note that previously you stated you did not benchmark how this affects the model, but I wonder whether this is a crucial thing to clarify. Reasons being:

  • hvgs selected with a batch label can ignore batch specific hvgs, I would anticipate that this removes sample level variability depending on how you select batch labels for non-multiplexed data
  • globally selecting hvgs also seems like it would ignore batch specific bias. e.g. if one dataset is hugely over-represented, then would this not bias the hvg selection method?

On this note, it is still unclear what method of hvg selection you performed in your manuscript (particularly interested in the Crohn’s benchmarking). The reproducibility repo seems to have disappeared.

  1. Can one use mrVI as the integration tool in place of scVI? e.g. use the u embeddings to create neighbourhood graphs and downstream clustering etc.

Thank again for all the great work! look forward to hearing your thoughts

Hi @Nusob888,

Thanks for revisiting our package, and thank you for the questions.

I have just made our reproducibility repo public here: GitHub - YosefLab/mrvi-reproducibility: Nextflow pipeline for MrVI pipelines. The crohn’s analysis is in the notebooks folder in a ipynb. The HVG selection is described in the methods section of the manuscript as:

We selected 10,000 highly variable genes using “seurat_v3” flavor in scanpy and using the respective 10X chemistry as batch key. We also filtered cells expressing more than 300 genes, 1K counts, or more than 20 mitochondrial reads, as these cells contained low-quality events.

We found that adding the batch key actually would overly index on the batch specific HVGs and miss sample subset specific genes. So other approaches may serve as a compromise (e.g., selecting genes that are highly variable in some proportion of the same). But yes, this has not been thoroughly benchmarked on our side and that kind of work would be appreciated. It seems to be an ongoing open question in single-cell integration, not just bound to sample-level variability analysis.

  1. Yes, we would encourage looking at the u space as another way to integrate single-cell data as a substitute for scVI. We actually considered advertising the method in this way, but we still found the sample analysis perspective more compelling. But in general, we did find the scib scores to be equal or slightly stronger than scVI.

Please let us know if you have more questions/find success in using mrVI!

Hi @Justin_Hong , Do you know if MRVI has been taken off scvi-tools? I can no longer find it in scvi.external.MRVI since I updated scvi-tools

Apologies I figured this out. I forgot it had to be installed from the GitHub.

On a side note, is MRVI compatible with scarches? if not, will this be adapted in future versions?

Hi @Nusob888, glad you were able to resolve your issue. MRVI is not compatible with scarches and there are no plans at the moment to support it. However, we acknowledge this could be a very valuable direction to take this work!

Thanks Justin. Can I just check another aspect of the tutorial.

I am trying to do abundance analysis for healthy vs disease 1 (our of 5 disease states). How can we do this with the code provided?

sample_cov_keys = ["Status"]  # Replace with your sample covariate of interest
model.sample_info["Status"] = model.sample_info["Status"].cat.reorder_categories(
    ["Healthy", "Covid"]
)  # Reorder categories such that the coefficient corresponds to Covid
de_res = model.differential_expression(
    sample_cov_keys=sample_cov_keys,
    store_lfc=True,
)

It seems this can only handle two labels is that correct? do you have any case use examples for multiple labels?

The reproducibility code is not yet available for: https://www.biorxiv.org/content/10.1101/2024.01.03.573877v2.supplementary-material

but it seems like they only categorise the samples into two age groups, so presumably, there is no way around this?