MrVI input and interpretation

Hi, I’m slightly confused because you ask for abundance but copied the code for expression. Can you clarify? Abundance is meant for pairwise comparisons, similar to tools like milo. What type of output would you expect for more than two groups? For expression, you get DE estimates for the learnt linear model coefficient in latent space, which is also always pairwise in comparison to the reference category similar to tools like classical DE.

Ah thanks, sorry for the code error.

Yes, agree the abundance is always going to be pairwise, but I was wondering if there is a way to perform this without having to subset the data and regenerated the u embeddings for a subset of the data?

I guess I am thinking of examples where people might generate a multi-disease atlas. It would be a useful feature to have.

on another note, I have been having issues with reproducibility of MRVI described here: Reproducibility issue of MRVI · Issue #105 · YosefLab/mrvi · GitHub

Thanks a bunch in advance

Hi @Nusob888, thanks for your question. What do you mean exactly by performing DA without having to subset the data an regenerating the u embeddings? Technically you can get a log density score for every cell and sample then just cache these to compute the final DA comparisons for any given subset of cells. Did you want something like this to be exposed to the user?

Also, aware of your issue. I moved it to the scvi-tools repo since we have deprecated the mrvi repo entirely. I will be investigating it this week.

That’s amazing thanks.

And I hadn’t realised this, yes I guess something in the API that allows the user to select the conditions they want to do either DA or DE. Currently if I make an MrVI model of let’s say 16 datasets across 6 disease labels and 1 control label, when I run the following code from the quick start tutorial, I get an error.

model.sample_info["Status"] = model.sample_info["Status"].cat.reorder_categories(
    ["Healthy", "Covid"]

I will try to get the exact error message to you when I can, unfortunately our clusters GPU nodes are in demand. But essentially, it cannot reorder two labels from 7. Since we can only do pairwise comparisons, I couldn’t quite see how I can isolate my two labels of interest, without having to re-run a less complex model.

I am probably just being a bit too impatient and the final official release will be much clearer, but it’s giving me too many ideas on how to implement it on a dataset I am working on!

Hello! Reading through the MRVI tutorials and this thread, I have two questions that I think are relevant here:

  1. Is it recommended to run MRVI with a high # of variable features? I noticed in the quickstart tutorial 10k genes were used and in the analysis of Tahoe100M, 15k genes were used. This is much greater than the 2-3k typically used in scRNAseq analysis, so I was wondering if that’s due to MRVI working best with large #’s of genes or if it was just a coincednce.
  2. I take it from the tutorial and this thread that MRVI is best run on an adata subset by the variable features. However, doesn’t that mean that it won’t be possible to detect differential expression in other genes? If just 2-3k genes are used as variable features, that leaves ~30k genes that won’t be used during in MRVI’s differential expression analysis.

I think that the number of HVG you keep is a function of the diversity in your data, but also poses a computational limitation. You can do several trials and errors with this number.

Noise and uninformative genes will not help for modeling, nor appear as significant in a DE analysis.

Hi, I generally agree. When developing MrVI we realized that it’s performance doesn’t drop as quickly when selecting more hvg compared to a standard scVI model, where selecting 10k genes usually leads to much worse batch integration. It is not entirely clear, why we see this improvement.

For downsteam applications, it is desired to have more genes and be able to check e.g. differential expression for those genes. We also have one application, where we skipped hvg entirely and got good results, however, this comes with additional computational costs and batch integration performance might be worse than on e.g. 10k genes.

Thank you for the explanation. If I don’t subset the adata by the HVG, but instead keep the HVG encoded in adata.var, it seems that will that allow me to use just HVG during MrVI batch integration while retaining all genes for differential expression analysis. Are there any potential downsides to this or is this not recommended?

To train only on hvg, you have to subset the AnnData file. However, you can create one subset data and then put the results back in the full AnnData. However, genes not used for training can’t be used for downstream functions in MrVI such as differential expression.