Thank you for developing a bunch of amazing tools in the single-cell field!
I encountered an issue related to the integration of multiple multiome datasets.
As I understand, totalVI is specialized in CITE-seq(protein + scRNA), and multiVI is designed for paired and unpaired data to improve the usage of single-modality data. I’m curious whether the focus on applying totalVI to CITE-seq is influenced by the relatively small size of the protein matrix, thereby limiting its applicability to other types of datasets?
I have a total 9 multiome datasets and want to integrate them without introducing batch effects, similar to what scVI does.
May I ask what the best approach to integrate these multiome datasets would be?
Additionally, there is normalization options in
cellranger-arc aggr such as
--normalize=depth|none[default:depth]. I believe that overcorrecting with
--normalize=depth could potentially affect the subsequent integration step with scVI tools. Could you provide some advice regarding this?
Hi, thanks for your question. Yes, totalVI is only suitable for analyzing paired CITE-seq data since it jointly models both RNA and protein expression in the same latent space. In addition to this, it generates RNA and protein counts from learned negative binomial distributions, including foreground and background protein parameters. You can find more information in our user guide.
MultiVI accepts paired and unpaired data for RNA, ATAC, and protein expression. It does so by inferring modality-specific latent representations and then merging them into a common one, thus making it suitable for modeling unpaired data. What are the modalities present across your datasets? If the datasets are unpaired, then MultiVI would be the appropriate choice.
Regarding depth normalization, all scvi-tools models expect raw count data as input, so normalization is not recommended.
Apologies for the delayed response. I utilized fully paired multiome data from four different libraries and employed MOFA+ to integrate the datasets. But I also attempted to use MultiVI for the integration, as described here.
However, I encountered an error message when I used the same code from the page.
I assumed that the error was caused by using the modality (in my case, all cells are assigned as ‘paired’) as a single batch during
scvi.model.MULTIVI.setup_anndata. Therefore, I also tried using
fully_paired=True in the
scvi.model.MULTIVI function. However, I encountered the same error during training
- Could you please provide any advice on resolving this error?
- In your opinion, what is the most effective way to integrate several fully paired multiome datasets?
- version info
scvi : v1.0.4
pandas : v2.1.0
Hi @jjuhh, sorry you are running into this error. We have noticed an increased frequency of NaN errors, particularly in MultiVI and GIMVI, and are looking into a solution for these. I believe this is occurring for scvi-tools >= 1.0.0. Could you try the following and see if the issue resolves?
- Try setting
scvi.settings.seed = 0 in
- Try installing
Thank you for your response and for paying attention to this issue.
I was able to resolve the problem by simply running
scvi.modelMULTIVI.setup_anndata() without specifying
batch="modality" before the
I believe this error arose from using only one batch (in my case, all cells are ‘paired’ in modality column) rather than multiple batches. It seems that the
fully_paired = True flag in
scvi.model.MULTIVI() is currently ignored. If this flag becomes functional in the future, errors like the one I encountered may not occur.
On another note, I have a question: Can I use the code below to integrate fully paired-multiome data? Does it make sense?
mvi = scvi.model.MULTIVI(
fully_paired = True, # allows the simplification of the model if the data is fully paired. "Currently ignored."
n_genes=(adata_mvi.var["feature_types"] == "Gene Expression").sum(),
n_regions=(adata_mvi.var["feature_types"] == "Peaks").sum()