Integration of Multiple Multiome Datasets

jjuhh · August 16, 2023, 4:12am

Thank you for developing a bunch of amazing tools in the single-cell field!

I encountered an issue related to the integration of multiple multiome datasets.

As I understand, totalVI is specialized in CITE-seq(protein + scRNA), and multiVI is designed for paired and unpaired data to improve the usage of single-modality data. I’m curious whether the focus on applying totalVI to CITE-seq is influenced by the relatively small size of the protein matrix, thereby limiting its applicability to other types of datasets?

I have a total 9 multiome datasets and want to integrate them without introducing batch effects, similar to what scVI does.

May I ask what the best approach to integrate these multiome datasets would be?

Additionally, there is normalization options in cellranger-arc aggr such as --normalize=depth|none[default:depth]. I believe that overcorrecting with --normalize=depth could potentially affect the subsequent integration step with scVI tools. Could you provide some advice regarding this?

Thank you!!
Juhyun

martinkim0 · August 18, 2023, 9:16pm

Hi, thanks for your question. Yes, totalVI is only suitable for analyzing paired CITE-seq data since it jointly models both RNA and protein expression in the same latent space. In addition to this, it generates RNA and protein counts from learned negative binomial distributions, including foreground and background protein parameters. You can find more information in our user guide.

MultiVI accepts paired and unpaired data for RNA, ATAC, and protein expression. It does so by inferring modality-specific latent representations and then merging them into a common one, thus making it suitable for modeling unpaired data. What are the modalities present across your datasets? If the datasets are unpaired, then MultiVI would be the appropriate choice.

Regarding depth normalization, all scvi-tools models expect raw count data as input, so normalization is not recommended.

jjuhh · November 9, 2023, 10:04pm

Apologies for the delayed response. I utilized fully paired multiome data from four different libraries and employed MOFA+ to integrate the datasets. But I also attempted to use MultiVI for the integration, as described here.

However, I encountered an error message when I used the same code from the page.
스크린샷 2023-11-09 오후 4.59.51
스크린샷 2023-11-09 오후 5.00.07

I assumed that the error was caused by using the modality (in my case, all cells are assigned as ‘paired’) as a single batch during scvi.model.MULTIVI.setup_anndata. Therefore, I also tried using fully_paired=True in the scvi.model.MULTIVI function. However, I encountered the same error during training

Could you please provide any advice on resolving this error?
In your opinion, what is the most effective way to integrate several fully paired multiome datasets?

version info
scvi : v1.0.4
pandas : v2.1.0

martinkim0 · November 14, 2023, 7:27pm

Hi @jjuhh, sorry you are running into this error. We have noticed an increased frequency of NaN errors, particularly in MultiVI and GIMVI, and are looking into a solution for these. I believe this is occurring for scvi-tools >= 1.0.0. Could you try the following and see if the issue resolves?

Try setting scvi.settings.seed = 0 in scvi-tools==1.0.4
Try installing scvi-tools==0.20.3

jjuhh · November 14, 2023, 8:18pm

Thank you for your response and for paying attention to this issue.

I was able to resolve the problem by simply running scvi.modelMULTIVI.setup_anndata() without specifying batch="modality" before the scvi.train() step.

I believe this error arose from using only one batch (in my case, all cells are ‘paired’ in modality column) rather than multiple batches. It seems that the fully_paired = True flag in scvi.model.MULTIVI() is currently ignored. If this flag becomes functional in the future, errors like the one I encountered may not occur.

On another note, I have a question: Can I use the code below to integrate fully paired-multiome data? Does it make sense?

scvi.model.MULTIVI.setup_anndata(
    adata_mvi,
    categorical_covariate_keys=["sampleName"],
    layer="counts"
)

%%time
mvi = scvi.model.MULTIVI(
    adata_mvi,
    fully_paired = True, # allows the simplification of the model if the data is fully paired. "Currently ignored."
    n_genes=(adata_mvi.var["feature_types"] == "Gene Expression").sum(),
    n_regions=(adata_mvi.var["feature_types"] == "Peaks").sum()
)

%%time
mvi.train(early_stopping=True,
          use_gpu='cuda:1',
          adversarial_mixing=False,
          plan_kwargs={"reduce_lr_on_plateau":True, "optimizer":"AdamW"},
          batch_size=500
         )

Thank you!

asur · March 6, 2024, 9:45pm

Just curious if there was any update on this from the developers. So far the git branch on 1.2 doesn’t show the fully_paired flag as active.

Topic		Replies	Views
multiVI and totalVI modal integration question scvi-tools scvi , multivi , totalvi	0	479	September 15, 2022
Dataset integration and analysis scvi-tools integration , cellassign , scvi , multivi , totalvi	3	882	May 3, 2023
MultiVI error on only paired data scvi-tools	3	455	December 19, 2023
Help on MultiVI for data modality scvi-tools totalvi	1	193	May 16, 2025
Protocol for model optimization (currently focused on MultiVI) scvi-tools multivi	3	554	May 14, 2025

Integration of Multiple Multiome Datasets

Related topics