Integration of Multiple Multiome Datasets

Thank you for developing a bunch of amazing tools in the single-cell field!

I encountered an issue related to the integration of multiple multiome datasets.

As I understand, totalVI is specialized in CITE-seq(protein + scRNA), and multiVI is designed for paired and unpaired data to improve the usage of single-modality data. I’m curious whether the focus on applying totalVI to CITE-seq is influenced by the relatively small size of the protein matrix, thereby limiting its applicability to other types of datasets?

I have a total 9 multiome datasets and want to integrate them without introducing batch effects, similar to what scVI does.

May I ask what the best approach to integrate these multiome datasets would be?

Additionally, there is normalization options in cellranger-arc aggr such as --normalize=depth|none[default:depth]. I believe that overcorrecting with --normalize=depth could potentially affect the subsequent integration step with scVI tools. Could you provide some advice regarding this?

Thank you!!
Juhyun

Hi, thanks for your question. Yes, totalVI is only suitable for analyzing paired CITE-seq data since it jointly models both RNA and protein expression in the same latent space. In addition to this, it generates RNA and protein counts from learned negative binomial distributions, including foreground and background protein parameters. You can find more information in our user guide.

MultiVI accepts paired and unpaired data for RNA, ATAC, and protein expression. It does so by inferring modality-specific latent representations and then merging them into a common one, thus making it suitable for modeling unpaired data. What are the modalities present across your datasets? If the datasets are unpaired, then MultiVI would be the appropriate choice.

Regarding depth normalization, all scvi-tools models expect raw count data as input, so normalization is not recommended.

Apologies for the delayed response. I utilized fully paired multiome data from four different libraries and employed MOFA+ to integrate the datasets. But I also attempted to use MultiVI for the integration, as described here.

However, I encountered an error message when I used the same code from the page.
스크린샷 2023-11-09 오후 4.59.51
스크린샷 2023-11-09 오후 5.00.07

I assumed that the error was caused by using the modality (in my case, all cells are assigned as ‘paired’) as a single batch during scvi.model.MULTIVI.setup_anndata. Therefore, I also tried using fully_paired=True in the scvi.model.MULTIVI function. However, I encountered the same error during training

  1. Could you please provide any advice on resolving this error?
  2. In your opinion, what is the most effective way to integrate several fully paired multiome datasets?
  • version info
    scvi : v1.0.4
    pandas : v2.1.0
1 Like

Hi @jjuhh, sorry you are running into this error. We have noticed an increased frequency of NaN errors, particularly in MultiVI and GIMVI, and are looking into a solution for these. I believe this is occurring for scvi-tools >= 1.0.0. Could you try the following and see if the issue resolves?

  • Try setting scvi.settings.seed = 0 in scvi-tools==1.0.4
  • Try installing scvi-tools==0.20.3

Thank you for your response and for paying attention to this issue.

I was able to resolve the problem by simply running scvi.modelMULTIVI.setup_anndata() without specifying batch="modality" before the scvi.train() step.

I believe this error arose from using only one batch (in my case, all cells are ‘paired’ in modality column) rather than multiple batches. It seems that the fully_paired = True flag in scvi.model.MULTIVI() is currently ignored. If this flag becomes functional in the future, errors like the one I encountered may not occur.

On another note, I have a question: Can I use the code below to integrate fully paired-multiome data? Does it make sense?

scvi.model.MULTIVI.setup_anndata(
    adata_mvi,
    categorical_covariate_keys=["sampleName"],
    layer="counts"
)

%%time
mvi = scvi.model.MULTIVI(
    adata_mvi,
    fully_paired = True, # allows the simplification of the model if the data is fully paired. "Currently ignored."
    n_genes=(adata_mvi.var["feature_types"] == "Gene Expression").sum(),
    n_regions=(adata_mvi.var["feature_types"] == "Peaks").sum()
)

%%time
mvi.train(early_stopping=True,
          use_gpu='cuda:1',
          adversarial_mixing=False,
          plan_kwargs={"reduce_lr_on_plateau":True, "optimizer":"AdamW"},
          batch_size=500
         )

Thank you!

Just curious if there was any update on this from the developers. So far the git branch on 1.2 doesn’t show the fully_paired flag as active.