Failure to remove a batch_key/ effect of number of LVs

I have a dataset containing cells(700k) from two different conditions: unstimulated and stimulated with a virus. I noticed that when I set the batch_key to “condition”, scVI was unable to remove the batch effect, as the cell types were clustered based on the stimulation condition. This is evident in the UMAP plot, which shows that the cells are clustered by stimulation condition rather than cell type.

What can I do to address this issue?
does chaning the number of used HVG could affect this ?

Does the number of latent variables and layers scale with the number of cells , should I specify more LVs and layers if I have more cells?
i saw when i increased the n_epochs the result where a bit diffrent also, ideally shall i go for much higher epochs since i got alot of cells?

here is a umap colored by celltype and conditions that i wanted to remove its effect(normal workflow of pca/knn/umap yield a similary umap)

increasing the epochs until convergence ( using early stopping) still couldn’t remove the batch effect of the condition.

Hi Marwansha,

Are all your 700k cells from just two samples / batches that are confounded with your virus treatment?

Integration of samples with scVI tend to work better if you include all samples / batches. For example, if you have 8 untreated samples, and 8 treated samples, for a total of 16 samples but with 2 conditions – then integration typically works better if you give the ‘sample’ as batch_key than if you give ‘condition’ as batch_key.

WIth only 700k cells, I think the default of 10 latent variables should be sufficient. When there are many more cells increasing LVs can help, but necessarily for batch integration, which tend to benefit from a small ‘bottleneck’.

Hope this helps!

1 Like

Thank you alot for your response.

My library design was quite complex, with each library containing samples from different individuals at different conditions (one individual from same library was stimulated with a virus, while other was not) followed by demultiplexing the cell identities using genotyping data

the visualization shows distinct clustering and separation of each cell type based on the condition with no library batch effect. I was trying to remove this separation using scVI.

I tried a variety of model settings For instance, I tried different numbers of latent variables (10, 20, 30, 50, 10), number of layers (1, 2, 4, 10), dispersion settings (gene, gene-batch), and gene likelihood models (Zinb and nb). all models reach convergence ( Early stopping option )

*Do you have any suggestions for how to get this to work , i try to add library with the condition maybe as categorical keys ( batches) ? ? *

Hi Marwansha,

Since you want to account for (‘remove’) variation due to ‘condition’, the ‘condition’ indicator should be part of your ‘batch’ in the batch correction. Did you try this?

For example, if you have one column ‘individual’ and another column ‘condition’ in your adata.obs, you can make a new column adata.obs['individual_condition'] = adata.obs['individual'] + '-' + adata.obs['condition'], then use 'individual_condition' as batch_key when setting up the model.

The choice of how to group cells with the batch_key argument will have a much larger effect than latent dimensionality, number or layers or other technical settings.

I would be interested in seeing how your UMAP changes with different choices for what you put in the batch_key!

Hope this helps,

Hi Valentine,

just to let you know i tried excatly what u suggested and i think scvi / scanvi failed to remove this strong condition effectop( which is mainlya biological effect, stimulated and unstimulated cells)

i know its a tool effect since after i switched to harmony and it worked normal.

i would be happy to try again with any specific suggestion for the model parameters you would offer.