Batch effect not entirely removed in PeakVI

skwalker · May 18, 2022, 3:32pm

Hi,

I’m trying to use PeakVI to analyze a scATAC-seq dataset with two batches, where one is a multi-modal dataset with the cells co-profiled with scRNA-seq and the other run independently (just scATAC). The biggest difference in these two batches is that the counts are both lower and sparser for cells coming from the multimodal batch. While PeakVI seems to decrease this batch effect, it’s still present after training a PeakVI model with the batch_key representing the variable for multimodal vs scATAC. I still see the batches shifted on the UMAP from the model’s latent representation, and the clusters are quite skewed towards either scATAC cells or multimodal cells. In addition, the distributions of cell library size are still shifted between scATAC vs multimodal, when I calculate the library sizes by summing over the rounded reconstructed probabilities for each region [even when I run get_accessibility_estimates with normalize_cells and/or setting transform_batch to one of the batches]. I had initially thought that the cell-specific factor mentioned in the paper would take this into account, but I find that every cell gets a value of nearly 1, even though the raw library sizes have quite a wide distribution (and are shifted between the two batches). So I was wondering if you would have some advice on somehow tweaking the model somehow (or maybe the network architecture?) to more strongly remove the batch effect, or perhaps there is some way to force a more spread out distribution of cell-specific factors?

Thanks,
Sarah

adamgayoso · August 8, 2022, 7:12am

Are you using the same exact genomic features for both datasets? This could have a large impact.

skwalker · August 9, 2022, 1:17pm

Yes I’m using the same genomic features as I called peaks on the entire dataset combined. And I remove peaks that only have counts in one batch vs the other.

I’m also just curious why the cell-specific factor, which from my understanding should be taking adjusting for differences in library size, is so peaked at just under 1 (as in the paper as well, fig 1c). Especially in this case, I would perhaps expect one batch to have a slightly different peak than the other.

skwalker · August 10, 2022, 5:11pm

Actually, I’m so sorry, that is slightly incorrect. For this analysis, I am using a binned tile matrix of 5kb width. I have tried both binarizing the tile matrix before training the PeakVI model and using the full counts (with the idea that perhaps that would affect the library size factors), to the same result.

Topic		Replies	Views
Failure to remove a batch_key/ effect of number of LVs scvi-tools integration , scvi	6	413	February 9, 2024
scVI batch correction clusters all cells from sample in a circle (potential artifact) scvi-tools integration , scvi	14	92	June 3, 2025
Scvi - denoising single-cell/single-nucleus transcription data scvi-tools scvi	3	255	August 8, 2024
Input peak counts vs binarized scvi-tools peakvi	1	467	December 22, 2021
Insufficient batch correction for certain cell-types scvi-tools integration , scvi	8	425	May 15, 2024

Batch effect not entirely removed in PeakVI

Related topics