Batch effect not entirely removed in PeakVI


I’m trying to use PeakVI to analyze a scATAC-seq dataset with two batches, where one is a multi-modal dataset with the cells co-profiled with scRNA-seq and the other run independently (just scATAC). The biggest difference in these two batches is that the counts are both lower and sparser for cells coming from the multimodal batch. While PeakVI seems to decrease this batch effect, it’s still present after training a PeakVI model with the batch_key representing the variable for multimodal vs scATAC. I still see the batches shifted on the UMAP from the model’s latent representation, and the clusters are quite skewed towards either scATAC cells or multimodal cells. In addition, the distributions of cell library size are still shifted between scATAC vs multimodal, when I calculate the library sizes by summing over the rounded reconstructed probabilities for each region [even when I run get_accessibility_estimates with normalize_cells and/or setting transform_batch to one of the batches]. I had initially thought that the cell-specific factor mentioned in the paper would take this into account, but I find that every cell gets a value of nearly 1, even though the raw library sizes have quite a wide distribution (and are shifted between the two batches). So I was wondering if you would have some advice on somehow tweaking the model somehow (or maybe the network architecture?) to more strongly remove the batch effect, or perhaps there is some way to force a more spread out distribution of cell-specific factors?


Are you using the same exact genomic features for both datasets? This could have a large impact.

Yes I’m using the same genomic features as I called peaks on the entire dataset combined. And I remove peaks that only have counts in one batch vs the other.

I’m also just curious why the cell-specific factor, which from my understanding should be taking adjusting for differences in library size, is so peaked at just under 1 (as in the paper as well, fig 1c). Especially in this case, I would perhaps expect one batch to have a slightly different peak than the other.

Actually, I’m so sorry, that is slightly incorrect. For this analysis, I am using a binned tile matrix of 5kb width. I have tried both binarizing the tile matrix before training the PeakVI model and using the full counts (with the idea that perhaps that would affect the library size factors), to the same result.