Hi,
I’m trying to use PeakVI to analyze a scATAC-seq dataset with two batches, where one is a multi-modal dataset with the cells co-profiled with scRNA-seq and the other run independently (just scATAC). The biggest difference in these two batches is that the counts are both lower and sparser for cells coming from the multimodal batch. While PeakVI seems to decrease this batch effect, it’s still present after training a PeakVI model with the batch_key
representing the variable for multimodal vs scATAC. I still see the batches shifted on the UMAP from the model’s latent representation, and the clusters are quite skewed towards either scATAC cells or multimodal cells. In addition, the distributions of cell library size are still shifted between scATAC vs multimodal, when I calculate the library sizes by summing over the rounded reconstructed probabilities for each region [even when I run get_accessibility_estimates
with normalize_cells
and/or setting transform_batch
to one of the batches]. I had initially thought that the cell-specific factor mentioned in the paper would take this into account, but I find that every cell gets a value of nearly 1, even though the raw library sizes have quite a wide distribution (and are shifted between the two batches). So I was wondering if you would have some advice on somehow tweaking the model somehow (or maybe the network architecture?) to more strongly remove the batch effect, or perhaps there is some way to force a more spread out distribution of cell-specific factors?
Thanks,
Sarah