Normalizing scATACseq data


I am opening a discussion on good practices to deal with scATACseq data. Can anyone please tell me:

  1. Whether or not you use binary count matrix as input to any model (PeakVI or other)
  2. How do you subset your tiles ? My methodology is the following:
  • Remove tiles (=regions) appearing in less that 5% of the cells. This gives me ~ 100k tiles (from orginally ~6M)
  • Do you further filter tiles or you make it as input for PeakVI ? How I did it is that I used scanpy preprocessing function “highly_variable_genes” (with flavor = seurat) after log-norm the binary count data.

Any feedback on personal experience is more than welcome :slight_smile:



Hi Philippe,
thank you for opening this discussion. I usually follow the pre-processing of the muon package, which uses TF-IDF (log-scaled version) for normalization (that’s essentially a scaling by a size factor with some upweighting of rare peaks). I am hesitant to use a binarized version of the data matrix, because it over-simplifies the data. We tested PeakVI vs. scVI with a Poisson distribution and saw qualitatively better results for the latter. In fact, Laura Martens and colleagues have shown that the Poisson distribution is better suited for single-cell ATACseq data than PeakVI (see implementation in scVI tools).
As far as the filtering goes, I usually follow the muon tutorial linked above, which mostly filters highly variable features by min expression and dispersion (similar to the seurat flavor in scanpy) and I think that filtering out peaks present in less than 5% of the cells is OK unless you expect a rare cell population. In that case, I would consider rerunning the peak calling after clustering (and annotation), then for each cell type and merge peaks subsequently. In this setting, you may want to adjust your filtering procedure to keep peaks in rare cell types. I personally tend to keep way more peaks than beneficial for my computation.