Normalizing scATACseq data

philmar1 · April 23, 2024, 4:13pm

Hello,

I am opening a discussion on good practices to deal with scATACseq data. Can anyone please tell me:

Whether or not you use binary count matrix as input to any model (PeakVI or other)
How do you subset your tiles ? My methodology is the following:

Remove tiles (=regions) appearing in less that 5% of the cells. This gives me ~ 100k tiles (from orginally ~6M)
Do you further filter tiles or you make it as input for PeakVI ? How I did it is that I used scanpy preprocessing function “highly_variable_genes” (with flavor = seurat) after log-norm the binary count data.

Any feedback on personal experience is more than welcome

Best

Philippe

mbuttner · April 23, 2024, 10:34pm

Hi Philippe,
thank you for opening this discussion. I usually follow the pre-processing of the muon package, which uses TF-IDF (log-scaled version) for normalization (that’s essentially a scaling by a size factor with some upweighting of rare peaks). I am hesitant to use a binarized version of the data matrix, because it over-simplifies the data. We tested PeakVI vs. scVI with a Poisson distribution and saw qualitatively better results for the latter. In fact, Laura Martens and colleagues have shown that the Poisson distribution is better suited for single-cell ATACseq data than PeakVI (see implementation in scVI tools).
As far as the filtering goes, I usually follow the muon tutorial linked above, which mostly filters highly variable features by min expression and dispersion (similar to the seurat flavor in scanpy) and I think that filtering out peaks present in less than 5% of the cells is OK unless you expect a rare cell population. In that case, I would consider rerunning the peak calling after clustering (and annotation), then for each cell type and merge peaks subsequently. In this setting, you may want to adjust your filtering procedure to keep peaks in rare cell types. I personally tend to keep way more peaks than beneficial for my computation.

Topic		Replies	Views
Data preprocessing for scATAC-seq scATAC-seq scatac-seq	4	1167	September 10, 2021
Smartseq data prep for SCVI scvi-tools scvi , preprocessing	3	638	December 10, 2022
Preprocess of scATAC use for peakVI or MultiVI scvi-tools	2	14	June 19, 2025
Batch effect not entirely removed in PeakVI scvi-tools peakvi	3	554	August 10, 2022
Tile Matrix from ArchR scvi-tools scatac-seq , peakvi	4	669	December 3, 2021

Normalizing scATACseq data

Related topics