I am working with a scATAC-seq / multi-timepoint reprogramming dataset and want to run PeakVI on a combined AnnData object across samples/timepoints. The issue is that peaks were called separately for each sample, so the peak coordinates differ slightly between samples even when they appear to represent the same biological region.
For example, for what looks like the same locus, I see:
* Day0: chr1:3210643-3211508
* Day4: chr1:3210640-3211510
* Day7: chr1:3210651-3211509
When I compare exact peak names across samples, the overlap is almost zero, even though biologically these regions seem to correspond to the same accessible site.
Example overlap counts:
* total peaks:
* Day0: 292,251
* Day4: 253,628
* Day7: 198,323
* exact intersections:
* Day0 ∩ Day4: 20
* Day0 ∩ Day7: 14
* Day4 ∩ Day7: 25
So exact string matching is clearly not the right way to harmonize features here. My question is: What is the recommended way to construct a shared feature space for PeakVI when peak boundaries differ slightly across samples?
More specifically: Using 10x / Cell Ranger / Cell Ranger ARC outputs, what is the best practice to harmonize peaks before building the final AnnData for PeakVI?
My concern is that if I use only the exact intersection of peaks, I will lose genuine timepoint-specific accessibility changes, which are biologically important in my context as a reprogramming trajectory. And also when running it as is I see 0 integration of timepoints (which should be happening based on what we know about the biology) with all different resolutions Any guidance on best practice for preprocessing before PeakVI would be very helpful.