Hi scverse community,
I’m building a single-cell maturation atlas integrating 23 publicly available and in-house datasets spanning embryonic day 7.75 through postnatal day 84 in mice (~58,000 cells after QC). I’m running into a classic biology vs. batch correction tradeoff and would appreciate any advice.
Dataset composition:
My atlas spans 6 distinct sequencing chemistries:
- 10x Chromium (whole tissue dissociation)
- Drop-seq (whole tissue)
- SCRB-seq / mcSCRB-seq (LP-FACS isolated CMs)
- 8bp-STRT-seq (manually picked cells)
- iCell8
- Parse Biosciences (split-pool combinatorial indexing)
Cell counts per batch group are highly imbalanced — from 48 cells (smallest dataset) to ~16,500 (largest dataset).
HVG selection:
I selected 3,987 HVGs using scanpy’s highly_variable_genes across all datasets before scVI training. Raw integer counts are stored in a dedicated layer.
What I’ve tried:
v1: chemistry-level batch key (6 batches), n_latent=30, n_layers=2
Result: under-corrected. Batches separated clearly in UMAP. Biology partially preserved but batch dominated.
v2: dataset-level batch key (23 batches), n_latent=30, n_layers=2, full ~32k cells
Result: over-corrected. Excellent batch metrics (ASW_label/batch=0.905, PCR_batch=0.958) but biology completely lost (NMI=0.323, ARI=0.128). Timepoint numeric UMAP showed no gradient.
v3: dataset-level batch key, downsampled to 300 cells/dataset/timepoint (~10k cells)
Result: same over-correction as v2. Downsampling alone didn’t help.
v4: chemistry-level batch key, downsampled, n_latent=50, n_layers=3
Result: best so far. NMI improved to 0.515, ARI to 0.242, cLISI=0.975. Biology partially preserved show expected patterns in scVI denoised expression. However iLISI remains low (0.070) and batch chemistry groups still partially separate in UMAP.
Key observations:
- Dataset-level batch key always over-corrects regardless of downsampling, covariates (log_n_counts, pct_mito), or network depth
- Chemistry-level batch key preserves biology but leaves residual batch structure
- iLISI is unreliable in my dataset due to massive cell count imbalance across timepoints (6 cells at E8.0 vs 9,714 at E16.5)
- The developmental trajectory spans such a wide biological range (cardiac crescent → mature adult CM) that scVI struggles to distinguish biology from batch
- scANVI is not ideal here because of transcriptional heterogeneity within timepoints
Questions:
- Is there a principled way to choose batch key granularity for datasets with mixed chemistries and highly imbalanced sizes?
- Has anyone had success with continuous covariates (sequencing depth, pct_mito) improving biology preservation when using dataset-level batch keys?
- For a continuous developmental trajectory dataset, are NMI/ARI appropriate metrics or are they fundamentally limited by the continuous nature of the data?
- Any recommendations for preserving trajectory structure without scANVI supervision?
Thanks in advance!