Integration challenges in a multi-dataset single cell type maturation atlas

Hi scverse community,

I’m building a single-cell maturation atlas integrating 23 publicly available and in-house datasets spanning embryonic day 7.75 through postnatal day 84 in mice (~58,000 cells after QC). I’m running into a classic biology vs. batch correction tradeoff and would appreciate any advice.

Dataset composition:
My atlas spans 6 distinct sequencing chemistries:

  • 10x Chromium (whole tissue dissociation)
  • Drop-seq (whole tissue)
  • SCRB-seq / mcSCRB-seq (LP-FACS isolated CMs)
  • 8bp-STRT-seq (manually picked cells)
  • iCell8
  • Parse Biosciences (split-pool combinatorial indexing)

Cell counts per batch group are highly imbalanced — from 48 cells (smallest dataset) to ~16,500 (largest dataset).

HVG selection:
I selected 3,987 HVGs using scanpy’s highly_variable_genes across all datasets before scVI training. Raw integer counts are stored in a dedicated layer.

What I’ve tried:

v1: chemistry-level batch key (6 batches), n_latent=30, n_layers=2
Result: under-corrected. Batches separated clearly in UMAP. Biology partially preserved but batch dominated.

v2: dataset-level batch key (23 batches), n_latent=30, n_layers=2, full ~32k cells
Result: over-corrected. Excellent batch metrics (ASW_label/batch=0.905, PCR_batch=0.958) but biology completely lost (NMI=0.323, ARI=0.128). Timepoint numeric UMAP showed no gradient.

v3: dataset-level batch key, downsampled to 300 cells/dataset/timepoint (~10k cells)
Result: same over-correction as v2. Downsampling alone didn’t help.

v4: chemistry-level batch key, downsampled, n_latent=50, n_layers=3
Result: best so far. NMI improved to 0.515, ARI to 0.242, cLISI=0.975. Biology partially preserved show expected patterns in scVI denoised expression. However iLISI remains low (0.070) and batch chemistry groups still partially separate in UMAP.

Key observations:

  • Dataset-level batch key always over-corrects regardless of downsampling, covariates (log_n_counts, pct_mito), or network depth
  • Chemistry-level batch key preserves biology but leaves residual batch structure
  • iLISI is unreliable in my dataset due to massive cell count imbalance across timepoints (6 cells at E8.0 vs 9,714 at E16.5)
  • The developmental trajectory spans such a wide biological range (cardiac crescent → mature adult CM) that scVI struggles to distinguish biology from batch
  • scANVI is not ideal here because of transcriptional heterogeneity within timepoints

Questions:

  1. Is there a principled way to choose batch key granularity for datasets with mixed chemistries and highly imbalanced sizes?
  2. Has anyone had success with continuous covariates (sequencing depth, pct_mito) improving biology preservation when using dataset-level batch keys?
  3. For a continuous developmental trajectory dataset, are NMI/ARI appropriate metrics or are they fundamentally limited by the continuous nature of the data?
  4. Any recommendations for preserving trajectory structure without scANVI supervision?

Thanks in advance!

Hi, in my opinion those cases are very difficult. You can check the organoid atlases HNOCA and HEOCA that dealt with similar problems and came up with a solution and apply this to your data.

thank you, I’ll look into that and give it a shot