Integration challenges in a multi-dataset single cell type maturation atlas

grad_student_01 · April 23, 2026, 6:03pm

Hi scverse community,

I’m building a single-cell maturation atlas integrating 23 publicly available and in-house datasets spanning embryonic day 7.75 through postnatal day 84 in mice (~58,000 cells after QC). I’m running into a classic biology vs. batch correction tradeoff and would appreciate any advice.

Dataset composition:
My atlas spans 6 distinct sequencing chemistries:

10x Chromium (whole tissue dissociation)
Drop-seq (whole tissue)
SCRB-seq / mcSCRB-seq (LP-FACS isolated CMs)
8bp-STRT-seq (manually picked cells)
iCell8
Parse Biosciences (split-pool combinatorial indexing)

Cell counts per batch group are highly imbalanced — from 48 cells (smallest dataset) to ~16,500 (largest dataset).

HVG selection:
I selected 3,987 HVGs using scanpy’s highly_variable_genes across all datasets before scVI training. Raw integer counts are stored in a dedicated layer.

What I’ve tried:

v1: chemistry-level batch key (6 batches), n_latent=30, n_layers=2
Result: under-corrected. Batches separated clearly in UMAP. Biology partially preserved but batch dominated.

v2: dataset-level batch key (23 batches), n_latent=30, n_layers=2, full ~32k cells
Result: over-corrected. Excellent batch metrics (ASW_label/batch=0.905, PCR_batch=0.958) but biology completely lost (NMI=0.323, ARI=0.128). Timepoint numeric UMAP showed no gradient.

v3: dataset-level batch key, downsampled to 300 cells/dataset/timepoint (~10k cells)
Result: same over-correction as v2. Downsampling alone didn’t help.

v4: chemistry-level batch key, downsampled, n_latent=50, n_layers=3
Result: best so far. NMI improved to 0.515, ARI to 0.242, cLISI=0.975. Biology partially preserved show expected patterns in scVI denoised expression. However iLISI remains low (0.070) and batch chemistry groups still partially separate in UMAP.

Key observations:

Dataset-level batch key always over-corrects regardless of downsampling, covariates (log_n_counts, pct_mito), or network depth
Chemistry-level batch key preserves biology but leaves residual batch structure
iLISI is unreliable in my dataset due to massive cell count imbalance across timepoints (6 cells at E8.0 vs 9,714 at E16.5)
The developmental trajectory spans such a wide biological range (cardiac crescent → mature adult CM) that scVI struggles to distinguish biology from batch
scANVI is not ideal here because of transcriptional heterogeneity within timepoints

Questions:

Is there a principled way to choose batch key granularity for datasets with mixed chemistries and highly imbalanced sizes?
Has anyone had success with continuous covariates (sequencing depth, pct_mito) improving biology preservation when using dataset-level batch keys?
For a continuous developmental trajectory dataset, are NMI/ARI appropriate metrics or are they fundamentally limited by the continuous nature of the data?
Any recommendations for preserving trajectory structure without scANVI supervision?

Thanks in advance!

cane11 · April 23, 2026, 6:22pm

Hi, in my opinion those cases are very difficult. You can check the organoid atlases HNOCA and HEOCA that dealt with similar problems and came up with a solution and apply this to your data.

grad_student_01 · April 24, 2026, 3:40pm

thank you, I’ll look into that and give it a shot

Topic		Replies	Views
Insufficient batch correction for certain cell-types scvi-tools integration , scvi	8	663	May 15, 2024
How to Correct for Intra-Organ Batch Effects Without Removing Inter-Organ Differences? scvi-tools integration	6	152	August 5, 2025
scVI integration using two batch keys scvi-tools	5	1590	October 24, 2023
Failure to remove a batch_key/ effect of number of LVs scvi-tools integration , scvi	6	590	February 9, 2024
Understanding batch-corrected counts in scVI scvi-tools	6	744	March 9, 2025

Integration challenges in a multi-dataset single cell type maturation atlas

Related topics