What model to use when integrating batches of scRNA-seq matrices containing >150,000 T and innate lymphoid cell (ILC) sub-populations

Hi everyone,
I have already identified >150,000 pan-T and pan-ILC cells from >200 gene expression matrices of total >15 publicly available human scRNA-seq datasets, next I will re-cluster and interrogate them. I have also screened published scRNA-seq work which were akin to my downstream analysis scenario and list a few below :

Cheng et al., 2021, Cell 184, 792–809 ________>130,000 ________myeloid cells ________diverse datasets ________integrating and re-clustering using scanpy+scanorama

Zheng et al., Science 374, 1462 (2021) ________>390,000 ________T cells ________diverse datasets ________integrating and re-clustering using seurat+harmony (mini-clusters used)

Zheng et al., medRxiv doi:10.1101/2021.09.17.21263540 ________>67,000 ________T cells ________one dataset ________integrating and re-clustering using seurat+STACAS R package

Schnell et al., 2021, Cell 184, 6281–6298 ________> 84,000 ________Th17 cells ________one dataset ________integrating and re-clustering using seurat+sctransform

As known, T cells and ILCs are populations with obvious transcriptional shift continuity , which is also demonstrated among various sub-T populations. Since a strong technical batch effect (droplet-based and plate-based; 5-tag, 3-tag and full-length; various sequencing depth and saturation) in my present data and the abovementioned remarkable transcriptional profiles continuity exist, mnnCorrect algorithm and other derivatives (e.g. bbknn used in scanpy, Polański et al., Bioinformatics 2021,36(3) 964–965), which require that the differences between the same cell type across batches caused by batch effects should be less than the differences between cells of different types within a batch, could be unsuitable for downstream integration pipeline. Additionally, CCA-plus-anchors-based integration used in Seurat may lead to overcorrection and large computing resource consumption, though STACAS R package may resolve overcorrection issue caused by Seurat in certain degree. Can the powerful scvi-tools give me a better solution? If so, what models should I use for data integration?

Hi Duan,

I don’t know what you mean by ‘transcriptional shift continuity’ means and what the nature of the issues you are mentioning would be. But when I do these sorts of analyses I use the SCVI model where I have set up the AnnData to have a batch column where I find it works better with if you make sure all batches from the original publications are represented. (Presumably given your description, this would be the '>200" number you mention).

You can use the learned representation for clustering.

I would focus on investigating gene expression predictions of known markers for T and ILC subsets. By just using the observed batches (which is the default) the gene expression predictions will show you expression explained by batches.

It seems your concern is that the learned representation does not contain variation you are interested in which is variation between batches? You can investigate between-batch variation using the .differential_expression() method by selecting a cell population and giving the method the IDs for the batches you want to compare. The most important point is that you have access to the generative model of gene expression, f(Z, c), where Z is the latent representation and c are the known batches. Whether you want an effect of some phenomenon to show up in Z or c is a modeling choice, but the model will represent that variation in the generative model, which can in either case be investigated with either the .differential_expression() method or through the .get_normalized_expression() method.


Hi Valentine, thanks for your reply and sorry for my elusive expression. Briefly for instance, if I wanted to further re-cluster and characterise T/NK cells, or more extremely the NK sub-population, in atlas-level integrated lung dataset posted at scvi-tools tutorial page, scVI would still be competent? If both the T/NK cell number and donor samples (batches) were more considerable?
This term “transcriptional shift continuity”, may depend on cell heterogeneity and cell type abundance, for example, immune cells and non-immune cells have more differences transcriptionally whereas CD4+ and CD8+ T subclusters less, and in UMAP reduction plot, CD4+ and CD8+ T subclusters have more adjacent embeddings. In my knowledge and practice, bbknn in scanpy and CCA-plus-anchors in Seurat, both are current well-known integration algorithms, have their own limitations when handling datasets in my initial post. As a beginner in algorithm and AI, I need guidance on my downstream T and ILC integration pipeline. The generative model you mention is just like a blind box for me, and now I still wonder whether scvi-tools is suitable.

Hi Duan,

Yes the subclustering you describe is typically what I do. Though I annotate based on known markers rather than de novo clustering.

I think I understand what you mean by ‘transcriptional shift’ now. And as you mention, the analysis can be done in two ways. Either you fix a broader cell class and investigate which genes change in expression between conditions/batches in that broad class. Or you can define smaller more refined classes with e.g. an activation signature, then count ‘Activated CD4+’ vs ‘Resting CD4+’ cells (just as an example).

One of the main reasons I like using scVI is the ability to ‘open the blind box’ using the .get_normalized_expression() and .differential_expression() methods. For example, if you see some cell population you expect to be homogeneously represented after integration split into two groups, you can ask them model (using the .differential_expression() method) why the cells are represented by different coordinates. Then you get a list of genes which differ in expression levels between those groups. Next, you can use .get_normalized_expression() to investigate how those expression levels depend on various metadata in your dataset. As well as interpret the set of genes being differential as well, to learn whether specific pathways are regulated.


Follow your example, will those cells splitted into two groups be merged to one after I check something with your suggestion? To adjust parameters like ‘resolution’ in Seurat pipeline? Or modify the model? Or else?

For the ‘activated vs resting’ example, I’d keep them separate. Mostly because it’s easy to lose track of what you are looking at or talking about if you hierarchical groups and it can get ambiguous which level in the hierarchy you’re currently in.

Another question is what the difference is between ‘gene’ and ‘gene-batch’ when setting up ‘dispersion’ in scvi.model.SCVI()

Hi Duan,

This relates to the over-dispersion parameter for the negative binomial (and ZINB) observation distribution. The ‘gene’ setting makes it so the model has one over-dispersion parameter for each gene. The ‘gene-batch’ setting allows each gene to have different over-dispersion parameters in each batch. The idea is that you might have data from two experiments where one uses a more noisy technologt than the other.