Hi everyone,
I have already identified >150,000 pan-T and pan-ILC cells from >200 gene expression matrices of total >15 publicly available human scRNA-seq datasets, next I will re-cluster and interrogate them. I have also screened published scRNA-seq work which were akin to my downstream analysis scenario and list a few below :
Cheng et al., 2021, Cell 184, 792–809 ________>130,000 ________myeloid cells ________diverse datasets ________integrating and re-clustering using scanpy+scanorama
Zheng et al., Science 374, 1462 (2021) ________>390,000 ________T cells ________diverse datasets ________integrating and re-clustering using seurat+harmony (mini-clusters used)
Zheng et al., medRxiv doi:10.1101/2021.09.17.21263540 ________>67,000 ________T cells ________one dataset ________integrating and re-clustering using seurat+STACAS R package
Schnell et al., 2021, Cell 184, 6281–6298 ________> 84,000 ________Th17 cells ________one dataset ________integrating and re-clustering using seurat+sctransform
As known, T cells and ILCs are populations with obvious transcriptional shift continuity , which is also demonstrated among various sub-T populations. Since a strong technical batch effect (droplet-based and plate-based; 5-tag, 3-tag and full-length; various sequencing depth and saturation) in my present data and the abovementioned remarkable transcriptional profiles continuity exist, mnnCorrect algorithm and other derivatives (e.g. bbknn used in scanpy, Polański et al., Bioinformatics 2021,36(3) 964–965), which require that the differences between the same cell type across batches caused by batch effects should be less than the differences between cells of different types within a batch, could be unsuitable for downstream integration pipeline. Additionally, CCA-plus-anchors-based integration used in Seurat may lead to overcorrection and large computing resource consumption, though STACAS R package may resolve overcorrection issue caused by Seurat in certain degree. Can the powerful scvi-tools give me a better solution? If so, what models should I use for data integration?