The best practice for detecting differential gene expression between 2 types of samples

Changxu_Fan · November 12, 2021, 12:30am

Dear community,

The datasets I have come from WT or KO animals, each with multiple batches. They contain roughly the same cell types, although KO samples have very few cells in certain cell types. The goal is to match all datasets by cell type, and detect gene expression dysregulation in KO cells for each cell type. The classic Seurat pipeline would first run CCA to align all datasets, and then use the uncorrected (but normalized) counts for differential expression between WT and KO.

I was wondering how to approach this within the scVI-tools framework. Should I use batch corrected data for DEG detection or uncorrected values? Also, is there a way to identify a KO cells specific gene expression signature without running DEG detection? I’m asking because there are cases where genes in an entire pathway are up/downregulated in a concerted manner but to a small extent. These genes wouldn’t be called DEGs although they are meaningful. Is there an existing framework that directly compares KO and WT cells on a pathway level without running gene-wise comparisons first?

Thank you so much!!

adamgayoso · November 12, 2021, 1:19am

Generally, the scvi-tools approach would be the same, except you would use SCVI to integrate the datasets. @Valentine_Svensson might have better insight on your DE questions.

Valentine_Svensson · January 20, 2022, 7:55am

The key is that you want the cell labels to be consistent between the datasets. I have seen many papers where people annotate cells independently in WT and KO datasets before combining them and then running DE for KO vs WT. This has the problem that you can’t be sure that the labeling (which I prefer to call ‘stratification’) puts e.g. ‘B cell’ on a consistent population in the two datasets. There are many cases where classical cell type markers are perturbed by treatment. You won’t know if you are actually doing ‘cell type marker discovery DE’ or ‘per cell type perturbation DE’ (which is what you want to do).

Your proposed workflow of integration with Seurat to stratify cells, then taking the observed counts to do per-group (“cell type”) DE is reasonable (actually much more reasonable than the ‘cell type marker discovery DE’ workflow for Seurat). Something underappreciated in the field though is, how many independent samples (replicates) do you have for WT and KO?

My favorite aspect of scVI is that it allows you to perform both the batch-consistent stratification task and the ‘per-group perturbation DE’ with the same model! This ensures that the 1) the genes that are DE between batches, 2) the genes that are DE between stratification groups (cell types), and 3) the genes perturbed by treatment, are all indepdendent to avoid confounded effects.

Think about how gene expression is represented in a batch integrated scVI model. The expression of a gene is determined by f(z, b) (omitting various scaling factors). If the model trains correctly, z will represent variation not due to b. If your batches b are divided into populations b_wt and b_ko, then by defining a set Z of a ‘region’ of z values you can make comparisons between f(Z_1, b_ko) with f(Z_1, b_wt) for the same area of Z_1 corresponding to a Z region that defines a cell type.

There are still some limitations, scVI assumes all cells within a potential Z_1 region are independent, but they usually come from 3 or so replicated samples. I’m hoping in the future scVI will account for this replication structure when calculating posterior probabilities to avoid inflated false positive rates.

Some practical advice: when the contribution to a Z_1 region is unbalanced in its contribution of observed cells (as in your case), then you get an amount of confounding between batches and perturbations (KO) which means that when you run the .differential_expression() method you will get markers for both treatment and batches. I have found that by providing the parameter setting batch_correction = True in the .differential_expression() method this gets corrected and you get genes that are DE between treatments after filtering for markers for batches. I’m not entirely sure why and how this works, but I have had great success with this. Highly recommended!

If your replication is more compicated (which tends to happen for logistical reasons in molecular biology experiments) , I unfortunately have to recommend that you do the intregreation and stratification with scVI to get consistent cell type labels, then use a generalized linear mixed model (for example with the lme4 R package) to do differential expression analysis between perturbations.

I think the next great challenge in the field is how to properly handle complex replication structure in experiments to avoid ‘pseudoreplication’ in experiments.

Best,
/Valentine

aj95b · July 8, 2025, 7:40pm

Do you recommend setting up anndata and training the scVI model, subsetted for different samples to perform cell-type specific differential expression analysis? Or is it better to only train one model and just use model.differential_expression module for different cell-types?

ori-kron-wis · July 9, 2025, 1:53pm

Train one model.
There is importance for more samples for global batch correction, noise removal and statistical power, especially if one of your cell_type groups is small.

Topic		Replies	Views
Differential expression between datasets scvi-tools diff-exp	4	1060	May 20, 2021
Differential expression analysis scvi-tools	4	765	January 5, 2025
DE between conditions in a cluster scvi-tools integration , diff-exp , scvi	4	587	December 14, 2022
Differential Gene Expression scvi-tools diff-exp	2	815	March 20, 2021
Differential Expression and Batch Correction scvi-tools scvi	1	163	February 20, 2025

The best practice for detecting differential gene expression between 2 types of samples

Related topics