The best practice for detecting differential gene expression between 2 types of samples

Dear community,

The datasets I have come from WT or KO animals, each with multiple batches. They contain roughly the same cell types, although KO samples have very few cells in certain cell types. The goal is to match all datasets by cell type, and detect gene expression dysregulation in KO cells for each cell type. The classic Seurat pipeline would first run CCA to align all datasets, and then use the uncorrected (but normalized) counts for differential expression between WT and KO.

I was wondering how to approach this within the scVI-tools framework. Should I use batch corrected data for DEG detection or uncorrected values? Also, is there a way to identify a KO cells specific gene expression signature without running DEG detection? I’m asking because there are cases where genes in an entire pathway are up/downregulated in a concerted manner but to a small extent. These genes wouldn’t be called DEGs although they are meaningful. Is there an existing framework that directly compares KO and WT cells on a pathway level without running gene-wise comparisons first?

Thank you so much!!

Generally, the scvi-tools approach would be the same, except you would use SCVI to integrate the datasets. @Valentine_Svensson might have better insight on your DE questions.

The key is that you want the cell labels to be consistent between the datasets. I have seen many papers where people annotate cells independently in WT and KO datasets before combining them and then running DE for KO vs WT. This has the problem that you can’t be sure that the labeling (which I prefer to call ‘stratification’) puts e.g. ‘B cell’ on a consistent population in the two datasets. There are many cases where classical cell type markers are perturbed by treatment. You won’t know if you are actually doing ‘cell type marker discovery DE’ or ‘per cell type perturbation DE’ (which is what you want to do).

Your proposed workflow of integration with Seurat to stratify cells, then taking the observed counts to do per-group (“cell type”) DE is reasonable (actually much more reasonable than the ‘cell type marker discovery DE’ workflow for Seurat). Something underappreciated in the field though is, how many independent samples (replicates) do you have for WT and KO?

My favorite aspect of scVI is that it allows you to perform both the batch-consistent stratification task and the ‘per-group perturbation DE’ with the same model! This ensures that the 1) the genes that are DE between batches, 2) the genes that are DE between stratification groups (cell types), and 3) the genes perturbed by treatment, are all indepdendent to avoid confounded effects.

Think about how gene expression is represented in a batch integrated scVI model. The expression of a gene is determined by f(z, b) (omitting various scaling factors). If the model trains correctly, z will represent variation not due to b. If your batches b are divided into populations b_wt and b_ko, then by defining a set Z of a ‘region’ of z values you can make comparisons between f(Z_1, b_ko) with f(Z_1, b_wt) for the same area of Z_1 corresponding to a Z region that defines a cell type.

There are still some limitations, scVI assumes all cells within a potential Z_1 region are independent, but they usually come from 3 or so replicated samples. I’m hoping in the future scVI will account for this replication structure when calculating posterior probabilities to avoid inflated false positive rates.

Some practical advice: when the contribution to a Z_1 region is unbalanced in its contribution of observed cells (as in your case), then you get an amount of confounding between batches and perturbations (KO) which means that when you run the .differential_expression() method you will get markers for both treatment and batches. I have found that by providing the parameter setting batch_correction = True in the .differential_expression() method this gets corrected and you get genes that are DE between treatments after filtering for markers for batches. I’m not entirely sure why and how this works, but I have had great success with this. Highly recommended!

If your replication is more compicated (which tends to happen for logistical reasons in molecular biology experiments) , I unfortunately have to recommend that you do the intregreation and stratification with scVI to get consistent cell type labels, then use a generalized linear mixed model (for example with the lme4 R package) to do differential expression analysis between perturbations.

I think the next great challenge in the field is how to properly handle complex replication structure in experiments to avoid ‘pseudoreplication’ in experiments.

Best,
/Valentine

1 Like