Differential expression between datasets

Hi all,

I was wondering if scVI’s differential expression module still works in the following case:

I integrated two datasets with scANVI:

  • dataset1: healthy lung tissue, mixture of immune, stromal, endothelial and epithelial cells, 10x
  • dataset2: lung cancer tissue, also a mixture of the same cell types, Smart-seq2

Judging from the UMAP plot, the integration worked reasonably well.

I would like to compare for each cell type healthy vs. tumor samples.

With any classical DE method (i.e. a linear model), I’d expect the results of this comparison
to mainly reflect the technical differences between the 10x and the Smart-seq2 platforms. Since scANVI appears to successfully reduce the batch effects in the latent representation, I was wondering if such comparisons between datasets from different platforms become feasible as well.

Best regards,

1 Like

As you suspect, you will have the same issues as a linear model of not being able to know whether the fold changes you find are due technical differences or disease status.

To do this kind of differential expression, where you have confounding between data source and condition, the easiest solution is to find more data sources. Then you can treat each data source as a replicate.

If you have three independent healthy lung data sets and three independent lung cancer data sets, you can integrate these to harmonize the cell type annotations, then do within-cell-type differential expression between cancer vs healthy across the six total data sets.


Hi Valentine,

thanks for your quick reply and for your suggestion!
To make sure I got it right: By “treating the datasets as replicates” you mean

  • integrating them using scVI while specifying dataset as batch variable and then
  • simply use the SCVI.differential_expression() for the comparison?

Or are you rather referring to “pseudo-bulk” comparisons, such that every dataset becomes a sample?


At the moment scVI can’t deal with hierarchical samples (I think?). I use generalized linear mixed models, but pseudo-bulk should also work. You can split the total data by the (harmonized) cell type annotation and data-set.


For future reference: I found this article really helpful in explaining the point of using mixed effects models for single-cell DE analyses.