I was wondering if scVI’s differential expression module still works in the following case:
I integrated two datasets with scANVI:
dataset1: healthy lung tissue, mixture of immune, stromal, endothelial and epithelial cells, 10x
dataset2: lung cancer tissue, also a mixture of the same cell types, Smart-seq2
Judging from the UMAP plot, the integration worked reasonably well.
I would like to compare for each cell type healthy vs. tumor samples.
With any classical DE method (i.e. a linear model), I’d expect the results of this comparison
to mainly reflect the technical differences between the 10x and the Smart-seq2 platforms. Since scANVI appears to successfully reduce the batch effects in the latent representation, I was wondering if such comparisons between datasets from different platforms become feasible as well.
As you suspect, you will have the same issues as a linear model of not being able to know whether the fold changes you find are due technical differences or disease status.
To do this kind of differential expression, where you have confounding between data source and condition, the easiest solution is to find more data sources. Then you can treat each data source as a replicate.
If you have three independent healthy lung data sets and three independent lung cancer data sets, you can integrate these to harmonize the cell type annotations, then do within-cell-type differential expression between cancer vs healthy across the six total data sets.
At the moment scVI can’t deal with hierarchical samples (I think?). I use generalized linear mixed models, but pseudo-bulk should also work. You can split the total data by the (harmonized) cell type annotation and data-set.