Feature selection and effects on DGE analysis?

Dear scvi team,
I was wondering when selecting for the highly variable genes and doing feature reduction with that, if we go with lets say 2000 genes, we can later also only test differential gene expression for those 2000 genes right? Regarding this step, we assume that the differential expressed genes between clusters are covered by the highly variable genes correct? If I want to perform DGE for more genes I would need to adjust that in the feature selection part then?
Thanks & best wishes

Hi!

Thanks for using scvi-tools!

So yes this is how it works. When you filter genes for fitting the scVI model, you may only perform DE for those genes (the algorithm never saw the other ones, so can’t make inferences about them).

And indeed, if you’d like to perform DE with more genes, they need to be in the input data from scVI upfront. In the manuscript we discuss that adding more genes may be problematic if your dataset is small, (rule of thumb is to absolutely not go beyond more genes than cells). You can check your latent space, if cell type becomes blurry, then you’re probably not fitting well!

Hope that helps!
Romain

1 Like