For data integration, the scvi-tools tutorial recommends to subset to “something between 1000 and 10000 highly variable genes”. As discussed here fewer genes improve the separation of cell-types and I usually go with something between 4k and 6k.
However, when using the trained model for differential gene expression analysis, the “hypothesis space” is limited to those highly variable genes. Would you say that’s not an issue, because “non-variable” genes are not DE anyway, or would you recommend to train a second model with all genes for DE analysis only?
In this particular case, I want to identify cluster-specific markers, but the same question applies for comparisons between patient groups.
I have authored the post you linked and wanted to clarify that we did not conclude there that using fewer genes is better, if I remember correctly. In fact I saw hardly any difference between 2000, 3000 and 27000 genes on my particular data set – but that is anectodal evidence of course. You could train that second model on all genes and compare the two with Geary’s C, as suggested by Adam in the thread you have linked. I am just a by-stander though, as in I have not contributed to scvi-tools but am a user.
Personally I will keep using pseudobulk methods for comparisons between patient groups, but think you do raise an interesting question.
Hi, thanks for clarifying!
I was referring to this statement in the scvi-tools getting started tutorial:
For scVI, we recommend anywhere from 1,000 to 10,000 HVGs, but it will be context-dependent.
.In fact I saw hardly any difference between 2000, 3000 and 27000 genes on my particular data set
This is interesting, because I had the impression (also only anecdotal) that using fewer genes helped when working with rather heterogeneous data (e.g. 10x + smartseq2). Also in the single-cell integration benchmark, using hvgs tended to achieve better results than training the full model.
But it may very well be that when working with a single dataset only (with batch effects being limited to between-sample effects rather than between-dataset) it doesn’t matter as much.
Just sharing my anecdotes as well: I largely identify the same structure with all genes as with a small subset of a couple of thousand genes. Maybe a little bit more ‘refined’ when using the small subset of genes. But overall I have switched my workflow to using all genes.
The biggest reason is that invariably I will wonder about some specific gene that I read about in a paper that happened to not be in the top variable genes list. Including all genes makes it easier for me to relate results to literature.
I think it’s a reasonable argument, “how DE can it be if it’s not variable?” But it’s not clear if selecting top N genes explicitly filters out things that are not variable at all.
If you go with the idea of fitting one model on a subset of genes for clustering, and another model with all genes for the sake of DE, I think we enter an unfortunate “double dipping” paradigm. I’d suggest just trying the clustering/annotation with a model that has fitted all genes, you might see that it works fine.
As an aside, I wonder if it would be possible to make models that learn representation spaces of the cells for sets of genes. Similar to how scLVM did for KEGG pathways: It fitted a low-dimensional factor model to each pathway. That way you can have a Z1 space representing highly variable genes or genes known to define cell types, and a Z2 space which represents expression for all other genes. Then use Z1 for clustering/annotation annotation, and both Z1 and Z2 for e.g. DE analysis.