All genes or highly variable genes?

Hi scverse!

I was wondering if there is anything arguing against running scVI/totalVI on all genes, rather than highly-variable genes (HVGs) only. I am aware that with PCA-based methods (scanpy, Seurat), excluding genes not exceeding Poisson noise was crucial to increase signal. This is because PCA assumes normally distributed values, making normalization, logging and scaling necessary to stabilize the variance, which would blow up the noise in uninformative genes.

scvit-tools is different however. It uses the Negative Binomial likelihood to compare output (denoised counts) to input (raw UMIs), so there is no transformation that could inflate noise. Given that, I do not see any reason to subset to HVGs other than to make it computationally cheaper. Running on all genes would have some advantages:

  • We use all information in the data, some of which may be overlooked by HVG methods
  • We have denoised counts for all genes, so can also test DE for all genes, not just HVGs

Any opinions and considerations I am missing here?
@Valentine_Svensson, I saw you use all genes, not HVGs, in your blog post (*). Does that mean you agree it might have advantages?



(*) Valentine’s blog post using all genes, not HVGs:

Using HVG might improve cell type separation in the latent space. We haven’t rigorously benchmarked this. You could imagine that if the decoder has to reconstruct many genes that don’t differ between cell states much then it might blur the real differences to some extent.

The authors of scIB have this great webpage which compares scVI using HVG and all genes for integrating datasets. Using all genes reduces batch effect correction power, while there is no clear consensus on bio conservation.

Thank you for the insights. I might just stick to using HVGs for now, but would be curious for someone to benchmark this.

I strive to keep as many genes as possible for the sake of downstream analysis. You’re bound to see a gene mentioned in a paper and will want to check how it’s expressed in your data (is it high in a specific cell type? Is it DE between conditions? etc).

Of course, the model will be easier to fit with fewer genes (not to mention quicker). Even selecting a panel of genes known to mark cell types can be extremely effective if your aim is to see whether you have expected cell types in your data. But for me so much of the value with these models is to do follow-up analysis on things you didn’t think of when you started out, without having to refit a model.

Just an open-ended idea: Let’s ignore fitting time and scaling, and focus on how good you can get the embedding to batch-integrate and represent cell types. Maybe there’s some way to create a model structure where you have two representations? One Z that has a representation based on HVG’s / panel genes, and another Z' that describes variation in the other genes / all genes or something like this?


Yeah you could probably allocate some dimensions of the latent space to decode HVGs and then remaining dimensions decode non HVGs + HVGs? Maybe? Could easily run into identifiability issues with this type of thing.

Hi @adamgayoso and @Valentine_Svensson,

thanks for sharing your thoughts, very insightful!

I ran some experiments to offer anecdotal evidence (no rigorous benchmark either). I took a cancer data set with a few dozen samples, for which we spent quite some time manually annotating cell types at a detailed hierarchy (using Seurat’s WNN and re-running it for every major cell type to get finer subsets resolved into clusters, which we then named according to literature markers). I then visualized these ‘groundtruth’ cell type labels in UMAP plots generated on neighbor graphs from totalVI with 2k, 3k (HVGs) and 27k genes (all genes). For comparison, I also threw in the WNN UMAP on the same 2k genes and ~100 ADTs as totalVI. Amazingly, finer cell type subsets blurred in the WNN UMAP, but were resolved well in all three totalVI instances. In all cases, there were the same ~ 100 ADTs, and for WNN we used scTransform to normalize the mRNA counts.

I conclude that in this example there is very little difference between running totalVI on 2k, 3k and 27k genes, as judged by grouping of manually curated cell types in the UMAP embedding. And totalVI is able to use the information stored in 2k genes + ~ 100 ADTs more efficiently than WNN.

What I’m curious about next is whether totalVI does benefit at all from running it on each major cell type (T / B / NK / …) separately, as WNN clearly does, or whether it already balances all signals in the data in an optimal way. It’s possible that manual annotation of clusters becomes more comfortable when diving into T cells only, simply because different values for the resolution parameter might be ideal for T / B / NK / … cells. And UMAP is less crowded of course when looking at T cells only, so overplotting is less severe when inspecting marker gene expression.

Hi Felix,

Nice to hear about your experience!

Just intuitively, I don’t see why using less data would improve the ability to learn cell representations. With more data, even if it’s from other cell types, the model should have a higher chance to learn gene-gene correlations, which should be how the neural network decoders can model the expression of all genes.

In a linear setting (like e.g. PCA), more data might mean you need more linear factors to accurately ‘decode’ the factors into gene expression. But (again, just intuition, would be interesting to test) the neural networks used as decoders here are very flexible and should be able to encode a lot of information in local support with the default 10-dimensional representation.

For visualization and sub-clustering, you might want to filter out e.g. T cells from a first pass annotation, then tSNE/UMAP/PyMDE these alone based on the Z representation learned from all cells, and perform whichever clustering strategy you’re using on these alone.

Of course, would be interesting to see what results you get by running scVI/TotalVI on a subset of the cells in comparison to the original! Just throwing out some thoughts.


It would be interesting if you want to quantify this, to check the silhouette score of your cell type labels with respect to each method, as well as Geary’s C with respect to the respective neighbor graphs and the plain log normalized expression data. Geary’s C would allow you to better understand if adding more genes really contributes to the impact on nearest neighbors in the latent space.

Hi again,

@Valentine_Svensson thanks for the clear write-up of very interesting thoughts, this is great!

@adamgayoso Thanks for pointing me towards Geary’s C. Love that scanpy has it implemented!

If I go to a conference this year, I might extend this benchmark and put it on my poster, for which all your input is very helpful.



Hi @adamgayoso,

I have one final question. I just saw that totalVI registered 1 label in the anndata object, and I was wondering what that means. I ask this to exclude the slim chance that totalVI somehow is able to use pre-existing cell type labels to improve model fitting, which in my experiment here would of course be unfair to WNN. So, what does the ‘label’ mean in below screen shot?

Hi @FelixTheStudent,

I’m Justin another dev for scvi-tools. The labels are present because the underlying TOTALVAE module allows learning gene dispersions different across labels (scvi.module.TOTALVAE - scvi-tools). However, we do not expose that feature in the TOTALVI model at the moment which is why there is no labels_key kwarg. The log message is mainly an artifact of keeping the model compatible with the module.

1 Like