HVGs and model fit

pseudonym2 · January 23, 2022, 3:45pm

Dear scVI-Team,

in your tutorials it is mentioned that for scVI models 1k to 10k HVGs are recommended. However, I noticed that the choice of the number of HVGs has a strong influence on model fit as assessed by loss value and agreement between training and validation curve from the training history. Often, I encountered lower loss and better agreement between the curves when choosing a lower number of HVGs (as low as 400 genes). My question is, should I therefore consider the model with fewer HVGs to better describe my sample compared to the one with higher number of HVGs?
From my understanding it is possible that fewer genes may reduce noise and therefore lead to better model fit, but I am unsure whether I might be neglecting some detail in return. I would be happy to hear your opinion on this.

Best

Justin_Hong · January 24, 2022, 6:46pm

Hi there,

Thank you for using scvi-tools. In general, when it comes to comparing different model architectures, it is misleading to use ELBO as a metric for performance. E.g. in this case by changing the number of input genes & different genes used, the underlying model is different. If you do want to compare the latent representations of these models against each other, it makes more sense to compute p(x_G | z) for some common subset of genes G. This is not built-in to the codebase, but something you can try to compute yourself if desired.

pseudonym2 · January 28, 2022, 11:28am

Dear Justin,

thanks for your response. I understand your point saying that we are basically comparing two different models here, A) with say 500 features and B) with 2000 features. Naturally the models must be different.

I also see that I may not compare the loss values obtained from both models directly. However, as far as my knowledge goes (which is not very far in the field of ML), I would assume model A) which accurately predicts unknown data because train and validation loss are nearly equal to be more accurate than model B) which seems to overfit (higher validation loss than training loss), thus not accurately predicting unknown data?

Is this not what feature selection is all about, to find a set of features so that accurate predictions can be made from the model?

I apologize for kind of repeating my question.

Justin_Hong · January 28, 2022, 9:44pm

The reason these still aren’t very comparable is because unlike a classic classification scenario where you have a feature set and are trying to classify inputs, with VAEs the validation loss speaks to how well the model can reconstruct unknown data from the embedding. So good agreement between validation and training loss for a given number of features means the model can learn a good representation for that set of genes. For a model with a different set of input genes, the task different than before (the model is trying to create a latent embedding representative of this new set of genes). If there is poor agreement between the curves now, it does not mean that the previous model is strictly better, but rather that the current model is not finding a good representation for this set of genes, and perhaps you can improve the model for this task by tweaking hyperparameters.

To summarize, in the simple classification scenario, a different set of inputs does not change the output of the model, a vector of logits or a discrete prediction, which makes models taking different inputs comparable. However, with VAEs the set of inputs also determines the set of outputs and affects the implications of the loss curves.

pseudonym2 · March 6, 2022, 4:46pm

Considering a more simple case in which I want to compare two models with the same number of HVGs and all but one parameter (say the number of neurons of a layer) kept fixed, what would be the metric to assess then? Would I use something like model.get_marginal_ll() for that?

Topic		Replies	Views
Differential expression and highly variable genes scvi-tools	3	1734	October 5, 2022
All genes or highly variable genes? scvi-tools gene-selection , scvi , totalvi	10	3550	March 31, 2022
Selection of HVG in scVI scvi-tools scvi	3	1110	December 20, 2022
Usage of HVG in scVI scvi-tools gene-selection , scvi	12	2235	March 1, 2022
Feature selection and effects on DGE analysis? scvi-tools	1	763	March 25, 2021

HVGs and model fit

Related topics