HVGs and model fit

Dear scVI-Team,

in your tutorials it is mentioned that for scVI models 1k to 10k HVGs are recommended. However, I noticed that the choice of the number of HVGs has a strong influence on model fit as assessed by loss value and agreement between training and validation curve from the training history. Often, I encountered lower loss and better agreement between the curves when choosing a lower number of HVGs (as low as 400 genes). My question is, should I therefore consider the model with fewer HVGs to better describe my sample compared to the one with higher number of HVGs?
From my understanding it is possible that fewer genes may reduce noise and therefore lead to better model fit, but I am unsure whether I might be neglecting some detail in return. I would be happy to hear your opinion on this.


Hi there,

Thank you for using scvi-tools. In general, when it comes to comparing different model architectures, it is misleading to use ELBO as a metric for performance. E.g. in this case by changing the number of input genes & different genes used, the underlying model is different. If you do want to compare the latent representations of these models against each other, it makes more sense to compute p(x_G | z) for some common subset of genes G. This is not built-in to the codebase, but something you can try to compute yourself if desired.

Dear Justin,

thanks for your response. I understand your point saying that we are basically comparing two different models here, A) with say 500 features and B) with 2000 features. Naturally the models must be different.

I also see that I may not compare the loss values obtained from both models directly. However, as far as my knowledge goes (which is not very far in the field of ML), I would assume model A) which accurately predicts unknown data because train and validation loss are nearly equal to be more accurate than model B) which seems to overfit (higher validation loss than training loss), thus not accurately predicting unknown data?

Is this not what feature selection is all about, to find a set of features so that accurate predictions can be made from the model?

I apologize for kind of repeating my question.

The reason these still aren’t very comparable is because unlike a classic classification scenario where you have a feature set and are trying to classify inputs, with VAEs the validation loss speaks to how well the model can reconstruct unknown data from the embedding. So good agreement between validation and training loss for a given number of features means the model can learn a good representation for that set of genes. For a model with a different set of input genes, the task different than before (the model is trying to create a latent embedding representative of this new set of genes). If there is poor agreement between the curves now, it does not mean that the previous model is strictly better, but rather that the current model is not finding a good representation for this set of genes, and perhaps you can improve the model for this task by tweaking hyperparameters.

To summarize, in the simple classification scenario, a different set of inputs does not change the output of the model, a vector of logits or a discrete prediction, which makes models taking different inputs comparable. However, with VAEs the set of inputs also determines the set of outputs and affects the implications of the loss curves.


Considering a more simple case in which I want to compare two models with the same number of HVGs and all but one parameter (say the number of neurons of a layer) kept fixed, what would be the metric to assess then? Would I use something like model.get_marginal_ll() for that?