Tuning/setting scvi.model.SCVI parameters


Is there any recommendation for tuning the parameters in the scvi.model.SCVI() function?

I have been modifying the parameters until the clusters make more biological sense (such as immune cells far from adipocytes and adipocyte subpopulations closer in the UMAP). Does this sound reasonable? Should I simply stick to the default?

Are you seeing a lot of variance of hyperparameters? What sorts of things are you currently changing? We don’t go too far from the defaults, even if we do.

I messed with the number of hidden layers and dimensionality of the latent space (1-4 and 10-40). I modified these values because I expected different adipocyte types to be closely clustered, while other cell types far apart. Moreover, with higher values, brown adipocytes from different studies cluster together. But when I use values closer to the default some brown adipocytes cluster with white adipocytes.

1-4 would be much too small. If I changed defaults I might try n_layers=2, n_latent=30, gene_likelihood="nb"

Thank you for the suggestion Adam. Could you explain the reasoning for these choices? Especially the gene_likehood. If I understood it correctly, nb should do better with overdispersed data, which should be my case due to very different cell types coming from different studies?

1 Like

I also was wondering about the impact of parameter changes.
I work on two datasets that I do not succeed to successfully integrate with scvi whereas with harmony it works well.
I will now try to change the default parameters and hope it may improve the integration.
As you suggested I will start with using n_layers=2, n_latent=30, gene_likelihood=“nb”.

For the integration I am using 4000 HVG (see below)

What is in your experience the impact of changing the number of HVGs?

Any other advice what I could check concerning the data that prevents integration?

I would ensure that the same exact genes are used to compare methods. I would also ensure the loss has converged.

What genes are being used by harmony?

Thanks for your help Adam.

In both models 4000 HVG genes are used.
And the loss function seems to converge:

Any other advice how to understand why the data does not integrate?
Would the choice of dispersion or gene_likelihood have a big impact on the model?

I would need to understand more about your data. How many cells? How many batches? Also how are you defining poor integration? Finally, there may be implementation differences between R and Python for HVG selection so to fairly compare methods it’s ideal to use the same exact 4000 genes.

Harmony typically does tend to achieve higher batch correction scores than scVI, but scVI tends to preserve biological variation better.

It’s quiet a big object 390938 cells × 33538 genes with only 2 batches.
I am using harmonypy and therefore the same 4000 HVG with both methods.

With poor integration I mean that let’s say in both datasets I can find Tregs but after integration with scvi in the umap they cluster one next to each other whereas harmony manages to put them into one cluster.

This only the t cell compartment from the datasets


scvi ( zoom in):

This umap is already better than the ones I got with default parameters.
The integration improved when changing the default parameters to n_layers=2, n_latent=30, gene_likelihood=“nb”.
I am planning now to run hyperparameter optimization with the tuner.