Model complexity selection

Anatoly_Buchin · July 8, 2021, 12:17am

Hi everybody,

I am currently running the scvi-tools to perform batch correction and dimensionality reduction of the dataset consisting of multiple patients. I am not sure about the complexity of the model that has been chosen for dimensionality reduction.

Here is an example of the code I am currently running:

adata_da_HVG=adata_da.copy()
sc.pp.highly_variable_genes(adata_da_HVG, n_top_genes=5000, subset=True)

scvi.data.setup_anndata(
adata_da_HVG,
batch_key=‘donor_id’,
layer=“counts”,
labels_key=‘cell_group’)

model_da = scvi.model.SCVI(adata_da_HVG)
model_da.train()

adata_da_HVG.obsm[“X_scVI”]=model_da.get_latent_representation()

I wonder if there is a way to visualize the model complexity to look at bias-variance tradeoff for the particular dataset. I would like to check if I would need more layers to perform the dimensionality reduction more efficiently.

-Anatoly

Valentine_Svensson · July 9, 2021, 3:42am

I believe you can investigate the bias variance trade off by looking at the training curves. Your model_da should have an field .history that has a dictionary of training metrics per epoch of model training. It definitely has the reconstruction error for the training set. I can’t remember how to make the training also record the reconstruction error for the test set. @adamgayoso is this in the documentation somewhere?

When you have these curves for alternative models you can compare them. Curves with higher reconstruction error in both train and test will have more bias. Curves where the test error goes up compared to the training error will have more variance. (Caveat, I always get the concepts of bias and variance for model mixed, I’m hoping I’m getting it right now).

Now, how to quantify the complexity of the model? That is hard I think… You could try to count all the parameters, but there are also various forms of regularization both for the neural networks and the latent variables (and other parameters). Not sure how that factors in to the definition of model complexity.

/Valentine

adamgayoso · July 9, 2021, 4:27am

Two options here. Note that by default the “test” set is really called the “validation” set. This convention is due to the fact that the validation set can be used for early stopping. I can go into more detail on this if it’s confusing.

Add check_val_every_n_epoch=1 as an argument to train to record all the losses on the validation set. Note, training will be slower, you can set this value to 10, or 20, if you want.
call model.get_reconstruction_error(indices=model.validation_indices). I believe here higher is better, where the trainer history is logging the negative of this.

tu.le · September 1, 2021, 7:35am

Hi Adam,
Thanks for your explanation. When checking model.history dict, I saw several items, such as train_loss_step, train_loss_epoch, and some others. What is the difference between them? Which one should be used to assess the goodness of the model (as an alternative to the reconstruction_error that you have suggested above)? Also, assuming that early_stopping_monitor=‘elbo_validation’, which value in this dict has been related to the loss value written down during the training process (illustrated as below)

Epoch 40/40: 100%|██████████| 40/40 [00:44<00:00, 1.11s/it, loss=588, v_num=1]

Many thanks,

Valentine_Svensson · January 20, 2022, 8:07am

Hi tu.le,

Yes some identities are split up, and during training the trainable parts of the ‘loss’ for the optimization objective can be calculated in way that is equivalent for the sake of the gradients in the objective function. The full reconstruction error will require more calculations, but not all of factors contribute to the optimization. So for a human, ‘reconstruction_error’ is what matters, but a computer calculating gradients can’t tell the difference between reconstruction_error and e.g. train_loss_step.

Best,
/Valentine

tu.le · February 10, 2022, 4:28am

Hi Valentine,
Thanks for your explanation. It’s clear to me now.
Best,
Tu,

Topic		Replies	Views
How to compare different parameter sets using the validation loss? Help integration , scvi	6	806	August 26, 2024
Model doesn't contain validation metrics scvi-tools	7	615	October 12, 2023
HVGs and model fit scvi-tools gene-selection , scvi	4	798	March 6, 2022
How to interpret reconstruction loss increase but elbo loss decrease on trained scvi and scanvi models scvi-tools	1	70	May 11, 2025
Autoencoder gene expression reconstruction accuracy scvi-tools	1	377	June 1, 2022

Model complexity selection

Related topics