Which scvi-tools releases support modelling with extra covariates?

Dear scVI-developers and community,

I stumbled over the following github issue:

As I myself am very interested in modelling with extra categorical and continuous covariates that can be passed to setup_anndata, I am wondering since which release is the correction of the latent space actually implemented (and not just passing of covariates to the anndata_setup that won’t be further used). I understand it should be the case for the latest release, but what about older releases (e. g. 0.8, etc.)?

Is is this different for categorical and continuous covariates?

Thanks for providing this great tool and continuously improving it.

1 Like

This was implemented in v0.9.0. See release notes here. Before v0.9.0, only the batch_key parameter worked.

Dear Adam,

thanks a lot, this is what I was looking for.

Hi,

now I changed to the latest version of scvi to make use of extra covariates. However, I noticed that lots of the user interface changed and I encountered a strange thing while training my models.

I have rather a large dataset of > 100k cells, which I previously trained with:
model.train(n_epochs = 5, n_iter_kl_warmup=1600, n_epochs_kl_warmup=None, frequency=1)

This converged very quickly in v0.8.1.
Now I am training a similar set of cells in v0.13.0 with:

model.train(max_epochs = 5, early_stopping = True, check_val_every_n_epoch = 1)

However, this doesn’t converge as quickly, in fact even after 50 epochs I didn’t reach convergence although the model is learning. This was also independet of covariate usage etc.

Am I missing something?

can you please post full code for both versions? We’d like to know which model you are using.

Furthermore, how are you measuring convergence?

Dear Adam,
This is the relevant code for both versions, but I left out the data loading part. The dataset size is ~550k cells. Please also see the elbo plots. In this regard, I may have used the term convergence not in a very technical sense. I was just referring to “no substantial further loss reduction per epoch”. Clearly, the model parameters are different, as some of the keywords from the old version are not available in the latest version. Please let me know whether you need any further information regarding versions of any side-packages.

I have to say the resulting umap plots (not part of the code here) are very comparable in quality which is easy to judge as the dataset is only pbmc which have a well known composition.

v0.13.0

#normalize data
sc.pp.normalize_total(adata, target_sum=1e4, exclude_highly_expressed = True)
sc.pp.log1p(adata)
adata.raw = adata

#find HVGs
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True, flavor="cell_ranger", batch_key="orig.ident")

#setup anndata
scvi.data.setup_anndata(adata, batch_key="orig.ident", layer= 'counts')

#train model
model = scvi.model.SCVI(adata)
model.train(max_epochs = 500, early_stopping = True)

#plot training history
train_elbo = model.history['elbo_train'][1:]
test_elbo = model.history['elbo_validation']
ax = train_elbo.plot()
test_elbo.plot(ax=ax)

image

v0.8.1:

#normalize data
sc.pp.normalize_total(adata, target_sum=1e4, exclude_highly_expressed = True)
sc.pp.log1p(adata)
adata.raw = adata

#find HVGs
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True, flavor="cell_ranger", batch_key="orig.ident")

#setup anndata
scvi.data.setup_anndata(adata, batch_key="orig.ident" , layer= 'counts')

#train model
model = scvi.model.SCVI(adata)
model.train(n_epochs = 3, n_iter_kl_warmup=1600, n_epochs_kl_warmup=None, frequency=1)

#plot training history
train_test_results = pd.DataFrame(model.trainer.history).rename(columns={'elbo_train_set':'Train', 'elbo_test_set':'Test'})
print(train_test_results)
ax = train_test_results.plot()
ax.set_xlabel("Epoch")
ax.set_ylabel("Error")
plt.show()

image

A few things:

  1. Could you provide the actual history dataframe values? The scale of these plots is quite different.
  2. The way you run the latest version is not the same as before. You need to add plan_kwargs=dict(n_steps_kl_warmup=1200, n_epochs_kl_warmup=None)

I think the main problem was with the different scales. Sorry for bothering and thanks.