As I myself am very interested in modelling with extra categorical and continuous covariates that can be passed to setup_anndata, I am wondering since which release is the correction of the latent space actually implemented (and not just passing of covariates to the anndata_setup that won’t be further used). I understand it should be the case for the latest release, but what about older releases (e. g. 0.8, etc.)?
Is is this different for categorical and continuous covariates?
Thanks for providing this great tool and continuously improving it.
now I changed to the latest version of scvi to make use of extra covariates. However, I noticed that lots of the user interface changed and I encountered a strange thing while training my models.
I have rather a large dataset of > 100k cells, which I previously trained with:
model.train(n_epochs = 5, n_iter_kl_warmup=1600, n_epochs_kl_warmup=None, frequency=1)
This converged very quickly in v0.8.1.
Now I am training a similar set of cells in v0.13.0 with:
However, this doesn’t converge as quickly, in fact even after 50 epochs I didn’t reach convergence although the model is learning. This was also independet of covariate usage etc.
Dear Adam,
This is the relevant code for both versions, but I left out the data loading part. The dataset size is ~550k cells. Please also see the elbo plots. In this regard, I may have used the term convergence not in a very technical sense. I was just referring to “no substantial further loss reduction per epoch”. Clearly, the model parameters are different, as some of the keywords from the old version are not available in the latest version. Please let me know whether you need any further information regarding versions of any side-packages.
I have to say the resulting umap plots (not part of the code here) are very comparable in quality which is easy to judge as the dataset is only pbmc which have a well known composition.
v0.13.0
#normalize data
sc.pp.normalize_total(adata, target_sum=1e4, exclude_highly_expressed = True)
sc.pp.log1p(adata)
adata.raw = adata
#find HVGs
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True, flavor="cell_ranger", batch_key="orig.ident")
#setup anndata
scvi.data.setup_anndata(adata, batch_key="orig.ident", layer= 'counts')
#train model
model = scvi.model.SCVI(adata)
model.train(max_epochs = 500, early_stopping = True)
#plot training history
train_elbo = model.history['elbo_train'][1:]
test_elbo = model.history['elbo_validation']
ax = train_elbo.plot()
test_elbo.plot(ax=ax)
v0.8.1:
#normalize data
sc.pp.normalize_total(adata, target_sum=1e4, exclude_highly_expressed = True)
sc.pp.log1p(adata)
adata.raw = adata
#find HVGs
sc.pp.highly_variable_genes(adata, n_top_genes=2000, subset=True, flavor="cell_ranger", batch_key="orig.ident")
#setup anndata
scvi.data.setup_anndata(adata, batch_key="orig.ident" , layer= 'counts')
#train model
model = scvi.model.SCVI(adata)
model.train(n_epochs = 3, n_iter_kl_warmup=1600, n_epochs_kl_warmup=None, frequency=1)
#plot training history
train_test_results = pd.DataFrame(model.trainer.history).rename(columns={'elbo_train_set':'Train', 'elbo_test_set':'Test'})
print(train_test_results)
ax = train_test_results.plot()
ax.set_xlabel("Epoch")
ax.set_ylabel("Error")
plt.show()