Different UMAPs for same dataset with scANVI

I am using scVI and scANVI to integrate datasets of two species. I am noticing that there is a very big difference in species integration on the UMAP depending on whether scANVI creates a new model or if it imports in an existing scVI model (see code below). I have confirmed that the environment is the same and that the random seed is set to the same number in both scripts. The data is identical and adata_subset is created with the same subset of highly variable genes.

Here is the code for if I am generating an scVI model first and then passing it to the scANVI model:
scvi.model.SCVI.setup_anndata(adata_subset, batch_key = “10x_batch”, layer = “raw”, labels_key = ‘cell_type’)
model = scvi.model.SCVI(adata_subset, dispersion = ‘gene-batch’)
model.train()
lvae = scvi.model.SCANVI.from_scvi_model(model, adata=adata_subset, labels_key=“cell_type”, unlabeled_category=“Unknown”)
lvae.train()
adata_subset.obsm[“X_scANVI”] = lvae.get_latent_representation(adata_subset)
sc.pp.neighbors(adata_subset, use_rep=“X_scANVI”)
sc.tl.umap(adata_subset)

And here is the code for if I am going directly to the scANVI model:
scvi.model.SCANVI.setup_anndata(adata_subset, labels_key = ‘cell_type’, unlabeled_category = ‘Unknown’, layer = ‘raw’, batch_key = ‘10x_batch’)
lvae = scvi.model.SCANVI(adata_subset, dispersion = ‘gene-batch’)
lvae.train()
adata_subset.obsm[“X_scANVI”] = lvae.get_latent_representation(adata_subset)
sc.pp.neighbors(adata_subset, use_rep=“X_scANVI”)
sc.tl.umap(adata_subset)

Unfortunately I cannot share the UMAPs at this time, but hopefully this code is sufficient to hint at what could be happening. Thanks!

It is expected that both codes yield different results. We highly recommend training scVI first and then training scANVI. Training a classifier on a bad embedding can have unwanted side effects. If you think training scANVI only provides better embeddings, I would recommend looking into MrVI with a cell-type bias, which has a slightly different strategy and is better tested in these cases.

Can you explain why they would yield different results? I thought that scANVI was an extension from the scVI model, and that the extra time training scANVI when you start without an scVI model first was just wrapping in the scVI training time.

Is there anything necessarily wrong with using scANVI without making the scVI model first? We had favorable results with going directly to scANVI.

I will look into MrVI too, thank you for the recommendation.

scANVI was built with the idea of pretraining an scVI model first that doesn’t take celltypes into consideration (see our tutorials on how to correctly use scANVI). Directly using scANVI does not train an scVI model (see e.g. Seed labeling with scANVI — scvi-tools for how to correctly use scANVI).
Training it directly with a classifier will increase reliance on correct cell-type labels and might have negative side effects (I have observed this several times). In your case relying on labels might be helpful but I would then recommend the MrVI (or scPoli) manner of enforcing cell-type labels.

Hmm, I see where I was confused. I was looking at the API here and its example showed going directly to scANVI. Is there a scenario where this usage is appropriate, or is the API outdated? Thank you!

This is not a tutorial but a demonstration of the API. Especially, taking into consideration Metric Mirages in Cell Embeddings, I don’t have a good feeling with increasing the effect cell-type labels have on embeddings.

This is a lot to think about, thank you very much!