scANVI relables known cells with known types incorrectly

KevinMenden · May 27, 2021, 12:21pm

Hi scvi-tools Team,

I have been trying out scVI and scANVI and am evaluating how good label transfer works. For my current test setup, I have 4 different datasets (human liver), which I have manually labeled. To test scANVI, I overwrite the labels for one dataset with “Unknown”. I pretty much just follow your tutorial for scANVI.

The label transfer works pretty nice for the dataset with the “Unknown” cell types. However, when I use the SCANVI.predict() function, it generates wrong labels for actually known cell types (so not marked as “Unknown”). And bad ones at that.

Now my question, is that expected, i.e. is scANVI supposed to predict labels for these cells as well? And the more difficult question, any idea why it behaves this way? I’m still assuming that I’m just doing something wrong but I can’t figure out what. I have tried both starting from a pre-trained scVI model and training a scANVI model from scratch. Any help would be highly appreciated!

I’ll put code and a figure below. It’s pretty clear when looking at the NKT cluster in celltype_scanvi and then comparing to the same cluster in C_scANVI. This cluster doesn’t contain unlabeled cells.

Cheers,
Kevin

adata.obs["celltype_scanvi"] = 'Unknown'
# Get the labels for datasets 0, 1, 2

batch_idx = adata.obs['batch'] == "0"
adata.obs["celltype_scanvi"][batch_idx] = adata.obs.celltype[batch_idx]

batch_idx = adata.obs['batch'] == "1"
adata.obs["celltype_scanvi"][batch_idx] = adata.obs.celltype[batch_idx]

batch_idx = adata.obs['batch'] == "2"
adata.obs["celltype_scanvi"][batch_idx] = adata.obs.celltype[batch_idx]

adata.obs['celltype_scanvi'] = adata.obs['celltype_scanvi'].astype("str")

np.unique(adata.obs["celltype_scanvi"], return_counts=True)

scvi.data.setup_anndata(
    adata,
    layer="counts",
    batch_key="batch",
    labels_key="celltype_scanvi",
)

lvae = scvi.model.SCANVI(adata, "Unknown", n_latent=30, n_layers=2)

lvae.train(n_samples_per_label=100)

adata.obs["C_scANVI"] = lvae.predict(adata)
adata.obsm["X_scANVI"] = lvae.get_latent_representation(adata)
sc.pp.neighbors(adata, use_rep="X_scANVI")
sc.tl.umap(adata)

sc.pl.umap(adata, color=["celltype_scanvi", "C_scANVI", "batch"], ncols=1, frameon=False)

adamgayoso · May 27, 2021, 4:10pm

I’ll have to look at this again more closely later, but a few quick comments:

What version of scvi-tools are you using? In the latest version, the workflow is to now do some pre training with an SCVI model, which was done implicitly before, but we separated it out for code reasons.

vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30)
vae.train()
scanvi_model = scvi.model.SCANVI.from_scvi_model(vae, 'Unknown')
scanvi_model.train(25)

though I see now this tutorial was not properly updated with this workflow.

In your workflow, how many epochs is scanvi trained for?

Yes, though the accuracy should be higher.

KevinMenden · May 28, 2021, 10:44am

Thanks for the answer!

I’m using version 0.10.0

I have tried both, training an SCVI model before and then starting from that, or training a SCANVI model directly.
In the former case, it trains for ~60 epochs SCVI and then ~7 epochs SCANVI. In the latter case just ~60 epochs SCANVI.

Okay given that I wouldn’t want to re-label cells which I have already labeled, I could of course extract the predictions and just label the unknown cells manually. Maybe having this as an option would be helpful? (i.e. fix labels of cells with known labels).
Of course this still doesn’t solve the question why it behaves so weird for some clusters

adamgayoso · May 28, 2021, 3:52pm

This should probably be a default. We will make a note of it.

This seems like a small number of epochs. How many cells do you have?

KevinMenden · May 29, 2021, 7:43am

Ist about 140k cells. I will simply try more epochs then. I’ll let you know if it helped.

galenxing · May 31, 2021, 8:53pm

Hey Kevin,

scANVI predicting known celltypes incorrectly is something I’ve also observed – but haven’t extensively tested.

A few more suggestions to potentially improve results:

If the frequency of your smallest celltype size is greater than 100, I would set the n_samples_per_label arg in lvae.train() to that number. (This way you’ll train on more cells each epoch)
I agree with Adam in increasing the number of scANVI epochs. I would even train for like 50 epochs since with the n_samples_per_label param, you’re subsampling the train set.

Note, you’ll need to install the latest version of scvi-tools off of master. I just fixed a bug in max_epochs for scANVI. fix scANVI max_epochs bug when pretrained by galenxing · Pull Request #1079 · YosefLab/scvi-tools · GitHub

adamgayoso · June 7, 2021, 4:13pm

@KevinMenden We’d like to further troubleshoot this. Are you able to share your data with us?

KevinMenden · June 7, 2021, 5:18pm

Hi both,

sorry for not responding, I was on vacation.

I will try out your ideas. Yes the data are public so I can share them with you. I can basically send you the datasets as processed by me and the scripts I use.

Any preference about how to share the data with you?

adamgayoso · June 7, 2021, 7:21pm

I think a Google colab notebook (like our tutorials) that reproduces the issue is easiest, but even sharing the data and script is sufficient (dropbox, google, etc.)

Thanks!

KevinMenden · June 8, 2021, 7:07am

Alright, I’ll send you something tomorrow!

KevinMenden · June 9, 2021, 6:02am

Okay I’ve uploaded the labeled datasets (in .h5ad format) and the script I used here:

You should be able to just run the notebook from within that folder. I’ll install the patched scANVI version now and try to set max_epochs higher. Didn’t work with the current version.

KevinMenden · June 9, 2021, 8:24am

Quick update from my side:

increasing the scANVI epochs to 50 didn’t really help
additionally removing the subsampling did help

Without the subsampling and training scANVI for 50 epochs, it looks much better now and basically all cells are labeled correctly. A few labels have changed but those probably make sense.

nrclaudio · March 28, 2023, 9:29am

Hey,

Having the exact same problem here. I’m wondering what Kevin means by removing the subsampling? Just setting n_samples_per_label to the total amount of least frequent cell type?

Thanks

martinkim0 · April 18, 2023, 8:50pm

Hi @nrclaudio, sorry for the late reply. This would mean setting n_samples_per_label=None, which is the default option.

Topic		Replies	Views
Label transfer with SCVI-SCANVI pipeline changes (predicts wrong) labels in ref data scvi-tools scanvi , scvi	8	1024	July 31, 2023
SCANVI inferred cell types don't make sense scvi-tools scanvi	1	96	October 17, 2024
Scvi-tools label transfer accuracy scvi-tools scanvi	2	542	June 15, 2023
Issue with retrain scANVI model scvi-tools scanvi	1	53	March 3, 2025
SCVI and SCANVI for label transfer how to assess accuracy? scvi-tools scanvi	3	181	August 20, 2024

scANVI relables known cells with known types incorrectly

Related topics