Label transfer with SCVI-SCANVI pipeline changes (predicts wrong) labels in ref data

Avaptel18 · November 21, 2022, 4:42pm

Hi, I am following this tutorial " Integration and label transfer with Tabula Muris" Integration and label transfer with Tabula Muris - scvi-tools on SCVI-tools docs page and everything works fine and I save the predicted labels in the new metadata column

adata.obs[“C_scANVI”] = lvae.predict(adata) #saving predicted labels in new column

However when I check the ref labels, some of them are predicted differently than what it was before I trained SCVI model (when I concatenated with my query data).
I don’t understand why this is?
My understanding is that I am training the SCVI model on ref labelled data and then using SCANVI to transfer the labels on ‘unknown’ labels in query dataset.
Why is it predicting some of the ref data labels wrongly?
Any advise please. Am I doing something wrong here?
Thanks!

As you can see in the screenshot the ref data cell label ‘Cell cycle_TCGGTCTGTGAGAGGG-1_32_1-1’ is changed from Trm-c to CTL-c.

adamgayoso · December 2, 2022, 5:34am

Can you describe how many of the training labels are wrong?

The predict function makes a prediction for each cell, including the reference data, for which by default 90% is train and 10% is a validation set. Either scanvi is getting it wrong because there’s something systematically off and/or there is noise in the training labels.

Avaptel18 · December 6, 2022, 5:33pm

Hi Adam,
SCANVI predicted 35.2% labels wrong in the reference dataset.

Here is some of my code, probably not enough for you to figure out what’s wrong but providing here anyway, just in case if there is any obvious mistake.

#################
adata.layers[“counts”] = adata.X.copy() #preserve count layer
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
adata.raw = adata

sc.pp.highly_variable_genes(adata,
flavor = ‘seurat_v3’,
n_top_genes=3000, #3000 hvg selected
layer = “counts”,
batch_key=“batch”,
subset = True)

scvi.model.SCVI.setup_anndata(adata, layer=“counts”,
batch_key=“batch”)
vae = scvi.model.SCVI(adata, n_layers=4, n_latent=30)
vae.train()

vae
SCVI Model with the following params:
n_hidden: 128, n_latent: 30, n_layers: 4, dropout_rate: 0.1, dispersion: gene,
gene_likelihood: zinb, latent_distribution: normal
Training status: Trained

Transfer of annotations with scANVI

adata.obs[“celltype_scanvi”] = ‘Unknown’
ss2_idx = adata.obs[‘batch’] == “1”
adata.obs[“celltype_scanvi”][ss2_idx] = adata.obs.Ident2[ss2_idx]

scvi.model.SCANVI.setup_anndata(adata,
layer=“counts”,
batch_key=“batch”,
labels_key=“celltype_scanvi”,
unlabeled_category=“Unknown”)

lvae = scvi.model.SCANVI.from_scvi_model(vae, “Unknown”,
adata=adata,
labels_key=“celltype_scanvi”)

lvae.train(max_epochs=20, n_samples_per_label=100)

lave
ScanVI Model with the following params:
unlabeled_category: Unknown, n_hidden: 128, n_latent: 30, n_layers: 4, dropout_rate: 0.1,
dispersion: gene, gene_likelihood: zinb
Training status: Trained

#################

Am I missing something here?
Please help!
Thank you.

adamgayoso · December 11, 2022, 1:13am

It’s hard for me to diagnose without understanding what kinds of mistakes it’s making. Is it predicting random or related cell types?

Can you try training the scanvi part for longer?

Avaptel18 · December 11, 2022, 12:00pm

Hi Adam, It’s predicting all related cell types. I can try training scanvi for longer and see if it improves the prediction.
Ideally what percentage of correct prediction I should get?
Thanks!

adamgayoso · December 12, 2022, 6:15am

the accuracy on the labeled data should be near 100%. We should be able to expose the training accuracy in the model history to make this easier to check.

Avaptel18 · December 12, 2022, 3:42pm

Hi Adam, that would be great! thanks.
I have tried increasing max_epochs for SCANVI and it decreased wrong prediction from 35 to 26% and I can try further increase but there seems to be something missing as I am training SCANVI with 36 label types and it only predicts about 18 now. I don’t understand why its omitting half cell types?
I am specifying this option (n_samples_per_label=100) but all ref labels have over 140 cells.
Can I specify specific parameter so it predicts all cell types?
Thanks!

I have 145000 cells in ref dataset with 36 immune cell types (skin cd45+ cells: https://www.science.org/doi/10.1126/sciimmunol.abl9165?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub%20%200pubmed).
My query dataset is also skin cd45+ 102000 cells and I think I should have all 36 cell types present in my dataset.

miyang · July 31, 2023, 2:09pm

hello,
Is there a sort of pvalue or any kind of significance level for the predicted cell types ?
Thanks !

martinkim0 · July 31, 2023, 3:47pm

Hi, passing in soft=True to SCANVI’s predict method returns prediction probabilities from the cell type classifier. However, these shouldn’t be interpreted as p-values or significance levels, nor are they typically well calibrated (i.e. the classifier is confident even when giving wrong predictions).

Topic		Replies	Views
Label Transfer Discrepancy in scANVI Model Training scvi-tools	2	389	January 22, 2024
scANVI relables known cells with known types incorrectly scvi-tools scanvi	13	1832	April 18, 2023
Scvi-tools label transfer accuracy scvi-tools scanvi	2	519	June 15, 2023
Encountering Error in Label Transfer : Query Dataset Slightly Larger than Reference Dataset scvi-tools scanvi	3	258	January 11, 2024
Label transfer from CITE-seq CITE-seq scanvi , scarches	5	477	September 16, 2022

Label transfer with SCVI-SCANVI pipeline changes (predicts wrong) labels in ref data

Transfer of annotations with scANVI

Related topics