Limited cell types in reference dataset for scANVI

ayyldz · December 2, 2025, 10:58am

Thank you for developing such powerful tools, very useful!
I am trying to annotate some immature cell type populations in adult dataset and I am using fetal sample dataset as reference where these immature population is abundant. However, since the mature cell types are absent in this reference, scANVI predicts everything in adult data as immature cell populations. I am sure this is not an uncommon problem. I wonder what is the best way to handle these annotation errors. Since I am interested in annotating immature cell populations in adult dataset and I know which ones are mature in adult, can I add some mature labels from adult data to enhance the annotation? So not masking adult labels completely, but leave some mature ones labelled and the rest unknown to be annotated by scANVI. In that case, what is a good ratio of labelled cells per each cell mature type? Something like 25%-30% ?
I would appreciate if you have other suggestions as well.

ori-kron-wis · December 4, 2025, 8:09am

Hey @ayyldz ,

First, I think there is no magic number of how many annotated cells you need. This depends on the dataset and problem, and how those annotated cell types you have represent all the cell types and intraclass heterogeneity.

I think you can validate yourself with a train-valid-test experiment for different annotated counts (you keep some aside and check if your model correctly predicted them).

Secondly, I perhaps didn’t understand, but if you know who are the mature cells in the adult dataset, why not work only on the other part of the dataset (the suspected immature) and use that dataset as a query to a reference SCANVI model trained over the fetal dataset? These will be your initial labels for the immature part of the adult dataset. Then you can focus on the adult dataset only and try to improve those predictions in other ways, with all labels used (mature+immature).

ayyldz · December 4, 2025, 1:06pm

Thank you @ori-kron-wis !
This helps a lot but also creates new questions. Maybe I should have explained my situation better.
So I do have some cells that I labelled as mature and some potential-immature in my adult dataset. The idea is to let scANVI re-annotate both of these clusters since there might be some immature cells buried somewhere in mature cells (they are in a continuum, could be that we mislabelled some of them as mature or vice versa). That is where my problem started, and with transfer labelling the whole population of adult gets immature label because there is very few mature cells (n=8) in my fetal dataset. There I was experimenting to toss in some mature cell labels from adult and let it label them in joint latent space. What you are suggesting to try different thresholds makes sense for this joint space.
When it comes to the last paragraph you refer to the reference mapping and I am not sure if I understand what will be the correct approach to do this in my case. I can subset potential-immature of adult and use it as query to project on fetal. Annotate them. Then go to adult only dataset with known mature labels + predicted immature labels and then experiment what you are saying (train-validate-test) to fine tune the annotations? Meaning that hold some (mature+immature) for training and some for testing until I find a good threshold to redefine both labels? Am I interpreting correctly or did you mean something else?
Thanks a lot once again!

ori-kron-wis · December 4, 2025, 1:40pm

Basically, yes thats what I meant, so 2-step analysis:

Subset potential-immature of adult and use it as query to project on fetal dataset (remove those 8 matue cells from fetal dataset) with a scanvi model. That will give you initial labelling of the immature of adult.
Dont mix the 2 datasets (dont toss some annotated adults to fetal dataset)
Then, another scanvi which will work on the adult datasets only (now comprises projected annotations of immature from fetal dataset + the % of mature you manually annotated). Then optimize this flow as there are probably some wrong annotations, both from query model and the manual process. But this can be tuned using a train-valid-test scheme.

Topic		Replies	Views
scANVI relables known cells with known types incorrectly scvi-tools scanvi	13	2028	April 18, 2023
Transferring lables from refrence dataset to query dataset with a more diverse cell population using scANVI scvi-tools scanvi	8	213	October 17, 2024
Label transfer with SCVI-SCANVI pipeline changes (predicts wrong) labels in ref data scvi-tools scanvi , scvi	8	1158	July 31, 2023
SCANVI: Label transfer from adult to embryonic data? scvi-tools scanvi	1	479	May 17, 2022
Predict cell type with scANVI for spatial transcriptomics data (Xenium) scvi-tools integration , scanvi , scvi	7	181	December 28, 2025

Limited cell types in reference dataset for scANVI

Related topics