Transferring lables from refrence dataset to query dataset with a more diverse cell population using scANVI

Hello,

Firstly, thank you for providing such an excellent set of tools and tutorials.

I’m currently using scANVI to transfer cell labels from a reference dataset of NK cells (labeled according to NK cell subtypes) to a query dataset consisting of CD45+ cells from peripheral blood. In the initial results, all cells in the query dataset have been labeled as different NK cell subtypes.

I have a few questions regarding this process:

  1. Is it problematic to use a reference dataset with a much narrower variety of cells than the query dataset? Ideally, I only want to label the NK cells in the query dataset.

  2. Can I restrict the labeling to cells that meet a higher probability threshold?

  3. When using scanvi.predict to return probabilities, what would be a reliable probability threshold to consider the labeling as accurate?

Thank you for your assistance.

Hi, scANVI is not developed for this use case. Specifically, probabilities are not calibrated and can’t predict an unobserved cell-type.
The number of tools that can detect query-specific cell-types is quite limited. To address this need, we have developed popV and have tested it in similar cases beyond the actual manuscript with good results. GitHub - YosefLab/PopV. You will likely need to disable using a cell ontology as the NK cell subsets are not part of the Cell Ontology. We have tested it in these settings and it was rather straightforward to find a good decision boundary (usually >5/7 algorithms will highlight a confident transferred label).

Thank you so much, will give this a try!

Do you think ther would be any added value in annotating NK cells first using PopV with the built in cell onotology to narrow the query dataset down to only NK cells and in a second run, not use cell ontology, but use the refrence dataset and >5/7 algorithm boundary as you suggest?

I think both results will be best case very similar. I assume NK cells are defined more stringently in your dataset as you are interested in these cells. I would directly go towards your reference dataset as in Tabula sapiens I disagreed with some Tcell labels.