Hi scvi-tools Team,
I have been trying to transfer labels from one dataset to another using scVI and scANVI.
My query dataset represents different timepoints of the developmental mouse midbrain (from embryonic day 7 until postnatal day 7), while the reference dataset comprises mouse whole cortex and hippocampus from ~8 weeks old mice.
The predictions are not suggestive and I am not sure if I am doing something wrong, it’s a tool problem, or the datasets are not comparable.
For a better visualisation of the problem, here is the script that I run.
Note: I am new to bioinformatics and finding a reference dataset or a database that I can use to annotate my embryonic mouse data is something that I am struggling with for a few months now, hence the usage of adult data as reference.
Thank you in advance,
I’m not super familiar with the new scANVI API, but the way it works (unless something has changed…) is that scANVI learns a common latent representation for both the reference data and the query data. The predictions are then based on which cells overlap in that representation. If I was in your situation, to ‘debug’ what is going on, I’d start with getting the representations for both the reference and query cells and visualize them at once with e.g. the UMAP like you are doing in your notebook to see if the query cells at all overlap in that common represenation.
If the cells do not overlap in the common embedding, it indicates the cells are too different to be compared. You can investigate why this is by running the
.differential_expression() method between your reference and query data. (If they do not overlap, this will probably be a lot of genes though…)
Another thing I would look at would be expression of some markers based on the model. Given that your dataset has cell type labels such as ‘GABAergic-Lamp5’ etc, I would run
.get_normalized_expression(gene_list = ['Slc6a1', 'Lamp5']) (Slc6a1is a marker for GABA neurons). Then I would plot the results with X = Slc6a1 expression and Y = Lamp5 expression, colored by the scANVI predicted cluster labels, splitting up reference cells and query cells into two panels (but with the same axes). I would do this to check if 1) the reference labeling is consistent with the assumed biology, and 2) whether the predictions are completely off.
My impression in general is that there are a lot of changes in gene expression during the development of the brain, so it might be hard to compare these data… But given your situation I still think it’s warranted to try to figure out what the issues might be!