Hello everyone! I’m new here (and relatively new to the world of scRNA data analysis). I recently generated a scRNA dataset for my phd project and have been trying my best to use scanpy and scanVI to analyze it. I was wondering if I could get some help with this. Here’s some info which is relevant:
My query dataset is generated from a transgenic mouse embryo E11.5 which have been FACSed for mCherry expression. It’s a mix of cell types which I only vaguely know the composition to. Additionally, there are 30 artificial genes which I’ve inserted into the genome (they don’t exist in the WT transcriptome). ~15,000 cells.
The reference dataset is from the Ontogeny of Mouse, Graphed (https://omg.gs.washington.edu/). I’ve filtered for cells from E11.25, E11.5 and E11.75 embryos. ~500,000 cells.
My questions are as follows:
- How should I best go about the 30 artificial genes? Is it better to append them to the reference object and fill them with 0’s (after highly variable gene selection)? Or should I copy my query object, remove the 30 genes, do the cell type predictions, and then copy the cell types to the original object?
- I notice a significant difference in cell type composition when I concatenate the two datasets and do training & prediction vs if I train solely on the reference dataset and then predict only on the query dataset. I unfortunately don’t know enough about machine learning to understand this difference. In my mind, adding only 15,000 cells to such a huge reference shouldn’t impact the training so much.
- I noticed that the cell types in my reference dataset have different population sizes. In total, it’s 147 cell types. Here is a plot of the counts:
There are several cell types with a count of 1.
Should I use n_samples_per_label to limit the number of cells seen? How could I determine a sensible value for it? I tried setting it to 100, and I noticed that the predictions really biased some of the less frequent cell types.
I’m happy to go into more detail if anyone needs. Thanks a lot in advance!