Hey everyone. Historically I have done my cell annotations with a combination of manual approaches or with SingleR and databases (I come from the bioconductor world originally).
I wanted to give CellAssign a try for a recent project because some the cell types arent in the databases and this could be a great way to “automate” the manual annotation since I get to assign my own markers with it.
I followed this guide → https://docs.scvi-tools.org/en/stable/user_guide/notebooks/cellassign_tutorial.html and have no errors but also cells also get dumped into incorrect categories. The default 400 epochs puts everything into the last two categories , and when I try with 50 epochs everything gets dumped into “other” category… (have 210,000 cells , 10X platform, 5’ prime seq kits)
I have a RTX GPU so re-running the model.train() is trivial in time so am happy to try other approaches to change things if people have suggestions…
my two ideas are:
after bdata = adata[:, marker_gene_mat.index].copy() I am going to have a lot of empty cells. are the total 0’s confusing the model? those t cells are tricky becasue CD4 transcript and CD8A wont be detected in all of them. if I remove them , how hard is it to extrapolate them later from the detected ones? EDIT: I attempted to control for this, see first comment, it didn’t fix my problem
the tutorial doesnt want log data, put perhaps I need to format the counts differently than I am, I have tried RAW and normalized RAW…
There seems to be a VERY old r-cellassign package that uses tensorflow also, it is the same thing as this? I think I would be better at troubleshooting an sce object because of my background than anndata , but i have no idea if the projects are linked or just have same name.
okay here are some updates… after setting sc.pp.filter_cells(bdata, min_genes=1) it only removed 14 cells, so out of the 200K , I dont think that was my problem , I did remove one subset of cells and 2 genes to try and simplify things, and set min genes from my remaining 39 to 2 which is the smallest set (my CD4 Dump) and 7 is my largest (Tfh)…
this approach didnt give me any different results compared to before both at 400 and 50… =(
I have mostly solved it. I was able to improve my results by changing how I exported the count matrix as well as moving my “dump” other category to the end. I don’t know if that actually makes a difference in the order, but I went from 160,000 cells in the other, to none. So I am happy about that.
I have a bunch of stuff on my plate currently but maybe at the end of the summer I can write up a little R tutorial for formatting and exporting data properly for use with scvi cellassign , for the non scanpy / bioconductor people… I am sure my teething pains have more to do with not being as proficient with scvi & scanpy than with anything.
It shouldn’t… would be great if you could ensure this.
By the way, I realize we have a bug in the tutorial regarding size factors. Using scran is ideal, but if using sum of UMI counts (library size), it needs to be normalized by the mean library size: