Having some difficulties with CellAssign help please =)

LabCoatNomad · July 5, 2021, 7:52pm

Hey everyone. Historically I have done my cell annotations with a combination of manual approaches or with SingleR and databases (I come from the bioconductor world originally).

I wanted to give CellAssign a try for a recent project because some the cell types arent in the databases and this could be a great way to “automate” the manual annotation since I get to assign my own markers with it.

I followed this guide → https://docs.scvi-tools.org/en/stable/user_guide/notebooks/cellassign_tutorial.html and have no errors but also cells also get dumped into incorrect categories. The default 400 epochs puts everything into the last two categories , and when I try with 50 epochs everything gets dumped into “other” category… (have 210,000 cells , 10X platform, 5’ prime seq kits)

I have a RTX GPU so re-running the model.train() is trivial in time so am happy to try other approaches to change things if people have suggestions…

my two ideas are:

after bdata = adata[:, marker_gene_mat.index].copy() I am going to have a lot of empty cells. are the total 0’s confusing the model? those t cells are tricky becasue CD4 transcript and CD8A wont be detected in all of them. if I remove them , how hard is it to extrapolate them later from the detected ones? EDIT: I attempted to control for this, see first comment, it didn’t fix my problem
the tutorial doesnt want log data, put perhaps I need to format the counts differently than I am, I have tried RAW and normalized RAW…
There seems to be a VERY old r-cellassign package that uses tensorflow also, it is the same thing as this? I think I would be better at troubleshooting an sce object because of my background than anndata , but i have no idea if the projects are linked or just have same name.

oh and 41 genes being used in the celltype.csv

Thanks in advance for troubleshooting help!

LabCoatNomad · July 5, 2021, 10:06pm

okay here are some updates… after setting sc.pp.filter_cells(bdata, min_genes=1) it only removed 14 cells, so out of the 200K , I dont think that was my problem , I did remove one subset of cells and 2 genes to try and simplify things, and set min genes from my remaining 39 to 2 which is the smallest set (my CD4 Dump) and 7 is my largest (Tfh)…

this approach didnt give me any different results compared to before both at 400 and 50… =(

moving on to test hypothesis number 2

adamgayoso · July 6, 2021, 4:33pm

They won’t confuse the model, but you might consider adding more markers to the marker matrix.

Just the plain UMI counts are the input

This is a reimplementation of the R version. The training is a bit different and it should be much more scalable than the original version.

What are you currently using for the size factors? In the tutorial, I believe it’s using size factors computed originally with scran.

LabCoatNomad · July 6, 2021, 6:42pm

They won’t confuse the model, but you might consider adding more markers to the marker matrix.

Awesome Possum, I removed them anyway and learned they werent the problem. But good to know for the future and I will leave them in the next time

Just the plain UMI counts are the input

Great!

What are you currently using for the size factors? In the tutorial, I believe it’s using size factors computed originally with scran.

I will double check this tonight, I am pretty sure I was just using computeSumFactors , but I should go back to my pipeline to verify.

LabCoatNomad · July 13, 2021, 2:06pm

I have mostly solved it. I was able to improve my results by changing how I exported the count matrix as well as moving my “dump” other category to the end. I don’t know if that actually makes a difference in the order, but I went from 160,000 cells in the other, to none. So I am happy about that.

I have a bunch of stuff on my plate currently but maybe at the end of the summer I can write up a little R tutorial for formatting and exporting data properly for use with scvi cellassign , for the non scanpy / bioconductor people… I am sure my teething pains have more to do with not being as proficient with scvi & scanpy than with anything.

adamgayoso · July 13, 2021, 6:40pm

It shouldn’t… would be great if you could ensure this.

By the way, I realize we have a bug in the tutorial regarding size factors. Using scran is ideal, but if using sum of UMI counts (library size), it needs to be normalized by the mean library size:

lib_size = np.asarray(adata.X.sum(1))
adata.obs["size_factor"] = lib_size / np.mean(lib_size)

Topic		Replies	Views
CellAssign speed up scvi-tools cellassign , scvi	2	468	June 13, 2023
CellAssign not working correctly scvi-tools cellassign	0	285	July 10, 2023
Thoughts on a more ~realistic tutorial? scvi-tools tutorials	14	1499	February 26, 2022
SCANVI inferred cell types don't make sense scvi-tools scanvi	1	161	October 17, 2024
CellAssign keyword error: After Integration scvi-tools scvi	3	830	May 26, 2022

Having some difficulties with CellAssign help please =)

Related topics