First of all, thanks so much for creating this amazing set of tools! I’m a new user, so my apologies if I’ve missed relevant docs that would answer this question
I’m interested in creating a large “atlas” from multiple (~20) independent cohorts which collectively span ~1000 donors. I have two related questions:
is there a recommended minimum number of cells per donor ie batch (I am using the donor as the “batch” identifier)? I understand that there are suggestions to keep the number of cells greater than the number of genes, but I’m not sure if this applies within individual batches as well.
It looks like there are systematic technical differences in expression between the cohorts. Is there a way to include the cohort as an additional “batch” covariate for the model fitting? Does that even make sense, given that each donor has a cohort membership so the model already has freedom to fit those differences on a per-donor basis?
Any other thoughts/suggestions you might have on integrating very many batches would be much appreciated! For example, would it be better to integrate a few big batches and then bring the rest into that latent space using something like scarches? Are there SCVI flags (use_layer_norm, use_batch_norm, etc?) that might be appropriate for handling many cells/batches?
Thanks in advance for any advice you can offer!
I would be interested to learn a bit more about what you’re trying to do. What you stated are reasonable things to try though. Please feel free to email me (firstlast at berkeley dot edu) if you’d like to schedule a meeting!
Thanks for your kind reply and the pointer to that other relevant post. I will learn more about the categorical_covariate_keys argument.
I’m also very happy to chat more about this specific application. I’ll email you.
PS. Sorry for the delay replying-- I closed the tab and missed the notification of your reply.