Hi all–sorry for a longish, perhaps naive questions. I’m relatively new to the python ecosystem after having worked in R for a while. It’s long overdue, but I’m messing around re-creating some workflows with these tools and am hitting some friction points. I was wondering if any of you would be willing to share thoughts on some questions I had about the fairly common use case of exploring structure within a subpopulation of some larger data set.
Let’s say it’s a typical "atlas’ project, integrating data of a complex tissue from various samples/studies. You do a low-level annotation to get broad classes of cell types (epithelial, immune, etc) and then want to explore diversity within each. If I were doing this in Seurat, for example, my workflow may be: integration > low-resolution clustering/annotation > subset > re-do feature selection on the relevant population > re-integrate > cluster at appropriate resolution > annotate
But working through something simple like this, I’ve had a few questions:
1. I think I’d like to have convenient access to counts and log1p for all features
If I’m using a count-based integration method (eg. scVI) but also want to re-do feature selection after subsetting, I need to be able to get counts for all features. Although I see many tutorials store log1p in .raw, I could store all counts in there instead. But then I’d put log1p in .X or a layer, but it will only have HVGs, which is kind of annoying for plotting where I may want broader access.
What do you all do here? I’ve seen comments about untransforming the log1p values from .raw. Or maybe storing the raw counts in .uns or something. Or perhaps you don’t re-do feature selection after subsetting and maybe just include more than 2k features in your HVGs. It all just feels a little clunky.
2. With scVI, do you tend to re-train after any change in the dataset composition (eg. subsetting on a specific population, or even just removing a cluster of doublets for example)
I would assume that the latent space is dependent on the cells that you decided to remove, so it makes sense to re-train without them there. I typically would do this with other integration methods, but admittedly the guts of scVI are a bit beyond me, so perhaps I’m missing something.
3. More related to scArches–I was really intrigued by the lifelong learning pitch that it can make models easily extensible, incorporating new datasets into existing models without going back to the start and re-training on everything. However, in practice, is it better for any reason to just re-train from scratch?
Let’s take the hypothetical yet-to-be published atlas example again where I’m working with a cohort of data that’s been integrated, annotated, analyzed, etc. Then new data comes along (new pub, new data collection, etc) that I’d like to incorporate into the atlas itself (not just query against). With scArches, it’s pretty straight forward to incorporate this data in, but would you? Should I only really consider it for use cases where it’s more of a “query onto reference” situation, or is it valid to use it to keep updating some integrated collection of data? With the latter, presumably the results then become dependent on how the different pieces were brought together.
Again, sorry for the long post of questions. And double sorry if I’m missing obvious things here. At the very least, hopefully it just shows what someone migrating from R without much python experience may be thinking! Have been loving the tools on this side of the fence–I appreciate all your work on them. Just need to rewire my brain a bit to actually use them!