Hello everyone,
I’m new to scArches and currently exploring how it works in combination with SCVI, particularly for integrating new datasets into a reference atlas.
I’m following this tutorial, where a SCVI model is first trained on a reference dataset and then extended using treeArches to incorporate a new query dataset. From what I understand, the entire query dataset is treated as a single new batch during this “surgery” step when adapting the model. Also, it seems that SCVI expects raw count data for both training and mapping steps.
This leads me to a my question:
What if my query dataset contains multiple batches itself (for example, samples from different sources or sequencing runs)?
Should I split the query dataset by batch and integrate each one individually into the reference? Or is there a better approach that allows the model to recognize and correct for the internal batch effects within the query dataset during the surgery step?
Any guidance or best practices would be greatly appreciated!
Thanks in advance!