How to handle batch effects within the query dataset when using scArches + SCVI?

Hello everyone,

I’m new to scArches and currently exploring how it works in combination with SCVI, particularly for integrating new datasets into a reference atlas.

I’m following this tutorial, where a SCVI model is first trained on a reference dataset and then extended using treeArches to incorporate a new query dataset. From what I understand, the entire query dataset is treated as a single new batch during this “surgery” step when adapting the model. Also, it seems that SCVI expects raw count data for both training and mapping steps.

This leads me to a my question:

What if my query dataset contains multiple batches itself (for example, samples from different sources or sequencing runs)?

Should I split the query dataset by batch and integrate each one individually into the reference? Or is there a better approach that allows the model to recognize and correct for the internal batch effects within the query dataset during the surgery step?

Any guidance or best practices would be greatly appreciated!

Thanks in advance!

Hey,

I think the tutorial you linked to is a bit old and perhaps confusing (it uses batch column as “batch_key” but this column is also equal to “study” column, which is used to separate query and reference).

Anyway, You can check this tutorial: Reference mapping with SCVI-Tools — scvi-tools

Where the query dataset consists of several “tech” batches, and they are all used together as the query data. Obviously your query data should be close to your reference (in terms of species, tissue, cell types and so on) for good reference mapping (“surgery”).

However, there are more models beyond SCVI, that might be helpful for you if you want to integrate a query that is very different, such as SysVI. It might help in your flow (however scarches not fully supported in this model yet)