Query data formatting to map onto a reference

Hey scvi-tools team!

I was wondering what the requirements are on the query AnnData object for prepare_query_anndata(). The documentation just says the query object should be:

organized in the same way as data used to train model

but it’s not clear what exactly this means. Looking at the code I guess that query_data.layers['counts'] is not supported even if the model was trained from an AnnData object that has the count data stored in a layer? I guess this might be reformatted in setup_anndata for the original model training? Also, does organizing both objects in the same way mean also that query_data.var and query_data.obs needs to be consistent?

So for each kwarg used in the setup_anndata method for the original training data, your query data should have the corresponding data under those keys. E.g. if your count data was stored in .layers['counts'] originally, then it should be stored there for the query data as well. The reformatting that setup_anndata did on the original data will be repeated on the query data. While the query_data.var should match, the query_data.obs can be different (which I assume would be the case most of the time).

Thanks a lot @Justin_Hong! I think I found a bug or an incompatibility between query data prepping functions that I reported here.

Is there a way to only load the reference model schema from the model file without having to load the reference data? Then one could prep the query data depending on the model attributes. I guess this would be achievable by exposing the _get_loaded_data() function that is defined in _archesmixin.py. Otherwise I only see model load functions that require also loading the anndata object.

1 Like

Thanks for filing that issue! I’ll take a look an respond on it.

While it isn’t as extensive as the view_anndata_setup method, there is a static method on every model class called view_setup_args which tells you which kwargs were used to call setup_anndata on a model. This can be done without loading the data. Does this address your problem?

1 Like