Matching features in query data and reference model

I have a question on matching features between a trained xVI model and a query dataset (apologies if this is a duplicate).

By default load_query_data wants matching features (fair enough). If you set inplace_subset_query_vars=True it does the subsetting in place but it throws a KeyError if some features are missing in the query data var_names.

First off, the KeyError here is not very helpful because if only a few genes are missing it will still tell me that all the variables are missing:

KeyError: "Values [**all the genes**] are not valid obs/ var names or indices."

In case of missing features, the scArches paper recommends zero-filling the matrix (as long as less than 10% of features are missing). So is there an easy way to check the features used for training from, if there is no adata object saved with it? Or do I necessarily have to go back to the original reference anndata?

Ideally one should be able to share just a trained model w/o the big reference adata attached to it, so being able to access the missing genes would be important (like in older scvi-tools versions I could just read the var_names.csv). Perhaps load_query_data could also have an optional parameter for zero filling with a message on which fraction of features are missing.

Thanks in advance!

I think this method we added is what you’re looking for. Please let me know if it’s missing anything!

Usage is shown in this tutorial:

That’s brilliant, thanks for the pointer!