Cannot transfer setup without `extend_categories = True`

Hi, i’m having the same problem as described here, but with a TotalVI model and I’m running the latest scvi tools version.

i want to map a query to 2 different references, R1 and R2, using the same code.
Q->R1 worked fine and it’s using the exact same datasets as in your citeseq tutorial (the seurat dataset)

in my code, i calculate the final umap by concatenating Q and R1 and then computing the latent x_totalvi on the full object.

adata_ref.obs.loc[:, 'is_reference'] = 'Reference'
adata_query.obs.loc[:, 'is_reference'] = 'Query'
adata_full = adata_query.concatenate(adata_ref , batch_key="batch")
adata_full.obsm[latent_choice] = vae_q.get_latent_representation(adata_full)

i have now another model for R2 and i can use it for mapping the query, but the exact same lines above fail when concatenating Q and R2 at the vae_q.get_latent_representation(adata_full)

with this error

ValueError: Category 0 not found in source registry. Cannot transfer setup without extend_categories = True.

  • i don’t know what category is complaining about cause query nor reference have a column named 0
  • the reference is really small and with no metadata apart from the batch

AnnData object with n_obs × n_vars = 10849 × 4000
    obs: 'n_counts', 'batch'
    var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: 'hvg'
    obsm: 'protein_expression'

i train the reference like so:

arches_params = dict(

vae = TOTALVI(Ref, **arches_params)
vae.train(max_epochs=250)"cite_reference_model", overwrite=True)

Anndata setup with scvi-tools version 0.16.3.
Setup via `TOTALVI.setup_anndata` with arguments:
│   'protein_expression_obsm_key': 'protein_expression',
│   'protein_names_uns_key': None,
│   'batch_key': 'batch',
│   'layer': 'counts',
│   'size_factor_key': None,
│   'categorical_covariate_keys': None,
│   'continuous_covariate_keys': None
         Summary Statistics         
┃     Summary Stat Key     ┃ Value ┃
│         n_cells          │ 10849 │
│          n_vars          │ 4000  │
│         n_labels         │   1   │
│         n_batch          │   2   │
│ n_extra_categorical_covs │   0   │
│ n_extra_continuous_covs  │   0   │
│        n_proteins        │  14   │
                   Data Registry                   
┃ Registry Key ┃       scvi-tools Location        ┃
│      X       │      adata.layers['counts']      │
│    labels    │    adata.obs['_scvi_labels']     │
│    batch     │     adata.obs['_scvi_batch']     │
│   proteins   │ adata.obsm['protein_expression'] │
                     labels State Registry                      
┃      Source Location      ┃ Categories ┃ scvi-tools Encoding ┃
│ adata.obs['_scvi_labels'] │     0      │          0          │
                  batch State Registry                   
┃  Source Location   ┃ Categories ┃ scvi-tools Encoding ┃
│ adata.obs['batch'] │   PBMC5k   │          0          │
│                    │  PBMC10k   │          1          │

the query’s batch column has instead two other strings, “set1” and “set2”

Name: batch, Length: 57669, dtype: category
Categories (2, object): ['set1', 'set2']

i think i understand that the registry can be updated to account for the new info in the query but i am not sure i totally follow the documentation on Data Handling

this however doesn’t change the batch info in the anndata manager…

vae.adata_manager.transfer_fields(adata_target=query, extend_categories=True)

attrdict({'categorical_mapping': array(['PBMC5k', 'PBMC10k'], dtype=object), 'original_key': 'batch'})

I am not sure what to do here.

i think the workaround of simply concatenating the totalVI embeddings calculated independently for R and Q would be appropriate, but i want to have a standardized code across my Q2R so i would like to understand why the above lines work for other Q2R (the same code works also on other experiments) but not for this particular one .

thank you!

it didn’t allow me to add the link for the documentation so here it is:

well. as per usual, raising the issue was instrumental to finding the solution. i think i actually sorted the problem by changing the concatenation from the old to new anndata.concat function

adata_full = ad.concat( [adata_ref,adata_query])

no problem with getting the full totalvi embedding from this anndata, but i would still like to know if you’d advise concatenating the ref and query (totvi/scvi) embeddings or we should recalculate them on the full object.

they should be pretty much the same but I’m not sure what’s best.