Hi, i’m having the same problem as described here, but with a TotalVI model and I’m running the latest scvi tools version.
i want to map a query to 2 different references, R1 and R2, using the same code.
Q->R1 worked fine and it’s using the exact same datasets as in your citeseq tutorial (the seurat dataset)
in my code, i calculate the final umap by concatenating Q and R1 and then computing the latent x_totalvi on the full object.
adata_ref.obs.loc[:, 'is_reference'] = 'Reference'
adata_query.obs.loc[:, 'is_reference'] = 'Query'
adata_full = adata_query.concatenate(adata_ref , batch_key="batch")
adata_full.obsm[latent_choice] = vae_q.get_latent_representation(adata_full)
i have now another model for R2 and i can use it for mapping the query, but the exact same lines above fail when concatenating Q and R2 at the vae_q.get_latent_representation(adata_full)
with this error
ValueError: Category 0 not found in source registry. Cannot transfer setup without
extend_categories = True.
- i don’t know what category is complaining about cause query nor reference have a column named 0
- the reference is really small and with no metadata apart from the batch
Ref
AnnData object with n_obs × n_vars = 10849 × 4000
obs: 'n_counts', 'batch'
var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
uns: 'hvg'
obsm: 'protein_expression'
i train the reference like so:
TOTALVI.setup_anndata(
Ref,
layer="counts",
batch_key="batch",
protein_expression_obsm_key="protein_expression"
)
arches_params = dict(
use_layer_norm="both",
use_batch_norm="none",
n_layers_decoder=2,
n_layers_encoder=2,
)
vae = TOTALVI(Ref, **arches_params)
vae.train(max_epochs=250)
vae.save("cite_reference_model", overwrite=True)
vae.adata_manager.view_registry()
Anndata setup with scvi-tools version 0.16.3.
Setup via `TOTALVI.setup_anndata` with arguments:
{
│ 'protein_expression_obsm_key': 'protein_expression',
│ 'protein_names_uns_key': None,
│ 'batch_key': 'batch',
│ 'layer': 'counts',
│ 'size_factor_key': None,
│ 'categorical_covariate_keys': None,
│ 'continuous_covariate_keys': None
}
Summary Statistics
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃ Summary Stat Key ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│ n_cells │ 10849 │
│ n_vars │ 4000 │
│ n_labels │ 1 │
│ n_batch │ 2 │
│ n_extra_categorical_covs │ 0 │
│ n_extra_continuous_covs │ 0 │
│ n_proteins │ 14 │
└──────────────────────────┴───────┘
Data Registry
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Registry Key ┃ scvi-tools Location ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ X │ adata.layers['counts'] │
│ labels │ adata.obs['_scvi_labels'] │
│ batch │ adata.obs['_scvi_batch'] │
│ proteins │ adata.obsm['protein_expression'] │
└──────────────┴──────────────────────────────────┘
labels State Registry
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Source Location ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_labels'] │ 0 │ 0 │
└───────────────────────────┴────────────┴─────────────────────┘
batch State Registry
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃ Source Location ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['batch'] │ PBMC5k │ 0 │
│ │ PBMC10k │ 1 │
└────────────────────┴────────────┴─────────────────────┘
the query’s batch column has instead two other strings, “set1” and “set2”
query.obs["batch"]
[...]
Name: batch, Length: 57669, dtype: category
Categories (2, object): ['set1', 'set2']
i think i understand that the registry can be updated to account for the new info in the query but i am not sure i totally follow the documentation on Data Handling
this however doesn’t change the batch info in the anndata manager…
vae.adata_manager.transfer_fields(adata_target=query, extend_categories=True)
vae.adata_manager.get_state_registry("batch")
attrdict({'categorical_mapping': array(['PBMC5k', 'PBMC10k'], dtype=object), 'original_key': 'batch'})
I am not sure what to do here.
i think the workaround of simply concatenating the totalVI embeddings calculated independently for R and Q would be appropriate, but i want to have a standardized code across my Q2R so i would like to understand why the above lines work for other Q2R (the same code works also on other experiments) but not for this particular one .
thank you!