Cannot transfer setup without `extend_categories = True`

bio-la · December 15, 2022, 9:56am

Hi, i’m having the same problem as described here, but with a TotalVI model and I’m running the latest scvi tools version.

github.com/scverse/scvi-tools

Issue re-running the scVI query model with the concatenated dataset

opened 08:23PM - 06 Jan 21 UTC

closed 07:29PM - 11 Jan 21 UTC

davemcg

bug

# The issue 1. I build an scVI model 2. I save it 3. Later I load it the sc…VI and pull in *new* and *old* data 4. I (succesfully) run the old data with the scVI model (`vae_query`) 5. I concatenate new and old data and try to run the `vae_query` to extract the latent dims for the full data 6. I get an error about the categories being wrong # The colab notebook for anyone to run ```python https://github.com/davemcg/scEiaD/blob/colab/colab/Query_scEiaD_with_scVI.ipynb ``` # Error in question I'm somewhat convinced I've just done some stupid anndata thing. ```pytb INFO Input adata not setup with scvi. attempting to transfer anndata setup --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-47-5e8ea921bcb3> in <module>() 2 adata_full_HVG = adata_full[:, var_names[0]].copy() 3 adata_full_HVG.obs['batch'] = adata_full_HVG.obs['scEiaD_batch'] ----> 4 vae_query.get_latent_representation(adata_full_HVG) 5 6 #adata_full.obsm['X_scvi'] = vae_query.get_latent_representation(adata_scEiaD_HVG) 5 frames /usr/local/lib/python3.6/dist-packages/scvi/data/_anndata.py in _make_obs_column_categorical(adata, column_key, alternate_column_key, categorical_dtype) 690 raise ValueError( 691 'Making .obs["{}"] categorical failed. Expected categories: {}. ' --> 692 "Received categories: {}. ".format(column_key, mapping, received_categories) 693 ) 694 adata.obs[alternate_column_key] = codes ValueError: Making .obs["batch"] categorical failed. Expected categories: ['E-MTAB-7316_10xv2_Donor1' 'E-MTAB-7316_10xv2_Donor2' 'E-MTAB-7316_10xv2_Donor3' 'EGAD00001006350_10xv2_D1_Ch' 'EGAD00001006350_10xv2_D1_Re' 'EGAD00001006350_10xv2_D3_Ch' 'EGAD00001006350_10xv2_D3_Re' 'EGAD00001006350_10xv2_D4_Ch' 'EGAD00001006350_10xv2_D4_Re' 'SRP151023_10xv2_NA' 'SRP170761_10xv2_NA' 'SRP222001_10xv2_retina1' 'SRP222001_10xv2_retina2' 'SRP222001_10xv2_retina3' 'SRP222958_DropSeq_retina2' 'SRP222958_DropSeq_retina6' 'SRP222958_DropSeq_retina8' 'SRP223254_10xv2_NA' 'SRP223254_10xv2_rep2' 'SRP238587_10xv2_NA' 'SRP255195_10xv2_H1' 'SRP255195_10xv2_H2' 'SRP255195_10xv2_H3' 'SRP255195_10xv2_H4' 'SRP255195_10xv2_H5' 'SRR12130660']. Received categories: Index(['E-MTAB-7316_10xv2_Donor1', 'E-MTAB-7316_10xv2_Donor2', 'E-MTAB-7316_10xv2_Donor3', 'EGAD00001006350_10xv2_D1_Ch', 'EGAD00001006350_10xv2_D1_Re', 'EGAD00001006350_10xv2_D3_Ch', 'EGAD00001006350_10xv2_D3_Re', 'EGAD00001006350_10xv2_D4_Ch', 'EGAD00001006350_10xv2_D4_Re', 'OGVFB_Hufnagel_iPSC_RPE_10xv2_None', 'SRP050054_DropSeq_retina1', 'SRP050054_DropSeq_retina2', 'SRP050054_DropSeq_retina3', 'SRP050054_DropSeq_retina4', 'SRP050054_DropSeq_retina5', 'SRP050054_DropSeq_retina6', 'SRP050054_DropSeq_retina7', 'SRP073242_SMARTSeq_v2_NA', 'SRP075719_DropSeq_Batch1', 'SRP075719_DropSeq_Batch2', 'SRP075720_SMARTSeq_v2_Batch1', 'SRP075720_SMARTSeq_v2_Batch2', 'SRP106476_SMARTerSeq_v3_NA', 'SRP131661_10xv2_3-F-56', 'SRP131661_10xv2_3-F-57', 'SRP131661_10xv2_3-M-5/6', 'SRP131661_10xv2_3-M-7/8', 'SRP131661_10xv2_3-M-8', 'SRP131661_10xv2_3-M-8/9', 'SRP131661_10xv2_3-M-9', 'SRP136739_SMARTSeq_v4_NA', 'SRP151023_10xv2_NA', 'SRP157927_10xv2_Macaque1', 'SRP157927_10xv2_Macaque2', 'SRP157927_10xv2_Macaque3', 'SRP157927_10xv2_Macaque4', 'SRP158081_10xv2_Rep1', 'SRP158081_10xv2_Rep2', 'SRP158081_10xv2_Rep3', 'SRP158081_SMARTSeq_v2_Rep1', 'SRP158528_10xv2_Macaque1', 'SRP158528_10xv2_Macaque2', 'SRP158528_10xv2_Macaque3', 'SRP158528_10xv2_Macaque4', 'SRP159286_SCRBSeq_NA', 'SRP161678_SMARTSeq_v4_NA', 'SRP166660_10xv2_run1', 'SRP166660_10xv2_run2', 'SRP168426_10xv2_E2', 'SRP168426_10xv2_F2', 'SRP170038_SMARTSeq_v2_NA', 'SRP170761_10xv2_NA', 'SRP186396_SMARTSeq_v2_NA', 'SRP186407_10xv2_NA', 'SRP194595_10xv3_Donor1', 'SRP194595_10xv3_Donor2', 'SRP194595_10xv3_Donor3', 'SRP200599_10xv2_NA', 'SRP212151_10xv2_Batch1', 'SRP212151_10xv2_Batch2', 'SRP212151_10xv2_Batch3', 'SRP218652_10xv3_donor1', 'SRP218652_10xv3_donor2', 'SRP218652_10xv3_donor3', 'SRP218652_10xv3_donor4', 'SRP218652_10xv3_donor5', 'SRP218652_10xv3_donor6', 'SRP218652_10xv3_donor7', 'SRP222001_10xv2_retina1', 'SRP222001_10xv2_retina2', 'SRP222001_10xv2_retina3', 'SRP222958_DropSeq_retina2', 'SRP222958_DropSeq_retina6', 'SRP222958_DropSeq_retina8', 'SRP223254_10xv2_NA', 'SRP223254_10xv2_rep2', 'SRP238587_10xv2_NA', 'SRP255195_10xv2_H1', 'SRP255195_10xv2_H2', 'SRP255195_10xv2_H3', 'SRP255195_10xv2_H4', 'SRP255195_10xv2_H5', 'SRP255195_10xv3_H1', 'SRP257883_10xv3_donor_22', 'SRP257883_10xv3_donor_23', 'SRP257883_10xv3_donor_24', 'SRP257883_10xv3_donor_25', 'SRR12130660'], dtype='object'). ``` #### Versions: 0.8.1 > VERSION

i want to map a query to 2 different references, R1 and R2, using the same code.
Q->R1 worked fine and it’s using the exact same datasets as in your citeseq tutorial (the seurat dataset)

in my code, i calculate the final umap by concatenating Q and R1 and then computing the latent x_totalvi on the full object.

adata_ref.obs.loc[:, 'is_reference'] = 'Reference'
adata_query.obs.loc[:, 'is_reference'] = 'Query'
    
adata_full = adata_query.concatenate(adata_ref , batch_key="batch")
adata_full.obsm[latent_choice] = vae_q.get_latent_representation(adata_full)

i have now another model for R2 and i can use it for mapping the query, but the exact same lines above fail when concatenating Q and R2 at the vae_q.get_latent_representation(adata_full)

with this error

ValueError: Category 0 not found in source registry. Cannot transfer setup without extend_categories = True.

i don’t know what category is complaining about cause query nor reference have a column named 0
the reference is really small and with no metadata apart from the batch

Ref

AnnData object with n_obs × n_vars = 10849 × 4000
    obs: 'n_counts', 'batch'
    var: 'highly_variable', 'highly_variable_rank', 'means', 'variances', 'variances_norm', 'highly_variable_nbatches'
    uns: 'hvg'
    obsm: 'protein_expression'

i train the reference like so:

TOTALVI.setup_anndata(
   Ref,
    layer="counts",
    batch_key="batch",
    protein_expression_obsm_key="protein_expression"
)
arches_params = dict(
    use_layer_norm="both",
    use_batch_norm="none",
    n_layers_decoder=2,
    n_layers_encoder=2,
)

vae = TOTALVI(Ref, **arches_params)
vae.train(max_epochs=250)
vae.save("cite_reference_model", overwrite=True)

vae.adata_manager.view_registry()

Anndata setup with scvi-tools version 0.16.3.
Setup via `TOTALVI.setup_anndata` with arguments:
{
│   'protein_expression_obsm_key': 'protein_expression',
│   'protein_names_uns_key': None,
│   'batch_key': 'batch',
│   'layer': 'counts',
│   'size_factor_key': None,
│   'categorical_covariate_keys': None,
│   'continuous_covariate_keys': None
}
         Summary Statistics         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┓
┃     Summary Stat Key     ┃ Value ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━┩
│         n_cells          │ 10849 │
│          n_vars          │ 4000  │
│         n_labels         │   1   │
│         n_batch          │   2   │
│ n_extra_categorical_covs │   0   │
│ n_extra_continuous_covs  │   0   │
│        n_proteins        │  14   │
└──────────────────────────┴───────┘
                   Data Registry                   
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ Registry Key ┃       scvi-tools Location        ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│      X       │      adata.layers['counts']      │
│    labels    │    adata.obs['_scvi_labels']     │
│    batch     │     adata.obs['_scvi_batch']     │
│   proteins   │ adata.obsm['protein_expression'] │
└──────────────┴──────────────────────────────────┘
                     labels State Registry                      
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃      Source Location      ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['_scvi_labels'] │     0      │          0          │
└───────────────────────────┴────────────┴─────────────────────┘
                  batch State Registry                   
┏━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━┓
┃  Source Location   ┃ Categories ┃ scvi-tools Encoding ┃
┡━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━┩
│ adata.obs['batch'] │   PBMC5k   │          0          │
│                    │  PBMC10k   │          1          │
└────────────────────┴────────────┴─────────────────────┘

the query’s batch column has instead two other strings, “set1” and “set2”

query.obs["batch"]
[...]
Name: batch, Length: 57669, dtype: category
Categories (2, object): ['set1', 'set2']

i think i understand that the registry can be updated to account for the new info in the query but i am not sure i totally follow the documentation on Data Handling

this however doesn’t change the batch info in the anndata manager…

vae.adata_manager.transfer_fields(adata_target=query, extend_categories=True)
vae.adata_manager.get_state_registry("batch")

attrdict({'categorical_mapping': array(['PBMC5k', 'PBMC10k'], dtype=object), 'original_key': 'batch'})

I am not sure what to do here.

i think the workaround of simply concatenating the totalVI embeddings calculated independently for R and Q would be appropriate, but i want to have a standardized code across my Q2R so i would like to understand why the above lines work for other Q2R (the same code works also on other experiments) but not for this particular one .

thank you!

bio-la · December 15, 2022, 9:58am

it didn’t allow me to add the link for the documentation so here it is:

bio-la · December 15, 2022, 1:58pm

well. as per usual, raising the issue was instrumental to finding the solution. i think i actually sorted the problem by changing the concatenation from the old to new anndata.concat function

adata_full = ad.concat( [adata_ref,adata_query])

no problem with getting the full totalvi embedding from this anndata, but i would still like to know if you’d advise concatenating the ref and query (totvi/scvi) embeddings or we should recalculate them on the full object.

they should be pretty much the same but I’m not sure what’s best.

thanks!

Topic		Replies	Views
Error in scvi.model.SCANVI.load_query_data scvi-tools scanvi	2	565	June 3, 2022
totalVI workflow scvi-tools totalvi	12	683	August 1, 2021
Error when trying to use scvi.model.SCANVI.from_scvi_model scvi-tools	2	208	July 12, 2024
Failing to import SCVI-tools modules: has AnnDatasetFromAnnData been replaced? scvi-tools totalvi	2	67	June 27, 2024
Error in scvi.model.TOTALVI.setup_anndata when loading protein-only data scvi-tools totalvi	5	748	August 25, 2022

Cannot transfer setup without `extend_categories = True`

Related topics