Error in scvi.model.TOTALVI.setup_anndata when loading protein-only data


This is my first post here, so please apologize if I did something wrong.

I have been using TOTALVI for protein-only CITE-seq analyses for quite a while, and I’m sure I have been able to get good results using scvi-tools 1.2.X or 1.4.X.

However, the following code generates an error:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotnine as p9
import scvi
import scanpy as sc

sc.set_figure_params(figsize=(4, 4), color_map='cividis')
scvi.settings.seed = int(20021208)

adata = sc.read_h5ad(save_path_1+'/Hybridization_ann_single.h5')SCITO_scvi_ann =
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
adata.raw = adata
    protein_expression_obsm_key = 'protein_expression',

And the full error traceback is as follows:

ValueError                                Traceback (most recent call last)
<ipython-input-42-769f92bccbda> in <module>
      3     protein_expression_obsm_key = 'protein_expression',
      4     batch_key='assignment',
----> 5     layer='counts')

5 frames
/usr/local/lib/python3.7/dist-packages/scvi/model/ in setup_anndata(cls, adata, protein_expression_obsm_key, protein_names_uns_key, batch_key, layer, size_factor_key, categorical_covariate_keys, continuous_covariate_keys, **kwargs)
   1246             fields=anndata_fields, setup_method_args=setup_method_args
   1247         )
-> 1248         adata_manager.register_fields(adata, **kwargs)
   1249         cls.register_manager(adata_manager)

/usr/local/lib/python3.7/dist-packages/scvi/data/ in register_fields(self, adata, source_registry, **transfer_kwargs)
    175                     field_registry[
    176                         _constants._STATE_REGISTRY_KEY
--> 177                     ] = field.register_field(adata)
    179             # Compute and set summary stats for the given field.

/usr/local/lib/python3.7/dist-packages/scvi/data/fields/ in register_field(self, adata)
     95     def register_field(self, adata: AnnData) -> dict:
---> 96         super().register_field(adata)
     97         if self.correct_data_format:
     98             _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)

/usr/local/lib/python3.7/dist-packages/scvi/data/fields/ in register_field(self, adata)
     65             stored directly on the AnnData/MuData object.
     66         """
---> 67         self.validate_field(adata)
     68         return dict()

/usr/local/lib/python3.7/dist-packages/scvi/data/fields/ in validate_field(self, adata)
     84         x = self.get_field_data(adata)
---> 86         if self.is_count_data and not _check_nonnegative_integers(x):
     87             logger_data_loc = (
     88                 "adata.X" if self.attr_key is None else f"adata.layers[{self.attr_key}]"

/usr/local/lib/python3.7/dist-packages/scvi/data/ in _check_nonnegative_integers(data, n_to_check)
    204         raise TypeError("data type not understood")
--> 206     inds = np.random.choice(len(data), size=(n_to_check,))
    207     check = jax.device_put(data.flat[inds], device=jax.devices("cpu")[0])
    208     negative, non_integer = _is_not_count_val(check)

In for your reference, the structure of adata right before calling setup_anndata is as follows:

AnnData object with n_obs × n_vars = 134136 × 0
    obs: 'assignment', 'IGg_singlet', 'UMI_antibody_raw'
    var: 'gene_ids', 'feature_types'
    uns: 'random_seed', 'log1p', 'assignment_colors'
    obsm: 'X_ADT_umap', 'protein_expression'
    layers: 'counts'

adata.obsm['protein_expression'] has the expected shape ( 134136 x 136 )

A brief search shows the error has to do with empty pandas.DataFrame objects. However, I don’t expect gene expression is processed in Pandas in any way…?

Hi thanks for using scvi-tools.

Your anndata appears to have n_vars = 0 which explains why there is an empty pandas dataframe error (the data structure underlying the .layers object). Are you sure you have the counts layer formatted correctly?

In this case, the counts layer is a scipy.sparse.csr.csr_matrix of shape (134136, 0). But, still, if there’s no RNA, then n_vars would certain be zero…

For protein only, I would not advise using totalVI – unless somehow you made the RNA part of the anndata one gene of like all zeros. Is that what you’re doing?

Hi @adamgayoso : What would you recommend for protein-only data?

We don’t have a model at the moment. I’d be curious if you tried using one fake gene with all ones as input – I don’t think it will have much if any effect.