Error in scvi.model.TOTALVI.setup_anndata when loading protein-only data

Hi,

This is my first post here, so please apologize if I did something wrong.

I have been using TOTALVI for protein-only CITE-seq analyses for quite a while, and I’m sure I have been able to get good results using scvi-tools 1.2.X or 1.4.X.

However, the following code generates an error:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotnine as p9
import scvi
import scanpy as sc
import scipy.io

sc.set_figure_params(figsize=(4, 4), color_map='cividis')
scvi.settings.seed = int(20021208)
np.random.seed(20021208)

adata = sc.read_h5ad(save_path_1+'/Hybridization_ann_single.h5')SCITO_scvi_ann = scvi.data.organize_cite_seq_10x(adata)
adata.layers["counts"] = adata.X.copy()
sc.pp.normalize_total(adata, target_sum=1e4)
adata.raw = adata
scvi.model.TOTALVI.setup_anndata(
    adata,
    protein_expression_obsm_key = 'protein_expression',
    batch_key='assignment',
    layer='counts')

And the full error traceback is as follows:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-42-769f92bccbda> in <module>
      3     protein_expression_obsm_key = 'protein_expression',
      4     batch_key='assignment',
----> 5     layer='counts')

5 frames
/usr/local/lib/python3.7/dist-packages/scvi/model/_totalvi.py in setup_anndata(cls, adata, protein_expression_obsm_key, protein_names_uns_key, batch_key, layer, size_factor_key, categorical_covariate_keys, continuous_covariate_keys, **kwargs)
   1246             fields=anndata_fields, setup_method_args=setup_method_args
   1247         )
-> 1248         adata_manager.register_fields(adata, **kwargs)
   1249         cls.register_manager(adata_manager)
   1250 

/usr/local/lib/python3.7/dist-packages/scvi/data/_manager.py in register_fields(self, adata, source_registry, **transfer_kwargs)
    175                     field_registry[
    176                         _constants._STATE_REGISTRY_KEY
--> 177                     ] = field.register_field(adata)
    178 
    179             # Compute and set summary stats for the given field.

/usr/local/lib/python3.7/dist-packages/scvi/data/fields/_layer_field.py in register_field(self, adata)
     94 
     95     def register_field(self, adata: AnnData) -> dict:
---> 96         super().register_field(adata)
     97         if self.correct_data_format:
     98             _verify_and_correct_data_format(adata, self.attr_name, self.attr_key)

/usr/local/lib/python3.7/dist-packages/scvi/data/fields/_base_field.py in register_field(self, adata)
     65             stored directly on the AnnData/MuData object.
     66         """
---> 67         self.validate_field(adata)
     68         return dict()
     69 

/usr/local/lib/python3.7/dist-packages/scvi/data/fields/_layer_field.py in validate_field(self, adata)
     84         x = self.get_field_data(adata)
     85 
---> 86         if self.is_count_data and not _check_nonnegative_integers(x):
     87             logger_data_loc = (
     88                 "adata.X" if self.attr_key is None else f"adata.layers[{self.attr_key}]"

/usr/local/lib/python3.7/dist-packages/scvi/data/_utils.py in _check_nonnegative_integers(data, n_to_check)
    204         raise TypeError("data type not understood")
    205 
--> 206     inds = np.random.choice(len(data), size=(n_to_check,))
    207     check = jax.device_put(data.flat[inds], device=jax.devices("cpu")[0])
    208     negative, non_integer = _is_not_count_val(check)

In for your reference, the structure of adata right before calling setup_anndata is as follows:

AnnData object with n_obs × n_vars = 134136 × 0
    obs: 'assignment', 'IGg_singlet', 'UMI_antibody_raw'
    var: 'gene_ids', 'feature_types'
    uns: 'random_seed', 'log1p', 'assignment_colors'
    obsm: 'X_ADT_umap', 'protein_expression'
    layers: 'counts'

adata.obsm['protein_expression'] has the expected shape ( 134136 x 136 )

A brief search shows the error has to do with empty pandas.DataFrame objects. However, I don’t expect gene expression is processed in Pandas in any way…?

Hi thanks for using scvi-tools.

Your anndata appears to have n_vars = 0 which explains why there is an empty pandas dataframe error (the data structure underlying the .layers object). Are you sure you have the counts layer formatted correctly?

@Justin_Hong
In this case, the counts layer is a scipy.sparse.csr.csr_matrix of shape (134136, 0). But, still, if there’s no RNA, then n_vars would certain be zero…

For protein only, I would not advise using totalVI – unless somehow you made the RNA part of the anndata one gene of like all zeros. Is that what you’re doing?

Hi @adamgayoso : What would you recommend for protein-only data?

We don’t have a model at the moment. I’d be curious if you tried using one fake gene with all ones as input – I don’t think it will have much if any effect.