Questions about running differential expression

Hi,

I am new to scvi tools and I am trying to follow the tutorial of CITESeq analysis with totalIVI. However, I am encountering an error, “raise KeyError(f”{not_found} not in index")" while running the differential expression command. The error was saying a lot of them is not in the index. Here are the set up of the object and I am hoping someone could give me some insight on how to solve that or what is causing this index error. Thank you!!

1 Like

can you paste the full traceback here?

Thank you for the reply! Here’s the full traceback. P.S. The list of missing index goes a lot longer but I don’t have enough character so I cut some in the middle.

DE…: 0%| | 0/18 [00:39<?, ?it/s]

KeyError Traceback (most recent call last)
Cell In [25], line 1
----> 1 de_df = vae.differential_expression(groupby = “cell_calls”)
2 de_df.head(5)

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/scvi/model/_totalvi.py:763, in TOTALVI.differential_expression(self, adata, groupby, group1, group2, idx1, idx2, mode, delta, batch_size, all_stats, batch_correction, batchid1, batchid2, fdr_target, silent, protein_prior_count, scale_protein, sample_protein_mixing, include_protein_background, **kwargs)
749 model_fn = partial(
750 self._expression_for_de,
751 scale_protein=scale_protein,
(…)
755 batch_size=batch_size,
756 )
757 col_names = np.concatenate(
758 [
759 np.asarray(_get_var_names_from_manager(adata_manager)),
760 self.protein_state_registry.column_names,
761 ]
762 )
→ 763 result = _de_core(
764 adata_manager,
765 model_fn,
766 groupby,
767 group1,
768 group2,
769 idx1,
770 idx2,
771 all_stats,
772 cite_seq_raw_counts_properties,
773 col_names,
774 mode,
775 batchid1,
776 batchid2,
777 delta,
778 batch_correction,
779 fdr_target,
780 silent,
781 **kwargs,
782 )
784 return result

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/scvi/model/base/_utils.py:267, in de_core(adata_manager, model_fn, groupby, group1, group2, idx1, idx2, all_stats, all_stats_fn, col_names, mode, batchid1, batchid2, delta, batch_correction, fdr, silent, **kwargs)
265 res = res.sort_values(by=sort_key, ascending=False)
266 if mode == “change”:
→ 267 res[f"is_de_fdr
{fdr}"] = _fdr_de_prediction(res[“proba_de”], fdr=fdr)
268 if idx1 is None:
269 g2 = “Rest” if group2 is None else group2

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/scvi/model/base/_utils.py:288, in _fdr_de_prediction(posterior_probas, fdr)
286 raise ValueError(“posterior_probas should be 1-dimensional”)
287 sorted_genes = np.argsort(-posterior_probas)
→ 288 sorted_pgs = posterior_probas[sorted_genes]
289 cumulative_fdr = (1.0 - sorted_pgs).cumsum() / (1.0 + np.arange(len(sorted_pgs)))
290 d = (cumulative_fdr <= fdr).sum()

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/pandas/core/series.py:984, in Series.getitem(self, key)
981 key = np.asarray(key, dtype=bool)
982 return self._get_values(key)
→ 984 return self._get_with(key)

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/pandas/core/series.py:1019, in Series._get_with(self, key)
1015 if key_type == “integer”:
1016 # We need to decide whether to treat this as a positional indexer
1017 # (i.e. self.iloc) or label-based (i.e. self.loc)
1018 if not self.index._should_fallback_to_positional:
→ 1019 return self.loc[key]
1020 else:
1021 return self.iloc[key]

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/pandas/core/indexing.py:967, in _LocationIndexer.getitem(self, key)
964 axis = self.axis or 0
966 maybe_callable = com.apply_if_callable(key, self.obj)
→ 967 return self._getitem_axis(maybe_callable, axis=axis)

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/pandas/core/indexing.py:1194, in _LocIndexer._getitem_axis(self, key, axis)
1191 if hasattr(key, “ndim”) and key.ndim > 1:
1192 raise ValueError(“Cannot index with multidimensional key”)
→ 1194 return self._getitem_iterable(key, axis=axis)
1196 # nested tuple slicing
1197 if is_nested_tuple(key, labels):

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/pandas/core/indexing.py:1132, in _LocIndexer._getitem_iterable(self, key, axis)
1129 self._validate_key(key, axis)
1131 # A collection of keys
→ 1132 keyarr, indexer = self._get_listlike_indexer(key, axis)
1133 return self.obj._reindex_with_indexers(
1134 {axis: [keyarr, indexer]}, copy=True, allow_dups=True
1135 )

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/pandas/core/indexing.py:1330, in _LocIndexer._get_listlike_indexer(self, key, axis)
1327 ax = self.obj._get_axis(axis)
1328 axis_name = self.obj._get_axis_name(axis)
→ 1330 keyarr, indexer = ax._get_indexer_strict(key, axis_name)
1332 return keyarr, indexer

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/pandas/core/indexes/base.py:5796, in Index._get_indexer_strict(self, key, axis_name)
5793 else:
5794 keyarr, indexer, new_indexer = self._reindex_non_unique(keyarr)
→ 5796 self._raise_if_missing(keyarr, indexer, axis_name)
5798 keyarr = self.take(indexer)
5799 if isinstance(key, Index):
5800 # GH 42790 - Preserve name from an Index

File ~/miniconda3/envs/scanpy_env/lib/python3.10/site-packages/pandas/core/indexes/base.py:5859, in Index._raise_if_missing(self, key, indexer, axis_name)
5856 raise KeyError(f"None of [{key}] are in the [{axis_name}]“)
5858 not_found = list(ensure_index(key)[missing_mask.nonzero()[0]].unique())
→ 5859 raise KeyError(f”{not_found} not in index")

KeyError: ‘[239, 238, 237, 235, 234, 233, 236, 231, 230, 229, 232, 245, 244, 243, 241, 240, 242, 255, 260, 259, 258, 257, 256, 254, 248, 252, 251, 250, 249, 253, 247, 246, 274, 273, 272, 271, 270, 269, 268, 266, 265, 264, 263, 262, 261, 267, 286, 293, 292, 291, 290, 289, 288, 287, 285, 284, 283, 282, 281, 280, 279, 278, 277, 276, 275, 307, 306, 305, 304, 303, 302, 301, 299, 298, 297, 296, 295, 294, 300, 317, 322, 321, 320, 319, 318, 315, 316, 314, 313, 312, 311, 310, 309, 308, 337, 336, 335, 334, 333, 332, 330, 331, 328, 327, 326, 325, 324, 323, 329, 348, 349, 350, 351, 355, 353, 354, 356, 352, 347, 344, 345, 343, 342, 341, 340, 339, 338, 346, 370, 369, 368, 367, 366, 365, 364, 361, 362, 360, 359, 358, 357, 363, 384, 383, 389, 382, 386, 387, 388, 385, 381, 375, 379, 378, 377, 376, 374, 373, 372, 371, 380, 405, 406, 407, 408, 413, 410, 411, 412, 404, 409, 402, 403, 400, 401, 390, 391, 392, 394, 393, 396, 397, 398, 399, 395, 433, 430, 431, 432, 434, 440, 436, 437, 438, 441, 429, 435, 428, 439, 426, 427, 414, 416, 417, 418, 419, 415, 421, 422, 423, 424, 425, 420, 453, 461, 460, 459, 458, 457, 456, 455, 454, 452, 462, 450, 449, 448, 447, 446, 445, 444, 443, 442, 451, 474, 476, 477, 481, 479, 480, 473, 478, 472, 475, 470, 469, 468, 467, 466, 465, 464, 463, 471, 500, 501, 502, 503, 504, 505, 506, 511, 508, 509, 512, 513, 514, 499, 507, 498, 510, 496, 482, 483, 485, 497, 486, 487, 488, 484, 490, 491, 492, 493, 494, 495, 489, 529, 530, 531, 532, 536, 534, 535, 537, 528, 533, 527, 525, 517, 524, 523, 522, 521, 520, 519, 518, 516, 515, 526, 556, 555, 553, 557, 554, 558, 564, 560, 561, 562, 563, 565, 559, 552, 538, 550, 551, 539, 540, 541, 543, 542, 545, 546, 547, 548, 549, 544, 584, … 32071, 32072, 32073, 32074, 32075, 32076, 32077, 32078, 32079, 32080, 32081, 32082, 32083, 32084, 32085, 32086, 32087, 32088, 32089, 32090, 32091, 32092, 32093, 32094, 32095, 32096, 32097, 32098, 32099, 32100, 32101, 32102, 32103, 32104, 32105, 32106, 32107, 32108, 32109, 32110, 32111, 32112, 32113, 32114, 32115, 32116, 32117, 32118, 32119, 32120, 32121, 32122, 32123, 32124, 32125, 32126, 32127, 32128, 32129, 32130, 32131, 32132, 32133, 32134, 32135, 32136, 32137, 32138, 32139, 32140, 32141, 32142, 32143] not in index’

Just FYI, I am trying the same workflow for another set of data but still get stuck at the same step with the differential expression and same error. So I am wondering is there something wrong with me setting up the model. I am hoping someone could help me as I really need some help with using scvitools! Could attach the object and Jupyter notebook if needed. Thank you in advance!!

Here attached is the printout of vae.view_anndata_setup(adata)

Is it the case that you have cells without protein data?

I had cells without protein data but I filtered them with these command already and the resulting mdata should only contain the cells with both RNA and protein

common_idx = list(set(adata_RNA.obs_names).intersection(set(adata_FB.obs_names)))
adata_RNA_subset = adata_RNA[common_idx].copy()
adata_FB_subset = adata_FB[common_idx].copy()

I am attaching the two adata, RNA and protein and the Jupiter notebook that I used to combine to make mdata and run the totalVI analysis. I had no problem of running and replicating the tutorial of CITE-seq analysis with totalVI for PBMC10k and PBMC5k so the environment and package shouldn’t be the issue.
For the RNA, I already did some filtering and, normalize and log transfer the data. For protein, it is just raw count data. Please help and let me know how I can amend my object so that the differential expression analysis could be run. Please see the google drive for the data and notebook. Thank you very much!!!

https://drive.google.com/drive/folders/1HLKdIb0lFIea3UJ3r3-zzT-1TLCpPWam?usp=sharing

Sorry for the delay. In the screenshot you sent with the training code, there is a warning about potentially missing protein measurements, which are detected when a cell has all zero protein counts. Is this warning gone when you subset to 561 cells?

Ok I understand the issue.

In the notebook you shared you can replace

adata.obsm["protein_expression"] = mdata["adt"].layers["counts"].A.copy()

with

adata.obsm["protein_expression"] = mdata["adt"].to_df("counts")

The issue arises due to the first line giving a numpy array, so our library generates integer based index for the protein names. This causes an issue with accessing a pandas series in your traceback.

We will push a fix soon. In the meantime, the recommended workflow is to use dataframes. Alternatively, you can use mudata throughout by following this tutorial:

Finally, it’s important to note that the differential expression function may give strange results due to cells with missing proteins. You can filter these cells and supply a custom adata to vae.differential_expression(new_adata)

Alternatively, advanced usage can alter the batchid1 and batchid2 arguments.

Thanks Adam for your suggestion! I can confirm the below command is working now.
adata.obs['leiden'] = mdata.obs['leiden_totalVI']
de_df = vae.differential_expression(groupby = 'leiden', delta=0.5, batch_correction=True)

Really appreciated your help! One more question, when you mentioned my DE analysis may give strange results due to cells with missing proteins. May I ask how should I filter the cells with missing proteins? Thanks again for the help!!