Scirpy detects cells with dual chains: two VJ chains, two VDJ chains, or both.
I would like to select these cells and “split” them into two artificially defined cells, with the cell_id containing the suffix “primary_pair” or “secondary pair”.
(Whether this is a relevant way of including the double IR model into repertoire analysis is another question, on which your opinion is welcome.)
I managed to select the cells with dual chains (this example uses the data from scirpy’s tutorial):
But I don’t manage to modify the chain information in this anndata object to as to divide them into a ‘primary_chain’ and ‘secondary chain’ dataset.
More precisely, I have understood how to use pairs.obsm['chain_indices'] to subset VJ chains in pairs.obsm['airr'], but I do not manage to make new anndata objects out of these chains.
I suggest to decompose the AnnData object into individual AirrCell objects which are much more convenient for manipulating them than the awkward array.
Here’s a minimal example:
import scirpy as ir
adata = ir.datasets.wu2020_3k()["airr"]
ir.tl.chain_qc(adata)
# split into two anndata objects
single_chain = adata[adata.obs["chain_pairing"].isin(["orphan VDJ", "single pair", "orphan VJ"])]
multiple_chains = adata[adata.obs["chain_pairing"].isin(["two full chains", "extra VJ", "extra VDJ"])]
# decompose first AnnData into AirrCells
airr_cells = ir.io.to_airr_cells(single_chain)
# decompose second AnnData into AirrCells and generate new cells on-the-fly
airr_cells = ir.io.to_airr_cells(single_chain)
for airr_cell in ir.io.to_airr_cells(multiple_chains):
new_cell1 = ir.io.AirrCell(cell_id=f'{airr_cell.cell_id}_1')
new_cell2 = ir.io.AirrCell(cell_id=f'{airr_cell.cell_id}_2')
# stupid example: add every second chain to cell1 / cell2
for i, chain in enumerate(airr_cell.chains):
if i % 2:
new_cell1.add_chain(chain)
else:
new_cell2.add_chain(chain)
airr_cells.append(new_cell1)
airr_cells.append(new_cell2)
# Finally, convert back to AnnData
adata_new = ir.io.from_airr_cells(airr_cells)
Your approach with AirrCells simplifies the problem.
However, the final object does not have the expected number of cells of each type.
Running your example (slightly modified to answer my problem):
import scirpy as ir
adata = ir.datasets.wu2020_3k()["airr"]
ir.tl.chain_qc(adata)
# split into two anndata objects
single_chain = adata[adata.obs["chain_pairing"].isin(
["orphan VDJ", "single pair", "orphan VJ"])]
multiple_chains = adata[adata.obs["chain_pairing"].isin(
["two full chains", "extra VJ", "extra VDJ"])]
# decompose first AnnData into AirrCells
airr_cells = ir.io.to_airr_cells(single_chain)
# decompose second AnnData into AirrCells and generate new cells on-the-fly
for airr_cell in ir.io.to_airr_cells(multiple_chains):
new_cell1 = ir.io.AirrCell(cell_id=f'{airr_cell.cell_id}_1')
new_cell2 = ir.io.AirrCell(cell_id=f'{airr_cell.cell_id}_2')
# Add first two chains into one cell and the additional chains into another cell
for i, chain in enumerate(airr_cell.chains):
if i <= 2: # first two chains go into first cell
new_cell1.add_chain(chain)
elif i <= 4: # additional chains go into second cell
new_cell2.add_chain(chain)
airr_cells.append(new_cell1)
airr_cells.append(new_cell2)
# Finally, convert back to AnnData
adata_new = ir.io.from_airr_cells(airr_cells)
and looking at the resulting cells:
ir.tl.chain_qc(adata_new) ## compute chain_pairing information
adata_new.obs['chain_pairing'].value_counts()
, I get the following chain pairings:
chain_pairing
single pair 1685
orphan VDJ 969
orphan VJ 270
no IR 121
extra VJ 78
extra VDJ 51
Name: count, dtype: int64
Whereas here, the expected behavior should be to have only single pairs and orphan pairs.
Surprinsingly, cells in single_chain has more than 2 chains:
n_chains = [len(airr_cell.chains) for airr_cell in ir.io.to_airr_cells(single_chain)]
import pandas as pd
pd.Series(n_chains).value_counts()
This seems to not be compliant with the AirrCell format and the dual IR model, and would explain the unexpected behavior.
if i <= 1: # first two chains go into first cell
new_cell1.add_chain(chain)
elif i <= 3: # additional chains go into second cell
new_cell2.add_chain(chain)
, one obtains the following:
chain_pairing
single pair 1745
orphan VDJ 1052
orphan VJ 368
no IR 9
The unexpected behavior now is that some cells have no IR chain.
My guess is that airr_cell.chains contains VDJ and VJ chains, but also other chain types, which would explain why the aforementioned n_chains takes values above 2.
But I do not find a way to filter them.
There are two attributes you’ll probably want to consider:
chain["locus"] → from that you can infer if it’s VJ or VDJ (there’s pre-defined lists in AirrCell.VJ_LOCI and AirrCell.VDJ_LOCI)
chain["productive"] → scirpy by default ignores non-productive chains. This is why you get “no IR” sometimes. If you absolutely want to keep those, you can do so using the filter parameter in pp.index_chains.