Define dual IR pairs as new cell

Hello,

Scirpy detects cells with dual chains: two VJ chains, two VDJ chains, or both.
I would like to select these cells and “split” them into two artificially defined cells, with the cell_id containing the suffix “primary_pair” or “secondary pair”.
(Whether this is a relevant way of including the double IR model into repertoire analysis is another question, on which your opinion is welcome.)

I managed to select the cells with dual chains (this example uses the data from scirpy’s tutorial):

adata_tcr = ir.io.read_10x_vdj(proj_dir+"scirpy_example_data/GSM4385993_C144_filtered_contig_annotations.csv.gz")
ir.pp.index_chains(adata_tcr)
ir.tl.chain_qc(adata_tcr)
pairs = adata_tcr[adata_tcr.obs["chain_pairing"] == "extra VJ"]

But I don’t manage to modify the chain information in this anndata object to as to divide them into a ‘primary_chain’ and ‘secondary chain’ dataset.

More precisely, I have understood how to use pairs.obsm['chain_indices'] to subset VJ chains in pairs.obsm['airr'], but I do not manage to make new anndata objects out of these chains.

Thanks.

Hi, interesting question.

I suggest to decompose the AnnData object into individual AirrCell objects which are much more convenient for manipulating them than the awkward array.

Here’s a minimal example:

import scirpy as ir
adata = ir.datasets.wu2020_3k()["airr"]
ir.tl.chain_qc(adata)

# split into two anndata objects
single_chain = adata[adata.obs["chain_pairing"].isin(["orphan VDJ", "single pair", "orphan VJ"])]
multiple_chains = adata[adata.obs["chain_pairing"].isin(["two full chains", "extra VJ", "extra VDJ"])]

# decompose first AnnData into AirrCells
airr_cells = ir.io.to_airr_cells(single_chain)
# decompose second AnnData into AirrCells and generate new cells on-the-fly
airr_cells = ir.io.to_airr_cells(single_chain)
for airr_cell in ir.io.to_airr_cells(multiple_chains):
    new_cell1 = ir.io.AirrCell(cell_id=f'{airr_cell.cell_id}_1')
    new_cell2 = ir.io.AirrCell(cell_id=f'{airr_cell.cell_id}_2')
    # stupid example: add every second chain to cell1 / cell2
    for i, chain in enumerate(airr_cell.chains):
        if i % 2:
            new_cell1.add_chain(chain)
        else:
            new_cell2.add_chain(chain)
    
    airr_cells.append(new_cell1)
    airr_cells.append(new_cell2)

# Finally, convert back to AnnData
adata_new = ir.io.from_airr_cells(airr_cells)




Regarding this question, I’m not sure what you would gain compared to setting dual_ir="any" in the define_clonotype_clusters step.

Thank you very much.

Your approach with AirrCells simplifies the problem.
However, the final object does not have the expected number of cells of each type.

Running your example (slightly modified to answer my problem):

import scirpy as ir
adata = ir.datasets.wu2020_3k()["airr"]
ir.tl.chain_qc(adata)

# split into two anndata objects
single_chain = adata[adata.obs["chain_pairing"].isin(
    ["orphan VDJ", "single pair", "orphan VJ"])]
multiple_chains = adata[adata.obs["chain_pairing"].isin(
    ["two full chains", "extra VJ", "extra VDJ"])]

# decompose first AnnData into AirrCells
airr_cells = ir.io.to_airr_cells(single_chain)
# decompose second AnnData into AirrCells and generate new cells on-the-fly
for airr_cell in ir.io.to_airr_cells(multiple_chains):
    new_cell1 = ir.io.AirrCell(cell_id=f'{airr_cell.cell_id}_1')
    new_cell2 = ir.io.AirrCell(cell_id=f'{airr_cell.cell_id}_2')
    # Add first two chains into one cell and the additional chains into another cell
    for i, chain in enumerate(airr_cell.chains):
        if i <= 2: # first two chains go into first cell
            new_cell1.add_chain(chain)
        elif i <= 4: # additional chains go into second cell
            new_cell2.add_chain(chain)

    airr_cells.append(new_cell1)
    airr_cells.append(new_cell2)

# Finally, convert back to AnnData
adata_new = ir.io.from_airr_cells(airr_cells)

and looking at the resulting cells:

ir.tl.chain_qc(adata_new) ## compute chain_pairing information
adata_new.obs['chain_pairing'].value_counts()

, I get the following chain pairings:

chain_pairing
single pair    1685
orphan VDJ      969
orphan VJ       270
no IR           121
extra VJ         78
extra VDJ        51
Name: count, dtype: int64

Whereas here, the expected behavior should be to have only single pairs and orphan pairs.

Surprinsingly, cells in single_chain has more than 2 chains:

n_chains = [len(airr_cell.chains) for airr_cell in ir.io.to_airr_cells(single_chain)]
import pandas as pd
pd.Series(n_chains).value_counts()

This seems to not be compliant with the AirrCell format and the dual IR model, and would explain the unexpected behavior.

Am I missing something here?

enumerate starts at 0 in Python, i.e. <= 2 includes 3 chains.

Pardon the stupid mistake…

Replacing the two concerned lines with:

        if i <= 1: # first two chains go into first cell
            new_cell1.add_chain(chain)
        elif i <= 3: # additional chains go into second cell
            new_cell2.add_chain(chain)

, one obtains the following:

chain_pairing
single pair    1745
orphan VDJ     1052
orphan VJ       368
no IR             9

The unexpected behavior now is that some cells have no IR chain.

My guess is that airr_cell.chains contains VDJ and VJ chains, but also other chain types, which would explain why the aforementioned n_chains takes values above 2.
But I do not find a way to filter them.

There are two attributes you’ll probably want to consider:

  • chain["locus"] → from that you can infer if it’s VJ or VDJ (there’s pre-defined lists in AirrCell.VJ_LOCI and AirrCell.VDJ_LOCI)
  • chain["productive"] → scirpy by default ignores non-productive chains. This is why you get “no IR” sometimes. If you absolutely want to keep those, you can do so using the filter parameter in pp.index_chains.

Thanks much, this solved it!