How to retreive the V,D,J gene segment idenifiers and CD3 sequence give a clone_id?

I have mdata which has ‘gex’ and ‘airr’ slots (the mdata structure is below).

Then I did:

ir.pp.ir_dist(mdata)
ir.tl.define_clonotypes(mdata, receptor_arms="all", dual_ir="primary_only")

Which ran successfully. I see that it created two columns in mdata.obs
mdata.obs["airr:clone_id"].
mdata.obs["airr:clone_id_size"].

After visualizing airr:clone_id’s across my samples, I found that clone_id ‘747’ is of interest.

How do I retreive the actual V,D,J gene segment identifiers and CDR3 sequences of TRA/TRB that defines this clonotype?

Thanks

FYI: The mdata structure is as follows:

MuData object with n_obs × n_vars = 83740 × 32285
  obs:	'sample_id', 'treatment', 'sort', 'tissue', 'sample_id_augmented'
  uns:	'sample_id_colors', 'treatment_colors'
  2 modalities
    gex:	83740 x 32285
      obs:	'leiden_0.2', 'leiden_0.1', 'leiden_0.15', 'leiden_0.25', 'cd4', 'cd8', 'foxp3', 'leiden_0.3', 'celltype_id', 'sample_id', 'treatment', 'sort', 'tissue', 'sample_id_augmented'
      uns:	'log1p', 'pca', 'neighbors', 'umap', 'leiden_0.2', 'leiden_0.2_colors', 'leiden_0.1', 'leiden_0.1_colors', 'leiden_0.15', 'leiden_0.15_colors', 'leiden_0.25', 'leiden_0.3', 'leiden_0.3_colors', 'tissue_colors', 'treatment_colors', 'gex:celltype_id_colors', 'celltype_id_colors', 'sort_colors'
      obsm:	'X_pca', 'X_umap'
      varm:	'PCs'
      layers:	'counts', 'normalized', 'log1p'
      obsp:	'distances', 'connectivities'
    airr:	54647 x 0
      obs:	'receptor_type', 'receptor_subtype', 'chain_pairing', 'clone_id', 'clone_id_size'
      uns:	'chain_indices', 'ir_dist_nt_identity', 'clone_id'
      obsm:	'airr', 'chain_indices'

Hi @lesande,

I think an example of what you want to achieve is shown in the scirpy T cell tutorial:

with ir.get.airr_context(mdata, "junction_aa", ["VJ_1", "VDJ_1", "VJ_2", "VDJ_2"]):
    cdr3_ct_159 = (
        # TODO astype(str) is required due to a bug in pandas ignoring `dropna=False`. It seems fixed in pandas 2.x
        mdata.obs.loc[lambda x: x["airr:cc_aa_tcrdist"] == "159"]
        .astype(str)
        .groupby(
            [
                "VJ_1_junction_aa",
                "VDJ_1_junction_aa",
                "VJ_2_junction_aa",
                "VDJ_2_junction_aa",
                "airr:receptor_subtype",
            ],
            observed=True,
            dropna=False,
        )
        .size()
        .reset_index(name="n_cells")
    )
cdr3_ct_159

You can adjust that by additionally including the v_call, d_call and j_call columns which contain the gene segments according to the AIRR Rearrangement standard.

1 Like