Defines clonotype clusters using TCRdist distance matrix

Hello, apologize in advance, I’m new in this field so my question may seems super unprofessional.

Another apology for the long post. Here’s the TLDR: I want to use the TCRdist distance matrix to define clonotype clusters, but I’m not sure if I’m doing it right.

The first question is how to incorporate TCRdist matrix into Scirpy framework?

I knew Scirpy now has a TCRdistDistanceCalculator, however, it seems like all the DistanceCalculators in Scirpy (including this one) only compute pairwise distance between CDR3 sequences.

While the CDR3s are indeed the most important determinants of TCR specificity, I would like to study the impact of other CDR sequences as well.

If I understand correctly, the only way that Scirpy will consider CDR1 and CDR2 is through setting the same_v_gene parameter to True in tl.define_clonotypes() or tl.define_clonotype_clusters(), but instead of strictly enforcing clonotypes to have the same V-genes, I would like to define clonotypes based on sequence similarities in those regions alongside CDR3.

Therefore, I would like to use TCRdist3 distance matrix as input to define clonotype clusters in Scirpy. That’s quite a challenge for me because I’m newbie to almost everything and am especially bad at coding.

Thanksfully, Theis lab offers a step-by-step tutorial on how to incorporate TCRdist distance matrix into Scirpy framework. In that tutorial, they write a function to replace pp.ir_dist() in the standard Scirpy workflow, returning pairwise distance matrices for all VJ/VDJ sequences in adata.uns.

from scipy.sparse import csr_matrix


def add_dists(adata, df_dist_alpha, df_dist_beta, name, cutoff):

    adata.uns[f"ir_dist_aa_{name}"] = {
        "params": {"metric": f"{name}", "sequence": "aa", "cutoff": cutoff}
    }

    for chain, dists in [("VJ", df_dist_alpha), ("VDJ", df_dist_beta)]:
        if dists is None:
            continue
        dist_values = dists.values + 1
        dist_values[dist_values > (cutoff + 1)] = 0
        dist_values = csr_matrix(dist_values)
        adata.uns[f"ir_dist_aa_{name}"][chain] = {
            "seqs": dists.index.tolist(),
            "distances": dist_values,
        }

add_dists(adata_tcrdist, df_tcrdist_alpha, df_tcrdist_beta, "tcrdist", 60)

I appreciate their work from the bottom of my heart, however, perhaps because cells with the same cdr3_a_aa or cdr3_b_aa don’t necessarily have the same V or J genes, I found duplicated CDR3-alpha or -beta sequences in the output distance matrices.

I thought all the VJ/VDJ sequences in the distance matrices generated from pp.ir_dist() should be unique?

As Dr. Sturm suggested in this post, a possible workaround for this is to map the cell indices to each row in the distance matrix, likes the distance_result returned by tl.define_clonotypes() or tl.define_clonotype_clusters().

I also want to summarize cells with identical TCR configuration into a single node. To this end, I…

  1. Summarize the TCR configuration for each cell (by concatenating VJ_1_v_call, VJ_1_j_call, VJ_1_junction_aa, VDJ_1_v_call, VDJ_1_j_call and VDJ_1_junction_aa into one string for each cell)
  2. Create a dataframe that map each cell index into its corresponding TCR configuration (the cell index is the index of dataframe, and the first column is the TCR configuration)
  3. Write a function to map the cell indices to each row in the pairwise distance matrix:
def tcrdist_clone_uns(adata, df_dist, name, cutoff, cell_tcr_map):
    adata.uns[f"cc_aa_{name}"] = {}

    adata.uns[f"cc_aa_{name}"]["cell_indices"] = {}
    for row_idx, tcr_conf in enumerate(df_dist):
        adata.uns[f"cc_aa_{name}"]["cell_indices"][str(row_idx)] = np.array(cell_tcr_map[cell_tcr_map["TCR_conf"]==tcr_conf].index)

    for dists in [df_dist]:
        if dists is None:
            continue
        dist_values = dists.values + 1
        dist_values[dist_values > (cutoff + 1)] = 0
        dist_values = csr_matrix(dist_values)
        adata.uns[f"cc_aa_{name}"]["distances"] = dist_values.astype('uint8')

But I’m not sure if I’m doing it right? Please correct me if there’s any mistake!

A final question is how to finish defining clonotype clusters: how to assign clonotype id and clonotype size to each cell?

I copied the source code from Scirpy directly, but again, I’m not sure if this is correct:

from scirpy.util.graph import igraph_from_sparse_matrix
import igraph
import itertools

clonotype_dist = mdata_tcrdist["airr"].uns["cc_aa_tcrdist"]['distances']
clonotype_dist.eliminate_zeros()

## https://github.com/scverse/scirpy/blob/v0.16.1/src/scirpy/ir_dist/_clonotype_neighbors.py#L109
## See also what define_clonotype_clusters() will return: 
## ↑ (https://scirpy.scverse.org/en/latest/generated/scirpy.tl.define_clonotype_clusters.html)
cell_indices = mdata_tcrdist["airr"].uns["cc_aa_tcrdist"]['cell_indices'] 

g = igraph_from_sparse_matrix(clonotype_dist, matrix_type="distance")
part = g.clusters(mode="weak") ## https://igraph.org/python/doc/api/igraph.Graph.html#clusters


## https://github.com/scverse/scirpy/blob/main/src/scirpy/tl/_clonotypes.py
clonotype_cluster_series = pd.Series(data=None, index=mdata_tcrdist.obs_names, dtype=str)
clonotype_cluster_size_series = pd.Series(data=None, index=mdata_tcrdist.obs_names, dtype=int)

## clonotype cluster = graph partition
idx, values = zip(
    *itertools.chain.from_iterable(
        zip(cell_indices[str(ct_id)], itertools.repeat(str(clonotype_cluster)))
        for ct_id, clonotype_cluster in enumerate(part.membership)
    )
)
clonotype_cluster_series = pd.Series(values, index=idx).reindex(mdata_tcrdist.obs_names)
clonotype_cluster_size_series = clonotype_cluster_series.groupby(clonotype_cluster_series).transform("count")
mdata_tcrdist["airr"].obs["cc_aa_tcrdist"] = clonotype_cluster_series
mdata_tcrdist["airr"].obs["cc_aa_tcrdist_size"] = clonotype_cluster_size_series

If my questions are still confusing, please check the Jupyter notebook and test data I prepared for this post.

Thank you very much!

Hi @lywang,

thanks for your interest, I appreciate the amount of effort you put into researching this.

I knew Scirpy now has a TCRdistDistanceCalculator, however, it seems like all the DistanceCalculators in Scirpy (including this one) only compute pairwise distance between CDR3 sequences.

correct. In principle, it wouldn’t be very hard to include CDR1/CDR2 as well, there just hasn’t been a lot of demand yet. Feel free to open a feature request, and we can consider it in the future. Either way, it won’t be a quick solution, so I’m proposing a workaround below.

I thought all the VJ/VDJ sequences in the distance matrices generated from pp.ir_dist() should be unique?

When scirpy generates such a matrix, the sequences are unique. This is done for computational efficiency, but I don’t think there would matter for the downstream code if they were not unique.

A final question is how to finish defining clonotype clusters: how to assign clonotype id and clonotype size to each cell?

Defining clonotypes is hard, especially when you allow for different settings of dual_ir and receptor_arm. Instead, I propose to stick with scirpy as much as possible, and work around the CDR1/2 not being supported through a custom distance calculator class. Example code:

import scirpy as ir
import awkward as ak
import numpy as np
from typing import Sequence
from scipy.sparse import csr_matrix
from functools import reduce

# 10x example dataset obtained from here: https://www.10xgenomics.com/datasets/integrated-gex-totalseqc-and-tcr-analysis-of-connect-generated-library-from-5k-cmv-t-cells-2-standard
adata = ir.io.read_10x_vdj("/home/sturm/Downloads/5k_human_antiCMV_T_TBNK_connect_Multiplex_vdj_t_all_contig_annotations.json")
ir.pp.index_chains(adata)
ir.tl.chain_qc(adata)

# The sequences for CDR1/2/3 are stored in the .obsm["airr"] awkward array in the AIRR fields `cdr1_aa`, `cdr2_aa` and `cdr3_aa/junction_aa`, e.g.
adata.obsm["airr"]["cdr1_aa"]

# Let's merge CDR1/2/3 into a single string that is separated by `_`, i.e. f"{CDR1_aa}_{CDR2_aa}_{CDR3_aa}"
# We can do that very efficiently using awkward array string operations.
cdr_merged = ak.str.join_element_wise(
    adata.obsm["airr"]["cdr1_aa"],
    adata.obsm["airr"]["cdr2_aa"],
    adata.obsm["airr"]["junction_aa"],
    np.array("_"),
)

# Unfortunately, `ir.pp.ir_dist` currently does not allow to manually specify from which field the sequence is taken.
# We therefore need to store our "merged" CDR sequences in the `junction_aa` field.
# We keep a copy of the original in `junction_aa_orig`
adata.obsm["airr"]["junction_aa_orig"] = adata.obsm["airr"]["junction_aa"]
adata.obsm["airr"]["junction_aa"] = cdr_merged

# Let's define a scirpy `DistanceCalculator` class that splits the unique "merged" CDR3 sequences,
# calculates the distances on each of them separately and finally adds up the distances
class TCRDistAllCDRDistanceCalculator(ir.ir_dist.metrics.DistanceCalculator):
    def calc_dist_mat(
        self, seqs: Sequence[str], seqs2: Sequence[str] | None = None
    ) -> csr_matrix:
        assert seqs2 is None, "Workaround only implemented for symmetric case"
        cdr1, cdr2, cdr3 = zip(*[x.split("_") for x in seqs])

        tcrdist = ir.ir_dist.metrics.TCRdistDistanceCalculator(self.cutoff)
        res = [
            tcrdist.calc_dist_mat(cdr1),
            tcrdist.calc_dist_mat(cdr2),
            tcrdist.calc_dist_mat(cdr3) 
        ]

        # mask that indicates which values are non-zero in ALL result matrices. 
        # If one entry is 0 in any of the matrices it indicates the distance is > the cutoff.  
        mask_is_not_zero = reduce(lambda x, y: x.multiply(y), res).astype(bool)

        # Subtract 1, which as added to work better with sparse matrices.
        resm1 = []
        for r in res:
            r = r.copy()
            r.data -= 1
            resm1.append(r)

        # sum up result matrics. Could also add them up and assign different weights for CDR1/2 and CDR3 as 
        # the original tcrdist does afaik        
        # Multiply with the boolean indicator mask to avoid adding -1 in cases where there was already a zero
        res_mat = sum([x.multiply(mask_is_not_zero) for x in resm1])
        
        # Add the +1 again
        res_mat.data += 1

        return res_mat

# Now run ir_dist, define_clonotype_clusters as usual, referring to the "custom" metric. 
ir.pp.ir_dist(adata, metric=TCRDistAllCDRDistanceCalculator(cutoff=20), sequence="aa")
ir.tl.define_clonotype_clusters(adata, sequence='aa', metric="custom")
ir.tl.clonotype_network(adata, clonotype_key="cc_aa_custom", min_cells=2, base_size=2)
ir.pl.clonotype_network(adata)

Hope that helps.

Hello @grst, thank you very much for your reply!

The example code really helps me a lot! But I got some questions about it.

  1. It seems like the pairwise distance matrices returned by pp.ir_dist() were not “offset by 1”? I thought the real distance == 0 will be represented as 1 in the distance matrices, but I got 0 for all the values in the main diagonal in the matrices.

  2. When I removed the code that first subtract 1 and then add 1 in class TCRDistAllCDRDistanceCalculator, I got 3 for all the values in the main diagonal in the distance matrices.
    Should I scaled the values in the CSR distance matrices of CDR1/2/3 so that the real summed distance == 0 will be represented as 1 (and maybe apply different weight to CDR3 in the same time)? For example:

res = [
tcrdist.calc_dist_mat(cdr1) * 1/5,
tcrdist.calc_dist_mat(cdr2) * 1/5,
tcrdist.calc_dist_mat(cdr3) * 3/5
]

If I didn’t explain myself well, please take a look of the Jupyter notebook in which I tried to run the example code using the 10x dataset. (I added a lot of print() inside the example code because I have no idea what the intermediate results should look like… sorry if that’s annoying :sweat_smile:).

Again, thank you for your patience reply and help!

I think the problem in my code is here:

# Add the +1 again
        res_mat.data += 1

.data doesn’t contain an entry for the ‘actual’ zeros, only for distances > 0.
So instead, we need to add 1 for each entry that is non-zero all of the original distance matrices. We have that value already in mask_is_not_zero.

So maybe something like that could work

res_mat += mask_is_not_zero.astype(int)