Hello, apologize in advance, I’m new in this field so my question may seems super unprofessional.

Another apology for the long post. Here’s the TLDR: **I want to use the TCRdist distance matrix to define clonotype clusters, but I’m not sure if I’m doing it right.**

The first question is **how to incorporate TCRdist matrix into Scirpy framework**?

I knew Scirpy now has a TCRdistDistanceCalculator, however, it seems like all the DistanceCalculators in Scirpy (including this one) only compute pairwise distance between **CDR3** sequences.

While the CDR3s are indeed the most important determinants of TCR specificity, I would like to study the impact of **other CDR sequences** as well.

If I understand correctly, the only way that Scirpy will consider CDR1 and CDR2 is through setting the *same_v_gene* parameter to True in `tl.define_clonotypes()`

or `tl.define_clonotype_clusters()`

, but instead of strictly enforcing clonotypes to have the same V-genes, I would like to define clonotypes based on sequence similarities in those regions alongside CDR3.

Therefore, I would like to use TCRdist3 distance matrix as input to define clonotype clusters in Scirpy. That’s quite a challenge for me because I’m newbie to almost everything and am especially bad at coding.

Thanksfully, Theis lab offers a step-by-step tutorial on how to incorporate TCRdist distance matrix into Scirpy framework. In that tutorial, they write a function to replace `pp.ir_dist()`

in the standard Scirpy workflow, returning pairwise distance matrices for all VJ/VDJ sequences in adata.uns.

```
from scipy.sparse import csr_matrix
def add_dists(adata, df_dist_alpha, df_dist_beta, name, cutoff):
adata.uns[f"ir_dist_aa_{name}"] = {
"params": {"metric": f"{name}", "sequence": "aa", "cutoff": cutoff}
}
for chain, dists in [("VJ", df_dist_alpha), ("VDJ", df_dist_beta)]:
if dists is None:
continue
dist_values = dists.values + 1
dist_values[dist_values > (cutoff + 1)] = 0
dist_values = csr_matrix(dist_values)
adata.uns[f"ir_dist_aa_{name}"][chain] = {
"seqs": dists.index.tolist(),
"distances": dist_values,
}
add_dists(adata_tcrdist, df_tcrdist_alpha, df_tcrdist_beta, "tcrdist", 60)
```

I appreciate their work from the bottom of my heart, however, perhaps because cells with the same cdr3_a_aa or cdr3_b_aa don’t necessarily have the same V or J genes, I found duplicated CDR3-alpha or -beta sequences in the output distance matrices.

I thought all the VJ/VDJ sequences in the distance matrices generated from `pp.ir_dist()`

should be unique?

As Dr. Sturm suggested in this post, a possible workaround for this is to map the cell indices to each row in the distance matrix, likes the **distance_result** returned by `tl.define_clonotypes()`

or `tl.define_clonotype_clusters()`

.

I also want to summarize cells with identical TCR configuration into a single node. To this end, I…

- Summarize the TCR configuration for each cell (by concatenating VJ_1_v_call, VJ_1_j_call, VJ_1_junction_aa, VDJ_1_v_call, VDJ_1_j_call and VDJ_1_junction_aa into one string for each cell)
- Create a dataframe that map each cell index into its corresponding TCR configuration (the cell index is the index of dataframe, and the first column is the TCR configuration)
- Write a function to map the cell indices to each row in the pairwise distance matrix:

```
def tcrdist_clone_uns(adata, df_dist, name, cutoff, cell_tcr_map):
adata.uns[f"cc_aa_{name}"] = {}
adata.uns[f"cc_aa_{name}"]["cell_indices"] = {}
for row_idx, tcr_conf in enumerate(df_dist):
adata.uns[f"cc_aa_{name}"]["cell_indices"][str(row_idx)] = np.array(cell_tcr_map[cell_tcr_map["TCR_conf"]==tcr_conf].index)
for dists in [df_dist]:
if dists is None:
continue
dist_values = dists.values + 1
dist_values[dist_values > (cutoff + 1)] = 0
dist_values = csr_matrix(dist_values)
adata.uns[f"cc_aa_{name}"]["distances"] = dist_values.astype('uint8')
```

But I’m not sure if I’m doing it right? Please correct me if there’s any mistake!

A final question is how to finish defining clonotype clusters: **how to assign clonotype id and clonotype size to each cell**?

I copied the source code from Scirpy directly, but again, I’m not sure if this is correct:

```
from scirpy.util.graph import igraph_from_sparse_matrix
import igraph
import itertools
clonotype_dist = mdata_tcrdist["airr"].uns["cc_aa_tcrdist"]['distances']
clonotype_dist.eliminate_zeros()
## https://github.com/scverse/scirpy/blob/v0.16.1/src/scirpy/ir_dist/_clonotype_neighbors.py#L109
## See also what define_clonotype_clusters() will return:
## ↑ (https://scirpy.scverse.org/en/latest/generated/scirpy.tl.define_clonotype_clusters.html)
cell_indices = mdata_tcrdist["airr"].uns["cc_aa_tcrdist"]['cell_indices']
g = igraph_from_sparse_matrix(clonotype_dist, matrix_type="distance")
part = g.clusters(mode="weak") ## https://igraph.org/python/doc/api/igraph.Graph.html#clusters
## https://github.com/scverse/scirpy/blob/main/src/scirpy/tl/_clonotypes.py
clonotype_cluster_series = pd.Series(data=None, index=mdata_tcrdist.obs_names, dtype=str)
clonotype_cluster_size_series = pd.Series(data=None, index=mdata_tcrdist.obs_names, dtype=int)
## clonotype cluster = graph partition
idx, values = zip(
*itertools.chain.from_iterable(
zip(cell_indices[str(ct_id)], itertools.repeat(str(clonotype_cluster)))
for ct_id, clonotype_cluster in enumerate(part.membership)
)
)
clonotype_cluster_series = pd.Series(values, index=idx).reindex(mdata_tcrdist.obs_names)
clonotype_cluster_size_series = clonotype_cluster_series.groupby(clonotype_cluster_series).transform("count")
mdata_tcrdist["airr"].obs["cc_aa_tcrdist"] = clonotype_cluster_series
mdata_tcrdist["airr"].obs["cc_aa_tcrdist_size"] = clonotype_cluster_size_series
```

If my questions are still confusing, please check the Jupyter notebook and test data I prepared for this post.

Thank you very much!