Hello, apologize in advance, I’m new in this field so my question may seems super unprofessional.
Another apology for the long post. Here’s the TLDR: I want to use the TCRdist distance matrix to define clonotype clusters, but I’m not sure if I’m doing it right.
The first question is how to incorporate TCRdist matrix into Scirpy framework?
I knew Scirpy now has a TCRdistDistanceCalculator, however, it seems like all the DistanceCalculators in Scirpy (including this one) only compute pairwise distance between CDR3 sequences.
While the CDR3s are indeed the most important determinants of TCR specificity, I would like to study the impact of other CDR sequences as well.
If I understand correctly, the only way that Scirpy will consider CDR1 and CDR2 is through setting the same_v_gene parameter to True in tl.define_clonotypes()
or tl.define_clonotype_clusters()
, but instead of strictly enforcing clonotypes to have the same V-genes, I would like to define clonotypes based on sequence similarities in those regions alongside CDR3.
Therefore, I would like to use TCRdist3 distance matrix as input to define clonotype clusters in Scirpy. That’s quite a challenge for me because I’m newbie to almost everything and am especially bad at coding.
Thanksfully, Theis lab offers a step-by-step tutorial on how to incorporate TCRdist distance matrix into Scirpy framework. In that tutorial, they write a function to replace pp.ir_dist()
in the standard Scirpy workflow, returning pairwise distance matrices for all VJ/VDJ sequences in adata.uns.
from scipy.sparse import csr_matrix
def add_dists(adata, df_dist_alpha, df_dist_beta, name, cutoff):
adata.uns[f"ir_dist_aa_{name}"] = {
"params": {"metric": f"{name}", "sequence": "aa", "cutoff": cutoff}
}
for chain, dists in [("VJ", df_dist_alpha), ("VDJ", df_dist_beta)]:
if dists is None:
continue
dist_values = dists.values + 1
dist_values[dist_values > (cutoff + 1)] = 0
dist_values = csr_matrix(dist_values)
adata.uns[f"ir_dist_aa_{name}"][chain] = {
"seqs": dists.index.tolist(),
"distances": dist_values,
}
add_dists(adata_tcrdist, df_tcrdist_alpha, df_tcrdist_beta, "tcrdist", 60)
I appreciate their work from the bottom of my heart, however, perhaps because cells with the same cdr3_a_aa or cdr3_b_aa don’t necessarily have the same V or J genes, I found duplicated CDR3-alpha or -beta sequences in the output distance matrices.
I thought all the VJ/VDJ sequences in the distance matrices generated from pp.ir_dist()
should be unique?
As Dr. Sturm suggested in this post, a possible workaround for this is to map the cell indices to each row in the distance matrix, likes the distance_result returned by tl.define_clonotypes()
or tl.define_clonotype_clusters()
.
I also want to summarize cells with identical TCR configuration into a single node. To this end, I…
- Summarize the TCR configuration for each cell (by concatenating VJ_1_v_call, VJ_1_j_call, VJ_1_junction_aa, VDJ_1_v_call, VDJ_1_j_call and VDJ_1_junction_aa into one string for each cell)
- Create a dataframe that map each cell index into its corresponding TCR configuration (the cell index is the index of dataframe, and the first column is the TCR configuration)
- Write a function to map the cell indices to each row in the pairwise distance matrix:
def tcrdist_clone_uns(adata, df_dist, name, cutoff, cell_tcr_map):
adata.uns[f"cc_aa_{name}"] = {}
adata.uns[f"cc_aa_{name}"]["cell_indices"] = {}
for row_idx, tcr_conf in enumerate(df_dist):
adata.uns[f"cc_aa_{name}"]["cell_indices"][str(row_idx)] = np.array(cell_tcr_map[cell_tcr_map["TCR_conf"]==tcr_conf].index)
for dists in [df_dist]:
if dists is None:
continue
dist_values = dists.values + 1
dist_values[dist_values > (cutoff + 1)] = 0
dist_values = csr_matrix(dist_values)
adata.uns[f"cc_aa_{name}"]["distances"] = dist_values.astype('uint8')
But I’m not sure if I’m doing it right? Please correct me if there’s any mistake!
A final question is how to finish defining clonotype clusters: how to assign clonotype id and clonotype size to each cell?
I copied the source code from Scirpy directly, but again, I’m not sure if this is correct:
from scirpy.util.graph import igraph_from_sparse_matrix
import igraph
import itertools
clonotype_dist = mdata_tcrdist["airr"].uns["cc_aa_tcrdist"]['distances']
clonotype_dist.eliminate_zeros()
## https://github.com/scverse/scirpy/blob/v0.16.1/src/scirpy/ir_dist/_clonotype_neighbors.py#L109
## See also what define_clonotype_clusters() will return:
## ↑ (https://scirpy.scverse.org/en/latest/generated/scirpy.tl.define_clonotype_clusters.html)
cell_indices = mdata_tcrdist["airr"].uns["cc_aa_tcrdist"]['cell_indices']
g = igraph_from_sparse_matrix(clonotype_dist, matrix_type="distance")
part = g.clusters(mode="weak") ## https://igraph.org/python/doc/api/igraph.Graph.html#clusters
## https://github.com/scverse/scirpy/blob/main/src/scirpy/tl/_clonotypes.py
clonotype_cluster_series = pd.Series(data=None, index=mdata_tcrdist.obs_names, dtype=str)
clonotype_cluster_size_series = pd.Series(data=None, index=mdata_tcrdist.obs_names, dtype=int)
## clonotype cluster = graph partition
idx, values = zip(
*itertools.chain.from_iterable(
zip(cell_indices[str(ct_id)], itertools.repeat(str(clonotype_cluster)))
for ct_id, clonotype_cluster in enumerate(part.membership)
)
)
clonotype_cluster_series = pd.Series(values, index=idx).reindex(mdata_tcrdist.obs_names)
clonotype_cluster_size_series = clonotype_cluster_series.groupby(clonotype_cluster_series).transform("count")
mdata_tcrdist["airr"].obs["cc_aa_tcrdist"] = clonotype_cluster_series
mdata_tcrdist["airr"].obs["cc_aa_tcrdist_size"] = clonotype_cluster_size_series
If my questions are still confusing, please check the Jupyter notebook and test data I prepared for this post.
Thank you very much!