TCR Metrics for Pairwise Distance Matrix

First, I want to make the distinction between “clonotypes” and “clonotype clusters” clear:

  • A clonotype refers to T cells with the same origin and exactly the same CDR3 nucleotide sequence
  • A clonotype cluster is a group of similar clonotypes that likely recognize the same epitope, as defined by some distance metric.

This implies that for defining clonotypes, the only relevant metric is the identity metric.

As for defining clonotype clusters, clonotype networks and database queries, I don’t think that any of the metrics available in scirpy behaves fundamentally different. Ultimately, we don’t know which metric is best (in that it captures best which receptors recognize the same antigen), because there is insufficient gold standard data for benchmarking. It is very likely though that the “alignment” metric captures this better than the “levenshtein” distance because it takes the properties of the individual amino acids into account (at a higher computational cost). The “hamming” distance is more useful for B cells than for T cells.

The overall network structure is more affected by the distance threshold you set (a higher cutoff will lead to larger network components) and whether you set receptor_arms and dual_ir to all or any. This is also demonstrated to some extent in this thread: Is it necessary to remove orphan-VJ/VDJ cells in practice?

Hope that helps!