TCR Metrics for Pairwise Distance Matrix

Hello scverse Community,
I’m currently working through one of the tutorials and find myself needing a bit more clarity regarding the various metrics used to calculate pairwise distance matrices for TCR analysis, particularly in relation to different types of analyses. My queries are as follows:

  1. Clonotype Definition: Are there specific metrics preferred for defining clonotypes? How do these metrics influence the definition process?
  2. Clonotype Networks: When constructing clonotype networks, which metrics are most effective? How do they impact the overall network structure?
  3. Database Queries: In the context of querying databases for clonotype information, what are the recommended metrics for ensuring efficient and accurate data retrieval?
    I’m aiming to grasp the nuances between these metrics better and understand how they uniquely apply to each analysis type (clonotype definition, clonotype networks, and database queries). Any insights, experiences, or resources you could share would be greatly appreciated.
    Thank you in advance!

First, I want to make the distinction between “clonotypes” and “clonotype clusters” clear:

  • A clonotype refers to T cells with the same origin and exactly the same CDR3 nucleotide sequence
  • A clonotype cluster is a group of similar clonotypes that likely recognize the same epitope, as defined by some distance metric.

This implies that for defining clonotypes, the only relevant metric is the identity metric.

As for defining clonotype clusters, clonotype networks and database queries, I don’t think that any of the metrics available in scirpy behaves fundamentally different. Ultimately, we don’t know which metric is best (in that it captures best which receptors recognize the same antigen), because there is insufficient gold standard data for benchmarking. It is very likely though that the “alignment” metric captures this better than the “levenshtein” distance because it takes the properties of the individual amino acids into account (at a higher computational cost). The “hamming” distance is more useful for B cells than for T cells.

The overall network structure is more affected by the distance threshold you set (a higher cutoff will lead to larger network components) and whether you set receptor_arms and dual_ir to all or any. This is also demonstrated to some extent in this thread: Is it necessary to remove orphan-VJ/VDJ cells in practice?

Hope that helps!

Dear Gregor,

Thank you for your assistance with my query. Your input has been invaluable. I’m currently working with a custom TCR dataset that includes TCRs experimentally linked to specific antigens. My focus is on identifying exact matches for the CDR3b sequence. However, when I run my script (referenced below), it outputs a mix of exact and non-exact matches. I initially thought the “identity” metric would filter for only exact matches. I’m unclear about the similarity criteria used to include these additional sequences. Any clarification or guidance you could offer would be greatly appreciated.

Looking forward to your insights.

ir.tl.ir_query(
    mdata,
    adata_tcr,
    metric="identity",
    sequence="aa",
    receptor_arms="VDJ",
    dual_ir="primary_only",
)
ir.tl.ir_query_annotate_df(
    mdata,
    adata_tcr,
    metric="identity",
    sequence="aa",
    include_ref_cols=["antigen",],
).tail()
ir.tl.ir_query_annotate(
    mdata,
    adata_tcr,
    metric="identity",
    sequence="aa",
    include_ref_cols=["antigen"],
    strategy="most-frequent",
)

This code should return only exact matches. Can you provide a full reproducible example?

I tried with the example datasets from the tutorial and only get exact matches:

import scirpy as ir
import numpy.testing as npt

mdata = ir.datasets.wu2020_3k()
vdjdb = ir.datasets.vdjdb()

ir.pp.ir_dist(mdata, vdjdb, metric="identity", sequence="aa")
ir.tl.ir_query(
    mdata,
    vdjdb,
    metric="identity",
    sequence="aa",
    receptor_arms="VDJ",
    dual_ir="primary_only",
)

# airr_context adds the VDJ_1 junction_aa tempoarily as column in vdjdb.obs
with ir.get.airr_context(vdjdb, "junction_aa", "VDJ_1"):
    ir.tl.ir_query_annotate(
        mdata,
        vdjdb,
        metric="identity",
        sequence="aa",
        include_ref_cols=[
            "VDJ_1_junction_aa",
        ],
    )

# rename the annotated column in `mdata["airr"].obs
mdata["airr"].obs["VDJDB_VDJ_1_junction_aa"] = mdata["airr"].obs["VDJ_1_junction_aa"]
del mdata["airr"].obs["VDJ_1_junction_aa"]

# Add the VDJ_1_junction_aa column to mdata["airr"].obs and compare it with the annotations from VDJDB. For most of the chains no exact match will be found so let's only check the values that are not null. 
with ir.get.airr_context(mdata["airr"], "junction_aa", "VDJ_1"):
    display(mdata["airr"].obs.dropna(how="any"))

    import numpy.testing as npt

    mask = ~mdata["airr"].obs["VDJDB_VDJ_1_junction_aa"].isnull()
    npt.assert_equal(
        mdata["airr"].obs.loc[mask, "VDJDB_VDJ_1_junction_aa"].values.astype(str),
        mdata["airr"].obs.loc[mask, "VDJ_1_junction_aa"].values.astype(str),
    )