Interpreting scirpy's ir_dist_aa_alignment distances

Hi,

I’m using scirpy for computing the distances between CDR3 AA sequences.
For that I’ve called pp.ir_dist specifying metric = 'alignment', sequence = 'aa', and cutoff = 15.
Trying to understand the behavior, I looked at an example entry in the CDR33 AA distance matrix that was created in my mdata object:

mdata["airr"].uns["ir_dist_aa_alignment"]["VDJ"]["distances"][6,8]
12.0

The corresponding CDR3 AA sequences are:

mdata["airr"].uns["ir_dist_aa_alignment"]["VDJ"]["seqs"][6]
'AARGTGGITAQLV'
mdata["airr"].uns["ir_dist_aa_alignment"]["VDJ"]["seqs"][8]
'AARTGGITAQLV'

If I run R’s Biostrings::pairwiseAlignment, using BLOSUM62 as the substitution matrix, with gapOpening = -11 and gapExtension = -11 (consistent with scirpy’s ir_dist_aa_alignment parameters) I get:

Biostrings::pairwiseAlignment('AARGTGGITAQLV','AARGTGGITAQLV',substitutionMatrix="BLOSUM62",gapOpening=-11,gapExtension=-11)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AARGTGGITAQLV
subject: AARGTGGITAQLV
score: 62

Biostrings::pairwiseAlignment('AARTGGITAQLV','AARTGGITAQLV',substitutionMatrix="BLOSUM62",gapOpening=-11,gapExtension=-11)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AARTGGITAQLV
subject: AARTGGITAQLV
score: 56

Biostrings::pairwiseAlignment('AARGTGGITAQLV','AARTGGITAQLV',substitutionMatrix="BLOSUM62",gapOpening=-11,gapExtension=-11)
Global PairwiseAlignmentsSingleSubject (1 of 1)
pattern: AARGTGGITAQLV
subject: AAR-TGGITAQLV
score: 34

Following scirpy’s AlignmentDistanceCalculator function, the distance should be:
min(62,56) - 34 = 22, and with the offset of 1, the entry in mdata["airr"].uns["ir_dist_aa_alignment"]["VDJ"]["distances"][6,8] should therefore be 23 (or actually 0 since 23 > cutoff = 15 I called the pp.ir_dist function with.

Can anyone help pointing me in the right direction?

1 Like

It seems the parasail implementation of the NW algorithm is inconsistent with the Biostrings package:

>>> import parasail
>>> parasail.nw_trace_scan_16("AARGTGGITAQLV", "AARGTGGITAQLV", 11, 11, parasail.blosum62).score
62
>>> parasail.nw_trace_scan_16("AARTGGITAQLV", "AARTGGITAQLV", 11, 11, parasail.blosum62).score
56
>>> res = parasail.nw_trace_scan_16("AARGTGGITAQLV", "AARTGGITAQLV", 11, 11, parasail.blosum62)
>>> res.score 
45
>>> 56 - 45 + 1
12

Even though they arrive at the same alignment

>>> res.cigar.decode
'3=1I9='

which means 3 matches followed by 1 insertion followed by 9 matches which is exactly the same as in your example.

If you find out why they are different, please let me know!

1 Like