How to edit and filter awkward array created by scirpy?

I used scirpy to read an AIRR table generated by MiXCR. The v/d/j/c_call columns have “*00” added to the end of the gene that I want to remove.
scirpy has functions for retrieving values, but I can’t figure out how to directly edit the awkward array.
For example, mdata.mod['airr'].obsm['airr'][0][0]['c_call'] gives “TRBC1*00”. After assigning a value manually, mdata.mod['airr'].obsm['airr'][0][0]['c_call'] = "TRBC1". The values are unaffected.

Additionally, I want to remove some chains from the awkward array based on matching “junction_aa”, “v_call”, and “j_call”. These are contaminations that I want to remove from the analysis. Is there a way to do that?

Hi @racng,

this is an excellent question!

You can slice the awkward array in .obsm["airr"] and manipulate its values. For instance,
you can retrieve all c_call variables from all chains using

>>> mdata["airr"].obsm["airr"]["c_call"]
[['TRBC2*00', 'TRBC2*00'],
 ['TRBC1*00', 'TRAC*00', 'TRBC1*00'],
 ['TRBC2*00', 'TRBC2*00', 'TRAC*00'],
 ['TRBC2*00', 'TRAC*00'],
 ['TRBC2*00', 'TRAC*00']]
type: 3000 * var * ?string

This is still an awkward array. There are ways of manipulating awkward arrays directly and while they are computationally efficient, they are not always beginner-friendly. Let’s therefore convert it to a python list of lists that you can easily modify:

import awkward as ak
c_calls = ak.to_list(mdata["airr"].obsm["airr"]["c_call"])

Now you can walk that list and build a new one, manipulating values one-by-one:

c_calls_new = []
for cell in c_calls:
    tmp_cell = []
    for c_gene in cell:
        if c_gene is not None:

You can now re-assign the list to the awkward array:

mdata["airr"].obsm["airr"]["c_call"] = c_calls_new

And appreciate that the *00 suffix has been removed:

ir.get.airr(mdata, airr_variable="c_call")
VJ_1_c_call VDJ_1_c_call VJ_2_c_call VDJ_2_c_call

The reason why

mdata.mod['airr'].obsm['airr'][0][0]['c_call'] = "TRBC1"

doesn’t affect your values is that only “Record types” (that is the awkward equivalent of a dictionary) are mutable. Selecting an index [0] returns an immutable view of the array, therefore your edit is in vain.

Regarding your second question:
My suggestion here would be to not actually remove those values, but use the filtering capabilities of scirpy.pp.index_chains. That way, the chains will be ignored by all scirpy functions that use AIRR data.

For instance, you can define a list of custom filters, e.g.

filters = [
   # these are the default filters that you'll need to re-specify here if you want to keep them
   # custom filters via callback functions - return True to keep the chain
   lambda x: x['c_gene'] != "TRBC2",
   lambda x: ~x['junction_aa'].contains("*")

and pass it to index_chains like this:

scirpy.pp.index_chains(mdata, filter=filters)

(Note: From scirpy v0.14 on, these functions need to be numba-compilable since index-chains will switch to a more efficient numba implementation that is >100x faster)

Alternatively, if you prefer to remove those chains entirely, you can subset the awkward array directly.
Again, the easiest (but unefficient) way would be to convert the entire array to a list of dictionaries using ak.to_list, filter that list using a python loop and reassign mdata["airr"].obsm["airr"] = ak.Array(filtered_list_of_dicts). You can also create boolean masks of the awkward array and use that for subsetting it:

arr = mdata["airr"].obsm["airr"]
mask = arr["c_call"] != "TRBC2"
mdata["airr"].obsm["airr"] == arr[mask]

Hope that helps! If you have any specific questions regarding awkward arrays, consider asking in their forum. From my experience the authors are super helpful and responsive.