Hi @racng,
this is an excellent question!
You can slice the awkward array in .obsm["airr"]
and manipulate its values. For instance,
you can retrieve all c_call
variables from all chains using
>>> mdata["airr"].obsm["airr"]["c_call"]
[['TRBC2*00', 'TRBC2*00'],
['TRBC1*00', 'TRAC*00', 'TRBC1*00'],
['TRBC2*00', 'TRBC2*00', 'TRAC*00'],
...,
['TRBC2*00', 'TRAC*00'],
['TRBC2*00', 'TRAC*00']]
--------------------------------------------
type: 3000 * var * ?string
This is still an awkward array. There are ways of manipulating awkward arrays directly and while they are computationally efficient, they are not always beginner-friendly. Let’s therefore convert it to a python list of lists that you can easily modify:
import awkward as ak
c_calls = ak.to_list(mdata["airr"].obsm["airr"]["c_call"])
Now you can walk that list and build a new one, manipulating values one-by-one:
c_calls_new = []
for cell in c_calls:
tmp_cell = []
for c_gene in cell:
if c_gene is not None:
tmp_cell.append(c_gene.split("*")[0])
else:
tmp_cell.append(None)
c_calls_new.append(tmp_cell)
You can now re-assign the list to the awkward array:
mdata["airr"].obsm["airr"]["c_call"] = c_calls_new
And appreciate that the *00
suffix has been removed:
ir.get.airr(mdata, airr_variable="c_call")
|
VJ_1_c_call |
VDJ_1_c_call |
VJ_2_c_call |
VDJ_2_c_call |
LN1_GTAGGCCAGCGTAGTG-1 |
|
TRBC2 |
|
TRBC2 |
RN2_AGAGCGACAGATTGCT-1 |
TRAC |
TRBC1 |
|
|
LN1_GTCATTTCAATGAAAC-1 |
TRAC |
TRBC1 |
|
|
LN2_GACACGCAGGTAGCTG-2 |
|
TRBC2 |
|
|
LN2_GCACTCTCAGGGATTG-2 |
TRAC |
TRBC1 |
|
|
The reason why
mdata.mod['airr'].obsm['airr'][0][0]['c_call'] = "TRBC1"
doesn’t affect your values is that only “Record types” (that is the awkward equivalent of a dictionary) are mutable. Selecting an index [0]
returns an immutable view of the array, therefore your edit is in vain.
Regarding your second question:
My suggestion here would be to not actually remove those values, but use the filtering capabilities of scirpy.pp.index_chains
. That way, the chains will be ignored by all scirpy functions that use AIRR data.
For instance, you can define a list of custom filters, e.g.
filters = [
# these are the default filters that you'll need to re-specify here if you want to keep them
"productive",
"require_junction_aa"
# custom filters via callback functions - return True to keep the chain
lambda x: x['c_gene'] != "TRBC2",
lambda x: ~x['junction_aa'].contains("*")
]
and pass it to index_chains
like this:
scirpy.pp.index_chains(mdata, filter=filters)
(Note: From scirpy v0.14 on, these functions need to be numba-compilable since index-chains
will switch to a more efficient numba implementation that is >100x faster)
Alternatively, if you prefer to remove those chains entirely, you can subset the awkward array directly.
Again, the easiest (but unefficient) way would be to convert the entire array to a list of dictionaries using ak.to_list
, filter that list using a python loop and reassign mdata["airr"].obsm["airr"] = ak.Array(filtered_list_of_dicts)
. You can also create boolean masks of the awkward array and use that for subsetting it:
arr = mdata["airr"].obsm["airr"]
mask = arr["c_call"] != "TRBC2"
mdata["airr"].obsm["airr"] == arr[mask]
Hope that helps! If you have any specific questions regarding awkward arrays, consider asking in their forum. From my experience the authors are super helpful and responsive.