How to make the union set of 2 scRNA-seq matrix?

Hello Scanpy,
We have 2 scRNA-seq libraries sharing some common barcodes between them. We want to make a union set of these 2 libraries by removing the duplicated barcodes? But we’re not professional with the anndata coding. It looks like we cannot do it by simple | or & or - or ^ in pandas.
Could you please help us with this question?
Thanks!
Best,
YJ

We tried:

CKP1 = sc.read_10x_mtx(path='D:/ZGY/MST_matrix/KP9CKP11-5_11/CKP/', var_names='gene_symbols', cache=True) 
CKP1.var_names_make_unique()
CKP2 = sc.read_10x_mtx(path='D:/ZGY/MST_matrix/KP10CKP12-5_11/CKP/', var_names='gene_symbols', cache=True)
CKP2.var_names_make_unique()

dup_index = CKP1.obs_names.intersection(CKP2.obs_names)    # find the duplicated index between 2 libraries
dup_index
Index(['AAACCCACACCTTCGT-1', 'AAACCCACACTACAGT-1', 'AAACCCACAGATACCT-1',
       'AAACCCACAGCGTTTA-1', 'AAACCCACAGGCCTGT-1', 'AAACCCACATGAGATA-1',
       'AAACCCAGTAATCAAG-1', 'AAACCCAGTCGCCTAG-1', 'AAACCCAGTGTCATCA-1',
       'AAACCCAGTGTCCATA-1',
       ...
       'TTTGTTGCATAGAGGC-1', 'TTTGTTGCATGAGATA-1', 'TTTGTTGGTCGTACTA-1',
       'TTTGTTGGTGCGGCTT-1', 'TTTGTTGGTTGTGTTG-1', 'TTTGTTGTCAAAGCCT-1',
       'TTTGTTGTCACCCTTG-1', 'TTTGTTGTCCTCGCAT-1', 'TTTGTTGTCGCTTAAG-1',
       'TTTGTTGTCTCGCAGG-1'],
      dtype='object', length=9697)

CKP1_uni=CKP1-CKP1[dup_index,:]    # slice the unique part of CKP1
TypeError: unsupported operand type(s) for -: 'AnnData' and 'AnnData'
CKP2_uni=CKP2-CKP2[dup_index,:]    # slice the unique part of CKP2
TypeError: unsupported operand type(s) for -: 'AnnData' and 'AnnData'
CKP_intersection=CKP1[dup_index,:]    # slice the intersection part of CKP1 and CKP2
View of AnnData object with n_obs × n_vars = 9697 × 32285
    var: 'gene_ids', 'feature_types'

adata = CKP1_uni.concatenate(CKP2_uni, CKP_intersection, batch_categories=['CKP1_uni', 'CKP2_uni', 'CKP_intersection'])    # merge these 3 parts
NameError: name 'CKP1_uni' is not defined

Hi YJ,

Index objects acts as sets. You have gotten pretty far to how I would solve this with the first bit. Here’s what I would do:

idx1 = CKP1.obs.index
idx2 = CKP2.obs.index
dup_index = idx1.intersection(idx2)
unique_idx1 = idx1.difference(dup_index)
unique_idx2 = idx2.difference(dup_index)

CKP1_uni = CKP1[unique_idx1, :].copy()
CKP2_uni = CKP2[unique_idx2, :].copy()
CKP_intersection = CKP1[dup_index, :].copy()

adata = anndata.concatenate((CKP1_uni, CKP2_uni, CKP_intersection))

The error you are seeing (TypeError: unsupported operand type(s) for -: 'AnnData' and 'AnnData') is because it hasn’t been defined what adata1 - adata2 means. So instead I am creating unique indices and intersection of the indices, then slice by those.

Now, I am not sure what this data is. But if you have two sequenced librariries of the same samples, I would probably add up the molecule counts from both CKP1[dup_index, :] and CKP2[dup_index, :]. Though it’s hard to know if the UMIs are unique between them.

Hope this helps!
/Valentine

1 Like

Hello Valentine,
Thanks for the solution! Appreciate it! You saved our data!
We’ll revisit this post once we publish our data and make the acknowledgment for you!
Thanks!
Best,
YJ