How to make the union set of 2 scRNA-seq matrix?

Hello Scanpy,
We have 2 scRNA-seq libraries sharing some common barcodes between them. We want to make a union set of these 2 libraries by removing the duplicated barcodes? But we’re not professional with the anndata coding. It looks like we cannot do it by simple | or & or - or ^ in pandas.
Could you please help us with this question?

We tried:

CKP1 = sc.read_10x_mtx(path='D:/ZGY/MST_matrix/KP9CKP11-5_11/CKP/', var_names='gene_symbols', cache=True) 
CKP2 = sc.read_10x_mtx(path='D:/ZGY/MST_matrix/KP10CKP12-5_11/CKP/', var_names='gene_symbols', cache=True)

dup_index = CKP1.obs_names.intersection(CKP2.obs_names)    # find the duplicated index between 2 libraries
      dtype='object', length=9697)

CKP1_uni=CKP1-CKP1[dup_index,:]    # slice the unique part of CKP1
TypeError: unsupported operand type(s) for -: 'AnnData' and 'AnnData'
CKP2_uni=CKP2-CKP2[dup_index,:]    # slice the unique part of CKP2
TypeError: unsupported operand type(s) for -: 'AnnData' and 'AnnData'
CKP_intersection=CKP1[dup_index,:]    # slice the intersection part of CKP1 and CKP2
View of AnnData object with n_obs × n_vars = 9697 × 32285
    var: 'gene_ids', 'feature_types'

adata = CKP1_uni.concatenate(CKP2_uni, CKP_intersection, batch_categories=['CKP1_uni', 'CKP2_uni', 'CKP_intersection'])    # merge these 3 parts
NameError: name 'CKP1_uni' is not defined

Hi YJ,

Index objects acts as sets. You have gotten pretty far to how I would solve this with the first bit. Here’s what I would do:

idx1 = CKP1.obs.index
idx2 = CKP2.obs.index
dup_index = idx1.intersection(idx2)
unique_idx1 = idx1.difference(dup_index)
unique_idx2 = idx2.difference(dup_index)

CKP1_uni = CKP1[unique_idx1, :].copy()
CKP2_uni = CKP2[unique_idx2, :].copy()
CKP_intersection = CKP1[dup_index, :].copy()

adata = anndata.concatenate((CKP1_uni, CKP2_uni, CKP_intersection))

The error you are seeing (TypeError: unsupported operand type(s) for -: 'AnnData' and 'AnnData') is because it hasn’t been defined what adata1 - adata2 means. So instead I am creating unique indices and intersection of the indices, then slice by those.

Now, I am not sure what this data is. But if you have two sequenced librariries of the same samples, I would probably add up the molecule counts from both CKP1[dup_index, :] and CKP2[dup_index, :]. Though it’s hard to know if the UMIs are unique between them.

Hope this helps!

1 Like

Hello Valentine,
Thanks for the solution! Appreciate it! You saved our data!
We’ll revisit this post once we publish our data and make the acknowledgment for you!