Group/sum rows based on jobs feature

mbahin · March 23, 2022, 9:36am

Hi,

Thanks for the nice anndata!

I’m using it through in the context of single-cell sequencing data (with Scanpy).
I’d like to group rows based on a obs column, i.e sum up var values for group of obs. In the obs DataFrame, I add a column that is an ID, a same ID is shared by several obs. I’d like to collapse the X DataFrame values based on this ID.

So far, I tried something like that (with orig being the initial anndata):

counts = pd.DataFrame(orig.X, index=orig.obs.New_ID, columns=["ID"])
new_counts = counts.reset_index().groupby("New_ID")[total_counts.columns].sum()
df = anndata.AnnData(X=new_counts.values, obs=orig.obs.New_ID.unique(), var=orig.var)

But I think I’m misleading and have some issues manipulating it because orig.X is a sparse matrix.

Could you please help me on this?

Cheers,
Mathieu

Valentine_Svensson · March 24, 2022, 2:15pm

I have a strategy for this that I copy between notebooks that looks like this:

    from anndata import AnnData
    import pandas as pd
    from scipy import sparse

    groupby_object = adata.obs.groupby(['New_ID'], observed = True)
    
    X = adata.X

    N_obs = groupby_object.ngroups
    N_var = X.shape[1]
    X_summed = sparse.lil_matrix((N_obs, N_var))

    group_names = []
    index_names = []
    row = 0
    for group_columns, idx_ in groupby_object.indices.items():
        X_summed[row] = X[idx_].sum(0)
        row += 1
        group_names.append(group_columns)
        index_names.append('_'.join(map(str, group_columns)))

    if sparse.isspmatrix_csr(X):
        X_summed = X_summed.tocsr()
    else:
        X_summed = np.array(X_summed.todense())
    
    obs = pd.DataFrame(group_names, columns = group_list, index = index_names)

    new_adata = AnnData(X = X_summed, obs = obs, var = adata.var)

Hope this helps!

/Valentine

mbahin · March 24, 2022, 3:48pm

Hi Valentine,

Thanks for your answer.
Is group_list supposed to be New_ID? Do you confirm that your new obs is a DataFrame with just one column with the New_ID and the index being the merged row names belonging to this New_ID?

In the meanwhile, I came to a solution where I:

Create a non sparse DataFrame from the initial one (I had trouble manipulating sparse data but probably because I’m not used to it).
Add the New_ID to the DataFrame.
Sum rows after a groupby on the New_ID.
Recreate an anndata from the summed rows values (made sparsed), the unique New_ID list and the initial var.

df = pd.DataFrame.sparse.from_spmatrix(initial_df.X, index=initial_df.obs_names, columns=initial_df.var_names)
df["New_ID"] = New_ID
new_df = df.groupby("New_ID")[df.columns].sum()
new_ad = anndata.AnnData(X=csr_matrix(new_df.values), obs=pd.DataFrame(sorted(df["New_ID"].unique()), columns=["New_ID"]), var=initial_df.var)

Maybe it is less good regarding computational consuming since I’m going through non-sparse data at some point…

Cheers,
Mathieu

ivirshup · March 25, 2022, 1:08pm

So, a nice interface for this kind of thing is in the works (though in a separate anndata-stats package), but a fairly fast and simple way to do this at the moment is:

import pandas as pd, anndata as ad

def sum_by(adata: ad.AnnData, col: str) -> ad.AnnData:
    adata.strings_to_categoricals()
    assert pd.api.types.is_categorical_dtype(adata.obs[col])

    indicator = pd.get_dummies(adata.obs[col])

    return ad.AnnData(
        indicator.values.T @ adata.X,
        var=adata.var,
        obs=pd.DataFrame(index=indicator.columns)
    )

This assumes the result will be dense, which will probably be the more efficient format if you’re expecting many (>100?) cells per group.

If you want to maintain sparsity, you can get a bit fancier with:

import pandas as pd, numpy as np, anndata as ad
from scipy import sparse

def sum_by(adata: ad.AnnData, col: str) -> ad.AnnData:
    adata.strings_to_categoricals()
    assert pd.api.types.is_categorical_dtype(adata.obs[col])

    cat = adata.obs[col].values
    indicator = sparse.coo_matrix(
        (
            np.broadcast_to(True, adata.n_obs),
            (cat.codes, np.arange(adata.n_obs))
        ),
        shape=(len(cat.categories), adata.n_obs),
    )

    return ad.AnnData(
        indicator @ adata.X,
        var=adata.var,
        obs=pd.DataFrame(index=cat.categories)
    )

That said, Valentine’s approach is much more flexible and can be used to do more complex computations.

Valentine_Svensson · March 27, 2022, 11:04pm

Yes, sorry, I put in your case but missed an instance of group_list. Everything should work if you do group_list = ['New_ID'].

/Valentine

mbahin · March 28, 2022, 9:42am

Hi to both of you,

Thanks for your quick and nice piece of code. I’ll be able to do what I need with that!

@Valentine_Svensson: that’s what I guessed, just mentionned it in case other people run into the post later.

@ivirshup: in the same manner, in the get_dummies function, the col should not be “quoted”.

Cheers,
Mathieu

ivirshup · March 28, 2022, 1:17pm

Ah yes, thanks for the catch! Fixed that.

Topic		Replies	Views
Concatenate anndata with merged rows anndata	0	413	August 5, 2022
Efficient pseudobulking anndata anndata	4	139	May 28, 2025
Sorting annData object by observations or variables? anndata	1	1231	January 10, 2024
Merging identical genes from 10x fixed scRNA anndata	3	311	March 6, 2024
Build a large anndata object column by colum anndata	1	405	September 29, 2022

Group/sum rows based on jobs feature

Related topics