Group/sum rows based on jobs feature

Hi,

Thanks for the nice anndata! :slight_smile:

I’m using it through in the context of single-cell sequencing data (with Scanpy).
I’d like to group rows based on a obs column, i.e sum up var values for group of obs. In the obs DataFrame, I add a column that is an ID, a same ID is shared by several obs. I’d like to collapse the X DataFrame values based on this ID.

So far, I tried something like that (with orig being the initial anndata):

counts = pd.DataFrame(orig.X, index=orig.obs.New_ID, columns=["ID"])
new_counts = counts.reset_index().groupby("New_ID")[total_counts.columns].sum()
df = anndata.AnnData(X=new_counts.values, obs=orig.obs.New_ID.unique(), var=orig.var)

But I think I’m misleading and have some issues manipulating it because orig.X is a sparse matrix.

Could you please help me on this?

Cheers,
Mathieu

I have a strategy for this that I copy between notebooks that looks like this:

    from anndata import AnnData
    import pandas as pd
    from scipy import sparse

    groupby_object = adata.obs.groupby(['New_ID'], observed = True)
    
    X = adata.X

    N_obs = groupby_object.ngroups
    N_var = X.shape[1]
    X_summed = sparse.lil_matrix((N_obs, N_var))

    group_names = []
    index_names = []
    row = 0
    for group_columns, idx_ in groupby_object.indices.items():
        X_summed[row] = X[idx_].sum(0)
        row += 1
        group_names.append(group_columns)
        index_names.append('_'.join(map(str, group_columns)))

    if sparse.isspmatrix_csr(X):
        X_summed = X_summed.tocsr()
    else:
        X_summed = np.array(X_summed.todense())
    
    obs = pd.DataFrame(group_names, columns = group_list, index = index_names)

    new_adata = AnnData(X = X_summed, obs = obs, var = adata.var)

Hope this helps!

/Valentine

1 Like

Hi Valentine,

Thanks for your answer.
Is group_list supposed to be New_ID? Do you confirm that your new obs is a DataFrame with just one column with the New_ID and the index being the merged row names belonging to this New_ID?

In the meanwhile, I came to a solution where I:

  1. Create a non sparse DataFrame from the initial one (I had trouble manipulating sparse data but probably because I’m not used to it).
  2. Add the New_ID to the DataFrame.
  3. Sum rows after a groupby on the New_ID.
  4. Recreate an anndata from the summed rows values (made sparsed), the unique New_ID list and the initial var.
df = pd.DataFrame.sparse.from_spmatrix(initial_df.X, index=initial_df.obs_names, columns=initial_df.var_names)
df["New_ID"] = New_ID
new_df = df.groupby("New_ID")[df.columns].sum()
new_ad = anndata.AnnData(X=csr_matrix(new_df.values), obs=pd.DataFrame(sorted(df["New_ID"].unique()), columns=["New_ID"]), var=initial_df.var)

Maybe it is less good regarding computational consuming since I’m going through non-sparse data at some point…

Cheers,
Mathieu

So, a nice interface for this kind of thing is in the works (though in a separate anndata-stats package), but a fairly fast and simple way to do this at the moment is:

import pandas as pd, anndata as ad

def sum_by(adata: ad.AnnData, col: str) -> ad.AnnData:
    adata.strings_to_categoricals()
    assert pd.api.types.is_categorical_dtype(adata.obs[col])

    indicator = pd.get_dummies(adata.obs[col])

    return ad.AnnData(
        indicator.values.T @ adata.X,
        var=adata.var,
        obs=pd.DataFrame(index=indicator.columns)
    )

This assumes the result will be dense, which will probably be the more efficient format if you’re expecting many (>100?) cells per group.

If you want to maintain sparsity, you can get a bit fancier with:

import pandas as pd, numpy as np, anndata as ad
from scipy import sparse

def sum_by(adata: ad.AnnData, col: str) -> ad.AnnData:
    adata.strings_to_categoricals()
    assert pd.api.types.is_categorical_dtype(adata.obs[col])

    cat = adata.obs[col].values
    indicator = sparse.coo_matrix(
        (
            np.broadcast_to(True, adata.n_obs),
            (cat.codes, np.arange(adata.n_obs))
        ),
        shape=(len(cat.categories), adata.n_obs),
    )

    return ad.AnnData(
        indicator @ adata.X,
        var=adata.var,
        obs=pd.DataFrame(index=cat.categories)
    )

That said, Valentine’s approach is much more flexible and can be used to do more complex computations.

Yes, sorry, I put in your case but missed an instance of group_list. Everything should work if you do group_list = ['New_ID'].

/Valentine

Hi to both of you,

Thanks for your quick and nice piece of code. I’ll be able to do what I need with that! :slight_smile:

@Valentine_Svensson: that’s what I guessed, just mentionned it in case other people run into the post later.

@ivirshup: in the same manner, in the get_dummies function, the col should not be “quoted”.

Cheers,
Mathieu

Ah yes, thanks for the catch! Fixed that.