I’m using it through in the context of single-cell sequencing data (with Scanpy).
I’d like to group rows based on a obs column, i.e sum up var values for group of obs. In the obs DataFrame, I add a column that is an ID, a same ID is shared by several obs. I’d like to collapse the X DataFrame values based on this ID.
So far, I tried something like that (with orig being the initial anndata):
Thanks for your answer.
Is group_list supposed to be New_ID? Do you confirm that your new obs is a DataFrame with just one column with the New_ID and the index being the merged row names belonging to this New_ID?
In the meanwhile, I came to a solution where I:
Create a non sparse DataFrame from the initial one (I had trouble manipulating sparse data but probably because I’m not used to it).
Add the New_ID to the DataFrame.
Sum rows after a groupby on the New_ID.
Recreate an anndata from the summed rows values (made sparsed), the unique New_ID list and the initial var.
So, a nice interface for this kind of thing is in the works (though in a separate anndata-stats package), but a fairly fast and simple way to do this at the moment is:
import pandas as pd, anndata as ad
def sum_by(adata: ad.AnnData, col: str) -> ad.AnnData:
adata.strings_to_categoricals()
assert pd.api.types.is_categorical_dtype(adata.obs[col])
indicator = pd.get_dummies(adata.obs[col])
return ad.AnnData(
indicator.values.T @ adata.X,
var=adata.var,
obs=pd.DataFrame(index=indicator.columns)
)
This assumes the result will be dense, which will probably be the more efficient format if you’re expecting many (>100?) cells per group.
If you want to maintain sparsity, you can get a bit fancier with:
import pandas as pd, numpy as np, anndata as ad
from scipy import sparse
def sum_by(adata: ad.AnnData, col: str) -> ad.AnnData:
adata.strings_to_categoricals()
assert pd.api.types.is_categorical_dtype(adata.obs[col])
cat = adata.obs[col].values
indicator = sparse.coo_matrix(
(
np.broadcast_to(True, adata.n_obs),
(cat.codes, np.arange(adata.n_obs))
),
shape=(len(cat.categories), adata.n_obs),
)
return ad.AnnData(
indicator @ adata.X,
var=adata.var,
obs=pd.DataFrame(index=cat.categories)
)
That said, Valentine’s approach is much more flexible and can be used to do more complex computations.