Current status of dask support (and on disk sparse arrays)

I’d like to process a large scRNA-seq dataset using dask lazy arrays processed via ray. The dataset is in a compressed h5ad file that loads in adata.X as a scipy.sparse.csr_array. I followed the Dask + Zarr anndata guide (written or at least influenced by @ivirshup?) to wrap adata.X in a dask array and then ran into a simple bug at sc.pp.scale that I submitted a ticket for (scanpy issue #2491) and asked for more info.

Since I’ve read more and realized this is a better place to get more information. What’s the current status on unifying anndata and scanpy with dask (as well as zarr and sparse) and how can I help?


Parallelization via dask relies on chunking data and for large data its beneficial to use a datastore like zarr to read chunks directly from disk. Useful background includes this comment that shows how to write a sparse array to zarr, while this comment discusses a formalization of sparse arrays in zarr. It seems like CSR and CSC sparse matrices (e.g. scipy.sparse) are more efficient but to my knowledge can’t easily be chunked due to the cumulative sum while COO matrices (e.g. pydata’s sparse) can. Then there’s the Dask + Zarr guide I already mentioned which shows how to write anndata to zarr using write_dispatched. But before I even get to that step I hit a bug when trying to convert adata.X from a scipy.sparse.csr_matrix to a sparse.COO array: in anndata._core_anndata.X the np.array() call at line 635 wraps scipy.sparse matrices but tries to densify sparse.COO matrices.

Some notes: I’m new to scanpy and anndata so feel free to correct me if I’m way off base. I use zarr, xarray, dask, and ray for processing and performing ML on microscopy data. I originally included many more links to make life easier, but as a new poster I’m limited to only including 2.

2 Likes

It seems like adding dask support to scanpy should start with adding test data that includes an adata.X that is a dask.array. Most changes like this one should be simple given dask arrays follow the numpy API.

All conversation regarding sparse on-disk arrays can move here.