I’d like to process a large scRNA-seq dataset using dask lazy arrays processed via ray. The dataset is in a compressed h5ad file that loads in adata.X
as a scipy.sparse.csr_array
. I followed the Dask + Zarr anndata guide (written or at least influenced by @ivirshup?) to wrap adata.X
in a dask array and then ran into a simple bug at sc.pp.scale
that I submitted a ticket for (scanpy issue #2491) and asked for more info.
Since I’ve read more and realized this is a better place to get more information. What’s the current status on unifying anndata
and scanpy
with dask
(as well as zarr
and sparse
) and how can I help?
Parallelization via dask relies on chunking data and for large data its beneficial to use a datastore like zarr to read chunks directly from disk. Useful background includes this comment that shows how to write a sparse array to zarr, while this comment discusses a formalization of sparse arrays in zarr. It seems like CSR and CSC sparse matrices (e.g. scipy.sparse
) are more efficient but to my knowledge can’t easily be chunked due to the cumulative sum while COO matrices (e.g. pydata’s sparse
) can. Then there’s the Dask + Zarr guide I already mentioned which shows how to write anndata to zarr using write_dispatched
. But before I even get to that step I hit a bug when trying to convert adata.X
from a scipy.sparse.csr_matrix
to a sparse.COO
array: in anndata._core_anndata.X
the np.array()
call at line 635 wraps scipy.sparse
matrices but tries to densify sparse.COO
matrices.
Some notes: I’m new to scanpy and anndata so feel free to correct me if I’m way off base. I use zarr, xarray, dask, and ray for processing and performing ML on microscopy data. I originally included many more links to make life easier, but as a new poster I’m limited to only including 2.