I’d like to process a large scRNA-seq dataset using dask lazy arrays processed via ray. The dataset is in a compressed h5ad file that loads in
adata.X as a
scipy.sparse.csr_array. I followed the Dask + Zarr anndata guide (written or at least influenced by @ivirshup?) to wrap
adata.X in a dask array and then ran into a simple bug at
sc.pp.scale that I submitted a ticket for (scanpy issue #2491) and asked for more info.
Since I’ve read more and realized this is a better place to get more information. What’s the current status on unifying
dask (as well as
sparse) and how can I help?
Parallelization via dask relies on chunking data and for large data its beneficial to use a datastore like zarr to read chunks directly from disk. Useful background includes this comment that shows how to write a sparse array to zarr, while this comment discusses a formalization of sparse arrays in zarr. It seems like CSR and CSC sparse matrices (e.g.
scipy.sparse) are more efficient but to my knowledge can’t easily be chunked due to the cumulative sum while COO matrices (e.g. pydata’s
sparse) can. Then there’s the Dask + Zarr guide I already mentioned which shows how to write anndata to zarr using
write_dispatched. But before I even get to that step I hit a bug when trying to convert
adata.X from a
scipy.sparse.csr_matrix to a
sparse.COO array: in
np.array() call at line 635 wraps
scipy.sparse matrices but tries to densify
Some notes: I’m new to scanpy and anndata so feel free to correct me if I’m way off base. I use zarr, xarray, dask, and ray for processing and performing ML on microscopy data. I originally included many more links to make life easier, but as a new poster I’m limited to only including 2.