How can I operate on view of Anndata object when the Object itself is large to load to memory?

Hi All,

I am new to using AnnData, and seem to be struggling with some basic usage– I am using large (>10GB) snRNA-seq files from Allen Brain in .h5ad format. I can only load these as backed Anndata objects (backed=’r’) as my RAM is not sufficient to load them into memory. If I index the backed object to get a View based on metadata, I cannot use functions like .to_memory() or .to_df() on this View without crashing my python kernel. I have monitored memory and it seems to be attempting to load the entire AnnData object into memory during these operations, rather than just the indexed View I am calling (to_memory or to_df)…. I have confirmed this behavior even on absurdly small indexed Views (4 x 10, e.g., just a few cells and genes)

Am I missing something basic here? I want to be able to massively subset small amounts of data from specific cells in these AnnData objects based on metadata, and then conduct routine analysis using pandas, etc… Is this not possible if I can’t load the entire AnnData object to RAM at one point?

Many thanks in advance

Hello,

I am happy to look into this behavior a bit if you can open an issue. That should not be happening in theory.

However, we have new APIs to handle this - please have a look at anndata.experimental.read_elem_lazy — anndata 0.13.0.dev24+g3f831435e documentation and anndata.experimental.read_lazy — anndata 0.13.0.dev24+g3f831435e documentation which rely on dask/xarray and can handle obs and var as well.

And have a look at our notebook for a tutorial: Lazily Accessing Remotely Stored Data — anndata 0.13.0.dev24+g3f831435e documentation

However, this behavior sounds buggy so it would be great if you could open an issue so I can handle it when I’m back in the office :slight_smile: Also feel free to submit a fix