Memory Usage in multiple New Formats

Hi,
there are now, multiple ways of loading an anndata object. I will list some: read_h5ad, read_h5ad(backed=‘r’), read_zarr, read_lazy (in the experimental branch). I am having a difficult time to understand the difference between ‘backed’ and read_lazy.
Can you give me a comparison of how these methods are using memory when doing the following operations:
1) adata = ad.read_xx(file)
2) subset = adata[:100].copy()

I am looking for some description like: For read_h5ad(), during 1) the entire object is loaded into memory, during 2) a copy is created in which memory is allocated according to the number of observations. At the end of the instructions, memory will have an entire copy of anndata and 100 cells of the data.

Best!

In short backed is a bit of a hodge-podge where you hold a reference to a file and lazily construct data structures on the fly. It only works with h5ad and internally it relies on anndata.io.sparse_dataset — anndata 0.12.2 documentation as a data structure.

So you can use anndata.io.sparse_dataset — anndata 0.12.2 documentation without backed mode, and it may be simpler, especially for zarr. Here, immediately upon subsetting you will get the data read into memory i.e., adata_backed_or_with_sparse_dataset_in_X[subset] will not load anything (in theory) but accessing X will load it into memory because of [subset].

read_elem_lazy and read_lazy use dask and xarray to read in the on-disk data. Memory should never be allocated unless you explicitly request it via to_memory (anndata API) or compute (dask API). So adata_read_with_read_lazy[subset][other_subset][other_other_subset] will never allocate memory nor will accessing obs or X. However once you do …X.compute() or adata_read_with_read_lazy[subset][other_subset][other_other_subset].to_memory(), it will bring back the data into memory.

I’m open to a guide by the way! We’re working on something now that would tie in nicely, so I think we will make one as a byproduct of that work anyway :wink: