[AnnData] Lazily create .obsm on disk

Hi!

Is there is a way to create a large .obsm on disk for an existing AnnData object, without storing it in memory? My use case would look something like this:

import numpy as np
import anndata as ad

handle = "tmp.h5ad"
n_obs = 100000
n_var = 2000

# make some dummy AD, write out and read back in as file backed
mat = ad.AnnData(np.random.rand(n_obs, n_var))

mat.write(handle)
mat = ad.read(handle, backed=True)

# hypothetically, the user would now switch to a lower-memory machine

# Q: how can I avoid memory allocation in this step?
mat.obsm["test"] = np.random.rand(n_obs, n_var) 

For the reasons as to why: I would like to create dimensionality reductions under the assumption that even just storing the coordinates is too large for memory. Obviously creating the AnnData object creates more memory than that, but I imagine a user may want to switch devices in between creating and working with the object, so that they might encounter this issue.

Thanks! Best,

Jesko

I think you should be able to create a new array inside the obsm group of the on-disk anndata object, label it with the appropriate metadata (e.g. f["obsm"]["array"].attrs = {"encoding-type": "array", "encoding-version": "0.2.0"}), and be good to read it from somewhere else.

Happy to go into more detail if that wasn’t clear!

Thanks for your help, that sounds like a good approach!

However, I am not yet sure how I would initialize the new array inside the obsm group without allocating memory. In my example above the call to np.random.rand(n_obs, n_var) does use this memory. In other words, I am looking for a way to express f["obsm"]["array"] = <empty array of shape (n_obs, n_var)> with as little memory usage as possible. Any idea how I could achieve that?

Sure! You’d just go through h5py or zarr (whichever you’re using) here. So like:

f = h5py.File(path_to_h5ad, "a")
f["obsm"].create_dataset(
    "array",
    size=(n_obs, n_var),
    dtype=...,
    chunks=...,
    compression=...
)

These docs should help a bit more with the specifics for h5py.

You could probably also use dask if you wanted, but that may take a little more figuring out.

Right! For some reason I was stuck in the frame of mind of having to go through AnnData for this. Using h5py directly is perfect here, thanks a lot!

1 Like