[AnnData] Lazily create .obsm on disk

jeskowagner · May 5, 2022, 1:45pm

Hi!

Is there is a way to create a large .obsm on disk for an existing AnnData object, without storing it in memory? My use case would look something like this:

import numpy as np
import anndata as ad

handle = "tmp.h5ad"
n_obs = 100000
n_var = 2000

# make some dummy AD, write out and read back in as file backed
mat = ad.AnnData(np.random.rand(n_obs, n_var))

mat.write(handle)
mat = ad.read(handle, backed=True)

# hypothetically, the user would now switch to a lower-memory machine

# Q: how can I avoid memory allocation in this step?
mat.obsm["test"] = np.random.rand(n_obs, n_var)

For the reasons as to why: I would like to create dimensionality reductions under the assumption that even just storing the coordinates is too large for memory. Obviously creating the AnnData object creates more memory than that, but I imagine a user may want to switch devices in between creating and working with the object, so that they might encounter this issue.

Thanks! Best,

Jesko

ivirshup · May 6, 2022, 1:03pm

I think you should be able to create a new array inside the obsm group of the on-disk anndata object, label it with the appropriate metadata (e.g. f["obsm"]["array"].attrs = {"encoding-type": "array", "encoding-version": "0.2.0"}), and be good to read it from somewhere else.

Happy to go into more detail if that wasn’t clear!

jeskowagner · May 6, 2022, 2:39pm

Thanks for your help, that sounds like a good approach!

However, I am not yet sure how I would initialize the new array inside the obsm group without allocating memory. In my example above the call to np.random.rand(n_obs, n_var) does use this memory. In other words, I am looking for a way to express f["obsm"]["array"] = <empty array of shape (n_obs, n_var)> with as little memory usage as possible. Any idea how I could achieve that?

ivirshup · May 7, 2022, 4:11pm

Sure! You’d just go through h5py or zarr (whichever you’re using) here. So like:

f = h5py.File(path_to_h5ad, "a")
f["obsm"].create_dataset(
    "array",
    size=(n_obs, n_var),
    dtype=...,
    chunks=...,
    compression=...
)

These docs should help a bit more with the specifics for h5py.

You could probably also use dask if you wanted, but that may take a little more figuring out.

jeskowagner · May 10, 2022, 10:24am

Right! For some reason I was stuck in the frame of mind of having to go through AnnData for this. Using h5py directly is perfect here, thanks a lot!

Topic		Replies	Views
Anndata.write fails with "object header message is too large" anndata	3	1032	January 26, 2023
Concat anndata objects on disk anndata	1	1100	April 12, 2022
Build a large anndata object column by colum anndata	1	391	September 29, 2022
Removing an AnnData Object from Memory anndata	1	689	April 25, 2023
AnnData X memory layout anndata	6	649	August 29, 2022

[AnnData] Lazily create .obsm on disk

Related topics