AnnData X memory layout

Hi,

I could imagine this is a question for @ivirshup, but happy for any answer.

Specifically, I was wondering about the memory layout of data stored in the AnnData.X slot. My understanding is that, on-disk, data is stored column-major (F-CONTIGUOUS). However, AnnData seems to store data as C-CONTIGUOUS, i.e. row-major.

Example:

import numpy as np
from anndata import AnnData, read_h5ad
X = np.array(np.random.randn(10,5),order="F")
print(X.flags.f_contiguous) 
# True
AnnData(X=X).write("tmp.h5ad")
read_h5ad("tmp.h5ad").X.flags
#  C_CONTIGUOUS : True
#  F_CONTIGUOUS : False

My question is: is this intentional? I am aware not all numpy functions deal well with column-major, so I could envision this is intentional. However, I often find myself performing column-wise operations (e.g. standardization). Example use-case:

def scaler(x: np.array):
    x -= x.mean(axis=0)
    x /= x.std(axis=0)

c_arr=np.random.random((10000000,10))
f_arr=np.array(c_arr,order="F")

%time scaler(c_arr)
# Wall time: 528 ms

%time scaler(f_arr)
# Wall time: 271 ms

Of course, other functionality might benefit from a C-CONTIGUOUS layout. But even PCA’s seem to benefit from the F-CONTIGUOUS layout:

from sklearn.decomposition import PCA
pca = PCA()
%time pca.fit_transform(c_arr)
# Wall time: 1.55 s

%time pca.fit_transform(f_arr)
# Wall time: 1.35 s

I understand these examples are quite articifical. Therefore I am just giving these timings to provide an idea of why column-major might be interesting in some setups. Could column-major be an option in the future?

Cheers,
Jesko

In general, I think we try to respect the cardinality the user has set for arrays in anndata object (e.g. we don’t intentionally switch from column to row major, IIRC). But I think the issue here is that HDF5 is row major:

We may be able to support "F" ordered arrays for zarr. I think this would require us explicitly passing order to zarr, since zarr.create defaults to "C".

But maybe zarr will stop letting users specify the dimension order: Remove 'order' from Specs and make 'C' default · Issue #126 · zarr-developers/zarr-specs · GitHub

Interesting, clearly I did not remember the HDF5 format right. It seems like how to interpret the format depends on the language, see for example in Julia: Multi-dimensional arrays have reversed order if loaded by h5py in Python3 · Issue #785 · JuliaIO/HDF5.jl · GitHub
The discussion there is quite instructive I believe.

I think given the Zarr issue you linked integrating support for column-major through Zarr seems unwise.

The only solution I could think of is to add an option to read_h5ad to read/write in column-major, which would trigger np.asfortranarray to be used during reading (and vice versa for writing).
But I see how this will be tricky with backed data and adds computational overhead to IO, not least in memory. So perhaps this is out of scope unless HDF5 introduces a flag for this.

This might be infeasible, but could one work somehow do it through chunks of HDF5 datasets?

I.e. One would expose options to write X with specified chunksizes (I think this is not yet the case, but might be useful in any case?). Then during reading, if the chunks are a lot longer than they are wide, trigger conversion to column-major (this would need very clear documentation though).

I think it would be fairly easy for us to say the order we expect the dimensions to be in. E.g. I could just add some metadata saying the array is meant to be in fortran order, or really any dimension order. I think this is broadly “fine” in python, since numpy supports arbitrary array order.

My main concern would be about interoperability with other languages.

Interoperability definitely outweighs and performance benefits here. In my mind, the data on disk would have to be row-major, with the potential to add a flag (maybe in uns ?) if column-major is preferred. If so, AnnData could automatically convert the array when needed. But of course again with backed files that’ll have quite some overhead, so might be no good.