Reading in a 1.1 million cell HDF5 dataset


I am trying to read in a huge dataset from the Allen atlas. File name is Gene expression matrix (HDF5) 5.3 GB. I tried the following script but even on UC Berkeley’s Savio computational cluster, I’m pausing it after 10 hours, stuck at the “counts =” step. Is there a more efficient way to read the data in? Should I just let it run for much longer than 10 hours?

Thank you!

from scipy.sparse import csr_matrix
import numpy as np
import h5py
import scanpy as sc
import anndata

hf = h5py.File('expression_matrix.hdf5', 'r')

counts = csr_matrix(np.array(hf.get('data').get('counts')))
genes = list(hf.get('data').get('gene'))
cells = list(hf.get('data').get('samples'))

adata = anndata.AnnData(counts.transpose())

adata.obs.index = cells
adata.var.index = genes```

What data format is this?

@salwanbutrus I checked and it’s a hdf5dataset. I’m afraid there’s not much you can do (given what I understand). The authors should have saved an hdf5 with the components to build a sparse array instead of the dense array.

Though it may work to write a for loop and make the anndata for a chunk of cells (e.g., 100 cells). Then concatenate the anndatas together (as they will all have sparse format data). That whole dataset as dense in memory is probably causing issues.

Though it may work to write a for loop and make the anndata for a chunk of cells (e.g., 100 cells).

I think I would probably just work with the arrays directly here, then construct the anndata from that. anndata already has some code for doing this internally, which you could just copy:

That code
from scipy import sparse

def idx_chunks_along_axis(shape: tuple, axis: int, chunk_size: int):
    Gives indexer tuples chunked along an axis.

        Shape of array to be chunked
        Axis to chunk along
        Size of chunk along axis

    An iterator of tuples for indexing into an array of passed shape.
    total = shape[axis]
    cur = 0
    mutable_idx = [slice(None) for i in range(len(shape))]
    while cur + chunk_size < total:
        mutable_idx[axis] = slice(cur, cur + chunk_size)
        yield tuple(mutable_idx)
        cur += chunk_size
    mutable_idx[axis] = slice(cur, None)
    yield tuple(mutable_idx)

def read_dense_as_csr(dataset, axis_chunk=6000):
    sub_matrices = []
    for idx in idx_chunks_along_axis(dataset.shape, 0, axis_chunk):
        dense_chunk = dataset[idx]
        sub_matrix = sparse.csr_matrix(dense_chunk)
    return sparse.vstack(sub_matrices, format="csr")

counts = read_dense_as_csr(hf["data"])

Great, thank you both!