Reading in a 1.1 million cell HDF5 dataset

salwanbutrus · March 25, 2022, 12:08am

Hello,

I am trying to read in a huge dataset from the Allen atlas. File name is Gene expression matrix (HDF5) 5.3 GB. I tried the following script but even on UC Berkeley’s Savio computational cluster, I’m pausing it after 10 hours, stuck at the “counts =” step. Is there a more efficient way to read the data in? Should I just let it run for much longer than 10 hours?

Thank you!

from scipy.sparse import csr_matrix
import numpy as np
import h5py
import scanpy as sc
import anndata

hf = h5py.File('expression_matrix.hdf5', 'r')

counts = csr_matrix(np.array(hf.get('data').get('counts')))
genes = list(hf.get('data').get('gene'))
cells = list(hf.get('data').get('samples'))

adata = anndata.AnnData(counts.transpose())

adata.obs.index = cells
adata.var.index = genes```

adamgayoso · March 25, 2022, 12:48am

What data format is this?

adamgayoso · March 25, 2022, 1:52am

@salwanbutrus I checked and it’s a hdf5dataset. I’m afraid there’s not much you can do (given what I understand). The authors should have saved an hdf5 with the components to build a sparse array instead of the dense array.

Though it may work to write a for loop and make the anndata for a chunk of cells (e.g., 100 cells). Then concatenate the anndatas together (as they will all have sparse format data). That whole dataset as dense in memory is probably causing issues.

ivirshup · March 25, 2022, 12:33pm

Though it may work to write a for loop and make the anndata for a chunk of cells (e.g., 100 cells).

I think I would probably just work with the arrays directly here, then construct the anndata from that. anndata already has some code for doing this internally, which you could just copy:

That code

from scipy import sparse

def idx_chunks_along_axis(shape: tuple, axis: int, chunk_size: int):
    """\
    Gives indexer tuples chunked along an axis.

    Params
    ------
    shape
        Shape of array to be chunked
    axis
        Axis to chunk along
    chunk_size
        Size of chunk along axis

    Returns
    -------
    An iterator of tuples for indexing into an array of passed shape.
    """
    total = shape[axis]
    cur = 0
    mutable_idx = [slice(None) for i in range(len(shape))]
    while cur + chunk_size < total:
        mutable_idx[axis] = slice(cur, cur + chunk_size)
        yield tuple(mutable_idx)
        cur += chunk_size
    mutable_idx[axis] = slice(cur, None)
    yield tuple(mutable_idx)


def read_dense_as_csr(dataset, axis_chunk=6000):
    sub_matrices = []
    for idx in idx_chunks_along_axis(dataset.shape, 0, axis_chunk):
        dense_chunk = dataset[idx]
        sub_matrix = sparse.csr_matrix(dense_chunk)
        sub_matrices.append(sub_matrix)
    return sparse.vstack(sub_matrices, format="csr")


counts = read_dense_as_csr(hf["data"])

salwanbutrus · March 26, 2022, 11:54pm

Great, thank you both!

Topic		Replies	Views
Trouble Reading .h5 files scanpy	1	3269	April 13, 2023
Converting tab-delimited files to adata in a memory-efficient way scanpy	3	221	February 21, 2024
Build a large anndata object column by colum anndata	1	405	September 29, 2022
Current status of dask support (and on disk sparse arrays) scanpy	1	694	May 25, 2023
Read_10x_h5 Error scanpy	1	812	May 23, 2023

Reading in a 1.1 million cell HDF5 dataset

Related topics