Hello,
I am trying to read in a huge dataset from the Allen atlas. File name is Gene expression matrix (HDF5) 5.3 GB. I tried the following script but even on UC Berkeley’s Savio computational cluster, I’m pausing it after 10 hours, stuck at the “counts =” step. Is there a more efficient way to read the data in? Should I just let it run for much longer than 10 hours?
Thank you!
from scipy.sparse import csr_matrix
import numpy as np
import h5py
import scanpy as sc
import anndata
hf = h5py.File('expression_matrix.hdf5', 'r')
counts = csr_matrix(np.array(hf.get('data').get('counts')))
genes = list(hf.get('data').get('gene'))
cells = list(hf.get('data').get('samples'))
adata = anndata.AnnData(counts.transpose())
adata.obs.index = cells
adata.var.index = genes```
What data format is this?
@salwanbutrus I checked and it’s a hdf5dataset. I’m afraid there’s not much you can do (given what I understand). The authors should have saved an hdf5 with the components to build a sparse array instead of the dense array.
Though it may work to write a for loop and make the anndata for a chunk of cells (e.g., 100 cells). Then concatenate the anndatas together (as they will all have sparse format data). That whole dataset as dense in memory is probably causing issues.
Though it may work to write a for loop and make the anndata for a chunk of cells (e.g., 100 cells).
I think I would probably just work with the arrays directly here, then construct the anndata
from that. anndata
already has some code for doing this internally, which you could just copy:
That code
from scipy import sparse
def idx_chunks_along_axis(shape: tuple, axis: int, chunk_size: int):
"""\
Gives indexer tuples chunked along an axis.
Params
------
shape
Shape of array to be chunked
axis
Axis to chunk along
chunk_size
Size of chunk along axis
Returns
-------
An iterator of tuples for indexing into an array of passed shape.
"""
total = shape[axis]
cur = 0
mutable_idx = [slice(None) for i in range(len(shape))]
while cur + chunk_size < total:
mutable_idx[axis] = slice(cur, cur + chunk_size)
yield tuple(mutable_idx)
cur += chunk_size
mutable_idx[axis] = slice(cur, None)
yield tuple(mutable_idx)
def read_dense_as_csr(dataset, axis_chunk=6000):
sub_matrices = []
for idx in idx_chunks_along_axis(dataset.shape, 0, axis_chunk):
dense_chunk = dataset[idx]
sub_matrix = sparse.csr_matrix(dense_chunk)
sub_matrices.append(sub_matrix)
return sparse.vstack(sub_matrices, format="csr")
counts = read_dense_as_csr(hf["data"])