Converting tab-delimited files to adata in a memory-efficient way

jorvis · February 13, 2024, 7:45pm

We are processing matrix files with 30k+ rows and 700k+ columns, and the approach of converting like this can take 20+ hours and 500GB+ of RAM to create the h5ad file:

adata = sc.read('expression.tab', first_column_names=True, cache=False).transpose()
adata.var = pd.read_table('genes.tab', sep='\t', index_col=0, header=0)
adata.obs = pd.read_table('observations.tab', sep='\t', index_col=0, header=0)

These have a LOT of zeros, so was hoping to chunk load X with sparse matrices to possibly save on memory before the write to disk. When I try this way:

    # create empty AnnData object (var and obs read in the same as before)
    adata = sc.AnnData(obs=var, var=obs)
    reader = pd.read_csv(matrix_file_path, sep='\t', index_col=0, chunksize=500)

    for chunk in reader:
        chunk_sparse = sparse.csr_matrix(chunk.values)
        if hasattr(adata, 'X'):
            adata.X = sparse.vstack([adata.X, chunk_sparse])
        else:
            adata.X = chunk_sparse

    adata.write(args.output_file)

This ends with the error on a smaller test file:

adata.X = sparse.vstack([adata.X, chunk_sparse]):
ValueError: Data matrix has wrong shape (500, 53616), need to be (15182, 53616).

Is there a best practice for getting through this without 1TB of RAM?

File content samples:

$ head -n 3 genes.tab 
gene	gene_symbol
ENSG00000175899	A2M
ENSG00000278540	ACACA

$ cut -f 1-3 observations.tab | head -n 3
observations	chip	cell_id
X2	T151	2
X3	T151	3

$ cut -f 1-4 expression.tab | head -n 3
	X2	X3	X5
ENSG00000175899	0.69	0.69	0
ENSG00000278540	0.69	0	0

ivirshup · February 15, 2024, 4:55pm

jorvis:

    for chunk in reader:
        chunk_sparse = sparse.csr_matrix(chunk.values)
        if hasattr(adata, 'X'):
            adata.X = sparse.vstack([adata.X, chunk_sparse])
        else:
            adata.X = chunk_sparse

anndata want the matrix you are assigning to X to have the shape (n_obs, n_var) which is giving you the error.

If you create the full matrix, then assign it you should be fine. I would suggest:

    adata = sc.AnnData(obs=var, var=obs)
    reader = pd.read_csv(matrix_file_path, sep='\t', index_col=0, chunksize=500)
    adata.X = sparse.vstack([sparse.csr_matrix(chunk.values) for chunk in reader])

Iteratively concatenating the arrays has the potential to use more memory as a greater amount of intermediate matrices could be hanging around. I would suggest doing that all at once as in my example.

jorvis · February 17, 2024, 9:13am

This is great, thank you. I ran one test on a smaller dataset which produced a file successfully this way, and worked well, and have a medium test going for the last 12+ hours now. Memory usage is 10% of what it was previously though still, so great progress.

ivirshup · February 21, 2024, 2:22pm

Good to hear! FWIW, I think you could get away with using larger chunk sizes for greater performance

Topic		Replies	Views
Current status of dask support (and on disk sparse arrays) scanpy	1	396	May 25, 2023
Store data as sparse matrix anndata	3	2285	April 19, 2022
Reading in a 1.1 million cell HDF5 dataset scRNA-seq h5	4	782	March 26, 2022
Build a large anndata object column by colum anndata	1	325	September 29, 2022
Converting h5 to h5ad files? scATAC-seq	2	2637	September 29, 2022

Converting tab-delimited files to adata in a memory-efficient way

Related Topics