Converting tab-delimited files to adata in a memory-efficient way

We are processing matrix files with 30k+ rows and 700k+ columns, and the approach of converting like this can take 20+ hours and 500GB+ of RAM to create the h5ad file:

adata = sc.read('expression.tab', first_column_names=True, cache=False).transpose()
adata.var = pd.read_table('genes.tab', sep='\t', index_col=0, header=0)
adata.obs = pd.read_table('observations.tab', sep='\t', index_col=0, header=0)

These have a LOT of zeros, so was hoping to chunk load X with sparse matrices to possibly save on memory before the write to disk. When I try this way:

    # create empty AnnData object (var and obs read in the same as before)
    adata = sc.AnnData(obs=var, var=obs)
    reader = pd.read_csv(matrix_file_path, sep='\t', index_col=0, chunksize=500)

    for chunk in reader:
        chunk_sparse = sparse.csr_matrix(chunk.values)
        if hasattr(adata, 'X'):
            adata.X = sparse.vstack([adata.X, chunk_sparse])
        else:
            adata.X = chunk_sparse

    adata.write(args.output_file)

This ends with the error on a smaller test file:

adata.X = sparse.vstack([adata.X, chunk_sparse]):
ValueError: Data matrix has wrong shape (500, 53616), need to be (15182, 53616).

Is there a best practice for getting through this without 1TB of RAM?

File content samples:

$ head -n 3 genes.tab 
gene	gene_symbol
ENSG00000175899	A2M
ENSG00000278540	ACACA
$ cut -f 1-3 observations.tab | head -n 3
observations	chip	cell_id
X2	T151	2
X3	T151	3
$ cut -f 1-4 expression.tab | head -n 3
	X2	X3	X5
ENSG00000175899	0.69	0.69	0
ENSG00000278540	0.69	0	0

anndata want the matrix you are assigning to X to have the shape (n_obs, n_var) which is giving you the error.

If you create the full matrix, then assign it you should be fine. I would suggest:

    adata = sc.AnnData(obs=var, var=obs)
    reader = pd.read_csv(matrix_file_path, sep='\t', index_col=0, chunksize=500)
    adata.X = sparse.vstack([sparse.csr_matrix(chunk.values) for chunk in reader])

Iteratively concatenating the arrays has the potential to use more memory as a greater amount of intermediate matrices could be hanging around. I would suggest doing that all at once as in my example.

This is great, thank you. I ran one test on a smaller dataset which produced a file successfully this way, and worked well, and have a medium test going for the last 12+ hours now. Memory usage is 10% of what it was previously though still, so great progress.

1 Like

Good to hear! FWIW, I think you could get away with using larger chunk sizes for greater performance