We are processing matrix files with 30k+ rows and 700k+ columns, and the approach of converting like this can take 20+ hours and 500GB+ of RAM to create the h5ad file:
adata = sc.read('expression.tab', first_column_names=True, cache=False).transpose()
adata.var = pd.read_table('genes.tab', sep='\t', index_col=0, header=0)
adata.obs = pd.read_table('observations.tab', sep='\t', index_col=0, header=0)
These have a LOT of zeros, so was hoping to chunk load X with sparse matrices to possibly save on memory before the write to disk. When I try this way:
# create empty AnnData object (var and obs read in the same as before)
adata = sc.AnnData(obs=var, var=obs)
reader = pd.read_csv(matrix_file_path, sep='\t', index_col=0, chunksize=500)
for chunk in reader:
chunk_sparse = sparse.csr_matrix(chunk.values)
if hasattr(adata, 'X'):
adata.X = sparse.vstack([adata.X, chunk_sparse])
else:
adata.X = chunk_sparse
adata.write(args.output_file)
This ends with the error on a smaller test file:
adata.X = sparse.vstack([adata.X, chunk_sparse]):
ValueError: Data matrix has wrong shape (500, 53616), need to be (15182, 53616).
Is there a best practice for getting through this without 1TB of RAM?
File content samples:
$ head -n 3 genes.tab
gene gene_symbol
ENSG00000175899 A2M
ENSG00000278540 ACACA
$ cut -f 1-3 observations.tab | head -n 3
observations chip cell_id
X2 T151 2
X3 T151 3
$ cut -f 1-4 expression.tab | head -n 3
X2 X3 X5
ENSG00000175899 0.69 0.69 0
ENSG00000278540 0.69 0 0