scVI with large datasets

agdenadel · September 24, 2024, 7:52pm

Related to the topic of large datasets, I have been working with some datasets of millions of cells and training scVI models with all human genes and I have been having out of memory errors prior to training initiating, when the scVI constructor is called, i.e.

model = scvi.model.SCVI(adata, n_layers=n_layers, n_latent=n_latent, n_hidden=n_hidden, gene_likelihood="nb")

runs out of memory (and I have been using machines with as much as 384 gigabytes of RAM). Is there a workaround for this or something I am missing? The anndata object is in backed mode

adata = ad.read_h5ad(h5ad_file, backed="r+")

Is the constructor loading the entire dataset into memory for some reason?

I previously commented this in this thread, but never got a response:

yojetsharma · September 24, 2024, 8:11pm

I tried training a dataset from Braun et al 2022 Science using a 32GB RAM but I would have the same memory issue. Downsampling the number of cells helped it but this had some influence on mapping.

cane11 · September 24, 2024, 8:48pm

Can you explicitly define a size_factor in model.setup_anndata (you can set it to the counts per cell). Currently it loads the full data into memory to infer the library size. We are aware of this issue and it will be soon‘ish fixed.

agdenadel · September 24, 2024, 10:18pm

Hi @cane11, thanks I’ll try this. If I have previously trained models using the default (with a smaller training dataset) and I wish to do apples-to-apples comparisons of all of these models, I would need to retrain those with the new size factor, correct?

cane11 · September 24, 2024, 10:51pm

The difference is very marginal (activation function of the output layer is softplus when using size factor while it is softmax by default otherwise). I would still recommend to retrain, if you want to make sure they are fully comparable.

Topic		Replies	Views
SCVI tools with large datasets scvi-tools	3	753	May 31, 2024
scVI data set size runtime question scvi-tools scvi	4	1067	February 18, 2022
Increase scVI integration speed scvi-tools integration	5	1045	October 24, 2023
Failing to import SCVI-tools modules: has AnnDatasetFromAnnData been replaced? scvi-tools totalvi	2	67	June 27, 2024
Build a large anndata object column by colum anndata	1	403	September 29, 2022

scVI with large datasets

Related topics