Related to the topic of large datasets, I have been working with some datasets of millions of cells and training scVI models with all human genes and I have been having out of memory errors prior to training initiating, when the scVI constructor is called, i.e.
model = scvi.model.SCVI(adata, n_layers=n_layers, n_latent=n_latent, n_hidden=n_hidden, gene_likelihood="nb")
runs out of memory (and I have been using machines with as much as 384 gigabytes of RAM). Is there a workaround for this or something I am missing? The anndata object is in backed mode
adata = ad.read_h5ad(h5ad_file, backed="r+")
Is the constructor loading the entire dataset into memory for some reason?
I previously commented this in this thread, but never got a response:
I tried training a dataset from Braun et al 2022 Science using a 32GB RAM but I would have the same memory issue. Downsampling the number of cells helped it but this had some influence on mapping.
Can you explicitly define a size_factor in model.setup_anndata (you can set it to the counts per cell). Currently it loads the full data into memory to infer the library size. We are aware of this issue and it will be soon‘ish fixed.
Hi @cane11, thanks I’ll try this. If I have previously trained models using the default (with a smaller training dataset) and I wish to do apples-to-apples comparisons of all of these models, I would need to retrain those with the new size factor, correct?
The difference is very marginal (activation function of the output layer is softplus when using size factor while it is softmax by default otherwise). I would still recommend to retrain, if you want to make sure they are fully comparable.