SCVI tools with large datasets

I have have 2 hardware partitions, one is a large memory partition that I can concatenate many AnnDatas into a single large cell by gene matrix. I also have a GPU partition that has limited memory. How can I load the large AnnData into SCVI tools if it won’t fit into the GPU partition’s memory limit?

I see that AnnData 0.10 has been released and has extended on-disk support here. Is there documentation for this or other methods to train an SCVI/SCANVI model with a “larger-than-memory” counts matrix?

Thanks in advance for any input.

I believe SCVI batches the cells it loads in, so the whole dataset shouldn’t ever be loaded into the GPU. Specifically this paragraph from their pub:

" A second level of stochasticity comes from subsampling from the training set (possible because the cells are identically independently distributed when conditioned on the latent variables). We then have an online optimization procedure that can handle massive datasets, used by both scVI and other methods that exploit neural networks18,19,20,21. At each iteration, we focus only on a small subset of the data randomly sampled (M = 128 data points) and do not need to go through the entire dataset. Therefore, there is no need to store the entire dataset in memory. Because the number of genes is in practice limited to a few tens of thousands, these mini-batches of cells can be handled easily by a GPU. Now, our objective function is continuous and end-to-end differentiable, which allows us to use automatic differentiation operators."

Deep generative modeling for single-cell transcriptomics | Nature Methods

Hi crhodes,

I believe everything will run correctly if you have read in the AnnData file with the option backed = 'r+' and run the setup_anndata method for the SCVI model. Did you try this?

I would expect it to be extremely slow however, because you will need to move each mini-batch of data from the hard drive through the RAM to the GPU.

Hope this helps,

Related to the topic of large datasets, I have been working with some datasets of millions of cells and training scVI models with all human genes and I have been having out of memory errors prior to training initiating, when the scVI constructor is called, i.e.

model = scvi.model.SCVI(adata, n_layers=n_layers, n_latent=n_latent, n_hidden=n_hidden, gene_likelihood="nb")

runs out of memory (and I have been using machines with as much as 384 gigabytes of RAM). Is there a workaround for this or something I am missing? The anndata object is in backed mode

adata = ad.read_h5ad(h5ad_file, backed="r+")