Is there any way to increase the integration speed for a very large dataset (~800,000 cells/nuclei)? I want to keep the number of epochs high, around 80 as it gives me good results. I also saw in one of the discussions that increasing batch_size to 1024 would speed it up.
Anything else other than using GPUs?
If you have a (very) large amount of RAM, you can convert the sparse count matrix (.X in the AnnData object) to a dense array. During training, a minibatch of cells are selected and the corresponding ‘slice’ of gene molecule counts is converted one the fly to a dense matrix. This conversion step, which happens on the CPU, is a bottleneck for the speed of training. If you convert the .X matrix to
a dense you won’t need to do this conversion on the fly. Remember, however, for a dense matrix you will need 32 bits of RAM for each cell, gene -pair. So for your data, if you have 30,000 genes, you will need 32 bits * 800,000 * 30,000 = 96 gb of RAM to hold the dense UMI count data.
Epochs will run faster with larger minibatch sizes than the default. However, my experience is that this causes training to need more epochs before reaching the same reconstruction error. I tried to optimize the minibatch size, and found that the default was optimal across a few different datasets at reaching lower reconstruction error with faster wall clock time. I would recommend you also evaluate this, but make sure to measure time until finished as well as keeping track of the loss curves for the different minibatch sizes you’re evaluating.
Hope this helps!
Thank you for the advice! I am currently training the scVI model for 600k cells at 80 epochs and 5000 batch size and it is running significantly faster. I will assess the integration quality with these parameters and let you know! But I would also like to try converting the sparse count matrix to a dense one. How can I do that? Here is my code for generating the anndata file:
pdac.merged.hv = pdac.merged[int.features]
adata = convertFormat(pdac.merged.hv, from=“seurat”,
to=“anndata”, main_layer=“counts”, drop_single_values=FALSE)
#run setup_anndata, use column stim for batch
scvi$model$SCVI$setup_anndata(adata, batch_key = ‘Study.Sample.ID’)
create the model
model = scvi$model$SCVI(adata, n_latent=as.integer(30), n_layers = as.integer(2), gene_likelihood =“nb”)
train the model
model$train(max_epochs = as.integer(80), accelerator = “cpu”, batch_size = as.integer(5000))
The code for converting .X to dense is something like this (in Python):
adata.X = adata.X.toarray().
I performed scVI and I’m happy with the integration (clear separation of immune, epithelial, and stromal cells). I performed 200 epochs and 2000 batch size to increase processing speed on the dense matrix (3000 protein coding genes intersected across studies).
Just a follow up question. I want to 1) filter out a few cells from the scVI integrated dataset, 2) annotate main clusters and 3) perform scANVI integration on the same data before 4) performing scArches to extend my core atlas with additional single-nucleus datasets.
If I filter out cells in the initial integrated dataset (step 1), do I need to perform scVI integration again before doing scANVI? What is the best practice for this scenario?
I believe you can initiate the scANVI model with the previously fitted SCVI model even if you are just using a subset of the data that was used to fit the SCVI model. So in this case you shouldn’t need to fit a new SCVI integration model.
I think the steps you propose sound reasonable, as I understand them: 1) fit SCVI model; 2) filter and annotate data; 3) seed based annotation with scANVI initiated with SCVI model; 4) extend scANVI model with scArches.
Hope this helps,