scVI data set size runtime question

Hello,

I have several data sets that I am running scvi on via reticulate in R. I have a large data set which is over 350K cells and runs for 23 epochs and takes about 2 hours to complete. I have another data set that is about 25K cells and runs for 308 epochs and takes about over 6 hours to complete.

My question is why it takes longer to run on a smaller data set than it does on a much larger data set? Are there some additional parameters that I could set to make training on smaller data sets faster?

Below is the code I am running:

# Call integration
sc <- import('scanpy', convert = FALSE)
scvi <- import('scvi', convert = FALSE)
scvi$settings$progress_bar_style = 'tqdm'

print("Converting seurat object to anndata...")
DefaultAssay(so) = "RNA"
so <- FindVariableFeatures(so, selection.method = "vst", nfeatures = 2000)

# Get top genes and subset original matrix to include only top 2000 genes
top_genes <- head(VariableFeatures(so), 2000)
so_vargenes <- so[top_genes]

adata <- sc$AnnData(X   = t(as.matrix(GetAssayData(so_vargenes,slot='counts'))), obs = so_vargenes[[]],var = GetAssay(so_vargenes)[[]])

# run setup_anndata
scvi$data$setup_anndata(adata,batch_key = 'patient_id')

# create the model
model = scvi$model$SCVI(adata, use_cuda = TRUE)

# train the model
model$train()

Output:

INFO     Using batches from adata.obs["patient_id"]                                                                                                                                                                                                                         
INFO     No label_key inputted, assuming all cells have same label                                                                                                                                                                                                          
INFO     Using data from adata.X                                                                                                                                                                                                                                            
INFO     Computing library size prior per batch                                                                                                                                                                                                                             
INFO     Successfully registered anndata object containing 25934 cells, 2000 vars, 62 batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra continuous covariates.                                                                  
INFO     Please do not further modify adata until model is trained.    

INFO     Training for 308 epochs                                                                                                                                                                                                                                            
INFO     KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.                                                                                                     
INFO     KL warmup for 400 epochs                                                                                                                                                                                                                                           
Training...:   0%|                      

The output says to reduce the kl warmup which I tried to do according to Issue 735 by setting n_iter_kl_warmup = 0, but the kl warmup is still 400 epochs.

model$train(n_iter_kl_warmup=0)
INFO     Training for 308 epochs                                                                                                                                                                                                                                            
INFO     KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.                                                                                                     
INFO     KL warmup for 400 epochs                

Any help is greatly appreciated - thanks,
s2hui

Hi s2hui,

This is odd, in both datasets you should expect ~60,000 iterations total ([number of cells] / [batch size (128)] * [number of epochs]).

Are you using the same number of genes (2,000) in both datasets? The more genes you use the slower it will get.

It will also be slower the more batches you have. Do you have a similar number of batches in both datasets?

As a side note, these times seem a bit slow to me. I see you’re setting use_coda = TRUE, have you checked if scVI is actually using the GPU?

/Valentine

1 Like

Thanks for your reply!
I am using 2000 genes in both data sets as I did notice when I hadn’t added that step it was unbearably slow.
I have the same number of batches in each data set as well (roughly 75).
Also I had copied the linen of code that creates the model from a tutorial and didn’t notice the use_cuda = TRUE parameter. I don’t have access to GPUs so I will take that parameter out.
Would any or a combination of these be a reason why it takes hours for my analysis to complete?
The other thing I was reading up on was early stopping? Might this be something I should consider setting? I wasn’t sure how to do this correctly via R/reticulate though.

Thanks,
s2hui

Ok, yes without a GPU it will be pretty slow (I recently experienced how slow because I accidentally installed PyTorch without GPU support, quite a pain!).

Given what you’re describing though, I don’t see why one dataset is so much slower than the other…

Are you sure both datasets have integer counts in a sparse matrix? If the smaller dataset happens to have normalized values that are not sparse it will be slower. (I see you’re using a slot called ‘counts’ though, so it should be counts, but at least it’s something you can check).

When I want to take a quick look at the data I set max_epochs to a relatively small number, usually on the order of 1e6 / [number of cells], which would be 40 in this case. Then after looking at the results for a bit I rerun the model fitting with a larger number of epochs. So to just get started and try to see something, try with 40 epochs. It will still take you an hour though… With a GPU it takes ~10 minutes.

/Valentine

Great thank you I will try your suggestions and in the meantime look into getting access to a GPU as it seems to make a huge difference in run time.