Hello,
I have several data sets that I am running scvi on via reticulate in R. I have a large data set which is over 350K cells and runs for 23 epochs and takes about 2 hours to complete. I have another data set that is about 25K cells and runs for 308 epochs and takes about over 6 hours to complete.
My question is why it takes longer to run on a smaller data set than it does on a much larger data set? Are there some additional parameters that I could set to make training on smaller data sets faster?
Below is the code I am running:
# Call integration
sc <- import('scanpy', convert = FALSE)
scvi <- import('scvi', convert = FALSE)
scvi$settings$progress_bar_style = 'tqdm'
print("Converting seurat object to anndata...")
DefaultAssay(so) = "RNA"
so <- FindVariableFeatures(so, selection.method = "vst", nfeatures = 2000)
# Get top genes and subset original matrix to include only top 2000 genes
top_genes <- head(VariableFeatures(so), 2000)
so_vargenes <- so[top_genes]
adata <- sc$AnnData(X   = t(as.matrix(GetAssayData(so_vargenes,slot='counts'))), obs = so_vargenes[[]],var = GetAssay(so_vargenes)[[]])
# run setup_anndata
scvi$data$setup_anndata(adata,batch_key = 'patient_id')
# create the model
model = scvi$model$SCVI(adata, use_cuda = TRUE)
# train the model
model$train()
Output:
INFO     Using batches from adata.obs["patient_id"]                                                                                                                                                                                                                         
INFO     No label_key inputted, assuming all cells have same label                                                                                                                                                                                                          
INFO     Using data from adata.X                                                                                                                                                                                                                                            
INFO     Computing library size prior per batch                                                                                                                                                                                                                             
INFO     Successfully registered anndata object containing 25934 cells, 2000 vars, 62 batches, 1 labels, and 0 proteins. Also registered 0 extra categorical covariates and 0 extra continuous covariates.                                                                  
INFO     Please do not further modify adata until model is trained.    
INFO     Training for 308 epochs                                                                                                                                                                                                                                            
INFO     KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.                                                                                                     
INFO     KL warmup for 400 epochs                                                                                                                                                                                                                                           
Training...:   0%|                      
The output says to reduce the kl warmup which I tried to do according to Issue 735 by setting n_iter_kl_warmup = 0, but the kl warmup is still 400 epochs.
model$train(n_iter_kl_warmup=0)
INFO     Training for 308 epochs                                                                                                                                                                                                                                            
INFO     KL warmup phase exceeds overall training phaseIf your applications rely on the posterior quality, consider training for more epochs or reducing the kl warmup.                                                                                                     
INFO     KL warmup for 400 epochs                
Any help is greatly appreciated - thanks,
s2hui