All cpu nodes of the clusters are used, I would like to restrain this behavior when I call model$train function.
Also I wanted to know if I should start the integration from counts or log normalized counts. It’s not clear to me.
Any ideas would be more than welcome,
thanks !
In fact I’m looking for an integration with singlecellexperiments and not the the seurat approach as described here
library(reticulate)
library(sceasy)
## subset to predefine list of HVGs
sce.combined <- sce.combined[top_hvgs,]
sc <- import("scanpy", convert = FALSE)
scvi <- import("scvi", convert = FALSE)
## transform sce to anndata
adata <- sceasy::convertFormat(sce.combined, from="sce", to="anndata", main_layer="counts",transfer_layers = c("logcounts", "normcounts"),
drop_single_values=FALSE)
## run setup_anndata
scvi$model$SCVI$setup_anndata(adata, batch_key = 'sample_id')
## create the model
model <- scvi$model$SCVI(adata)
## train the model
model$train(accelerator=as.character("cpu"),
max_epochs = as.integer(10))
## get latent representation and normalized by this latent space
adata$obsm["X_scVI"] = model$get_latent_representation()
adata$obsm["X_normalized_scVI"] = model$get_normalized_expression()
# go back to sce to use known plot functions from R
sce <- SingleCellExperiment(
assays = list(X_normalized_scVI = t(reticulate::py_to_r(adata$obsm["X_normalized_scVI"] ))),
colData = reticulate::py_to_r(adata$obs),
reducedDims = list(X_scVI = reticulate::py_to_r(adata$obsm["X_scVI"]))
)
## PCA using expression values normalized by the latent space
sce <- scater::runPCA(sce, ncomponents = 30 , exprs_values="X_normalized_scVI",name = "PCA_SCVI")
## plot
r4 <- scater::plotReducedDim(sce, dimred="PCA_SCVI", colour_by="sample_id") + ggtitle("PCA_SCVI - sample")
Trainer complains as :
lightning/pytorch/trainer/connectors/data_connector.py:425: The ‘train_dataloader’ does not have many workers which may be a bottleneck. Consider increasing the value of the num_workers argumenttonum_workers=79in theDataLoader to improve performance. Epoch 2/2: 100%|█| 2/2 [00:58<00:00, 29.19s/it, v_num=1, train_loss_step=8.36e+3, train_loss_Trainer.fitstopped:max_epochs=2` reached.
scvi-tools models require the raw counts for integration.
other downstream analysis tasks might require the log-normalized (like in seurat or scanpy)
The trainer complains is about the dataloading number of workers.
You can set it up with something like:
run setup_anndata and adjust backend settings
scvi$model$SCVI$setup_anndata(adata)
scvi$settings$dl_num_workers = 79L
scvi$settings$persistent_workers = ‘True’ #try also with False
scvi$settings$num_threads = 3 #num cpus
create the model
model = scvi$model$SCVI(adata)
train the model
model$train()
This will run 3 cpu nodes.
There is some overhead for using workers and also the persistent workers might still be left alive and you’ll need to kill them manually.
Not sure if it will bring you any added advantage, but try it.
Thanks.
It gives me the following message and all the cores are still actived even setting dl_num_workers to a lower value :
Epoch 5/50: 8%| | 4/50 [01:01<12:41, 16.54s/it, v_num=1, train_loss_step=7.86e+3, train_loss/data2/USERS/anaconda3/envs/R-4.4.1/lib/python3.12/multiprocessing/popen_fork.py:66: RuntimeWarning: os.fork() was called. os.fork() is incompatible with multithreaded code, and JAX is multithreaded, so this will likely lead to a deadlock.
Nope there is no impact setting one everywhere ,all cores run ~ 100 % .
I work on a Ubuntu 18.04.6 LTS (GNU/Linux 5.4.0-150-generic x86_64) system with 80 cores and no job scheduler.
to change a system environment in R do it with Sys.setenv(OMP_NUM_THREADS = "8") Sys.setenv(MKL_NUM_THREADS = "8")
and not through the reticulate package and using os (a python package)