I have a AnnData object with data from six patients. When I loop through the patient ids, subset the data, and compute the standard scVI workflow the first patient id needs about 3 minutes to complete. However, the second patient subset takes hours. But if I just run the second patient id on its own it only takes 3 minutes as well.
I highly appreciate your input on how to ensure that a scVI process is removed from memory. For example in the setting when I looping through patient ids.
Hi, I don’t think your workflow becomes clear. Why do you want to run it seperately per patient? Do you see full GPU or system RAM?
This could explain the slowdown. Disabling pinning of memory on GPU would then help.
thank you very much for your time. I run scVI per patient and use the model for SOLO doublet detection. Currently I use CPU on our HPC (python v.3.9.19, scvi-tools v.1.1.6.post2).
With scvi.settings.num_threads=38 and in the Slurm --nodes=1, --ntasks=1, --cpus-per-task 38.
It feels like that after the training the model for the first patient id the CPU resources are still occupied. Do you have any idea what I could try to free the resources?
I don’t know if that is good coding practice but importing scvi within the function and using Process seems to free up the CPU after training a model.
def scvi_worflow(adata, patient_id):
import scvi # Import modules
scvi.settings.num_threads=8 # Set SCVI threads
adata_i = adata[adata.obs['patient_id']==patient_id]
[...] # Train model etc.
del adata_i, model_i
gc.collect()
return None
from multiprocessing import Process
for patient_id in adata.obs[‘patient_id’].cat.categories:
p = Process(target=scvi_worflow, args=(adata, patient_id))
p.start()
p.join()
My first intuition is: Do you enable persistent workers? Without it using multiple jobs doesn’t really speed things up, with setting it the workers stay persistent even after training as we don’t kill the dataloader. Deleting the model should be sufficient though - is the gc step really needed? See What are the (dis) advantages of persistent_workers - #8 by albanD - vision - PyTorch Forums for a longer discussion. It would be helpful to get a more complete set of the script.
Hi @cane11, thank you very much for the pointers! I successfully failed to replicate the behavior with other and my own data. However, I realized that there were server updates running regarding the CPU distribution while the problem occurred. Maybe that caused the behavior but I can’t be sure. My apologies.