Hello,
I’ve been exploring multi-GPU training in the hopes it’ll speed analysis of our larger (~15-20 million nuclei) datasets. In my initial testing in scvi-tools version 1.4.2, it seems activating multi-GPU training duplicates the process under the hood. For a node with 8 GPUs, CPU memory usage also increases roughly 8-fold. It looks like 8 independent processes are being initialized based on repeated printing of things like the scvi-tools version. Is that the expected behavior? If so, could you explain why its necessary/if there are workarounds? Here is the basic script I’m using:
###################### Load packages ######################
import scvi
import torch
import scanpy as sc
import numpy as np
import pandas as pd
from datetime import datetime
import warnings
###################### Set settings ######################
scvi.settings.seed = 0
warnings.filterwarnings('ignore')
torch.set_float32_matmul_precision("high")
datadir = "/data/"
print(f"{datetime.now()} -- scvi-tools version: {scvi.__version__}")
###################### Load reference dataset ######################
adata = sc.read_h5ad(datadir + "Multiregion/Data/SEA-AD_supertype_downsampled_1k.2025-10-01.h5ad") # Demo anndata contains ~205k nuclei
print(f"{datetime.now()} -- Loaded dataset")
###################### Run scVI ######################
scvi.model.SCVI.setup_anndata(
adata,
batch_key="method", # Sequencing chemistry (10xMulti vs 10xV3.1)
categorical_covariate_keys=["library_prep"], # (unique ID for every 10x library)
)
model = scvi.model.SCVI(
adata,
**{
"n_layers": 2,
"dispersion": "gene-batch" # We find this generally improves batch integration
}
)
model.train(
max_epochs=500,
accelerator="gpu",
devices=-1,
strategy="ddp_find_unused_parameters_true"
)
@ktravaglini ,
You are correct. There are X independent Python processes, and each runs the script from line 1 and also loads the full anndata to its memory. This is expected DDP behavior in pytorch. There’s no way around it right now, unless we change the engineering of how we train the models and run multiGPUs.
You can, however, run it with smaller memory footprint by reading the adata with on-disk mode (“backed='r'“), but this will be a bit slower overall runtime.
Also, make sure that:
- scvi.settings.dl_num_workers=0
- scvi.settings.dl_persistent_workers=False
Hello @ori-kron-wis,
Thanks for the quick reply, good to know this expected behavior and I haven’t mucked anything up! I suspected we’d have to use backed mode in the end, even with higher memory nodes.
Are there recommendations on I/O paradigms for training datasets of this scale? We are running this on AWS and currently streaming the h5ad file into memory from s3 via an s3-fuse mechanism (OK with smaller scale data and even medium scale on a single GPU). I was planning to benchmark downloading the files to solid state scratch drives and running everything as either h5ad or zarr files in backed mode. I’ve also been tracking remote streaming via zarr-v3 Guide/Roadmap — anndata 0.12.10 documentation. Any advice you have here would be extremely helpful!
Thanks, Kyle
Yes! A few options in fact:
-
Use one of our custom dataloader for reading very large data from disk (e.g: Lamin)
-
Use annbatch ( annbatch — annbatch ) zarr based custom dataloader, still under development, but shows great promise for very large adata. The branch to use it is here: GitHub - scverse/scvi-tools at Ori-annbatch · GitHub , and check out its test function to see how to run it. Annbatch is created by the folks from Anndata.
-
I actually did manage to make today a shared DDP multiGPU addition, given your question. See it in this branch: GitHub - scverse/scvi-tools at Ori-reduce-DPP-memory-footprint · GitHub . you can use it with model.train(…, datasplitter_kwargs={“share_memory”: True}).It will save memory
You will need to install scvi-tools from those branches in order to use those tools
Awesome. I’ll look through/try all three and let you know how it goes/if I have any questions. Thank you! -Kyle