Hello,
I’ve been exploring multi-GPU training in the hopes it’ll speed analysis of our larger (~15-20 million nuclei) datasets. In my initial testing in scvi-tools version 1.4.2, it seems activating multi-GPU training duplicates the process under the hood. For a node with 8 GPUs, CPU memory usage also increases roughly 8-fold. It looks like 8 independent processes are being initialized based on repeated printing of things like the scvi-tools version. Is that the expected behavior? If so, could you explain why its necessary/if there are workarounds? Here is the basic script I’m using:
###################### Load packages ######################
import scvi
import torch
import scanpy as sc
import numpy as np
import pandas as pd
from datetime import datetime
import warnings
###################### Set settings ######################
scvi.settings.seed = 0
warnings.filterwarnings('ignore')
torch.set_float32_matmul_precision("high")
datadir = "/data/"
print(f"{datetime.now()} -- scvi-tools version: {scvi.__version__}")
###################### Load reference dataset ######################
adata = sc.read_h5ad(datadir + "Multiregion/Data/SEA-AD_supertype_downsampled_1k.2025-10-01.h5ad") # Demo anndata contains ~205k nuclei
print(f"{datetime.now()} -- Loaded dataset")
###################### Run scVI ######################
scvi.model.SCVI.setup_anndata(
adata,
batch_key="method", # Sequencing chemistry (10xMulti vs 10xV3.1)
categorical_covariate_keys=["library_prep"], # (unique ID for every 10x library)
)
model = scvi.model.SCVI(
adata,
**{
"n_layers": 2,
"dispersion": "gene-batch" # We find this generally improves batch integration
}
)
model.train(
max_epochs=500,
accelerator="gpu",
devices=-1,
strategy="ddp_find_unused_parameters_true"
)
@ktravaglini ,
You are correct. There are X independent Python processes, and each runs the script from line 1 and also loads the full anndata to its memory. This is expected DDP behavior in pytorch. There’s no way around it right now, unless we change the engineering of how we train the models and run multiGPUs.
You can, however, run it with smaller memory footprint by reading the adata with on-disk mode (“backed='r'“), but this will be a bit slower overall runtime.
Also, make sure that:
- scvi.settings.dl_num_workers=0
- scvi.settings.dl_persistent_workers=False
Hello @ori-kron-wis,
Thanks for the quick reply, good to know this expected behavior and I haven’t mucked anything up! I suspected we’d have to use backed mode in the end, even with higher memory nodes.
Are there recommendations on I/O paradigms for training datasets of this scale? We are running this on AWS and currently streaming the h5ad file into memory from s3 via an s3-fuse mechanism (OK with smaller scale data and even medium scale on a single GPU). I was planning to benchmark downloading the files to solid state scratch drives and running everything as either h5ad or zarr files in backed mode. I’ve also been tracking remote streaming via zarr-v3 Guide/Roadmap — anndata 0.12.10 documentation. Any advice you have here would be extremely helpful!
Thanks, Kyle
Yes! A few options in fact:
-
Use one of our custom dataloader for reading very large data from disk (e.g: Lamin)
-
Use annbatch ( annbatch — annbatch ) zarr based custom dataloader, still under development, but shows great promise for very large adata. The branch to use it is here: GitHub - scverse/scvi-tools at Ori-annbatch · GitHub , and check out its test function to see how to run it. Annbatch is created by the folks from Anndata.
-
I actually did manage to make today a shared DDP multiGPU addition, given your question. See it in this branch: GitHub - scverse/scvi-tools at Ori-reduce-DPP-memory-footprint · GitHub . you can use it with model.train(…, datasplitter_kwargs={“share_memory”: True}).It will save memory
You will need to install scvi-tools from those branches in order to use those tools
Awesome. I’ll look through/try all three and let you know how it goes/if I have any questions. Thank you! -Kyle
Hi Ori,
I ran into errors when I initially tried your shared memory branch and wanted to set aside some time to understand what happened. Had some more time today to dive deeper and the outcome always appears to be
(base) root@3fce2f775bac:~/capsule/code# ./run
+ python -u run_mapping_one_donor.py
Seed set to 0
2026-04-03 22:17:08.053822 -- scvi-tools version: 1.4.2
2026-04-03 22:21:42.624367 -- Loaded reference dataset
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
self.trainer = self._trainer_cls(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
💡 Tip: For seamless cloud logging and experiment tracking, try installing [litlogger](https://pypi.org/project/litlogger/) to enable LitLogger, which logs metrics and artifacts automatically to the Lightning Experiments platform.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
[rank: 6] Seed set to 0
[rank: 2] Seed set to 0
2026-04-03 22:21:47.626748 -- scvi-tools version: 1.4.2
2026-04-03 22:21:47.627048 -- scvi-tools version: 1.4.2
[rank: 5] Seed set to 0
2026-04-03 22:21:47.660674 -- scvi-tools version: 1.4.2
[rank: 1] Seed set to 0
2026-04-03 22:21:47.684899 -- scvi-tools version: 1.4.2
[rank: 3] Seed set to 0
2026-04-03 22:21:47.838124 -- scvi-tools version: 1.4.2
[rank: 7] Seed set to 0
2026-04-03 22:21:47.852862 -- scvi-tools version: 1.4.2
[rank: 4] Seed set to 0
2026-04-03 22:21:47.891436 -- scvi-tools version: 1.4.2
2026-04-03 22:24:30.637870 -- Loaded reference dataset
2026-04-03 22:24:30.714274 -- Loaded reference dataset
2026-04-03 22:24:30.716654 -- Loaded reference dataset
2026-04-03 22:24:30.717344 -- Loaded reference dataset
2026-04-03 22:24:30.720324 -- Loaded reference dataset
2026-04-03 22:24:30.743129 -- Loaded reference dataset
2026-04-03 22:24:30.743129 -- Loaded reference dataset
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
self.trainer = self._trainer_cls(
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------
/opt/conda/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '
./run: line 5: 1281 Bus error (core dumped) python -u run_mapping_one_donor.py "$@"
The end stage state has 7 zombie processes on the GPUs and objects still in CPU memory.
(base) root@3fce2f775bac:~/capsule/code# nvidia-smi
Fri Apr 3 22:46:51 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08 Driver Version: 580.105.08 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA L4 On | 00000000:9F:00.0 Off | 0 |
| N/A 29C P8 12W / 72W | 3MiB / 23034MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA L4 On | 00000000:A1:00.0 Off | 0 |
| N/A 46C P0 31W / 72W | 339MiB / 23034MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA L4 On | 00000000:A3:00.0 Off | 0 |
| N/A 48C P0 31W / 72W | 339MiB / 23034MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA L4 On | 00000000:A5:00.0 Off | 0 |
| N/A 46C P0 31W / 72W | 339MiB / 23034MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA L4 On | 00000000:AE:00.0 Off | 0 |
| N/A 50C P0 32W / 72W | 339MiB / 23034MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA L4 On | 00000000:B0:00.0 Off | 0 |
| N/A 47C P0 31W / 72W | 339MiB / 23034MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA L4 On | 00000000:B2:00.0 Off | 0 |
| N/A 47C P0 31W / 72W | 339MiB / 23034MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA L4 On | 00000000:B4:00.0 Off | 0 |
| N/A 46C P0 32W / 72W | 339MiB / 23034MiB | 100% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 1 N/A N/A 2138 C /opt/conda/bin/python 330MiB |
| 2 N/A N/A 2139 C /opt/conda/bin/python 330MiB |
| 3 N/A N/A 2140 C /opt/conda/bin/python 330MiB |
| 4 N/A N/A 2141 C /opt/conda/bin/python 330MiB |
| 5 N/A N/A 2142 C /opt/conda/bin/python 330MiB |
| 6 N/A N/A 2143 C /opt/conda/bin/python 330MiB |
| 7 N/A N/A 2144 C /opt/conda/bin/python 330MiB |
+-----------------------------------------------------------------------------------------+
Ok, let’s continue the discussion in the PR itself (I updated it), as it has become a development thread.
Can you try to release the zombie processes in the code? Feel free to add your commits to it.