CPU memory usage in multi-GPU training

ktravaglini · March 18, 2026, 7:12pm

Hello,

I’ve been exploring multi-GPU training in the hopes it’ll speed analysis of our larger (~15-20 million nuclei) datasets. In my initial testing in scvi-tools version 1.4.2, it seems activating multi-GPU training duplicates the process under the hood. For a node with 8 GPUs, CPU memory usage also increases roughly 8-fold. It looks like 8 independent processes are being initialized based on repeated printing of things like the scvi-tools version. Is that the expected behavior? If so, could you explain why its necessary/if there are workarounds? Here is the basic script I’m using:

###################### Load packages ######################
import scvi
import torch
import scanpy as sc
import numpy as np
import pandas as pd
from datetime import datetime
import warnings

###################### Set settings ######################
scvi.settings.seed = 0
warnings.filterwarnings('ignore')
torch.set_float32_matmul_precision("high")
datadir = "/data/"
print(f"{datetime.now()} -- scvi-tools version: {scvi.__version__}")

###################### Load reference dataset ######################
adata = sc.read_h5ad(datadir + "Multiregion/Data/SEA-AD_supertype_downsampled_1k.2025-10-01.h5ad") # Demo anndata contains ~205k nuclei
print(f"{datetime.now()} -- Loaded dataset")

###################### Run scVI ######################
scvi.model.SCVI.setup_anndata(
    adata,
    batch_key="method", # Sequencing chemistry (10xMulti vs 10xV3.1)
    categorical_covariate_keys=["library_prep"], # (unique ID for every 10x library)
)

model = scvi.model.SCVI(
    adata,
    **{
        "n_layers": 2,
        "dispersion": "gene-batch" # We find this generally improves batch integration
    }
)
model.train(
    max_epochs=500,
    accelerator="gpu",
    devices=-1,
    strategy="ddp_find_unused_parameters_true"
)

ori-kron-wis · March 19, 2026, 2:19pm

@ktravaglini ,

You are correct. There are X independent Python processes, and each runs the script from line 1 and also loads the full anndata to its memory. This is expected DDP behavior in pytorch. There’s no way around it right now, unless we change the engineering of how we train the models and run multiGPUs.

You can, however, run it with smaller memory footprint by reading the adata with on-disk mode (“backed='r'“), but this will be a bit slower overall runtime.

Also, make sure that:
- scvi.settings.dl_num_workers=0
- scvi.settings.dl_persistent_workers=False

ktravaglini · March 19, 2026, 2:42pm

Hello @ori-kron-wis,

Thanks for the quick reply, good to know this expected behavior and I haven’t mucked anything up! I suspected we’d have to use backed mode in the end, even with higher memory nodes.

Are there recommendations on I/O paradigms for training datasets of this scale? We are running this on AWS and currently streaming the h5ad file into memory from s3 via an s3-fuse mechanism (OK with smaller scale data and even medium scale on a single GPU). I was planning to benchmark downloading the files to solid state scratch drives and running everything as either h5ad or zarr files in backed mode. I’ve also been tracking remote streaming via zarr-v3 Guide/Roadmap — anndata 0.12.10 documentation. Any advice you have here would be extremely helpful!

Thanks, Kyle

ori-kron-wis · March 19, 2026, 3:31pm

Yes! A few options in fact:

Use one of our custom dataloader for reading very large data from disk (e.g: Lamin)
Use annbatch ( annbatch — annbatch ) zarr based custom dataloader, still under development, but shows great promise for very large adata. The branch to use it is here: GitHub - scverse/scvi-tools at Ori-annbatch · GitHub , and check out its test function to see how to run it. Annbatch is created by the folks from Anndata.
I actually did manage to make today a shared DDP multiGPU addition, given your question. See it in this branch: GitHub - scverse/scvi-tools at Ori-reduce-DPP-memory-footprint · GitHub . you can use it with model.train(…, datasplitter_kwargs={“share_memory”: True}).It will save memory

You will need to install scvi-tools from those branches in order to use those tools

ktravaglini · March 19, 2026, 3:57pm

Awesome. I’ll look through/try all three and let you know how it goes/if I have any questions. Thank you! -Kyle

ktravaglini · April 3, 2026, 11:15pm

Hi Ori,

I ran into errors when I initially tried your shared memory branch and wanted to set aside some time to understand what happened. Had some more time today to dive deeper and the outcome always appears to be

(base) root@3fce2f775bac:~/capsule/code# ./run
+ python -u run_mapping_one_donor.py
Seed set to 0
2026-04-03 22:17:08.053822 -- scvi-tools version: 1.4.2
2026-04-03 22:21:42.624367 -- Loaded reference dataset
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
  self.trainer = self._trainer_cls(
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
💡 Tip: For seamless cloud logging and experiment tracking, try installing [litlogger](https://pypi.org/project/litlogger/) to enable LitLogger, which logs metrics and artifacts automatically to the Lightning Experiments platform.
Initializing distributed: GLOBAL_RANK: 0, MEMBER: 1/8
[rank: 6] Seed set to 0
[rank: 2] Seed set to 0
2026-04-03 22:21:47.626748 -- scvi-tools version: 1.4.2
2026-04-03 22:21:47.627048 -- scvi-tools version: 1.4.2
[rank: 5] Seed set to 0
2026-04-03 22:21:47.660674 -- scvi-tools version: 1.4.2
[rank: 1] Seed set to 0
2026-04-03 22:21:47.684899 -- scvi-tools version: 1.4.2
[rank: 3] Seed set to 0
2026-04-03 22:21:47.838124 -- scvi-tools version: 1.4.2
[rank: 7] Seed set to 0
2026-04-03 22:21:47.852862 -- scvi-tools version: 1.4.2
[rank: 4] Seed set to 0
2026-04-03 22:21:47.891436 -- scvi-tools version: 1.4.2
2026-04-03 22:24:30.637870 -- Loaded reference dataset
2026-04-03 22:24:30.714274 -- Loaded reference dataset
2026-04-03 22:24:30.716654 -- Loaded reference dataset
2026-04-03 22:24:30.717344 -- Loaded reference dataset
2026-04-03 22:24:30.720324 -- Loaded reference dataset
2026-04-03 22:24:30.743129 -- Loaded reference dataset
2026-04-03 22:24:30.743129 -- Loaded reference dataset
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
  self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
  self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
  self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
  self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
  self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
  self.trainer = self._trainer_cls(
/src/scvi-tools/src/scvi/train/_trainrunner.py:116: UserWarning: early_stopping was automatically disabled due to the use of DDP
  self.trainer = self._trainer_cls(
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
----------------------------------------------------------------------------------------------------
distributed_backend=nccl
All distributed processes registered. Starting with 8 processes
----------------------------------------------------------------------------------------------------

/opt/conda/lib/python3.12/multiprocessing/resource_tracker.py:254: UserWarning: resource_tracker: There appear to be 1 leaked shared_memory objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '
./run: line 5:  1281 Bus error               (core dumped) python -u run_mapping_one_donor.py "$@"

The end stage state has 7 zombie processes on the GPUs and objects still in CPU memory.

(base) root@3fce2f775bac:~/capsule/code# nvidia-smi
Fri Apr  3 22:46:51 2026       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.105.08             Driver Version: 580.105.08     CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L4                      On  |   00000000:9F:00.0 Off |                    0 |
| N/A   29C    P8             12W /   72W |       3MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L4                      On  |   00000000:A1:00.0 Off |                    0 |
| N/A   46C    P0             31W /   72W |     339MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L4                      On  |   00000000:A3:00.0 Off |                    0 |
| N/A   48C    P0             31W /   72W |     339MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L4                      On  |   00000000:A5:00.0 Off |                    0 |
| N/A   46C    P0             31W /   72W |     339MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L4                      On  |   00000000:AE:00.0 Off |                    0 |
| N/A   50C    P0             32W /   72W |     339MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L4                      On  |   00000000:B0:00.0 Off |                    0 |
| N/A   47C    P0             31W /   72W |     339MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L4                      On  |   00000000:B2:00.0 Off |                    0 |
| N/A   47C    P0             31W /   72W |     339MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L4                      On  |   00000000:B4:00.0 Off |                    0 |
| N/A   46C    P0             32W /   72W |     339MiB /  23034MiB |    100%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    1   N/A  N/A            2138      C   /opt/conda/bin/python                   330MiB |
|    2   N/A  N/A            2139      C   /opt/conda/bin/python                   330MiB |
|    3   N/A  N/A            2140      C   /opt/conda/bin/python                   330MiB |
|    4   N/A  N/A            2141      C   /opt/conda/bin/python                   330MiB |
|    5   N/A  N/A            2142      C   /opt/conda/bin/python                   330MiB |
|    6   N/A  N/A            2143      C   /opt/conda/bin/python                   330MiB |
|    7   N/A  N/A            2144      C   /opt/conda/bin/python                   330MiB |
+-----------------------------------------------------------------------------------------+

ori-kron-wis · April 4, 2026, 8:44am

Ok, let’s continue the discussion in the PR itself (I updated it), as it has become a development thread.

Can you try to release the zombie processes in the code? Feel free to add your commits to it.

Topic		Replies	Views
Documentation for training with multiple GPUs scvi-tools	5	481	July 9, 2025
Iteration consumes memory scvi-tools scvi	5	180	November 7, 2024
MultiVi with multiple GPUs and CPUs scvi-tools multivi , gpu	2	1165	December 2, 2021
SCVI tools with large datasets scvi-tools	3	1094	May 31, 2024
multi-GPU training scvi-tools gpu	1	593	October 13, 2021

CPU memory usage in multi-GPU training

Related topics