Client.restart errors for dask cuda clusters (pooled mode rapids)

mkarikom · April 10, 2026, 1:45pm

I need to process several mdata modalities serially (updating shared mdata embeddings in between as needed for shared embeddings) in rapids but my total cluster VRAM is only ~350gb, so my strategy is to re-initialize the pool between modalities. I set my rmm_pool_size conservatively (~50% of per device VRAM), with max_rmm_pool_size==90%. However, I’m getting errors when I try to do `client.restart()`.

Here is how I’m attempting to re-initialize the cluster:

cluster = LocalCUDACluster() # uses pooling and persist
client = Client(cluster)

for m in modalities 
    <do rapids stuff>
    if m > 0
        client.restart() # client restart error

Here is some stuff that my logs are showing:

TimeoutError

2026-04-09 20:21:42,970 - distributed.scheduler - ERROR - Workers ['ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'] did not shut down within 120s; force closing

2026-04-09 20:21:42,981 - distributed.nanny - WARNING - Restarting worker

2026-04-09 20:21:42,982 - distributed.nanny - WARNING - Restarting worker

2026-04-09 20:21:42,984 - distributed.nanny - WARNING - Restarting worker

2026-04-09 20:21:42,985 - distributed.nanny - WARNING - Restarting worker

2026-04-09 20:21:42,986 - distributed.nanny - WARNING - Restarting worker

2026-04-09 20:21:42,987 - distributed.nanny - WARNING - Restarting worker

2026-04-09 20:21:42,989 - distributed.nanny - WARNING - Restarting worker

2026-04-09 20:21:42,990 - distributed.nanny - WARNING - Restarting worker

2026-04-09 20:21:43,054 - distributed.scheduler - ERROR - 8/8 nanny worker(s) did not shut down within 120s: {'ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'}

Traceback (most recent call last):

  File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/utils.py", line 818, in wrapper

    return await func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/scheduler.py", line 6666, in restart_workers

    raise TimeoutError(

    ...<2 lines>...

    )

TimeoutError: 8/8 nanny worker(s) did not shut down within 120s: {'ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'}

2026-04-09 20:21:43,055 - distributed.scheduler - ERROR - 8/8 nanny worker(s) did not shut down within 120s: {'ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'}

Traceback (most recent call last):

  File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/utils.py", line 818, in wrapper

    return await func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/scheduler.py", line 6526, in restart

    await self.restart_workers(

    ...<4 lines>...

    )

  File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/utils.py", line 818, in wrapper

    return await func(*args, **kwargs)

           ^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/scheduler.py", line 6666, in restart_workers

    raise TimeoutError(

    ...<2 lines>...

    )

TimeoutError: 8/8 nanny worker(s) did not shut down within 120s: {'ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'}

2026-04-09 20:21:43,056 - distributed.core - ERROR - Exception while handling op restart

Topic		Replies	Views
How to fix "the kernel appears to have died. It will restart automatically."? Help	7	9447	October 24, 2022
Inconsistency of scvi/SOLO in predicting doublets? scvi-tools	5	102	November 7, 2024
CPU memory usage in multi-GPU training scvi-tools	6	48	April 4, 2026
Error training MrVI: Backend 'cuda' failed to initialize scvi-tools	2	272	November 18, 2024
TypeError with ir.tl.define_clonotype_clusters scirpy	3	64	July 9, 2024

Client.restart errors for dask cuda clusters (pooled mode rapids)

Related topics