I need to process several mdata modalities serially (updating shared mdata embeddings in between as needed for shared embeddings) in rapids but my total cluster VRAM is only ~350gb, so my strategy is to re-initialize the pool between modalities. I set my rmm_pool_size conservatively (~50% of per device VRAM), with max_rmm_pool_size==90%. However, I’m getting errors when I try to do `client.restart()`.
Here is how I’m attempting to re-initialize the cluster:
cluster = LocalCUDACluster() # uses pooling and persist
client = Client(cluster)
for m in modalities
<do rapids stuff>
if m > 0
client.restart() # client restart error
Here is some stuff that my logs are showing:
TimeoutError
2026-04-09 20:21:42,970 - distributed.scheduler - ERROR - Workers ['ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'] did not shut down within 120s; force closing
2026-04-09 20:21:42,981 - distributed.nanny - WARNING - Restarting worker
2026-04-09 20:21:42,982 - distributed.nanny - WARNING - Restarting worker
2026-04-09 20:21:42,984 - distributed.nanny - WARNING - Restarting worker
2026-04-09 20:21:42,985 - distributed.nanny - WARNING - Restarting worker
2026-04-09 20:21:42,986 - distributed.nanny - WARNING - Restarting worker
2026-04-09 20:21:42,987 - distributed.nanny - WARNING - Restarting worker
2026-04-09 20:21:42,989 - distributed.nanny - WARNING - Restarting worker
2026-04-09 20:21:42,990 - distributed.nanny - WARNING - Restarting worker
2026-04-09 20:21:43,054 - distributed.scheduler - ERROR - 8/8 nanny worker(s) did not shut down within 120s: {'ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'}
Traceback (most recent call last):
File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/utils.py", line 818, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/scheduler.py", line 6666, in restart_workers
raise TimeoutError(
...<2 lines>...
)
TimeoutError: 8/8 nanny worker(s) did not shut down within 120s: {'ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'}
2026-04-09 20:21:43,055 - distributed.scheduler - ERROR - 8/8 nanny worker(s) did not shut down within 120s: {'ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'}
Traceback (most recent call last):
File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/utils.py", line 818, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/scheduler.py", line 6526, in restart
await self.restart_workers(
...<4 lines>...
)
File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/utils.py", line 818, in wrapper
return await func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/nfs/turbo/umms-welchjd-code/code/mkarikom/base_conda_envs/rapids_singlecell_25.12/lib/python3.13/site-packages/distributed/scheduler.py", line 6666, in restart_workers
raise TimeoutError(
...<2 lines>...
)
TimeoutError: 8/8 nanny worker(s) did not shut down within 120s: {'ucx://127.0.0.1:40950', 'ucx://127.0.0.1:53050', 'ucx://127.0.0.1:59470', 'ucx://127.0.0.1:37139', 'ucx://127.0.0.1:48204', 'ucx://127.0.0.1:53934', 'ucx://127.0.0.1:50543', 'ucx://127.0.0.1:59670'}
2026-04-09 20:21:43,056 - distributed.core - ERROR - Exception while handling op restart