Any suggestions for speeding up model training (vae and solo) on M2 mac

Hello scverse!
first time asking a question so let me know what I can improve
I’ve been trying to train the vae and solo models but using my MPS gpu throws the same error mentioned in this post (Error when training model on M3 Max MPS) so I’ve been going cpu only. It is absolutely slow because of the dataset size (700000 x 35000), but I was wondering if you all had any suggestions for things I could do to make sure this is going at the max possible speed.

I’ve been running this code

scvi.settings.dl_num_workers = 11
scvi.settings.batch_size = 2048
scvi.settings.num_threads = 10

vae = scvi.model.SCVI(adata)

if it helps, here is the startup output of the code above

GPU available: True (mps), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/opt/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/lightning/pytorch/trainer/ GPU available but not used. You can set it by doing `Trainer(accelerator='gpu')`.
/opt/miniconda3/envs/scanpy_env/lib/python3.9/site-packages/lightning/pytorch/trainer/connectors/ Consider setting `persistent_workers=True` in 'train_dataloader' to speed up the dataloader worker initialization.

Specs: M2 Max, 94gb ram,
cpu usage during training: 50-65%
ram usage during training: 40~gb

I haven’t tried this recently so not sure if it’s stable, but you can install an MPS-supported version of PyTorch and then run use the MPS backend by passing in: