scANVI fails and returns NaNs after few epochs

Hello,

Backstorry:

I successfully installed scvi-tools on Apple Silicon. This successfully trained on small dataset of (20k cells, 3k genes), however performance was extremely impeded when dataset size was (20k x 17k). So, I decided to use the accelerated PyTorch training on MPS. However, the aten::remainder.Tensor_out operation is not yet implemented using MPS so I get the following error

NotImplementedError: Could not run 'aten::index.Tensor' with arguments from the 'MPS' backend.

This was described in this github issue, which suggested to set the PYTORCH_ENABLE_MPS_FALLBACK=1 environment variable to use CPU fall back instead for the operation.

Training worked as a result, but always fails after some epochs, or even first epoch and returns the rror;

ValueError: Expected parameter loc (Tensor of shape (X, Y))) of distribution Normal(loc: torch.Size([X, Y]), scale: torch.Size([X, Y)) to satisfy the constraint Real(), but found invalid values

Can anyone please help with this?

PyTorch MPS support is not fully operational (i.e., some tensor operations fail as this happened to you). Therefore, M1 support is restricted to CPU. We do not anticipate much of a speedup for scvi models anyway on MPS.

1 Like

:frowning_face:

Thanks a lot for the quick reply!

Iā€™m fairly new to this space. Since all my personal compute resource are apple (silicon) based, are there any platforms with GPU support that you could recommend (something like sagemaker) etc.

Thanks!