Macbook M1 M2 mps acceleration with scVI

davemcg · February 9, 2024, 9:21pm

Has anyone recently gotten scVI (ideally 1.0.4) working with “GPU” (well, “mps”) acceleration with a Apple ARM M1, M2, or M3? I’ve tried a variety of incantations when installing torch and jax and it either doesn’t see the GPU or does and throws a tensor error which suggests something is very borked somewhere in the software chain.

ValueError: Expected parameter loc (Tensor of shape (128, 30)) of distribution Normal(loc: torch.Size([128, 30]), scale: torch.Size([128, 30])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        ...,
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan],
        [nan, nan, nan,  ..., nan, nan, nan]], device='mps:0',
       grad_fn=<LinearBackward0>)

TaniaThimraj · February 12, 2024, 9:19am

Same issue here. I am on a Mac M3

pckinnunen_lbl · February 15, 2024, 11:25pm

I had similar issues and found this:

I think this is a pytorch issue, not an scVI issue.

I think this is the pytorch issue where they track mps compatibility:

github.com/pytorch/pytorch

General MPS op coverage tracking issue

opened 06:12PM - 18 May 22 UTC

albanD

feature triaged module: mps

### This issue is to have a centralized place to list and track work on adding s…upport to new ops for the MPS backend. [**MPS operators coverage matrix**](https://qqaatw.github.io/pytorch-mps-ops-coverage/) - The matrix covers most of the supported operators but is not exhaustive. Before you comment below, please take a look at this matrix to make sure the operator you're requesting has not been implemented in nightly. More details can be found on the [readme](https://github.com/qqaatw/pytorch-mps-ops-coverage). There are a very large number of operators in pytorch and so they are not all implemented yet for the MPS backends as it is still in the prototype phase. We will be prioritizing adding new operators based on user feedback. If possible, please also provide link to the network or use-case where this op is getting used. If you want to work on adding support for such op, feel free to comment below to get assigned one. Please avoid pickup up an op that is already being worked on or that already has a PR associated with it. [Link to the wiki for details](https://github.com/pytorch/pytorch/wiki/MPS-Backend) on how to add these ops and example PRs. **Good First Issue:** Below is list of Ops which are good to get started to add operations to MPS backend. Please consider picking them up. - [ ] `nn.Conv3D` - [ ] `aten::_weight_norm_interface` - [ ] `aten::max_unpool2d` - [ ] `aten::cummin.out`, `aten::cummax.out` - [ ] `aten::upsample_linear1d.out` - [ ] `aten::lerp.Scalar_out` - [ ] `aten::renorm` **Not categorized:** These are the ops which are not yet picked up and need MPS implementation. - [ ] `aten::slow_conv3d_forward` - [ ] `aten::_ctc_loss` - [ ] `aten::avg_pool3d.out` - [ ] `aten::linalg_qr.out` - [ ] `aten::multilabel_margin_loss_forward` - [ ] `aten::unique_dim` - [ ] `aten::_sample_dirichlet` - [ ] `aten::_fft_r2c` - [ ] `aten::upsample_bicubic2d.out` - [ ] `aten::linalg_inv_out_helper` - [ ] `aten::bucketize` - [ ] `aten::_embedding_bag` - [ ] `aten::_standard_gamma` - [ ] `aten::_upsample_bicubic2d_aa.out` - [ ] `aten::'aten::_symeig_helper` - [ ] `aten::linalg_matrix_exp` - [ ] `aten::_nested_tensor_from_mask` - [ ] `aten::randperm.generator_out` - [ ] `aten::_fused_sdp_choice` - [ ] `aten::linalg_cholesky_ex` - [ ] `aten::scatter_reduce.two_out` - [ ] `aten::kthvalue.values` - [ ] `aten::_linalg_solve_ex.result` - [ ] `aten::grid_sampler_2d_backward'` **WIP:** - [ ] `max_pool3d` https://github.com/pytorch/pytorch/pull/102148 - [ ] `aten::kl_div_backward` (Is not needed ) **Implemented Ops:** Ops that have MPS backend implementations. See [**MPS operators coverage matrix**](https://qqaatw.github.io/pytorch-mps-ops-coverage/) and the [readme](https://github.com/qqaatw/pytorch-mps-ops-coverage) for more details. <details> <summary>deprecated list</summary> - [x] `aten::histc` #96652 - [x] `pow.Scalar_out` (@qqaatw ) - [x] `aten::log_sigmoid_forward` (@qqaatw ) - [x] `aten::fmax.out` (@qqaatw ) - [x] `aten::roll` https://github.com/pytorch/pytorch/pull/95168 - [x] `aten::hardsigmoid` (@qqaatw ) - [x] `aten::logit` (@qqaatw ) - [x] `linalg_solve_triangular` - [x] `aten::sort.values_stable` https://github.com/pytorch/pytorch/issues/86750 - [x] `aten::remainder.Tensor_out` https://github.com/pytorch/pytorch/issues/86806 - [x] `aten::hardswish` https://github.com/pytorch/pytorch/issues/86807 - [x] `aten::nansum` https://github.com/pytorch/pytorch/issues/86809 - [x] `aten::fmod.Tensor_out` https://github.com/pytorch/pytorch/issues/86810 - [x] `aten::range` https://github.com/pytorch/pytorch/issues/86990 - [x] `aten::argsort` https://github.com/pytorch/pytorch/issues/86991 - [x] `aten::repeat_interleave` https://github.com/pytorch/pytorch/issues/87219 - [x] `aten::median` https://github.com/pytorch/pytorch/issues/87220 - [x] `aten::trace` https://github.com/pytorch/pytorch/issues/87221 - [x] `aten::im2col` (Falling back to CPU as its mostly used in preprocessing layers) - [x] `aten::_cdist_forward` https://github.com/pytorch/pytorch/pull/91643 - [x] `aten::native_group_norm_backward` (Implemented by @malfet ) - [x] `aten::grid_sampler_2d` (https://github.com/pytorch/pytorch/pull/94273) - [x] `aten::upsample_nearest1d_backward.grad_input` - [x] `aten::upsample_nearest1d.out` - [x] `aten::repeat_interleave.self_int` - [x] `aten::nan_to_num.out` - [x] `aten::unique_consecutive` https://github.com/pytorch/pytorch/pull/88532 - [x] `torch.bincount` https://github.com/pytorch/pytorch/pull/91267 - [x] `aten::_unique2` https://github.com/pytorch/pytorch/pull/88532 - [x] `aten::unfold` https://github.com/pytorch/pytorch/pull/91266 - [x] `aten::triangular_solve.X` https://github.com/pytorch/pytorch/pull/94345 - [x] `aten::nonzero` https://github.com/pytorch/pytorch/pull/91616 - [x] `aten::_index_put_impl_` (https://github.com/pytorch/pytorch/pull/85672) - [x] `aten::amax.out` (#79682) - [X] `aten::_slow_conv2d_forward` (https://github.com/pytorch/pytorch/pull/86303) - [x] `aten::eye.m_out` (https://github.com/pytorch/pytorch/pull/78408) - [x] `aten::multinomial` (https://github.com/pytorch/pytorch/pull/80760 ) - [x] `aten::flip` (#80214) - [x] `aten::equal` https://github.com/pytorch/pytorch/pull/80195 - [x] `aten::_local_scalar_dense` - [x] `aten::l1_loss_backward.grad_input` (#80010) - [x] `aten::glu.out` (#79866) - [x] ` aten::linspace.out` https://github.com/pytorch/pytorch/pull/78570 - [x] `aten::arange.out` https://github.com/pytorch/pytorch/pull/78789 - [x] `aten::adaptive_max_pool2d` https://github.com/pytorch/pytorch/pull/78410 - [x] `aten::count_nonzero.dim_IntList` - [x] `aten::softplus.out` (https://github.com/pytorch/pytorch/pull/78930) - [x] `aten::index_add.out` https://github.com/pytorch/pytorch/pull/79935 - [x] `aten::normal` (#80297) - [x] `aten::native_layer_norm_backward` https://github.com/pytorch/pytorch/pull/79189 - [x] `aten::logical_and.out` (#80216) - [x] `aten::frac.out` (https://github.com/pytorch/pytorch/pull/86625) - [x] `aten:: masked_select` https://github.com/pytorch/pytorch/pull/85818 - [x] `aten::softplus_backward.grad_input` (#79873) - [x] `aten::slow_conv_transpose2d.out` (@malfet could be due to incompatibility with torchvision) - [x] `aten::signbit.out` (https://github.com/pytorch/pytorch/pull/87214) - [X] `aten::cumsum.out` (https://github.com/pytorch/pytorch/pull/88319) - [X] `aten::cumprod.out` - [X] `aten::expm1.out` (https://github.com/pytorch/pytorch/pull/87147) - [x] `aten::bitwise_xor.Tensor_out` (https://github.com/pytorch/pytorch/pull/82307) - [x] `aten::bitwise_and.Tensor_out` (https://github.com/pytorch/pytorch/pull/82307) - [x] `aten::bitwise_or.Tensor_out` (https://github.com/pytorch/pytorch/pull/82307) - [x] `aten::index.Tensor` (https://github.com/pytorch/pytorch/pull/82507) - [x] `aten::index.Tensor_out` (https://github.com/pytorch/pytorch/pull/82507) </details> **Ops not supported by MPS:** Ops that will require either to use the CPU fallback system or a custom Metal kernel. - [ ] `aten::lgamma.out` - [ ] `aten::linalg_householder_product`

I think the specific function that’s incompatible (at least for my usage) was

aten::_standard_gamma

martinkim0 · February 16, 2024, 5:30pm

I’ve tried testing out the nightly PyTorch versions with the MPS backend and have had no success. Technically it should work since they’ve implemented the lgamma kernel, which was the last one needed to fully support running scVI, but it looks like there might be issues with the implementation or numerical instabilities since I’ve also experienced NaNs in the first epoch of training.

So yes, this is an issue on the PyTorch end - unfortunately there’s not much we can do to support the Metal backend.

davemcg · February 16, 2024, 6:04pm

Thanks - I’ll keep track of this thread and I hope if anyone gets scVI working on a nightly (or better, stable) branch of pytorch they will report it here!

jan-engelmann · August 3, 2024, 10:18pm

I ran into the same issue with torch 2.2.2 on mps backend where torch.lgamma produces -inf
very confusing to debug.
Thanks for tracking it here!

cane11 · August 4, 2024, 7:09am

Hi, thanks for the pointer. We tried it beforehand and got Nan errors (that was in the not released torch version). I looked deeper into it, it’s a very strange one. Apparently MPS handles broadcast_all weirdly (and the line torch.lgamma(theta) is producing inf values). This is an example of the behavior:

b = torch.full((5, 3), 1., device='mps')
c = torch.full((5, 1), 1., device='mps')
b, c = broadcast_all(b, c)
torch.lgamma(b), torch.lgamma(c)

(tensor([[0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.],
         [0., 0., 0.]], device='mps:0'),
 tensor([[ 0.0000,  0.0000,  0.0000],
         [ 0.0000,  0.0000, -0.0906],
         [-0.0906, -0.0906, -0.0906],
         [-0.0906, -0.0906, -0.0906],
         [-0.0906, -0.0906, -0.0906]], device='mps:0'))

While the actual issue is something the MPS-torch team has to look into (I guess it’s some issue with pointing in memory). However, this also points to two manners in which scVI works on MPS. First one: gene_likelihood=‘Poisson’ is fully supported. Second one, if using dispersion=‘gene-label’ during setting up the model, the broadcast_all function has no effect and I’m not getting None errors (if not using label_key using this setup has no effect on the training procedure and is save - it won’t work for scANVI though). The speedup that I’m getting on an M1 Max on High Power mode is 80% with very high batch size - this allows testing the GPU capabilities). I would be curious to get more people to benchmark it (especially using M3 chips).

cane11 · September 10, 2024, 7:23pm

I would conclude that the use of MPS even with the proposed change should be only experimental, as we use broadcasting at various locations and different models and there is an issue in pytorch MPS with this: MPS lgamma function changes results when using broadcasting · Issue #132605 · pytorch/pytorch · GitHub

timslittle · May 30, 2025, 10:37am

Hi folks, I was exploring this topic recently and wondering if there have been any updates at all? It looks like the PyTorch lgamma issue is still open so I assume that running scVI with Mac MPS is still out of the questions? Thanks

cane11 · May 30, 2025, 11:43am

Hi, yes the most recent scVI-tools supports MOS. However, we needed to explicitly reshape instead of broadcasting, which likely comes with added computational costs.

Topic		Replies	Views
scANVI fails and returns NaNs after few epochs Help scanvi , scvi	2	588	November 11, 2022
Error when training model on M3 Max MPS scvi-tools	5	1033	February 16, 2024
M1 MAX: GPU available, but not used scvi-tools	4	1348	April 13, 2023
GPU available but not using scvi-tools	7	7239	March 17, 2023
Scvi model kills kernel when .train() is called on Macbook Pro with Apple M1 Pro Chip scvi-tools scvi	3	431	December 1, 2022

Macbook M1 M2 mps acceleration with scVI

Related topics