metal-SingleCell — a work-in-progress Metal/MLX port of rapids-singlecell for Apple Silicon (feedback welcome)

gingerii · July 2, 2026, 3:30pm

TL;DR: I’ve been working on an independent Metal/MLX re-implementation of rapids-singlecell so the scanpy/squidpy GPU workflow can run on the Apple M-series GPU. It’s a drop-in API and covers ~30 pp/tl/gr functions, validated against CPU references. To be clear up front: this does not approach rapids-singlecell’s speed on NVIDIA hardware — that’s a different bandwidth class and far better-optimized CUDA than I can write. What it does offer is a GPU option for Mac users who currently have none, and a useful speedup over CPU scanpy for a good number of functions. A ~20-person beta ran the tutorials without issues. Repo: GitHub - gingerii/metal-SingleCell: GPU-accelerated single-cell analysis on Apple Silicon (Metal/MLX) — a drop-in for rapids-singlecell. · GitHub — feedback and suggestions very welcome, especially on the Leiden kernel (see below).

Why: rapids-singlecell is CUDA/CuPy-only, and there’s no Apple-silicon path because the M-series GPU has no native sparse-matrix support. This project builds that missing sparse substrate (CSR container, SpMM, segmented reductions, a few custom mx.fast.metal_kernel kernels) and puts the scanpy/squidpy front-end on top.

Drop-in API — swap the import prefix and an existing pipeline runs on the GPU, writing back into the same scanpy slots (.X, .obsm, .obsp, .var, .uns):

import scanpy as sc
import metasinglecell as msc          # pp / tl / gr mirror sc.pp, sc.tl, sq.gr

msc.pp.normalize_total(adata, target_sum=1e4)
msc.pp.log1p(adata)
msc.pp.highly_variable_genes(adata, n_top_genes=2000)
msc.pp.pca(adata); msc.pp.neighbors(adata)
msc.tl.leiden(adata, backend="gpu")
msc.tl.umap(adata)
sc.pl.umap(adata, color="leiden")      # plot with scanpy as usual

What’s covered: most of the scanpy pp/tl + squidpy gr surface (~30 functions) — QC, normalize/log1p, Pearson residuals, HVG, PCA, neighbors, UMAP, Leiden/Louvain, kmeans, rank_genes_groups, diffmap, draw_graph, Harmony, scrublet, bbknn, and the spatial gr set. Four tutorial notebooks mirror the rapids-singlecell ones.

Accuracy (checked against CPU references on real data — PBMC, a 1.3M-neuron atlas, a 2M-cell Xenium cohort): normalize/Pearson exact (Δ≈1e-6), HVG gene-overlap 1.000, PCA subspace 0.97–0.99, rank-genes top-k overlap 1.000, co-occurrence correlation 1.000 vs squidpy. I’ve tried to be careful here since correctness matters more than speed.

On speed: against CPU scanpy on the same Mac, several functions do benefit from the GPU (HVG, UMAP, sparse PCA, Pearson residuals, and Louvain once graphs are large). But plenty of things are CPU-favored on this hardware — kNN above ~250k, Leiden below ~50k, Harmony, and t-SNE above ~30k — and the README documents each, including when to just use scanpy/igraph instead. I’d rather under-promise here.

Where I could really use help — parallel Leiden. I’m not a strong Metal programmer, and Leiden was by far the hardest piece. The current design (graph/leiden.py, built on graph/louvain.py) is coloring-free synchronous local-moving + a Traag-style refinement phase, with a random half-commit (commit_prob) to break the symmetric-swap oscillation that fully-synchronous moves otherwise cause. It’s correct — modularity is equal to or a touch above igraph Leiden — but on speed it only reaches roughly tied with CPU igraph Leiden at ~1M cells and is slower below that. Profiling puts ~75% of the runtime in the refinement phase.

I did try the obvious “make it fast” route — a fully-fused single-dispatch color+move kernel like cuGraph’s — and concluded it’s fundamentally incompatible with Metal’s memory model: no grid-wide barrier across threadgroups, and relaxed-only atomics (no acquire/release/seq_cst, no native float atomic_max). So the sanctioned cross-threadgroup ordering is the dispatch boundary, which is what the current multi-dispatch path already uses. Given those constraints, I’d love suggestions on: (a) a cheaper refinement phase, (b) a better convergence/commit rule than random half-commit, or (c) any GPU-Leiden approach that lives happily within Metal’s relaxed-atomics / no-grid-barrier limits. Even a “you’re thinking about this wrong” would help.

Status: early — v0.0.1, BSD-3, one beta round done. It’s an independent, unaffiliated project that models its API/workflows on (and asks users to cite) rapids-singlecell, scanpy, squidpy, and MLX — not endorsed by scverse, NVIDIA, or Apple.

Topic		Replies	Views
Any MLX (MPS) framework accelerate solution for scanpy？ scanpy	2	200	July 2, 2026
Setting up a fully featured GPU single cell enviroment Ecosystem gpu	2	1162	August 24, 2022
Dense matrix datasets scanpy	3	1125	May 6, 2022
Community Meeting 2023-10-03 18:00 CEST Announcements	0	264	September 27, 2023
Macbook M1 M2 mps acceleration with scVI scvi-tools developer	10	2211	March 3, 2026

metal-SingleCell — a work-in-progress Metal/MLX port of rapids-singlecell for Apple Silicon (feedback welcome)

Related topics