Distances -> connectivities in neighbor graph construction

Based off of a prior twitter discussion, I’m wondering if anyone has done any evaluation of how Scanpy converts distances to connectivities. For example, Seurat uses jaccard similarity of neighborhoods, while Scanpy uses an internal umap function, which at a quick glance looked like some sort of modified gaussian kernel.

  1. Should scanpy consider adding the seurat method (also used by Phenograph)? It’s probably quite fast with numba.
  2. Should scanpy consider implementing the umap method on it’s own so that it can be used with other KNN distance matrix constructions and so that scanpy’s implementation doesn’t actually need umap anymore (can use pynndescent for neighbors)?

We do support other graph weighting methods, but are definitely interested in having more. I think the main thing blocking this is the implementation of how neighbors weighting is handled. You can read some context on how I would prefer to do this here (among other places):

I am thinking this is something I would like to put Michal or some of our new RAs on.

Should scanpy consider adding the seurat method

I’m not against it, but I’m not sure I’d recommend people use it. Personally, I think the UMAP neighbor weighting makes more sense than using Jaccard. The Jaccard weight is going to be influenced by the number of observations present, which we’d really prefer to avoid.

implementing the umap method on it’s own

Sure. Could be easy to even vendor (which all of UMAP was back in the day). Why not just import it from UMAP though?

Again the main thing stopping this is the implementation of the neighbors class. The main issue here being that on instantiation the neighbors class looks through the anndata and tries to guess what weighting method was used. It’s then unclear how that information gets used downstream, (which get’s tied up in some old code that doesn’t have tests). Maybe this means getting rid of the Neighbors class, maybe this means making a bunch of Neighbors subclasses.

I recently switched to PyMDE for visualization: PyMDE: Minimum-Distortion Embedding — pymde 0.1.15 documentation.

The nearest neighbors calculation in it is extremely fast. I don’t remember how it does it, but maybe it has some functions that could be useful as well for kNN matrix construction?


It uses PyNNDescent, which UMAP also uses (same author), though at one point not too long ago it wouldn’t be used in scanpy unless pynndescent was explicitly installed.

Just want to also add that pynndescent is super easy to use on it’s own for creating the same output as sklearn