Distances -> connectivities in neighbor graph construction

adamgayoso · April 18, 2022, 1:19am

Based off of a prior twitter discussion, I’m wondering if anyone has done any evaluation of how Scanpy converts distances to connectivities. For example, Seurat uses jaccard similarity of neighborhoods, while Scanpy uses an internal umap function, which at a quick glance looked like some sort of modified gaussian kernel.

Should scanpy consider adding the seurat method (also used by Phenograph)? It’s probably quite fast with numba.
Should scanpy consider implementing the umap method on it’s own so that it can be used with other KNN distance matrix constructions and so that scanpy’s implementation doesn’t actually need umap anymore (can use pynndescent for neighbors)?

ivirshup · April 19, 2022, 8:31am

We do support other graph weighting methods, but are definitely interested in having more. I think the main thing blocking this is the implementation of how neighbors weighting is handled. You can read some context on how I would prefer to do this here (among other places):

github.com/scverse/scanpy

Comment by ivirshup to Switch t-SNE implementation to openTSNE

scverse:master ← pavlin-policar:opentsne

Happy new year! And thanks for opening this PR @pavlin-policar. -------------…---- First a general question. What is the scope of this PR? Will this just be single dataset TSNE calculation, with integration/ `ingest` functionality happening separately, or would you like to do it all at once? ----------------- In terms of workflow, I think I'd like it to look similar to UMAP * One function for calculating the graph/ manifold * One function for computing the embedding If possible, I would like it if the user could specify an arbitrary manifold (e.g. the umap weighted one) to pass to the embedding step, but this is icing. > It would also make sense to add a tsne option to sc.pp.neighbors I would prefer for this to be a separate function, maybe `neighbors_tsne`? This could use the entire neighbor calculating workflow from `openTSNE`. How different are the arguments to the various `affinity` methods? At first glance they look pretty similar. I'd like to have the option of choosing which one, but does it make sense to have all the methods available through one function? > noticed that sc.tl.umap and now sc.tl.tsne add their parameters to adata.uns. ... Determining which affinity kernel to use would then be as simple as looking into adata.uns to find which parameter value sc.pp.neighbors was called with. +1. Do you need to know what the affinity method was if you're just calculating an embeddings? Or does that only become important when you want to add new data?

I am thinking this is something I would like to put Michal or some of our new RAs on.

Should scanpy consider adding the seurat method

I’m not against it, but I’m not sure I’d recommend people use it. Personally, I think the UMAP neighbor weighting makes more sense than using Jaccard. The Jaccard weight is going to be influenced by the number of observations present, which we’d really prefer to avoid.

implementing the umap method on it’s own

Sure. Could be easy to even vendor (which all of UMAP was back in the day). Why not just import it from UMAP though?

Again the main thing stopping this is the implementation of the neighbors class. The main issue here being that on instantiation the neighbors class looks through the anndata and tries to guess what weighting method was used. It’s then unclear how that information gets used downstream, (which get’s tied up in some old code that doesn’t have tests). Maybe this means getting rid of the Neighbors class, maybe this means making a bunch of Neighbors subclasses.

Valentine_Svensson · April 22, 2022, 1:23am

I recently switched to PyMDE for visualization: PyMDE: Minimum-Distortion Embedding — pymde 0.1.15 documentation.

The nearest neighbors calculation in it is extremely fast. I don’t remember how it does it, but maybe it has some functions that could be useful as well for kNN matrix construction?

/Valentine

adamgayoso · April 22, 2022, 4:18pm

It uses PyNNDescent, which UMAP also uses (same author), though at one point not too long ago it wouldn’t be used in scanpy unless pynndescent was explicitly installed.

Just want to also add that pynndescent is super easy to use on it’s own for creating the same output as sklearn

Topic		Replies	Views
Methods of calculating graphs scanpy	0	238	November 9, 2023
Computing sc.pp.neighbors without method? scanpy	2	587	April 25, 2024
Computing neighbors from X_scANVI representation scvi-tools scanvi , scvi	5	412	April 26, 2024
Scanpy.external.tl.phenograph scanpy	3	700	January 26, 2023
Replicating data analysis using published pipeline scanpy	0	35	December 10, 2024

Distances -> connectivities in neighbor graph construction

Related topics