Question about diffusion map "gaussian" implementation

quarkmatter · April 12, 2023, 8:28pm

Hi,

I am curious about the Gaussian implementation of diffusion maps in scanpy. The documentation claims that this reproduces the procedure laid out in Haghverdi et. al. However, the code for determining this seems to just use the median of the distances in the nearest neighborhood (from neighbors/init.py --def _compute_connectivites_diffmap)

# choose sigma, the heuristic here doesn't seem to make much of a difference,
# but is used to reproduce the figures of Haghverdi et al. (2016)
if self.knn:
      # as the distances are not sorted
      # we have decay within the n_neighbors first neighbors
      sigmas_sq = np.median(distances_sq, axis=1)
  else:
      # the last item is already in its sorted position through argpartition
      # we have decay beyond the n_neighbors neighbors
      sigmas_sq = distances_sq[:, -1] / 4
  sigmas = np.sqrt(sigmas_sq)

  # compute the symmetric weight matrix
  if not issparse(self._distances):
      Num = 2 * np.multiply.outer(sigmas, sigmas)
      **Den = np.add.outer(sigmas_sq, sigmas_sq)**
      W = np.sqrt(Num / Den) * np.exp(-Dsq / Den)

The original paper, on the other hand, looks at the average dimensionality of the manifold across the entire data set, through the definitions in section 2.3 (eqs. 10-12).

My question is: how are these the same thing? Because if the medians of each neighborhood are, in fact, being used for the sigmas then this corresponds to entirely different operator in the continuum limit of the Markov kernel. Specifically, it doesn’t give you the Laplacian of the data manifold, although the dependence on the density still drops out.

Otherwise, what motivates the choice? I am interested in keeping this x-dependence in my graph laplacian, but can’t reproduce the procedure of Haghverdi et. al, and this implementation would be much easier and faster. But I need to motivate it in the context of my problem.

If anyone can help–thank you!!

jolespin · April 19, 2024, 6:34pm

Hey there did you ever get any answers for this? I’m trying to better understand the distances and connectivities in the backend.

quarkmatter · April 19, 2024, 11:15pm

i did wind up figuring this out. the median is just a choice, effectively an approximation for the underlying noise model. you might find my recent preprint helpful https://arxiv.org/pdf/2401.03095.pdf, i explicitly discuss this implementation and how it effects the interpretation of the result diffusion map

jolespin · April 22, 2024, 1:41am

Awesome, thank you for sharing this! I’ll definitely take a look at your preprint.

If you have a moment, would you be able to take a look at some of these questions I had about diffusion maps to see if you have any insight?

Topic		Replies	Views
How to interpret Diffusion Maps? scanpy	0	386	April 18, 2024
Distances -> connectivities in neighbor graph construction scanpy	3	1233	April 22, 2022
Computing neighbors from X_scANVI representation scvi-tools scanvi , scvi	5	408	April 26, 2024
Methods of calculating graphs scanpy	0	234	November 9, 2023
Diffmap DPT question scanpy dpt	3	1376	July 17, 2022

Question about diffusion map "gaussian" implementation

Related topics