Question about diffusion map "gaussian" implementation

Hi,

I am curious about the Gaussian implementation of diffusion maps in scanpy. The documentation claims that this reproduces the procedure laid out in Haghverdi et. al. However, the code for determining this seems to just use the median of the distances in the nearest neighborhood (from neighbors/init.py --def _compute_connectivites_diffmap)

# choose sigma, the heuristic here doesn't seem to make much of a difference,
# but is used to reproduce the figures of Haghverdi et al. (2016)
if self.knn:
      # as the distances are not sorted
      # we have decay within the n_neighbors first neighbors
      sigmas_sq = np.median(distances_sq, axis=1)
  else:
      # the last item is already in its sorted position through argpartition
      # we have decay beyond the n_neighbors neighbors
      sigmas_sq = distances_sq[:, -1] / 4
  sigmas = np.sqrt(sigmas_sq)

  # compute the symmetric weight matrix
  if not issparse(self._distances):
      Num = 2 * np.multiply.outer(sigmas, sigmas)
      **Den = np.add.outer(sigmas_sq, sigmas_sq)**
      W = np.sqrt(Num / Den) * np.exp(-Dsq / Den)

The original paper, on the other hand, looks at the average dimensionality of the manifold across the entire data set, through the definitions in section 2.3 (eqs. 10-12).

My question is: how are these the same thing? Because if the medians of each neighborhood are, in fact, being used for the sigmas then this corresponds to entirely different operator in the continuum limit of the Markov kernel. Specifically, it doesn’t give you the Laplacian of the data manifold, although the dependence on the density still drops out.

Otherwise, what motivates the choice? I am interested in keeping this x-dependence in my graph laplacian, but can’t reproduce the procedure of Haghverdi et. al, and this implementation would be much easier and faster. But I need to motivate it in the context of my problem.

If anyone can help–thank you!!

Hey there did you ever get any answers for this? I’m trying to better understand the distances and connectivities in the backend.

i did wind up figuring this out. the median is just a choice, effectively an approximation for the underlying noise model. you might find my recent preprint helpful https://arxiv.org/pdf/2401.03095.pdf, i explicitly discuss this implementation and how it effects the interpretation of the result diffusion map

Awesome, thank you for sharing this! I’ll definitely take a look at your preprint.

If you have a moment, would you be able to take a look at some of these questions I had about diffusion maps to see if you have any insight?