Hi,
I am curious about the Gaussian implementation of diffusion maps in scanpy. The documentation claims that this reproduces the procedure laid out in Haghverdi et. al. However, the code for determining this seems to just use the median of the distances in the nearest neighborhood (from neighbors/init.py --def _compute_connectivites_diffmap)
# choose sigma, the heuristic here doesn't seem to make much of a difference,
# but is used to reproduce the figures of Haghverdi et al. (2016)
if self.knn:
# as the distances are not sorted
# we have decay within the n_neighbors first neighbors
sigmas_sq = np.median(distances_sq, axis=1)
else:
# the last item is already in its sorted position through argpartition
# we have decay beyond the n_neighbors neighbors
sigmas_sq = distances_sq[:, -1] / 4
sigmas = np.sqrt(sigmas_sq)
# compute the symmetric weight matrix
if not issparse(self._distances):
Num = 2 * np.multiply.outer(sigmas, sigmas)
**Den = np.add.outer(sigmas_sq, sigmas_sq)**
W = np.sqrt(Num / Den) * np.exp(-Dsq / Den)
The original paper, on the other hand, looks at the average dimensionality of the manifold across the entire data set, through the definitions in section 2.3 (eqs. 10-12).
My question is: how are these the same thing? Because if the medians of each neighborhood are, in fact, being used for the sigmas then this corresponds to entirely different operator in the continuum limit of the Markov kernel. Specifically, it doesn’t give you the Laplacian of the data manifold, although the dependence on the density still drops out.
Otherwise, what motivates the choice? I am interested in keeping this x-dependence in my graph laplacian, but can’t reproduce the procedure of Haghverdi et. al, and this implementation would be much easier and faster. But I need to motivate it in the context of my problem.
If anyone can help–thank you!!