Changing the order of the anndata objects in anndata.concat changes the leiden clustering

Hi everyone,

When I change the order of anndata objects in anndata concat, like the following:

adata1 = ad.concat([day1_data, day2_data, day3_data], join='outer')
adata2 = ad.concat([day2_data, day1_data, day3_data], join='outer')

And doing basic processing with normalization, PCA, neighbouring graph, UMAP, and leiden clustering, I got the same UMAP (at least no visual differences I can spot) but fairly significant clustering differences. Like some clusters being merged, some new ones pop up, etc. Did anyone else notice a similar thing? Also, I wonder why?

Hi @osmanmerdan, are you using igraph or leidenalg ? Could you share an environment listing i.e., import scanpy; scanpy.logging.print_versions() for where this happens? From the igraph R documentation, I could see the algorithm having this property “The Leiden algorithm consists of three phases: (1) local moving of nodes, (2) refinement of the partition and (3) aggregation of the network based on the refined partition.”

Dear @ilan-gold , I have included an example Jupyter notebook using a publicly available dataset. In my case, I was using Scanpy 1.11.0 and anndata 0.11.1, and I saw this behavior using both the leidenalg and igraph implementations. In the example I have included, I observed that the PCA analysis yielded different floating-point values for the two datasets, which may be the reason for the differing end results. Google Colab

Hmm @osmanmerdan this strikes me as a bug with scipy’s PCA, no? Maybe a bug report there would be a good start? I think if you cut down your example to just that final cell with arpack using scipy directly, you have a great bug report

Hi @ilan-gold, I have opened a discussion here. And it turns out that the np.float32 precision in the normalized count matrix, plus how arpack works, can cause these kinds of numerical inconsistencies. So, I have converted the adata.X data type to np.float64 and ran the clustering again.

Here is the interesting part for me: now the PCA values are equal within the 1e-12 limits. However, neighbouring/ clustering is still different. .obsp['distances'] .obsp['connectivities'] entries are different.

I went further and changed the method to compute connectivities from method="umap" to method="gauss". But it also resulted in different connectivity arrays.

Updated notebook: Google Colab

Hi @osmanmerdan I have a hunch that the leiden algorithm is not stable up to ordering. From wikipedia I see that there is a “Local Moving of Nodes” phase which I appears to be unstable up to ordering. Does this make sense?

While I don’t know a whole lot about the inner workings of the algorithms underlying umap or neighbors, I assume given that neighbors relies on ANN (specifically, the default is using PyNNDescentTransformer), the ordering again matters because of the approximate nature. For example, I notice the there are a factor of 10 more difference in connectivities than distances, which really signals to me that this is down to some sort of tie-breaking that is getting propagated down from the distances to the connectivities (and then to leiden and umap). So distances are slightly different, which means connectivities are too, which means leiden is even worse etc.

Here’s a quick sample that I think highlights this fact:

import numpy as np
from pynndescent import PyNNDescentTransformer
from sklearn.datasets import make_blobs
from sklearn.utils import shuffle

n_samples = 1000
n_features = 10
n_neighbors = 10
random_state = 42

X, _ = make_blobs(n_samples=n_samples, n_features=n_features, random_state=random_state)

transformer1 = PyNNDescentTransformer(n_neighbors=n_neighbors, random_state=42)
X_nn_1 = transformer1.fit_transform(X)

X_shuffled, indices = shuffle(X, np.arange(len(X)), random_state=123)

transformer2 = PyNNDescentTransformer(n_neighbors=n_neighbors, random_state=42)
X_nn_2 = transformer2.fit_transform(X_shuffled)

inverse_indices = np.zeros_like(indices)
inverse_indices[indices] = np.arange(len(indices))

X_nn_2_mapped = X_nn_2[inverse_indices, :][:, inverse_indices] # reorder both rows and columns

np.testing.assert_equal(X, X_shuffled[inverse_indices]) # should be equal
np.testing.assert_almost_equal(X_nn_1.toarray(), X_nn_2_mapped.toarray()) # off by 174 for me

Short of going into the scanpy code to check the output of nndescent output, I think this probably signals the issue really is the ordering, and it starts as soon as you compute neighbors.