Changing the order of the anndata objects in anndata.concat changes the leiden clustering

osmanmerdan · September 17, 2025, 10:50am

Hi everyone,

When I change the order of anndata objects in anndata concat, like the following:

adata1 = ad.concat([day1_data, day2_data, day3_data], join='outer')
adata2 = ad.concat([day2_data, day1_data, day3_data], join='outer')

And doing basic processing with normalization, PCA, neighbouring graph, UMAP, and leiden clustering, I got the same UMAP (at least no visual differences I can spot) but fairly significant clustering differences. Like some clusters being merged, some new ones pop up, etc. Did anyone else notice a similar thing? Also, I wonder why?

ilan-gold · September 22, 2025, 5:35pm

Hi @osmanmerdan, are you using igraph or leidenalg ? Could you share an environment listing i.e., import scanpy; scanpy.logging.print_versions() for where this happens? From the igraph R documentation, I could see the algorithm having this property “The Leiden algorithm consists of three phases: (1) local moving of nodes, (2) refinement of the partition and (3) aggregation of the network based on the refined partition.”

osmanmerdan · September 29, 2025, 6:57pm

Dear @ilan-gold , I have included an example Jupyter notebook using a publicly available dataset. In my case, I was using Scanpy 1.11.0 and anndata 0.11.1, and I saw this behavior using both the leidenalg and igraph implementations. In the example I have included, I observed that the PCA analysis yielded different floating-point values for the two datasets, which may be the reason for the differing end results. Google Colab

ilan-gold · September 30, 2025, 8:38am

Hmm @osmanmerdan this strikes me as a bug with scipy’s PCA, no? Maybe a bug report there would be a good start? I think if you cut down your example to just that final cell with arpack using scipy directly, you have a great bug report

osmanmerdan · October 1, 2025, 5:31pm

Hi @ilan-gold, I have opened a discussion here. And it turns out that the np.float32 precision in the normalized count matrix, plus how arpack works, can cause these kinds of numerical inconsistencies. So, I have converted the adata.X data type to np.float64 and ran the clustering again.

Here is the interesting part for me: now the PCA values are equal within the 1e-12 limits. However, neighbouring/ clustering is still different. .obsp['distances'] .obsp['connectivities'] entries are different.

I went further and changed the method to compute connectivities from method="umap" to method="gauss". But it also resulted in different connectivity arrays.

Updated notebook: Google Colab

ilan-gold · October 7, 2025, 10:04am

Hi @osmanmerdan I have a hunch that the leiden algorithm is not stable up to ordering. From wikipedia I see that there is a “Local Moving of Nodes” phase which I appears to be unstable up to ordering. Does this make sense?

While I don’t know a whole lot about the inner workings of the algorithms underlying umap or neighbors, I assume given that neighbors relies on ANN (specifically, the default is using PyNNDescentTransformer), the ordering again matters because of the approximate nature. For example, I notice the there are a factor of 10 more difference in connectivities than distances, which really signals to me that this is down to some sort of tie-breaking that is getting propagated down from the distances to the connectivities (and then to leiden and umap). So distances are slightly different, which means connectivities are too, which means leiden is even worse etc.

Here’s a quick sample that I think highlights this fact:

import numpy as np
from pynndescent import PyNNDescentTransformer
from sklearn.datasets import make_blobs
from sklearn.utils import shuffle

n_samples = 1000
n_features = 10
n_neighbors = 10
random_state = 42

X, _ = make_blobs(n_samples=n_samples, n_features=n_features, random_state=random_state)

transformer1 = PyNNDescentTransformer(n_neighbors=n_neighbors, random_state=42)
X_nn_1 = transformer1.fit_transform(X)

X_shuffled, indices = shuffle(X, np.arange(len(X)), random_state=123)

transformer2 = PyNNDescentTransformer(n_neighbors=n_neighbors, random_state=42)
X_nn_2 = transformer2.fit_transform(X_shuffled)

inverse_indices = np.zeros_like(indices)
inverse_indices[indices] = np.arange(len(indices))

X_nn_2_mapped = X_nn_2[inverse_indices, :][:, inverse_indices] # reorder both rows and columns

np.testing.assert_equal(X, X_shuffled[inverse_indices]) # should be equal
np.testing.assert_almost_equal(X_nn_1.toarray(), X_nn_2_mapped.toarray()) # off by 174 for me

Short of going into the scanpy code to check the output of nndescent output, I think this probably signals the issue really is the ordering, and it starts as soon as you compute neighbors.

osmanmerdan · October 9, 2025, 7:45am

Hi @ilan-gold thank you very much for the detailed explanation and example. I totally agree. To me, the mystery is somewhat resolved. It is like a snowball effect, tiny differences add up.

Topic		Replies	Views
Re-Clustering Clusters of Anndata scanpy	2	3718	November 8, 2022
Leiden clustering gives me different results when I run it scanpy	1	492	April 1, 2024
Leidenalg clustering error in Trajectory inference tutorial scanpy	3	1717	September 29, 2022
ValueError: `X_pca` does not have enough PCs. Rerun `sc.pp.pca` with adjusted `n_comps` scanpy	1	865	August 5, 2022
Sc.pl.heatmap ,clusters, ascending order scanpy	0	427	January 3, 2023

Changing the order of the anndata objects in anndata.concat changes the leiden clustering

Related topics